BoostingDecisionTrees

Documentation for BoostingDecisionTrees.

BoostingDecisionTrees.BoostingDecisionTreesModule

BoostingDecisionTrees

A Julia module for decision tree and boosting algorithms, including decision stumps, Gini impurity, and information gain utilities.

Overview

This module provides tools for training and evaluating simple decision trees, with support for both Gini impurity and information gain as splitting criteria.

Features

  • Splitting Criteria: Supports both Gini impurity and information gain for feature selection.
  • Utilities: Includes helper functions for entropy, Gini impurity, and majority voting.

Exports

  • Decision TreeNode Functions:
    • train_tree: Train a decision tree on a dataset.
    • predict: Make predictions using a trained decision tree.
  • AdaBoost Functions
    • train_adaboost: Train an AdaBoost model on a dataset.
    • predict: Make predicitions using a trained AdaBoost model
source
BoostingDecisionTrees.AdaBoostType
AdaBoost(learner, alphas)

A stronger ensemble learning classifier consisting of multiple weaker learners. 
Each new learner focuses on correcting the errors made by its predicessors.

# Fields
- `learner::Vector{DecisionTree}`: A collection of DecisionTree objects. Each tree acts as a weak classifier that makes a prediction based on a single feature threshold.
- `alphas::Vector{Float64}`: A vector of floating-point weights, corresponding to the voting power of a tree. A higher alpha values means the stump was more accurate during the training phase.
source
BoostingDecisionTrees.DecisionNodeType
DecisionNode

A decision node in a decision tree that splits data based on a feature and threshold.

Fields

  • feature::Int: the feature index used for splitting.
  • threshold::Float64: the threshold value for the split.
  • left::TreeNode: the left subtree (samples where feature ≤ threshold).
  • right::TreeNode: the right subtree (samples where feature > threshold).
source
BoostingDecisionTrees.best_splitMethod
best_split(feature, labels)

Find the best threshold to split a feature vector for minimizing Gini impurity.

Arguments

  • feature::AbstractVector{<:Real}: A vector of numerical feature values.
  • labels::AbstractVector: a vector of class labels (same length as feature)

Returns

  • best_threshold::Union{Float64, Nothing}: The best numerical value to split the feature on.

Returns nothing if no split is possible.

  • best_gini::Float64: The weighted Gini impurity after the split.

Examples

julia> feature = [1.0, 2.0, 3.0, 4.0];

julia> labels = ["A", "A", "B", "B"];

julia> best_threshold, best_gini = best_split(feature, labels)
(2.5, 0.0)
source
BoostingDecisionTrees.createWeightedDatasetMethod
createWeightedDataset(X, y, weights)

Create a new dataset by sampling rows from X and y, guided by a probability distribution defined by weights where samples with higher weights are more likely to be selected for the new dataset

Arguments

  • X::AbstractMatrix: rows are samples, columns are features.
  • y::AbstractVector: class labels for each sample.
  • weights::Vector{Float64}: weight of each sample in the given dataset. The sum of all weights should be 1.

Returns

  • X_prime: A resampled matrix of the same dimensions and type as X.
  • y_prime: A resampled vector of the same length and type as y.
source
BoostingDecisionTrees.gini_impurityMethod
gini_impurity(classes)

Compute the Gini impurity of a vector of class labels.

Arguments

  • classes::AbstractVector: A collection of class labels.

Returns

  • Float64: The Gini impurity of the input vector. Returns 0 if the input is empty.
source
BoostingDecisionTrees.information_gainMethod
information_gain(X_column::AbstractVector{<:Real}, y::AbstractVector)

Compute the best information gain obtainable by splitting a numeric feature column using a threshold (x ≤ t vs. x > t).

Returns

  • best_threshold::Float64: threshold yielding maximum information gain
  • best_gain::Float64: corresponding information gain
source
BoostingDecisionTrees.load_data_irisMethod
load_data_iris(path)

Load the Iris dataset from a CSV file, shuffle the observations, and split features from labels.

AI Disclaimer

This helper method was generated by AI

Arguments

  • path::String: The file path to the CSV file (e.g., "src/data/Iris.csv").

Returns

  • X::Matrix: A matrix of feature values (columns 2 through 5).
  • y::Vector: A vector of target labels.
source
BoostingDecisionTrees.predictMethod
predict(model, X)

Predict class labels for samples in X using a trained AdaBoost classifier.

The function adds the weighted votes of all decision trees within the model to determine the most likely class for each sample.

Arguments

  • model::AdaBoost: A trained AdaBoost structure.
  • X::AbstractMatrix: rows are samples, columns are features.

Returns

  • Vector: A vector of predicted labels, with the same type as the labels found in the model's learners.
source
BoostingDecisionTrees.predictMethod
predict(model::AdaBoost, X::AbstractVector)

A convenience method for predicting the label of a single sample.

Arguments

  • model::AdaBoost: A trained AdaBoost structure.
  • X::AbstractVector: A single sample represented as a vector of features.

Returns

  • The predicted label for the single input sample.
source
BoostingDecisionTrees.predictMethod
predict(tree::TreeNode, X::AbstractMatrix)

Make predictions for multiple samples using the decision tree.

Arguments

  • tree::TreeNode: a trained decision tree.
  • X::AbstractMatrix: rows are samples, columns are features.

Returns

  • Vector{Any}: predicted class labels for each sample in X.

Examples

julia> tree = DecisionNode{String}(1, 2.5, LeafNode("a"), LeafNode("b"), String);

julia> X = [1.0 2.0; 3.0 0.5; 2.0 1.5];

julia> preds = predict(tree, X)
3-element Vector{String}:
 "a"
 "b"
 "a"
source
BoostingDecisionTrees.train_adaboostMethod
train_adaboost(X, y; iterations, max_alpha)

Trains an AdaBoost classifier on the given dataset.

Arguments

  • X::AbstractMatrix: rows are samples, columns are features.
  • y::AbstractVector: class labels for each sample.
  • iterations::Integer: maximum number of weak learners. In case of perfect fit the training will be stopped early. Values must be in range [1, Inf). Default is 50.
  • max_alpha::Float64: A threshold to cap the "amount of say" (alpha) for any single stump. Default is 2.5 which means an accuracy of >= 99.999%. The bigger the value the more of a 'dictator' becomes a stump with a perfect result.
  • max_depth::Integer: Maximum depth of each tree. Default is set to 1 which is equivalent to a decision stump

Returns

  • AdaBoost: a trained classifier with learners and alphas.
source
BoostingDecisionTrees.train_treeMethod
train_tree(X, y; max_depth=5, criterion=:gini)

Train a decision tree using numeric threshold splits.

Arguments

  • X::AbstractMatrix: feature matrix.
  • y::AbstractVector: class labels.
  • max_depth::Int: maximum tree depth.
  • criterion::Symbol: :information_gain or :gini.

Returns

  • TreeNode
source