BoostingDecisionTrees

Documentation for BoostingDecisionTrees.

BoostingDecisionTrees.BoostingDecisionTrees
BoostingDecisionTrees.AdaBoost
BoostingDecisionTrees.DecisionNode
BoostingDecisionTrees.LeafNode
BoostingDecisionTrees.TreeNode
BoostingDecisionTrees.best_split
BoostingDecisionTrees.createWeightedDataset
BoostingDecisionTrees.gini_impurity
BoostingDecisionTrees.information_gain
BoostingDecisionTrees.load_data_iris
BoostingDecisionTrees.predict
BoostingDecisionTrees.predict
BoostingDecisionTrees.predict
BoostingDecisionTrees.train_adaboost
BoostingDecisionTrees.train_tree

BoostingDecisionTrees.BoostingDecisionTrees — Module

A Julia module for decision tree and boosting algorithms, including decision stumps, Gini impurity, and information gain utilities.

Overview

This module provides tools for training and evaluating simple decision trees, with support for both Gini impurity and information gain as splitting criteria.

Features

Splitting Criteria: Supports both Gini impurity and information gain for feature selection.
Utilities: Includes helper functions for entropy, Gini impurity, and majority voting.

Exports

Decision TreeNode Functions:
- train_tree: Train a decision tree on a dataset.
- predict: Make predictions using a trained decision tree.
AdaBoost Functions
- train_adaboost: Train an AdaBoost model on a dataset.
- predict: Make predicitions using a trained AdaBoost model

source

BoostingDecisionTrees.AdaBoost — Type

AdaBoost(learner, alphas)

A stronger ensemble learning classifier consisting of multiple weaker learners. 
Each new learner focuses on correcting the errors made by its predicessors.

# Fields
- `learner::Vector{DecisionTree}`: A collection of DecisionTree objects. Each tree acts as a weak classifier that makes a prediction based on a single feature threshold.
- `alphas::Vector{Float64}`: A vector of floating-point weights, corresponding to the voting power of a tree. A higher alpha values means the stump was more accurate during the training phase.

source

BoostingDecisionTrees.DecisionNode — Type

DecisionNode

A decision node in a decision tree that splits data based on a feature and threshold.

Fields

feature::Int: the feature index used for splitting.
threshold::Float64: the threshold value for the split.
left::TreeNode: the left subtree (samples where feature ≤ threshold).
right::TreeNode: the right subtree (samples where feature > threshold).

source

BoostingDecisionTrees.LeafNode — Type

LeafNode

A leaf node holding a predicted class label.

source

BoostingDecisionTrees.TreeNode — Type

TreeNode

Abstract type for decision tree nodes.

source

BoostingDecisionTrees.best_split — Method

best_split(feature, labels)

Find the best threshold to split a feature vector for minimizing Gini impurity.

Arguments

feature::AbstractVector{<:Real}: A vector of numerical feature values.
labels::AbstractVector: a vector of class labels (same length as feature)

Returns

best_threshold::Union{Float64, Nothing}: The best numerical value to split the feature on.

Returns nothing if no split is possible.

best_gini::Float64: The weighted Gini impurity after the split.

Examples

julia> feature = [1.0, 2.0, 3.0, 4.0];

julia> labels = ["A", "A", "B", "B"];

julia> best_threshold, best_gini = best_split(feature, labels)
(2.5, 0.0)

source

BoostingDecisionTrees.createWeightedDataset — Method

createWeightedDataset(X, y, weights)

Create a new dataset by sampling rows from X and y, guided by a probability distribution defined by weights where samples with higher weights are more likely to be selected for the new dataset

Arguments

X::AbstractMatrix: rows are samples, columns are features.
y::AbstractVector: class labels for each sample.
weights::Vector{Float64}: weight of each sample in the given dataset. The sum of all weights should be 1.

Returns

X_prime: A resampled matrix of the same dimensions and type as X.
y_prime: A resampled vector of the same length and type as y.

source

BoostingDecisionTrees.gini_impurity — Method

gini_impurity(classes)

Compute the Gini impurity of a vector of class labels.

Arguments

classes::AbstractVector: A collection of class labels.

Returns

Float64: The Gini impurity of the input vector. Returns 0 if the input is empty.

source

BoostingDecisionTrees.information_gain — Method

information_gain(X_column::AbstractVector{<:Real}, y::AbstractVector)

Compute the best information gain obtainable by splitting a numeric feature column using a threshold (x ≤ t vs. x > t).

Returns

best_threshold::Float64: threshold yielding maximum information gain
best_gain::Float64: corresponding information gain

source

BoostingDecisionTrees.load_data_iris — Method

load_data_iris(path)

Load the Iris dataset from a CSV file, shuffle the observations, and split features from labels.

AI Disclaimer

This helper method was generated by AI

Arguments

path::String: The file path to the CSV file (e.g., "src/data/Iris.csv").

Returns

X::Matrix: A matrix of feature values (columns 2 through 5).
y::Vector: A vector of target labels.

source

BoostingDecisionTrees.predict — Method

predict(model, X)

Predict class labels for samples in X using a trained AdaBoost classifier.

The function adds the weighted votes of all decision trees within the model to determine the most likely class for each sample.

Arguments

model::AdaBoost: A trained AdaBoost structure.
X::AbstractMatrix: rows are samples, columns are features.

Returns

Vector: A vector of predicted labels, with the same type as the labels found in the model's learners.

source

BoostingDecisionTrees.predict — Method

predict(model::AdaBoost, X::AbstractVector)

A convenience method for predicting the label of a single sample.

Arguments

model::AdaBoost: A trained AdaBoost structure.
X::AbstractVector: A single sample represented as a vector of features.

Returns

The predicted label for the single input sample.

source

BoostingDecisionTrees.predict — Method

predict(tree::TreeNode, X::AbstractMatrix)

Make predictions for multiple samples using the decision tree.

Arguments

tree::TreeNode: a trained decision tree.
X::AbstractMatrix: rows are samples, columns are features.

Returns

Vector{Any}: predicted class labels for each sample in X.

Examples

julia> tree = DecisionNode{String}(1, 2.5, LeafNode("a"), LeafNode("b"), String);

julia> X = [1.0 2.0; 3.0 0.5; 2.0 1.5];

julia> preds = predict(tree, X)
3-element Vector{String}:
 "a"
 "b"
 "a"

source

BoostingDecisionTrees.train_adaboost — Method

train_adaboost(X, y; iterations, max_alpha)

Trains an AdaBoost classifier on the given dataset.

Arguments

X::AbstractMatrix: rows are samples, columns are features.
y::AbstractVector: class labels for each sample.
iterations::Integer: maximum number of weak learners. In case of perfect fit the training will be stopped early. Values must be in range [1, Inf). Default is 50.
max_alpha::Float64: A threshold to cap the "amount of say" (alpha) for any single stump. Default is 2.5 which means an accuracy of >= 99.999%. The bigger the value the more of a 'dictator' becomes a stump with a perfect result.
max_depth::Integer: Maximum depth of each tree. Default is set to 1 which is equivalent to a decision stump

Returns

AdaBoost: a trained classifier with learners and alphas.

source

BoostingDecisionTrees.train_tree — Method

train_tree(X, y; max_depth=5, criterion=:gini)

Train a decision tree using numeric threshold splits.

Arguments

X::AbstractMatrix: feature matrix.
y::AbstractVector: class labels.
max_depth::Int: maximum tree depth.
criterion::Symbol: :information_gain or :gini.

Returns

TreeNode

source