BoostingDecisionTrees.jl

Dev Build Status Coverage

This project demonstrates how different tree based models handle multiclass classification using the Iris Dataset. By comparing a generic full decision tree and an adaptive ensemble (AdaBoost), we can show the "power" of each approach.

Running the package

To use this package, open a Julia REPL and run:

pkg> add https://github.com/MateoSkmn/BoostingDecisionTrees.jl

julia> using BoostingDecisionTrees

Examples

Load dataset

For easier use you may download the given dataset under 'src/data/Iris.csv'. The dataset can also be found at https://www.kaggle.com/datasets/uciml/iris.

julia>  X,y = load_data("path/to/Iris.csv")

This loads the 150 samples in a random order. Attention: this method was created especially for the described Iris.csv and might not work for other datasets. You can always use your own dataset as long as X is a Matrix and y is a Vector.

Decision Tree

Decision trees recursively split the data using numeric thresholds until a stopping criterion is reached.

julia> tree = train_tree(X[1:100, :], y[1:100]; max_depth=5, criterion=:gini)

julia> prediction = predict(tree, X[101:150, :])

julia> sum(prediction .== y[101:150]) / length(y[101:150]) # Accuracy

The decision tree supports two numeric splitting criteria:

:gini - Gini impurity

:information_gain - entropy-based information gain

AdaBoost

AdaBoost is an ensemble learning classifier using multiple weaker learners. Each new learner focuses on correcting the errors made by its predicessors. You can train a model using your dataset. You may also adjust the maximum number of iterations as well as the maximum 'power' of a weaker learner.

julia> ada = train_adaboost(X[1:100, :], y[1:100]; iterations=50, max_alpha=2.5)

julia> ada2 = train_adaboost(X[1:100, :], y[1:100]) # This will use the same parameters as the code above

julia> prediciton = predict(ada, X[101:150, :])

julia> sum(prediciton .== y[101:150]) / size(y[101:150], 1) # Accuracy of the created model