Scikit-learn Decision Tree: A Comprehensive Guide

Decision Trees (DTs) are a powerful and versatile non-parametric supervised learning method widely used in both classification and regression tasks. In essence, a decision tree model learns to predict the value of a target variable by inferring simple decision rules directly from the input data features. Think of it as creating a set of if-then-else conditions that guide you to a prediction. Visually, a tree structure emerges, hence the name. You can conceptualize a decision tree as a piecewise constant approximation of the underlying data pattern.

For example, consider how a decision tree can learn to approximate a sine wave. By recursively splitting the data based on feature values, the tree constructs a series of step-wise decisions that, when combined, mimic the curve. The depth of the tree plays a crucial role; deeper trees allow for more intricate decision boundaries, leading to potentially more complex and accurate models.

Advantages of Scikit-learn Decision Trees

Scikit-learn’s implementation of decision trees offers several compelling advantages, making them a popular choice for various machine learning problems:

Simplicity and Interpretability: Decision trees are remarkably easy to understand and interpret, even for those without a strong statistical background. The tree structure itself can be visualized graphically, making it transparent how the model arrives at a prediction. This “white box” nature contrasts with more complex “black box” models like neural networks, where understanding the reasoning behind predictions can be challenging.
Minimal Data Preprocessing: Unlike many other machine learning algorithms, decision trees require relatively little data preparation. They do not necessitate data normalization or the creation of dummy variables for categorical features (though scikit-learn’s current implementation has limitations on categorical variables, as noted later). Furthermore, decision trees can handle missing values, reducing the need for complex imputation techniques.
Efficient Prediction: Once a decision tree is trained, the cost of predicting new data points is logarithmically related to the number of training samples. This efficiency makes decision trees particularly well-suited for large datasets and real-time applications where rapid predictions are essential.
Versatile Data Handling: Decision trees can effectively handle both numerical and categorical data. While scikit-learn’s current version has some constraints regarding categorical variables, the underlying algorithm is inherently adaptable to different data types. This flexibility contrasts with some techniques that are specialized for datasets with only one type of variable.
Multi-Output Capability: Scikit-learn decision trees are capable of handling multi-output problems, where you need to predict multiple target variables simultaneously. This is a significant advantage in scenarios where outputs are correlated, as a single tree can learn these relationships more effectively than building separate models for each output.
White Box Model: As mentioned earlier, decision trees operate as “white box” models. Their decision-making process is transparent and easily explained through boolean logic. If a specific condition or prediction is observed, the path through the tree clearly reveals the reasoning behind it.
Statistical Validation: Decision trees allow for model validation using statistical tests. This feature enables assessment of the model’s reliability and helps in understanding the confidence level of predictions.
Robustness to Model Violations: Decision trees often perform well even when the underlying assumptions of the true data-generating model are somewhat violated. This robustness makes them applicable in a wider range of real-world scenarios where ideal conditions may not be met.

Disadvantages of Scikit-learn Decision Trees

Despite their numerous advantages, decision trees also have limitations to be aware of:

Overfitting: Decision trees are prone to overfitting, especially when allowed to grow too deep. They can create overly complex trees that memorize the training data rather than generalizing to unseen data. This results in poor performance on new datasets. Techniques like pruning, limiting tree depth (max_depth), or setting minimum samples per leaf (min_samples_leaf) are crucial to mitigate overfitting.
Instability: Decision trees can be unstable. Small variations in the training data can lead to significantly different tree structures. This sensitivity arises because even minor changes can alter the feature chosen for splitting at a node, cascading into larger structural changes. Ensemble methods, such as Random Forests and Gradient Boosting, which combine multiple decision trees, are often used to address this instability.
Non-Smooth and Discontinuous Predictions: Decision tree predictions are piecewise constant approximations. They are not smooth or continuous functions, leading to step-like predictions. This characteristic makes them less suitable for extrapolation tasks, where predicting values outside the training data range is required.
Suboptimal Tree Learning: Learning a truly optimal decision tree is an NP-complete problem. Practical decision tree algorithms, including those in scikit-learn, rely on heuristic greedy algorithms. These algorithms make locally optimal decisions at each node, but they don’t guarantee finding the globally optimal tree. Ensemble methods, again, can help by training multiple trees with random feature and sample sampling.
Difficulty Learning Certain Concepts: Some concepts are inherently difficult for decision trees to represent efficiently. Problems like XOR, parity, or multiplexer problems, which involve complex relationships not easily captured by simple axis-parallel splits, can pose challenges for standard decision trees.
Bias in Imbalanced Datasets: If classes in a classification problem are imbalanced (i.e., one class dominates), decision trees can become biased towards the majority class. It’s recommended to balance the dataset before training a decision tree to prevent this bias. Techniques like oversampling the minority class or undersampling the majority class can be employed.

Decision Trees for Classification with Scikit-learn

Scikit-learn’s DecisionTreeClassifier is designed for multi-class classification tasks. Like other classifiers in scikit-learn, it takes two primary input arrays:

X: An array (sparse or dense) of shape (n_samples, n_features) representing the training samples. Each row is a sample, and each column is a feature.
Y: An array of shape (n_samples,) containing integer class labels corresponding to each training sample in X.

Here’s a basic example of training a DecisionTreeClassifier:

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

Once trained (fit method is called), the model can predict the class label for new samples using the predict method:

clf.predict([[2., 2.]])

In scenarios where multiple classes have the same highest probability, DecisionTreeClassifier will predict the class with the lowest index among those classes.

Beyond predicting a specific class, you can also obtain the probability of each class for a given sample using predict_proba:

clf.predict_proba([[2., 2.]])

DecisionTreeClassifier is versatile, handling both binary (two classes, e.g., [-1, 1]) and multiclass (more than two classes, e.g., [0, 1, …, K-1]) classification problems.

Let’s illustrate with the Iris dataset, a classic benchmark in machine learning:

from sklearn.datasets import load_iris
from sklearn import tree

iris = load_iris()
X, y = iris.data, iris.target
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

After training, you can visualize the learned decision tree using scikit-learn’s plot_tree function:

tree.plot_tree(clf)

Exporting Decision Trees

Scikit-learn provides flexible options for exporting decision trees for visualization or further analysis:

Graphviz Export: The export_graphviz exporter allows you to export the tree in the Graphviz format. Graphviz is a powerful graph visualization tool. You’ll need to install Graphviz separately (e.g., via conda or your system’s package manager) and the Python graphviz package.
```
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("iris") # Saves to iris.pdf
```
export_graphviz offers extensive customization options, including coloring nodes by class, displaying feature and class names, and controlling aesthetics. Jupyter notebooks can also render these plots inline.
```
dot_data = tree.export_graphviz(clf, out_file=None,
                            feature_names=iris.feature_names,
                            class_names=iris.target_names,
                            filled=True, rounded=True,
                            special_characters=True)
graph = graphviz.Source(dot_data)
graph
```

Textual Export: For a more compact, text-based representation, use export_text. This method doesn’t require external libraries and is useful for quickly understanding the tree structure.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text

iris = load_iris()
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(iris.data, iris.target)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

|--- petal width (cm) <= 0.80
|   |--- class: 0
|--- petal width (cm) > 0.80
|   |--- petal width (cm) <= 1.75
|   |   |--- class: 1
|   |--- petal width (cm) > 1.75
|   |   |--- class: 2

Decision Trees for Regression with Scikit-learn

Decision trees are not limited to classification; they are equally applicable to regression problems using scikit-learn’s DecisionTreeRegressor.

Similar to classification, the fit method for DecisionTreeRegressor takes arrays X and y as input. However, in regression, the target variable y is expected to contain floating-point values rather than integer class labels.

from sklearn import tree
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict([[1, 1]])

Handling Multi-output Problems with Decision Trees

Multi-output problems involve predicting multiple target variables simultaneously. In scikit-learn, this is represented when the target Y is a 2D array of shape (n_samples, n_outputs).

While you could solve multi-output problems by building independent models for each output, decision trees offer a more integrated approach. Scikit-learn’s DecisionTreeClassifier and DecisionTreeRegressor are inherently capable of handling multi-output scenarios. The key adaptations are:

Storing Multiple Outputs in Leaves: Instead of storing a single output value in leaf nodes (as in single-output trees), multi-output trees store an array of n_outputs values.
Averaged Splitting Criteria: The splitting criteria used at each node are adapted to compute the average reduction in impurity across all n_outputs.

By using a single multi-output decision tree, you can often achieve:

Reduced Training Time: Building one model is generally faster than training n independent models.
Improved Generalization Accuracy: Capturing correlations between outputs within a single model can lead to better predictive performance.

When you train a decision tree on a multi-output array Y, the resulting estimator will:

Output an array of n_output values when using predict.
Output a list of n_output arrays of class probabilities when using predict_proba (for classification).

Multi-output regression is demonstrated in the Decision Tree Regression example. In this example, the input X is a single real value, and the outputs Y are the sine and cosine of X.

Multi-output classification is showcased in the Face completion with a multi-output estimators example. Here, inputs X are the pixels of the upper half of faces, and outputs Y are the pixels of the lower half.

Complexity of Decision Tree Algorithms

The time complexity of building a balanced binary decision tree is typically (O(n{samples} n{features} log(n{samples}))). The prediction time complexity is efficient, at (O(log(n{samples}))).

While the tree construction aims for balanced trees, they aren’t always perfectly balanced. At each node, the algorithm searches through (O(n{features})) features to find the split that maximizes impurity reduction (e.g., information gain). This search contributes a cost of (O(n{features} n{samples} log(n{samples}))) per node. Summing across all nodes, the total tree construction cost can reach (O(n{features} n{samples}^{2} log(n_{samples}))) in the worst-case scenario for imbalanced trees.

Practical Tips for Effective Decision Tree Usage

To maximize the performance and avoid common pitfalls when using decision trees, consider these practical tips:

Feature-to-Sample Ratio: Decision trees are susceptible to overfitting when dealing with datasets with a large number of features relative to the number of samples. Strive for a reasonable ratio. Trees trained on high-dimensional data with limited samples are highly likely to overfit.
Dimensionality Reduction: Before applying decision trees, consider dimensionality reduction techniques like PCA (Principal Component Analysis), ICA (Independent Component Analysis), or feature selection methods. Reducing the number of irrelevant or redundant features can significantly improve tree performance and prevent overfitting.
Understand Tree Structure: Explore the structure of your trained decision tree. Scikit-learn provides tools to visualize the tree structure, helping you gain insights into how the tree makes predictions and identify important features.
Visualize During Training: Utilize the export functions (like export_graphviz or export_text) to visualize your tree as you train it, especially during initial exploration. Start with a shallow tree (max_depth=3) to understand how it fits the data and gradually increase depth as needed.
Control Tree Depth (max_depth): Remember that the number of samples needed to populate the tree effectively doubles with each additional level. Use max_depth to control tree size and prevent overfitting. Start with smaller depths and increase cautiously.
min_samples_split and min_samples_leaf: Use these parameters to ensure that splits are informed by a sufficient number of samples. min_samples_split sets the minimum samples required to split an internal node, while min_samples_leaf sets the minimum samples required in a leaf node. Small values can lead to overfitting, while large values might prevent the tree from learning important patterns. Start with min_samples_leaf=5 and adjust based on your dataset. For classification with few classes, min_samples_leaf=1 might be suitable. Note that min_samples_split considers raw sample counts, not sample_weight.
Dataset Balancing: If you have imbalanced classes, balance your dataset before training. This prevents the tree from being biased towards dominant classes. Balance by either sampling an equal number of samples from each class or by normalizing sample_weight sums for each class. For weight-based pre-pruning, consider min_weight_fraction_leaf, which is less biased by dominant classes than min_samples_leaf.
min_weight_fraction_leaf: If using sample_weight, min_weight_fraction_leaf offers a weight-aware pre-pruning criterion. It ensures leaf nodes contain at least a fraction of the total sample weights, useful for imbalanced datasets or weighted learning.
Data Type: Internally, decision trees in scikit-learn use np.float32 arrays. If your training data isn’t in this format, a copy will be created.
Sparse Matrices: For very sparse input matrices X, convert to csc_matrix before fit and csr_matrix before predict. This can significantly speed up training time compared to dense matrices, especially when features have many zero values.

Decision Tree Algorithms: ID3, C4.5, C5.0, and CART

Several decision tree algorithms exist, each with variations in their approach. Here’s a brief overview:

ID3 (Iterative Dichotomiser 3): Developed by Ross Quinlan in 1986, ID3 builds multiway trees using categorical features and information gain for splitting. Trees are grown fully and then pruned.
C4.5: A successor to ID3, C4.5 removes the categorical feature restriction by dynamically discretizing numerical features. It converts trees to if-then rules and prunes based on rule accuracy.
C5.0: Quinlan’s later, proprietary version, C5.0, is more memory-efficient, builds smaller rule sets, and is generally more accurate than C4.5.
CART (Classification and Regression Trees): CART is similar to C4.5 but supports numerical targets (regression) and builds binary trees. It uses the feature and threshold that maximize information gain at each node.

Scikit-learn’s decision tree implementation is based on an optimized version of the CART algorithm. However, it’s important to note that scikit-learn’s current implementation does not natively support categorical variables directly.

Mathematical Formulation of Decision Trees

Given training data (x_i in R^n) and labels (y in R^l), a decision tree recursively partitions the feature space to group samples with similar labels or target values.

At each node (m) with data (Q_m) and (n_m) samples, the algorithm considers candidate splits (theta = (j, t_m)) defined by a feature (j) and threshold (t_m). The data is partitioned into left and right subsets:

[ begin{align}begin{aligned}Q_m^{left}(theta) = {(x, y) | x_j leq t_m}Q_m^{right}(theta) = Q_m setminus Q_m^{left}(theta)end{aligned}end{align} ]

The quality of a split is evaluated using an impurity function (H()) (or loss function), chosen based on the task (classification or regression):

[G(Q_m, theta) = frac{n_m^{left}}{n_m} H(Q_m^{left}(theta)) + frac{n_m^{right}}{n_m} H(Q_m^{right}(theta))]

The algorithm selects the split (theta^*) that minimizes impurity:

[theta^* = operatorname{argmin}_theta G(Q_m, theta)]

This process is repeated recursively for subsets (Q_m^{left}(theta^)) and (Q_m^{right}(theta^)) until a stopping criterion is met (e.g., maximum depth, minimum samples, or node purity).

Classification Criteria

For classification targets (values 0, 1, …, K-1), let (p_{mk}) be the proportion of class (k) observations in node (m):

[p_{mk} = frac{1}{nm} sum{y in Q_m} I(y = k)]

Common impurity measures for classification are:

Gini Impurity:

[H(Q_m) = sumk p{mk} (1 – p_{mk})]
Entropy (Log Loss):

[H(Q_m) = – sumk p{mk} log(p_{mk})]

Entropy is related to Shannon entropy and is equivalent to minimizing log loss (cross-entropy). Minimizing Shannon entropy as a splitting criterion is equivalent to minimizing the log loss between true labels and probabilistic predictions of the tree model.

Regression Criteria

For continuous target values, common criteria to minimize for splits include:

Mean Squared Error (MSE):

[ begin{align}begin{aligned}bar{y}_m = frac{1}{nm} sum{y in Q_m} yH(Q_m) = frac{1}{nm} sum{y in Q_m} (y – bar{y}_m)^2end{aligned}end{align} ]
Mean Poisson Deviance: Suitable for count or frequency data (where (y ge 0) is required).

[H(Q_m) = frac{2}{nm} sum{y in Q_m} (y logfrac{y}{bar{y}_m} – y + bar{y}_m)]
Mean Absolute Error (MAE):

[ begin{align}begin{aligned}median(y)_m = underset{y in Q_m}{mathrm{median}}(y)H(Q_m) = frac{1}{nm} sum{y in Q_m} |y – median(y)_m|end{aligned}end{align} ]

MSE and Poisson deviance use the mean value (bar{y}_m) for predictions in terminal nodes, while MAE uses the median (median(y)_m). Poisson deviance and MAE fitting are computationally slower than MSE.

Handling Missing Values in Scikit-learn Decision Trees

Scikit-learn decision trees (DecisionTreeClassifier, DecisionTreeRegressor) and extra-trees (ExtraTreeClassifier, ExtraTreeRegressor) have built-in support for missing values (NaNs) when using splitter='best' (for standard trees) or splitter='random' (for extra-trees).

When splitting a node with missing values, the splitter evaluates splits by considering all missing values to go to either the left or the right child node.

Prediction with Missing Values:

Default Behavior: During prediction, samples with missing values are assigned to the child node based on the split decision made during training.

from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]
tree = DecisionTreeClassifier(random_state=0).fit(X, y)
tree.predict(X)

Tie-Breaking: If the impurity evaluation is identical for both left and right nodes when considering missing values, the tie is broken by sending missing values to the right child node during prediction.

from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.array([np.nan, -1, np.nan, 1]).reshape(-1, 1)
y = [0, 0, 1, 1]
tree = DecisionTreeClassifier(random_state=0).fit(X, y)
X_test = np.array([np.nan]).reshape(-1, 1)
tree.predict(X_test)

Missing Values Not Seen During Training: If a feature has no missing values during training, but encounters them during prediction, missing values are mapped to the child node with the most samples.

from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.array([0, 1, 2, 3]).reshape(-1, 1)
y = [0, 1, 1, 1]
tree = DecisionTreeClassifier(random_state=0).fit(X, y)
X_test = np.array([np.nan]).reshape(-1, 1)
tree.predict(X_test)

Extra-trees and Missing Values:

ExtraTreeClassifier and ExtraTreeRegressor handle missing values slightly differently during splitting. They choose a random threshold to split non-missing values and randomly assign missing values to either the left or right child. This process is repeated for each feature, and the best split is selected. Prediction with missing values follows the same logic as standard decision trees.

Minimal Cost-Complexity Pruning

Minimal cost-complexity pruning is a technique to prune decision trees and prevent overfitting. It uses a complexity parameter (alpha ge 0) to define the cost-complexity measure (R_alpha(T)) of a tree (T):

[R_alpha(T) = R(T) + alpha|widetilde{T}|]

where (|widetilde{T}|) is the number of terminal nodes, and (R(T)) is the total impurity of terminal nodes (misclassification rate or sample-weighted impurity). The algorithm finds the subtree of (T) that minimizes (R_alpha(T)).

The effective (alpha) of a node (t) is (alpha_{eff}(t) = frac{R(t) – R(T_t)}{|T|-1}), where (Tt) is the subtree rooted at (t). The weakest link (non-terminal node with the smallest (alpha{eff})) is pruned iteratively until the minimal (alpha_{eff}) of the pruned tree exceeds the ccp_alpha parameter.

Conclusion

Scikit-learn decision trees provide a versatile and interpretable approach to both classification and regression problems. Their strengths lie in their simplicity, ease of use, and ability to handle diverse data types with minimal preprocessing. While they have limitations like overfitting and instability, these can be effectively addressed through techniques like pruning, ensemble methods, and careful parameter tuning. Understanding the nuances of decision trees, their algorithms, and practical considerations will empower you to leverage them effectively in your machine learning projects.

References

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
Decision tree learning – Wikipedia
Predictive analytics – Wikipedia
Quinlan, J.R. (1993). C4.5: programs for machine learning. Morgan Kaufmann.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). Elements of Statistical Learning. Springer.