In the realm of machine learning, building a model is only half the battle. The true measure of success lies in rigorously testing and evaluating its performance. Scikit-learn, a cornerstone library for machine learning in Python, provides a rich toolkit for quantifying the quality of your model’s predictions. This guide delves into the essential aspects of model evaluation within scikit-learn, ensuring your models are not just built, but robustly tested and ready for real-world applications.
This comprehensive guide is inspired by statistical decision theory and aims to provide a clear path for choosing and applying the right scoring functions for supervised learning, drawing from insights by Gneiting et al.
Predicting vs. Decision Making: Defining Your Evaluation Goals
Before diving into specific metrics, it’s crucial to understand the two key stages in utilizing machine learning models: predicting and decision making.
Predicting: In most real-world scenarios, the response variable (what you’re trying to predict) isn’t a deterministic function of the features. Instead, it follows a probability distribution. Your goal in prediction can be to estimate this entire distribution (probabilistic prediction) or, more commonly in scikit-learn, to produce a point prediction. Point predictions focus on a specific property of the distribution, such as the mean, median, or a particular quantile of the response variable, conditional on the input features.
Decision Making: Decision making builds upon prediction, particularly in classification tasks. Here, the probabilistic output of a model (like predict_proba
in scikit-learn) is transformed into a concrete action. For instance, predicting the probability of rain leads to the decision of whether or not to bring an umbrella. In scikit-learn, predict
method in classifiers handles this decision-making step.
Choosing the right scoring function is paramount and directly ties to your prediction goals.
Consistent Scoring Functions: The Truth Serum for Model Evaluation
For effective model testing, you need scoring functions that are strictly consistent. A strictly consistent scoring function accurately measures the “distance” between your model’s predictions (y_pred
) and the actual, observed values (y_true
). In classification, these are known as strictly proper scoring rules.
Consistent scoring functions act like a “truth serum” for your models. They ensure that “truth telling,” or accurate prediction, is always the optimal strategy in expectation. By using a strictly consistent scoring function, you guarantee that improving your model’s prediction accuracy will directly translate to a better score.
Ideally, the same strictly consistent scoring function should be used both for training your model (as the loss function) and for evaluating its performance in model selection and comparison. This ensures that your optimization during training is aligned with your evaluation goals.
For regression tasks in scikit-learn, the predict
method is typically used for predictions. In classification, predict_proba
is often employed to obtain probability estimates, which are then used by consistent scoring functions.
Strictly Consistent Scoring Functions: A Practical List
Here’s a table outlining some of the most relevant statistical functionals and their corresponding strictly consistent scoring functions available in scikit-learn for practical tasks:
Functional | Scoring/Loss Function | Response y |
Prediction Type |
---|---|---|---|
Classification | |||
Mean | Brier Score | Multi-class | predict_proba |
Mean | Log Loss | Multi-class | predict_proba |
Mode | Zero-One Loss | Multi-class | predict , categorical |
Regression | |||
Mean | Squared Error | All reals | predict , all reals |
Mean | Poisson Deviance | Non-negative | predict , strictly positive |
Mean | Gamma Deviance | Strictly positive | predict , strictly positive |
Mean | Tweedie Deviance | Depends on power |
predict , depends on power |
Median | Absolute Error | All reals | predict , all reals |
Quantile | Pinball Loss | All reals | predict , all reals |
Mode | No consistent one exists | Reals |
Note:
- The Brier score is essentially the squared error when applied to classification tasks.
- The zero-one loss is consistent for the mode but not strictly consistent. It is equivalent to (1 – accuracy), providing the same ranking but different score values.
- R² score provides the same ranking as squared error.
Practical Example: Network Reliability Testing
Consider a network reliability engineering scenario. As a network provider, you aim to guarantee network connection stability, promising customers no connection disruptions longer than 1 minute for at least 99% of the days. To achieve this, you need to predict the 99% quantile of the longest connection interruption duration daily.
In this case, the target functional is the 99% quantile. Referring to the table, the pinball loss is the appropriate scoring function. You would then use pinball loss for both model training (e.g., using HistGradientBoostingRegressor(loss="quantile", quantile=0.99)
) and model evaluation (mean_pinball_loss(..., alpha=0.99)
). This ensures consistency in your objective throughout the model development process.
Scikit-learn’s Scoring API: Three Ways to Evaluate
Scikit-learn offers three distinct APIs for evaluating the quality of your model’s predictions, providing flexibility and control over your testing process:
- Estimator scoring method: Models often have a default scoring method (
estimator.score
) suitable for basic evaluation. - Scoring parameter: Tools like cross-validation and grid search utilize a
scoring
parameter for flexible metric specification. - Metric functions: The
sklearn.metrics
module provides standalone functions for specific metric calculations.
For baseline comparisons, especially in early stages of model testing, dummy estimators in scikit-learn are invaluable. They provide a reference point by implementing simple prediction strategies, allowing you to quickly assess if your model offers genuine improvement over naive approaches.
Defining Model Evaluation Rules with the scoring
Parameter
Many scikit-learn tools, particularly those for model selection and evaluation that employ cross-validation (like GridSearchCV
, validation_curve
, and LogisticRegressionCV
), use a scoring
parameter. This parameter dictates which metric is used to evaluate the estimators.
The scoring
parameter can be specified in several ways:
None
: Uses the estimator’s default scoring method (estimator.score
).- String name: For common metrics, use predefined string identifiers (e.g., ‘accuracy’, ‘f1_score’, ‘neg_mean_squared_error’).
- Callable object: For more complex or custom metrics, provide a callable function.
Some tools also support evaluating multiple metrics simultaneously, enhancing the depth of your model testing.
String Name Scorers: Quick Access to Common Metrics
For most common evaluation needs, you can use string names to specify scorer objects via the scoring
parameter. Scikit-learn follows the convention that higher scores are better. For metrics that measure error (where lower is better, like mean_squared_error
), negated versions are available (e.g., ‘neg_mean_squared_error’) to adhere to this convention.
String Name Scorers in Scikit-learn:
Scoring String Name | Function | Comment |
---|---|---|
Classification | ||
'accuracy' |
metrics.accuracy_score |
|
'balanced_accuracy' |
metrics.balanced_accuracy_score |
|
'top_k_accuracy' |
metrics.top_k_accuracy_score |
|
'average_precision' |
metrics.average_precision_score |
|
'neg_brier_score' |
metrics.brier_score_loss |
|
'f1' |
metrics.f1_score |
for binary targets |
'f1_micro' |
metrics.f1_score |
micro-averaged |
'f1_macro' |
metrics.f1_score |
macro-averaged |
'f1_weighted' |
metrics.f1_score |
weighted average |
'f1_samples' |
metrics.f1_score |
by multilabel sample |
'neg_log_loss' |
metrics.log_loss |
requires predict_proba support |
'precision' etc. |
metrics.precision_score |
suffixes apply as with ‘f1’ |
'recall' etc. |
metrics.recall_score |
suffixes apply as with ‘f1’ |
'jaccard' etc. |
metrics.jaccard_score |
suffixes apply as with ‘f1’ |
'roc_auc' |
metrics.roc_auc_score |
|
'roc_auc_ovr' |
metrics.roc_auc_score |
|
'roc_auc_ovo' |
metrics.roc_auc_score |
|
'roc_auc_ovr_weighted' |
metrics.roc_auc_score |
|
'roc_auc_ovo_weighted' |
metrics.roc_auc_score |
|
'd2_log_loss_score' |
metrics.d2_log_loss_score |
|
Clustering | ||
'adjusted_mutual_info_score' |
metrics.adjusted_mutual_info_score |
|
'adjusted_rand_score' |
metrics.adjusted_rand_score |
|
'completeness_score' |
metrics.completeness_score |
|
'fowlkes_mallows_score' |
metrics.fowlkes_mallows_score |
|
'homogeneity_score' |
metrics.homogeneity_score |
|
'mutual_info_score' |
metrics.mutual_info_score |
|
'normalized_mutual_info_score' |
metrics.normalized_mutual_info_score |
|
'rand_score' |
metrics.rand_score |
|
'v_measure_score' |
metrics.v_measure_score |
|
Regression | ||
'explained_variance' |
metrics.explained_variance_score |
|
'neg_max_error' |
metrics.max_error |
|
'neg_mean_absolute_error' |
metrics.mean_absolute_error |
|
'neg_mean_squared_error' |
metrics.mean_squared_error |
|
'neg_root_mean_squared_error' |
metrics.root_mean_squared_error |
|
'neg_mean_squared_log_error' |
metrics.mean_squared_log_error |
|
'neg_root_mean_squared_log_error' |
metrics.root_mean_squared_log_error |
|
'neg_median_absolute_error' |
metrics.median_absolute_error |
|
'r2' |
metrics.r2_score |
|
'neg_mean_poisson_deviance' |
metrics.mean_poisson_deviance |
|
'neg_mean_gamma_deviance' |
metrics.mean_gamma_deviance |
|
'neg_mean_absolute_percentage_error' |
metrics.mean_absolute_percentage_error |
|
'd2_absolute_error_score' |
metrics.d2_absolute_error_score |
Example Usage:
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score
X, y = datasets.load_iris(return_X_y=True)
clf = svm.SVC(random_state=0)
cross_val_score(clf, X, y, cv=5, scoring='recall_macro')
Note: Passing an invalid scoring name will raise an InvalidParameterError
. You can get a list of all available scorer names using get_scorer_names()
.
Callable Scorers: Customizing Your Evaluation
For more advanced testing scenarios, scikit-learn allows you to use callable objects as scorers, offering greater flexibility. This can be achieved in two primary ways:
- Adapting predefined metrics with
make_scorer
: Usemake_scorer
to create scorer objects from existing metric functions, especially when you need to set specific parameters. - Creating custom scorer objects from scratch: For ultimate flexibility, define your own scoring function and use
make_scorer
or create a scorer object directly.
Adapting Predefined Metrics via make_scorer
Functions like fbeta_score
, mean_tweedie_deviance
, and mean_pinball_loss
require additional parameters (e.g., beta
, power
, alpha
) and cannot be directly used as string scorers. make_scorer
bridges this gap by allowing you to “wrap” these functions and set their parameters.
Functions Adaptable with make_scorer
:
Function | Parameter | Example Usage |
---|---|---|
Classification | ||
metrics.fbeta_score |
beta |
make_scorer(fbeta_score, beta=2) |
Regression | ||
metrics.mean_tweedie_deviance |
power |
make_scorer(mean_tweedie_deviance, power=1.5) |
metrics.mean_pinball_loss |
alpha |
make_scorer(mean_pinball_loss, alpha=0.95) |
metrics.d2_tweedie_score |
power |
make_scorer(d2_tweedie_score, power=1.5) |
metrics.d2_pinball_score |
alpha |
make_scorer(d2_pinball_score, alpha=0.95) |
Example: Creating a scorer for fbeta_score
with beta=2
:
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]},
scoring=ftwo_scorer, cv=5)
Remember that functions ending in _score
are maximized (higher is better), while those ending in _error
, _loss
, or _deviance
are minimized (lower is better). When using make_scorer
with loss functions, set greater_is_better=False
.
Creating a Custom Scorer Object
For highly specific testing needs, you can create fully custom scorer objects.
Custom Scorer Objects using make_scorer
make_scorer
is a powerful tool for building custom scorers from simple Python functions. Key parameters include:
score_func
: Your Python function to calculate the score.greater_is_better
: Indicate if your function returns a score (True
, default) or a loss (False
). For losses, the output is negated to align with scikit-learn’s “higher is better” convention.response_method
: For classification metrics, specify if your function requires probability estimates ("predict_proba"
) or decision values ("decision_function"
). You can also provide a list of methods to try.- `kwargs`:** Additional parameters to pass to your scoring function.
Example: Creating a custom loss function and scorer:
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.dummy import DummyClassifier
def my_custom_loss_func(y_true, y_pred):
diff = np.abs(y_true - y_pred).max()
return np.log1p(diff)
score = make_scorer(my_custom_loss_func, greater_is_better=False)
X = [[1], [1]]
y = [0, 1]
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf = clf.fit(X, y)
my_custom_loss_func(y, clf.predict(X))
score(clf, X, y)
Custom Scorer Objects from Scratch
For maximum control, you can create scorer objects from the ground up. A scorer callable must adhere to this protocol:
- Signature: Accept parameters
(estimator, X, y)
, whereestimator
is the model,X
is validation data, andy
is the ground truth (orNone
for unsupervised). - Return Value: Return a float representing the model’s prediction quality on
X
relative toy
. Higher values should indicate better performance. Negate loss values to adhere to this convention. - Advanced (Metadata Routing): For scorers needing metadata, implement
get_metadata_routing
andset_score_request
methods.
Using Custom Scorers with n_jobs > 1
For robust parallel processing with custom scorers, especially when using n_jobs > 1
, it’s best to define your custom scoring function in a separate Python module and import it. This approach is more reliable across different joblib backends.
Example: Importing a custom scoring function from custom_scorer_module.py
:
from custom_scorer_module import custom_scoring_function
from sklearn.model_selection import cross_val_score
cross_val_score(model, X_train, y_train,
scoring=make_scorer(custom_scoring_function, greater_is_better=False),
cv=5, n_jobs=-1)
Multiple Metric Evaluation: A Holistic View
Scikit-learn allows you to evaluate multiple metrics simultaneously in GridSearchCV
, RandomizedSearchCV
, and cross_validate
, providing a more comprehensive testing picture.
You can specify multiple scoring metrics in three ways:
-
List of strings: Provide a list of metric names as strings:
scoring = ['accuracy', 'precision']
-
Dictionary mapping names to scorers: Create a dictionary where keys are scorer names and values are either scorer functions or predefined metric strings:
from sklearn.metrics import accuracy_score, make_scorer scoring = {'accuracy': make_scorer(accuracy_score), 'prec': 'precision'}
-
Callable returning a dictionary: Define a callable function that computes and returns a dictionary of scores:
from sklearn.model_selection import cross_validate from sklearn.metrics import confusion_matrix from sklearn.datasets import make_classification from sklearn.svm import LinearSVC X, y = make_classification(n_classes=2, random_state=0) svm = LinearSVC(random_state=0) def confusion_matrix_scorer(clf, X, y): y_pred = clf.predict(X) cm = confusion_matrix(y, y_pred) return {'tn': cm[0, 0], 'fp': cm[0, 1], 'fn': cm[1, 0], 'tp': cm[1, 1]} cv_results = cross_validate(svm, X, y, cv=5, scoring=confusion_matrix_scorer) print(cv_results['test_tp']) print(cv_results['test_fn'])
Classification Metrics: Evaluating Classifier Performance
The sklearn.metrics
module is rich with functions for evaluating classification performance. These metrics assess different facets of your classifier’s effectiveness, from overall accuracy to nuanced measures like precision and recall.
Some metrics are tailored for binary classification, while others extend to multiclass and multilabel scenarios. Many implementations support sample weighting via the sample_weight
parameter.
Binary Classification Specific Metrics:
Function | Description |
---|---|
precision_recall_curve |
Compute precision-recall pairs for different probability thresholds. |
roc_curve |
Compute Receiver operating characteristic (ROC). |
class_likelihood_ratios |
Compute binary classification positive and negative likelihood ratios. |
det_curve |
Compute error rates for different probability thresholds. |
Multiclass Capable Metrics:
Function | Description |
---|---|
balanced_accuracy_score |
Compute the balanced accuracy. |
cohen_kappa_score |
Compute Cohen’s kappa: a statistic that measures inter-annotator agreement. |
confusion_matrix |
Compute confusion matrix to evaluate classification accuracy. |
hinge_loss |
Average hinge loss (non-regularized). |
matthews_corrcoef |
Compute the Matthews correlation coefficient (MCC). |
roc_auc_score |
Compute Area Under the ROC Curve (ROC AUC) from prediction scores. |
top_k_accuracy_score |
Top-k Accuracy classification score. |
Multilabel Capable Metrics:
Function | Description |
---|---|
accuracy_score |
Accuracy classification score. |
classification_report |
Build a text report showing the main classification metrics. |
f1_score |
Compute the F1 score, also known as balanced F-score or F-measure. |
fbeta_score |
Compute the F-beta score. |
hamming_loss |
Compute the average Hamming loss. |
jaccard_score |
Jaccard similarity coefficient score. |
log_loss |
Log loss, aka logistic loss or cross-entropy loss. |
multilabel_confusion_matrix |
Compute a confusion matrix for each class or sample. |
precision_recall_fscore_support |
Compute precision, recall, F-measure and support for each class. |
precision_score |
Compute the precision. |
recall_score |
Compute the recall. |
roc_auc_score |
Compute Area Under the ROC Curve (ROC AUC) from prediction scores. |
zero_one_loss |
Zero-one classification loss. |
d2_log_loss_score |
(D^2) score function, fraction of log loss explained. |
Binary and Multilabel (Not Multiclass) Metrics:
Function | Description |
---|---|
average_precision_score |
Compute average precision (AP) from prediction scores. |
From Binary to Multiclass and Multilabel: Adapting Metrics
Many classification metrics are inherently designed for binary classification. Extending them to multiclass or multilabel problems involves treating the data as a collection of binary problems, one for each class. The average
parameter in scikit-learn metrics controls how these binary metric calculations are averaged across classes.
"macro"
: Calculates the mean of binary metrics per class, giving equal weight to each class. Useful for highlighting performance on infrequent classes."weighted"
: Averages binary metrics, weighting each class’s score by its prevalence in the true data. Accounts for class imbalance."micro"
: Gives each sample-class pair equal weight. Sums up dividends and divisors across classes to compute an overall quotient. Suitable for multilabel settings and multiclass problems where a majority class should be de-emphasized."samples"
: Applies only to multilabel problems. Calculates the metric for each sample and returns thesample_weight
-weighted average.average=None
: Returns an array of scores for each class, providing a class-wise breakdown of performance.
Multiclass data is typically provided as an array of class labels, while multilabel data is represented as an indicator matrix.
Accuracy Score: Overall Correctness
The accuracy_score
function computes the accuracy, representing the fraction or count of correctly classified samples. In multilabel classification, it returns the subset accuracy, requiring the entire set of predicted labels to perfectly match the true set.
Formula:
[texttt{accuracy}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{n_text{samples}-1} 1(hat{y}_i = y_i)]
Example:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
accuracy_score(y_true, y_pred, normalize=False) # Count of correct predictions
Top-k Accuracy Score: Considering Top Predictions
The top_k_accuracy_score
is a generalization of accuracy, where a prediction is considered correct if the true label is among the k
highest predicted scores. accuracy_score
is a special case where k=1
. This metric is useful when you care about the model’s ability to rank the correct class highly, even if it’s not the absolute top prediction.
Formula:
[texttt{top-k accuracy}(y, hat{f}) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} sum{j=1}^{k} 1(hat{f}_{i,j} = y_i)]
Example:
import numpy as np
from sklearn.metrics import top_k_accuracy_score
y_true = np.array([0, 1, 2, 2])
y_score = np.array([[0.5, 0.2, 0.2], [0.3, 0.4, 0.2],
[0.2, 0.4, 0.3], [0.7, 0.2, 0.1]])
top_k_accuracy_score(y_true, y_score, k=2)
top_k_accuracy_score(y_true, y_score, k=2, normalize=False) # Count of top-k correct predictions
Balanced Accuracy Score: Addressing Class Imbalance
The balanced_accuracy_score
is crucial for imbalanced datasets. It’s the macro-average of recall scores per class, or equivalently, accuracy where each sample is weighted by the inverse prevalence of its true class. This metric provides a more realistic performance estimate when classes are not equally represented.
Binary Case Formula:
[texttt{balanced-accuracy} = frac{1}{2}left( frac{TP}{TP + FN} + frac{TN}{TN + FP}right )]
Multiclass Case Formula:
[texttt{balanced-accuracy}(y, hat{y}, w) = frac{1}{sum{hat{w}_i}} sum_i 1(hat{y}_i = y_i) hat{w}_i]
where [hat{w}_i = frac{w_i}{sum_j{1(y_j = y_i) w_j}}]
Example:
from sklearn.metrics import balanced_accuracy_score
y_true = [0, 1, 0, 0, 1, 0]
y_pred = [0, 0, 0, 0, 0, 0]
balanced_accuracy_score(y_true, y_pred)
Cohen’s Kappa: Inter-Annotator Agreement
The cohen_kappa_score
computes Cohen’s kappa, a statistic measuring inter-annotator agreement. It’s especially useful for comparing labelings from different human annotators, not just classifier performance against ground truth. Kappa scores range from -1 to 1, with scores above 0.8 generally considered good agreement.
Example:
from sklearn.metrics import cohen_kappa_score
labeling1 = [2, 0, 2, 2, 0, 1]
labeling2 = [0, 0, 2, 2, 0, 2]
cohen_kappa_score(labeling1, labeling2)
Confusion Matrix: Detailed Breakdown of Predictions
The confusion_matrix
is a powerful tool for visualizing classification performance. It presents a matrix where rows represent true classes and columns represent predicted classes. Element (i, j) shows the count of samples actually in class i but predicted as class j.
Example:
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
You can visually represent confusion matrices using ConfusionMatrixDisplay
. Normalization options (normalize
) allow displaying ratios instead of raw counts.
Classification Report: Summary of Key Metrics
The classification_report
generates a text report summarizing key classification metrics: precision, recall, F1-score, and support for each class. It’s a convenient way to get a quick overview of your classifier’s performance across different classes.
Example:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
Hamming Loss: Measuring Label Mismatches
The hamming_loss
computes the average Hamming loss or Hamming distance between sets of samples. In multilabel classification, it quantifies the fraction of incorrectly predicted labels.
Formula:
[L{Hamming}(y, hat{y}) = frac{1}{ntext{samples} * ntext{labels}} sum{i=0}^{ntext{samples}-1} sum{j=0}^{ntext{labels} – 1} 1(hat{y}{i,j} not= y_{i,j})]
Example:
from sklearn.metrics import hamming_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]
hamming_loss(y_true, y_pred)
Precision, Recall, and F-Measures: Balancing Error Types
Precision, recall, and F-measures are fundamental metrics for evaluating classification performance, particularly in binary and multilabel settings.
- Precision: Measures the classifier’s ability to avoid false positives (labeling negative samples as positive).
- Recall: Measures the classifier’s ability to find all positive samples (avoiding false negatives).
- F-measure: A weighted harmonic mean of precision and recall, balancing both metrics. F1-score (F-measure with beta=1) gives equal weight to precision and recall.
The precision_recall_curve
computes precision-recall pairs for varying decision thresholds, allowing you to visualize the trade-off between precision and recall. The average_precision_score
summarizes the precision-recall curve into a single average precision (AP) score.
Formulas (Binary Classification):
[text{precision} = frac{text{tp}}{text{tp} + text{fp}}]
[text{recall} = frac{text{tp}}{text{tp} + text{fn}}]
[F_beta = (1 + beta^2) frac{text{precision} times text{recall}}{beta^2 text{precision} + text{recall}}]
Example (Binary Classification):
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
metrics.precision_score(y_true, y_pred)
metrics.recall_score(y_true, y_pred)
metrics.f1_score(y_true, y_pred)
metrics.fbeta_score(y_true, y_pred, beta=0.5)
metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5)
For multiclass and multilabel classification, these metrics can be averaged across labels using the average
parameter (see above).
Jaccard Similarity Coefficient Score: Set-Based Accuracy
The jaccard_score
computes the average Jaccard similarity coefficient, also known as the Jaccard index. It measures the similarity between sets of predicted and true labels, defined as the size of the intersection divided by the size of the union of the label sets.
Formula:
[J(y, hat{y}) = frac{|y cap hat{y}|}{|y cup hat{y}|}.]
Example:
import numpy as np
from sklearn.metrics import jaccard_score
y_true = np.array([[0, 1, 1], [1, 1, 0]])
y_pred = np.array([[1, 1, 1], [1, 0, 0]])
jaccard_score(y_true[0], y_pred[0])
jaccard_score(y_true, y_pred, average="micro") # Micro-averaged Jaccard score
Hinge Loss: Margin-Based Loss
The hinge_loss
computes the average hinge loss, a loss function commonly used in support vector machines (SVMs). It’s a one-sided metric that focuses on prediction errors, particularly useful for margin-maximizing classifiers.
Binary Case Formula:
[Ltext{Hinge}(y, w) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} maxleft{1 – w_i y_i, 0right}]
Multiclass Case Formula (Crammer & Singer):
[Ltext{Hinge}(y, w) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} maxleft{1 + hat{w}_{i, yi} – w{i, y_i}, 0right}]
Example:
from sklearn import svm
from sklearn.metrics import hinge_loss
X = [[0], [1]]
y = [-1, 1]
est = svm.LinearSVC(random_state=0)
est.fit(X, y)
pred_decision = est.decision_function([[-2], [3], [0.5]])
hinge_loss([-1, 1, 1], pred_decision)
Log Loss (Cross-Entropy Loss): Probabilistic Loss
The log_loss
, also known as logistic regression loss or cross-entropy loss, is defined on probability estimates. It’s commonly used in logistic regression and neural networks, evaluating the probabilistic outputs (predict_proba
) of a classifier.
Binary Case Formula:
[L_{log}(y, p) = -log operatorname{Pr}(y|p) = -(y log (p) + (1 – y) log (1 – p))]
Multiclass Case Formula:
[L{log}(Y, P) = -log operatorname{Pr}(Y|P) = – frac{1}{N} sum{i=0}^{N-1} sum{k=0}^{K-1} y{i,k} log p_{i,k}]
Example:
from sklearn.metrics import log_loss
y_true = [0, 0, 1, 1]
y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
log_loss(y_true, y_pred)
Matthews Correlation Coefficient (MCC): Balanced Binary Metric
The matthews_corrcoef
computes the Matthews correlation coefficient (MCC) for binary classification. MCC is considered a balanced measure, even with very different class sizes. It ranges from -1 to +1, with +1 representing perfect prediction, 0 random prediction, and -1 inverse prediction.
Binary Case Formula:
[MCC = frac{tp times tn – fp times fn}{sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}]
Multiclass Case Formula:
[MCC = frac{ c times s – sum_{k}^{K} p_k times tk }{sqrt{ (s^2 – sum{k}^{K} pk^2) times (s^2 – sum{k}^{K} t_k^2) }}]
Example:
from sklearn.metrics import matthews_corrcoef
y_true = [+1, +1, +1, -1]
y_pred = [+1, -1, +1, +1]
matthews_corrcoef(y_true, y_pred)
Multi-label Confusion Matrix: Class-Wise Breakdown
The multilabel_confusion_matrix
computes class-wise or sample-wise multilabel confusion matrices. It treats multiclass data as multilabel for compatibility with binary classification metrics. This function provides a confusion matrix for each class, allowing detailed analysis of performance per label.
Example:
import numpy as np
from sklearn.metrics import multilabel_confusion_matrix
y_true = np.array([[1, 0, 1], [0, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 1]])
multilabel_confusion_matrix(y_true, y_pred) # Class-wise matrices
multilabel_confusion_matrix(y_true, y_pred, samplewise=True) # Sample-wise matrices
Receiver Operating Characteristic (ROC): Visualizing Trade-offs
The roc_curve
computes the receiver operating characteristic (ROC) curve, a graphical plot showing the performance of a binary classifier as its discrimination threshold varies. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different thresholds, visualizing the trade-off between sensitivity and specificity.
The roc_auc_score
function computes the Area Under the ROC Curve (ROC AUC or AUROC), summarizing the ROC curve into a single, interpretable score.
Example:
import numpy as np
from sklearn.metrics import roc_curve
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
fpr
tpr
thresholds
ROC AUC is applicable to binary, multiclass (using one-vs-one or one-vs-rest strategies), and multilabel classification.
Detection Error Tradeoff (DET): Focusing on Error Tradeoffs
The det_curve
computes the detection error tradeoff (DET) curve. DET curves are similar to ROC curves but plot False Negative Rate (FNR) against False Positive Rate (FPR), often using a non-linear scale. DET curves emphasize the critical operating region and can be more linear than ROC curves, making it easier to visually compare classifiers, especially where error tradeoffs are important.
Zero-One Loss: Strict Accuracy
The zero_one_loss
computes the sum or average of the 0-1 classification loss. It measures the fraction of samples misclassified (or the count of misclassifications if normalize=False
). In multilabel classification, it requires a perfect match of predicted and true label sets for a sample to be considered correctly classified.
Formula:
[L{0-1}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} 1(hat{y}_i not= y_i)]
Example:
from sklearn.metrics import zero_one_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]
zero_one_loss(y_true, y_pred)
zero_one_loss(y_true, y_pred, normalize=False) # Count of misclassifications
Brier Score Loss: Probabilistic Accuracy
The brier_score_loss
computes the Brier score for binary classes. It measures the accuracy of probabilistic predictions by calculating the mean squared error between predicted probabilities and actual outcomes (0 or 1). Lower Brier scores indicate better probabilistic accuracy.
Formula:
[BS = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}} – 1}(y_i – p_i)^2]
Example:
import numpy as np
from sklearn.metrics import brier_score_loss
y_true = np.array([0, 1, 1, 0])
y_prob = np.array([0.1, 0.9, 0.8, 0.4])
brier_score_loss(y_true, y_prob)
Class Likelihood Ratios: Prevalence-Invariant Metrics
The class_likelihood_ratios
function computes positive and negative likelihood ratios (LR+ and LR-) for binary classification. These metrics are prevalence-invariant, making them useful for generalizing model performance across populations with varying class imbalances. They represent the ratio of post-test to pre-test odds, offering insights into how predictions modify the probability of belonging to a class.
D² Score for Classification: Deviance Explained
The D² score generalizes R² for classification, measuring the fraction of deviance explained by the model. The d2_log_loss_score
function implements D² using log loss as the deviance measure. A higher D² score (closer to 1.0) indicates a better fit.
Multilabel Ranking Metrics: Evaluating Label Order
In multilabel learning, where samples can have multiple true labels, ranking metrics assess how well the model ranks true labels higher than false labels.
Coverage Error: Minimum Labels to Cover True Labels
The coverage_error
computes the average number of labels that need to be included in the top-ranked predictions to cover all true labels. Lower coverage error is better, with the ideal value being the average number of true labels per sample.
Formula:
[coverage(y, hat{f}) = frac{1}{n{text{samples}}} sum{i=0}^{n{text{samples}} – 1} max{j:y{ij} = 1} text{rank}{ij}]
Example:
import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
coverage_error(y_true, y_score)
Label Ranking Average Precision (LRAP): Precision of Label Ranking
The label_ranking_average_precision_score
computes label ranking average precision (LRAP). It averages over samples the fraction of higher-ranked labels that are true labels for each ground truth label. Higher LRAP scores (closer to 1) are better, indicating better label ranking.
Formula:
[LRAP(y, hat{f}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}} – 1} frac{1}{||y_i||0} sum{j:y{ij} = 1} frac{|mathcal{L}{ij}|}{text{rank}_{ij}}]
Example:
import numpy as np
from sklearn.metrics import label_ranking_average_precision_score
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
label_ranking_average_precision_score(y_true, y_score)
Ranking Loss: Incorrect Label Pair Orderings
The label_ranking_loss
computes the ranking loss, averaging over samples the number of incorrectly ordered label pairs (true labels ranked lower than false labels). Lower ranking loss (closer to 0) is better.
Formula:
[ranking_loss(y, hat{f}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}} – 1} frac{1}{||y_i||0(ntext{labels} – ||y_i||0)} left|left{(k, l): hat{f}{ik} leq hat{f}{il}, y{ik} = 1, y_{il} = 0 right}right|]
Example:
import numpy as np
from sklearn.metrics import label_ranking_loss
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
label_ranking_loss(y_true, y_score)
Normalized Discounted Cumulative Gain (NDCG): Ranking Quality
Discounted Cumulative Gain (DCG) and Normalized Discounted Cumulative Gain (NDCG), implemented in dcg_score
and ndcg_score
, evaluate ranking quality by comparing predicted order to ground-truth relevance scores. NDCG is DCG normalized to be between 0 and 1, with higher values indicating better ranking quality. It’s particularly useful in information retrieval and recommendation systems where graded relevance scores are available.
Regression Metrics: Evaluating Regression Performance
The sklearn.metrics
module offers a range of functions for evaluating regression models. Many of these functions support multioutput regression, with the multioutput
parameter controlling how scores are averaged across multiple target variables.
R² Score (Coefficient of Determination): Variance Explained
The r2_score
computes the coefficient of determination (R²), representing the proportion of variance in the target variable explained by the model. R² ranges from negative infinity to 1.0, with 1.0 being the best possible score. It provides a measure of goodness of fit, indicating how well unseen samples are likely to be predicted.
Formula:
[R^2(y, hat{y}) = 1 – frac{sum_{i=1}^{n} (y_i – hat{y}i)^2}{sum{i=1}^{n} (y_i – bar{y})^2}]
Example:
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
r2_score(y_true, y_pred)
Mean Absolute Error (MAE): Average Absolute Deviation
The mean_absolute_error
computes the mean absolute error (MAE), representing the average absolute difference between predicted and true values. MAE is robust to outliers and is measured in the same units as the target variable. Lower MAE values are better.
Formula:
[text{MAE}(y, hat{y}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}}-1} left| y_i – hat{y}_i right|.]
Example:
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)
Mean Squared Error (MSE): Average Squared Deviation
The mean_squared_error
computes the mean squared error (MSE), representing the average squared difference between predicted and true values. MSE is more sensitive to outliers than MAE due to the squaring operation. Lower MSE values are better. Root Mean Squared Error (RMSE), the square root of MSE, is also commonly used and available via root_mean_squared_error
.
Formula:
[text{MSE}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{n_text{samples} – 1} (y_i – hat{y}_i)^2.]
Example:
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_squared_error(y_true, y_pred)
Mean Squared Logarithmic Error (MSLE): Relative Error Focus
The mean_squared_log_error
computes the mean squared logarithmic error (MSLE). MSLE is particularly useful when targets have exponential growth, penalizing under-predictions more heavily than over-predictions. It is less sensitive to outliers compared to MSE.
Formula:
[text{MSLE}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{n_text{samples} – 1} (log_e (1 + y_i) – log_e (1 + hat{y}_i) )^2.]
Example:
from sklearn.metrics import mean_squared_log_error
y_true = [3, 5, 2.5, 7]
y_pred = [2.5, 5, 4, 8]
mean_squared_log_error(y_true, y_pred)
Root Mean Squared Logarithmic Error (RMSLE) is available via root_mean_squared_log_error
.
Mean Absolute Percentage Error (MAPE): Relative Error as Percentage
The mean_absolute_percentage_error
(MAPE) measures the mean absolute percentage error, providing a relative error metric as a percentage. MAPE is scale-invariant, meaning it’s not affected by global scaling of the target variable. It is sensitive to relative errors, making it useful when relative accuracy is important.
Formula:
[text{MAPE}(y, hat{y}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}}-1} frac{{}left| y_i – hat{y}_i right|}{max(epsilon, left| y_i right|)}]
Example:
from sklearn.metrics import mean_absolute_percentage_error
y_true = [1, 10, 1e6]
y_pred = [0.9, 15, 1.2e6]
mean_absolute_percentage_error(y_true, y_pred)
Median Absolute Error (MedAE): Robust to Outliers
The median_absolute_error
is robust to outliers, calculating the median of absolute errors instead of the mean. MedAE provides a robust measure of typical error magnitude, less influenced by extreme values.
Formula:
[text{MedAE}(y, hat{y}) = text{median}(mid y_1 – hat{y}_1 mid, ldots, mid y_n – hat{y}_n mid).]
Example:
from sklearn.metrics import median_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
median_absolute_error(y_true, y_pred)
Max Error: Worst-Case Error
The max_error
computes the maximum residual error, capturing the worst-case error between predicted and true values. Max error indicates the largest prediction error the model makes.
Formula:
[text{Max Error}(y, hat{y}) = max(| y_i – hat{y}_i |)]
Example:
from sklearn.metrics import max_error
y_true = [3, 2, 7, 1]
y_pred = [9, 2, 7, 1]
max_error(y_true, y_pred)
Explained Variance Score: Variance Explained in Regression
The explained_variance_score
computes the explained variance regression score. It measures the proportion of variance in the target variable that is explained by the model. The best possible score is 1.0, with lower values indicating worse performance.
Formula:
[explained_{}variance(y, hat{y}) = 1 – frac{Var{ y – hat{y}}}{Var{y}}]
Example:
from sklearn.metrics import explained_variance_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
explained_variance_score(y_true, y_pred)
Mean Poisson, Gamma, and Tweedie Deviance: Deviance-Based Loss
The mean_tweedie_deviance
computes the mean Tweedie deviance error, a metric that elicits predicted expectation values for regression targets. It generalizes several common regression loss functions based on the power
parameter:
power=0
: Normal distribution (squared error)power=1
: Poisson distributionpower=2
: Gamma distribution
Tweedie deviance is useful for modeling data with different variance structures. Higher power
values give less weight to extreme deviations.
Pinball Loss: Quantile Regression Evaluation
The mean_pinball_loss
evaluates quantile regression models. It measures the deviation from a specified quantile (alpha
). When alpha=0.5
, it’s equivalent to half of the MAE. Pinball loss is asymmetric, penalizing deviations differently depending on whether the prediction is above or below the true value, according to the chosen quantile.
Formula:
[text{pinball}(y, hat{y}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}}-1} alpha max(y_i – hat{y}_i, 0) + (1 – alpha) max(hat{y}_i – y_i, 0)]
Example:
from sklearn.metrics import mean_pinball_loss
y_true = [1, 2, 3]
mean_pinball_loss(y_true, [0, 2, 3], alpha=0.1)
mean_pinball_loss(y_true, [1, 2, 4], alpha=0.9)
D² Score for Regression: Deviance Explained Generalization
The D² score generalizes R² for regression by replacing squared error with a deviance of choice. d2_tweedie_score
, d2_pinball_score
, and d2_absolute_error_score
implement D² using Tweedie deviance, pinball loss, and mean absolute error, respectively.
Visual Evaluation of Regression Models: PredictionErrorDisplay
The PredictionErrorDisplay
class provides visual tools for regression model evaluation. It allows plotting predicted vs. actual values and residuals vs. predicted values, aiding in diagnosing model fit and identifying potential issues like heteroscedasticity or model misspecification.
Clustering Metrics: Assessing Unsupervised Grouping
The sklearn.metrics
module also includes metrics for evaluating clustering performance. These metrics assess the quality of unsupervised grouping, both in terms of internal cluster structure and external alignment with ground truth (when available). Refer to the Clustering performance evaluation section for a detailed overview.
Dummy Estimators: Establishing Baselines for Comparison
Dummy estimators in scikit-learn (DummyClassifier
and DummyRegressor
) serve as essential baselines for model testing. They implement simple, rule-based prediction strategies, allowing you to quickly gauge if your more complex models are providing meaningful improvements over naive approaches.
DummyClassifier
Strategies:
stratified
: Random predictions respecting class distribution.most_frequent
: Always predicts the most frequent class.prior
: Predicts class maximizing prior probability.uniform
: Uniformly random predictions.constant
: Always predicts a user-specified constant label.
DummyRegressor
Strategies:
mean
: Always predicts the mean of training targets.median
: Always predicts the median of training targets.quantile
: Predicts a user-specified quantile of training targets.constant
: Always predicts a user-specified constant value.
Example: Using DummyClassifier
for Baseline Comparison:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True)
y[y != 1] = -1 # Create imbalanced binary classification problem
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf_svc = SVC(kernel='linear', C=1).fit(X_train, y_train)
print("SVC Accuracy:", clf_svc.score(X_test, y_test))
clf_dummy = DummyClassifier(strategy='most_frequent', random_state=0).fit(X_train, y_train)
print("Dummy Classifier Accuracy:", clf_dummy.score(X_test, y_test))
clf_rbf_svc = SVC(kernel='rbf', C=1).fit(X_train, y_train)
print("RBF SVC Accuracy:", clf_rbf_svc.score(X_test, y_test)) # Improved model
By comparing your model’s performance against dummy estimators, you gain valuable insights into the actual effectiveness of your machine learning approach. If your model barely outperforms a dummy estimator, it signals potential issues with feature engineering, model selection, hyperparameter tuning, or class imbalance that need to be addressed to improve model quality.
This guide provides a comprehensive overview of model testing and evaluation in scikit-learn. By understanding and applying these metrics and tools, you can rigorously assess your models, ensuring they are not only built but also robust and reliable for your intended applications.
References