Scikit-learn Model Evaluation: A Comprehensive Guide to Testing Your Models

In the realm of machine learning, building a model is only half the battle. The true measure of success lies in rigorously testing and evaluating its performance. Scikit-learn, a cornerstone library for machine learning in Python, provides a rich toolkit for quantifying the quality of your model’s predictions. This guide delves into the essential aspects of model evaluation within scikit-learn, ensuring your models are not just built, but robustly tested and ready for real-world applications.

This comprehensive guide is inspired by statistical decision theory and aims to provide a clear path for choosing and applying the right scoring functions for supervised learning, drawing from insights by Gneiting et al.

Predicting vs. Decision Making: Defining Your Evaluation Goals

Before diving into specific metrics, it’s crucial to understand the two key stages in utilizing machine learning models: predicting and decision making.

Predicting: In most real-world scenarios, the response variable (what you’re trying to predict) isn’t a deterministic function of the features. Instead, it follows a probability distribution. Your goal in prediction can be to estimate this entire distribution (probabilistic prediction) or, more commonly in scikit-learn, to produce a point prediction. Point predictions focus on a specific property of the distribution, such as the mean, median, or a particular quantile of the response variable, conditional on the input features.

Decision Making: Decision making builds upon prediction, particularly in classification tasks. Here, the probabilistic output of a model (like predict_proba in scikit-learn) is transformed into a concrete action. For instance, predicting the probability of rain leads to the decision of whether or not to bring an umbrella. In scikit-learn, predict method in classifiers handles this decision-making step.

Choosing the right scoring function is paramount and directly ties to your prediction goals.

Consistent Scoring Functions: The Truth Serum for Model Evaluation

For effective model testing, you need scoring functions that are strictly consistent. A strictly consistent scoring function accurately measures the “distance” between your model’s predictions (y_pred) and the actual, observed values (y_true). In classification, these are known as strictly proper scoring rules.

Consistent scoring functions act like a “truth serum” for your models. They ensure that “truth telling,” or accurate prediction, is always the optimal strategy in expectation. By using a strictly consistent scoring function, you guarantee that improving your model’s prediction accuracy will directly translate to a better score.

Ideally, the same strictly consistent scoring function should be used both for training your model (as the loss function) and for evaluating its performance in model selection and comparison. This ensures that your optimization during training is aligned with your evaluation goals.

For regression tasks in scikit-learn, the predict method is typically used for predictions. In classification, predict_proba is often employed to obtain probability estimates, which are then used by consistent scoring functions.

Strictly Consistent Scoring Functions: A Practical List

Here’s a table outlining some of the most relevant statistical functionals and their corresponding strictly consistent scoring functions available in scikit-learn for practical tasks:

Functional	Scoring/Loss Function	Response `y`	Prediction Type
Classification
Mean	Brier Score	Multi-class	`predict_proba`
Mean	Log Loss	Multi-class	`predict_proba`
Mode	Zero-One Loss	Multi-class	`predict`, categorical
Regression
Mean	Squared Error	All reals	`predict`, all reals
Mean	Poisson Deviance	Non-negative	`predict`, strictly positive
Mean	Gamma Deviance	Strictly positive	`predict`, strictly positive
Mean	Tweedie Deviance	Depends on `power`	`predict`, depends on `power`
Median	Absolute Error	All reals	`predict`, all reals
Quantile	Pinball Loss	All reals	`predict`, all reals
Mode	No consistent one exists	Reals

Note:

The Brier score is essentially the squared error when applied to classification tasks.
The zero-one loss is consistent for the mode but not strictly consistent. It is equivalent to (1 – accuracy), providing the same ranking but different score values.
R² score provides the same ranking as squared error.

Practical Example: Network Reliability Testing

Consider a network reliability engineering scenario. As a network provider, you aim to guarantee network connection stability, promising customers no connection disruptions longer than 1 minute for at least 99% of the days. To achieve this, you need to predict the 99% quantile of the longest connection interruption duration daily.

In this case, the target functional is the 99% quantile. Referring to the table, the pinball loss is the appropriate scoring function. You would then use pinball loss for both model training (e.g., using HistGradientBoostingRegressor(loss="quantile", quantile=0.99)) and model evaluation (mean_pinball_loss(..., alpha=0.99)). This ensures consistency in your objective throughout the model development process.

Scikit-learn’s Scoring API: Three Ways to Evaluate

Scikit-learn offers three distinct APIs for evaluating the quality of your model’s predictions, providing flexibility and control over your testing process:

Estimator scoring method: Models often have a default scoring method (estimator.score) suitable for basic evaluation.
Scoring parameter: Tools like cross-validation and grid search utilize a scoring parameter for flexible metric specification.
Metric functions: The sklearn.metrics module provides standalone functions for specific metric calculations.

For baseline comparisons, especially in early stages of model testing, dummy estimators in scikit-learn are invaluable. They provide a reference point by implementing simple prediction strategies, allowing you to quickly assess if your model offers genuine improvement over naive approaches.

Defining Model Evaluation Rules with the `scoring` Parameter

Many scikit-learn tools, particularly those for model selection and evaluation that employ cross-validation (like GridSearchCV, validation_curve, and LogisticRegressionCV), use a scoring parameter. This parameter dictates which metric is used to evaluate the estimators.

The scoring parameter can be specified in several ways:

None: Uses the estimator’s default scoring method (estimator.score).
String name: For common metrics, use predefined string identifiers (e.g., ‘accuracy’, ‘f1_score’, ‘neg_mean_squared_error’).
Callable object: For more complex or custom metrics, provide a callable function.

Some tools also support evaluating multiple metrics simultaneously, enhancing the depth of your model testing.

String Name Scorers: Quick Access to Common Metrics

For most common evaluation needs, you can use string names to specify scorer objects via the scoring parameter. Scikit-learn follows the convention that higher scores are better. For metrics that measure error (where lower is better, like mean_squared_error), negated versions are available (e.g., ‘neg_mean_squared_error’) to adhere to this convention.

String Name Scorers in Scikit-learn:

Scoring String Name	Function	Comment
Classification
`'accuracy'`	`metrics.accuracy_score`
`'balanced_accuracy'`	`metrics.balanced_accuracy_score`
`'top_k_accuracy'`	`metrics.top_k_accuracy_score`
`'average_precision'`	`metrics.average_precision_score`
`'neg_brier_score'`	`metrics.brier_score_loss`
`'f1'`	`metrics.f1_score`	for binary targets
`'f1_micro'`	`metrics.f1_score`	micro-averaged
`'f1_macro'`	`metrics.f1_score`	macro-averaged
`'f1_weighted'`	`metrics.f1_score`	weighted average
`'f1_samples'`	`metrics.f1_score`	by multilabel sample
`'neg_log_loss'`	`metrics.log_loss`	requires `predict_proba` support
`'precision'` etc.	`metrics.precision_score`	suffixes apply as with ‘f1’
`'recall'` etc.	`metrics.recall_score`	suffixes apply as with ‘f1’
`'jaccard'` etc.	`metrics.jaccard_score`	suffixes apply as with ‘f1’
`'roc_auc'`	`metrics.roc_auc_score`
`'roc_auc_ovr'`	`metrics.roc_auc_score`
`'roc_auc_ovo'`	`metrics.roc_auc_score`
`'roc_auc_ovr_weighted'`	`metrics.roc_auc_score`
`'roc_auc_ovo_weighted'`	`metrics.roc_auc_score`
`'d2_log_loss_score'`	`metrics.d2_log_loss_score`
Clustering
`'adjusted_mutual_info_score'`	`metrics.adjusted_mutual_info_score`
`'adjusted_rand_score'`	`metrics.adjusted_rand_score`
`'completeness_score'`	`metrics.completeness_score`
`'fowlkes_mallows_score'`	`metrics.fowlkes_mallows_score`
`'homogeneity_score'`	`metrics.homogeneity_score`
`'mutual_info_score'`	`metrics.mutual_info_score`
`'normalized_mutual_info_score'`	`metrics.normalized_mutual_info_score`
`'rand_score'`	`metrics.rand_score`
`'v_measure_score'`	`metrics.v_measure_score`
Regression
`'explained_variance'`	`metrics.explained_variance_score`
`'neg_max_error'`	`metrics.max_error`
`'neg_mean_absolute_error'`	`metrics.mean_absolute_error`
`'neg_mean_squared_error'`	`metrics.mean_squared_error`
`'neg_root_mean_squared_error'`	`metrics.root_mean_squared_error`
`'neg_mean_squared_log_error'`	`metrics.mean_squared_log_error`
`'neg_root_mean_squared_log_error'`	`metrics.root_mean_squared_log_error`
`'neg_median_absolute_error'`	`metrics.median_absolute_error`
`'r2'`	`metrics.r2_score`
`'neg_mean_poisson_deviance'`	`metrics.mean_poisson_deviance`
`'neg_mean_gamma_deviance'`	`metrics.mean_gamma_deviance`
`'neg_mean_absolute_percentage_error'`	`metrics.mean_absolute_percentage_error`
`'d2_absolute_error_score'`	`metrics.d2_absolute_error_score`

Example Usage:

from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score

X, y = datasets.load_iris(return_X_y=True)
clf = svm.SVC(random_state=0)
cross_val_score(clf, X, y, cv=5, scoring='recall_macro')

Note: Passing an invalid scoring name will raise an InvalidParameterError. You can get a list of all available scorer names using get_scorer_names().

Callable Scorers: Customizing Your Evaluation

For more advanced testing scenarios, scikit-learn allows you to use callable objects as scorers, offering greater flexibility. This can be achieved in two primary ways:

Adapting predefined metrics with make_scorer: Use make_scorer to create scorer objects from existing metric functions, especially when you need to set specific parameters.
Creating custom scorer objects from scratch: For ultimate flexibility, define your own scoring function and use make_scorer or create a scorer object directly.

Adapting Predefined Metrics via `make_scorer`

Functions like fbeta_score, mean_tweedie_deviance, and mean_pinball_loss require additional parameters (e.g., beta, power, alpha) and cannot be directly used as string scorers. make_scorer bridges this gap by allowing you to “wrap” these functions and set their parameters.

Functions Adaptable with make_scorer:

Function	Parameter	Example Usage
Classification
`metrics.fbeta_score`	`beta`	`make_scorer(fbeta_score, beta=2)`
Regression
`metrics.mean_tweedie_deviance`	`power`	`make_scorer(mean_tweedie_deviance, power=1.5)`
`metrics.mean_pinball_loss`	`alpha`	`make_scorer(mean_pinball_loss, alpha=0.95)`
`metrics.d2_tweedie_score`	`power`	`make_scorer(d2_tweedie_score, power=1.5)`
`metrics.d2_pinball_score`	`alpha`	`make_scorer(d2_pinball_score, alpha=0.95)`

Example: Creating a scorer for fbeta_score with beta=2:

from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]},
                    scoring=ftwo_scorer, cv=5)

Remember that functions ending in _score are maximized (higher is better), while those ending in _error, _loss, or _deviance are minimized (lower is better). When using make_scorer with loss functions, set greater_is_better=False.

Creating a Custom Scorer Object

For highly specific testing needs, you can create fully custom scorer objects.

Custom Scorer Objects using `make_scorer`

make_scorer is a powerful tool for building custom scorers from simple Python functions. Key parameters include:

score_func: Your Python function to calculate the score.
greater_is_better: Indicate if your function returns a score (True, default) or a loss (False). For losses, the output is negated to align with scikit-learn’s “higher is better” convention.
response_method: For classification metrics, specify if your function requires probability estimates ("predict_proba") or decision values ("decision_function"). You can also provide a list of methods to try.
`kwargs`:** Additional parameters to pass to your scoring function.

Example: Creating a custom loss function and scorer:

import numpy as np
from sklearn.metrics import make_scorer
from sklearn.dummy import DummyClassifier

def my_custom_loss_func(y_true, y_pred):
    diff = np.abs(y_true - y_pred).max()
    return np.log1p(diff)

score = make_scorer(my_custom_loss_func, greater_is_better=False)
X = [[1], [1]]
y = [0, 1]
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf = clf.fit(X, y)
my_custom_loss_func(y, clf.predict(X))
score(clf, X, y)

Custom Scorer Objects from Scratch

For maximum control, you can create scorer objects from the ground up. A scorer callable must adhere to this protocol:

Signature: Accept parameters (estimator, X, y), where estimator is the model, X is validation data, and y is the ground truth (or None for unsupervised).
Return Value: Return a float representing the model’s prediction quality on X relative to y. Higher values should indicate better performance. Negate loss values to adhere to this convention.
Advanced (Metadata Routing): For scorers needing metadata, implement get_metadata_routing and set_score_request methods.

Using Custom Scorers with `n_jobs > 1`

For robust parallel processing with custom scorers, especially when using n_jobs > 1, it’s best to define your custom scoring function in a separate Python module and import it. This approach is more reliable across different joblib backends.

Example: Importing a custom scoring function from custom_scorer_module.py:

from custom_scorer_module import custom_scoring_function
from sklearn.model_selection import cross_val_score

cross_val_score(model, X_train, y_train,
                scoring=make_scorer(custom_scoring_function, greater_is_better=False),
                cv=5, n_jobs=-1)

Multiple Metric Evaluation: A Holistic View

Scikit-learn allows you to evaluate multiple metrics simultaneously in GridSearchCV, RandomizedSearchCV, and cross_validate, providing a more comprehensive testing picture.

You can specify multiple scoring metrics in three ways:

List of strings: Provide a list of metric names as strings:
```
scoring = ['accuracy', 'precision']
```
Dictionary mapping names to scorers: Create a dictionary where keys are scorer names and values are either scorer functions or predefined metric strings:
```
from sklearn.metrics import accuracy_score, make_scorer
scoring = {'accuracy': make_scorer(accuracy_score),
           'prec': 'precision'}
```

Callable returning a dictionary: Define a callable function that computes and returns a dictionary of scores:

from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC

X, y = make_classification(n_classes=2, random_state=0)
svm = LinearSVC(random_state=0)

def confusion_matrix_scorer(clf, X, y):
    y_pred = clf.predict(X)
    cm = confusion_matrix(y, y_pred)
    return {'tn': cm[0, 0], 'fp': cm[0, 1],
            'fn': cm[1, 0], 'tp': cm[1, 1]}

cv_results = cross_validate(svm, X, y, cv=5,
                           scoring=confusion_matrix_scorer)
print(cv_results['test_tp'])
print(cv_results['test_fn'])

Classification Metrics: Evaluating Classifier Performance

The sklearn.metrics module is rich with functions for evaluating classification performance. These metrics assess different facets of your classifier’s effectiveness, from overall accuracy to nuanced measures like precision and recall.

Some metrics are tailored for binary classification, while others extend to multiclass and multilabel scenarios. Many implementations support sample weighting via the sample_weight parameter.

Binary Classification Specific Metrics:

Function	Description
`precision_recall_curve`	Compute precision-recall pairs for different probability thresholds.
`roc_curve`	Compute Receiver operating characteristic (ROC).
`class_likelihood_ratios`	Compute binary classification positive and negative likelihood ratios.
`det_curve`	Compute error rates for different probability thresholds.

Multiclass Capable Metrics:

Function	Description
`balanced_accuracy_score`	Compute the balanced accuracy.
`cohen_kappa_score`	Compute Cohen’s kappa: a statistic that measures inter-annotator agreement.
`confusion_matrix`	Compute confusion matrix to evaluate classification accuracy.
`hinge_loss`	Average hinge loss (non-regularized).
`matthews_corrcoef`	Compute the Matthews correlation coefficient (MCC).
`roc_auc_score`	Compute Area Under the ROC Curve (ROC AUC) from prediction scores.
`top_k_accuracy_score`	Top-k Accuracy classification score.

Multilabel Capable Metrics:

Function	Description
`accuracy_score`	Accuracy classification score.
`classification_report`	Build a text report showing the main classification metrics.
`f1_score`	Compute the F1 score, also known as balanced F-score or F-measure.
`fbeta_score`	Compute the F-beta score.
`hamming_loss`	Compute the average Hamming loss.
`jaccard_score`	Jaccard similarity coefficient score.
`log_loss`	Log loss, aka logistic loss or cross-entropy loss.
`multilabel_confusion_matrix`	Compute a confusion matrix for each class or sample.
`precision_recall_fscore_support`	Compute precision, recall, F-measure and support for each class.
`precision_score`	Compute the precision.
`recall_score`	Compute the recall.
`roc_auc_score`	Compute Area Under the ROC Curve (ROC AUC) from prediction scores.
`zero_one_loss`	Zero-one classification loss.
`d2_log_loss_score`	(D^2) score function, fraction of log loss explained.

Binary and Multilabel (Not Multiclass) Metrics:

Function	Description
`average_precision_score`	Compute average precision (AP) from prediction scores.

From Binary to Multiclass and Multilabel: Adapting Metrics

Many classification metrics are inherently designed for binary classification. Extending them to multiclass or multilabel problems involves treating the data as a collection of binary problems, one for each class. The average parameter in scikit-learn metrics controls how these binary metric calculations are averaged across classes.

"macro": Calculates the mean of binary metrics per class, giving equal weight to each class. Useful for highlighting performance on infrequent classes.
"weighted": Averages binary metrics, weighting each class’s score by its prevalence in the true data. Accounts for class imbalance.
"micro": Gives each sample-class pair equal weight. Sums up dividends and divisors across classes to compute an overall quotient. Suitable for multilabel settings and multiclass problems where a majority class should be de-emphasized.
"samples": Applies only to multilabel problems. Calculates the metric for each sample and returns the sample_weight-weighted average.
average=None: Returns an array of scores for each class, providing a class-wise breakdown of performance.

Multiclass data is typically provided as an array of class labels, while multilabel data is represented as an indicator matrix.

Accuracy Score: Overall Correctness

The accuracy_score function computes the accuracy, representing the fraction or count of correctly classified samples. In multilabel classification, it returns the subset accuracy, requiring the entire set of predicted labels to perfectly match the true set.

Formula:

[texttt{accuracy}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{n_text{samples}-1} 1(hat{y}_i = y_i)]

Example:

import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
accuracy_score(y_true, y_pred, normalize=False) # Count of correct predictions

Top-k Accuracy Score: Considering Top Predictions

The top_k_accuracy_score is a generalization of accuracy, where a prediction is considered correct if the true label is among the k highest predicted scores. accuracy_score is a special case where k=1. This metric is useful when you care about the model’s ability to rank the correct class highly, even if it’s not the absolute top prediction.

Formula:

[texttt{top-k accuracy}(y, hat{f}) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} sum{j=1}^{k} 1(hat{f}_{i,j} = y_i)]

Example:

import numpy as np
from sklearn.metrics import top_k_accuracy_score
y_true = np.array([0, 1, 2, 2])
y_score = np.array([[0.5, 0.2, 0.2], [0.3, 0.4, 0.2],
                    [0.2, 0.4, 0.3], [0.7, 0.2, 0.1]])
top_k_accuracy_score(y_true, y_score, k=2)
top_k_accuracy_score(y_true, y_score, k=2, normalize=False) # Count of top-k correct predictions

Balanced Accuracy Score: Addressing Class Imbalance

The balanced_accuracy_score is crucial for imbalanced datasets. It’s the macro-average of recall scores per class, or equivalently, accuracy where each sample is weighted by the inverse prevalence of its true class. This metric provides a more realistic performance estimate when classes are not equally represented.

Binary Case Formula:

[texttt{balanced-accuracy} = frac{1}{2}left( frac{TP}{TP + FN} + frac{TN}{TN + FP}right )]

Multiclass Case Formula:

[texttt{balanced-accuracy}(y, hat{y}, w) = frac{1}{sum{hat{w}_i}} sum_i 1(hat{y}_i = y_i) hat{w}_i]
where [hat{w}_i = frac{w_i}{sum_j{1(y_j = y_i) w_j}}]

Example:

from sklearn.metrics import balanced_accuracy_score
y_true = [0, 1, 0, 0, 1, 0]
y_pred = [0, 0, 0, 0, 0, 0]
balanced_accuracy_score(y_true, y_pred)

Cohen’s Kappa: Inter-Annotator Agreement

The cohen_kappa_score computes Cohen’s kappa, a statistic measuring inter-annotator agreement. It’s especially useful for comparing labelings from different human annotators, not just classifier performance against ground truth. Kappa scores range from -1 to 1, with scores above 0.8 generally considered good agreement.

Example:

from sklearn.metrics import cohen_kappa_score
labeling1 = [2, 0, 2, 2, 0, 1]
labeling2 = [0, 0, 2, 2, 0, 2]
cohen_kappa_score(labeling1, labeling2)

Confusion Matrix: Detailed Breakdown of Predictions

The confusion_matrix is a powerful tool for visualizing classification performance. It presents a matrix where rows represent true classes and columns represent predicted classes. Element (i, j) shows the count of samples actually in class i but predicted as class j.

Example:

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

You can visually represent confusion matrices using ConfusionMatrixDisplay. Normalization options (normalize) allow displaying ratios instead of raw counts.

Classification Report: Summary of Key Metrics

The classification_report generates a text report summarizing key classification metrics: precision, recall, F1-score, and support for each class. It’s a convenient way to get a quick overview of your classifier’s performance across different classes.

Example:

from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))

Hamming Loss: Measuring Label Mismatches

The hamming_loss computes the average Hamming loss or Hamming distance between sets of samples. In multilabel classification, it quantifies the fraction of incorrectly predicted labels.

Formula:

[L{Hamming}(y, hat{y}) = frac{1}{ntext{samples} * ntext{labels}} sum{i=0}^{ntext{samples}-1} sum{j=0}^{ntext{labels} – 1} 1(hat{y}{i,j} not= y_{i,j})]

Example:

from sklearn.metrics import hamming_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]
hamming_loss(y_true, y_pred)

Precision, Recall, and F-Measures: Balancing Error Types

Precision, recall, and F-measures are fundamental metrics for evaluating classification performance, particularly in binary and multilabel settings.

Precision: Measures the classifier’s ability to avoid false positives (labeling negative samples as positive).
Recall: Measures the classifier’s ability to find all positive samples (avoiding false negatives).
F-measure: A weighted harmonic mean of precision and recall, balancing both metrics. F1-score (F-measure with beta=1) gives equal weight to precision and recall.

The precision_recall_curve computes precision-recall pairs for varying decision thresholds, allowing you to visualize the trade-off between precision and recall. The average_precision_score summarizes the precision-recall curve into a single average precision (AP) score.

Formulas (Binary Classification):

[text{precision} = frac{text{tp}}{text{tp} + text{fp}}]

[text{recall} = frac{text{tp}}{text{tp} + text{fn}}]

[F_beta = (1 + beta^2) frac{text{precision} times text{recall}}{beta^2 text{precision} + text{recall}}]

Example (Binary Classification):

from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
metrics.precision_score(y_true, y_pred)
metrics.recall_score(y_true, y_pred)
metrics.f1_score(y_true, y_pred)
metrics.fbeta_score(y_true, y_pred, beta=0.5)
metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5)

For multiclass and multilabel classification, these metrics can be averaged across labels using the average parameter (see above).

Jaccard Similarity Coefficient Score: Set-Based Accuracy

The jaccard_score computes the average Jaccard similarity coefficient, also known as the Jaccard index. It measures the similarity between sets of predicted and true labels, defined as the size of the intersection divided by the size of the union of the label sets.

Formula:

[J(y, hat{y}) = frac{|y cap hat{y}|}{|y cup hat{y}|}.]

Example:

import numpy as np
from sklearn.metrics import jaccard_score
y_true = np.array([[0, 1, 1], [1, 1, 0]])
y_pred = np.array([[1, 1, 1], [1, 0, 0]])
jaccard_score(y_true[0], y_pred[0])
jaccard_score(y_true, y_pred, average="micro") # Micro-averaged Jaccard score

Hinge Loss: Margin-Based Loss

The hinge_loss computes the average hinge loss, a loss function commonly used in support vector machines (SVMs). It’s a one-sided metric that focuses on prediction errors, particularly useful for margin-maximizing classifiers.

Binary Case Formula:

[Ltext{Hinge}(y, w) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} maxleft{1 – w_i y_i, 0right}]

Multiclass Case Formula (Crammer & Singer):

[Ltext{Hinge}(y, w) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} maxleft{1 + hat{w}_{i, yi} – w{i, y_i}, 0right}]

Example:

from sklearn import svm
from sklearn.metrics import hinge_loss
X = [[0], [1]]
y = [-1, 1]
est = svm.LinearSVC(random_state=0)
est.fit(X, y)
pred_decision = est.decision_function([[-2], [3], [0.5]])
hinge_loss([-1, 1, 1], pred_decision)

Log Loss (Cross-Entropy Loss): Probabilistic Loss

The log_loss, also known as logistic regression loss or cross-entropy loss, is defined on probability estimates. It’s commonly used in logistic regression and neural networks, evaluating the probabilistic outputs (predict_proba) of a classifier.

Binary Case Formula:

[L_{log}(y, p) = -log operatorname{Pr}(y|p) = -(y log (p) + (1 – y) log (1 – p))]

Multiclass Case Formula:

[L{log}(Y, P) = -log operatorname{Pr}(Y|P) = – frac{1}{N} sum{i=0}^{N-1} sum{k=0}^{K-1} y{i,k} log p_{i,k}]

Example:

from sklearn.metrics import log_loss
y_true = [0, 0, 1, 1]
y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
log_loss(y_true, y_pred)

Matthews Correlation Coefficient (MCC): Balanced Binary Metric

The matthews_corrcoef computes the Matthews correlation coefficient (MCC) for binary classification. MCC is considered a balanced measure, even with very different class sizes. It ranges from -1 to +1, with +1 representing perfect prediction, 0 random prediction, and -1 inverse prediction.

Binary Case Formula:

[MCC = frac{tp times tn – fp times fn}{sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}]

Multiclass Case Formula:

[MCC = frac{ c times s – sum_{k}^{K} p_k times tk }{sqrt{ (s^2 – sum{k}^{K} pk^2) times (s^2 – sum{k}^{K} t_k^2) }}]

Example:

from sklearn.metrics import matthews_corrcoef
y_true = [+1, +1, +1, -1]
y_pred = [+1, -1, +1, +1]
matthews_corrcoef(y_true, y_pred)

Multi-label Confusion Matrix: Class-Wise Breakdown

The multilabel_confusion_matrix computes class-wise or sample-wise multilabel confusion matrices. It treats multiclass data as multilabel for compatibility with binary classification metrics. This function provides a confusion matrix for each class, allowing detailed analysis of performance per label.

Example:

import numpy as np
from sklearn.metrics import multilabel_confusion_matrix
y_true = np.array([[1, 0, 1], [0, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 1]])
multilabel_confusion_matrix(y_true, y_pred) # Class-wise matrices
multilabel_confusion_matrix(y_true, y_pred, samplewise=True) # Sample-wise matrices

Receiver Operating Characteristic (ROC): Visualizing Trade-offs

The roc_curve computes the receiver operating characteristic (ROC) curve, a graphical plot showing the performance of a binary classifier as its discrimination threshold varies. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different thresholds, visualizing the trade-off between sensitivity and specificity.

The roc_auc_score function computes the Area Under the ROC Curve (ROC AUC or AUROC), summarizing the ROC curve into a single, interpretable score.

Example:

import numpy as np
from sklearn.metrics import roc_curve
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2)
fpr
tpr
thresholds

ROC AUC is applicable to binary, multiclass (using one-vs-one or one-vs-rest strategies), and multilabel classification.

Detection Error Tradeoff (DET): Focusing on Error Tradeoffs

The det_curve computes the detection error tradeoff (DET) curve. DET curves are similar to ROC curves but plot False Negative Rate (FNR) against False Positive Rate (FPR), often using a non-linear scale. DET curves emphasize the critical operating region and can be more linear than ROC curves, making it easier to visually compare classifiers, especially where error tradeoffs are important.

Zero-One Loss: Strict Accuracy

The zero_one_loss computes the sum or average of the 0-1 classification loss. It measures the fraction of samples misclassified (or the count of misclassifications if normalize=False). In multilabel classification, it requires a perfect match of predicted and true label sets for a sample to be considered correctly classified.

Formula:

[L{0-1}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{ntext{samples}-1} 1(hat{y}_i not= y_i)]

Example:

from sklearn.metrics import zero_one_loss
y_pred = [1, 2, 3, 4]
y_true = [2, 2, 3, 4]
zero_one_loss(y_true, y_pred)
zero_one_loss(y_true, y_pred, normalize=False) # Count of misclassifications

Brier Score Loss: Probabilistic Accuracy

The brier_score_loss computes the Brier score for binary classes. It measures the accuracy of probabilistic predictions by calculating the mean squared error between predicted probabilities and actual outcomes (0 or 1). Lower Brier scores indicate better probabilistic accuracy.

Formula:

[BS = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}} – 1}(y_i – p_i)^2]

Example:

import numpy as np
from sklearn.metrics import brier_score_loss
y_true = np.array([0, 1, 1, 0])
y_prob = np.array([0.1, 0.9, 0.8, 0.4])
brier_score_loss(y_true, y_prob)

Class Likelihood Ratios: Prevalence-Invariant Metrics

The class_likelihood_ratios function computes positive and negative likelihood ratios (LR+ and LR-) for binary classification. These metrics are prevalence-invariant, making them useful for generalizing model performance across populations with varying class imbalances. They represent the ratio of post-test to pre-test odds, offering insights into how predictions modify the probability of belonging to a class.

D² Score for Classification: Deviance Explained

The D² score generalizes R² for classification, measuring the fraction of deviance explained by the model. The d2_log_loss_score function implements D² using log loss as the deviance measure. A higher D² score (closer to 1.0) indicates a better fit.

Multilabel Ranking Metrics: Evaluating Label Order

In multilabel learning, where samples can have multiple true labels, ranking metrics assess how well the model ranks true labels higher than false labels.

Coverage Error: Minimum Labels to Cover True Labels

The coverage_error computes the average number of labels that need to be included in the top-ranked predictions to cover all true labels. Lower coverage error is better, with the ideal value being the average number of true labels per sample.

Formula:

[coverage(y, hat{f}) = frac{1}{n{text{samples}}} sum{i=0}^{n{text{samples}} – 1} max{j:y{ij} = 1} text{rank}{ij}]

Example:

import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
coverage_error(y_true, y_score)

Label Ranking Average Precision (LRAP): Precision of Label Ranking

The label_ranking_average_precision_score computes label ranking average precision (LRAP). It averages over samples the fraction of higher-ranked labels that are true labels for each ground truth label. Higher LRAP scores (closer to 1) are better, indicating better label ranking.

Formula:

[LRAP(y, hat{f}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}} – 1} frac{1}{||y_i||0} sum{j:y{ij} = 1} frac{|mathcal{L}{ij}|}{text{rank}_{ij}}]

Example:

import numpy as np
from sklearn.metrics import label_ranking_average_precision_score
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
label_ranking_average_precision_score(y_true, y_score)

Ranking Loss: Incorrect Label Pair Orderings

The label_ranking_loss computes the ranking loss, averaging over samples the number of incorrectly ordered label pairs (true labels ranked lower than false labels). Lower ranking loss (closer to 0) is better.

Formula:

[ranking_loss(y, hat{f}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}} – 1} frac{1}{||y_i||0(ntext{labels} – ||y_i||0)} left|left{(k, l): hat{f}{ik} leq hat{f}{il}, y{ik} = 1, y_{il} = 0 right}right|]

Example:

import numpy as np
from sklearn.metrics import label_ranking_loss
y_true = np.array([[1, 0, 0], [0, 0, 1]])
y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]])
label_ranking_loss(y_true, y_score)

Normalized Discounted Cumulative Gain (NDCG): Ranking Quality

Discounted Cumulative Gain (DCG) and Normalized Discounted Cumulative Gain (NDCG), implemented in dcg_score and ndcg_score, evaluate ranking quality by comparing predicted order to ground-truth relevance scores. NDCG is DCG normalized to be between 0 and 1, with higher values indicating better ranking quality. It’s particularly useful in information retrieval and recommendation systems where graded relevance scores are available.

Regression Metrics: Evaluating Regression Performance

The sklearn.metrics module offers a range of functions for evaluating regression models. Many of these functions support multioutput regression, with the multioutput parameter controlling how scores are averaged across multiple target variables.

R² Score (Coefficient of Determination): Variance Explained

The r2_score computes the coefficient of determination (R²), representing the proportion of variance in the target variable explained by the model. R² ranges from negative infinity to 1.0, with 1.0 being the best possible score. It provides a measure of goodness of fit, indicating how well unseen samples are likely to be predicted.

Formula:

[R^2(y, hat{y}) = 1 – frac{sum_{i=1}^{n} (y_i – hat{y}i)^2}{sum{i=1}^{n} (y_i – bar{y})^2}]

Example:

from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
r2_score(y_true, y_pred)

Mean Absolute Error (MAE): Average Absolute Deviation

The mean_absolute_error computes the mean absolute error (MAE), representing the average absolute difference between predicted and true values. MAE is robust to outliers and is measured in the same units as the target variable. Lower MAE values are better.

Formula:

[text{MAE}(y, hat{y}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}}-1} left| y_i – hat{y}_i right|.]

Example:

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)

Mean Squared Error (MSE): Average Squared Deviation

The mean_squared_error computes the mean squared error (MSE), representing the average squared difference between predicted and true values. MSE is more sensitive to outliers than MAE due to the squaring operation. Lower MSE values are better. Root Mean Squared Error (RMSE), the square root of MSE, is also commonly used and available via root_mean_squared_error.

Formula:

[text{MSE}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{n_text{samples} – 1} (y_i – hat{y}_i)^2.]

Example:

from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_squared_error(y_true, y_pred)

Mean Squared Logarithmic Error (MSLE): Relative Error Focus

The mean_squared_log_error computes the mean squared logarithmic error (MSLE). MSLE is particularly useful when targets have exponential growth, penalizing under-predictions more heavily than over-predictions. It is less sensitive to outliers compared to MSE.

Formula:

[text{MSLE}(y, hat{y}) = frac{1}{ntext{samples}} sum{i=0}^{n_text{samples} – 1} (log_e (1 + y_i) – log_e (1 + hat{y}_i) )^2.]

Example:

from sklearn.metrics import mean_squared_log_error
y_true = [3, 5, 2.5, 7]
y_pred = [2.5, 5, 4, 8]
mean_squared_log_error(y_true, y_pred)

Root Mean Squared Logarithmic Error (RMSLE) is available via root_mean_squared_log_error.

Mean Absolute Percentage Error (MAPE): Relative Error as Percentage

The mean_absolute_percentage_error (MAPE) measures the mean absolute percentage error, providing a relative error metric as a percentage. MAPE is scale-invariant, meaning it’s not affected by global scaling of the target variable. It is sensitive to relative errors, making it useful when relative accuracy is important.

Formula:

[text{MAPE}(y, hat{y}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}}-1} frac{{}left| y_i – hat{y}_i right|}{max(epsilon, left| y_i right|)}]

Example:

from sklearn.metrics import mean_absolute_percentage_error
y_true = [1, 10, 1e6]
y_pred = [0.9, 15, 1.2e6]
mean_absolute_percentage_error(y_true, y_pred)

Median Absolute Error (MedAE): Robust to Outliers

The median_absolute_error is robust to outliers, calculating the median of absolute errors instead of the mean. MedAE provides a robust measure of typical error magnitude, less influenced by extreme values.

Formula:

[text{MedAE}(y, hat{y}) = text{median}(mid y_1 – hat{y}_1 mid, ldots, mid y_n – hat{y}_n mid).]

Example:

from sklearn.metrics import median_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
median_absolute_error(y_true, y_pred)

Max Error: Worst-Case Error

The max_error computes the maximum residual error, capturing the worst-case error between predicted and true values. Max error indicates the largest prediction error the model makes.

Formula:

[text{Max Error}(y, hat{y}) = max(| y_i – hat{y}_i |)]

Example:

from sklearn.metrics import max_error
y_true = [3, 2, 7, 1]
y_pred = [9, 2, 7, 1]
max_error(y_true, y_pred)

Explained Variance Score: Variance Explained in Regression

The explained_variance_score computes the explained variance regression score. It measures the proportion of variance in the target variable that is explained by the model. The best possible score is 1.0, with lower values indicating worse performance.

Formula:

[explained_{}variance(y, hat{y}) = 1 – frac{Var{ y – hat{y}}}{Var{y}}]

Example:

from sklearn.metrics import explained_variance_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
explained_variance_score(y_true, y_pred)

Mean Poisson, Gamma, and Tweedie Deviance: Deviance-Based Loss

The mean_tweedie_deviance computes the mean Tweedie deviance error, a metric that elicits predicted expectation values for regression targets. It generalizes several common regression loss functions based on the power parameter:

power=0: Normal distribution (squared error)
power=1: Poisson distribution
power=2: Gamma distribution

Tweedie deviance is useful for modeling data with different variance structures. Higher power values give less weight to extreme deviations.

Pinball Loss: Quantile Regression Evaluation

The mean_pinball_loss evaluates quantile regression models. It measures the deviation from a specified quantile (alpha). When alpha=0.5, it’s equivalent to half of the MAE. Pinball loss is asymmetric, penalizing deviations differently depending on whether the prediction is above or below the true value, according to the chosen quantile.

Formula:

[text{pinball}(y, hat{y}) = frac{1}{n{text{samples}}} sum{i=0}^{n_{text{samples}}-1} alpha max(y_i – hat{y}_i, 0) + (1 – alpha) max(hat{y}_i – y_i, 0)]

Example:

from sklearn.metrics import mean_pinball_loss
y_true = [1, 2, 3]
mean_pinball_loss(y_true, [0, 2, 3], alpha=0.1)
mean_pinball_loss(y_true, [1, 2, 4], alpha=0.9)

D² Score for Regression: Deviance Explained Generalization

The D² score generalizes R² for regression by replacing squared error with a deviance of choice. d2_tweedie_score, d2_pinball_score, and d2_absolute_error_score implement D² using Tweedie deviance, pinball loss, and mean absolute error, respectively.

Visual Evaluation of Regression Models: PredictionErrorDisplay

The PredictionErrorDisplay class provides visual tools for regression model evaluation. It allows plotting predicted vs. actual values and residuals vs. predicted values, aiding in diagnosing model fit and identifying potential issues like heteroscedasticity or model misspecification.

Clustering Metrics: Assessing Unsupervised Grouping

The sklearn.metrics module also includes metrics for evaluating clustering performance. These metrics assess the quality of unsupervised grouping, both in terms of internal cluster structure and external alignment with ground truth (when available). Refer to the Clustering performance evaluation section for a detailed overview.

Dummy Estimators: Establishing Baselines for Comparison

Dummy estimators in scikit-learn (DummyClassifier and DummyRegressor) serve as essential baselines for model testing. They implement simple, rule-based prediction strategies, allowing you to quickly gauge if your more complex models are providing meaningful improvements over naive approaches.

DummyClassifier Strategies:

stratified: Random predictions respecting class distribution.
most_frequent: Always predicts the most frequent class.
prior: Predicts class maximizing prior probability.
uniform: Uniformly random predictions.
constant: Always predicts a user-specified constant label.

DummyRegressor Strategies:

mean: Always predicts the mean of training targets.
median: Always predicts the median of training targets.
quantile: Predicts a user-specified quantile of training targets.
constant: Always predicts a user-specified constant value.

Example: Using DummyClassifier for Baseline Comparison:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC

X, y = load_iris(return_X_y=True)
y[y != 1] = -1  # Create imbalanced binary classification problem
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf_svc = SVC(kernel='linear', C=1).fit(X_train, y_train)
print("SVC Accuracy:", clf_svc.score(X_test, y_test))

clf_dummy = DummyClassifier(strategy='most_frequent', random_state=0).fit(X_train, y_train)
print("Dummy Classifier Accuracy:", clf_dummy.score(X_test, y_test))

clf_rbf_svc = SVC(kernel='rbf', C=1).fit(X_train, y_train)
print("RBF SVC Accuracy:", clf_rbf_svc.score(X_test, y_test)) # Improved model

By comparing your model’s performance against dummy estimators, you gain valuable insights into the actual effectiveness of your machine learning approach. If your model barely outperforms a dummy estimator, it signals potential issues with feature engineering, model selection, hyperparameter tuning, or class imbalance that need to be addressed to improve model quality.

This guide provides a comprehensive overview of model testing and evaluation in scikit-learn. By understanding and applying these metrics and tools, you can rigorously assess your models, ensuring they are not only built but also robust and reliable for your intended applications.

References

Predicting vs. Decision Making: Defining Your Evaluation Goals

Consistent Scoring Functions: The Truth Serum for Model Evaluation

Strictly Consistent Scoring Functions: A Practical List

Practical Example: Network Reliability Testing

Scikit-learn’s Scoring API: Three Ways to Evaluate

Defining Model Evaluation Rules with the scoring Parameter

String Name Scorers: Quick Access to Common Metrics

Callable Scorers: Customizing Your Evaluation

Adapting Predefined Metrics via make_scorer

Creating a Custom Scorer Object

Custom Scorer Objects using make_scorer

Custom Scorer Objects from Scratch

Using Custom Scorers with n_jobs > 1

Multiple Metric Evaluation: A Holistic View

Classification Metrics: Evaluating Classifier Performance

From Binary to Multiclass and Multilabel: Adapting Metrics

Accuracy Score: Overall Correctness

Top-k Accuracy Score: Considering Top Predictions

Balanced Accuracy Score: Addressing Class Imbalance

Cohen’s Kappa: Inter-Annotator Agreement

Confusion Matrix: Detailed Breakdown of Predictions

Classification Report: Summary of Key Metrics

Hamming Loss: Measuring Label Mismatches

Precision, Recall, and F-Measures: Balancing Error Types

Jaccard Similarity Coefficient Score: Set-Based Accuracy

Hinge Loss: Margin-Based Loss

Log Loss (Cross-Entropy Loss): Probabilistic Loss

Matthews Correlation Coefficient (MCC): Balanced Binary Metric

Multi-label Confusion Matrix: Class-Wise Breakdown

Receiver Operating Characteristic (ROC): Visualizing Trade-offs

Detection Error Tradeoff (DET): Focusing on Error Tradeoffs

Zero-One Loss: Strict Accuracy

Brier Score Loss: Probabilistic Accuracy

Class Likelihood Ratios: Prevalence-Invariant Metrics

D² Score for Classification: Deviance Explained

Multilabel Ranking Metrics: Evaluating Label Order

Coverage Error: Minimum Labels to Cover True Labels

Label Ranking Average Precision (LRAP): Precision of Label Ranking

Ranking Loss: Incorrect Label Pair Orderings

Normalized Discounted Cumulative Gain (NDCG): Ranking Quality

Regression Metrics: Evaluating Regression Performance

R² Score (Coefficient of Determination): Variance Explained

Mean Absolute Error (MAE): Average Absolute Deviation

Mean Squared Error (MSE): Average Squared Deviation

Mean Squared Logarithmic Error (MSLE): Relative Error Focus

Mean Absolute Percentage Error (MAPE): Relative Error as Percentage

Median Absolute Error (MedAE): Robust to Outliers

Max Error: Worst-Case Error

Explained Variance Score: Variance Explained in Regression

Mean Poisson, Gamma, and Tweedie Deviance: Deviance-Based Loss

Pinball Loss: Quantile Regression Evaluation

D² Score for Regression: Deviance Explained Generalization

Visual Evaluation of Regression Models: PredictionErrorDisplay

Clustering Metrics: Assessing Unsupervised Grouping

Dummy Estimators: Establishing Baselines for Comparison

Comments

Leave a Reply Cancel reply

Defining Model Evaluation Rules with the `scoring` Parameter

Adapting Predefined Metrics via `make_scorer`

Custom Scorer Objects using `make_scorer`

Using Custom Scorers with `n_jobs > 1`