Scikit-learn Linear Regression: A Comprehensive Guide for Machine Learning in Python

Linear regression is a fundamental and widely used algorithm in machine learning. Its simplicity and interpretability make it an excellent starting point for understanding predictive modeling. Scikit-learn, a powerful Python library, provides robust and efficient tools for implementing various linear regression techniques. This guide delves into the world of scikit-learn linear regression, exploring its different models, applications, and how to leverage them for effective data analysis and prediction.

Understanding Linear Models in Scikit-learn

In scikit-learn, linear models are designed for regression tasks where the target value is expected to be a linear combination of the input features. Mathematically, a linear model predicts the output ( hat{y} ) based on input features ( x = (x_1, x_2, …, x_p) ) and coefficients ( w = (w_1, w_2, …, w_p) ), along with an intercept ( w_0 ), as follows:

[hat{y}(w, x) = w_0 + w_1 x_1 + … + w_p x_p]

Within scikit-learn’s linear model module, ( w ) is referred to as coef_ and ( w0 ) as `intercept`. While primarily used for regression, some linear models in scikit-learn, like Logistic Regression and Ridge Classifier, are adapted for classification problems.

Ordinary Least Squares (OLS) Regression

Ordinary Least Squares, implemented as LinearRegression in scikit-learn, is the most basic linear regression method. It aims to find the line (or hyperplane in higher dimensions) that minimizes the sum of squared differences between the observed target values and the values predicted by the linear model. This is known as minimizing the residual sum of squares (RSS).

Mathematically, OLS solves the following optimization problem:

[min_{w} || X w – y||_2^2]

Here, ( X ) represents the feature matrix, ( y ) is the target vector, and ( w ) is the coefficient vector we want to find.

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression()
>>> reg.coef_
array([0.5, 0.5])

The fit method of LinearRegression trains the model using the provided data, and the estimated coefficients are stored in the coef_ attribute.

One important assumption of OLS is the independence of features. Multicollinearity, which occurs when features are highly correlated, can lead to unstable and high-variance coefficient estimates.

Non-Negative Least Squares

In scenarios where coefficients are expected to be non-negative, such as when they represent physical quantities, LinearRegression can be constrained using the positive=True parameter. This applies Non-Negative Least Squares (NNLS).

Complexity of Ordinary Least Squares

The computational complexity of OLS using Singular Value Decomposition (SVD) is (O(n{text{samples}} n{text{features}}^2)), assuming (n{text{samples}} geq n{text{features}}).

Ridge Regression

Ridge Regression, implemented as Ridge in scikit-learn, is a regularized version of linear regression that addresses some limitations of OLS, particularly in the presence of multicollinearity. It adds an ( ell_2 ) penalty to the residual sum of squares, shrinking the coefficients and making the model more robust.

The Ridge regression objective function is:

[min_{w} || X w – y||_2^2 + alpha ||w||_2^2]

The regularization parameter ( alpha geq 0 ) controls the strength of the penalty. A larger ( alpha ) leads to greater shrinkage and more robust coefficients against collinearity.

>>> from sklearn import linear_model
>>> reg = linear_model.Ridge(alpha=.5)
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
Ridge(alpha=0.5)
>>> reg.coef_
array([0.34545455, 0.34545455])
>>> reg.intercept_
0.13636...

Like LinearRegression, Ridge uses the fit method for training and stores coefficients in coef_.

Example of Ridge Regression coefficient paths for different alpha values.

Ridge Classification

RidgeClassifier is a classifier variant of Ridge Regression. It converts binary targets to ( {-1, 1} ) and treats classification as a regression problem with a penalized least squares loss. For multi-class classification, it uses a multi-output regression approach. Despite using a least squares loss for classification, RidgeClassifier can perform comparably to traditional classification models like Logistic Regression but often with better computational efficiency, especially for datasets with many classes.

Complexity of Ridge Regression

Ridge Regression has a similar computational complexity to OLS.

Ridge Regression with Cross-Validation: RidgeCV

RidgeCV and RidgeClassifierCV provide built-in cross-validation to automatically find the optimal regularization parameter ( alpha ). They use efficient Leave-One-Out Cross-Validation (LOOCV) by default, similar to GridSearchCV but more optimized for this specific task.

>>> import numpy as np
>>> from sklearn import linear_model
>>> reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))
>>> reg.alpha_
0.01

Specifying the cv parameter in RidgeCV allows for using k-fold cross-validation instead of LOOCV.

Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator), implemented as Lasso in scikit-learn, is another linear model with regularization, but it uses an ( ell_1 ) penalty. This penalty encourages sparsity in the coefficients, effectively performing feature selection by driving some coefficients to exactly zero. Lasso is particularly useful when dealing with datasets with many features and when feature selection is desired.

The Lasso objective function to minimize is:

[min{w} { frac{1}{2n{text{samples}}} ||X w – y||_2 ^ 2 + alpha ||w||_1}]

The ( alpha ) parameter controls the strength of the ( ell_1 ) penalty, with higher values leading to more sparse solutions.

>>> from sklearn import linear_model
>>> reg = linear_model.Lasso(alpha=0.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
Lasso(alpha=0.1)
>>> reg.predict([[1, 1]])
array([0.8])

The lasso_path function is useful for exploring the coefficient paths for different values of ( alpha ).

Feature Selection with Lasso

Due to its sparsity-inducing property, Lasso is often used for feature selection. Features with coefficients driven to zero are effectively removed from the model.

Setting Regularization Parameter in Lasso

The alpha parameter in Lasso controls the sparsity of the model.

Cross-Validation for Lasso: LassoCV and LassoLarsCV

LassoCV and LassoLarsCV are scikit-learn classes that use cross-validation to automatically find the optimal alpha for Lasso. LassoLarsCV is based on the Least Angle Regression (LARS) algorithm. For high-dimensional datasets with collinear features, LassoCV is generally preferred. However, LassoLarsCV can be faster when the number of samples is much smaller than the number of features and explores more relevant alpha values.

Examples illustrating the use of LassoCV for model selection.

Information-Criteria Based Model Selection: LassoLarsIC

LassoLarsIC uses information criteria like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) to select the optimal alpha. This is computationally cheaper than cross-validation as it computes the regularization path only once. However, these criteria rely on assumptions and might not be reliable for badly conditioned problems.

Example showing model selection using LassoLarsIC with AIC and BIC.

Multi-task Lasso

MultiTaskLasso is designed for situations with multiple regression tasks that share the same features. It estimates sparse coefficients jointly for all tasks, enforcing feature selection to be consistent across tasks. This is useful when predicting multiple related target variables simultaneously and assuming the same set of features is relevant for all of them.

The objective function for Multi-task Lasso is:

[min{W} { frac{1}{2n{text{samples}}} ||X W – Y||{text{Fro}} ^ 2 + alpha ||W||{21}}]

Here, ( Y ) is a 2D array of shape (n_samples, n_tasks), and ( W ) is the coefficient matrix. The ( ||W||_{21} ) norm encourages sparsity across rows of ( W ), meaning features are either selected or not selected for all tasks together.

Comparison of coefficient sparsity between Lasso and MultiTaskLasso.

Elastic-Net Regression

ElasticNet combines both ( ell_1 ) and ( ell_2 ) penalties. It aims to balance the feature selection benefits of Lasso with the regularization properties of Ridge. The l1_ratio parameter controls the mix between the two penalties. Elastic-Net is particularly useful when dealing with multicollinearity and when you suspect that a group of correlated features might be important.

The Elastic-Net objective function is:

[min{w} { frac{1}{2n{text{samples}}} ||X w – y||_2 ^ 2 + alpha rho ||w||_1 + frac{alpha(1-rho)}{2} ||w||_2 ^ 2}]

ElasticNetCV can be used to find optimal ( alpha ) and l1_ratio parameters through cross-validation.

Example comparing coefficient paths for Lasso, LARS Lasso, and Elastic-Net.

Multi-task Elastic-Net

MultiTaskElasticNet extends Elastic-Net to multi-task regression problems, applying both ( ell_1 )-( ell_2 ) and ( ell_2 ) regularization to the coefficient matrix ( W ). MultiTaskElasticNetCV is available for cross-validation parameter tuning.

Least Angle Regression (LARS)

Least Angle Regression (LARS), implemented as Lars in scikit-learn, is an algorithm for high-dimensional data regression. It’s computationally efficient, especially when the number of features is much larger than the number of samples. LARS provides a piecewise linear solution path, which is useful for cross-validation. It’s also more stable than forward stepwise regression when features are highly correlated. However, LARS can be sensitive to noise due to its iterative residual refitting process.

LARS Lasso

LassoLars implements Lasso using the LARS algorithm, providing an exact piecewise linear solution path as a function of the ( ell_1 ) norm of the coefficients. lars_path and lars_path_gram functions can be used to retrieve the full coefficient path.

>>> from sklearn import linear_model
>>> reg = linear_model.LassoLars(alpha=.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
LassoLars(alpha=0.1)
>>> reg.coef_
array([0.6..., 0. ])

Example of coefficient paths for Lasso and LARS Lasso.

Orthogonal Matching Pursuit (OMP)

Orthogonal Matching Pursuit (OMP), implemented as OrthogonalMatchingPursuit and orthogonal_mp, is a greedy algorithm for linear model fitting with constraints on the number of non-zero coefficients (( ell_0 ) pseudo-norm). OMP iteratively selects features most correlated with the current residual, using orthogonal projection for residual recomputation at each step.

Bayesian Regression

Bayesian Regression methods introduce regularization by incorporating prior distributions over model parameters. This allows the regularization parameter to be tuned based on the data rather than being fixed. Bayesian approaches offer advantages such as adapting to data characteristics and naturally including regularization. However, inference in Bayesian models can be computationally intensive.

Bayesian Ridge Regression

BayesianRidge implements Bayesian Ridge Regression, using Gaussian priors for coefficients and gamma priors for precision parameters. It estimates the regularization parameters ( alpha ) and ( lambda ) by maximizing the log marginal likelihood. Bayesian Ridge is more robust to ill-posed problems compared to OLS.

>>> from sklearn import linear_model
>>> X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
>>> Y = [0., 1., 2., 3.]
>>> reg = linear_model.BayesianRidge()
>>> reg.fit(X, Y)
BayesianRidge()
>>> reg.predict([[1, 0.]])
array([0.50000013])
>>> reg.coef_
array([0.49999993, 0.49999993])

Automatic Relevance Determination (ARD) Regression

ARDRegression (Automatic Relevance Determination) is similar to Bayesian Ridge but uses a different prior that leads to sparser coefficients. ARD uses an elliptic Gaussian prior, allowing each coefficient to have its own precision parameter ( lambda_i ). This often results in more effective feature selection.

Logistic Regression

Despite its name, LogisticRegression in scikit-learn is a linear model for classification. It’s also known as logit regression or maximum-entropy classification. Logistic Regression models the probability of a binary outcome using the logistic function (sigmoid).

It can handle binary, One-vs-Rest, and multinomial classification with ( ell_1 ), ( ell_2 ), or Elastic-Net regularization. Regularization is applied by default to improve numerical stability and generalization.

Binary Logistic Regression

For binary classification, predict_proba method of LogisticRegression estimates the probability of the positive class:

[hat{p}(X_i) = operatorname{expit}(X_i w + w_0) = frac{1}{1 + exp(-X_i w – w_0)}.]

Binary logistic regression minimizes the following cost function with regularization term (r(w)):

[min{w} frac{1}{S}sum{i=1}^n s_i left(-y_i log(hat{p}(X_i)) – (1 – y_i) log(1 – hat{p}(X_i))right) + frac{r(w)}{S C},,]

Regularization options via the penalty argument include ‘None’, ‘l1’, ‘l2’, and ‘ElasticNet’.

Multinomial Logistic Regression

Multinomial Logistic Regression extends binary logistic regression to handle more than two classes. It uses a softmax function to model probabilities for each class.

Solvers for Logistic Regression

LogisticRegression offers several solvers: ‘lbfgs’, ‘liblinear’, ‘newton-cg’, ‘newton-cholesky’, ‘sag’, and ‘saga’. Each solver has different capabilities regarding penalties and multiclass support. ‘lbfgs’ is the default solver for its robustness. ‘saga’ is often faster for large datasets.

LogisticRegressionCV provides built-in cross-validation to optimize the C (inverse regularization strength) and l1_ratio parameters.

Generalized Linear Models (GLM)

Generalized Linear Models (GLMs), implemented by TweedieRegressor in scikit-learn, extend linear models by linking the predicted values to a linear combination of features through an inverse link function and by generalizing the loss function beyond squared loss to unit deviance functions from exponential family distributions. This allows modeling a wider range of data types and distributions.

Distribution Target Domain Unit Deviance (d(y, hat{y}))
Normal (y in (-infty, infty)) ((y-hat{y})^2)
Bernoulli (y in {0, 1}) (2({y}logfrac{y}{hat{y}}+({1}-{y})logfrac{{1}-{y}}{{1}-hat{y}}))
Poisson (y in [0, infty)) (2(ylogfrac{y}{hat{y}}-y+hat{y}))
Gamma (y in (0, infty)) (2(logfrac{hat{y}}{y}+frac{y}{hat{y}}-1))

Probability Density Functions for Poisson, Tweedie, and Gamma distributions.

TweedieRegressor specifically implements GLMs for the Tweedie family of distributions, which includes Normal (power=0), Poisson (power=1), Gamma (power=2), and Inverse Gaussian (power=3) distributions, among others.

>>> from sklearn.linear_model import TweedieRegressor
>>> reg = TweedieRegressor(power=1, alpha=0.5, link='log')
>>> reg.fit([[0, 0], [0, 1], [2, 2]], [0, 1, 2])
TweedieRegressor(alpha=0.5, link='log', power=1)
>>> reg.coef_
array([0.2463..., 0.4337...])
>>> reg.intercept_
-0.7638...

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD), implemented in SGDClassifier and SGDRegressor, is an efficient algorithm for fitting linear models, especially with large datasets. partial_fit enables online learning. SGD supports various loss functions and penalties, allowing implementation of models like Logistic Regression (loss="log_loss") and Linear SVM (loss="hinge").

Perceptron

Perceptron is a simple classification algorithm well-suited for large-scale learning. It’s a wrapper around SGDClassifier with perceptron loss and a constant learning rate. Perceptron is faster to train than SGD with hinge loss and often produces sparser models.

Passive Aggressive Algorithms

Passive-Aggressive algorithms, implemented in PassiveAggressiveClassifier and PassiveAggressiveRegressor, are another family for large-scale learning. They are similar to Perceptron but include a regularization parameter C. They are efficient and do not require learning rate tuning.

Robustness Regression: Handling Outliers

Robust regression techniques are designed to fit linear models in the presence of outliers or errors in the data.

RANSAC (RANdom SAmple Consensus)

RANSAC, implemented as RANSACRegressor, is a non-deterministic algorithm that fits a model to random subsets of inliers from the data. It separates data into inliers and outliers and estimates the model only from inliers. RANSAC is effective for linear and non-linear regression and is commonly used in computer vision.

Example of RANSAC regression in the presence of outliers.

Theil-Sen Estimator

TheilSenRegressor uses a generalized median for robust regression, making it resistant to multivariate outliers. However, its robustness decreases in high-dimensional settings. Theil-Sen is non-parametric and has a high breakdown point in univariate regression.

Comparison of OLS and Theil-Sen regression in the presence of outliers.

Huber Regression

HuberRegressor is less sensitive to outliers than OLS but, unlike RANSAC or Theil-Sen, gives outliers less weight rather than completely ignoring them. It uses a Huber loss function, which is quadratic for inliers and linear for outliers.

Comparison of Huber and Ridge Regression in the presence of outliers.

Quantile Regression

Quantile Regression, implemented as QuantileRegressor, estimates conditional quantiles of the target variable, such as the median. It’s useful when you need to predict intervals rather than just point predictions, especially when errors are non-constant or non-normally distributed. Quantile regression is more robust to outliers than OLS.

Example of Quantile Regression predicting different quantiles.

Polynomial Regression: Extending Linear Models

Polynomial Regression extends linear models by using nonlinear functions of the input features, specifically polynomial features. This allows linear models to fit more complex, nonlinear relationships while maintaining computational efficiency.

>>> from sklearn.preprocessing import PolynomialFeatures
>>> import numpy as np
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(degree=2)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])

PolynomialFeatures transformer generates polynomial and interaction features. Combined with LinearRegression in a Pipeline, it creates a polynomial regression model.

Example of Polynomial Regression fitting data with varying polynomial degrees.

Conclusion

Scikit-learn provides a comprehensive suite of linear models for regression and classification. From basic Ordinary Least Squares to regularized models like Ridge, Lasso, and Elastic-Net, and robust methods for handling outliers, scikit-learn equips data scientists and machine learning practitioners with powerful tools for linear modeling. Understanding these models and their nuances is crucial for effectively applying them to diverse datasets and problems. By leveraging scikit-learn’s linear regression capabilities, you can build interpretable, efficient, and accurate predictive models. Dive deeper into each model’s documentation and experiment with different parameters to master the art of linear regression in scikit-learn.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *