The paper by Lockhart, Taylor, Tibshirani, and Tibshirani (LTTT), a significant stride in our understanding of inference for high-dimensional regression, is a remarkable work. It masterfully combines an impressive collection of results, culminating in a series of compelling convergence findings. Notably, the test statistic’s inherent ability to balance the effects of both shrinkage and adaptive variable selection is truly noteworthy.
However, it’s important to acknowledge that the authors operate under considerable assumptions, a common and often necessary approach when making substantial theoretical progress in complex procedures. These assumptions serve as a foundation, but naturally lead to the crucial question: what can be achieved when we relax or eliminate these assumptions?
1. Examining the Core Assumptions
The assumptions underpinning this paper, and indeed most theoretical works on high-dimensional regression, can be categorized into several key components:
- Linear Model Correctness: The fundamental assumption that the relationship between variables is linear.
- Constant Variance (Homoscedasticity): The variance of the errors remains consistent across all levels of the independent variables.
- Normally Distributed Errors: The errors are assumed to follow a Normal distribution.
- Parameter Vector Sparsity: Only a small number of parameters are assumed to be non-zero, implying that only a few variables are truly influential.
- Weak Collinearity in the Design Matrix: This is often expressed through incoherence, eigenvalue restrictions, or incompatibility conditions, all pointing to a design matrix where independent variables are not highly correlated.
It’s crucial to recognize that the testability of these assumptions, particularly when (the number of predictors exceeds the number of observations), is limited. While they provide a valuable starting point for theoretical explorations, their strength should not be overlooked. The true regression function could take any form, and there’s no inherent reason to expect it to be inherently linear. Similarly, design assumptions, especially concerning collinearity, are often questionable. High collinearity is more often the norm, especially in high-dimensional settings, rather than an exception. Compressed sensing in signal processing, where the design matrix can be constructed, represents a specialized case. If such a matrix is populated with independent random Normals, it will likely be incoherent. However, this is a niche scenario.
This is not intended as a critique of the LTTT paper. Instead, it aims to highlight the importance of the question previously posed: What can we infer and achieve without relying on these strong assumptions?
Remark 1 Even in simpler, low-dimensional models, and even when model correctness is granted, the process of model selection itself introduces complexities often overlooked. Variable selection, in particular, can lead to an explosion in minimax risk (as demonstrated by Leeb and Potscher in 2008 and 2005). This isn’t a mere theoretical anomaly; the risk becomes substantial in the vicinity of 0, a parameter space of practical relevance.
2. The Assumption-Free Perspective on Lasso
It’s important to highlight that the Lasso method possesses a valuable, assumption-free interpretation.
Consider observations where and . The regression function remains an unknown, arbitrary function. Without imposing assumptions on , our capacity to estimate is inherently limited.
However, robust theoretical foundations justify the use of Lasso with minimal assumptions, as evidenced by the work of Greenshtein and Ritov (2004) and Juditsky and Nemirovski (2000).
Let represent the set of linear predictors. For a given , the predictive risk is defined as:
Here, denotes a new, independent pair of observations. We define the best sparse linear predictor (in the sense), where minimizes within the constraint set . The Lasso estimator then minimizes the empirical risk over the same set . For simplicity, assuming all variables are bounded by (though not strictly necessary), and without imposing linearity, design, or model assumptions, it can be shown that:
This holds true except for a set of probabilities at most .
This inequality demonstrates that the predictive risk of the Lasso estimator closely approaches the risk associated with the best sparse linear predictor. This, in my view, elucidates why Lasso is effective. It delivers a predictor with the desirable characteristic of sparsity, is computationally feasible, and achieves a risk level close to that of the optimal sparse linear predictor.
3. Interlude: Contrasting Weak and Strong Modeling
When developing new methodologies, it is helpful to consider three distinct stages:
- Method Construction: Devising the method itself.
- Output Interpretation: Understanding and explaining the results generated by the method.
- Property Analysis: Studying the theoretical characteristics and behavior of the method.
Furthermore, distinguishing between two modeling paradigms is beneficial. Strong modeling presumes the model to be true across all three stages. Conversely, weak modeling assumes model validity primarily for stage 1 (method construction), but not necessarily for stages 2 and 3. In essence, a model can be a valuable tool for method development without requiring it to be strictly true for interpretation or theoretical analysis. My perspective in this discussion is rooted in a preference for weak modeling.
4. Assumption-Free Inference: Introducing HARNESS
Here, I want to discuss an approach developed in collaboration with Ryan Tibshirani, termed HARNESS: High-dimensional Agnostic Regression Not Employing Structure or Sparsity. This method builds upon the idea presented in Wasserman and Roeder (2009).
The core concept is data splitting. The dataset is divided into two halves, and . Assuming for simplicity an even sample size observations. Using the first half, , a subset of variables is selected. The method is deliberately agnostic to the specific variable selection technique employed – it could be forward stepwise, Lasso, elastic net, or any other method. The outcome of this initial phase is the selected predictor subset and a coefficient estimator . The second data half, , is then used to derive distribution-free inferences for the following questions:
- Predictive Risk Assessment: What is the predictive risk associated with ?
- Variable Contribution to Risk: How much does each variable in contribute to the overall predictive risk?
- Best Linear Predictor in Selected Set: What is the best linear predictor achievable using only the variables within ?
All inferences drawn from are interpreted as being conditional on . (A variation involves using solely for selecting , and then constructing predictor coefficients from . For this discussion, we assume is derived from .)
More specifically, let:
where randomness is considered over a new pair , conditioned on . Note that the definition of is now on an absolute scale for better interpretability. In this equation, for . The first question addresses estimating and constructing confidence intervals for (conditional on ). The second focuses on inferring:
for each , where is identical to except that is set to 0. Thus, represents the risk increase incurred by excluding . The third question aims to infer:
which represents the coefficient of the best linear predictor within the chosen variable set. We call the projected parameter. Therefore, is the optimal linear approximation to within the linear space spanned by the selected variables.
A consistent estimator for is:
summing over , where . An approximate confidence interval for is given by , where is the standard deviation of the values.
The validity of this confidence interval is largely distribution-free. To achieve complete distribution freedom and avoid asymptotic assumptions, could be defined as the median of the distribution of . In this case, order statistics of the values can be employed to obtain finite-sample, distribution-free confidence intervals for .
Estimates and confidence intervals for can be derived from , where:
Similarly, estimates and confidence intervals for can be obtained using standard least squares methods applied to .
The HARNESS Procedure Summarized:
Input: Dataset .
- Data Splitting: Randomly divide into two halves, and .
- Variable Selection: Utilize to select a subset of variables using any method (forward stepwise, Lasso, etc.).
- Predictive Risk Definition: Define predictive risk as , representing the risk of the selected model on future data, conditional on .
- Inference from Second Half: Using , calculate point estimates and confidence intervals for , , and .
HARNESS shares similarities with POSI (Berk et al., 2013), another inference method for model selection, in eschewing assumptions about linear model correctness. However, POSI aims for inferences valid across all possible selected models, while HARNESS focuses on the chosen model. HARNESS also emphasizes predictive inferential statements.
Example: Wine Dataset Application
Using the wine dataset (gratefully acknowledged to the authors for data provision), forward stepwise selection with criterion was applied on the first half of the data to select a model. The selected variables were Alcohol, Volatile-Acidity, Sulphates, Total-Sulfur-Dioxide, and pH. A 95% confidence interval for the null model’s predictive risk was (0.65, 0.70). For the selected model, the confidence interval for was (0.46, 0.53). Bonferroni-corrected 95% confidence intervals for are shown in the first plot below, and for the projected model parameters in the second plot.
Alt Text: Bonferroni-corrected 95% confidence intervals for Risk Inflation (Rj) for Wine Dataset Variables: Alcohol, Volatile-Acidity, Sulphates, Total-Sulfur-Dioxide, and pH.
Alt Text: Bonferroni-corrected 95% confidence intervals for Projected Model Parameters for Wine Dataset Variables: Alcohol, Volatile-Acidity, Sulphates, Total-Sulfur-Dioxide, and pH.
5. The Value Proposition of Data-Splitting
Data-splitting sometimes faces skepticism from statisticians, with two common objections: first, the randomness of inferences across repeated splits, and second, the perceived wastefulness of data.
The first objection can be addressed by performing multiple splits and appropriately aggregating the information, a more complex approach detailed elsewhere. The second objection, in my view, is unfounded. Data-splitting’s value lies in enabling simple, assumption-free inference. This is not wasteful; both data halves are utilized. While splitting might lead to a power reduction compared to standard methods if the model were correct, this comparison is misleading. We aim for inference without assuming model correctness. It’s akin to comparing nonparametric and parametric estimators – nonparametric methods have slower convergence rates due to weaker assumptions.
6. Conformal Prediction: Distribution-Free Predictive Inference
Given the focus on regression methods with minimal assumptions, Vladimir Vovk’s theory of conformal inference warrants mention. This is a fully distribution-free, finite-sample method for predictive regression, described in Vovk, Gammerman, and Shafer (2005) and Vovk, Nouretdinov, and Gammerman (2009). Unfortunately, it remains underappreciated by many statisticians. Conformal prediction’s statistical properties, including minimax properties, are explored in Lei, Robins, and Wasserman (2012) and Lei and Wasserman (2013).
A comprehensive explanation is beyond this discussion’s scope, but the core idea and relevance to this paper can be outlined. With data , and a new observation , the goal is to predict . Let be a tentative prediction for . Form an augmented dataset:
Fit a linear model to this augmented dataset and calculate residuals for all observations. To test , observe that under , residuals are invariant under permutations, so:
is a distribution-free p-value for .
Inverting this test, let . It can be shown that:
Therefore, is a distribution-free, finite-sample prediction interval for . Like HARNESS, its validity doesn’t hinge on linear model correctness. The set achieves the desired coverage probability regardless of the true model. Both HARNESS and conformal prediction use linear models as prediction tools, but their inferences remain valid without assuming the linear model’s truth. (In fact, conformal prediction can utilize any method for generating residuals, not just linear models; see Lei and Wasserman, 2013).
Analyzing how changes with variable removal provides another assumption-free approach to explore predictor effects in regression. Minimizing interval length over the Lasso path can also serve as a distribution-free method for Lasso regularization parameter selection.
Relatedly, assumption-free methods for inferring graphical models are of interest, as explored in Wasserman, Kolar, and Rinaldo (2013).
7. Causation: Separating Prediction and Inference
While LTTT doesn’t address causation, it’s an underlying concern when discussing regression assumptions. Causation, prediction, and inference are distinct concepts, yet often conflated.
Even with a correct linear model, interpreting parameters requires caution. Commonly, is described as the change in when changes, holding other covariates constant. This is inaccurate. represents the change in our prediction of when changes. This distinction is crucial – the difference between association and causation.
Causation is about the change in as is actively changed. Association (prediction) is about the change in our prediction of as changes. Prediction concerns , while causation is about . They coincide only if is randomly assigned. Causal claims necessitate including all potential confounding variables in a complete causal model:
The relationship between and alone is:
The causal effect is , while the association (prediction) is . Omitted confounders lead to discrepancies.
Returning to the paper, even with a correct linear model, coefficient interpretation demands caution. Non-statisticians, the primary users of these methods, are prone to causal interpretations of despite warnings.
8. Conclusion
LTTT’s paper is a significant advancement in high-dimensional regression understanding, poised to inspire substantial future research.
This discussion has emphasized the role of assumptions. Low-dimensional models allow for relatively assumption-light methods. High-dimensional models pose greater challenges for low-assumption inference.
I hope to have conveyed the value of exploring the low-assumption landscape to the authors. Congratulations to them on a stimulating and impactful paper.
Acknowledgements
Thanks to Rob Kass, Rob Tibshirani, and Ryan Tibshirani for their insightful comments.
References
Berk, Richard, Brown, Lawrence, Buja, Andreas, Zhang, Kai, and Zhao, Linda. (2013). Valid post-selection inference. The Annals of Statistics, 41, 802-837.
Greenshtein, Eitan, and Ritov, Ya’cov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.
Juditsky, A., and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist., 28, 681-712.
Leeb, Hannes, and Potscher, Benedikt M. (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. Journal of Econometrics, 142, 201-211.
Leeb, Hannes, and Potscher, Benedikt M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21-59.
Lei, Jing, Robins, and Wasserman (2012). Efficient nonparametric conformal prediction regions. Journal of the American Statistical Association.
Lei, Jing, and Wasserman, Larry. (2013). Distribution free prediction bands. J. of the Royal Statistical Society Ser. B.
Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
Vovk, V., Nouretdinov, I., and Gammerman, A. (2009). On-line predictive linear regression. The Annals of Statistics, 37, 1566-1590.
Wasserman, L., Kolar, M., and Rinaldo, A. (2013). Estimating Undirected Graphs Under Weak Assumptions. arXiv:1309.6933.
Wasserman, Larry, and Roeder, Kathryn. (2009). High dimensional variable selection. Annals of statistics, 37, p 2178.