Regression Analysis Definition
Regression analysis is a statistical method employed in statistical modeling to scrutinize the relationship between a dependent response variable and one or more independent predictor variables. It provides an insight into the manner in which the distinctive value of the dependent variable (also referred to as the criterion variable) changes with a change in any one of the independent variables, provided the values of the other independent variables are not altered. Regression analysis not only interprets how different variables interact with one another but also offers cognizance of which variables are salient and which can be ignored.
A Little More on What is Regression Analysis
Regression analysis is typically used to determine the average value of the dependent variable for fixed values of the independent variables. It is also used, albeit less frequently, to determine a quantile or any other location parameter of the conditional distribution of the dependent variable for fixed values of the independent variables. The primary objective of these endeavors is to estimate the regression function, which is a function of the independent variables.
In simpler terms, regression analysis answers the following two questions:
- Are the independent predictor variables competent in predicting dependent criterion variables?
- Which independent variables are the most essential predictors of dependent variables?
Regression analysis also mandates the use of a probability distribution to characterize the variation of the dependent variable around the prediction of the regression function. Then, there is also the vastly similar approach known as Necessary Condition Analysis (NCA) that is used to derive the maximum value of the dependent variable for a fixed value of the independent variable. The goal of NCA is to determine a value of the independent variable that is essential but insufficient for a fixed value of the dependent variable.
Uses of Regression Analysis
There are three principal uses of regression analysis:
- To calculate the strength of independent variables or predictors.
- To forecast an outcome.
- To forecast a trend.
When used as a tool for prediction and forecasting, regression analysis has applications that are analogous to artificial intelligence. There are situations where regression analysis can also be used, with due caution, to theorize a causal connection between dependent and independent variables.
There are several methods of performing regression analysis; however all methods can be broadly classified into two categories:
Parametric: Parametric regressions are those regressions where the regression functions can be expressed in terms of a limited number of unknown parameters obtainable from the available data. Examples include linear regression and least squares regression.
Nonparametric: Nonparametric regressions are regressions in which the regression function can remain in a defined set of possibly infinite-dimensional functions.
The structure of the data generating process and its correlation with the regression approach can greatly influence the performance of regression analysis procedures. Since such process structures remain typically unascertained, assumptions are often made by regression analysis procedures regarding those processes. However, the availability of a sufficient quantity of data makes it possible to put these assumptions to the test. Although moderately violated assumptions do not significantly affect usability of regression models as tools for prediction, optimal performance cannot be guaranteed in such situations. Moreover, applications that have small effects of causality based on observational data can cause regression procedures to project anomalous outcomes.
Often, regression may take on the specific role of appraising dependent variables instead of the the discrete response variables used in classification. This is known as metric regression.
References for Regression Analysis
Academic Research on Regression Analysis
Nonlinear regression analysis and its applications, Kass, R. E. (1990). Journal of the American Statistical Association, 85(410), 594-596. Although nonlinear regression is similar to linear regression in many ways, it still presents a lot of new problems. Nonlinear regression models are usually based on solutions to differential equations and this makes it easier to correlate statistical inference with scientific analysis. This paper scrutinizes two books written on the subject – one by Bates and Watts and the other by Seber and Wild.
Smooth regression analysis, Watson, G. S. (1964). Sankhyā: The Indian Journal of Statistics, Series A, 359-372. Notwithstanding the fact that graphical procedures are some of the most potent statistical tools, these procedures become cumbersome when too many observations and/or variables are involved. The author contends that developing procedures for analyzing the data output of electronic observing systems is a major challenge for statisticians. The paper presents a simple procedure for obtaining graphical representations of a large number of observations.
Locally weighted regression: an approach to regression analysis by local fitting, Cleveland, W. S., & Devlin, S. J. (1988). Journal of the American statistical association, 83(403), 596-610. This paper seeks to elucidate locally weighted regression (loess), which is a procedure to determine a regression surface by employing a process known as multivariate smoothing. The procedure involves fitting a function of the independent variables locally and in a moving mode that is comparable to the computation of a moving average for a time series.
A simulation study of the number of events per variable in logistic regression analysis, Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., & Feinstein, A. R. (1996). Journal of clinical epidemiology, 49(12), 1373-1379. This paper provides an insight into the Monte Carlo study performed by the authors in order to assess the impact of the number of events per variable (EPV) analyzed in logistic regression analysis. A sample cardiac trial was conducted on 673 patients (of which 252 died). Seven variable were identified as compelling mortality predictors. This resulted in (252/7 =) 36 events per predictor variable for the sample set. The paper concluded that low EPV can cause major complications.
Methods of correlation and regression analysis: Linear and curvilinear, Ezekiel, M., & Fox, K. A. (1959).This book provides an insight into the correlation between variance analysis and issues in regression. Essentially a revision of Ezekiel’s previous works, this third edition, rewritten with the assistance of Fox, shifts its focus from correlation to regression. The authors also describe the correlation between the analysis of variance and regression problems in much greater detail.
Tests for specification errors in classical linear least-squares regression analysis, Ramsey, J. B. (1969). Journal of the Royal Statistical Society. Series B (Methodological), 350-371. This paper studies the effects of a series of model mis-specifications on the distribution of least-squares residuals. The author observes that distributions of the least-squares residuals are normal for various specification errors, albeit with non-zero means. The author then proceeds to employ an alternative predictor of the disturbance vector to formulate four discrete techniques to investigate the presence of specification errors.
How many subjects does it take to do a regression analysis, Green, S. B. (1991). Multivariate behavioral research, 26(3), 499-510. Several broadly accepted practical principles have been employed for evaluating the minimum number of subjects necessary for conducting multiple regression analyses. These principles are assessed by pitching their outcomes against those established on power analyses for evaluations of hypotheses of multiple and partial correlations. The outcomes show that it is not correct to specify the minimum number of subjects as a random constant or to identify a minimum ratio of number of subjects (N) to number of predictors (m).
Multicollinearity in regression analysis: the problem revisited, Farrar, D. E., & Glauber, R. R. (1967). The Review of Economic and Statistics, 92-107. This book describes a series of diagnostic techniques as well as a point of view regarding multicollinearity in regression analysis. The author notes that while the techniques supporting a particular point of view can be easily substituted, the antithesis is usually not true. The book not only describes the author’s approach to multicollinearity, but also presents a study of alternate interpretations of the problem. It should be noted that the book defines multicollinearity as a statistical condition and not a mathematical one.
Regression analysis when the dependent variable is truncated normal, Amemiya, T. (1973). Econometrica: Journal of the Econometric Society, 997-1016. This paper appraises the estimate of the parameters of a regression model with a normal dependent variable that is truncated to the left of zero. The author revisits Tobin’s iterative solution of the maximum likelihood equations and finds inconsistencies in Tobin’s initial estimator. The paper puts forward a simple, consistent and asymptotically normal estimator and substantiates the asymptotic efficiency of the second-round estimator in the approach of Newton.
Regression analysis of multivariate incomplete failure time data by modeling marginal distributions, Wei, L. J., Lin, D. Y., & Weissfeld, L. (1989). Journal of the American statistical association, 84(408), 1065-1073. This paper is a survival study that examines the regression analysis of multivariate failure time observations where the times to two or more distinct failures on each subject are recorded. The authors employ a Cox proportional hazards model to develop each marginal distribution of the failure times without imposing any particular dependence structure among the distinct failure times on each subject.
On the regression analysis of multivariate failure time data, Prentice, R. L., Williams, B. J., & Peterson, A. V. (1981). Biometrika, 68(2), 373-379. This paper examines regression effects of a reasonably large number of study subjects that are likely to encounter multiple failures. The authors employ two general classes of regression models (both being of a stratified proportional hazards type) with the aim of correlating the hazard or intensity function to covariates and to preceding failure time history. The first class of regression models carries baseline intensity functions that are arbitrary as a function of time, calculated from the commencement of the study. The other class of regression models carries baseline intensity functions that are arbitrary as a function of time, calculated from the time immediately preceding failure of the study subject.
Collinearity, power, and interpretation of multiple regression analysis, Mason, C. H., & Perreault Jr, W. D. (1991). Journal of marketing research, 268-280. The importance of multiple regression analysis in marketing research cannot be underestimated. However, interpretation of regression estimates does present challenges in the form of correlated predictor variables and potential collinearity effects. This paper attempts to elucidate the circumstances under which collinearity influences the estimates derived by multiple regression analysis and also interprets the magnitude of such influences.