How do we create models to predict or explain?
Categories: High-Level Discussions, Paper Discussions
“To Explain or To Predict?”, Statistical Science, 25(3): 289-310, 2010. [PDF][Research Page]
Summary: This publication examines the fundamental and practical differences between the use of statistical and other empirical methods for prediction and causal explanation. The article’s thesis is that statistical modeling, from the early stages of study design and data collection, to choice of model type and variables, through to data usage and reporting, takes different paths and leads to different results depending on whether the objective is to predict or explain. The current conflation of explanation and prediction in empirical models is pervasive in many fields and particularly in the social sciences. Not only are explanatory power and predictive power confused, but there is a clear lack of adequate predictive modeling in those fields due to the under-appreciation of the scientific role of prediction.
As you have probably noticed, I had taken a break from writing because of my move to a new residence across town. I had also taken a step back from current and pending analytics projects in order to focus on a new job, and to think more deeply about football analytics and my contribution to it. It was during that time that I came across this extremely thought-provoking paper on statistical modeling. Statistical modeling goes to the heart of the type of analytics that I do, and this publication has motivated me to scrutinize and refine my own work and processes.
Galit Shmueli is a Distinguished Professor of the Institute of Service Science at National Tsing Hua University in Taiwan. (“Service science”, as best as I understand it, is industrial and systems engineering applied to the service sector. This link from IBM Research tells more.) Prof. Shmueli has been a professor at Carnegie Mellon (visiting), University of Maryland Business School, and the Indian School of Business before she joined the Institute last year.
Shmueli’s research focuses on statistical and data mining methods applied to e-commerce problems, but most of her recent work has been in the big picture topics of statistical strategy, such as the use of large datasets to make statistical inferences, the ability of a dataset to answer statistical questions, and the link between model objective and model building. The “to explain or to predict” problem is one of these big picture topics.
Explanatory and predictive models
Analytics is the use of data to create information that either explains the world in some limited scope or provides guidance on how to interact with it. The tool of choice that interacts with data to enable this understanding or guidance is the mathematical model – hypotheses about the relationship, interaction, and association of inputs to outputs expressed in the language of mathematics and statistics. Models can be used to explain phenomena at a conceptual level. These are explanatory models. They can also be used to produce expectations of future behavior that one can measure. These are predictive models.
By way of providing some examples, in my previous life as a basic science researcher, I developed models to describe aerodynamic forces at extreme flight conditions and predict trajectories of flight vehicles. In my work in soccer analytics, I have created models of team and referee influence on effective match time, individual player influence on in-match goal differentials (adjusted plus/minus), sports fans’ sensitivity to buying secondary market tickets given their team’s in-season performance, and expected performance of a team in league competition given its goal statistics (soccer Pythagorean).
Most of these models are explanatory in nature — soccer Pythagorean can be a predictive model if the goal model is itself predictive — and some of these models capture part of the underlying phenomena well. But most of them are poor as predictive models. The adjusted plus/minus model has received a lot of criticism in the basketball community for its inability to predict point differential of expected lineups, to give just one example. Yet the players with the highest adjusted plus/minus coefficients more often than not are the players who writers and pundits would expect to see. It is very common, particularly in economics and the social sciences, to see researchers claim that just because a model has explanatory power, it must have predictive power as well. The lack of predictive power, it is inferred, means that something is fatally wrong with the model. But is that assessment accurate?
Different goals require different approaches
Shmueli’s response to the question is the thesis of her research: many researchers in non-statistical fields conflate explanation and prediction when it comes to the use of statistical models, but both uses require different approaches to modeling that result in very different end products. These approaches have rarely been discussed in the statistics literature except for the issue of model selection. Moreover, predictive modeling has been seen by segments of the academic research community as a “money-grubbing activity” with little relevance to basic knowledge, so most of the recent advances in predictive modeling have come from the machine learning community.
Shmueli responds that predictive models can be used to inform and refine hypotheses and theories that are expressed in statistical models, especially in today’s Big Data era. She wants the statistics research community to do two things: take the lead on explaining the distinct modeling processes that create predictive and explanatory models, and embrace the use of predictive modeling to reduce the gap between theory and practice.
A very brief notational interlude
This will be the only equation in the review, but for the remainder of the post assume that we are considering a model \(\mathcal{F}\) that relates \(X\) to \(Y\):
\[ Y = \mathcal{F}\left(X\right) \]
The dimensions of statistical models
Statistical models exhibit a rich collection of competing characteristics and measures. These features are weighted to varying degrees by explanatory and predictive models. Shmueli describes four pairs of competing interests that are impacted by the two classes of models.
- Causation vs association: Explanatory models capture causal relationships (\(X\) causes \(Y\)), while predictive models capture associations between \(X\) and \(Y\).
- Theory vs data: Explanatory models are derived from theories that hypothesize relationships between \(X\) and \(Y\). Predictive models are derived from data and are often opaque, although it is possible to extract some relationship between \(X\) and \(Y\) sometimes.
- Retrospective vs prospective: Explanatory models test a set of hypotheses on data in the past. Predictive models, well, predict future observations.
- Bias vs variance: Bias is the error associated with getting the model wrong, while variance has two components: the model’s estimation error and the irreducible measurement error (aka noise). Because the goal of explanatory models is to explain phenomena, it’s important to minimize the bias. Predictive models seek the best trade-off between bias and variance, to the point that it’s not unusual for a model to predict future behavior well despite an inability to capture underlying phenomena.
The elements of statistical modeling
Shmueli’s major contribution in this paper is to lay out the statistical modeling process and explain how the end goal — to predict or to explain — affects every step of the process. It just goes to show that it’s essential to know how a model will be used in order to develop it correctly and obtain credible results.
Model design and data collection. The first step involves matters of sample size, sampling schemes, and the use of either experimental or observational data.
Explanatory modeling places a high value on statistical power, precision, and reduced bias, so to this end enough data is collected to reduce bias to an acceptable level. Beyond a certain amount of data, the reduction in bias and the improvement in precision are minimal. Experimental data is greatly preferred in a descriptive model, collected in a way that represents the underlying phenomena sufficiently (factorial experimental design is almost ideal for this purpose).
Predictive modeling places a high value on lowering bias and variance simultaneously, and at any rate are derived from data. For these models, collecting as much data as possible is top priority in order to perform out-of-sample testing appropriately, and the data has to reflect reality. So observational data, perhaps collected using response surface methods, is preferred with attention paid to measurement quality.
Data preparation. Once the data is collected, it has to be examined, and data is inherently messy even in experimental settings. This step brings up the issues of handling missing data and data partitioning for model training and testing.
Missing data, for the most part, is not very useful to explanatory modeling, so in many cases it is possible to discard those data points. Data partitioning is rarely used to explanatory modeling as this could reduce the model’s explanatory power, but it has been used to assess a model’s robustness.
In contrast to explanatory modeling, missing data must be accounted for in the predictive modeling process as it may influence the behavior of the response variable \(Y\). Likewise, data partitioning is a critical part of predictive modeling, with a variety of methods that partition datasets in order to create training and testing datasets that preserve certain user-defined characteristics.
Exploratory analysis. This is the point where eyeballs are applied to the data in order to visualize trends, and some light analysis is performed on the data in order to make first-order summaries.
Explanatory modeling makes use of exploratory analysis to establish causal relationships, hypothesize model assumptions, and examine the effect of variable transformations. For predictive modeling, the emphases are to assess the quality of the data measurement and the associations between variables.
Choice of variables. This step selects the variables that the model interacts with.
In explanatory modeling, the goal is to implement the theories that explain the underlying behavior that the model seeks to capture. The variables are windows to the underlying phenomena, and play various roles in the social and medical sciences literature — antecedent and consequent, treatment and control, exposure and confounding, to give some examples. It’s important not to exclude variables that should be in the model, or these variables could correlate with the model error (known in econometrics as as endogeneity).
Variable choice for predictive models is much simpler. All that matters is ensuring good association between variables, sufficiently high quality measurements, and certain variables preceding others in time (ex-ante availability).
Choice of methods. This is where the math enters the process — whether to use statistical or algorithmic models, and whether to use an ensemble method to build a predictive model.
It makes sense that explanatory models make use of models that can be interpreted. Such models include statistical models (e.g. those that make use of a distribution) or regression models (e.g. linear, logistic, Cox/hazard). Predictive models can use regression models as well, but over the last 15-20 years attention has shifted to non-parametric models such as neural networks, k-nearest neighbor models, and support vector machines.
Model validation, evaluation, and selection. After a model is trained, it must be evaluated and validated.
Explanatory modeling considers the following questions: Does the model have a good fit to the data? and Does the model represent the underlying theory well? Predictive modeling wants to know: Can the model predict from unseen data?
Both modeling approaches seek to answer these questions by different means, whether by model specification and residue tests or hold-out tests and confusion matrices. They also must address the twin banes of underfitting and overfitting by either revising the model, adding or removing variables, or incorporating more data. Multicollinearity is a particular malady of descriptive models.
Model use and reporting. At the end of the explanatory modeling process, the model is used to make statistical conclusions and evaluate theories on causal relationships between variables. In contrast, the predictive model is used to make predictions from new data, and the results are used to formulate new hypotheses, establish relevance of existing theories, and assess predictability of certain relations.
Shmueli’s takeaways — and mine
Shmueli walks through the dimensions of statistical models and the taxonomy of statistical modeling for explanatory or predictive purposes and arrives at a couple of end points. The first is that statistical models are used in ways that would probably shock and horrify researchers in the academic field of statistics, but this use is the reality and the statistics community needs to interact with it and ultimately influence it. The second point is that practitioners blur erroneously the lines between a model’s predictive power and its explanatory power, when in reality the model’s design and construction will influence its capability and power. The third point is that the predictive modeling needs to be treated as more than an applied exercise, but rather a key part of modeling that examines, informs, and refines the theories underpinning explanatory models.
So how do these viewpoints relate to soccer analytics, or more specifically, the kind of soccer analytics that intersects with data science?
In my opinion, a small minority of the models used in sports analytics in general, and soccer analytics in particular, are predictive models. The betting industry deploys its own models that return a likelihood of a result between two teams, perhaps taking into account some characteristics of the teams, previous matchups, and recent results. Such a model is derived very differently and looks very differently from one that infers a player’s impact through his involvements, or even his presence, in a match. Sports science has devoted more attention to injury predictive models using neural network or other non-parametric models, with very interesting results reported across the football codes. The resulting models may be ugly and not all that intuitive, but they have to be capturing some element of the underlying phenomena in order to predict future behavior. Observations create questions that can be investigated through a descriptive or explanatory model.
When I think of one of my own models — the soccer Pythagorean and adjusted plus/minus — I see a model that is a good descriptive model of goal statistics on expected league performance. But its predictive power is very sensitive to the expected or projected goal model fed into it (which makes it a classic example of a hierarchical statistical model), and ultimately isn’t all that great. The same issue applies to adjusted plus/minus models. Does the observation that the soccer Pythagorean and adjusted plus/minus have good explanatory power and not-so-good predictive power make them crappy models? No — it just means that they are better explanatory models. It does mean that I need to be honest about my models’ predictive and explanatory qualities.
I’ll just wrap up this post by saying that Shmueli has written a valuable paper for those who are serious about the models that they build and the results that they obtain from them. By having a clear idea of what kind of analytics problem we want to solve, we can develop statistical models in a precise and coherent way and generate results that advance our understanding of our small corner of the world.
MORE: Prof. Shmueli has given a number of talks on the subject over the last five years, and here is one from 2010 that is a bit rushed but explains the philosophical nature of her work better. But do read her paper first!