In this Unit, we will take another look at the whole data analysis process. Here, we focus on the actual (statistical) analysis (which I also call model fitting) component of the data analysis process. A general conceptual understanding is useful before we jump into looking at and applying specific modeling/statistical approaches.
So far in this course, you briefly encountered a few simple statistical models in some of the readings and exercises (e.g., applying a linear model), but we didn’t focus on it. Instead, we looked at a lot of other components that are important for a full data analysis project but are less often taught. You might have come across the statement (e.g., in R4DS) that 80% of the time spent on data analysis is in the non-modeling/non-statistical parts. From my experience, that is true. While the statistical modeling/analysis part is certainly important, it often takes a fairly small amount of the whole project time. And while it is certainly possible to use the wrong statistical model, it seems to me that the most important and consequential mistakes that might invalidate an analysis do not happen in the modeling part. Sure, people often use the wrong statistical method, but that has - in my experience - often surprisingly (and thankfully!) little impact on the results. Not properly cleaning and processing data (e.g., not realizing that weight is reported in pounds and not kilograms when computing BMI) often has a much bigger impact on results.
No matter what statistical analysis you end up using, you will need to do the steps of getting and cleaning/processing/wrangling the data. During that process, you also explore your data, e.g., through plots and tables. Doing those steps efficiently and accurately is crucial. After you are done with the preliminary steps of getting/cleaning/wrangling data, you can move on to your main goal, fitting models to data. Once you reach the stage where you are ready to fit models, most of the hard work is done.
Once you reach the state at which your data is ready for statistical analysis, you should have a good idea of the types of models that might be appropriate. The choice of model is partly driven by the data, and partly by the kind of question you want to answer. There are several reasons why you might want to apply statistical models to your data, we’ll discuss them briefly.
There are several reasons why we might want to fit models. The following is a brief list. The next section goes into more detail.
Depending on the objective, you will likely be using different statistical approaches. Note however that there is no clear mapping. For instance you can use a linear model to come up with hypotheses (i.e., do an exploratory analysis), to test hypotheses (e.g., in a clinical trial), to estimate parameters, or make predictions. A lot of models can be used for different purposes. This might be initially confusing. Just keep in mind that while some models are better for some purposes (e.g. a complex machine learning or artificial intelligence model might be good for predictions, but bad for causal conclusions), most models can be applied to more than one type of objective.
The following way of categorizing types of data analyses follows The Art of Data Science (ADS) – specifically, chapter 3. Now might be a good time to give that chapter another quick re-read. A very similar, shortened version of the ADA chapter 3 discussion can be found in this article by Jeff Leek and Roger Peng. The following figure from their paper gives a good summary of what follows.
The most straightforward analysis is a descriptive one. At that stage, you process, summarize and present the data, and do not go further. You don’t need to fit any statistical models. A lot of data collected by surveillance systems or government agencies falls into the descriptive category. For most scientific projects, we often start with a descriptive presentation. E.g. Table 1 in a study often describes and summarizes the data source. Note that some authors have the bad habit of including model fitting quantities, such as p-values, in a descriptive table. Measures that involve fitting a model (such as p-values), go beyond a descriptive analysis and should therefore generally not be in a descriptive table.
Sometimes, a descriptive study is interesting and sufficient by itself. But often, we then want to go beyond the descriptive presentation of data. The most common analysis approach is associative. Here, we are looking for associations (i.e., patterns) in the data. We are interested in seeing if patterns exist (e.g., if there is a correlation between age and speed of solving mathematical problems) and what the shape of the pattern is (e.g., linearly increasing/decreasing or non-monotone). This is also called correlation analysis.
Depending on the way we came up with our question and the data, the results from such an associative analysis can be interpreted as an exploratory or hypothesis-generating approach, or an inferential or hypothesis-supporting approach. In general, if you asked the question/posed the hypothesis first, then went out and collected the data and analyzed it to see if your hypothesis holds, you can interpret your findings as supporting or refuting your hypothesis. If you had data that were not collected to answer your question specifically, and you analyzed the data to see if you can find some interesting patterns, then your findings should be interpreted as hypothesis-generating.
For both exploratory and inferential settings, you are usually interested in understanding how specific inputs/predictors affect the outcome(s) of interest. For that reason, you generally want to keep your models fairly simple and easy to interpret.
Essentially all standard statistical tests and approaches you are likely already familiar with (e.g., various statistical tests like t-tests, simple linear regression, or simple classification such as logistic regression) fall into these categories of associative, exploratory, or inferential. We will cover some of those modeling approaches in future units.
Often, we would like to go from association to causation, i.e., we would like to say that not only does an increase in X correlate with an increase in Y, but that X causes that increase in Y. There are two ways of doing so. One is to collect the data in the right way, namely using a randomized clinical trial or equivalent lab science approach where all contributing factors but the input of interest, X, are controlled. This way, we can say that a change in X directly causes a change in Y. If the data does not come from such a study design, methods of causal inference (which we won’t discuss in this course) can sometimes help in trying to determine causality.
Classical statistical models can get us as far as determining potential causal relations. If we want to go even further and not only try to answer if X causes Y, but how X causes Y, we will need to employ studies or models that are mechanistic. Such models explicitly include postulated mechanisms, and by comparing such models to data, one can often determine which mechanisms are more plausible. Mechanistic models are also beyond what we cover in this course (if you are interested in those, I teach two courses on mechanistic modeling in infectious diseases 😃).
While the main goal of science is generally understanding a system as well as possible, outside of science, other goals are often more important. In applied/industry/commerce settings, one often does not care too much if or how exactly certain inputs influence outcomes of interest. Instead, the main purpose is to try and predict future outcomes given a set of inputs. In this case, the interpretability of your model is not that important. Instead, a predictive modeling framework is more important. That’s where complex models, such as machine learning and artificial intelligence (neural net) approaches come into play. Those models are too complex to allow much interpretation and understanding of the system, but often these kinds of models are very good at prediction and real world performance (e.g., differentiating cats from dogs in images).
An example of an inferential analysis might be the question which, if any, immunological markers (e.g., cytokines in blood) are most influential for a given clinical outcome. Building a simple model here, e.g. a linear model if the outcome is continuous (e.g., blood sugar level) or a logistic model if the outcome is binary (e.g., heart attack within 5 years or not), allows us to quickly and transparently investigate how each variable in our model affects the outcome and which variables (immunological markers) might be important to study further.
A good example of a prediction model is the monitoring of credit cards by the issuing companies, who try to predict fraudulent transactions. To that end, they feed all the data they can get about you into their models, and if something happens that is unusual, you might get flagged, and your card denied, or you will have to call to confirm. In this case, the interest is not too much on how exactly all these data points about you and your behavior lead to the prediction of legitimate vs. fraudulent, only that the accuracy of those predictions is high. Because of this, in situations where prediction is important, models tend to be more complex, and one is willing to trade simplicity and interpretability of a model for increased predictive performance.
While the goal of the analysis will guide you toward choosing a general type of modeling approach, the data usually dictate in more detail what kinds of models are suitable. The main determinant of the model type to use is the outcome(s) one wants to analyze.
First, is there even an outcome? While the majority of datasets have an outcome(s) of interest, that is not always the case. Data without a clear outcome are sometimes called unlabeled. For instance, we might have collections of images of cell types, and our question is if these images cluster into specific types of cells - without knowledge of what those types might be. Another example is a scenario where we might have a large dataset of customers and lots of information about each customer. We might want to know if those customers can somehow be grouped based on the data we have about them (with the goal to design a marketing campaign directed at specific groups). Methods for those tasks are generally called clustering methods or unsupervised learning methods. Examples are k-means clustering, principal component analysis, and neural networks. (Note that some of these methods can also be used with data that include outcomes.)
The more common data structure is one with a specific outcome(s) of interest. This is also referred to as labeled data. Since labeled data is the most common, we focus on it in this course. In this case, we use approaches referred to as supervised learning methods. Those can be further divided based on the type of outcome. If the outcome is continuous (or can be treated as such), we use regression approaches, or if the outcome is categorical, we use classification approaches.
You are likely already familiar with some of these approaches. Most basic statistical tests are simple models for regression or classification, i.e., they try to detect patterns in data with quantitative or categorical outcomes. Some other statistical methods are generalized linear models (which include the basic linear and logistic models), generalized additive models, trees, support vector machines, neural nets, k-nearest neighbors, linear discriminant analysis, and a lot of further methods, many of which are variants of others. Some, but not all, of the more complex methods can be applied to both quantitative and categorical outcomes. We will cover a few of these methods later in the course.
Note that there is, unfortunately, no one method that is universally best for all data/questions. Both the type of question and the details of the data will influence the model choices. Often, there are several models that would be reasonable choices for a given setting, and in such instances it is often worthwhile (and fairly easy) to explore multiple alternative models.
For some additional (as well as overlapping) information to what I wrote, read Chapters 1 and 2.1. An Introduction to Statistical Learning (ILS). You don’t need to work through it in detail and can skip over the math bits if you want to. But do try to get an overall idea of the concepts these chapters are trying to convey. Chapter 1 of HMLR is another good source that you should skim through. Again, try to get the main points (which will of course overlap with the text above and the other readings).
This recent paper A practical guide to selecting models for exploration, inference, and prediction in ecology provides a nice discussion of different modeling goals and what approaches one should use. We have not yet discussed some of these approaches, but will soon. You could skim through the paper now, then revisit later once we covered more of the topics discussed in there.