Statistical Modeling Terminology

Author

Andreas Handel

Modified

2024-03-20

Overview

The field of data analysis has, in recent years, seen a lot of progress and changes. While Data Science or Data analysis was long synonymous with Statistics, in past years, other fields such as Computer Science, Engineering, and Business have all started to contribute toward the overall progress in data analytics. With a lot of old and new players engaged in this topic, a lot of new terminologies have emerged. Some of this terminology can be quite confusing, especially if different terms are used for the same concept. Throughout this course, I try to mention alternative names for any concept whenever I introduce it. The following is a brief discussion of the main names given to the task of analyzing data.

One of the hot – and also confusing – terms in recent years is Machine Learning (and the related terms of Data Mining, Deep Learning, and Artificial Intelligence), and how they relate to Statistics. There is a lot of debate about what exactly those different areas are. To get a bit of an idea of what people are talking about, skim through Frank Harrell’s blog post. As you can see from the post and the many other sources he cites, there is no real agreement on what exactly these terms mean.

In my view, it is not worth spending too much time trying to come up with a clear definition. But it’s good to have some frame of reference so when you see all these terms, you know what they mean. So here are my – arguably fuzzy, but hopefully still somewhat useful – thoughts on how to distinguish those topics. While making those distinctions can be at times useful, the reality is that the terminology is not clearly defined and all over the place.

For additional attempts at defining terms related to data analysis, see the Glossary page – and certainly feel free to contribute!

Statistics, Machine Learning and Artificial Intelligence

Statistics is the classical machinery driving data analysis. Depending on the type of data, many different approaches have been developed (parametric vs. non-parametric methods, longitudinal analysis, time-series analysis, and many more). Models are, in general, simple and interpretable, and the goal is to understand how inputs (predictors) relate to outcomes. Statistics was developed when data was sparse, computers didn’t exist, and mainly scientists interested in a deep understanding of their data used it. Because of this, statistical models tend to be simple and work well on small datasets. Most of classical statistics focuses on associative/exploratory/inferential analysis types.

In Machine Learning (ML), the models tend to be more complex, and the goal is mainly on getting a “powerful” model, i.e., a model that is good at prediction. Understanding how different inputs lead to different outcomes is of secondary importance. Data is often abundant, so more complex models can be used.

Artificial intelligence (AI) can be considered a type of machine learning. The types of complex models that dominate AI these days are generally based on neural nets. Neural nets sound fancy, and they are certainly quite complex, but technically speaking, you can think of them as a large collection of logistic regression type models combined together.

As mentioned before, the terminology is fuzzy. Thus, a fairly simple model like a linear or logistic regression model could be considered a “classical” statistical model, while a more complex support vector machine or random forest (we’ll visit those later) are generally considered machine learning models. However, the terminology is poorly-defined, and you will see pretty much any approach can be given any label. Often it makes sense to think of the two terms as describing more or less the same thing, and that is applying some kind of mathematical or computational model to gain insight from the data. For some similar ideas, and a bit more details, read Joshua Ebner’s blog post.

Terminology abuse

In the last several years, there has been a tremendous interest among industry in anything related to Data Science. Most notable, the huge successes of AI in some areas and for some companies (Google, Amazon, Uber, …) has meant that suddenly everyone wants AI and everyone sells AI, with a lot of players not having a clue what it is and can/can’t do. This has led to a lot of marketing hype and another host of acronyms. (For instance, I was recently at an online industry workshop with the title (quoting from memory) “From AI over BI to CI” - which apparently stand for Artificial Intelligence, Business Intelligence and Continuous Intelligence.) Most often, these terms can mean whatever you want them to mean. So some companies who have been doing data analysis by running simple linear or logistic regression models are now calling what they are doing AI. In the end, the only way to know what anyone is actually doing is to look at the detailed description of their methods. And if those are not provided, be very skeptical. Most of all, don’t be unduly impressed by all these big words. There is good data analysis, and there is bad data analysis, and just because someone uses a deep neural net AI model doesn’t mean what they are doing is any good. Critical and careful thinking about your data and the question you want to answer always trumps any new fancy modeling approach.

Further reading

This course has an optional module on Deep Learning, AI and Big Data. This is a very brief introduction you can check out if you are interested. For a broader introduction, the (non-free) book Artificial Intelligence: A Guide for Thinking Humans is a very good non-technical overview book. And of course, topics like ML and AI are so “hot” these days, tons of free resources are available online–though be sure to look at the writer’s credentials and think critically when reading online blogs.