Overview

The field of data analysis has, in recent years, seen a lot of progress and changes. While Data Science or Data analysis was long synonymous with Statistics, in past years, other fields such as Computer Science, Engineering, and Business have all started to contribute toward the overall progress in data analytics. With a lot of old and new players engaged in this topic, a lot of new terminologies have emerged. Some of this terminology can be quite confusing, especially if different terms are used for the same concept. Throughout this course, I try to mention alternative names for any concept whenever I introduce it. The following is a brief discussion of the main names given to the task of analyzing data.

One of the hot – and also confusing – terms in recent years is Machine Learning (and the related terms of Data Mining, Deep Learning, and Artificial Intelligence), and how they relate to Statistics. There is a lot of debate about what exactly those different areas are. To get a bit of an idea of what people are talking about, skim through this blog post by Frank Harrell. As you can see from the post and the many other sources he cites, there is no real agreement on what exactly these terms mean.

In my view, it is not worth spending too much time trying to come up with a clear definition. But it’s good to have some frame of reference so when you see all these terms, you know what they mean. So here are my – arguably fuzzy, but hopefully still somewhat useful – thoughts on how to distinguish those topics. While making those distinctions can be at times useful, the reality is that the terminology is not clearly defined and all over the place.

For additional attempts at defining terms related to data analysis, see the Glossary page – and certainly feel free to contribute!

Statistics, Machine Learning and Artificial Intelligence

Statistics is the classical machinery driving data analysis. Depending on the type of data, many different approaches have been developed (parametric vs. non-parametric methods, longitudinal analysis, time-series analysis, and many more). Models are, in general, simple and interpretable, and the goal is to understand how inputs (predictors) relate to outcomes. Statistics was developed when data was sparse, computers didn’t exist, and mainly scientists interested in a deep understanding of their data used it. Because of this, statistical models tend to be simple and work well on small datasets. Most of classical statistics focuses on associative/exploratory/inferential analysis types.

In Machine Learning (ML), the models tend to be more complex, and the goal is mainly on getting a “powerful” model, i.e., a model that is good at prediction. Understanding how different inputs lead to different outcomes is of secondary importance. Data is often abundant, so more complex models can be used.

Artificial intelligence (AI) can be considered a type of machine learning. The types of complex models that dominate AI these days are generally based on neural nets. Neural nets sound fancy, and they are certainly quite complex, but technically speaking, you can think of them as a large collection of logistic regression type models combined together.

As mentioned before, the terminology is fuzzy. Thus, a fairly simple model like a linear or logistic regression model could be considered a “classical” statistical model, while a more complex support vector machine or random forest (we’ll visit those later) are generally considered machine learning models. However, the terminology is poorly-defined, and you will see pretty much any approach can be given any label. Often it makes sense to think of the two terms as describing more or less the same thing, and that is applying some kind of mathematical or computational model to gain insight from the data. For some similar ideas, and a bit more details, read this blog post.

Related Terms

The term Data Mining is often used interchangeably with Machine Learning. It might sometimes indicate a “fishing” approach of combing through data to look for patterns, without pre-defined hypotheses to be tested. As such, results from Data Mining explorations need to be confirmed with independent data. Exploratory Data Analysis or Secondary Data Analysis mean pretty much the same thing, though the latter are often done using smaller (Statistics) models as opposed to larger (Machine Learning) models. Again, terminology and use of the words is fuzzy.

The term Deep Learning is generally used when one class of machine learning models, namely neural nets, are applied to a data analysis problem. The “deep” part comes from the fact that the neural net models often have many stacked layers (it has nothing to do with deep as in especially insightful).

The use of deep learning and related approaches applied to “complex” problems is often labeled Artificial Intelligence.

Neural nets (and similar complex methods) usually need a lot of data to perform well. Thus, the term Big Data often shows up together with these other terms.

In general, there is currently a lot of hype around these topics, and people – especially in the business, but also the research community – use the words Deep Learning and Artificial Intelligence quite liberally, even if all they do is fit a linear model to data.

We’ll discuss these topics a bit more at the end of the course, but since most scientific inquiry focuses on understanding patterns (i.e., questions tend to be inferential/causal/mechanistic) and amounts of data – while growing – still tend to be on the small side. So we won’t spend much time on it.

Terminology abuse

In the last several years, there has been a tremendous interest among industry in anything related to Data Science. Most notable, the huge successes of AI in some areas and for some companies (Google, Amazon, Uber,…) has meant that suddenly everyone wants AI and everyone sells AI, with a lot of players not having a clue what it is and can/can’t do. This has led to a lot of marketing hype and another host of acronyms. (For instance, I was recently at an online industry workshop with the title (quoting from memory) “From AI over BI to CI” - which apparently stand for Artificial Intelligence, Business Intelligence and Continuous Intelligence.) Most often, these terms can mean whatever you want them to mean. So some companies who have been doing data analysis by running simple linear or logistic regression models are now calling what they are doing AI. In the end, the only way to know what anyone is actually doing is to look at the detailed description of their methods. And if those are not provided, be very skeptical. Most of all, don’t be unduly impressed by all these big words. There is good data analysis, and there is bad data analysis, and just because someone uses a deep neural net AI model doesn’t mean what they are doing is any good. Critical and careful thinking about your data and the question you want to answer always trumps any new fancy modeling approach.

Statistical Modeling Terminology

Andreas Handel

2023-01-29 13:37:02.257801

Overview

Statistics, Machine Learning and Artificial Intelligence

Terminology abuse

Further reading