The field of data analysis has, in recent years, seen a lot of
progress and changes. While **Data Science** or
**Data analysis** was long synonymous with
**Statistics**, in past years, other fields such as
**Computer Science**, **Engineering**, and
**Business** have all started to contribute toward the
overall progress in data analytics. With a lot of old and new players
engaged in this topic, a lot of new terminologies have emerged. Some of
this terminology can be quite confusing, especially if different terms
are used for the same concept. Throughout this course, I try to mention
alternative names for any concept whenever I introduce it. The following
is a brief discussion of the main names given to the task of analyzing
data.

One of the hot – and also confusing – terms in recent years is
**Machine Learning** (and the related terms of **Data
Mining**, **Deep Learning**, and **Artificial
Intelligence**), and how they relate to
**Statistics**. There is a lot of debate about what exactly
those different areas are. To get a bit of an idea of what people are
talking about, skim through this blog post by Frank
Harrell. As you can see from the post and the many other sources he
cites, there is no real agreement on what exactly these terms mean.

In my view, it is not worth spending too much time trying to come up with a clear definition. But it’s good to have some frame of reference so when you see all these terms, you know what they mean. So here are my – arguably fuzzy, but hopefully still somewhat useful – thoughts on how to distinguish those topics. While making those distinctions can be at times useful, the reality is that the terminology is not clearly defined and all over the place.

For additional attempts at defining terms related to data analysis, see the Glossary page – and certainly feel free to contribute!

**Statistics** is the classical machinery driving data
analysis. Depending on the type of data, many different approaches have
been developed (parametric vs. non-parametric methods, longitudinal
analysis, time-series analysis, and many more). Models are, in general,
simple and interpretable, and the goal is to understand how inputs
(predictors) relate to outcomes. Statistics was developed when data was
sparse, computers didn’t exist, and mainly scientists interested in a
deep understanding of their data used it. Because of this, statistical
models tend to be simple and work well on small datasets. Most of
classical statistics focuses on
**associative**/**exploratory**/**inferential**
analysis types.

In **Machine Learning (ML)**, the models tend to be more
complex, and the goal is mainly on getting a “powerful” model, i.e., a
model that is good at prediction. Understanding how different inputs
lead to different outcomes is of secondary importance. Data is often
abundant, so more complex models can be used.

**Artificial intelligence (AI)** can be considered a
type of machine learning. The types of complex models that dominate AI
these days are generally based on neural nets. Neural nets sound fancy,
and they are certainly quite complex, but technically speaking, you can
think of them as a large collection of logistic regression type models
combined together.

As mentioned before, the terminology is fuzzy. Thus, a fairly simple model like a linear or logistic regression model could be considered a “classical” statistical model, while a more complex support vector machine or random forest (we’ll visit those later) are generally considered machine learning models. However, the terminology is poorly-defined, and you will see pretty much any approach can be given any label. Often it makes sense to think of the two terms as describing more or less the same thing, and that is applying some kind of mathematical or computational model to gain insight from the data. For some similar ideas, and a bit more details, read this blog post.

In the last several years, there has been a tremendous interest among
industry in anything related to **Data Science**. Most
notable, the huge successes of **AI** in some areas and for
some companies (Google, Amazon, Uber,…) has meant that suddenly everyone
wants AI and everyone sells AI, with a lot of players not having a clue
what it is and can/can’t do. This has led to a lot of marketing hype and
another host of acronyms. (For instance, I was recently at an online
industry workshop with the title (quoting from memory) “From AI over BI
to CI” - which apparently stand for Artificial Intelligence, Business
Intelligence and Continuous Intelligence.) Most often, these terms can
mean whatever you want them to mean. So some companies who have been
doing data analysis by running simple linear or logistic regression
models are now calling what they are doing AI. In the end, the only way
to know what anyone is actually doing is to look at the detailed
description of their methods. And if those are not provided, be very
skeptical. Most of all, don’t be unduly impressed by all these *big
words*. There is good data analysis, and there is bad data analysis,
and just because someone uses a deep neural net AI model doesn’t mean
what they are doing is any good. Critical and careful thinking about
your data and the question you want to answer **always**
trumps any new fancy modeling approach.

This course has an optional module on *Deep Learning, AI and Big
Data*. This is a very brief introduction you can check out if you
are interested. For a broader introduction, the (non-free) book Artificial Intelligence: A
Guide for Thinking Humans is a very good non-technical overview
book. And of course, topics like ML and AI are so “hot” these days, tons
of free resources are available online–though be sure to look at the
writer’s credentials and think critically when reading online blogs.