*This page is work in progress. Feel free to contribute via a
GitHhub pull request.*

There are lots of technical terms related to data and its analysis. In this course, I try to explain them whenever they show up first. However, I figured it might also be useful to have one central place for quick reference. Terms are ordered alphabetically. This is how I define them, others might have different definitions - and so might I in the future, I try to change/update my understanding of things every so often đ.

Note that in general, there is no regulatory terminology agency, so everyone can use these words to mean whatever they want. While some are fairly well defined, other terms are often used and abused in ways outside the definitions I provide. Always look carefully at what is described to decide what exactly is going on.

Not surprisingly, Iâm not the 1st one with the idea of compiling a list of data science related terms and definitions. After starting this, I found another one by Frank Harrell, which you can find here. Iâm sure Iâll find and list more alternatives soon.

**Artificial Intelligence:** The use of **Deep
Learning** and related approaches applied to âcomplexâ problems.
Historically, it was used for trying to solve problems using modeling
approaches that mimic in a simplified form a brain (i.e.Â neural net
models). Currently this term has become *hot* and is used more
widely than it should.

**Big Data:** Any dataset that doesnât easily fit into
the memory of a regular computer (or cluster) and thus needs to be
analyzed using special tools. Alternatively, data that is so big that
doing analysis on it takes too long using standard tools (e.g.Â R on a
regular computer) and instead requires specical treatment. Of course
this also depends on the type of analysis, not only the type of data. As
computers keep getting faster and tools more flexible and integrated,
the label **big data** is a moving target.

**Binary Variable:** A categorical variable with 2
categories, e.g., yes/no, dead/alive, diseased/healthy.

**Categorical Variable:** A variable that can take on
discrete values. Those categories can be ordered (at which point it is
an **Ordinal variable** - see there) or not. Examples of
non-ordered variabales are hair color or ethnicity. No ordering is
possible. A special and common case of categorical variables are
**Binary variables**.

**Causal Modeling:** Analysis of data with the goal to
make causal inferences between variables (e.g.Â X (partially) caused
Y).

**Classification:** Analysis approaches for
**Categorical Variables**.

**Continuous Variable:** A variable that can take on any
numeric value. E.g. weight. Note that in practice the values are often
discrete, e.g.Â while age is in principle a contiuous variable, it is
usually expressed in some units (say years) and individual values are
reported in discrete units (e.g.Â whole years). For analysis purposes, we
still generally treat it as if it could have been any value.

**Data Mining:** Often used interchangeably with
**Machine Learning**. It might sometimes indicate
specifically a âfishingâ approach of combing through data to look for
patterns, without pre-defined hypotheses to be tested. I that sense it
is similar to **Exploratory Data Analysis** or
**Secondary Data Analysis**, though those two are often
done using smaller (**Statistical**) as opposed to larger
(**Machine Learning**) models.

**Deep Learning:** Generally applied to a specific class
of machine learning models, namely neural nets. The âdeepâ part comes
from the fact that the neural net models generally have multiple layers
stacked on top of it (it has nothing to do with deep as in especially
insightful).

**Dependent (variable):** An alternative name for
**Outcome**.

**Descriptive Analysis:** Describing and presenting data
in meaningful ways using tables and figures, without trying to perform
statistical modeling, i.e.Â without looking for
correlations/patterns.

**Exploratory Data Analysis:** Looking for patterns in
data, without a hypothesis specified before data collection (or at least
before looking at the data). Very useful, but any result needs to be
tested on independent data.

**Exposure (variable):** A name for a
**predictor** variable of particular interest. For
instance, if we wanted to study if the daily duration of exercise had an
impact on BMI. We would consider BMI our outcome, exercise duration our
exposure (our main predictor of interest), and any other variable we
record (e.g., a personâs age and gender) as other predictors (sometimes
called covariates). The term exposure is common in the biomedical and
related disciplines, not so much in other areas.

**Feature (variable):** An alternative name for
**Predictor** often used in the machine learning
literature.

**Independent variable:** An alternative name for
*predictor*, most often used in the statistical literature.

**Interval scale variable:** A numerical (quantitative)
variable for which taking differences makes sense. Unlike ratio scale
variables, the zero point for interval data may be arbitrary. Addition
and subtraction are meaningful for interval data, but in general, taking
ratios is less meaningful. For example, temperatures in Celsius are on
the interval scale. If the temperature was 70 degrees yesterday and 65
degrees today, it makes perfect sense to say âit is 5 degrees cooler
than it was yesterday.â But a day that is 70 degrees is not exactly
twice as warm than a day that is 35 degrees, so division does not make
sense here. There are also temperatures lower than 0 degrees, so the
zero point is arbitrary.

**Labeled data:** If we have data for which there is a
specific outcome of interest and we know it, it is called labeled data.
For instance, if we had a lot of pictures of tissue samples, and someone
had gone through them and labeled them as cancerous or not cancerous, it
is labeled data. Labeled data (the most common type) is usually analyzed
using **Supervised Learning/Analysis** approaches.

**Machine Learning:** An approach to data analysis that
tends to use more complex models. The goal is mainly to obtain a model
that is good at prediction. Understanding how different inputs lead to
different outcomes is of secondary importance. Data is often abundant,
so more complex models can be used. Often this term and
**Statistics/Statistical Learning** are used interchangably
(though some people try to distinguish them, see e.g.Â this blog post).

**Mechanistic Modeling:** Building and using models that
explicitly incorporate mechanisms and processes of the system under
study to understand *how* things interact and lead to specific
outcomes. Many models in the hard sciences are of this type. A common
way to formulate such models is with differential equations.

**Nominal variable:** A qualitative measurement that can
take on distinct categories, but there is no natural order to these
categories. For example, apples can be red, yellow, or green, but there
is no inherent way to order these colors, even if the data is
useful.

**Observation:** An observation is a recording of the
different variables for a single unit of analysis. Usually an
individual, e.g., a single human or animal, but it could also be a
picture or video, a county, a city, or whatever our level of observation
is. For each observation, values for the different *variables*
should be available. In R, it is most common that each observation is
stored as a row in a data frame.

**Ordinal variable:** A variable that can take on
distinct categories which can be ordered, but the difference between
levels might not allow for mathematical operations. For instance, if a
question asks a person to rank their level of a pain on a scale from
1-10, a 7 is clearly higher than a 6, and a 6 higher than a 5. But itâs
unclear if the difference between 5 and 6 is the same as 6 and 7.

**Outcome (variable):** The variable of main interest
for our analysis. This can be a single outcome (most common) or
multiple. For instance, the main outcome might be if an individual
survives or dies. Or it could be their BMI, or it could be if a given
picture contains a cat or not. Also called **response
(variable)** or **dependent (variable)**.

**Predictive Modeling:** Using models with the main goal
of predicting future outcomes, given a collection of predictors.

**Predictor (variable):** All variables that are not the
outcome, which we use to see if we can predict the outcome. For
instance, if we wanted to predict the price of houses, we could use the
square footage of each house and the school district as the
predictors.

**Ratio scale variable:** Ratio data are a type of
quantitative data that are continuous, have consistent differences, and
have a true zero. Addition/subtraction and multiplication/division are
all meaningful for ratio data. For example, distance is ratio scale
because the difference between two distances is meaningful, and a
distance of 0 is truly the lowest possible distance value. Ratios are
meaningful for distance: a distance of 4 miles is exactly twice as long
as a distance of 2 miles.

**Regression:** A type of supervised learning/modeling,
where the outcome of interest is quantitative (or can be treated as
such).

**Response (variable):** See **Outcome
(variable).**

**Secondary Data Analysis:** Analysis of a dataset that
was not specifically collected for the purpose of answering the question
one wants to answer. With an increasing abundance and rountine
collection of data, such secondary analyses are becoming very common. If
a clear hypothesis is formulated before one looks at the data, one might
consider any results *confirmatory*, otherwise it is an
**Exploratory Analysis**. Even just looking at the data a
little bit during the cleaning process should move such a secondary
analysis into the exploratory category.

**Statistics:** The basic/classical machinery for data
analysis. Depending on the type of data, many different approaches have
been developed (parametric vs.Â non-parametric methods, longitudinal
analysis, time-series analysis, and many more). Models tend to be simple
and interpretable, the goal is to understand how inputs (predictors)
relate to outcomes. Statistics was developed when data was sparse,
computers didnât exist, and mainly scientists interested in a deep
understanding of their data used it. Because of this, statistical models
tend to be simple and work well on small datasets.

**Statistical Learning:** A term that seems to become
more widely used in recent years. While some people distinguish this
term from **Statistics** and consider it a sub-field (see
e.g.Â Chapter 1 of *Introduction
to Statistical Learning*), the two terms are often used
interchangably.

**Supervised Learning/Analysis:** Fitting
**Labeled data**. The two types of supervised learning are
**Regression** and **Classification**.

**Unlabeled data:** If we have data for which there is
either no specific outcome variable of interest or we do not know it, it
is called unlabelled data. For instance, if we had a lot of pictures of
tissue samples, and we knew that some showed cancerous tissue and others
not, but we didnât know which are which, it is unlabeled data.
Similarly, if we had pictures of different tissue samples (or say a
number of gene sequences) and all we wanted to know is if some samples
are more related to each other than others, but there is no main
outcome, it is considered unlabeled data. Unlabeled data is usually
analyzed using *unsupervised* analysis approaches.

**Variable:** Any quantity that we record like height,
weight, income, or species type. In R, it is most common, that each
variable is stored as a column in a data frame. The column name should
be the name of the variable.

**Variable:** Any quantity that we record like height,
weight, income, or species type. In R, it is most common, that each
variable is stored as a column in a data frame. The column name should
be the name of the variable.