In this unit, we will cover the concept of regularization. We’ll also briefly mention a few related approaches.

- Learn what regularization is and when to use it.
- Be able to implement regularization in R.

The standard subset selection approach you just learned about considers a specific variable to be either in the model or not.

Newer approaches, called “regularization” methods, can take an in-between stance. In general, regularization forces some (or all) coefficients in a regression model to be smaller than the normal estimates. A variable might be included, but it might be given less weight than other variables by reducing (regularizing) the coefficient in front of it. That’s the idea of regularization.

Regularization tries to solve the same problem as subset selection,
namely preventing overfitting (and also underfitting). Instead of
solving this by completely removing predictors (and thus model
flexibility, which might lead to overfitting), it penalizes variables by
giving them less influence on the outcome, thus *regularizing*
model behavior (or, in technical language: making things less “wiggly”
😁).

It might be easiest to explain regularization with a specific example, so let’s consider a linear model. Note, however, that the regularization concept and approach is general and applies to many models beyond linear ones.

Our model is given by

\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_nX_n.\]

We might decide to minimize the SSR between model and data, i.e., we are minimizing a cost function

\[C = SSR=\sum_i (Y_m^i - Y_d^i)^2.\]

Now, if we use regularization, we are going to instead minimize

\[C = SSR + R(b_j),\] where the
function *R*, called the “regularization term” or “reguliarizer”
is some function of the model parameters. Although you could (in theory)
choose whatever function you want for *R*, there are 3 main ways
to choose it, described next.

One way to choose the function that penalizes the predictors is to
weigh each predictor by the predictor’s coefficient squared. Choosing
the penalty term as the square of the coefficient is called *ridge
regression* (AKA *L2 regularization*, *Tikhonov
regularization*, *weight decay*, and potentially lots of
other names). This leads to the cost function:

\[C = SSR + \lambda \sum_j^p b_j^2.\]

The parameter \(\lambda\) decides
the balance between the goodness of fit (low SSR) and the penalty for
having large coefficients. Instead of trying different subsets as above
and picking the best based on lowest CV performance, we now try
different values of \(\lambda\) and
pick the model with the lowest (cross-validated) value for our
performance measure, *C*. The parameter \(\lambda\) is often referred to as the
*tuning parameter* or the *penalty*. Sometimes \(\lambda\) is also called a
**hyperparameter** of the model, which just means that the
best value of \(\lambda\) cannot be
found by fitting the model one time only.

An alternative is to penalize the coefficients by their absolute value, namely using this cost function:

\[C = SSR + \lambda \sum_j^p |b_j| \]

This method is called *L1 regularization* or the *Least
Absolute Shrinkage and Selection Operator (LASSO)*. One nice feature
of LASSO (which ridge regression does not have) is that coefficients may
go to 0. That means the predictor has been dropped from the model,
similar to the subset selection approach described previously. One can
think of the LASSO as an efficient approach for performing subset
selection. It is not quite equivalent though, since, in the LASSO, the
predictors that remain might have been shrunk in their impact due to the
regularization penalty.

One can also combine ridge regression and LASSO into an approach
called *elastic net*, which has a cost function that is the
combination of the previous two, namely:

\[ C = SSR + \lambda \left( (1-\alpha) \sum_j^p b_j^2 + \alpha \sum_j^p |b_j|\right)\]

Now one needs to try different values for \(\lambda\) (called the *penalty*
parameter) and \(\alpha\) (called the
*mixture* parameter) to determine the model with the best
(cross-validated) performance. \(\lambda\) determines the overall weight
given to the penalty factor, while \(\alpha\) determines how the penalty should
be distributed between the 2 alternative terms. There are also a few
variants of this method, such as relaxed elastic net or adaptive elastic
net which you can look into if you are interested but we won’t discuss
here.

Depending on the kind of regularization model you fit, you have to
determine 1 or 2 extra parameters (\(\lambda\) and \(\alpha\)). These parameters are called
**tuning parameters** (or **hyperparameters**)
and it is the first time we see a model that has them. Most complex
machine learning models have such tuning parameters, and determining
good values for those is part of the model fitting/training process.
We’ll talk about that in the next unit.

If you are familiar with AIC, BIC or similar information criteria, you might have noticed that the cost function in regularization looks a bit like the equations for AIC or similar quantities. That is no accident. Both try to penalize the model for being overly complicated and thus have equations that contain terms for both model performance and model complexity while trying to find the model with the best balance.

If LASSO has the nice feature of potentially removing variables and thus making the model simpler, why ever use ridge regression or the elastic net? It turns out that for some problems, those other methods perform better. See section 6.2. of ISLR for more.

There is a lot of math behind the regularization concept. From an applied perspective, the focus is to understand the overall idea and how to implement them. For more on regularization see section 6.2. of ISLR, chapter 6 of HMLR and section 34.9 of IDS. I encourage you to check out the ISLR and HMLR readings and skim through them to get a better understanding of these widespread and powerful techniques.

The term *regularization* is broad and the idea has been
applied to many different types of models. For instance in a Bayesian
framework, the choice of informative priors regularizes the model and
thus reduces the risk of overfitting. Similarly, in a
hierarchical/multi-level statistical framework (Bayesian or
frequentist), the structure imposed on the model leads to some amount of
regularization. Unfortunately, those topics are outside of what we can
cover in this class. If you want to learn more, I highly recommend the
book Statistical
Rethinking by McElreath. It is unfortunately not free and not cheap.
But if you are interested in Bayesian analysis, this is the book I
suggest you start with.

The main point is that if you see the word *regularization* in
the literature, it might refer to the specific approaches discussed
above, but it might also refer to other approaches related to the
general concept.