In this unit, we will cover the concept of regularization. We’ll also briefly mention a few related approaches.

- Learn what regularization is and when to use it.
- Be able to implement regularization in R.

The standard subset selection approach you just learned about considers a specific variable to be either in the model or not.

Newer approaches, called “regularization” methods, can take an in-between stance. In general, regularization forces some (or all) coefficients in a regression model to be smaller than the normal estimates. A variable might be included, but it might be given less weight than other variables by reducing (regularizing) the coefficient in front of it. That’s the idea of regularization.

Regularization tries to solve the same problem as subset selection, namely preventing overfitting (and also underfitting). Instead of solving this by completely removing predictors (and thus model flexibility, which might lead to overfitting), it penalizes variables by giving them less influence on the outcome, thus *regularizing* model behavior (or, in technical language: making things less “wiggly” 😁).

It might be easiest to explain regularization with a specific example, so let’s consider a linear model. Note, however, that the regularization concept and approach is general and applies to many models beyond linear ones.

Our model is given by

\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_nX_n.\]

We might decide to minimize the SSR between model and data, i.e., we are minimizing a cost function

\[C = SSR=\sum_i (Y_m^i - Y_d^i)^2.\]

Now, if we use regularization, we are going to instead minimize

\[C = SSR + R(b_j),\] where the function *R*, called the “regularization term” or “reguliarizer” is some function of the model parameters. Although you could (in theory) choose whatever function you want for *R*, there are 3 main ways to choose it, described next.

One way to choose the function that penalizes the predictors is to weigh each predictor by the predictor’s coefficient squared. Choosing the penalty term as the square of the coefficient is called *ridge regression* (AKA *L2 regularization*, *Tikhonov regularization*, *weight decay*, and potentially lots of other names). This leads to the cost function:

\[C = SSR + \lambda \sum_j^p b_j^2.\]

The parameter \(\lambda\) decides the balance between the goodness of fit (low SSR) and the penalty for having large coefficients. Instead of trying different subsets as above and picking the best based on lowest CV performance, we now try different values of \(\lambda\) and pick the model with the lowest (cross-validated) value for our performance measure, *C*. The parameter \(\lambda\) is often referred to as the *tuning parameter* or the *penalty*. Sometimes \(\lambda\) is also called a **hyperparameter** of the model, which just means that the best value of \(\lambda\) cannot be found by fitting the model one time only.

An alternative is to penalize the coefficients by their absolute value, namely using this cost function:

\[C = SSR + \lambda \sum_j^p |b_j| \]

This method is called *L1 regularization* or the *Least Absolute Shrinkage and Selection Operator (LASSO)*. One nice feature of LASSO (which ridge regression does not have) is that coefficients may go to 0. That means the predictor has been dropped from the model, similar to the subset selection approach described previously. One can think of the LASSO as an efficient approach for performing subset selection. It is not quite equivalent though, since, in the LASSO, the predictors that remain might have been shrunk in their impact due to the regularization penalty.

One can also combine ridge regression and LASSO into an approach called *elastic net*, which has a cost function that is the combination of the previous two, namely:

\[ C = SSR + \lambda \left( (1-\alpha) \sum_j^p b_j^2 + \alpha \sum_j^p |b_j|\right)\]

Now one needs to try different values for \(\lambda\) (called the *penalty* parameter) and \(\alpha\) (called the *mixture* parameter) to determine the model with the best (cross-validated) performance. \(\lambda\) determines the overall weight given to the penalty factor, while \(\alpha\) determines how the penalty should be distributed between the 2 alternative terms. There are also a few variants of this method, such as relaxed elastic net or adaptive elastic net which you can look into if you are interested but we won’t discuss here.

Depending on the kind of regularization model you fit, you have to determine 1 or 2 extra parameters (\(\lambda\) and \(\alpha\)). These parameters are called **tuning parameters** (or **hyperparameters**) and it is the first time we see a model that has them. Most complex machine learning models have such tuning parameters, and determining good values for those is part of the model fitting/training process. We’ll talk about that in the next unit.

If you are familiar with AIC, BIC or similar information criteria, you might have noticed that the cost function in regularization looks a bit like the equations for AIC or similar quantities. That is no accident. Both try to penalize the model for being overly complicated and thus have equations that contain terms for both model performance and model complexity while trying to find the model with the best balance.

If LASSO has the nice feature of potentially removing variables and thus making the model simpler, why ever use ridge regression or the elastic net? It turns out that for some problems, those other methods perform better. See section 6.2. of ISLR for more.

There is a lot of math behind the regularization concept. From an applied perspective, the focus is to understand the overall idea and how to implement them. For more on regularization see section 6.2. of ISLR, chapter 6 of HMLR and section 34.9 of IDS. I encourage you to check out the ISLR and HMLR readings and skim through them to get a better understanding of these widespread and powerful techniques.

The term *regularization* is broad and the idea has been applied to many different types of models. For instance in a Bayesian framework, the choice of informative priors regularizes the model and thus reduces the risk of overfitting. Similarly, in a hierarchical/multi-level statistical framework (Bayesian or frequentist), the structure imposed on the model leads to some amount of regularization. Unfortunately, those topics are outside of what we can cover in this class. If you want to learn more, I highly recommend the book Statistical Rethinking by McElreath. It is unfortunately not free and not cheap. But if you are interested in Bayesian analysis, this is the book I suggest you start with.

The main point is that if you see the word *regularization* in the literature, it might refer to the specific approaches discussed above, but it might also refer to other approaches related to the general concept.