# Overview

In this unit, we will cover the concept of regularization. We’ll also briefly mention a few related approaches.

# Learning Objectives

• Learn what regularization is and when to use it.
• Be able to implement regularization in R.

# Introduction

The standard subset selection approach you just learned about considers a specific variable to be either in the model or not.

Newer approaches, called “regularization” methods, can take an in-between stance. In general, regularization forces some (or all) coefficients in a regression model to be smaller than the normal estimates. A variable might be included, but it might be given less weight than other variables by reducing (regularizing) the coefficient in front of it. That’s the idea of regularization.

# Regularization

Regularization tries to solve the same problem as subset selection, namely preventing overfitting (and also underfitting). Instead of solving this by completely removing predictors (and thus model flexibility, which might lead to overfitting), it penalizes variables by giving them less influence on the outcome, thus regularizing model behavior (or, in technical language: making things less “wiggly” 😁).

It might be easiest to explain regularization with a specific example, so let’s consider a linear model. Note, however, that the regularization concept and approach is general and applies to many models beyond linear ones.

Our model is given by

$Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_nX_n.$

We might decide to minimize the SSR between model and data, i.e., we are minimizing a cost function

$C = SSR=\sum_i (Y_m^i - Y_d^i)^2.$

Now, if we use regularization, we are going to instead minimize

$C = SSR + R(b_j),$ where the function R, called the “regularization term” or “reguliarizer” is some function of the model parameters. Although you could (in theory) choose whatever function you want for R, there are 3 main ways to choose it, described next.

## Ridge regression

One way to choose the function that penalizes the predictors is to weigh each predictor by the predictor’s coefficient squared. Choosing the penalty term as the square of the coefficient is called ridge regression (AKA L2 regularization, Tikhonov regularization, weight decay, and potentially lots of other names). This leads to the cost function:

$C = SSR + \lambda \sum_j^p b_j^2.$

The parameter $$\lambda$$ decides the balance between the goodness of fit (low SSR) and the penalty for having large coefficients. Instead of trying different subsets as above and picking the best based on lowest CV performance, we now try different values of $$\lambda$$ and pick the model with the lowest (cross-validated) value for our performance measure, C. The parameter $$\lambda$$ is often referred to as the tuning parameter or the penalty. Sometimes $$\lambda$$ is also called a hyperparameter of the model, which just means that the best value of $$\lambda$$ cannot be found by fitting the model one time only.

## LASSO

An alternative is to penalize the coefficients by their absolute value, namely using this cost function:

$C = SSR + \lambda \sum_j^p |b_j|$

This method is called L1 regularization or the Least Absolute Shrinkage and Selection Operator (LASSO). One nice feature of LASSO (which ridge regression does not have) is that coefficients may go to 0. That means the predictor has been dropped from the model, similar to the subset selection approach described previously. One can think of the LASSO as an efficient approach for performing subset selection. It is not quite equivalent though, since, in the LASSO, the predictors that remain might have been shrunk in their impact due to the regularization penalty.

## Elastic net

One can also combine ridge regression and LASSO into an approach called elastic net, which has a cost function that is the combination of the previous two, namely:

$C = SSR + \lambda \left( (1-\alpha) \sum_j^p b_j^2 + \alpha \sum_j^p |b_j|\right)$

Now one needs to try different values for $$\lambda$$ (called the penalty parameter) and $$\alpha$$ (called the mixture parameter) to determine the model with the best (cross-validated) performance. $$\lambda$$ determines the overall weight given to the penalty factor, while $$\alpha$$ determines how the penalty should be distributed between the 2 alternative terms. There are also a few variants of this method, such as relaxed elastic net or adaptive elastic net which you can look into if you are interested but we won’t discuss here.

## Tuning a regularization model

Depending on the kind of regularization model you fit, you have to determine 1 or 2 extra parameters ($$\lambda$$ and $$\alpha$$). These parameters are called tuning parameters (or hyperparameters) and it is the first time we see a model that has them. Most complex machine learning models have such tuning parameters, and determining good values for those is part of the model fitting/training process. We’ll talk about that in the next unit.