In this unit, we will cover the concept of parameter/model tuning and training.

- Learn what parameter tuning/model training are.
- Be able to implement training/tuning in R.

We discussed at length that one (but not the only) goal of model
fitting is to find a model that has **good performance when
applied to new/different data.**

For linear, logistic, and similar models (generalized linear models), we discussed the idea of subset/feature selection that can help determine a good model. For different sub-models, we fitted the model to some of the data and evaluated model performance on another part of the data using cross-validation. The model with the best cross-validated performance across all sub-models is then designated as the best (at least with regard to our chosen performance metric).

We then discussed regularization, which tries to solve a problem
similar to subset selection, namely preventing a model that is too
complex and thus **overfits**. In the regularization
approach, one does not compare sub-models with different predictor
variables. Instead, all predictors are present, and one (or two)
parameters (the regularization or penalty parameters, We called them
\(\lambda\) and \(\alpha\)) are varied to influence model
complexity. For each value of that parameter, the model is evaluated
through cross-validation, and the \(\lambda\) which produces the model with the
best performance is chosen (or sometimes one picks a somewhat larger
\(\lambda\) to further prevent
potential overfitting).

This approach of taking a model parameter and evaluating models for
different parameter values is called **model/parameter
tuning**. These model-specific tuning parameters are often also
referred to as **hyperparameters**. Simple models, like
linear or logistic regression, do not have any parameters that can be
tuned. However, more complicated models, which we will discuss soon,
generally have one or more tuning parameters. Very flexible models, such
as neural nets used in artificial intelligence tasks, can have thousands
or even more parameters that need tuning. For any models with tuning
parameters, it is essential to **tune the parameters/train the
model.** Without it, the model will likely not perform very
well.

Training the model by tuning its parameters follows a general approach that is conceptually the same for all models. You need to go through these steps:

- Select some values for your tuning parameters.
- Using cross-validation, fit model to a part of the data (the
*analysis*portion), evaluate model performance on the remainder of data (the*assessment*portion). - Select new values for your tuning parameters, and repeat step 2.
- Keep going until you hit some stopping criterion, e.g., you tried all parameter combinations you wanted to try, you hit the maximum number of tries, or you hit the maximum amount of time you allocated for this parameter tuning. (Or you found the absolute best tuning parameter values, but that’s only likely for simple models with few tuning parameters).
- Pick your best model as the one with the parameter values that produced the overall best model performance.

Your final model consists of both the type of model and the values of the parameters.

You might (or might not) have wondered if this procedure of
repeatedly refitting the model for different values of the
**tuning parameters (hyperparameters)** and re-fitting the
model for different values of the **model parameters**
(e.g., for the coefficients \(b_i\) for
a linear or generalized linear model) isn’t more or less the same. Both
conceptually and in practice, those approaches are quite similar. For
GLM, there are differences in how things happen, namely the \(b_i\) can be determined in a single step,
without the need for trial and error. But for other models, such as some
of the ML models we’ll explore, tuning parameters and internal
parameters associated with the model might both need to be determined by
iterative procedures. Thus, while one can try to distinguish between
model and tuning parameters (see e.g., this
blog post, which discusses that point a bit more), this is often
fuzzy. And the usual caveat applies: The terminology is not consistent,
and what some people might call a model parameter might be called a
tuning parameter by others. The good news is that in practice it doesn’t
matter much what you call a specific parameter. Some can be tuned and
you can chose to do so (or not and fix it), others cannot be tuned and
are determined internally.

The problem of trying a lot of different tuning parameter values to
find the ones that lead to the best performance is very similar to the
problem of trying to test a lot of different sub-models during subset
selection to find the best model. Not surprisingly then, the procedures
to perform the search over parameter space are similar to the ones one
can use to search over subset/sub-model space. The most basic one for
subset selection was to try every possible model (exhaustive search).
This works in principle for tuning parameters as well, but only if the
tuning parameters are discrete. For continuous tuning parameters (e.g.,
\(\lambda\) in regularization), it is
impossible to try *all* values. One instead chooses discrete
values for the parameter between some lower and upper limits and then
searches that grid of parameters. This is called **grid
search**. The advantage of a grid search is that you know you
tried every combination of parameters in your grid.

For instance, if you have 2 continuous tuning parameters and 1 categorical tuning parameter with 3 categories, if you chose 10 discrete values for the 2 continuous parameters each, your grid would be 10x10x3. In this case, you would need to evaluate the model for 300 different tuning parameter values. That’s not too bad. But you can see that one problem with this approach is that as the number of tuning parameters increases, or if you want to try many different discrete values (e.g., 100 instead of 10 for each continuous parameter), the number of times you need to run the model increases rapidly. That’s the same problem as the exhaustive search for subset selection.

To solve this problem, there are a few major methods: * One can
efficiently choose parameter values to estimate – instead of searching
the entire grid, you can use a sampling method to find a smaller grid
that covers the same space in the most efficient way. The most
well-known algorithm for this is probably *Latin hypercube
sampling*. * methods similar to those mentioned for subset selection
can be applied, e.g., one can use Genetic Algorithms, Simulated
Annealing, Racing Methods, or many other optimizer routines for tuning.
While it is, in principle, possible to write your own code that
implements the tuning procedure with whatever method you want to use,
for most instances, it is easier to use pre-existing methods.

`tidymodels`

, and its package `tune`

currently have
a few different algorithms for searching the tuning parameter space
implemented. Grid search is the main one, but the package also
implements *Iterative Bayesian optimization*. The `finetune`

which is not yet on CRAN implements *Simulated Annealing* and
*Racing Methods*. To learn more about those, see chapters 12-14 in Tidy Modeling With
R.

The `mlr3/mlr`

package I keep mentioning also has
algorithms to tune parameters, including some that are not available
(yet) in `tidymodels`

. For tuning in `mlr`

see
e.g. here and
here.
For this course, we’ll focus on what’s available in
`tidymodels`

, but if you ever need to do some major parameter
tuning/model optimization, checking out `mlr`

might be worth
it (or implementing your own with `tidymodels`

, which is
possible.)

Only if your model has very few tuning parameters and your data is manageable in size can you find the absolute best parameter values in a reasonable amount of time. More likely, you’ll find parameter values that give you a close-to-optimal model.

At times, cross-validation might take too much time, and you might have to use a computationally faster method, such as AIC or similar, to try to estimate model performance on future data. That’s not ideal, you might want to consider other approaches first (fewer parameter evaluations, faster computer, running things in parallel…).

Any process that adjusts the model repeatedly based on prior fits to
data has the danger of **overfitting**, even if you try to
guard against this using approaches such as CV. Thus, sometimes less
tuning might actually give you a more robust/generalizable model.

The more tuning parameters in your model, the more data you need to be able to train the model properly. If you have a mismatch between the amount of data and model complexity, you are likely going to overfit. This is why complex models such as neural nets need vast amounts of data (millions or billions of observations).

Most relevant and maybe good to visit next are chapters 12-14 in Tidy Modeling With R,
which discuss the general tuning process, and then explain how to do
grid search and iterative search using `tidymodels`

.

Section
2.5.3 of HMLR provides a very short section on tuning. ISLR mentions tuning in various
places but doesn’t describe it in a dedicated section. IDS mentions it in the
*Machine Learning* chapters but also does not have a dedicated
section on the topic.