Fitting (Simple) Statistical Models in R

Author

Andreas Handel

Modified

2024-03-20

Overview

In this unit, we will discuss common approaches and packages that are useful for fitting statistical models in R.

Learning Objectives

Learn about different packages in R that allow model fitting.

Introduction

R has a few statistical model fitting routines built in, e.g., the lm() and glm() functions. Those are well-tested and reliable, but often do not allow for fitting more advanced models. To fit more advanced models, there are many (100+) different R packages that provide additional functionality. The variety can be bewildering. Often it is hard to decide which package to use. Also, many packages do things slightly differently, which can be confusing and can make coding tedious. At some point in your analysis career, you’ll likely have to interact with packages specific to your data and questions and have to learn their syntax. Initially, to make things easier, several groups have undertaken efforts to create packages that allow a unified approach to fitting a lot of different models. We will focus on those packages here.

The `tidymodels` set of packages

You’ve learned about the tidyverse already. A more recent effort by the folks from RStudio – and many other contributors - is a set of packages called tidymodels. The idea is that similar to various packages in the tidyverse you can use for data wrangling, tidymodels provides a set of packages that help you with the code related to fitting models. For this course, we will focus on the tidymodels set of packages.

You can use tidymodels for pretty much any part of the modeling workflow (e.g., pre-processing, model evaluation, tuning). We have not yet covered most of those steps, but will do so shortly. The goal for this unit is to start exploring the tidymodels workflow – for now we’ll ignore a lot of the additional features, but we’ll discuss shortly.

The tidymodels suite is a relatively recent addition to the R universe. One of the main persons behind tidymodels, Max Kuhn previously wrote the package caret. This was – and still is – a nice and comprehensive package. I used it in previous versions of this course. However, at this point, all effort by Max and his team is put into tidymodels. Thus, for this iteration of the course, we will focus on those newer packages. If you ever end up working with the caret package, you might want to check out the caret chapter in IDS – caret has existed for a while so you can find all kinds of resources online as well.

Metrics with `tidymodels`

The yardstick package implements a lot of different metrics in the tidymodels framework. For details, see the yardstick package website and the Metric types vignette.

You can also define your own metrics, as described in this article on custom metrics.

Other comprehensive packages

While the whole data exploring, cleaning, and wrangling part in R is strongly dominated by the various tidyverse packages, tidymodel packages do not (yet) dominate the fitting part as much. And it is always good to have options.

Another great set of packages for model fitting is Machine Learning in R (mlr3). The goal of the various packages which are part of mlr3 is similar to those of tidymodels. While there is overlap, each set of tools can do certain things the other cannot do. For instance, in my experience, mlr3 has more options for parameter tuning, though tidymodels is catching up.

The main reason why we focus on tidymodels in this course is that the coding style is very similar to the tidyverse coding style, e.g., heavy use of pipes. Thus, in my opinion it is easier to learn. The mlr3 package has its own syntax. It is of course still R, but things look and operate quite a bit differently, which means one needs more time to get used to the code. Thus, to keep things as simple as possible on the coding side, we are not looking at mlr3 in this course. If, however, you ever end up trying to do a fitting/machine learning operation that you can’t do with the tidymodels set of packages, checking out mlr3 is certainly a good option.

Note that similar to tidymodels and caret, mlr3 had a predecessor called mlr (I don’t know if there ever was mlr2 🤷). mlr still exists, but all new development occurs in mlr3.

Direct interaction with statistical fitting packages

The idea behind tidymodels and mlr3 is that you write code that allows you to easily switch between the underlying model and algorithm you want to apply to your data, without having to write separate code each time. That often works rather well. Occasionally, you might need direct access to and interaction with a package. Say you want to fit some mixed-effects/multilevel/hierarchical models using a package that is not yet supported by tidymodels or mlr3. In that case, you will have to write code using the syntax your specific package needs. You might still be able to use, say, tidymodels to do a lot of the processing before and after fitting. It is generally good to try and start with a framework that tries to make your life easier, such as tidymodels. Once you realize you can’t get what you need through those packages, you can add custom code.