Fitting (Simple) Statistical Models in R
Overview
In this unit, we will discuss common approaches and packages that are useful for fitting statistical models in R.
Learning Objectives
- Learn about different packages in R that allow model fitting.
Introduction
R has a few statistical model fitting routines built in, e.g., the lm()
and glm()
functions. Those are well-tested and reliable, but often do not allow for fitting more advanced models. To fit more advanced models, there are many (100+) different R packages that provide additional functionality. The variety can be bewildering. Often it is hard to decide which package to use. Also, many packages do things slightly differently, which can be confusing and can make coding tedious. At some point in your analysis career, you’ll likely have to interact with packages specific to your data and questions and have to learn their syntax. Initially, to make things easier, several groups have undertaken efforts to create packages that allow a unified approach to fitting a lot of different models. We will focus on those packages here.
The tidymodels
set of packages
You’ve learned about the tidyverse
already. A more recent effort by the folks from RStudio – and many other contributors - is a set of packages called tidymodels
. The idea is that similar to various packages in the tidyverse
you can use for data wrangling, tidymodels
provides a set of packages that help you with the code related to fitting models. For this course, we will focus on the tidymodels
set of packages.
You can use tidymodels
for pretty much any part of the modeling workflow (e.g., pre-processing, model evaluation, tuning). We have not yet covered most of those steps, but will do so shortly. The goal for this unit is to start exploring the tidymodels
workflow – for now we’ll ignore a lot of the additional features, but we’ll discuss shortly.
The tidymodels
suite is a relatively recent addition to the R universe. One of the main persons behind tidymodels
, Max Kuhn previously wrote the package caret
. This was – and still is – a nice and comprehensive package. I used it in previous versions of this course. However, at this point, all effort by Max and his team is put into tidymodels
. Thus, for this iteration of the course, we will focus on those newer packages. If you ever end up working with the caret
package, you might want to check out the caret
chapter in IDS – caret
has existed for a while so you can find all kinds of resources online as well.
Metrics with tidymodels
The yardstick
package implements a lot of different metrics in the tidymodels
framework. For details, see the yardstick
package website and the Metric types vignette.
You can also define your own metrics, as described in this article on custom metrics.
Other comprehensive packages
While the whole data exploring, cleaning, and wrangling part in R is strongly dominated by the various tidyverse
packages, tidymodel
packages do not (yet) dominate the fitting part as much. And it is always good to have options.
Another great set of packages for model fitting is Machine Learning in R (mlr3
). The goal of the various packages which are part of mlr3
is similar to those of tidymodels
. While there is overlap, each set of tools can do certain things the other cannot do. For instance, in my experience, mlr3
has more options for parameter tuning, though tidymodels
is catching up.
The main reason why we focus on tidymodels
in this course is that the coding style is very similar to the tidyverse
coding style, e.g., heavy use of pipes. Thus, in my opinion it is easier to learn. The mlr3
package has its own syntax. It is of course still R, but things look and operate quite a bit differently, which means one needs more time to get used to the code. Thus, to keep things as simple as possible on the coding side, we are not looking at mlr3
in this course. If, however, you ever end up trying to do a fitting/machine learning operation that you can’t do with the tidymodels
set of packages, checking out mlr3
is certainly a good option.
Note that similar to tidymodels
and caret
, mlr3
had a predecessor called mlr
(I don’t know if there ever was mlr2
🤷). mlr
still exists, but all new development occurs in mlr3
.
Direct interaction with statistical fitting packages
The idea behind tidymodels
and mlr3
is that you write code that allows you to easily switch between the underlying model and algorithm you want to apply to your data, without having to write separate code each time. That often works rather well. Occasionally, you might need direct access to and interaction with a package. Say you want to fit some mixed-effects/multilevel/hierarchical models using a package that is not yet supported by tidymodels
or mlr3
. In that case, you will have to write code using the syntax your specific package needs. You might still be able to use, say, tidymodels
to do a lot of the processing before and after fitting. It is generally good to try and start with a framework that tries to make your life easier, such as tidymodels
. Once you realize you can’t get what you need through those packages, you can add custom code.
Further reading
One nice feature about tidymodels
is that the developers are placing a lot of emphasis not only on implementing new features, but also on providing good documentation. The tidymodels
website is your best starting point. It has several sections that contain documentation and help resources. I recommend you visit and browse through regularly.
Max Kuhn and tidymodels
co-maintainer Julia Silge also have on online book called Tidy Modeling with R which discusses both the general concepts and the specific details of some of the tidymodels
packages.
There are lots of good tutorials and walk-throughs both on the tidymodels
website and other places. I’ll give you some more links soon. But since most of those discuss the full workflow, and we haven’t gotten there quite yet, we’ll save most of those for later.