In this unit, we will discuss common approaches and packages that are useful for fitting statistical models in R.
R has a few statistical model fitting routines built in, e.g., the
lm()
and glm()
functions. Those are
well-tested and reliable, but often do not allow for fitting more
advanced models. To fit more advanced models, there are
many (100+) different R packages that provide
additional functionality. The variety can be bewildering. Often it is
hard to decide which package to use. Also, many packages do things
slightly differently, which can be confusing and can make coding
tedious. At some point in your analysis career, you’ll likely have to
interact with packages specific to your data and questions and have to
learn their syntax. Initially, to make things easier, several groups
have undertaken efforts to create packages that allow a unified approach
to fitting a lot of different models. We will focus on those packages
here.
tidymodels
set of packagesYou’ve learned about the tidyverse
already. A more
recent effort by the folks from RStudio – and many other contributors -
is a set of packages called tidymodels. The idea is that
similar to various packages in the tidyverse
you can use
for data wrangling, tidymodels
provides a set of packages
that help you with the code related to fitting models. For this course,
we will focus on the tidymodels
set of packages.
You can use tidymodels
for pretty much any part of the
modeling workflow (e.g., pre-processing, model evaluation, tuning). We
have not yet covered most of those steps, but will do so shortly. The
goal for this unit is to start exploring the tidymodels
workflow – for now we’ll ignore a lot of the additional features, but
we’ll discuss shortly.
The tidymodels
suite is a relatively recent addition to
the R universe. One of the main persons behind tidymodels
,
Max Kuhn previously wrote the
package caret.
This was – and still is – a nice and comprehensive package. I used it in
previous versions of this course. However, at this point, all effort by
Max and his team is put into tidymodels
. Thus, for this
iteration of the course, we will focus on those newer packages. If you
ever end up working with the caret
package, you might want
to check out the
caret chapter in IDS – caret
has existed for a
while so you can find all kinds of resources online as well.
While the whole data exploring, cleaning, and wrangling part in R is
strongly dominated by the various tidyverse
packages,
tidymodel
packages do not (yet) dominate the fitting part
as much. And it is always good to have options.
Another great set of packages for model fitting is Machine Learning in R (mlr3). The
goal of the various packages which are part of mlr3
is
similar to those of tidymodels
. While there is overlap,
each set of tools can do certain things the other cannot do. For
instance, in my experience, mlr3
has more options for
parameter tuning, though tidymodels
is catching up.
The main reason why we focus on tidymodels
in this
course is that the coding style is very similar to the
tidyverse
coding style, e.g., heavy use of pipes. Thus, in
my opinion it is easier to learn. The mlr3
package has its
own syntax. It is of course still R, but things look and operate quite a
bit differently, which means one needs more time to get used to the
code. Thus, to keep things as simple as possible on the coding side, we
are not looking at mlr3
in this course. If, however, you
ever end up trying to do a fitting/machine learning operation that you
can’t do with the tidymodels
set of packages, checking out
mlr3
is certainly a good option.
Note that similar to tidymodels
and caret
,
mlr3
had a predecessor called mlr
(I don’t know if
there ever was mlr2
🤷). mlr
still exists, but
all new development occurs in mlr3
.
The idea behind tidymodels
and mlr3
is that
you write code that allows you to easily switch between the underlying
model and algorithm you want to apply to your data, without having to
write separate code each time. That often works rather well.
Occasionally, you might need direct access to and interaction with a
package. Say you want to fit some mixed-effects/multilevel/hierarchical
models using a package that is not yet supported by
tidymodels
or mlr3
. In that case, you will
have to write code using the syntax your specific package needs. You
might still be able to use, say, tidymodels
to do a lot of
the processing before and after fitting. It is generally good to try and
start with a framework that tries to make your life easier, such as
tidymodels
. Once you realize you can’t get what you need
through those packages, you can add custom code.
One nice feature about tidymodels
is that the developers
are placing a lot of emphasis not only on implementing new features, but
also on providing good documentation. The tidymodels website is your best
starting point. It has several sections that contain documentation and
help resources. I recommend you visit and browse through regularly.
Max Kuhn and tidymodels
co-maintainer Julia Silge also have on online book called Tidy Modeling
with R which discusses both the general concepts and the
specific details of some of the tidymodels
packages.
There are lots of good tutorials and walk-throughs both on the
tidymodels
website and other places. I’ll give you some
more links soon. But since most of those discuss the full workflow, and
we haven’t gotten there quite yet, we’ll save most of those for
later.