Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
We’ll practice a bit of model fitting using the
tidymodels framework. Of course, some wrangling tasks are
likely part as well. This is another solo exercise. We’ll eventually get
back to group work, I promise 😁.
We are going to use the
data from Brian McKay’s paper you saw at the beginning of the
course. Download all his files and make sure to unzip.
SympAct_Any_Pos.Rda file, which is the one we’ll
Start a new repository using the dataanalyis-template,
make it your own repository (name it
YOURNAME-MADA-flu-analysis), and clone it to your local
Place Brian’s data file into the
raw_data folder. (I
know he already processed it some, but we use it as our start, so for us
that is the raw material.)
We already spent time on the important steps of data wrangling/cleaning and exploring/visualizing. But practice makes perfect 😄 so we’ll talk a bit more and do a bit more of it here.
By now you will have figured out that the whole wrangling/exploring process is iterative. Without exploring your data, you don’t know how to clean/wrangle it. Without at least some cleaning, exploration often doesn’t work (e.g. if you have missing values, some R summary functions don’t work properly).
For this exercise, write some code (as Quarto or Quarto+R) that further processes and cleans the data as follows:
Unique.Visit. You should be left with 32 variables coding for presence or absence of some symptom. Only one, temperature, is continuous. A few have multiple categories.
Make sure to add plenty of comments explaining each step. State briefly why you are making a decision, and then if the code is not obvious, explain the how. At this point, you should also delete any bits from your code or any files that are not part of your cleaning (e.g., leftovers from the template).
At the end of your (fairly short) cleaning process, you should end up
with 730 observations and 32 variables. Save the new dataset to an
For our further analysis, we decide (well, I’m deciding that for you 😁) is that our main continuous outcome of interest is Body temperature, and our main categorical outcome is Nausea. We want to see if the other symptoms are correlated with (predict) those outcomes.
As part of the data wrangling/cleaning, and also to some extent as
stand-alone exploration to prepare you for the formal statistical
fitting, you need to explore and understand your data. Add an R Markdown
file (call it
exploration.Rmd) that does a bit of
exploration, produces a few figures and tables.
At minimum, your exploration code should do the following:
If you only have a limited number of variables, say less than 10, you can explore them all. If you have a lot, focus your exploration on the ones you think are most interesting and important, which of course is the outcome and the main predictors of interest. You can decide on the predictors to explore, the outcomes I chose for you as stated above. If you have a lot of predictors which may be important you need to use other techniques that we have not discussed.
Always make sure you document well. Prior to each code piece that produces some output, you should add text describing what you are about to do and why. After you produced the result, add some text that comments on what you see and what it means. E.g. you could write something like Histogram for height shows two persons at 20in, everyone else else is above 50in. Checked to see if those are wrong entries or not. Decided to remove those observations.
Once you are done with this part, you should generally have at a
minimum code that explores and cleans the data, produces some figures
and tables, and saves the final cleaned data set to an
file. You should also have documentation, either as comments in the
code, or text in an Rmd file, or notes in a README file (or even better,
all of those), that explain what variables the data contain, what each
variable means, which are the main outcome(s) and main predictor(s),
etc. Providing all this meta-information is important so that someone
else (or your future self) can easily understand what is going on.
For this exercise, it isn’t so important that the outcomes and predictors are of actual scientific interest. We won’t do exciting science here, just explore the analysis workflow. However, I want to remind you that having good/interesting question to answer (and data you can use to answer it) is the most important part of any research project. Rather a great question and basic stats than really fancy stats answering a question nobody cares about!
Any good data collection and analysis requires that data is documented with meta-data to describe what it contains, what units things are in, what variables are allowed, etc. Without good meta-data, analyses often go wrong. There are famous examples in the literature where the coding 0/1 was assumed to mean the opposite of what it was, so all conclusions were wrong. To guard against this, careful documentation is crucial. Also, giving variables easy to understand names and values is best. E.g. instead of coding sex as 0/1/99, code it as Male/Female/Other, or something along those lines. Factors in R allow you to do this very easily, so you may as well!
Finally, we’ll do some model fitting. Start with a clean analysis
script. Use another R Markdown file. If you want, you can place some of
the code into a separate R script and then load into your
Rmd file, or have everything inside the
file. In either case, document/comment well.
Add code that does the following operations:
Here, we decide that
RunnyNose is our main
predictor of interest, and we also care about all others. So you should
fit models with each of the 2 outcomes and
models with each of the 2 outcomes and all predictors. Consider
whichever outcome you are currently NOT fitting as a predictor of
You might have noticed that we are including some predictors twice, e.g. we have Weakness as 4 categories and also as yes/no. That’s likely not ideal. We’ll address that soon, for now let’s just “throw it all in”. Because we do that, you might get some warning messages from your code about rank deficiency. Usually, we would need to inspect those and make sure we either fix or understand what’s going on and are ok with it. For now, you can ignore.
For all of these operations, you should use the the
tidymodels framework. To do so, I suggest to go to the build a model
tutorial in the Get Started section on the
tidymodels website. There, you’ll see an example of fitting
some data with a linear model, followed by a Bayesian model. We won’t do
the Bayesian one (but if you want you can try). From that tutorial, pick
out the bits and pieces you need to fit your two linear models. You’ll
need to use the
set_engine("lm") commands. For the logistic model, you will
need to use
For each of the 4 fits (2 linear models, 2 logistic models), report
the results in some way, either as table or figure. You might want to
glance() function. You can also play around a bit
predict() function or do some other explorations
with your fits. This is open-ended, so play around as much or as little
as you like.
Comparison of models should be based on performance of the models (and anything else you want to look at). This should be heuristic, with you looking at and discussing differences. Please no comparison using p-values or other statistical shenanigans 😄.
The main point of this is to get you started fitting models with
tidymodels. You’ll likely see lots of information on the
tidymodels website that might not make sense yet, such as
pre-processing, model evaluation, tuning, etc. Just ignore that for now,
we’ll get to it soon enough.
If you are extra ambitious or just enjoy playing around with
tidymodels, add a k-nearest neighbors model fit to
the continuous and/or categorical outcome. You do that by choosing
nearest_neighbor() as the model type (instead of
logistic_reg()). It seems that
nearest_neighbor() currently only supports a single engine
(underlying implementation/model) called
kknn, so you don’t
even have to set an engine, it will use that one as default (but it’s
still good practice in case
tidymodels adds another engine
in the future 😁). The rest should work as before. The fact that we can
easily switch models and don’t have to write much new code is the beauty
tidymodels framework and will become even clearer
I’m a big fan of fitting models in a Bayesian framework. R has many
good packages. Most of them connect to the Stan software, an independent piece of
powerful Bayesian fitting software. It can be somewhat tricky to get the
installation of Stan and the connection to R working. So if you try to
do some Bayesian modeling, along the lines of the example in the
tidymodels tutorial, some initial fiddling is required.
Once you get it up and running, it usually works well. The Stan website
provides useful instructions. Learning some Bayesian statistics is very
useful. Unfortunately, it’s beyond what we can cover in this course.
Once you completed your
tidymodels exploration, make
sure everything runs and produces relevant output (tables, figures,
text, etc.). There is no need to update the manuscript file for this
exercise. We’ll do that in a future exercise. But do make sure
everything is documented well so others can understand what is going
Once done, commit and push. Then publish a link to your repository into the exercise channel for this module.
Write a post in this week’s discussion channel that answers this question:
Which major new insight did you gain from this week’s topic (or any topic related to model fitting you got exposed to over the last few weeks) and how will that maybe affect your analyses in the future? And what point(s) do you find rather confusing about the model fitting material we covered so far?
Post by Wednesday, then reply to each other by Friday. See if you can help each other reduce any existing confusion. I’ll be participating too, of course!