Get the quiz sheet for this module from the general **Assessments** page.
Fill it in, then submit to the online grading system before the
deadline.

We’ll practice a bit of model fitting using the
`tidymodels`

framework. Of course, some wrangling tasks are
likely part as well. This is another solo exercise. We’ll eventually get
back to group work, I promise 😁.

We are going to use the
data from Brian McKay’s paper you saw at the beginning of the
course. Download all his files and **make sure to unzip**.
Find the `SympAct_Any_Pos.Rda`

file, which is the one we’ll
work with.

Start a new repository using the dataanalyis-template,
make it your own repository (name it
`YOURNAME-MADA-flu-analysis`

), and clone it to your local
computer.

Place Brian’s data file into the `raw_data`

folder. (I
know he already processed it some, but we use it as our start, so for us
that is the raw material.)

We already spent time on the important steps of data wrangling/cleaning and exploring/visualizing. But practice makes perfect 😄 so we’ll talk a bit more and do a bit more of it here.

By now you will have figured out that the whole wrangling/exploring process is iterative. Without exploring your data, you don’t know how to clean/wrangle it. Without at least some cleaning, exploration often doesn’t work (e.g. if you have missing values, some R summary functions don’t work properly).

For this exercise, write some code (as Quarto or Quarto+R) that further processes and cleans the data as follows:

- Remove all variables that have
*Score*or*Total*or*FluA*or*FluB*or*Dxname*or*Activity*in their name. - Also remove the variable
`Unique.Visit`

. You should be left with 32 variables coding for presence or absence of some symptom. Only one, temperature, is continuous. A few have multiple categories. - Remove any NA observations, there aren’t many.

**Make sure to add plenty of comments explaining each step.
State briefly why you are making a decision, and then if the code is not
obvious, explain the how. At this point, you should also delete any bits
from your code or any files that are not part of your cleaning (e.g.,
leftovers from the template).**

At the end of your (fairly short) cleaning process, you should end up
with 730 observations and 32 variables. Save the new dataset to an
`Rds`

file.

For our further analysis, we decide (well, I’m deciding that for you
😁) is that our main continuous outcome of interest is **Body
temperature**, and our main categorical outcome is
**Nausea**. We want to see if the other symptoms are
correlated with (predict) those outcomes.

As part of the data wrangling/cleaning, and also to some extent as
stand-alone exploration to prepare you for the formal statistical
fitting, you need to explore and understand your data. Add an R Markdown
file (call it `exploration.Rmd`

) that does a bit of
exploration, produces a few figures and tables.

At minimum, your exploration code should do the following:

- For each (important) variable, produce and print some numerical output (e.g. a table or some summary statistics numbers).
- For each (important) continuous variable, create a histogram or density plot.
- Create scatterplots or boxplots or similar plots for the variable you decided is your main outcome of interest and the most important (or all depending on number of variables) independent variables/predictors. For this dataset, you can pick and choose a few predictor variables.
- If applicable to your data, make some pairwise correlation plots. (Not applicable here)
- If needed for your data, explore the pattern of missing values. (Not applicable here)
- Any other exploration steps that might be useful.

If you only have a limited number of variables, say less than 10, you can explore them all. If you have a lot, focus your exploration on the ones you think are most interesting and important, which of course is the outcome and the main predictors of interest. You can decide on the predictors to explore, the outcomes I chose for you as stated above. If you have a lot of predictors which may be important you need to use other techniques that we have not discussed.

Always make sure you document well. Prior to each code piece that
produces some output, you should add text describing what you are about
to do and why. After you produced the result, add some text that
comments on what you see and what it means. E.g. you could write
something like *Histogram for height shows two persons at 20in,
everyone else else is above 50in. Checked to see if those are wrong
entries or not. Decided to remove those observations.*

Once you are done with this part, you should generally have at a
minimum code that explores and cleans the data, produces some figures
and tables, and saves the final cleaned data set to an `Rds`

file. You should also have documentation, either as comments in the
code, or text in an Rmd file, or notes in a README file (or even better,
all of those), that explain what variables the data contain, what each
variable means, which are the main outcome(s) and main predictor(s),
etc. Providing all this meta-information is important so that someone
else (or your future self) can easily understand what is going on.

For this exercise, it isn’t so important that the outcomes and
predictors are of actual scientific interest. We won’t do exciting
science here, just explore the analysis workflow. However, I want to
remind you that having good/interesting question to answer (and data you
can use to answer it) is the **most important** part of any
research project. Rather a great question and basic stats than really
fancy stats answering a question nobody cares about!

Any good data collection and analysis requires that data is documented with meta-data to describe what it contains, what units things are in, what variables are allowed, etc. Without good meta-data, analyses often go wrong. There are famous examples in the literature where the coding 0/1 was assumed to mean the opposite of what it was, so all conclusions were wrong. To guard against this, careful documentation is crucial. Also, giving variables easy to understand names and values is best. E.g. instead of coding sex as 0/1/99, code it as Male/Female/Other, or something along those lines. Factors in R allow you to do this very easily, so you may as well!

Finally, we’ll do some model fitting. Start with a clean analysis
script. Use another R Markdown file. If you want, you can place some of
the code into a separate R script and then load into your
`Rmd`

file, or have everything inside the `Rmd`

file. In either case, document/comment well.

Add code that does the following operations:

- Loads cleaned data.
- Fits a linear model to the continuous outcome using only the main predictor of interest.
- Fits another linear model to the continuous outcome using all (important) predictors of interest.
- Compares the model results for the model with just the main predictor and all predictors.
- Fits a logistic model to the categorical outcome using only the main predictor of interest.
- Fits another logistic model to the categorical outcome using all (important) predictors of interest.
- Compares the model results for the categorical model with just the main predictor and all predictors.

**Here, we decide that RunnyNose is our main
predictor of interest, and we also care about all others. So you should
fit models with each of the 2 outcomes and RunnyNose and
models with each of the 2 outcomes and all predictors. Consider
whichever outcome you are currently NOT fitting as a predictor of
interest.**

You might have noticed that we are including some predictors twice,
e.g. we have *Weakness* as 4 categories and also as yes/no.
That’s likely not ideal. We’ll address that soon, for now let’s just
“throw it all in”. Because we do that, you might get some warning
messages from your code about rank deficiency. Usually, we would need to
inspect those and make sure we either fix or understand what’s going on
and are ok with it. For now, you can ignore.

For all of these operations, you should use the the
`tidymodels`

framework. To do so, I suggest to go to the build a model
tutorial in the *Get Started* section on the
`tidymodels`

website. There, you’ll see an example of fitting
some data with a linear model, followed by a Bayesian model. We won’t do
the Bayesian one (but if you want you can try). From that tutorial, pick
out the bits and pieces you need to fit your two linear models. You’ll
need to use the `linear_reg()`

and
`set_engine("lm")`

commands. For the logistic model, you will
need to use `logistic_reg()`

and
`set_engine("glm")`

.

For each of the 4 fits (2 linear models, 2 logistic models), report
the results in some way, either as table or figure. You might want to
try the `glance()`

function. You can also play around a bit
with the `predict()`

function or do some other explorations
with your fits. This is open-ended, so play around as much or as little
as you like.

Comparison of models should be based on performance of the models (and anything else you want to look at). This should be heuristic, with you looking at and discussing differences. Please no comparison using p-values or other statistical shenanigans 😄.

The main point of this is to get you started fitting models with
`tidymodels`

. You’ll likely see lots of information on the
`tidymodels`

website that might not make sense yet, such as
pre-processing, model evaluation, tuning, etc. Just ignore that for now,
we’ll get to it soon enough.

If you are extra ambitious or just enjoy playing around with
`tidymodels`

, add a *k*-nearest neighbors model fit to
the continuous and/or categorical outcome. You do that by choosing
`nearest_neighbor()`

as the model type (instead of
`linear_reg()`

or `logistic_reg()`

). It seems that
`nearest_neighbor()`

currently only supports a single engine
(underlying implementation/model) called `kknn`

, so you don’t
even have to set an engine, it will use that one as default (but it’s
still good practice in case `tidymodels`

adds another engine
in the future 😁). The rest should work as before. The fact that we can
easily switch models and don’t have to write much new code is the beauty
of the `tidymodels`

framework and will become even clearer
soon.

I’m a big fan of fitting models in a Bayesian framework. R has many
good packages. Most of them connect to the Stan software, an independent piece of
powerful Bayesian fitting software. It can be somewhat tricky to get the
installation of Stan and the connection to R working. So if you try to
do some Bayesian modeling, along the lines of the example in the
`tidymodels`

tutorial, some initial fiddling is required.
Once you get it up and running, it usually works well. The Stan website
provides useful instructions. Learning some Bayesian statistics is very
useful. Unfortunately, it’s beyond what we can cover in this course.

Once you completed your `tidymodels`

exploration, make
sure everything runs and produces relevant output (tables, figures,
text, etc.). There is no need to update the manuscript file for this
exercise. We’ll do that in a future exercise. **But do make sure
everything is documented well so others can understand what is going
on.**

Once done, commit and push. Then publish a link to your repository into the exercise channel for this module.

Write a post in this week’s discussion channel that answers this question:

Which major new insight did you gain from this week’s topic (or any topic related to model fitting you got exposed to over the last few weeks) and how will that maybe affect your analyses in the future? And what point(s) do you find rather confusing about the model fitting material we covered so far?

Post by Wednesday, then reply to each other by Friday. See if you can help each other reduce any existing confusion. I’ll be participating too, of course!