Assessment - Fitting Basic Statistical Models
Quiz
Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.
Discussion
Write a post in this week’s discussion channel that answers this question:
Which new insight(s) did you gain from this week’s topic (or any topic related to model fitting) and how will that possibly impact your analyses in the future? And what point(s) do you find rather confusing about the model fitting material we covered so far?
Post by Wednesday, then reply to each other by Friday. See if you can help each other reduce any existing confusion.
Exercise
We’ll practice a bit of model fitting using the tidymodels
framework. Of course, some data processing and EDA components are also needed. This is a solo exercise and due Friday.
Setup
This will be part of your portfolio site. Make a new folder called fitting-exercise
. Also make a new Quarto document inside the folder and call it fitting-exercise.qmd
.
In general, I suggest using a folder structure like the one from data-analysis-template
and having separate pieces of code and Quarto files. For the purpose of this exercise, we’ll keep it simple and do everything inside the fitting-exercise
folder and fitting-exercise.qmd
file.
Data
We are going to use data on a drug candidate called Mavoglurant. This is The original paper describing and analyzing the data. You don’t need to look at it, unless you want to 😁. In case you do, since the paper might be behind a paywall, here’s a local copy of the pdf.
The data is part of an R package called nlmixr2data
. You could get it by installing that pacakge. But the easiest way to get the data is to download it from this Github repository, which is part of another paper you don’t need to read - but might want to peek at, see below.
Go to the GitHub repository, download the Mavoglurant_A2121_nmpk.csv
file and place it into the folder you just created.
Data processing and exploration
Before we can do any analysis, we need to explore and process the data. The data we are looking at here has the – very common – feature that on first glance, it is confusing 🙄. Variables have weird names and there seems to be a strange structure. In almost any real-world setting, this is the kind of data you’ll be given. So let’s do some exploration and processing.
Unfortunately, the data does not come with a codebook. The documentation of this dataset in the nlmixr2data
package provides a bit of information. However, this documentation is very sparse and not enough. Another way to learn a bit more is to take a quick look at the papers that were published covering the data.
The second paper mentioned above is likely giving you a quicker overview compared to the first one. Take a look at the Data section and Figure 3, which shows the data as symbols. Based on that, you’ll figure out that this is time-series data of drug concentrations.
You can get a quick visual idea of the data by plotting the outcome variable (which according to the documentation is called DV
here) as a function of time, stratified by DOSE
and using ID
as a grouping factor.
Write code to load the data into R. Then write code to make a plot that shows a line for each individual, with DV
on the y-axis and time on the x-axis. Stratify by dose (e.g., use a different color for each dose, or facets).
As you look closer, the formatting of the dataset still looks a bit strange. So let’s dig deeper. One thing you will notice is that there are some individuals that seem to have received the drug more than once, indicated by having both entries with OCC=1 and OCC=2. Since we are not sure what the difference is, and to keep things simple, we only keep one dataset for each individual. Therefore, remove all entries with OCC=2.
Write code that keeps only observations with OCC = 1.
You will also see that each individual has an entry at time 0 that has DV=0
and some non-zero value for AMT
. This is the dosing entry for everyone. All the other entries are the time-series values for the drug concentration. We won’t do a time-series analysis here, so instead we’ll compute the total amount of drug for each individual by adding all the DV
values. Note that this is a pretty bad idea, since some individuals might have more or less data points. The proper way to do this would be to do some form of integration to get the area under the curve, e.g. with a simple trapezoid rule, or to model the whole time-series with a function and then compute the AUC from that function. But to keep things simple, we’ll go ahead - keeping in mind that in general, outside of a practice example, this is not a good idea.
Write code to exclude the observations with TIME = 0, then compute the sum of the DV variable for each individual using dplyr::summarize(). Call this variable Y
. The result from this step should be a data frame/tibble of size 120 x 2, one column for the ID one for the variable Y
. Next, create a data frame that contains only the observations where TIME == 0. This should be a tibble of size 120 x 17. Finally, use the appropriate join function to combine those two data frames, to get a data frame of size 120 x 18.
Finally, we’ll do a bit more cleaning. At this point, we don’t need most of these indicator variables anymore (e.g., OCC
or EVID
). We also want to convert RACE
and SEX
to factor variables.
Write code that converts RACE
and SEX
to factor variables and keeps only these variables: Y,DOSE,AGE,SEX,RACE,WT,HT
Check the data and make sure everything looks ok. As needed, perform further exploration or cleaning steps.
EDA revisited
You probably already produced some EDA (figures and tables) as part of the exploratory process above. Do a few more here, once the data is clean.
- Make some useful summary tables.
- Show some scatterplots or boxplots between the main outcome of interest (total drug,
Y
) and other predictors. - Plot the distributions of your variables to make sure they all make sense.
- Look at some pair/correlation plots.
This is a fairly open exploration, I want you to play with the data enough to get a good idea of what you are looking at and what’s “inside” the data.
General wrangling and EDA comments
Always make sure you document well. Prior to each code piece that produces some output, you should add text describing what you are about to do and why. After you produced the result, add some text that comments on what you see and what it means. E.g. you could write something like Histogram for height shows two persons at 20in, everyone else else is above 50in. Checked to see if those are wrong entries or not. Decided to remove those observations.
Once you’ve come this far in any kind of data analysis, you should have some code that explores and cleans the data, produces some figures and tables, and (in general, not required here) saves the cleaned data set to a file.
You should also have documentation, as comments in the code and as text in a Quarto document, as well as some notes in README file(s), that explain what the different parts of data and code are/do, what variables the data contain, what each variable means, which are the main outcome(s) and main predictor(s), etc. Providing all this meta-information is important so that someone else (or your future self) can easily understand what is going on.
Any good data collection and analysis requires that data is documented with meta-data to describe what it contains, what units things are in, what variables are allowed, etc. Without good meta-data, analyses often go wrong. There are famous examples in the literature where the coding 0/1 was assumed to mean the opposite of what it was, so all conclusions were wrong. To guard against this, careful documentation is crucial. Also, giving variables easy to understand names and values is best. The data we have here is a good example of how NOT to do things. There’s no codebook, and we don’t really know what SEX 0/1 stands for (which ones are male/female). We also don’t know what the 4 entries 1/2/7/88 for RACE
stand for. If we had a codebook, or some better labeling – or ideally both –, would be very helpful. Unfortunately, a lot of real-world data is rather poorly documented.
Model fitting
Finally, we’ll do some model fitting. We’ll keep it simple here. We’ll use the tidymodels
framework. I suggest you go to the build a model tutorial in the Get Started section on the tidymodels
website. There, you’ll see an example of fitting some data with a linear model. From that tutorial, and other sources in the tidymodels
website, as well as the TMWR book and other online sources figure out how to do the steps below. Of course AI tools are also an option (that’s how I did part of it 😁).
Write code that does the following:
- Fit a linear model to the continuous outcome (
Y
) using the main predictor of interest, which we’ll assume here to beDOSE
. - Fit a linear model to the continuous outcome (
Y
) using all predictors. - For both models, compute RMSE and R-squared and print them.
Repeat the steps above, but now we’ll consider SEX
as the outcome of interest (that doesn’t make too much scientific sense, but we want to practice fitting both continuous and categorical outcomes).
- Fit a logistic model to the categorical/binary outcome (
SEX
) using the main predictor of interest, which we’ll again assume here to beDOSE
. - Fit a logistic model to
SEX
using all predictors. - For both models, compute accuracy and ROC-AUC and print them.
Interpret the results and add additional comments and thoughts to the document, so a reader can understand what you did, why you did it, and what it all means. For instance do you see any pattern between Y
and DOSE
or SEX
and DOSE
and does what you see make sense? How about the other predictors?
If you are extra ambitious or just enjoy playing around with tidymodels
, add a k-nearest neighbors model fit to the continuous and/or categorical outcome.
For this exercise, it isn’t so important that the outcomes and predictors are of actual scientific interest. We won’t do exciting science here, just explore the analysis workflow. However, I want to remind you that having good/interesting question to answer (and data you can use to answer it) is the most important part of any research project. Rather a great question and basic stats than really fancy stats answering a meaningless question that nobody cares about!
Updating the portfolio website
Once you completed everything, make sure everything runs/renders and produces relevant output (tables, figures, text, etc.).
Now it’s time to add everything to the portfolio website. For that, open the _quarto.yml
file and make a new entry in the navbar
section as you did before, that points to your new Quarto file.
Remember that YAML is very picky about indentation, so if you get strange error messages, it is often likely that your entries are not aligned/indented exactly right. If you suspect that you have this problem, you can show whitespace characters in RStudio by going to `Tools -> Global Options -> Code -> Display” and check Show Whitespace Characters.
Render your whole website – not just the new document! – and make sure all looks good and all links work. If all is ready, commit and push to GitHub.
Since this is part of your portfolio site, you don’t need to post anything, I know where to find it.