Assessment - Machine Learning Models 1

Author

Andreas Handel

Modified

2024-03-25

Quiz

Get the quiz sheet for this module from the general Assessments page. Fill it in, then submit to the online grading system before the deadline.

Discussion

We’ll use this week’s discussion to talk a bit more about the projects. Last week you received feedback on our project from a few classmates, this week I want everyone to have a chance to hear about and comment on all projects.

To that end, post a summary of your project to the discussion channel for this week. Briefly describe your data, your question(s), your (planned) methods and (expected) results. Similar to an abstract for a paper, or a summary for a report. This way, everyone can get a quick glimpse as to what everyone else is doing. Also provide a link to the repo for easy access (for private repositories, you can skip this).

Also mention specific questions/struggles/concerns you might have regarding your project. It’s quite likely that there is some overlap in approaches and struggles among you and your classmates. Hopefully, through this discussion you can provide each other with some help and input.

Post by Wednesday. Then read the summaries of your classmates’ projects. If you see a question/topic that you think you might be able to help with, either by answering a specific question, or by providing some general feedback, do so. Of course, you are also welcome to take inspiration from what you see others doing and integrate some of that into your own project.

Since you get feedback from me at the other submission points, I will stay out of this week’s discussion 😁. So this is all you helping each other. Each of you has already seen a few projects and will be reviewing 2 projects in depth at the end. With this week’s discussion, you get a chance to learn about all the projects, get an idea of what everyone else is doing, and you can comment on any project you want.

Exercise

For this exercise, we will implement some of the machine learning models covered in this module, and some of the general approaches we covered previously. This is a continuation of the exercise you’ve been working on over the last few weeks.

Note: This exercise is likely time-consuming and you might get stuck. Please plan accordingly. Start early and ask for help if needed. The documentation on the tidymodels website will likely be very helpful for figuring out the different bits of code you need to write here.

Setup

This is a solo-exercise, and it’s part of your portfolio.

Since the previous Quarto document was getting long, we’ll set this up as a different one. Make a new folder in your portfolio repo and call it ml-models-exercise.

If you haven’t done so yet, update your previous code to save the clean data (with the variable SEX included) from the last exercise into an Rds file. Copy that Rds file into the newly created folder. Also in this new folder, start a new quarto file called ml-models-exercise.qmd.

Open the new quarto file you just created, and we should be ready to go.

Preliminaries

Write code to load your packages. You already know you’ll need tidymodels and probably ggplot2. You can add other package loading commands later (but remember that is is good practice to add them to the top of your script). Also define a random number seed. Set it to the value 1234. Then load the data.

More processing

Just to practice, we’ll do a bit more data wrangling.

Earlier, we dropped the variable SEX. Now we’ll decide we actually want to keep it. Let’s go on a ‘treasure hunt’. Try to find out what the values used to encode race stand for. This is the kind of task one has to do at times for any real-world data analysis. Spend a few minutes to see if you can figure out the meaning of the numbers. (Hint: you can look at the manuscript from our first exercise with these data.)

If you can’t find out the meaning, that’s ok for the purpose of the exercise. What we’ll do is to combine the two sparse groups into a single category, so that we are left with race 1, 2, and 3, where the latter is a re-coding of everyone with a 7 and 88.

Write code that changes the RACE variable such that it combines categories 7 and 88 and puts them in a category called 3.

Pairwise correlations

Next, let’s make a pairwise correlation plot for the continuous variables. If we were to find any very strong correlations, we might want to remove those.

Write code to make a correlation plot for the continuous variables.

You’ll see that while some variables are somewhat strongly correlated, nothing seems excessive (e.g., above an absolute value of 0.9). So looks like we won’t have too much of a problem with collinearity.

Feature engineering

As you learned, feature engineering is a fancy word for creating new variables (or more broadly, doing stuff to existing variables, e.g., one could interpret our recoding of RACE above as feature engineering).

Let’s become a feature engineer 😁. Our data contains height and weight for each individual (HT and WT). There’s usually some correlation between those – which you likely noticed above when you did the pairwise correlation. It’s not too strong here, but maybe it would still be good to combine these variables into a new one. Let’s do that by computing BMI.

Look up the formula for BMI. Then add a new variable to the data frame called BMI, computed from the HT and WT variables. Be careful with the units! Since we don’t have a codebook that explains the variables and their units, you have to guess.

Model building

We are finally done with our additional processing bits and are ready to fit some models. Specifically, we will explore these 3 models:

  • For the first model, we’ll revisit the one we had previously, namely a linear model with all predictors.

  • For our second model, we’ll use LASSO regression.

  • For our third model, we’ll do a random forest (RF).

As we discussed, sometimes it’s useful to do a train/test split, so that we can do a final model assessment at the end. Depending on your question and setup, that might or might not be useful. For this exercise we’ll skip this split. Instead, we’ll use all the data and perform cross-validation to try and get an “honest” assessment of the performance of our models.

First fit

We’ll start by performing a single fit to the training data, without any CV and model tuning.

As always, we should compare to a null model that includes no predictors. Since we already did that previously, we don’t need to repeat here. (Although keep in mind that you should almost always fit a null model when you do a real analysis!)

Some algorithms, such as random forest, do at times use random numbers. So to ensure reproducibility, set a seed to the value given above.

Create tidymodels workflows/recipes and fit the 3 models specified above to the data. The outcome is Y, predictors are all other variables. Use the glmnet engine for the LASSO and the ranger engine for the random forest. Ignore any kind of tuning for the models for now. For the LASSO model, set the penalty to 0.1. We’ll tune it later. For the random forest, we’ll keep all the defaults as they are and don’t specify any for now. To ensure reproducibility for the random forest model, you also need to specify the random seed as an engine argument: set_engine("ranger", seed = rngseed).

You’ll likely have to install some further R packages to do LASSO and random forest, namely the glmnet and ranger packages.

Once you fit each model, use the model to make predictions on the entire dataset, and report the RMSE performance metric for each model fit. Also make a observed versus predicted plot for each of the models.

You should find that linear model and the LASSO give pretty much the same results. Can you explain why that is?

You should also find that the random forest performs best, with a RMSE of around 362 and predictions that are overall closer to the observations (closer to the 45 degree line) in the observed versus predicted plot.

This isn’t entirely surprising, because random forest models are very flexible and thus can capture a lot of the patterns seen in the data. The danger is that they easily overfit.

Tuning the models

Now we’ll do something that you should never do in the real world!

The LASSO and RF both have tuning parameters. Tuning them will often improve model performance. So let’s tune them now.

The stupid thing we’ll be doing is tuning them by repeated fitting to the data without cross-validation. This means model performance is evaluated on the data used for tuning. This pretty much guarantees that we’ll overfit our model. This is a bad idea, but it is instructive to see what happens, so let’s do that.

The Tune model parameters section of the Tidymodels tutorial and this blog post might be useful to look at.

Write code that tunes the LASSO and RF models (there is nothing to tune for the linear model, because it doesn’t have any hyperparameters).

For LASSO, define a grid of parameters to tune over that range from 1E-5 to 1E2 (pick 50 values linearly spaced on a log scale). Note that here, our “grid” is just a vector of numbers. But as you’ll see below, for models that have multiple tuning parameters, it does become a grid.

Use tune_grid() to tune the model by testing model performance for each parameter value you defined as the search grid.

To create an object that only contains the data, and that you can use as resamples input for tune_grid(), use the apparent() function.

Once you have done the tuning, you can take a look at some diagnostics by sending your object returned from the tune_grid() function to autoplot(). For instance if you tuned the tree and saved the result as tree_tune_res, you can run tree_tune_res %>% autoplot(). Depending on the model, the plot will be different, but in general it shows you what happened during the tuning process.

For the LASSO, you’ll see RMSE as function of penalty parameter. You should see that the LASSO does best (lowest RMSE) for low penalty values and gets worse if the penalty parameter increases. At the lowest penalty, the RMSE is the same as for the linear model.

Explain why you see this behavior. What are you doing here, what happens as the penalty parameter goes up? Why does the RMSE only increase and does not go lower than the value found from the linear model or the un-tuned model above? You’ll likely need to re-read the LASSO section on the course website to really understand what is going on here.

Now, let’s repeat with a random forest. Different ML algorithms have different tuning parameters. It’s usually a good idea to look them up to understand at least a bit what they mean and do. You can check this for the ranger algorithm by looking at its help file. There’s a bunch we could tune. We’ll tune over the parameters mtry and min_n and fix trees at 300. Everything else we keep at their defaults. Update your model/workflow accordingly.

Then set up a tuning grid with the grid_regular() function. Do a range for mtry between 1 and 7, min_n between 1 and 21, and 7 levels for each parameter. (That means we’ll be trying 7 x 7 tuning parameter combinations).

The tuning might take several seconds. Once done, use the autoplot() function to look at the results. You’ll see a plot showing how RMSE changes as you change the tuning parameters. You’ll see that a higher value of mtry and a lower one for min_n lead to the best results. This model isn’t as easily understandable as the linear and LASSO models, so it’s not quite clear why those values for the tuning parameters should give the best performance. This is one of the drawbacks of complex ML/AI models, they are harder to interpret. We could probably understand this if we thought more about it. But for now, let’s move on.

Tuning with CV

Now it’s time to do the tuning properly, by using CV to evaluate model performance during the tuning process.

Repeat the steps you did above, for tuning both the LASSO and RF. But now, instead of creating silly resamples with the apparent() function that weren’t actually samples but just the data, we’ll now create real samples. Let’s do 5-fold cross-validation, 5 times repeated. (There’s no specific reason to do this 5x5 pattern, other than to show you that there are different ways to pick the sample, and that I want you to not use the default.) Use the vfold_cv() function to create a resample object with these specifications. Since this uses samples, re-set the random number seed.

Then redo the tuning of both models. You can keep everything as before, the only thing that needs to change in the tune_grid() function is to replace the resamples input, now with real samples generated by the vfold_cv() function.

Running this code will now take time. It’s basically 5x5 = 25 times as long as what you ran above. So it might take several minutes.

Once done, look at the results from both the LASSO and the RF again using the autoplot() function.

Compare what you see for the RMSE here with the results above when we didn’t use CV to evaluate our model. You should find that the LASSO still does best for a small penalty, the RMSE for both models went up, and the LASSO now has lower RMSE compared to the RF. Explain why you are seeing what you do.

Based on what you find here, which model performs better?

We should also compare the results to the linear model we had initially. However, this is not really necessary, since the LASSO with a small penalty is essentially the linear model. If this is not clear why, revisit the discussion of LASSO on the course page.

Conclusion

I hope this exercise showed you a few things, the main ones to me are:

  • Complex models need tuning and with tuning they can get close to the data.

  • Because complex models can fit data well, it’s important to not evaluate their performance on the data that was used to tune/train them. Otherwise one will make overly confident conclusions and the models will likely perform poorly on new data.

ML and speed

This is probably the first exercise where things started slowing down. Once you do bigger models with more data, and extensive tuning and CV, things slow down. It is possible and generally not too hard to parallelize some code. For instance for CV, you could run each sample on a different core. This can often give you a lot of speedup. We won’t go into this here.

If you need to do a lot of hyperparameter tuning, there are also more advanced methods for tuning available in the tune and finetune packages that can help you reduce the number of hyperparameters you need to check.

You’ll also find that as things slow down, including computationally intensive code into your Quarto documents can be annoying, since each time you re-render the code might run and take a lot of time. Some options to deal with this are:

  • Move to a mix of R scripts for computation and Quarto for showing the results.

  • Set specific code chunks to not execute during rendering. Instead, run those code chunks manually, save the results to a file, and in your Quarto document, you don’t execute the code chunk that does the computation but instead just load the previously computed code chunk.

  • Quarto also lets you set a freeze option (either for specific Quarto documents in the YAML header, or at the project level in the _quarto.yml file) that you can set to only re-render a file when it changed. You might want to use this in combination with the ideas above.

Final test and website update

Make sure your analysis and all results are nicely documented and everything runs/renders correctly. Then, add the newly created Quarto document as an entry into your _quarto.yml file. Call it Machine Learning. Recompile your portfolio website and make sure everything works and shows up as expected. Then commit and push.

Since this is part of your portfolio site, you don’t need to post anything, I know where to find it.