Overview

This course provides a survey of modern statistical approaches to analyze data. We will cover a variety of modern approaches for analyzing and interpreting data commonly encountered in public health, biomedical sciences, and related areas. This is an applied, hands-on class. We will use real data (bring your own data if you have it) to learn different methods of analysis. We will discuss all the steps of a data analysis, including obtaining and cleaning data, exploratory and full analysis, and presentation of results.

We will discuss how to formulate scientifically solid questions for a given set of data, how to decide on the right method of analysis, how to implement the analysis in R, and how to present and communicate the results. We will cover statistical topics such as regression, tree based models, cross-validation, bootstrapping, and model selection.

Learning Objectives

The main goal for this course is for you to learn the whole process of performing a data analysis project. This starts with identifying a suitable question-data pair, proceeds to getting, cleaning and exploring the data, culminates with fitting statistical models and producing materials that communicate what you did and found. A second goal is to introduce you to some modern analysis approaches that these days often go by the name of ‘Machine Learning’. Finally, a related goal is to introduce you to a set of tools that allow for a modern, reproducible workflow of your analysis.

The specific learning objectives that I hope you will achieve by going through this course are:

Define meaningful data analysis questions and assess the feasibility of answering these questions with the available data.

  • Given a data set, define the questions that can be answered and formulate and implement suitable analytic approaches.
  • Given a data analysis question, determine the type of data and analytic approach needed to answer it.

Be able to obtain, organize and process data for detailed analysis.

  • Know how to obtain data from a variety of different sources.
  • Be knowledgeable about data types and standards and how to process them.
  • Be able to organize and process data in a reproducible, automated and documented manner.
  • Be able to thoughtfully and critically assess strengths and weaknesses of specific data sets and process the data appropriately.

Be knowledgeable of different data analysis methods and select the appropriate approach for a given project based on data and question.

  • Critically compare and evaluate the strengths and weaknesses of different data analysis approaches.
  • Judge the appropriateness of different approaches for specific questions and data sets and know how to apply an appropriate analytic approach.
  • Design and implement successful data analyses using state-of-the-art analysis software to translate data to information and knowledge that leads to actionable insights.

Efficiently communicate results from data analyses to a variety of stakeholders.

  • Summarize analysis results in ways that provide actionable conclusions and that are easily understandable by different audiences, such as laypersons, decision makers, and expert colleagues.
  • Assess the strengths and weaknesses of different formats for representing the results of data analyses.

Use modern coding and analysis tools to implement automated, reproducible analysis and project management workflows.

  • Explain the importance of workflow, project management, and reproducibility tools, and know how to use those tools.
  • Be proficient in R coding to implement and execute a complete data analysis project in a reproducible and automated manner.
  • Be comfortable using R and Github to do data analysis in a reproducible manner.
  • Be able to quickly learn how to use new software and tools, figure out how to get help when stuck, and make it work for you.

Develop skills to critically assess your own and others’ analyses and conclusions.

  • Judge the usefulness and appropriateness of data analyses described in the primary research literature.
  • Learn to look over your shoulder and critically assess what you are doing, what assumptions you make by doing certain things, and if and how you can justify these.

Topics

Here is a non-exhaustive lists of topics that this course does and does not cover.

We will cover

  • How to set up an analysis workflow that is as reproducible and automated as possible.
  • Getting, cleaning and processing messy real-world data.
  • Data visualization.
  • Modern tools for data analysis (e.g., R, Quarto, Git/GitHub).
  • The tidyverse for data processing and tidymodels for fitting models.
  • An introduction to some Machine Learning tools and techniques.

We will not (or barely) cover

  • Advanced visualization techniques using interactive tools such as R/Shiny.
  • Dealing with “non-rectangular” data, such as time-series data, images, audio, complex -omics data, etc. (we won’t cover it, but you can use such data for your class project).
  • Statistical tests and basic statistical modeling (linear and generalized linear models). Some familiarity with those techniques is assumed and they do show up a few times, but won’t be covered.
  • How to code in R. We will use R, but this course doesn’t teach R. With enough effort, a student can learn both the subject matter and pick up enough R at the same time, but this will require extra effort. In general, some basic familiarity with R or another programming language will be assumed.
  • Anything in depth. This is a survey course and covers a lot of material, thus we won’t be able to go into much depth for any topic. Resources are provided to allow anyone interested to go deeper on their own.

Prerequisites

Formal requirement for the course is BIOS 7010. Knowledge of material from BIOS 7010 is assumed. If your lack pre-requisites prevent you from enrolling in this course, please contact me to get permission to enroll.

This is a quantitative course. We will not discuss the mathematical details of specific data analysis approaches, however some statistical background and being comfortable with quantitative thinking are needed. Knowledge of statistics at the level of fitting linear or logistic models to data (e.g., as obtained in our BIOS 7010 and 7020 courses) are assumed. Some R coding skills (e.g., as obtained in our EPID 7500 class), are helpful.

If you do not have any coding or statistics knowledge, you can still take the class, but you need to be prepared to spend extra time and effort to fill any gaps. This will be especially true for the R coding part. Some of you likely have previous R experience, while others might have little to none. If you are in the little to none category, expect to spend extra time getting up to speed. I believe it’s doable and worth the effort, but you need to be prepared for it. There will be plenty of help from myself, classmates, and the internet if you end up getting stuck with some of the coding, but your effort and commitment are still required.

To re-emphasize: For those among you who have not used R or any other programming language before, this course will be time-consuming. Budget your time accordingly and plan ahead! If you do, I’m fairly certain you will find it worth it. If you are not able or willing to allot the time needed to learn enough R (and GitHub) to make things work, this course might not be ideal for you.

Should I take this course?

Here are a few more pointers to what to expect from the course, and comments from previous students (so you can hear it from them, not just from me).

  • The course is a high-level survey course. We cover a lot of material.
  • A lot of the content is fairly conceptual/broad/big-picture, we don’t go much into technical details on specific topics.
  • The pace of the course is fast.
  • There is a good bit of hands-on work/exercises that you need to do each week, which can be time-consuming.
  • This course comes with a good bit of team work.
  • You will learn skills that are likely useful for your research/work career.
  • You will be using tools like R, GitHub, Quarto and other useful tools.

Some feedback from past students

“I have found this course to be useful and relevant to my research interests. I think the mixture of the modules, additional resources, and coding exercises really helped strengthen my understanding of important data analysis concepts.”

“I have thoroughly enjoyed this class from start to finish, and I have gained a lot of knowledge, starting from Github to the entire data analysis workflow. However, I feel that we have covered a lot of material in just one semester.”

“I have learned so much from this class in what feels like such a short amount of time, and it has gotten me a lot more comfortable working with and interpreting my data and I now feel that I have a strong grasp on data analysis workflows. However, sometimes I feel like I’ve learned about so many new things that I feel dumb all over again because there is so much to take in when it comes to machine learning and I want the opportunity to explore each facet in greater detail.”

“The last half of the semester has seemed like a whirlwind. We have reviewed everything from modeling to training and even machine learning.”

Course Setup

This course is a fully online, asynchronous, cohort-based course. That means there are weekly deadlines, but other than the fixed deadlines, you can do the work whenever it is convenient for you.

All course materials are freely available online. We will make use of several freely available textbooks and other materials. All course materials are listed on the course website. We will use the R software for data analysis. We will also use a few other software tools. All are freely available.

This course is very hands-on. The weekly exercises (aka homework) are usually quite in-depth and also often time-consuming. Plan accordingly. For each exercise, I will provide detailed instructions that hopefully make it clear what you need to do. The materials provided on the course website are not meant to be memorized, but to be used to be able to do the hands-on activities, such as the exercises and the class project.

For more details on course logistics, see the other information in the General Information section of the course website.

Privacy/FERPA

The course is set up such that you work “in the open” by having a GitHub repository and accompanying website that showcases the work you are doing in this class as an online portfolio. This is a great way to showcase your work to potential employers. However, it also means that your work is publicly visible.

As a student, under the Family Educational Rights and Privacy Act (FERPA) you have the right to have all your student-related information kept private, including the fact that you are taking this class. Keeping everything private can be done using private GitHub repositories. But it means you won’t be able to create a public website/portfolio, and you can’t follow the general instructions in the exercises. Instead I’ll need to give you special instructions. If for some reason you want to keep the fact that you are taking this class a secret, and thus do not want to use public GitHub repositories/create a publicly accessible online portfolio please let me know. If you take this course and do not inform me that you want your attendance in this course to be kept private, I consider this as you consenting with producing public materials as part of this course, and thus make your participation in this course public. If you have any concerns about this, please contact me to discuss.

Grading

The grade will be made up as follows:

  • 20% quizzes
  • 30% exercises/homework
  • 10% participation/discussions
  • 40% a course long project, broken up into pieces.

The following grading scale will be used, final grades might be curved (upward, never down): A 93-100, A- 90-93, B+ 87-90, B 83-87, B- 80-83, C+ 77-80, C 73-77, C- 70-73, D 60-70, F < 60

More detailed descriptions of the different assessments is provided on the Assessments page.

Class Attendance, Make-up Policy

This class is online. You are expected to submit all assignments by their due dates. Excused misses of due dates are only provided by prior agreement with the instructor or for special reasons (medical, etc.).

Getting Help

If you have questions about any aspect of the course, please do not hesitate to ask for help. The course materials describe in detail the ways you can ask for help.

University Honor Code and Academic Honesty Policy

All academic work must meet the standards contained in A Culture of Honesty. All students are responsible to inform themselves about those standards before performing any academic work. More detailed information about academic honesty can be found at: http://www.uga.edu/honesty/

Discussions with your classmates and the instructor are encouraged. However, the final work should be your own.

Students with Disabilities

Students with disabilities who require reasonable accommodations in order to participate in course activities or meet course requirements should contact the instructor.

General Disclaimers

This syllabus is a general plan, deviations announced to the class by the instructor may be necessary.

Course Outline

For an outline of the course, please see the Course Schedule document.

More Details

The General Information section and the introductory unit of this course contains all the logistic details you need to know.