This course provides a survey of modern statistical approaches to analyze data. We will cover a variety of modern approaches for analyzing and interpreting data commonly encountered in public health, biomedical sciences, and related areas. This is an applied, hands-on class. We will use real data (bring your own data if you have it) to learn different methods of analysis. We will discuss all the steps of a data analysis, including obtaining and cleaning data, exploratory and full analysis, and presentation of results.

We will discuss how to formulate scientifically solid questions for a given set of data, how to decide on the right method of analysis, how to implement the analysis in R, and how to present and communicate the results. We will cover statistical topics such as regression, tree based models, cross-validation, bootstrapping, and model selection.

The main goal for this course is for you to learn the whole process of performing a data analysis project. This starts with identifying a suitable question-data pair, proceeds to getting, cleaning and exploring the data, culminates with fitting statistical models and producing materials that communicate what you did and found. A second goal is to introduce you to some modern analysis approaches that these days often go by the name of ‘Machine Learning’. Finally, a related goal is to introduce you to a set of tools that allow for a modern, reproducible workflow of your analysis.

The specific learning objectives that I hope you will achieve by going through this course are:

*Define meaningful data analysis questions and assess the
feasibility of answering these questions with the available
data.*

- Given a data set, define the questions that can be answered and formulate and implement suitable analytic approaches.
- Given a data analysis question, determine the type of data and analytic approach needed to answer it.

*Be able to obtain, organize and process data for detailed
analysis.*

- Know how to obtain data from a variety of different sources.
- Be knowledgeable about data types and standards and how to process them.
- Be able to organize and process data in a reproducible, automated and documented manner.
- Be able to thoughtfully and critically assess strengths and weaknesses of specific data sets and process the data appropriately.

*Be knowledgeable of different data analysis methods and select
the appropriate approach for a given project based on data and
question.*

- Critically compare and evaluate the strengths and weaknesses of different data analysis approaches.
- Judge the appropriateness of different approaches for specific questions and data sets and know how to apply an appropriate analytic approach.
- Design and implement successful data analyses using state-of-the-art analysis software to translate data to information and knowledge that leads to actionable insights.

*Efficiently communicate results from data analyses to a variety
of stakeholders.*

- Summarize analysis results in ways that provide actionable conclusions and that are easily understandable by different audiences, such as laypersons, decision makers, and expert colleagues.
- Assess the strengths and weaknesses of different formats for representing the results of data analyses.

*Use modern coding and analysis tools to implement automated,
reproducible analysis and project management workflows.*

- Explain the importance of workflow, project management, and reproducibility tools, and know how to use those tools.
- Be proficient in R coding to implement and execute a complete data analysis project in a reproducible and automated manner.
- Be comfortable using R and Github to do data analysis in a reproducible manner.
- Be able to quickly learn how to use new software and tools, figure out how to get help when stuck, and make it work for you.

*Develop skills to critically assess your own and others’ analyses
and conclusions.*

- Judge the usefulness and appropriateness of data analyses described in the primary research literature.
- Learn to
*look over your shoulder*and critically assess what you are doing, what assumptions you make by doing certain things, and if and how you can justify these.

Here is a non-exhaustive lists of topics that this course does and does not cover.

- How to set up an analysis workflow that is as reproducible and automated as possible.
- Getting, cleaning and processing messy real-world data.
- Data visualization.
- Modern tools for data analysis (e.g., R, Quarto, Git/GitHub).
- The
`tidyverse`

for data processing and`tidymodels`

for fitting models. - An introduction to some Machine Learning tools and techniques.

- Advanced visualization techniques using interactive tools such as R/Shiny.
- Dealing with “non-rectangular” data, such as time-series data, images, audio, complex -omics data, etc. (we won’t cover it, but you can use such data for your class project).
- Statistical tests and basic statistical modeling (linear and
generalized linear models). Some familiarity with those techniques is
assumed and they do show up a few times, but won’t be covered.

- How to code in R. We will use R, but this course doesn’t teach R. With enough effort, a student can learn both the subject matter and pick up enough R at the same time, but this will require extra effort. In general, some basic familiarity with R or another programming language will be assumed.
- Anything in depth. This is a survey course and covers a lot of material, thus we won’t be able to go into much depth for any topic. Resources are provided to allow anyone interested to go deeper on their own.

*Formal requirement for the course is BIOS 7010. Knowledge of
material from BIOS 7010 is assumed. If your lack pre-requisites prevent
you from enrolling in this course, please contact me to get permission
to enroll.*

This is a quantitative course. We will not discuss the mathematical details of specific data analysis approaches, however some statistical background and being comfortable with quantitative thinking are needed. Knowledge of statistics at the level of fitting linear or logistic models to data (e.g., as obtained in our BIOS 7010 and 7020 courses) are assumed. Some R coding skills (e.g., as obtained in our EPID 7500 class), are helpful.

If you do not have any coding or statistics knowledge, you can still
take the class, but you need to be prepared to spend extra time and
effort to fill any gaps. This will be especially true for the R coding
part. Some of you likely have previous R experience, while others might
have little to none. If you are in the *little to none* category,
expect to spend extra time getting up to speed. I believe it’s doable
and worth the effort, but you need to be prepared for it. There will be
plenty of help from myself, classmates, and the internet if you end up
getting stuck with some of the coding, but your effort and commitment
are still required.

**To re-emphasize: For those among you who have not used R or
any other programming language before, this course will be
time-consuming. Budget your time accordingly and plan ahead! If you do,
I’m fairly certain you will find it worth it. If you are not able or
willing to allot the time needed to learn enough R (and GitHub) to make
things work, this course might not be ideal for you.**

Here are a few more pointers to what to expect from the course, and comments from previous students (so you can hear it from them, not just from me).

- The course is a high-level survey course. We cover a lot of material.
- A lot of the content is fairly conceptual/broad/big-picture, we don’t go much into technical details on specific topics.
- The pace of the course is fast.
- There is a good bit of hands-on work/exercises that you need to do each week, which can be time-consuming.
- This course comes with a good bit of team work.
- You will learn skills that are likely useful for your research/work career.
- You will be using tools like
`R`

,`GitHub`

,`Quarto`

and other useful tools.

“I have found this course to be useful and relevant to my research interests. I think the mixture of the modules, additional resources, and coding exercises really helped strengthen my understanding of important data analysis concepts.”

“I have thoroughly enjoyed this class from start to finish, and I have gained a lot of knowledge, starting from Github to the entire data analysis workflow. However, I feel that we have covered a lot of material in just one semester.”

“I have learned so much from this class in what feels like such a short amount of time, and it has gotten me a lot more comfortable working with and interpreting my data and I now feel that I have a strong grasp on data analysis workflows. However, sometimes I feel like I’ve learned about so many new things that I feel dumb all over again because there is so much to take in when it comes to machine learning and I want the opportunity to explore each facet in greater detail.”

“The last half of the semester has seemed like a whirlwind. We have reviewed everything from modeling to training and even machine learning.”

This course is a fully online, asynchronous, cohort-based course. That means there are weekly deadlines, but other than the fixed deadlines, you can do the work whenever it is convenient for you.

All course materials are freely available online. We will make use of several freely available textbooks and other materials. All course materials are listed on the course website. We will use the R software for data analysis. We will also use a few other software tools. All are freely available.

**This course is very hands-on. The weekly exercises (aka
homework) are usually quite in-depth and also often time-consuming. Plan
accordingly.** For each exercise, I will provide detailed
instructions that hopefully make it clear what you need to do. The
materials provided on the course website are not meant to be memorized,
but to be used to be able to do the hands-on activities, such as the
exercises and the class project.

For more details on course logistics, see the other information in
the *General Information* section of the course website.

The grade will be made up as follows:

- 20% quizzes
- 30% exercises/homework
- 10% participation/discussions
- 40% a course long project, broken up into pieces.

The following grading scale will be used, final grades might be curved (upward, never down): A 93-100, A- 90-93, B+ 87-90, B 83-87, B- 80-83, C+ 77-80, C 73-77, C- 70-73, D 60-70, F < 60

More detailed descriptions of the different assessments is provided
on the *Assessments
page*.

This class is online. You are expected to submit all assignments by their due dates. Excused misses of due dates are only provided by prior agreement with the instructor or for special reasons (medical, etc.).

If you have questions about any aspect of the course, please do not hesitate to ask for help. The course materials describe in detail the ways you can ask for help.

All academic work must meet the standards contained in *A Culture
of Honesty*. All students are responsible to inform themselves about
those standards before performing any academic work. More detailed
information about academic honesty can be found at: http://www.uga.edu/honesty/

Discussions with your classmates and the instructor are encouraged. However, the final work should be your own.

Students with disabilities who require reasonable accommodations in order to participate in course activities or meet course requirements should contact the instructor.

This syllabus is a general plan, deviations announced to the class by the instructor may be necessary.

For an outline of the course, please see the *Course Schedule* document.

The *General Information* section and the introductory unit of
this course contains all the logistic details you need to know.