This course provides a survey of modern statistical approaches to analyze data. We will cover a variety of modern approaches for analyzing and interpreting data commonly encountered in public health, biomedical sciences, and related areas. This is an applied, hands-on class. We will use real data (bring your own data if you have it) to learn different methods of analysis. We will discuss all the steps of a data analysis, including obtaining and cleaning data, exploratory and full analysis, and presentation of results.
We will discuss how to formulate scientifically solid questions for a given set of data, how to decide on the right method of analysis, how to implement the analysis in R, and how to present and communicate the results. We will cover statistical topics such as regression, tree based models, cross-validation, bootstrapping, and model selection.
The main goal for this course is for you to learn the whole process of performing a data analysis project. This starts with identifying a suitable question-data pair, proceeds to getting, cleaning and exploring the data, culminates with fitting statistical models and producing materials that communicate what you did and found. A second goal is to introduce you to some modern analysis approaches that these days often go by the name of ‘Machine Learning’. Finally, a related goal is to introduce you to a set of tools that allow for a modern, reproducible workflow of your analysis.
The specific learning objectives that I hope you will achieve by going through this course are:
Define meaningful data analysis questions and assess the feasibility of answering these questions with the available data.
Be able to obtain, organize and process data for detailed analysis.
Be knowledgeable of different data analysis methods and select the appropriate approach for a given project based on data and question.
Efficiently communicate results from data analyses to a variety of stakeholders.
Use modern coding and analysis tools to implement automated, reproducible analysis and project management workflows.
Develop skills to critically assess your own and others’ analyses and conclusions.
Here is a non-exhaustive lists of topics that this course does and does not cover.
tidyversefor data processing and
tidymodelsfor fitting models.
Formal requirement for the course is BIOS 7010. Knowledge of material from BIOS 7010 is assumed. If your lack pre-requisites prevent you from enrolling in this course, please contact me to get permission to enroll.
This is a quantitative course. We will not discuss the mathematical details of specific data analysis approaches, however some statistical background and being comfortable with quantitative thinking are needed. Knowledge of statistics at the level of fitting linear or logistic models to data (e.g., as obtained in our BIOS 7010 and 7020 courses) are assumed. Some R coding skills (e.g., as obtained in our EPID 7500 class), are helpful.
If you do not have any coding or statistics knowledge, you can still take the class, but you need to be prepared to spend extra time and effort to fill any gaps. This will be especially true for the R coding part. Some of you likely have previous R experience, while others might have little to none. If you are in the little to none category, expect to spend extra time getting up to speed. I believe it’s doable and worth the effort, but you need to be prepared for it. There will be plenty of help from myself, classmates, and the internet if you end up getting stuck with some of the coding, but your effort and commitment are still required.
To re-emphasize: For those among you who have not used R or any other programming language before, this course will be time-consuming. Budget your time accordingly and plan ahead! If you do, I’m fairly certain you will find it worth it. If you are not able or willing to allot the time needed to learn enough R (and GitHub) to make things work, this course might not be ideal for you.
Here are a few more pointers to what to expect from the course, and comments from previous students (so you can hear it from them, not just from me).
Quartoand other useful tools.
“I have found this course to be useful and relevant to my research interests. I think the mixture of the modules, additional resources, and coding exercises really helped strengthen my understanding of important data analysis concepts.”
“I have thoroughly enjoyed this class from start to finish, and I have gained a lot of knowledge, starting from Github to the entire data analysis workflow. However, I feel that we have covered a lot of material in just one semester.”
“I have learned so much from this class in what feels like such a short amount of time, and it has gotten me a lot more comfortable working with and interpreting my data and I now feel that I have a strong grasp on data analysis workflows. However, sometimes I feel like I’ve learned about so many new things that I feel dumb all over again because there is so much to take in when it comes to machine learning and I want the opportunity to explore each facet in greater detail.”
“The last half of the semester has seemed like a whirlwind. We have reviewed everything from modeling to training and even machine learning.”
This course is a fully online, asynchronous, cohort-based course. That means there are weekly deadlines, but other than the fixed deadlines, you can do the work whenever it is convenient for you.
All course materials are freely available online. We will make use of several freely available textbooks and other materials. All course materials are listed on the course website. We will use the R software for data analysis. We will also use a few other software tools. All are freely available.
This course is very hands-on. The weekly exercises (aka homework) are usually quite in-depth and also often time-consuming. Plan accordingly. For each exercise, I will provide detailed instructions that hopefully make it clear what you need to do. The materials provided on the course website are not meant to be memorized, but to be used to be able to do the hands-on activities, such as the exercises and the class project.
For more details on course logistics, see the other information in the General Information section of the course website.
The grade will be made up as follows:
The following grading scale will be used, final grades might be curved (upward, never down): A 93-100, A- 90-93, B+ 87-90, B 83-87, B- 80-83, C+ 77-80, C 73-77, C- 70-73, D 60-70, F < 60
More detailed descriptions of the different assessments is provided on the Assessments page.
This class is online. You are expected to submit all assignments by their due dates. Excused misses of due dates are only provided by prior agreement with the instructor or for special reasons (medical, etc.).
If you have questions about any aspect of the course, please do not hesitate to ask for help. The course materials describe in detail the ways you can ask for help.
All academic work must meet the standards contained in A Culture of Honesty. All students are responsible to inform themselves about those standards before performing any academic work. More detailed information about academic honesty can be found at: http://www.uga.edu/honesty/
Discussions with your classmates and the instructor are encouraged. However, the final work should be your own.
Students with disabilities who require reasonable accommodations in order to participate in course activities or meet course requirements should contact the instructor.
This syllabus is a general plan, deviations announced to the class by the instructor may be necessary.
For an outline of the course, please see the Course Schedule document.
The General Information section and the introductory unit of this course contains all the logistic details you need to know.