Some of this will only fully make sense once we are a few weeks into the course. It’s nevertheless a good idea to read through it and get an overall idea before we have discussed all the different tools and details mentioned. Re-visit/re-read as needed.

Overview

A semester long data analysis project will be a major method of assessment for this class. The project should include all the different components of a data analysis we will cover in this class.

  • Finding a good data - question pair
  • Getting the data
  • Cleaning the data
  • Exploring the data
  • Processing the data
  • Analyzing the data
  • Reporting/communicating your findings
  • Doing everything reproducibly

You can do the class project either on your own or in a team of two.

Timeline

There are several deadlines throughout the course at which time you need to submit parts of your project

  • Part 1: Have a question/hypothesis & suitable data.
  • Part 2: Part 1 + exploratory analysis.
  • Part 3: Part 2 + start of statistical analysis.
  • Part 4: Part 3 + most of statistical analysis.
  • Part 5: Complete project, ready for peer review.
  • Part 6: Complete final project, updated based on reviewer feedback.

The deadlines are listed in the Schedule document (it’s unlikely, but they might change, so keep checking that document).

Project Requirements

You can pick your project topic and data source pretty much however you like. The only requirements are:

  • Real world (messy) data.
  • An interesting question.
  • An analysis that has not been done previously.
  • A fully reproducible and automated analysis using the tools we use in this class.

Data sources

You can use data from any source you like, here are some possible ones.

Your “own” data

If you have data that you are using for some research project(s) you are doing, you are welcome and encouraged to work on this as part of the class project. Of course, what you do for this class project needs to be new work, not a previously done and recycled analysis. Also, since the analysis needs to be fully reproducible, you need to provide the data at least within the class (no need to make it publicly available). I encourage you to use the class project toward helping you do an analysis and write a report that can help you with a project you want to publish as part of your research!

Publicly available data

You can use any data you can get access to. Since it needs to be reproducible, you need to be able to share the data at least with me and the classmates who will review your project. If you need some ideas for data, check out this website I maintain with various links to resources, among them is a list with links to various data sources. Of course you are not limited to data listed there.

Often, the most interesting questions can be asked by combining data from more than one source.

Logistics and formatting

Each assignment needs to be submitted in a fully reproducible form, using the tools we cover in the class (R, Quarto, GitHub, etc.). You should create a Github repository (using the template described below) which should contain all the files for your project. Name it YOURLASTNAME-MADA-project. The preferred type is a public repository, if you have data that requires to be kept confidential, you can make it a private repository. If you need to use a private repository, please let me know.

The main document should be a Quarto file, which can be turned into a suitable output format (html or word or pdf). This should have the structure of a scientific paper. I suggest you use the Manscript.qmd file of the template and replace the template content with your content. I also suggest having the output be a word document, since that is what most journals want for submission of a scientific paper. The template I provide is set up for word output. However, if you for instance decide to include interactive figures or apps (using e.g. Shiny) in your analysis project, you can also pick html as output. If for some reason you want/need pdf, that’s ok too (then you need a LaTeX system installed, see the Quarto documentation for how to do that).

Structure your project similar to the provided template (see below), with data, scripts, results and manuscript in different folders, various R scripts to perform different bits of the analysis, and a final Rmd file that pulls everything in and generates the report.

Use a setup that resembles a real research paper. A main manuscript file in Quarto format should contain text, the main results (figures/tables), references, etc. A supplementary Quarto file should contain additional results, such as some of your exploratory analysis findings. Any code and further results should be in additional Quarto files (with code either inside the Quarto file, or in separate R files). Overall, your write-up can be a bit more detailed than what would normally go into a peer reviewed paper, with the most salient parts shown in the main text, and the rest in a supplementary file.

For all your submissions, you need to provide everything needed (data, code, etc.) to allow a full and automated reproduction of your analysis.

References should be included as a bibtex file and cited in the Rmd file. See the template for an example.

Project template

I created a public Github repository called dataanalyis-template which is meant as a template for doing a data analysis project. It has different folders for organizing your project. Various readme files are provided to explain what each folder should contain. The template also contains several example files to show how the whole project workflow (or any data analysis workflow for that matter) can work.

Inside the manuscript folder, there is a Quarto template file called Manuscript.qmd with a suggested outline for the report write-up. I have also indicated which sections need to be completed/filled for which part of the project. This template is just a guide, you do not have to follow exactly that structure, as long as you provide all the requested parts for the the different submission deadlines and a final, complete, fully automated and reproducible project, with all files in one GitHub repository, at the end.

In addition to the Quarto file, there is a bibliography file in bibtex format which contains the references, and a style file which indicates reference formatting. You can use it as starting point. Feel free to switch the reference formatting style. You do need to use a setup and reference manager that plays nicely with Quarto.

Use the provided template as starting point for your project. To do so, go to its Github repository and follow the instructions to turn it into a new repository for your class project, call it YOURLASTNAME-MADA-project. Once you made the new repository, follow the usual Github workflow to get it to your local computer. Then open the readme file and change the text so it states somewhere “This is YOURNAME class project repository”.

Feedback and Assessment

You will receive feedback from me and/or your classmates after each submission.

  • I will provide feedback and assess parts 1,2 and 4, mainly checking to see that you deliver what you were supposed to and got the project to the requested stage.
  • Your classmates will provide peer-review feedback on parts 3 and 5. Details on that is provided in the Project Review document.
  • Part 6 is the final submission.

The initial submissions and peer reviews make up 40% of the project grade. The final submission counts for the remaining 60%.

More details on what exactly is expected and how it is assessed is provided in the Project Details document.

Project communications

Any communication regarding this project should happen in the Project discussion channel. Go there to ask project-specific questions, to post links to your repository whenever you have a part finished, etc. I will also post any further or clarifying information there.