Putting it all together

For your final proejct, I’m asking you to conduct an analysis of a full data set. I’d like you to write up the full project using RMarkdown so that I can not only see the answers to the question you pose, but how you answered them. Remember, I’m looking for quality well commented, well thought-out, well styled code in addition to gorgeous visualizations that convery a message and unambiguous analyses (where possible). Before I dig into the Sections you should use to organize your report, here’s my general rubric.

Exepectations 4 (Exemplary) 3 (Accomplished) 2 (Developing) 1 (Beginning)
Description of data The student provides the source of the dataset. The description includes background on how the data were collected, with a focus on details of the data collection that would be relevant to how they answer their question (e.g., understanding of sampling design that may be relevant to meeting assumptions of statistical tests). The student provides exploratory summary statistics or visualisations that help the reader understand the scope, content, and coverage of the data. The student provides the source of the dataset. The description includes background on how the data were collected, with a focus on details of the data collection that would be relevant to how they answer their question (e.g., understanding of sampling design that may be relevant to meeting assumptions of statistical tests). The student provides exploratory summary statistics and visualizations. The student provides the source and brief description of the dataset. The student provides exploratory summary statistics or visualizations. Student gives us name and source of data set and what type of data is in it.
Explanation and justification of question(s) There is a single focal question that is testable given the data. There are additional questions that are subsets or follow ups of the focal question. These questions will also be testable given the data. The student has described the rationale behind the question, providing context for how they came up with this question. There is a single focal question that is testable given the data. The student has described the rationale behind the question, providing context for how they came up with this question. There is a single focal question that is testable given the data. The student has described the rationale behind the question. There is a single focal question that relates to the data. The rationale for the question is unclear, however.
Description of workflow The student provides a verbal description of the workflow used to answer their question. What steps did they take to answer their question. This can include everything from data tidying to visualization to analysis. Justification for the workflow is included in this description. The student provides a verbal description of the workflow used to answer their question. What steps did they take to answer their question? This can include everything from data tidying to visualization to analysis. The student provides a verbal description of the workflow of the analyses used to answer their question. The student provides a broad-based verbal description of the workflow of their process, but is not able to break it down into specific steps. The description reads as an abstract rather than a concrete set of actions.
Selection and justification of statistical methods The statistical methods used are appropriate and answer the question posed. A justification of statistical method choice includes a clear statement of the underlying model used, with a description of the data generating process and error generating process. The student has clearly described the assumptions of the test or tests that they used and provided support that these assumptions have been met (e.g., verbal descriptions, additional statistical tests, data visualisations). The statistical methods used are appropriate and answer the question posed. The student has clearly described the assumptions of the test or tests that they used and provided support that these assumptions have been met (e.g., verbal descriptions, additional statistical tests, data visualisations). The statistical methods used are appropriate and answer the question posed. The studnet proposes statistical methods, but how they relate to the question being asked is unclear.
Code quality The student adheres to an R style guide (e.g., http://adv-r.had.co.nz/Style.html). The code is easy to read and well commented, allowing an external reviewer to understand why certain steps are being done. This code is modular, with complex problems being broken down into small, human readable, and logically discrete steps. Functions are used instead of repeated code chunks where appropriate. The report has been written in Rmarkdown The code is easy to read and well commented, allowing an external reviewer to understand why certain steps are being done. This code is modular, with complex problems being broken down into small, human readable, and logically discrete steps. Functions are used instead of repeated code chunks where appropriate. The code is easy to read and well commented, allowing an external reviewer to understand why certain steps are being done. The student has made an effort to make the code readable via good commenting practices.
Presentation of results Data visualisations are clearly relevant to the questions being asked and models being tested. Figures are appropriately captioned and can be interpreted with minimal additional context (i.e., can stand alone). Each visualisation conveys information that is related to the questions described in the introduction. Axes are well labeled, legends are clear, color schemes make key points easily understandable to the reader. Minutaue of font-sizes, visual aesthetics show clear attention to detail. Data visualisations are clearly relevant to the questions being asked and models being tested. Figures are appropriately captioned and can be interpreted with minimal additional context (i.e., can stand alone). Each visualisation conveys information that is related to the questions described in the introduction. Data visualisations are clearly relevant to the questions being asked and models being tested. Each visualisation conveys information that is related to the questions described in the introduction. Visualizations convey information related to analyses and questions in a clear manner, but are difficult to interpret.
Discussion/evaluation of results The conclusions are derived logically from the results and data visualisations. The student has examined and evaluated limitations in the analysis and has proposed ways to overcome these limitations. The student is able to synthesize multiple different results into a single strong conclusion. The conclusions are derived logically from the results and data visualisations. The student has multiple conclusions drawn from analyses, but does not bring them together into a single strong point. The conclusions are derived logically from the results and data visualisations. The conclusions are derived from the results and data visualisations, but do not connect clarly and cleanly. Some analyses are ignored.
Quality of writing Spelling and grammar count. Sections written so that they can be clearly understood by the reader. Student’s prose flow cleanly and clearly. Sentences are complete, organized, and are easy to understand. Clarity of communication is key. Spelling and grammar count. Sections written so that they can be clearly understood by the reader. Clarity of communication is key. Spelling and grammar count. Sections written so that they can be clearly understood by the reader. Spelling and grammar are not great, but the writing is still clear.

So, each section of the rubric is out of 4.

Note: Using libraries not taught in class will be +1 point per library. Please note when this is a new library for you.

Organizational Structure of the Paper

The final paper should be broken up into something along the lines of the following sections. You may feel free to adapt this flexibly given your unique data set and set of problems and questions. But this is a general guide, particularly if you are lost.

1) Introduction

  • What is the dataset you are using? +Describe the dataset in detail +Visualize summary information about the data
  • Based on the data, what questions do you want to ask?

2) Methods

  • Based on the data and question, what is the workflow you put together to answer your question? Feel free to write it out step-by-step
  • How did you prepare and manupulate the data to get it ready for analysis?
  • What are the different steps you need to take? Why are they they the right thing to do?
  • What are the visualizations and analyses that would answer your questions?
  • What are the underlying models you are going to use - what is your data generating process and your error generating process?
  • What statistical tests will you use, if any?
  • How will you validate your tests?

3) Results

In this section, show the analyses and visualizations as well as code that generates them that answer your question. For each analysis, be sure to walk through the entire process of model creation, evaluation of assumptions, and evaluation of any statistical tests. Feel free to show when and where you revise your models

4) Discussion

  • What do your results say? Put it all together.
  • Do your results suggest additional visualizations/analyses? Feel free to conduct them here.
  • What final conclusion can you draw about your data set?