I love it when a Plot comes together

We’ve talked a lot about file paths. But it’s a tricky concept. Let’s say I had the following directory structure:

Home
|
|- My_Projects
  |
  |-Downloads
  |
  |-Fun_Homework
    |
    |-Data
    |-R
    |-Figures
  |-Final_Project
    |
    |-Data
    |-R
    |-Figures
    |-Paper
|- My_Documents
  |-Random_Paper
    |
    |-Odd_Data
|- My_Images
    |
    |-Hey_more_data

So - one toplevel directory (Home), 4 Subdirectories, and multiple directories treeing off under that. For the next few questions, when asked, write code, but in your RMarkdown, you can include the argument eval=FALSE to not have a code block evaluated.

1A. You are working on your homework, and have saved your script file in the R directory of Fun_Homework. Because homework is fun. Tell me two ways to set your working directory to be where that script file is located. (Hint: one is scripted, one is not)

1B. Let’s say you have a file foo.csv that you want to load for your homework. You’ve downloaded it to Downloads. How would you read in the CSV without moving the file?

1C. How would you load in the CSV after moving the file to the Data directory of Fun_Homework.

1D. You realized that you want to use a data file from and old Random_Paper in the Odd_Data directory. It’s called bizbaz.csv. How would you load it?

2. We’ve spent a lot of time with the plankton data set, and learned much. Let’s learn a bit more!

2A. Load up the plankton data, and generate a scatterplot of the relationship between Chlorophytes and TotalChlA. Is there more Chlorophyll when there are more Chlorophytes?

2B. While we have been looking at salinity a lot, we also know that there are other processes that covary with salinity as one moves up the estuary, such as vegetation, and the community of Zooplankton. Perhaps distance from mouth - Distance - is a better predictor of changes in relationships? Can you see any pattern of how distance alters this relationship (from 2A) by coloring the points by Distance? Use something other than the default color scale.

2C. Or - what if we made a discrete variable out of distance using cut_interval and used facet_wrap to see its influence. What patterns do you see?

2D. As the estuary was sampled at times of year where temperature varied, and distance from mouth might have a different effect under cold v. warm temperatures, let’s look at both variables together as facets. What do you see if you create a discrete variable from Temp using cut_interval and then make a facet_grid plot?

2E. Last, are your answers from A-D made clearer or not by changing the scale of the x and y axes with log10 or any other transformation (see the handout)? Why or why not does this transformation help

3. We can also use factors as something to plot on the x axis.

3A. Make a plot using geom_point where your distance groups from question 2 are on the x axis and TotalChlA is on the y.

3B. Wow. That’s a lot of overlap in points! There are a few ways to resolve that so things look better. One is to use the alpha argument in geom_point. This controls point transparency. Try setting alpha=0.5. Does that help?

3C. Another way to clear things up is a different geom. What if we added a little random noise to the x and y position of each point. Instead of geom_point try geom_jitter. What do you see? How does this differ from 3A and 3B, if at all? Does adding an alpha here still help you?

3D. So, we talked on Tuesday about adding statistical information to help visualize patterns in the data. In ggplot these come from stat functions - just like geom functions. Let’s try stat_summary. We supply it with an argument to fun.data and it generates some summary information for us. Try using one of the plots from above with stat_summary setting fun.data="median_hilow". What do you see? Try adjusting the color of stat_summary and the alpha level of your points to make it easier to see the trends in the data. FYI, look at the helpfile for stat_summary including the sample code for ideas if you are confused.

4. Let’s make this plot look good! Choose one of the plots that you worked on in this assignment.

4A. Give it a title with ggtitle(). Change the x and y axis names with xlab() and ylab().

4B. Now, let’s make it look very nice. Install the ggthemes package and look through the theme options it gives you. Choose one, and implement it (e.g., add theme_bw(base_size=12)) to your plot. Why did you choose this theme? What about it aids in your visualization?

4C. Extra credit - look at the theme help file. Customize your plot even more using theme() and justify your choices.

5. What is your favorite data visualization. Grab a jpg of it and put it into this RMarkdown document (you’ll need look at how to get images into RMarkdown documents and you’ll need to email it to us along with the homework so we can compile the document).

Now tell us why this is your favorite example of a data visualization.

I love it when a Plot comes together

Intro to Data Science for Biology

February 17, 2016