Putting it all together

You guys have had a whirlwind over the last few weeks and are now armed with some of the most powerful tools for data exploration at your proposal - dplyr and ggplot2. Today, I want us to begin to put it all together with a new data set. I want you to begin to think like data scientists, from start to finish, and see what you can ask of this data set. I’ll provide some guidance, but along the way I want you to begin asking your own questoins.

Pesticide use in Kaua’i

I don’t know how many of you have ever been to Hawaii’s small island of Kaua’i. If you ever get a chance in your life, go. Wonderful beaches, kayaking tours in its rivers, waterfalls everywhere, and this little burger shack on the north shore that can’t be beat. This is not to mention the shaved ice.

But Kaua’i also is hope to a lot of agriculture, much of it conventional. These farms use pesticides to manage the myriad of tropical pests that try to consume their crops. Now, folk may not want to live next door to a pesticide-saturated farm, so, the government mandates that farms report their pesticide use as a sort of good neighbor reporting. The data can be found at http:://healthdata.gov

I’ve lightly reformatted the data for easier use, and today want you to begin asking questions about what we see in pesticide use in these farms.

So choose a partner (we’ll work in pairs today), and get started!

Also, for today, we’ll use an etherpad at https://etherpad.wikimedia.org/p/pesticides a little differently. Use this etherpad to document discoveries of things in the data, useful code, and questions or problems you run into with coding. We’ll check in with it throughout the class.

1) Start Your Workflow!

Before we even begin, build a project folder for today’s work, with appripriate sub-directories for code, data, outputs, etc. Download the data, and put it in the appropriate place. Fire up Rstudio, start a new script or RMarkdown file and save it to the proper directory. Set your working directory to the appropriate place.

2) Load the data and take a gander.

Now, just load the data. You can use read.csv with the usual options. You’ll know very quickly if you’re setup properly. If not, raise your hand and we’ll help you out. Make sure NAs are treated properly.

Check to make sure your data is loaded properly.

Now, take a look at the data. What are the columns? What kind of data does each column contain? What are the ranges of dates? How many farms and farm owners are there? How many pesticides types? How many chemical types?

3) Formulate some questions

There is a LOT of information here. We have time, farms, products, chemicals… How do we make sense of it all. I want you and your partner to explore the data set a little. Given that the purpose of this data set is to understand pesticide deposition - and give specific information by producer, product, and even chemical type over time - what are some questions you might ask of this data set? Just brainstorm a few. As you go forward in this lab, you might come back and revise them a bit to make them more specifiic or more general. That’s a normal part of the process.

4) Dream of outputs

Given the questions you have come up with, sketch out with pencil and paper some possible graphs that would address the question you’re asking. Draw them in their most extreme form - with the data giving you a clear answer. What would they look like? What would be on your x and y axes? Would you be using facets, size, or color at all? How many graphs would you have to make to answer one question - and sometimes the answer is not 1!

Now choose one question that you think will be the most interesting!

5) How do you get there from here?

OK, you’ve decided on what the outputs should be. How do you generate them. Look back at the data. Summarise it, str it, maybe pairs plot or do a few other things to explore its properties. And then write out, in ordered steps, 1, 2, 3… what you would need to do starting with the data as loaded to generate those plots. Don’t worry about being specific or code. Just write it out in words. A general point of worlflow is to have the first few steps be about data manipulation, then data transformation, and then actual plots.

Now put those all in as comments in your code file as a starting point.

6) Slice and dice

First, how do you have to break the data up. Are you filtering? Are you slicing? Are you aggregating? Are you making multiple different data sets to work with?

Walk through the first few steps of data processing that you have laid out in your code. What dplyr functions or other subsetting tricks will make the data sets you need to create your analysis? Refer to your dplyr cheat sheet and the data aggregation chapter in R4DS to help you out.

What is the new structure of your ready-to-analyze data sets?

6) Plot plot plot

Once you have the data sets as you’d like them, turn them into graphical outputs. What do you see? Do you need to look at just the raw data, or have you summarized things with means and variation? What metrics of variation are most useful in terms of visualizing your data? Does stat_summary help you out here?

7) Answer your question

All right! You’ve done it - taken a data set and gone from start to finish. Now, what does your visualization tell you about your question? What’s the answer? Did it match with your predictions? Or is the answer different and surprising? Raise your hand and let us know you’ve hit this point so that we can go over what you’ve found.

8) Rinse and repeat

Great, you have now taken a data sets and unearthed some of its secrets! In so doing, you’ve learned about a number of steps along the way, and what the data does and does not contain. Great! Now try a second question. This time, you should be able to work faster, create more compelling outputs, and generally have more clarity. What is the answer to your second question of interest?

Lab 5: Pest-Free Farms!

Intro to Data Science for Biology

March 2, 2016