Links
text

R

Of Use

R Presentations

The Community

Instructor: Jarrett Byrnes, PhD.
Email: jarrett.byrnes@umb.edu
Weekly Schedule:
Tuesday & Thursday 11-12:30
Office Hours: Prof. Byrnes will hold office hours Thursday from 2:00-3:30.

Skip to Schedule, Assignments, Lecture Notes
Homework Solutions

 

Overview: This course will cover the basic statistical knowledge necessary for a graduate student to design, execute, and analyze a basic research project.  The course aims to have students focus on thinking about the biological processes that they are studying in their research and how to translate them into statistical models.  The course will take a hands-on computational approach, teaching students the statistical programming language R.  In addition to teaching the fundamentals of data analysis, we will emphasize several key concepts of efficient computer programming that students can use in a variety of other areas outside of data analysis.

 

We will emphasize the underlying principle behind modern statistical analysis – that nearly every biological system can be described with a simple series of linear or nonlinear relationships with some meaningful error distribution around them.  Additionally, we will emphasize thinking about whole biological systems, causality, and the limits of inference that can be drawn from observational versus experimental studies.

 

The course will build through a series of topics.  We will begin by thinking about the basics of what is data, how do we curate it, and how do we efficiently visualize it.  We will move on to thinking about natural systems and sampling design to derive inferences about the a single property within a system, such as the distribution of bird beak lengths or levels of gene expression.  We will move on and think about how we describe causal processes within a system.  We will discuss the different techniques used to fit models that describe these causal processes.  From there, we will move on to an exploration of the role of experiments in deriving inferences about our study systems.  We will move on to topics concerning how to construct and evaluate statistical models of complex systems from either experimental or observational data; we will then end with a discussion of the comparison of multiple alternative hypotheses.

 

Objectives:

1)   To learn how to think about your study system and research question of interest in a systematic way in order to design an efficient sampling and experimental research program.

2)   To understand how to analyze collected data to derive the most information possible about your research questions.

3)   Provide the grounding needed to effectively collaborate with statistical experts.

4)   Allow students to feel sufficiently comfortable with the basic principles of statistical analysis so that they can learn and implement techniques outside of the purview of this course.

 

Prerequisites: I will assume a basic knowledge of algebra and introductory calculus (although no calculus will be used).  Undergraduate courses in probability theory and computer science are useful, but not required.  Students who are new to programming should skim chapter 1 of Adler before beginning the course.

 

Required Texts:

Adler, J. (2009) R in a Nutshell: A Desktop Quick Reference. O’Reilly Media. http://shop.oreilly.com/product/9780596801717.do

Silver, N. (2012) The Signal and the Noise. The Penguin Press.http://www.amazon.com/dp/B007V65R54/

Whitlock, W.C. and Schluter, D. (2014) The Analysis of Biological Data, Second Edition. Roberts and Company Publishers. http://www.amazon.com/Analysis-Biological-Data-Second-Edition/dp/1936221489

Recommended Texts:

I will be drawing on examples and materials from a few other sources.  They include wonderful examples of R code in the context of data analysis.  You are not required to have these, but you will either find them useful in this course or in future endeavors.

Bolker, B. (2009) Ecological Models and Data in R. Princeton University Press. http://www.amazon.com/Ecological-Models-Data-Benjamin-Bolker/dp/0691125228

Matloff, N. (2011) The Art of R Programming: A Tour of Statistical Software Design. No Starch Press. http://nostarch.com/artofr.htm

Song, S. Qian (2009) Environmental and Ecological Statistics with R. Chapman and Hall/CRC Press, London. http://www.amazon.com/Environmental-Ecological-Statistics-Chapman-Applied-ebook/dp/B005H6YDPU

 

Software

 

Content and teaching approach:  The course will be a mixture of lecture and hands-on data analysis lab.  Students will be expected to have a computer available during the course so that they can follow examples and attempt in-class problems.

 

Grading: Your grade will be determined by a combination of weekly homework, a course blog, and a midterm exam, and a final paper. Homework will consist of a problem set and will be worth 40% of your course grade. Reflections on the course blog will be worth 10%. The midterm exam will be take-home and worth 20%. The final paper will be worth 30%. Additionally, students may earn extra credit for preparing their final paper for submission to a journal.

 

Homework: All homework done using R should be turned in as a formatted pdf document using the knitr library (http://yihui.name/knitr/). I will conduct a short tutorial in class. I encourage you to use markdown for formatting. Visit the library webpage for an additional demo and if you are using Rstudio see http://yihui.name/knitr/demo/rstudio/ and https://support.rstudio.com/hc/en-us/articles/200552086-Using-R-Markdown. Note – all slides will be written using knitr and slidify, and code will be made available as an example.

 

Course Blog: As part of this course, I want you to think beyond the immediate techniques I’m teaching you to larger issues about how we use statistics. Towards that end, I want you to write a few reflective pieces for a course blog at http://learningdata.wordpress.com as well as comment on what your classmates are thinking. Towards that end, you are required to post three times during the course of the semester. Sign-up for dates and learn how to post over at http://goo.gl/P0L8ls. I will also be noting responses, and would ask that you respond to at least three posts. More posting and more comments are welcome, but the minimum is three for each. To spur some thoughts, I ask that each week you read one chapter of Nate Silver’s book on prediction, The Signal and the Noise. You are not required to discuss this book at all in your posts, but I hope it will get some of your thoughts going beyond just textbook materials.

 

Final Paper: The final paper will be an analysis of a topic of your choosing. This could be an opportunity for you to analyze and write-up your own data.  It could be an opportunity for you to mine data from various public sources – online data repositories, sensor networks, NASA’s data archive, etc. – that are relevant to your research.  Look at this as an opportunity to contribute to your thesis.  Papers are to be fully written up in an academic journal style (intro, methods, results, discussion, etc.).  Topics must be approved by week 9, or final papers will not be accepted.  Each student will give a short (10 min) presentation on the final day of class. If a project is large enough in scope to warrant working in groups, I will consider it. I will retroactively increase students grades if their analysis is used for the submission of a published paper in the following semester (e.g., from a B- to an A, or a B to B+).

Course Content:

While the topics covered are broad, each week will feature different examples from genetics, ecology, molecular, and evolutionary biology highlighting uses of each individual set of techniques.

Week 1: Introduction to thinking about data in a computational framework
Readings: W&S 3-4, Adler 2-3, Silver 1
Lectures and Code:

Homework: Homework 1

Week 2: Sampling and Simulation: writings loops to generate simulated landscapes, sample size and natural variability Readings: W&S 2,4, Silver 2
Lectures and Code:

  •  Sampling Lecture Handout
  • Sampling Lecture R Code
  • Simulation Lecture Handout
  • Simulation Lecture R Code
  • Desert Bird Census Data
  • Homework: Homework 2

    Week 3: Sample Variation & Data Visualization: variation in estimates of biological processes, Tufte and other principles of good data visualization
    Readings: W&S 5-6, Adler 6, Wickham’s Layered Grammer of Graphics, Wickham on Boxplots, A Survey of R Graphics by Michael Driscoll
    Lectures and Code:

  •  Data Viz Lecture Handout
  • Data Viz Lecture R Code
  • Lake Baikal Plankton Data
  •  Variation in Sample Estimates Lecture Handout
  • Sample Variation Lecture R Code
  • Homework: Homework 3

    Week 4: Probability,Hypothesis Testing, P-Values, and Power
    Readings: W & S 7-8, 10-11, Adler 9, Silver 3, Confirmation v. Falsification, A Beastiary of Distributions
    Lectures and Code:

  •  Probability & P-Values Lecture Handout
  • Probability & P-Values Lecture R Code
  •  Neyman-Pearson Hypothesis Testing & Power via Simulation Lecture Handout
  • N-P NHST & Power Lecture R Code
  • Homework:Homework 4

    Week 5: Writing Functions to Test Hypotheses & Power: automating analyses via functions, t-tests and beer quality, power analysis via simulation, Controversies in biological application of Neyman-Pearson hypothesis tests
    Readings: W&S 12-13, Hurlbert and Lombardi 2009, Silver 4
    Lectures and Code:

  •  Functions & Hypothesis Testing
  •  Code for Functions & Hypothesis Testing
  •  The T and Chi Square Tests Handout
  • Homework: Homework 5

    Week 6: Fitting Linear Models: Least Squares approaches to evaluating inbreeding depression and allometric relationships
    Readings: W&S 16-17, Silver 5
    Lectures and Code:

  •  Correlation and Regression Handout
  • Correlation and Regression R Code
  • Pufferfish Data
  • Wolf Inbreeding Data
  • Testing Regressions Handout
  • Testing Regressions R Code
  • Homework:
    Homework 6
    Data for Homework 6

    Week 7: Fitting Linear Models: Likelihood – evaluation of survivorship along an environmental gradient
    Project Proposals Due
    Readings: W&S 20,Bolker 2012, Silver 6, install emdbook and bbmle libraries
    Suggested Readings: Bolker Book Ch 6
    Lectures and Code:

  • Iteration & Likelihood Handout
  • Iteration & Likelihood R Code
  • Bee Lifespan Data
  • Fitting & Evaluating Likelihood Models Handout
  • Fitting & Evaluating Likelihood Models R Code
  • Mid-term Handed out: Exam

    Week 8: Fitting Linear Models: Bayes – Bayes theorem, incorporating prior information, false positive rates in medical testing, MCMC approaches
    Readings: Ellison 1996, Silver 7
    Lectures and Code:

  • Introduction to Bayesian Statistics Handout
  • Introduction to Bayesian Statistics R Code
  • Linear Modeling with Bayesian Statistics Handout
  • Linear Modeling with Bayesian Statistics R Code
  • No homework

    Week 9: Generalized Linear Models – incorporating nature’s inherent non-normality into models, kelp-urchin interactions
    Readings: O’Hara 2009, O’Hara and Kotze 2010, Wharton and Hui 2011, Silver 8
    Mid-Term Due
    Lectures and Code:

  • Nonlinear Modeling Handout
  • Nonlinear Modeling R Code
  • Kelp Holdfast Data
  • Generalized Linear Modeling Handout
  • Generalized Linear Modeling R Code
  • Logistic Regression Handout
  • Logistic Regression R Code
  • Cryptosporidium Data
  • Seed Predation Data
  • No Homework

    Week 10: Experiments & the Linear Model (ANOVA) – randomization of treatments, gene expression and mental disorders, predation resistance, non-parametric comparison of groups
    Readings: W&S 14-15,18 Wickham on plyr, Silver 9
    Lectures and Code:

  • ANOVA Part 1 Handout
  • ANOVA Part 1 R Code
  • Gene Expression Disorder Data
  • Daphnia Resistance Data
  • ANOVA Part 2 Handout
  • ANOVA Part 2 R Code
  • Giant Kelp frond and holdfast data
  • Homework:

    Week 11: Multiple Predictors & Model Selection (AIC) – model weights, alternative nested and non-nested models of biological processes in plankton communities
    Readings: Hobbs and Hilborn 2001, Symonds and Moussalli 2010, Ecology Special Section on P Values, Silver 10
    Lectures and Code:

  • Multiple Regression
  • Multiple Regression R Code
  • Fire Recovery data
  • West Nile Virus data
  • Information Theoretic Approaches Handout
  • Information Theoretic Approaches R Code
  • Baikal data
  • Homework: install the car, visreg, and QuantPsyc libraries. Also the AICcmodavg library and glmulti

    Week 12: Interactions, Covariates, and Experiments – pseudoreplication, Neanderthal morphometrics, fire damage severity in Southern California
    Readings: Hurlbert 1984, W&S 18, Silver 11
    Lectures and Code:

  • Multiway ANOVA Handout
  • Multiway ANOVA R Code
  • Zooplankton Predation data
  • Bee Gene Expression data
  • Interaction Effects Handout
  • Interaction Effects R Code
  • Useful code for 3D Plotting
  • Intertidal Algae Experiment data
  • Homework:

    Week 13: Hierarchical Models and Nesting – split plot experiments, modeling population variation, disease infection rates, introduction to time series analysis with plankton
    Readings: Schielzeth and Nakagawa 2012, Schielzeth and Nakagawa Appendix, Bolker et al 2009, Silver 12
    Writings on visualization: Visualizing Mixed Models part 1, Visualizing Mixed Models part 2, sjPlot, Random regression coefficients using lme4, Making mixed model plots look fancy, R2 for mixed models (from Jon Lefcheck)

    Lectures and Code:

  • ANCOVA and Mixed Model Handout
  • ANCOVA and Mixed Model R Code
  • Nested Plant Growth Experiment data
  • Neanderthal Brain Size data
  • Beach Invertebrate Survey data
  • Hierarchical Mixed Model Handout
  • Hierarchical Mixed Model R Code
  • Week 14: Confounding Variables – isolating causal links in biological networks of interactions, introduction to graph theoretic approaches to inferences
    Readings: Cottingham, Readings from Pearl, Silver 13
    Lectures and Code:

    Week 15: Final Presentations


    Things you need: A large amount of computer programming will be necessary to successfully complete the course, so students will need easy access to computers running R (or with administrative access to download R), which is free, open-source software and some form of spreadsheet software (Microsoft Excel, Open Office, etc.). We will learn how to load R and R packages in the class. Ideally, students will start the class with a general idea their project system or an ecosystem of interest (e.g., studying insects in salt marshes, experimentally driven levels of gene expression, patterns of biodiversity across a bathymetric gradient, yeast reproductive rates, etc.) as there will be opportunities for students to use their own data for course credit.

     

    Code of Conduct and Academic Integrity: It is the expressed policy of the University that every aspect of academic life–not only formal coursework situations, but all relationships and interactions connected to the educational process–shall be conducted in an absolutely and uncompromisingly honest manner. The University presupposes that any submission of work for academic credit is the student’s own and is in compliance with University policies, including its policies on appropriate citation and plagiarism. These policies are spelled out in the Code of Student Conduct. Students are required to adhere to the Code of Student Conduct, including requirements for academic honesty, as delineated in the University of Massachusetts Boston Graduate Catalogue and relevant program student handbook(s).http://www.umb.edu/life_on_campus/policies/code

    You are encouraged to visit and review the UMass website on Correct Citation and Avoiding Plagiarism: http://umb.libguides.com/citations

     

    Accommodations: The University of Massachusetts Boston is committed to providing reasonable academic accommodations for all students with disabilities. This syllabus is available in alternate format upon request. If you have a disability and feel you will need accommodations in this course, please contact the Ross Center for Disability Services, Campus Center, Upper Level, Room 211 at 617.287.7430. http://www.umb.edu/academics/vpass/disability/After registration with the Ross Center, a student should present and discuss the accommodations with the professor. Although a student can request accommodations at any time, we recommend that students inform the professor of the need for accommodations by the end of the Drop/Add period to ensure that accommodations are available for the entirety of the course.

     

    Course notes: Slides and code for each lecture will be available on the course website before each lecture.

     

    Useful Online References for R

    R-Bloggers.  Read this daily. http://www.r-bloggers.com/

    John Verzani, “simpleR”, in PDF

    Patrick Burns, The R Inferno. “If you are using R and you think you’re in hell, this is a map for you.”

    Thomas Lumley, “R Fundamentals and Programming Techniques” (large PDF)

    A list of tutorials in R from universities around the world http://pairach.com/2012/02/26/r-tutorials-from-universities-around-the-world/

     

    Additional Books About R and Statistical Computing

    A fairly comprehensive list can be found at http://www.r-project.org/doc/bib/R-books.html.  Below, I highlight a few of my favorites that overlap and extend the material in this course:

    Benjamin M. Bolker. Ecological Models and Data in R. Princeton University Press, 2008. ISBN 978-0-691-12522-0. [ Publisher Info | http://www.zoology.ufl.edu/bolker/emdbook/ ]

    Julian J. Faraway. Extending Linear Models with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Chapman & Hall/CRC, Boca Raton, FL, 2006. ISBN 1-584-88424-X. [ bib | Discount Info | Publisher Info |http://www.maths.bath.ac.uk/~jjf23/ELM/ ]

    John Fox and Sanford Weisberg. An R Companion to Applied Regression.Sage Publications, Thousand Oaks, CA, USA, second edition, 2011. ISBN 978-1-4129-7514-8. [ http://socserv.socsci.mcmaster.ca/jfox/Books/Companion/index.html]

    M. Henry H. Stevens. A Primer of Ecology with R. Use R. Springer, 2009. ISBN 978-0-387-89881-0. [ Discount Info | Publisher Info ]

    Paul Teetor. R Cookbook. O’Reilly, first edition, 2011. ISBN 978-0-596-80915-7. [ http://oreilly.com/catalog/9780596809157 ]

    John Verzani. Using R for Introductory Statistics. Chapman & Hall/CRC, Boca Raton, FL, 2005. ISBN 1-584-88450-9. [ Discount Info | Publisher Info | http://wiener.math.csi.cuny.edu/UsingR/ ]

    Hadley Wickham. ggplot: Elegant Graphics for Data Analysis. Use R. Springer, 2009. ISBN 978-0-98140-6. [ Discount Info | Publisher Info ]

     

    Journals to Keep an Eye On

    The Journal of Statistical Software. http://www.jstatsoft.org/

    Methods in Ecology and Evolution. http://www.methodsinecologyandevolution.org/

    The R Journal. http://journal.r-project.org/current.html