|Instructor: Jarrett Byrnes, PhD.
Weekly Schedule: Tuesday & Thursday 11-12:30
Office Hours: Prof. Byrnes will hold office hours Thursday from 2:00-3:30.
Overview: This course will cover the basic statistical knowledge necessary for a graduate student to design, execute, and analyze a basic research project. The course aims to have students focus on thinking about the biological processes that they are studying in their research and how to translate them into statistical models. The course will take a hands-on computational approach, teaching students the statistical programming language R. In addition to teaching the fundamentals of data analysis, we will emphasize several key concepts of efficient computer programming that students can use in a variety of other areas outside of data analysis.
We will emphasize the underlying principle behind modern statistical analysis – that nearly every biological system can be described with a simple series of linear or nonlinear relationships with some meaningful error distribution around them. Additionally, we will emphasize thinking about whole biological systems, causality, and the limits of inference that can be drawn from observational versus experimental studies.
The course will build through a series of topics. We will begin by thinking about the basics of what is data, how do we curate it, and how do we efficiently visualize it. We will move on to thinking about natural systems and sampling design to derive inferences about the a single property within a system, such as the distribution of bird beak lengths or levels of gene expression. We will move on and think about how we describe causal processes within a system. We will discuss the different techniques used to fit models that describe these causal processes. From there, we will move on to an exploration of the role of experiments in deriving inferences about our study systems. We will move on to topics concerning how to construct and evaluate statistical models of complex systems from either experimental or observational data; we will then end with a discussion of the comparison of multiple alternative hypotheses.
1) To learn how to think about your study system and research question of interest in a systematic way in order to design an efficient sampling and experimental research program.
2) To understand how to analyze collected data to derive the most information possible about your research questions.
3) Provide the grounding needed to effectively collaborate with statistical experts.
4) Allow students to feel sufficiently comfortable with the basic principles of statistical analysis so that they can learn and implement techniques outside of the purview of this course.
Prerequisites: I will assume a basic knowledge of algebra and introductory calculus (although no calculus will be used). Undergraduate courses in probability theory and computer science are useful, but not required. Students who are new to programming should skim chapter 1 of Adler before beginning the course.
Adler, J. (2009) R in a Nutshell: A Desktop Quick Reference. O’Reilly Media. http://shop.oreilly.com/product/9780596801717.do
Silver, N. (2012) The Signal and the Noise. The Penguin Press.http://www.amazon.com/dp/B007V65R54/
Whitlock, W.C. and Schluter, D. (2014) The Analysis of Biological Data, Second Edition. Roberts and Company Publishers. http://www.amazon.com/Analysis-Biological-Data-Second-Edition/dp/1936221489
I will be drawing on examples and materials from a few other sources. They include wonderful examples of R code in the context of data analysis. You are not required to have these, but you will either find them useful in this course or in future endeavors.
Bolker, B. (2009) Ecological Models and Data in R. Princeton University Press. http://www.amazon.com/Ecological-Models-Data-Benjamin-Bolker/dp/0691125228
Matloff, N. (2011) The Art of R Programming: A Tour of Statistical Software Design. No Starch Press. http://nostarch.com/artofr.htm
Song, S. Qian (2009) Environmental and Ecological Statistics with R. Chapman and Hall/CRC Press, London. http://www.amazon.com/Environmental-Ecological-Statistics-Chapman-Applied-ebook/dp/B005H6YDPU
Content and teaching approach: The course will be a mixture of lecture and hands-on data analysis lab. Students will be expected to have a computer available during the course so that they can follow examples and attempt in-class problems.
Grading: Your grade will be determined by a combination of weekly homework, a course blog, and a midterm exam, and a final paper. Homework will consist of a problem set and will be worth 40% of your course grade. Reflections on the course blog will be worth 10%. The midterm exam will be take-home and worth 20%. The final paper will be worth 30%. Additionally, students may earn extra credit for preparing their final paper for submission to a journal.
Homework: All homework done using R should be turned in as a formatted pdf document using the knitr library (http://yihui.name/knitr/). I will conduct a short tutorial in class. I encourage you to use markdown for formatting. Visit the library webpage for an additional demo and if you are using Rstudio see http://yihui.name/knitr/demo/rstudio/ and https://support.rstudio.com/hc/en-us/articles/200552086-Using-R-Markdown. Note – all slides will be written using knitr and slidify, and code will be made available as an example.
Course Blog: As part of this course, I want you to think beyond the immediate techniques I’m teaching you to larger issues about how we use statistics. Towards that end, I want you to write a few reflective pieces for a course blog at http://learningdata.wordpress.com as well as comment on what your classmates are thinking. Towards that end, you are required to post three times during the course of the semester. Sign-up for dates and learn how to post over at http://goo.gl/P0L8ls. I will also be noting responses, and would ask that you respond to at least three posts. More posting and more comments are welcome, but the minimum is three for each. To spur some thoughts, I ask that each week you read one chapter of Nate Silver’s book on prediction, The Signal and the Noise. You are not required to discuss this book at all in your posts, but I hope it will get some of your thoughts going beyond just textbook materials.
Final Paper: The final paper will be an analysis of a topic of your choosing. This could be an opportunity for you to analyze and write-up your own data. It could be an opportunity for you to mine data from various public sources – online data repositories, sensor networks, NASA’s data archive, etc. – that are relevant to your research. Look at this as an opportunity to contribute to your thesis. Papers are to be fully written up in an academic journal style (intro, methods, results, discussion, etc.). Topics must be approved by week 9, or final papers will not be accepted. Each student will give a short (10 min) presentation on the final day of class. If a project is large enough in scope to warrant working in groups, I will consider it. I will retroactively increase students grades if their analysis is used for the submission of a published paper in the following semester (e.g., from a B- to an A, or a B to B+).
While the topics covered are broad, each week will feature different examples from genetics, ecology, molecular, and evolutionary biology highlighting uses of each individual set of techniques.
Week 1: Introduction to thinking about data in a computational framework
Homework: Homework 1
Week 2: Sampling and Simulation: writings loops to generate simulated landscapes, sample size and natural variability Readings: W&S 2,4, Silver 2
Homework: Homework 2
Week 3: Sample Variation & Data Visualization: variation in estimates of biological processes, Tufte and other principles of good data visualization
Homework: Homework 3
Week 5: Writing Functions to Test Hypotheses & Power: automating analyses via functions, t-tests and beer quality, power analysis via simulation, Controversies in biological application of Neyman-Pearson hypothesis tests
Homework: Homework 5
Week 6: Fitting Linear Models: Least Squares approaches to evaluating inbreeding depression and allometric relationships
Week 7: Fitting Linear Models: Likelihood – evaluation of survivorship along an environmental gradient
Mid-term Handed out: Exam
Week 8: Fitting Linear Models: Bayes – Bayes theorem, incorporating prior information, false positive rates in medical testing, MCMC approaches
Week 9: Generalized Linear Models – incorporating nature’s inherent non-normality into models, kelp-urchin interactions
Week 10: Experiments & the Linear Model (ANOVA) – randomization of treatments, gene expression and mental disorders, predation resistance, non-parametric comparison of groups
Week 11: Multiple Predictors & Model Selection (AIC) – model weights, alternative nested and non-nested models of biological processes in plankton communities
Homework: install the car, visreg, and QuantPsyc libraries. Also the AICcmodavg library and glmulti
Week 12: Interactions, Covariates, and Experiments – pseudoreplication, Neanderthal morphometrics, fire damage severity in Southern California
Week 13: Hierarchical Models and Nesting – split plot experiments, modeling population variation, disease infection rates, introduction to time series analysis with plankton
Lectures and Code:
Week 14: Confounding Variables – isolating causal links in biological networks of interactions, introduction to graph theoretic approaches to inferences
Week 15: Final Presentations
Things you need: A large amount of computer programming will be necessary to successfully complete the course, so students will need easy access to computers running R (or with administrative access to download R), which is free, open-source software and some form of spreadsheet software (Microsoft Excel, Open Office, etc.). We will learn how to load R and R packages in the class. Ideally, students will start the class with a general idea their project system or an ecosystem of interest (e.g., studying insects in salt marshes, experimentally driven levels of gene expression, patterns of biodiversity across a bathymetric gradient, yeast reproductive rates, etc.) as there will be opportunities for students to use their own data for course credit.
Code of Conduct and Academic Integrity: It is the expressed policy of the University that every aspect of academic life–not only formal coursework situations, but all relationships and interactions connected to the educational process–shall be conducted in an absolutely and uncompromisingly honest manner. The University presupposes that any submission of work for academic credit is the student’s own and is in compliance with University policies, including its policies on appropriate citation and plagiarism. These policies are spelled out in the Code of Student Conduct. Students are required to adhere to the Code of Student Conduct, including requirements for academic honesty, as delineated in the University of Massachusetts Boston Graduate Catalogue and relevant program student handbook(s).http://www.umb.edu/life_on_campus/policies/code
You are encouraged to visit and review the UMass website on Correct Citation and Avoiding Plagiarism: http://umb.libguides.com/citations
Accommodations: The University of Massachusetts Boston is committed to providing reasonable academic accommodations for all students with disabilities. This syllabus is available in alternate format upon request. If you have a disability and feel you will need accommodations in this course, please contact the Ross Center for Disability Services, Campus Center, Upper Level, Room 211 at 617.287.7430. http://www.umb.edu/academics/vpass/disability/After registration with the Ross Center, a student should present and discuss the accommodations with the professor. Although a student can request accommodations at any time, we recommend that students inform the professor of the need for accommodations by the end of the Drop/Add period to ensure that accommodations are available for the entirety of the course.
Course notes: Slides and code for each lecture will be available on the course website before each lecture.
Useful Online References for R
R-Bloggers. Read this daily. http://www.r-bloggers.com/
John Verzani, “simpleR”, in PDF
Patrick Burns, The R Inferno. “If you are using R and you think you’re in hell, this is a map for you.”
Thomas Lumley, “R Fundamentals and Programming Techniques” (large PDF)
A list of tutorials in R from universities around the world http://pairach.com/2012/02/26/r-tutorials-from-universities-around-the-world/
Additional Books About R and Statistical Computing
A fairly comprehensive list can be found at http://www.r-project.org/doc/bib/R-books.html. Below, I highlight a few of my favorites that overlap and extend the material in this course:
Julian J. Faraway. Extending Linear Models with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Chapman & Hall/CRC, Boca Raton, FL, 2006. ISBN 1-584-88424-X. [ bib | Discount Info | Publisher Info |http://www.maths.bath.ac.uk/~jjf23/ELM/ ]
John Fox and Sanford Weisberg. An R Companion to Applied Regression.Sage Publications, Thousand Oaks, CA, USA, second edition, 2011. ISBN 978-1-4129-7514-8. [ http://socserv.socsci.mcmaster.ca/jfox/Books/Companion/index.html]
Paul Teetor. R Cookbook. O’Reilly, first edition, 2011. ISBN 978-0-596-80915-7. [ http://oreilly.com/catalog/9780596809157 ]
Journals to Keep an Eye On
The Journal of Statistical Software. http://www.jstatsoft.org/
Methods in Ecology and Evolution. http://www.methodsinecologyandevolution.org/
The R Journal. http://journal.r-project.org/current.html