Hello! So, today we’re going to begin to code in R. We’re going to cover the basics of using R with a focus on data frame objects.
But let’s begin. To talk about what’s going on offline, use today’s etherpad at https://etherpad.wikimedia.org/p/r-intro-umb
Peering into your console, often the first source of fear and confusion is, what the heck do I do with this giant blank space. The easiest way to start thinking about R is as th world’s most advanced calculator. Seriously, try it out!
3+4
## [1] 7
Whoah! You can add! Wonderful. Let’s try a few other operations.
3-4
## [1] -1
4*5
## [1] 20
9/3
## [1] 3
2^7
## [1] 128
Yes, basic arithmatic is right there at your fingertips.
While arithmatic is cool (and necessary) we often want R to do more for us. Perhaps calculating more compelling mathematical functions. One that we commonly use is logarithms. Let’s say you want the natural log of 10.
log(10)
## [1] 2.302585
Great! Notice how that worked. We had log. Then two parentheses. Inside of which we had 10. log is a function. Functions in R are hugely powerful. They are the core of many things we do. Functions consist of a function name, those parentheses, and then inside of the parentheses one or more arguments separated by commas. Often these arguments have names. For example, what if we wanted to get the log of 10 in base 10.
log(10, base = 10)
## [1] 1
Now we’ve supplied two arguments.
We’re going to run into many different functions as we go forward. To get help on how to use them, and what arguments to supply, there are two ways to get help. Let’s look at the help file for log:
?log
help("log")
Both of those do the same thing. Note that there is an order to arguments in the help file. If you put arguments in order in a function, you don’t need to worry about naming them. This is bad practice (future you will be unhappy), as you may forget what they mean. Whevenever possible, use named arguments.
Sometimes, we don’t know the name of a function we’re looking for. For example, in R, the function for arcsin is asin
. But you have no way of knowing that. For that, we use the ?? with quotes.
??"arcsine"
Notice this brings up a list of helpfiles that have the word arcsine in them. Bueno! You can now track down the right function.
It can be VERY easy to get lost in a sea of R code, not knowing what is going on. Fortunately, R provides something called comments. In a comment, R stops evaluating code, and let’s you write whatever love notes to yourself that you want to write. In R the comment character is #
. For example.
#this is a comments
### This is also a comment
3+4 #hey, I commented after 3+4
## [1] 7
########################################
##### Oh, a comment box
##### that I can use to delinieate
##### blocks of code
########################################
ALWAYS COMMENT THE HECK OUT OF YOUR CODE TO HELP OUT FUTURE YOU! I’ll include some comments in the code today to show you examples.
One of the great things about R is that you can save things as variables and use them later on. Some of them are there already. For example:
#This is Pi
pi
## [1] 3.141593
WHOAH! PI IN R!
What if you wanted to make your own variable. Say, foo. And you wanted foo to always equal the square root of 2.
#let's create a variable foo
foo <- sqrt(2)
#what's inside of foo
foo
## [1] 1.414214
Note that the assignment operator is <- and not an = sign. Now, you can use =, but in R it’s generally bad practice, as = will crop up in other places, and you’ll want to avoid it.
Now you have a variable that you can use form now on that is the square root of two! For example:
foo + 5
## [1] 6.414214
log(foo, base=2)
## [1] 0.5
foo^2
## [1] 2
OK, numbers are great, but there are other types of objects we’ll be dealing with in R. Primarily, we’re going to work with data frames, but let’s build up to a data frame, as it’s big and hairy.
First, are there other object types that have a single element to them beyond numbers? Well, yes! There are strings - words in quotes
"hello"
## [1] "hello"
Also, Booleans, which denote true and false
TRUE
## [1] TRUE
FALSE
## [1] FALSE
0==4
## [1] FALSE
4==4
## [1] TRUE
3 <= 4
## [1] TRUE
3 <= 4
## [1] TRUE
Note the different ways we made comparisons. These will become handy as you move on. Booleans are really 1s and 0s, such that you can even do math with them
TRUE + TRUE + FALSE
## [1] 2
OK, what about something for a missing value. For that, we have NA. This is quite important if, say, you have a missing value in your data set.
NA
## [1] NA
NA + 1
## [1] NA
No one likes an NA, and often we have to find ways around them!
Now that we have a few object times down, what is we have a bunch of them we want to work with together? Let’s start with what we call a vector.
What is a vector? A vector is a bunch of numbers (or other things) all in one single object that we can reference with an index. Thing of it as a column in a spreadsheet. For example, let’s say I had a column containing all of the integers from 5 through 10, and I wanted to know the 2nd integer.
my_numbers <- c(5, 6, 7, 8, 9, 10)
my_numbers[2]
## [1] 6
Notice a few things. First, we created our vector wth the function c
. This function takes a sequence of values and puts them into a vector. These values can be anything - numbers, strings, booleans, etc. Second, notice that to reference the second value of the vector we used [] Specificially, we put [2]
in. This means give me the second value in this object.
Now you try it. Create a vector - any vector - and try pulling out single values. Do some math with them. Log transform them. Use your vector with an index as you would any other variable. For example
my_vector <- 1:100
my_vector[4] + my_vector[50]
## [1] 54
Oh! Notice the use of :
there to get a long vector? Neat trick, no? There are other ways to generate vectors. Here are two:
First, a function to get a sequence with non-integer steps between numbers
seq(from = 1, to = 2, by=0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
Second, 10 random numbers between 0 and 100
runif(10, min = 0, max = 100)
## [1] 2.246729 83.211340 78.812395 69.149559 23.660027 94.247732 4.315947
## [8] 92.919745 20.097051 55.152500
Now, vectors are neat, as they allow us to introduce two more concepts. First, some functions take vectors as input, and return other types of objects. For example, let’s say we wanted to sum everything in my_vector
above. And then get the average of a bunch of random numbers between 0 and 100
sum(my_vector)
## [1] 5050
#a function in a function!
#oh my!
mean(runif(10, min = 0, max = 100))
## [1] 49.84039
OH! Notice I nested a function inside of a function. YES! You can do that. But only when you really need to. To keep track of things, it’s often better practice to create an object with a variable name that has meaning to you, and then feed that as an argument to another function.
Last, often you just want some summary information about your vector. You’ll want to do this for many more complex objects in the future as well. Fortunately, there’s a function for that. Summary!
summary(my_numbers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.25 7.50 7.50 8.75 10.00
We’ve talked before about how vectors are like the columns in a spreadsheet. Those ‘sheets’ themselves within R care called data frames. Let’s look at a sample data frame.
#load the data
data(mtcars)
#look at the top 6 rows
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
So, I’ve loaded a data set about cars. And now we’re looking at the first six lines using the head function. If you want, try typing mtcars
without anything else. It runs off the screen! Hence, head
is a useful function.
There are other ways we can get information about the data frame that you will use time and time and time again to diagnose what’s going wrong.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
#str is life
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Summary gives you some nice information about each column of mtcars
. But str
is the real star of the show. Any time your code borks on you and you think it’s because of one of the objects you are passing to it, use str
as it will give you a rich set of information about object types, values, etc. Often your object is not doing what you think it’s doing.
First off, how do we work with individual columns? We have a few ways to access them.
mtcars[["mpg"]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
This is kind of like the [] notation of before, but instead we use [[]] with a quoted variable name.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
This is a way we can not worry about quotes (unless you have spaces in column names, but you won’t will you?) to also get the values in a column. Last, as we see in a spreadsheet, there are rows and columns - a matrix if you will. We can take advantage of this, as r uses the [row, column]
formulation of indexing matrices. So
mtcars[1,1]
## [1] 21
gives us that first value of mtcars. To get the second row and then second column, we can use similar notation.
#columns
mtcars[,2]
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
#rows
mtcars[2,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
By leaving out a row or column number, it means, give all of the values in that row or column to us. We can of course be more specific.
mtcars[1:5,1]
## [1] 21.0 21.0 22.8 21.4 18.7
Yes, we can use vectors of numbers as rows and columns to get many of them.
Great! Now that you’ve got those basics, futz around with the rows and columns of mtcars. Get information about each of them, try applying arithmatic and functions, and see what shakes out!