Hello! So, today we’re going to begin to code in R. We’re going to cover the basics of using R with a focus on data frame objects.

But let’s begin. To talk about what’s going on offline, use today’s etherpad at https://etherpad.wikimedia.org/p/r-intro-umb

I have this cursor sitting here. What can I do with it?

Peering into your console, often the first source of fear and confusion is, what the heck do I do with this giant blank space. The easiest way to start thinking about R is as th world’s most advanced calculator. Seriously, try it out!

3+4
## [1] 7

Whoah! You can add! Wonderful. Let’s try a few other operations.

3-4
## [1] -1
4*5
## [1] 20
9/3
## [1] 3
2^7
## [1] 128

Yes, basic arithmatic is right there at your fingertips.

Arithmatic is great. But I want more

While arithmatic is cool (and necessary) we often want R to do more for us. Perhaps calculating more compelling mathematical functions. One that we commonly use is logarithms. Let’s say you want the natural log of 10.

log(10)
## [1] 2.302585

Great! Notice how that worked. We had log. Then two parentheses. Inside of which we had 10. log is a function. Functions in R are hugely powerful. They are the core of many things we do. Functions consist of a function name, those parentheses, and then inside of the parentheses one or more arguments separated by commas. Often these arguments have names. For example, what if we wanted to get the log of 10 in base 10.

log(10, base = 10)
## [1] 1

Now we’ve supplied two arguments.

Help! I don’t know what arguments to give to a function!

We’re going to run into many different functions as we go forward. To get help on how to use them, and what arguments to supply, there are two ways to get help. Let’s look at the help file for log:

?log

help("log")

Both of those do the same thing. Note that there is an order to arguments in the help file. If you put arguments in order in a function, you don’t need to worry about naming them. This is bad practice (future you will be unhappy), as you may forget what they mean. Whevenever possible, use named arguments.

But I don’t know what the name of a function I need is!

Sometimes, we don’t know the name of a function we’re looking for. For example, in R, the function for arcsin is asin. But you have no way of knowing that. For that, we use the ?? with quotes.

??"arcsine"

Notice this brings up a list of helpfiles that have the word arcsine in them. Bueno! You can now track down the right function.

Before we go any further, a comment on comments

It can be VERY easy to get lost in a sea of R code, not knowing what is going on. Fortunately, R provides something called comments. In a comment, R stops evaluating code, and let’s you write whatever love notes to yourself that you want to write. In R the comment character is #. For example.

#this is a comments

### This is also a comment

3+4 #hey, I commented after 3+4
## [1] 7
########################################
##### Oh, a comment box
##### that I can use to delinieate
##### blocks of code
########################################

ALWAYS COMMENT THE HECK OUT OF YOUR CODE TO HELP OUT FUTURE YOU! I’ll include some comments in the code today to show you examples.

Variables and You

One of the great things about R is that you can save things as variables and use them later on. Some of them are there already. For example:

#This is Pi
pi
## [1] 3.141593

WHOAH! PI IN R!

What if you wanted to make your own variable. Say, foo. And you wanted foo to always equal the square root of 2.

#let's create a variable foo
foo <- sqrt(2)


#what's inside of foo
foo
## [1] 1.414214

Note that the assignment operator is <- and not an = sign. Now, you can use =, but in R it’s generally bad practice, as = will crop up in other places, and you’ll want to avoid it.

Now you have a variable that you can use form now on that is the square root of two! For example:

foo + 5
## [1] 6.414214
log(foo, base=2)
## [1] 0.5
foo^2
## [1] 2

More than a number

OK, numbers are great, but there are other types of objects we’ll be dealing with in R. Primarily, we’re going to work with data frames, but let’s build up to a data frame, as it’s big and hairy.

First, are there other object types that have a single element to them beyond numbers? Well, yes! There are strings - words in quotes

"hello"
## [1] "hello"

Also, Booleans, which denote true and false

TRUE
## [1] TRUE
FALSE
## [1] FALSE
0==4
## [1] FALSE
4==4
## [1] TRUE
3 <= 4
## [1] TRUE
3 <= 4
## [1] TRUE

Note the different ways we made comparisons. These will become handy as you move on. Booleans are really 1s and 0s, such that you can even do math with them

TRUE + TRUE + FALSE
## [1] 2

OK, what about something for a missing value. For that, we have NA. This is quite important if, say, you have a missing value in your data set.

NA
## [1] NA
NA + 1
## [1] NA

No one likes an NA, and often we have to find ways around them!

Combining values into larger objects

Now that we have a few object times down, what is we have a bunch of them we want to work with together? Let’s start with what we call a vector.

What is a vector? A vector is a bunch of numbers (or other things) all in one single object that we can reference with an index. Thing of it as a column in a spreadsheet. For example, let’s say I had a column containing all of the integers from 5 through 10, and I wanted to know the 2nd integer.

my_numbers <- c(5, 6, 7, 8, 9, 10)

my_numbers[2]
## [1] 6

Notice a few things. First, we created our vector wth the function c. This function takes a sequence of values and puts them into a vector. These values can be anything - numbers, strings, booleans, etc. Second, notice that to reference the second value of the vector we used [] Specificially, we put [2] in. This means give me the second value in this object.

Now you try it. Create a vector - any vector - and try pulling out single values. Do some math with them. Log transform them. Use your vector with an index as you would any other variable. For example

my_vector <- 1:100

my_vector[4] + my_vector[50]
## [1] 54

Oh! Notice the use of : there to get a long vector? Neat trick, no? There are other ways to generate vectors. Here are two:

First, a function to get a sequence with non-integer steps between numbers

seq(from = 1, to = 2, by=0.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

Second, 10 random numbers between 0 and 100

runif(10, min = 0, max = 100)
##  [1]  2.246729 83.211340 78.812395 69.149559 23.660027 94.247732  4.315947
##  [8] 92.919745 20.097051 55.152500

Now, vectors are neat, as they allow us to introduce two more concepts. First, some functions take vectors as input, and return other types of objects. For example, let’s say we wanted to sum everything in my_vector above. And then get the average of a bunch of random numbers between 0 and 100

sum(my_vector)
## [1] 5050
#a function in a function!
#oh my!
mean(runif(10, min = 0, max = 100))
## [1] 49.84039

OH! Notice I nested a function inside of a function. YES! You can do that. But only when you really need to. To keep track of things, it’s often better practice to create an object with a variable name that has meaning to you, and then feed that as an argument to another function.

Last, often you just want some summary information about your vector. You’ll want to do this for many more complex objects in the future as well. Fortunately, there’s a function for that. Summary!

summary(my_numbers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.25    7.50    7.50    8.75   10.00

I’m all vectorized. What does this have to do with data?

We’ve talked before about how vectors are like the columns in a spreadsheet. Those ‘sheets’ themselves within R care called data frames. Let’s look at a sample data frame.

#load the data
data(mtcars)

#look at the top 6 rows
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

So, I’ve loaded a data set about cars. And now we’re looking at the first six lines using the head function. If you want, try typing mtcars without anything else. It runs off the screen! Hence, head is a useful function.

There are other ways we can get information about the data frame that you will use time and time and time again to diagnose what’s going wrong.

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
#str is life
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Summary gives you some nice information about each column of mtcars. But str is the real star of the show. Any time your code borks on you and you think it’s because of one of the objects you are passing to it, use str as it will give you a rich set of information about object types, values, etc. Often your object is not doing what you think it’s doing.

So, I’ve got data. How do I use it?

First off, how do we work with individual columns? We have a few ways to access them.

mtcars[["mpg"]]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

This is kind of like the [] notation of before, but instead we use [[]] with a quoted variable name.

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

This is a way we can not worry about quotes (unless you have spaces in column names, but you won’t will you?) to also get the values in a column. Last, as we see in a spreadsheet, there are rows and columns - a matrix if you will. We can take advantage of this, as r uses the [row, column] formulation of indexing matrices. So

mtcars[1,1]
## [1] 21

gives us that first value of mtcars. To get the second row and then second column, we can use similar notation.

#columns
mtcars[,2]
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
#rows
mtcars[2,]
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

By leaving out a row or column number, it means, give all of the values in that row or column to us. We can of course be more specific.

mtcars[1:5,1]
## [1] 21.0 21.0 22.8 21.4 18.7

Yes, we can use vectors of numbers as rows and columns to get many of them.

Great! Now that you’ve got those basics, futz around with the rows and columns of mtcars. Get information about each of them, try applying arithmatic and functions, and see what shakes out!