Right now, we’ve worked with a single vector, and I’ve made the analogy that a vector is like a column in a spreadsheet. Each index references a different row. But, we know that spreadsheets have many columns. What would that be like in R? Introducing, the matrix.
my_matrix <- matrix(1:50, nrow=10)
my_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50
Well that’s a table. Notice that the rows and column labels are in brackets with commas. The row labels all have a comma after a number, and the column labels have the comma before the number. This suggest an indexing scheme - where we have [row, column]
to define any individual cell of a matrix.
Think of the matrix as the cells of a spreadsheet. So, let’s say we wanted row 5, column 3 of my_matrix
my_matrix[5,3]
## [1] 25
What about if we wanted to get a whole row or a whole column? R let’s us specify that by leaving the columns or rows blank, respectively. For example, to see the entire third row of this matrix
#third row - note no column specified
my_matrix[3,]
## [1] 3 13 23 33 43
Similarly, for the fourth column
my_matrix[,4]
## [1] 31 32 33 34 35 36 37 38 39 40
So, what class
are these outputs? Find out with str
.
Matrices are great as many functions specialize in looking at rows and columns. For example
rowMeans(my_matrix)
## [1] 21 22 23 24 25 26 27 28 29 30
colSums(my_matrix)
## [1] 55 155 255 355 455
We can also get the dimensions of a matrix in a few different fashions
#All at once
dim(my_matrix)
## [1] 10 5
#just rows
nrow(my_matrix)
## [1] 10
#just columns
ncol(my_matrix)
## [1] 5
Try creating a 10 x 10 matrix of random uniform numbers between 5 and 50, and get the row and column means. What’s the output of str?
Matrices are somewhat unsatisfying, as we’re going to want to reference our columns by name. R has a catchall object type called a list
where objects are stored in key-value pairs. If you know the key - think column name - you can access the value. Let’s see how this works.
my_list <- list(First = 1:10,
Second = letters[1:5])
my_list
## $First
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Second
## [1] "a" "b" "c" "d" "e"
So now we have a list with two elements, a vector of ten numbers and a vector of five letters. Note the funny $
notation next to the names of the keys. Much like we’ve noticed indiced before when we look at object suggesting how we can access the pieces they hold, so, too, is this a clue. To look at just the object stored with the key (again, think column name!) “First”, we have three options
my_list$First
## [1] 1 2 3 4 5 6 7 8 9 10
my_list[["First"]]
## [1] 1 2 3 4 5 6 7 8 9 10
my_list["First"]
## $First
## [1] 1 2 3 4 5 6 7 8 9 10
my_list[1]
## $First
## [1] 1 2 3 4 5 6 7 8 9 10
Note the first way (ha!) uses the $
followed by another new way to index - [[]]
. The third uses our old friend []
, but in a new way. The fourth assumes we want the first key-value pair from the list, and its name is unimportant. What is the different between these methods? Try str
out on all of them. Notice anything? The first two return vectors. The second return lists. Yeah, sneaky. It’s important to know which method of accessing a list you are using so that you can then use the output in the proper way. More on that in a moment.
Back to [1]
- what is first? How do you know? To get the keys in a list (think column names!) we use names
names(my_list)
## [1] "First" "Second"
Try this out. Create a list consisting of a vector of numbers, an NA
, and a list which contains two vectors. Take a look at their second elements. Also, check out our old friends str
and summary
Yes, lists can be nested in lists. And yes, you’d use something like my_list$sublist$First
or anything analogous to access it.
Last, how do you access multiple parts of a list? With matrices we could use the [,]
notation, but not so with lists. Let’s first look at the second element of the first entry in my_list
.
my_list$First[2]
## [1] 2
What are some other ways you can do that?
OK, now what if we want to look at two elements of my_list
. Again, there are a few ways to do that.
my_list[1:2]
## $First
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Second
## [1] "a" "b" "c" "d" "e"
my_list[c("First", "Second")]
## $First
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Second
## [1] "a" "b" "c" "d" "e"
These both yield lists with multiple elements in them. We couldn’t have used [[]]
as it does not return a list, and hence you can’t access multiple elements. This will become important as we get into data frames.
We’ve talked before about how vectors are like the columns in a spreadsheet. Those ‘sheets’ themselves within R care called data frames. And they are a useful combination of a matrix and a list. Let’s look at a sample data frame.
#load the data
data(mtcars)
#look at the top 6 rows
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
So, I’ve loaded a data set about cars. And now we’re looking at the first six lines using the head function. If you want, try typing mtcars
without anything else. It runs off the screen! Hence, head
is a useful function.
There are other ways we can get information about the data frame that you will use time and time and time again to diagnose what’s going wrong.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
#str is life
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Summary gives you some nice information about each column of mtcars
. But str
is the real star of the show. Any time your code borks on you and you think it’s because of one of the objects you are passing to it, use str
as it will give you a rich set of information about object types, values, etc. Often your object is not doing what you think it’s doing.
First off, how do we work with individual columns? We have a few ways to access them.
mtcars[["mpg"]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Oh hey! It’s the [[]]
notation from before! Quick, before you str
, what class of object will the output be?
…
If you guessed a numeric vector, you’d be right! We can also use ["mpg"]
as well to return a data frame. Now, we of course have other list-like ways of accessing columns.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
This is a way we can not worry about quotes (unless you have spaces in column names, but you won’t will you?) to also get the values in a column.
But, I mentioned a data frame can also act like a matrix. As we see in a spreadsheet, there are rows and columns - a matrix if you will. We can take advantage of this, as r uses the [row, column]
formulation of indexing matrices. So
mtcars[1,1]
## [1] 21
gives us that first value of mtcars. To get the second row and then second column, we can use similar notation.
#columns
mtcars[,2]
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
#rows
mtcars[2,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
By leaving out a row or column number, it means, give all of the values in that row or column to us. We can of course be more specific.
mtcars[1:5,1]
## [1] 21.0 21.0 22.8 21.4 18.7
Yes, we can use vectors of numbers as rows and columns to get many of them. We can also mix up indices and column names.
mtcars[1:5, c("disp", "hp")]
## disp hp
## Mazda RX4 160 110
## Mazda RX4 Wag 160 110
## Datsun 710 108 93
## Hornet 4 Drive 258 110
## Hornet Sportabout 360 175
Or we could have had both numbers and column names in the columns vector. And this is just the beginning.
Great! Now that you’ve got those basics, futz around with the rows and columns of mtcars. Get information about each of them, try applying arithmatic and functions, such as those we used on matrices, and see what shakes out!