Matrices, Lists, and Data Frames, Oh My!

A Vector is like a column… So what are rows?

Right now, we’ve worked with a single vector, and I’ve made the analogy that a vector is like a column in a spreadsheet. Each index references a different row. But, we know that spreadsheets have many columns. What would that be like in R? Introducing, the matrix.

my_matrix <- matrix(1:50, nrow=10)

my_matrix

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1   11   21   31   41
##  [2,]    2   12   22   32   42
##  [3,]    3   13   23   33   43
##  [4,]    4   14   24   34   44
##  [5,]    5   15   25   35   45
##  [6,]    6   16   26   36   46
##  [7,]    7   17   27   37   47
##  [8,]    8   18   28   38   48
##  [9,]    9   19   29   39   49
## [10,]   10   20   30   40   50

Well that’s a table. Notice that the rows and column labels are in brackets with commas. The row labels all have a comma after a number, and the column labels have the comma before the number. This suggest an indexing scheme - where we have [row, column] to define any individual cell of a matrix.

Think of the matrix as the cells of a spreadsheet. So, let’s say we wanted row 5, column 3 of my_matrix

my_matrix[5,3]

## [1] 25

What about if we wanted to get a whole row or a whole column? R let’s us specify that by leaving the columns or rows blank, respectively. For example, to see the entire third row of this matrix

#third row - note no column specified
my_matrix[3,]

## [1]  3 13 23 33 43

Similarly, for the fourth column

my_matrix[,4]

##  [1] 31 32 33 34 35 36 37 38 39 40

So, what class are these outputs? Find out with str.

Matrices are great as many functions specialize in looking at rows and columns. For example

rowMeans(my_matrix)

##  [1] 21 22 23 24 25 26 27 28 29 30

colSums(my_matrix)

## [1]  55 155 255 355 455

We can also get the dimensions of a matrix in a few different fashions

#All at once
dim(my_matrix)

## [1] 10  5

#just rows
nrow(my_matrix)

## [1] 10

#just columns
ncol(my_matrix)

## [1] 5

Try creating a 10 x 10 matrix of random uniform numbers between 5 and 50, and get the row and column means. What’s the output of str?

But columns should have names!

Matrices are somewhat unsatisfying, as we’re going to want to reference our columns by name. R has a catchall object type called a list where objects are stored in key-value pairs. If you know the key - think column name - you can access the value. Let’s see how this works.

my_list <- list(First = 1:10,
                Second = letters[1:5])

my_list

## $First
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $Second
## [1] "a" "b" "c" "d" "e"

So now we have a list with two elements, a vector of ten numbers and a vector of five letters. Note the funny $ notation next to the names of the keys. Much like we’ve noticed indiced before when we look at object suggesting how we can access the pieces they hold, so, too, is this a clue. To look at just the object stored with the key (again, think column name!) “First”, we have three options

my_list$First

##  [1]  1  2  3  4  5  6  7  8  9 10

my_list[["First"]]

##  [1]  1  2  3  4  5  6  7  8  9 10

my_list["First"]

## $First
##  [1]  1  2  3  4  5  6  7  8  9 10

my_list[1]

## $First
##  [1]  1  2  3  4  5  6  7  8  9 10

Note the first way (ha!) uses the $ followed by another new way to index - [[]]. The third uses our old friend [], but in a new way. The fourth assumes we want the first key-value pair from the list, and its name is unimportant. What is the different between these methods? Try str out on all of them. Notice anything? The first two return vectors. The second return lists. Yeah, sneaky. It’s important to know which method of accessing a list you are using so that you can then use the output in the proper way. More on that in a moment.

Back to [1] - what is first? How do you know? To get the keys in a list (think column names!) we use names

names(my_list)

## [1] "First"  "Second"

Try this out. Create a list consisting of a vector of numbers, an NA, and a list which contains two vectors. Take a look at their second elements. Also, check out our old friends str and summary Yes, lists can be nested in lists. And yes, you’d use something like my_list$sublist$First or anything analogous to access it.

Last, how do you access multiple parts of a list? With matrices we could use the [,] notation, but not so with lists. Let’s first look at the second element of the first entry in my_list.

my_list$First[2]

## [1] 2

What are some other ways you can do that?

OK, now what if we want to look at two elements of my_list. Again, there are a few ways to do that.

my_list[1:2]

## $First
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $Second
## [1] "a" "b" "c" "d" "e"

my_list[c("First", "Second")]

## $First
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $Second
## [1] "a" "b" "c" "d" "e"

These both yield lists with multiple elements in them. We couldn’t have used [[]] as it does not return a list, and hence you can’t access multiple elements. This will become important as we get into data frames.

What if a matrix and a list had a baby?

We’ve talked before about how vectors are like the columns in a spreadsheet. Those ‘sheets’ themselves within R care called data frames. And they are a useful combination of a matrix and a list. Let’s look at a sample data frame.

#load the data
data(mtcars)

#look at the top 6 rows
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

So, I’ve loaded a data set about cars. And now we’re looking at the first six lines using the head function. If you want, try typing mtcars without anything else. It runs off the screen! Hence, head is a useful function.

There are other ways we can get information about the data frame that you will use time and time and time again to diagnose what’s going wrong.

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

#str is life
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Summary gives you some nice information about each column of mtcars. But str is the real star of the show. Any time your code borks on you and you think it’s because of one of the objects you are passing to it, use str as it will give you a rich set of information about object types, values, etc. Often your object is not doing what you think it’s doing.

So, I’ve got data. How do I use it?

First off, how do we work with individual columns? We have a few ways to access them.

mtcars[["mpg"]]

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

Oh hey! It’s the [[]] notation from before! Quick, before you str, what class of object will the output be?

…

If you guessed a numeric vector, you’d be right! We can also use ["mpg"] as well to return a data frame. Now, we of course have other list-like ways of accessing columns.

mtcars$mpg

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

This is a way we can not worry about quotes (unless you have spaces in column names, but you won’t will you?) to also get the values in a column.

But, I mentioned a data frame can also act like a matrix. As we see in a spreadsheet, there are rows and columns - a matrix if you will. We can take advantage of this, as r uses the [row, column] formulation of indexing matrices. So

mtcars[1,1]

## [1] 21

gives us that first value of mtcars. To get the second row and then second column, we can use similar notation.

#columns
mtcars[,2]

##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

#rows
mtcars[2,]

##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

By leaving out a row or column number, it means, give all of the values in that row or column to us. We can of course be more specific.

mtcars[1:5,1]

## [1] 21.0 21.0 22.8 21.4 18.7

Yes, we can use vectors of numbers as rows and columns to get many of them. We can also mix up indices and column names.

mtcars[1:5, c("disp", "hp")]

##                   disp  hp
## Mazda RX4          160 110
## Mazda RX4 Wag      160 110
## Datsun 710         108  93
## Hornet 4 Drive     258 110
## Hornet Sportabout  360 175

Or we could have had both numbers and column names in the columns vector. And this is just the beginning.

Great! Now that you’ve got those basics, futz around with the rows and columns of mtcars. Get information about each of them, try applying arithmatic and functions, such as those we used on matrices, and see what shakes out!

Matrices, Lists, and Data Frames, Oh My!

Intro to Data Science for Biology

A Vector is like a column… So what are rows?

But columns should have names!

What if a matrix and a list had a baby?

So, I’ve got data. How do I use it?