Data Structures

Last updated on 2024-08-19 | Edit this page

Estimated time: 55 minutes

Overview

Questions

  • How can I read data in R?
  • What are the basic data types in R?
  • How do I represent categorical information in R?

Objectives

  • To be aware of the different types of data.
  • To begin exploring data frames, and understand how they are related to vectors, factors and lists.
  • To be able to ask questions from R about the type, class, and structure of an object.

Let’s start by creating a new R script and saving it to the scripts folder in our project directory. We will create new scripts for each episode in this workshop.

We can create a new R script by clicking the button at the top left of our RStudio, the one that looks like a piece of paper with a green plus sign next to it. On the drop down, click “R Script”. We can also create an R script using File > New file > R script.

We can add a comment to our script to remind us what we’re working on:

R

# R script for data structures

Comments are useful notes to us that are ignored by the computer.

Now that we’ve got the basics covered, we can move on to the lesson. One of R’s most powerful features is its ability to deal with tabular data - like you may already have in a spreadsheet or a CSV file. Let’s start by downloading and reading in a file nordic-data.csv. We will save this data as an object named nordic:

R

nordic <- read.csv("data/nordic-data.csv")

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can

  1. click on the Run button above the editor panel, or
  2. select “Run Lines” from the “Code” menu, or
  3. hit Ctrl+Enter in Windows, Ctrl+Return in Linux, or +Return on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run.

The read.table function is used for reading in tabular data stored in a text file where the columns of data are separated by punctuation characters such as CSV files (csv = comma-separated values). Tabs and commas are the most common punctuation characters used to separate or delimit data points in csv files. For convenience R provides 2 other versions of read.table. These are: read.csv for files where the data are separated with commas and read.delim for files where the data are separated with tabs. Of these three functions read.csv is the most commonly used. If needed it is possible to override the default delimiting punctuation marks for both read.csv and read.delim.

We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:

R

nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway" 

R

nordic$lifeExp

OUTPUT

[1] 77.2 80.0 79.0

We can do other operations on the columns. For example, if we discovered that the life expectancy is two years higher:

R

nordic$lifeExp + 2

OUTPUT

[1] 79.2 82.0 81.0

But what about:

R

nordic$lifeExp + nordic$country

ERROR

Error in nordic$lifeExp + nordic$country: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing data in R.

Data Types


  • Learners will work with factors in the following lesson. Be sure to cover this concept.
  • If needed for time reasons, you can skip the section on lists. The learners don’t use lists in the rest of the workshop.

If you guessed that the last command will return an error because 77.2 plus "Denmark" is nonsense, you’re right - and you already have some intuition for an important concept in programming called data classes. We can ask what class of data something is:

R

class(nordic$lifeExp)

OUTPUT

[1] "numeric"

There are 6 main types:

  • numeric: values like 1.254, 0.8, -7.2
  • integer: values like 1, 10, 150
  • complex: values like 1+1i
  • logical: TRUE or FALSE
  • character: values like “cats”, “dogs”, and “animals”
  • factor: categories with all possible values, like the months of the year

No matter how complicated our analyses become, all data in R is interpreted a specific data class. This strictness has some really important consequences.

A user has added new details of age expectancy. This information is in the file data/nordic-data-2.csv.

Load the new nordic data as nordic_2, and check what class of data we find in the lifeExp column:

R

nordic_2 <- read.csv("data/nordic-data-2.csv")
class(nordic_2$lifeExp)

OUTPUT

[1] "character"

Oh no, our life expectancy lifeExp aren’t the numeric type anymore! If we try to do the same math we did on them before, we run into trouble:

R

nordic_2$lifeExp + 2

ERROR

Error in nordic_2$lifeExp + 2: non-numeric argument to binary operator

What happened? When R reads a csv file into one of these tables, it insists that everything in a column be the same class; if it can’t understand everything in the column as numeric, then nothing in the column gets to be numeric. The table that R loaded our nordic data into is something called a dataframe, and it is our first example of something called a data structure - that is, a structure which R knows how to build out of the basic data types.

We can look at the data within R by clicking the object in our environment or using View(nordic). We see that the lifeExp column now has the value “79.0 or 83” in the third row. R needed all values in that column to be the same type, so it forced everything to be a character instead of a number.

In order to successfully use our data in R, we need to understand what the basic data structures are, and how they behave.

Vectors and Type Coercion


To better understand this behavior, let’s meet another of the data structures: the vector.

A vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type. If you don’t choose the data type, it’ll default to logical; or, you can declare an empty vector of whatever type you like.

You can specify a vector either using the vector() function or the combine (c()) function.

R

vector(length = 3) # this creates an empty vector of logical values

OUTPUT

[1] FALSE FALSE FALSE

R

c(1, 2, 3) # this creates a vector with numerical values

OUTPUT

[1] 1 2 3

R

c("this", "that", "the other") # this creates a vector with character values

OUTPUT

[1] "this"      "that"      "the other"

R

another_vector <- vector(mode = 'character', length = 3)
another_vector

OUTPUT

[1] "" "" ""

You can check if something is a vector using str which asks for an object’s structure:

R

str(another_vector)

OUTPUT

 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic data type found in this vector - in this case chr, character; an indication of the number of things in the vector - actually, the indexes of the vector, in this case [1:3]; and a few examples of what’s actually in the vector - in this case empty character strings. If we similarly do

R

str(nordic$lifeExp)

OUTPUT

 num [1:3] 77.2 80 79

we see that nordic$lifeExp is a vector, too - the columns of data we load into R data frames are all vectors, and that’s the root of why R forces everything in a column to be the same basic data type.

Discussion 1

Why is R so opinionated about what we put in our columns of data? How does this help us?

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.

Given what we’ve learned so far, what do you think the following will produce?

R

quiz_vector <- c(2, 6, '3')

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type. Consider:

R

coercion_vector <- c('a', TRUE)
coercion_vector

OUTPUT

[1] "a"    "TRUE"

The coercion rules go: logical -> integer -> numeric -> complex -> character, where -> can be read as are transformed into. You can try to force coercion against this flow using the as. functions:

R

character_vector_example <- c('0', '2', '4')
character_vector_example

OUTPUT

[1] "0" "2" "4"

R

character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric

OUTPUT

[1] 0 2 4

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data frames, or you will get nasty surprises!

Challenge 1

Given what you now know about type conversion, look at the class of data in nordic_2$lifeExp and compare it with nordic$lifeExp. Why are these columns different classes?

R

str(nordic_2$lifeExp)

OUTPUT

 chr [1:3] "77.2" "80" "79.0 or 83"

R

str(nordic$lifeExp)

OUTPUT

 num [1:3] 77.2 80 79

The data in nordic_2$lifeExp is stored as a character vector, rather than as a numeric vector. This is because of the “or” character string in the third data point.

The combine function, c(), will also append things to an existing vector:

R

ab_vector <- c('a', 'b')
ab_vector

OUTPUT

[1] "a" "b"

R

combine_example <- c(ab_vector, 'DC')
combine_example

OUTPUT

[1] "a"  "b"  "DC"

You can also make series of numbers:

R

my_series <- 1:10
my_series

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

R

seq(from = 1, to = 10, by = 0.1)

OUTPUT

 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

We can ask a few questions about vectors:

R

sequence_example <- 1:10
head(sequence_example)

OUTPUT

[1] 1 2 3 4 5 6

R

tail(sequence_example)

OUTPUT

[1]  5  6  7  8  9 10

R

length(sequence_example)

OUTPUT

[1] 10

R

class(sequence_example)

OUTPUT

[1] "integer"

Finally, you can give names to elements in your vector:

R

my_example <- 5:8
names(my_example) <- c("a", "b", "c", "d")
my_example

OUTPUT

a b c d 
5 6 7 8 

R

names(my_example)

OUTPUT

[1] "a" "b" "c" "d"

Challenge 2

Start by making a vector with the numbers 1 through 26. Multiply the vector by 2, and give the resulting vector names A through Z (hint: there is a built in vector called LETTERS)

R

x <- 1:26
x <- x * 2
names(x) <- LETTERS

Factors


We said that columns in data frames were vectors:

R

str(nordic$lifeExp)

OUTPUT

 num [1:3] 77.2 80 79

R

str(nordic$year)

OUTPUT

 int [1:3] 2002 2002 2002

R

str(nordic$country)

OUTPUT

 chr [1:3] "Denmark" "Sweden" "Norway"

One final important data structure in R is called a “factor”. Factors look like character data, but are used to represent data where each element of the vector must be one of a limited number of “levels”. To phrase that another way, factors are an “enumerated” type where there are a finite number of pre-defined values that your vector can have.

For example, let’s make a vector of strings labeling nordic countries for all the countries in our study:

R

nordic_countries <- nordic$country
str(nordic_countries)

OUTPUT

 chr [1:3] "Denmark" "Sweden" "Norway"

We can turn a vector into a factor like so:

R

categories <- factor(nordic_countries)
str(categories)

OUTPUT

 Factor w/ 3 levels "Denmark","Norway",..: 1 3 2

Now R has noticed that there are 3 possible categories in our data - but it also did something surprising; instead of printing out the strings we gave it, we got a bunch of numbers instead. R has replaced our human-readable categories with numbered indices under the hood, this is necessary as many statistical calculations utilise such numerical representations for categorical data.

Challenge 3

Can you guess why these numbers are used to represent these countries?

They are sorted in alphabetical order

Challenge 4

Convert the country column of our nordic data frame to a factor. Then try converting it back to a character vector.

Now try converting lifeExp in our nordic data frame to a factor, then back to a numeric vector. What happens if you use as.numeric()?

Remember that you can reload the nordic data frame using read.csv("data/nordic-data.csv") if you accidentally lose some data!

Converting character vectors to factors can be done using the factor() function:

R

nordic$country <- factor(nordic$country)
nordic$country

OUTPUT

[1] Denmark Sweden  Norway 
Levels: Denmark Norway Sweden

You can convert these back to character vectors using as.character():

R

nordic$country <- as.character(nordic$country)
nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway" 

You can convert numeric vectors to factors in the exact same way:

R

nordic$lifeExp <- factor(nordic$lifeExp)
nordic$lifeExp

OUTPUT

[1] 77.2 80   79  
Levels: 77.2 79 80

But be careful – you can’t use as.numeric() to convert factors to numerics!

R

as.numeric(nordic$lifeExp)

OUTPUT

[1] 1 3 2

Instead, as.numeric() converts factors to those “numbers under the hood” we talked about. To go from a factor to a number, you need to first turn the factor into a character vector, and then turn that into a numeric vector:

R

nordic$lifeExp <- as.character(nordic$lifeExp)
nordic$lifeExp <- as.numeric(nordic$lifeExp)
nordic$lifeExp

OUTPUT

[1] 77.2 80.0 79.0

Note: new students find the help files difficult to understand; make sure to let them know that this is typical, and encourage them to take their best guess based on semantic meaning, even if they aren’t sure.

When doing statistical modelling, it’s important to know what the baseline levels are. This is assumed to be the first factor, but by default factors are labeled in alphabetical order. You can change this by specifying the levels:

R

mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)

OUTPUT

 Factor w/ 2 levels "control","case": 2 1 1 2

In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2. This designation can be very important for interpreting the results of statistical models!

Lists


Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it:

R

list_example <- list(1, "a", TRUE, c(2, 6, 7))
list_example

OUTPUT

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 2 6 7

R

another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list

OUTPUT

$title
[1] "Numbers"

$numbers
 [1]  1  2  3  4  5  6  7  8  9 10

$data
[1] TRUE

We can now understand something a bit surprising in our data frame; what happens if we compare str(nordic) and str(another_list):

R

str(nordic)

OUTPUT

'data.frame':	3 obs. of  3 variables:
 $ country: chr  "Denmark" "Sweden" "Norway"
 $ year   : int  2002 2002 2002
 $ lifeExp: num  77.2 80 79

R

str(another_list)

OUTPUT

List of 3
 $ title  : chr "Numbers"
 $ numbers: int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ data   : logi TRUE

We see that the output for these two objects look very similar. It is because data frames are lists ‘under the hood’. Data frames are a special case of lists where each element (the columns of the data frame) have the same lengths.

In our nordic example, we have a character, an integer, and a numerical variable. As we have seen already, each column of data frame is a vector.

R

nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway" 

We can also call to the contents within the items in a list via indexing. For example, if we wanted to just return the contents from the first object in another_list (the title), we can do that by using double brackets and specifying either the index value of the item, or it’s name.

R

another_list[[1]]

OUTPUT

[1] "Numbers"

R

another_list[["title"]]

OUTPUT

[1] "Numbers"

Challenge 5

There are several subtly different ways to call variables, observations and elements from data frames:

  • nordic[1]
  • nordic[[1]]
  • nordic$country
  • nordic["country"]
  • nordic[1, 1]
  • nordic[, 1]
  • nordic[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function class() to examine what is returned in each case.

R

nordic[1]

OUTPUT

  country
1 Denmark
2  Sweden
3  Norway

We can think of a data frame as a list of vectors. The single brace [1] returns the first slice of the list, as another list. In this case it is the first column of the data frame.

R

nordic[[1]]

OUTPUT

[1] "Denmark" "Sweden"  "Norway" 

The double brace [[1]] returns the contents of the list item. In this case it is the contents of the first column, a vector of type character.

R

nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway" 

This example uses the $ character to address items by name. country is the first column of the data frame, again a vector of type character.

R

nordic["country"]

OUTPUT

  country
1 Denmark
2  Sweden
3  Norway

Here we are using a single brace ["country"] replacing the index number with the column name. Like example 1, the returned object is a list.

R

nordic[1, 1]

OUTPUT

[1] "Denmark"

This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object is an character: the first value of the first vector in our nordic object.

R

nordic[, 1]

OUTPUT

[1] "Denmark" "Sweden"  "Norway" 

Like the previous example we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column vector.

R

nordic[1, ]

OUTPUT

  country year lifeExp
1 Denmark 2002    77.2

Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row.

Key Points

  • Use read.csv to read tabular data in R.
  • The basic data types in R are double, integer, complex, logical, and character.
  • Use factors to represent categories in R.