Intro to Geospatial Raster and Vector Data: All in One View

Content from Introduction to R and RStudio

Last updated on 2024-08-19 | Edit this page

Estimated time: 25 minutes

Instructor Note

Leave about 30 minutes at the start of each workshop and another 15 mins at the start of each session for technical difficulties like WiFi and installing things (even if you asked students to install in advance, longer if not).
Be sure to actually go through examples of an R help page: help files can be intimidating at first, but knowing how to read them is tremendously useful.
Don’t worry about being correct or knowing the material back-to-front. Use mistakes as teaching moments: the most vital skill you can impart is how to debug and recover from unexpected errors.

Overview

Questions

How to find your way around RStudio?
How to interact with R?
How to install packages?

Objectives

Describe the purpose and use of each pane in the RStudio IDE
Locate buttons and options in the RStudio IDE
Define a variable
Assign data to a variable
Use mathematical and comparison operators
Call functions
Manage packages

Motivation

Instructor Note

This lesson is designed to introduce learners to the core concepts of R that they will need in order to complete the other lessons in this workshop.
It is intended for learners who have no prior experience with R. If your workshop learners have all completed another Software or Data Carpentry R workshop, or have taken courses in R, you can skip this lesson and move straight into the [Introduction to Geospatial Raster and Vector Data with R] (https://datacarpentry.org/r-raster-vector-geospatial/) lesson.
This lesson is a trimmed-down version of the R for Reproducible Scientific Analysis SWC lesson. It does not cover visualization in detail, as the later lesson in this workshop covers visualization in the context of geospatial data.

Science is a multi-step process: once you’ve designed an experiment and collected data, the real fun begins! This lesson will teach you how to start this process using R and RStudio. We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. This example starts with a dataset from gapminder.org containing population information for many countries through time. Can you read the data into R? Can you plot the population for Senegal? Can you calculate the average income for countries on the continent of Asia? By the end of these lessons you will be able to do things like plot the populations for all of these countries in under a minute!

Before Starting The Workshop

Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date.

Instructor Note

If your workshop includes the Introduction to Geospatial Concepts lesson, learners will have just been introduced to RStudio in the context of the overall Geospatial software landscape.
Have your learners open RStudio and follow along as you explain each pane. Make sure that your RStudio environment is the default so learners can follow along.
Be sure to explain how to execute code from the script window, whether you’re using the Run button or the keyboard shortcut.
Learners will be using several packages in the next lesson, so be sure to introduce what a package is and how it is installed.

Introduction to RStudio

Throughout this lesson, we’re going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier.

We’ll be using RStudio: a free, open source R integrated development environment (IDE). It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.

Basic layout

When you first open RStudio, you will be greeted by three panels:

The interactive R console (entire left)
Environment/History (tabbed in upper right)
Files/Plots/Packages/Help/Viewer (tabbed in lower right)

Once you open files, such as R scripts, an editor panel will also open in the top left.

Workflow within RStudio

There are two main ways one can work within RStudio.

Test and play within the interactive R console then copy code into a .R file to run later.

This works well when doing small tests and initially starting off.
It quickly becomes laborious

Start writing in an .R file and use RStudio’s shortcut keys for the Run command to push the current line, selected lines or modified lines to the interactive R console.

This is a great way to start; all your code is saved for later
You will be able to run the file you create from within RStudio or using R’s source() function.

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can

click on the Run button above the editor panel, or
select “Run Lines” from the “Code” menu, or
hit Ctrl+Enter in Windows, Ctrl+Return in Linux, or ⌘+Return on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run.

Introduction to R

Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you typed in R in your command-line environment.

The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. When you are running a section of your code, this is the location where R will first read your code, attempt to execute them, and then returns a result.

Using R as a calculator

The simplest thing you could do with R is do arithmetic:

R

1 + 100

OUTPUT

[1] 101

And R will print out the answer, with a preceding “[1]”. Don’t worry about this for now, we’ll explain that later. For now think of it as indicating output.

Like bash, if you type in an incomplete command, R will wait for you to complete it:

R

> 1 +

OUTPUT

Any time you hit return and the R session shows a “+” instead of a “>”, it means it’s waiting for you to complete the command. If you want to cancel a command you can simply hit “Esc” and RStudio will give you back the “>” prompt.

Tip: Cancelling commands

If you’re using R from the command line instead of from within RStudio, you need to use Ctrl+C instead of Esc to cancel the command. This applies to Mac users as well!

Cancelling a command isn’t only useful for killing incomplete commands: you can also use it to tell R to stop running code (for example if it’s taking much longer than you expect), or to get rid of the code you’re currently writing.

When using R as a calculator, the order of operations is the same as you would have learned back in school.

From highest to lowest precedence:

Parentheses: ( )
Exponents: ^ or **
Divide: /
Multiply: *
Add: +
Subtract: -

R

3 + 5 * 2

OUTPUT

[1] 13

Use parentheses to group operations in order to force the order of evaluation if it differs from the default, or to make clear what you intend.

R

(3 + 5) * 2

OUTPUT

[1] 16

This can get unwieldy when not needed, but clarifies your intentions. Remember that others may later read your code.

R

(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2       # clear, if you remember the rules
3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that follows after the hash (or octothorpe) symbol # is ignored by R when it executes code.

Really small or large numbers get a scientific notation:

R

2/10000

OUTPUT

[1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So 2e-4 is shorthand for 2 * 10^(-4).

You can write numbers in scientific notation too:

R

5e3  # Note the lack of minus here

OUTPUT

[1] 5000

Don’t worry about trying to remember every function in R. You can look them up using a search engine, or if you can remember the start of the function’s name, use the tab completion in RStudio.

This is one advantage that RStudio has over R on its own, it has auto-completion abilities that allow you to more easily look up functions, their arguments, and the values that they take.

Typing a ? before the name of a command will open the help page for that command. As well as providing a detailed description of the command and how it works, scrolling to the bottom of the help page will usually show a collection of code examples which illustrate command usage. We’ll go through an example later.

Comparing things

We can also do comparison in R:

R

1 == 1  # equality (note two equals signs, read as "is equal to")

OUTPUT

[1] TRUE

R

1 != 2  # inequality (read as "is not equal to")

OUTPUT

[1] TRUE

R

1 < 2  # less than

OUTPUT

[1] TRUE

R

1 <= 1  # less than or equal to

OUTPUT

[1] TRUE

R

1 > 0  # greater than

OUTPUT

[1] TRUE

R

1 >= -9 # greater than or equal to

OUTPUT

[1] TRUE

Tip: Comparing Numbers

A word of warning about comparing numbers: you should never use == to compare two numbers unless they are integers (a data type which can specifically represent only whole numbers).

Computers may only represent decimal numbers with a certain degree of precision, so two numbers which look the same when printed out by R, may actually have different underlying representations and therefore be different by a small margin of error (called Machine numeric tolerance).

Instead you should use the all.equal function.

Further reading: http://floating-point-gui.de/

Variables and assignment

We can store values in variables using the assignment operator <-, like this:

R

x <- 1/40

Notice that assignment does not print a value. Instead, we stored it for later in something called a variable. x now contains the value 0.025:

R

OUTPUT

[1] 0.025

More precisely, the stored value is a decimal approximation of this fraction called a floating point number.

Look for the Environment tab in one of the panes of RStudio, and you will see that x and its value have appeared. Our variable x can be used in place of a number in any calculation that expects a number:

R

log(x)

OUTPUT

[1] -3.688879

Notice also that variables can be reassigned:

R

x <- 100

x used to contain the value 0.025 and and now it has the value 100.

Assignment values can contain the variable being assigned to:

R

x <- x + 1 #notice how RStudio updates its description of x on the top right tab
y <- x * 2

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Challenge 1

What will be the value of each variable after each statement in the following program?

R

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20

Solution to challenge 1

R

mass <- 47.5

This will give a value of 47.5 for the variable mass

R

age <- 122

This will give a value of 122 for the variable age

R

mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new value of 109.25 to the variable mass.

R

age <- age - 20

This will subtract 20 from the existing value of 122 to give a new value of 102 to the variable age.

Challenge 2

Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age?

Solution to challenge 2

One way of answering this question in R is to use the > to set up the following:

R

mass > age

OUTPUT

[1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater than 102.

Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long variable names, these include

periods.between.words
underscores_between_words
camelCaseToSeparateWords

What you use is up to you, but be consistent.

It is also possible to use the = operator for assignment:

R

x = 1/40

But this is much less common among R users. The most important thing is to be consistent with the operator you use. There are occasionally places where it is less confusing to use <- than =, and it is the most common symbol used in the community. So the recommendation is to use <-.

Challenge 3

Which of the following are valid R variable names?

R

min_height
max.height
_age
.mass
MaxLength
min-length
2widths
celsius2kelvin

Solution to challenge 3

The following can be used as R variables:

R

min_height
max.height
MaxLength
celsius2kelvin

The following creates a hidden variable:

R

.mass

We won’t be discussing hidden variables in this lesson. We recommend not using a period at the beginning of variable names unless you intend your variables to be hidden.

The following will not be able to be used to create a variable

R

_age
min-length
2widths

Installing Packages

We can use R as a calculator to do mathematical operations (e.g., addition, subtraction, multiplication, division), as we did above. However, we can also use R to carry out more complicated analyses, make visualizations, and much more. In later episodes, we’ll use R to do some data wrangling, plotting, and saving of reformatted data.

R coders around the world have developed collections of R code to accomplish themed tasks (e.g., data wrangling). These collections of R code are known as R packages. It is also important to note that R packages refer to code that is not automatically downloaded when we install R on our computer. Therefore, we’ll have to install each R package that we want to use (more on this below).

We will practice using the dplyr package to wrangle our datasets in episode 6 and will also practice using the ggplot2 package to plot our data in episode 7. To give an example, the dplyr package includes code for a function called filter(). A function is something that takes input(s) does some internal operations and produces output(s). For the filter() function, the inputs are a dataset and a logical statement (i.e., when data value is greater than or equal to 100) and the output is data within the dataset that has a value greater than or equal to 100.

There are two main ways to install packages in R:

If you are using RStudio, we can go to Tools > Install Packages... and then search for the name of the R package we need and click Install.
We can use the install.packages( ) function. We can do this to install the dplyr R package.

R

install.packages("dplyr")

OUTPUT

The following package(s) will be installed:
- dplyr [1.1.4]
These packages will be installed into "~/Documents/2024-07-01-ucsb-intro-geospatial/renv/profiles/lesson-requirements/renv/library/R-4.3/aarch64-apple-darwin20".

# Installing packages --------------------------------------------------------
- Installing dplyr ...                          OK [linked from cache]
Successfully installed 1 package in 5.6 milliseconds.

It’s important to note that we only need to install the R package on our computer once. Well, if we install a new version of R on the same computer, then we will likely need to also re-install the R packages too.

Challenge 4

What code would we use to install the ggplot2 package?

Solution to challenge 4

We would use the following R code to install the ggplot2 package:

R

install.packages("ggplot2")

Now that we’ve installed the R package, we’re ready to use it! To use the R package, we need to “load” it into our R session. We can think of “loading” an R packages as telling R that we’re ready to use the package we just installed. It’s important to note that while we only have to install the package once, we’ll have to load the package each time we open R (or RStudio).

To load an R package, we use the library( ) function. We can load the dplyr package like this:

R

library(dplyr)

OUTPUT


Attaching package: 'dplyr'

OUTPUT

The following objects are masked from 'package:stats':

    filter, lag

OUTPUT

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Challenge 5

Which of the following could we use to load the ggplot2 package? (Select all that apply.)

install.packages(“ggplot2”)
library(“ggplot2”)
library(ggplot2)
library(ggplo2)

Solution to challenge 5

The correct answers are b and c. Answer a will install, not load, the ggplot2 package. Answer b will correctly load the ggplot2 package. Note there are no quotation marks. Answer c will correctly load the ggplot2 package. Note there are quotation marks. Answer d will produce an error because ggplot2 is misspelled.

Note: It is more common for coders to not use quotation marks when loading an R package (i.e., answer c).

R

library(ggplot2)

Key Points

Use RStudio to write and run R programs.
R has the usual arithmetic operators.
Use <- to assign values to variables.
Use install.packages() to install packages (libraries).

Content from Project Management With RStudio

Last updated on 2024-08-19 | Edit this page

Estimated time: 15 minutes

Overview

Questions

How can I manage my projects in R?

Objectives

Create self-contained projects in RStudio

Introduction

The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together. Organising a project involving spatial data is no different from any other data analysis project, although you may require more disk space than usual.

Managing your projects in a reproducible fashion doesn’t just make your science reproducible, it makes your life easier.
— Vince Buffalo (@vsbuffalo) April 15, 2013

Most people tend to organize their projects like this:

A screenshot of a project folder containing multiple versions of data, analysis scripts, figures, and results files

There are many reasons why we should ALWAYS avoid this:

It is really hard to tell which version of your data is the original and which is the modified;
It gets really messy because it mixes files with various extensions together;
It probably takes you a lot of time to actually find things, and relate the correct figures to the exact code that has been used to generate it;

A good project layout will ultimately make your life easier:

It will help ensure the integrity of your data;
It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
It allows you to easily upload your code with your manuscript submission;
It makes it easier to pick the project back up after a break.

A possible solution

Fortunately, there are tools and packages which can help you manage your work effectively.

One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.

Instructor Note

Make sure learners download the data files in Challenge 1 and move those files to their data/ directory.

When learners load an RStudio project, their R session’s working directory should automatically be set to the same folder as the .RProj file. We’ll be using relative paths throughout the lesson to refer to files, so it’s important to make sure that learners have loaded the right project and are in the right directory! You may also want to introduce other ways to make file paths, such as the here package, after creating the project.

Challenge: Creating a self-contained project

We’re going to create a new project in RStudio:

Click the “File” menu button, then “New Project”.
Click “New Directory”.
Click “New Project”.
Type in “r-geospatial” as the name of the directory.
Click the “Create Project” button.

A key advantage of an RStudio Project is that whenever we open this project in subsequent RStudio sessions our working directory will always be set to the folder r-geospatial. Let’s check our working directory by entering the following into the R console:

R

getwd()

R should return your/path/r-geospatial as the working directory.

Best practices for project organization

Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:

Treat data as read only

This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.

Data Cleaning

In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. I find it useful to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “cleaned” data sets.

Treat generated output as disposable

Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.

There are lots of different ways to manage this output. I find it useful to have an output folder with different sub-directories for each separate analysis. This makes it easier later, as many of my analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.

Some GIS file formats are really 3-6 files that need to be kept together and have the same name, e.g. shapefiles. It may be tempting to store those components separately, but your spatial data will be unusable if you do that.

Keep a consistent naming scheme

It is generally best to avoid renaming downloaded spatial data, so that a clear connection is maintained with the point of truth. You may otherwise find yourself wondering whether file_A really is just a copy of Official_file_on_website or not.

For datasets you generate, it’s worth taking the time to come up with a naming convention that works for your project, and sticking to it. File names don’t have to be long, they just have to be long enough that you can tell what the file is about. Date generated, topic, and whether a product is intermediate or final are good bits of information to keep in a file name. For more tips on naming files, check out the slides from Jenny Bryan’s talk “Naming things” at the 2015 Reproducible Science Workshop.

Tip: Good Enough Practices for Scientific Computing

Good Enough Practices for Scientific Computing gives the following recommendations for project organization:

Put each project in its own directory, which is named after the project.
Put text documents associated with the project in the doc directory.
Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
Name all files to reflect their content or function.

Save the data in the data directory

Now we have a good directory structure we will now place/save our data files in the data/ directory.

Challenge 1

1. Download each of the data files listed below (Ctrl+S, right mouse click -> “Save as”, or File -> “Save page as”)

2. Make sure the files have the following names:

nordic-data.csv
nordic-data-2.csv
gapminder_data.csv

3. Save the files in the data/ folder within your project.

We will load and inspect these data later.

Challenge 2

We also want to move the data that we downloaded previously for this workshop into a subdirectory inside r-geospatial. If you haven’t already downloaded the data, you can do so by clicking this download link.

Move the downloaded zip file to the data directory.
Once the data have been moved, unzip all files.

Once you have completed moving the data across to the new folder, your data directory should look as follows:

data/
   gapminder_data.csv
   NEON-DS-Airborne-Remote-Sensing/
   NEON-DS-Landsat-NDVI/
   NEON-DS-Met-Time-Series/
   NEON-DS-Site-Layout-Files/
   NEON-DS-Airborne-Remote-Sensing.zip
   NEON-DS-Landsat-NDVI.zip
   NEON-DS-Met-Time-Series.zip
   NEON-DS-Site-Layout-Files.zip
   nordic-data.csv
   nordic-data-2.csv

Stage your scripts

Creating separate R scripts or Rmarkdown documents for different stages of a project will maximise efficiency. For instance, separating data download commands into their own file means that you won’t re-download data unnecessarily.

Key Points

Use RStudio to create and manage projects with consistent layout.
Treat raw data as read-only.
Treat generated output as disposable.

Content from Data Structures

Last updated on 2024-08-19 | Edit this page

Estimated time: 55 minutes

Overview

Questions

How can I read data in R?
What are the basic data types in R?
How do I represent categorical information in R?

Objectives

To be aware of the different types of data.
To begin exploring data frames, and understand how they are related to vectors, factors and lists.
To be able to ask questions from R about the type, class, and structure of an object.

Let’s start by creating a new R script and saving it to the scripts folder in our project directory. We will create new scripts for each episode in this workshop.

We can create a new R script by clicking the button at the top left of our RStudio, the one that looks like a piece of paper with a green plus sign next to it. On the drop down, click “R Script”. We can also create an R script using File > New file > R script.

We can add a comment to our script to remind us what we’re working on:

R

# R script for data structures

Comments are useful notes to us that are ignored by the computer.

Now that we’ve got the basics covered, we can move on to the lesson. One of R’s most powerful features is its ability to deal with tabular data - like you may already have in a spreadsheet or a CSV file. Let’s start by downloading and reading in a file nordic-data.csv. We will save this data as an object named nordic:

R

nordic <- read.csv("data/nordic-data.csv")

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can

click on the Run button above the editor panel, or
select “Run Lines” from the “Code” menu, or
hit Ctrl+Enter in Windows, Ctrl+Return in Linux, or ⌘+Return on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run.

The read.table function is used for reading in tabular data stored in a text file where the columns of data are separated by punctuation characters such as CSV files (csv = comma-separated values). Tabs and commas are the most common punctuation characters used to separate or delimit data points in csv files. For convenience R provides 2 other versions of read.table. These are: read.csv for files where the data are separated with commas and read.delim for files where the data are separated with tabs. Of these three functions read.csv is the most commonly used. If needed it is possible to override the default delimiting punctuation marks for both read.csv and read.delim.

We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:

R

nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway"

R

nordic$lifeExp

OUTPUT

[1] 77.2 80.0 79.0

We can do other operations on the columns. For example, if we discovered that the life expectancy is two years higher:

R

nordic$lifeExp + 2

OUTPUT

[1] 79.2 82.0 81.0

But what about:

R

nordic$lifeExp + nordic$country

ERROR

Error in nordic$lifeExp + nordic$country: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing data in R.

Data Types

Instructor Note

Learners will work with factors in the following lesson. Be sure to cover this concept.
If needed for time reasons, you can skip the section on lists. The learners don’t use lists in the rest of the workshop.

If you guessed that the last command will return an error because 77.2 plus "Denmark" is nonsense, you’re right - and you already have some intuition for an important concept in programming called data classes. We can ask what class of data something is:

R

class(nordic$lifeExp)

OUTPUT

[1] "numeric"

There are 6 main types:

numeric: values like 1.254, 0.8, -7.2
integer: values like 1, 10, 150
complex: values like 1+1i
logical: TRUE or FALSE
character: values like “cats”, “dogs”, and “animals”
factor: categories with all possible values, like the months of the year

No matter how complicated our analyses become, all data in R is interpreted a specific data class. This strictness has some really important consequences.

A user has added new details of age expectancy. This information is in the file data/nordic-data-2.csv.

Load the new nordic data as nordic_2, and check what class of data we find in the lifeExp column:

R

nordic_2 <- read.csv("data/nordic-data-2.csv")
class(nordic_2$lifeExp)

OUTPUT

[1] "character"

Oh no, our life expectancy lifeExp aren’t the numeric type anymore! If we try to do the same math we did on them before, we run into trouble:

R

nordic_2$lifeExp + 2

ERROR

Error in nordic_2$lifeExp + 2: non-numeric argument to binary operator

What happened? When R reads a csv file into one of these tables, it insists that everything in a column be the same class; if it can’t understand everything in the column as numeric, then nothing in the column gets to be numeric. The table that R loaded our nordic data into is something called a dataframe, and it is our first example of something called a data structure - that is, a structure which R knows how to build out of the basic data types.

We can look at the data within R by clicking the object in our environment or using View(nordic). We see that the lifeExp column now has the value “79.0 or 83” in the third row. R needed all values in that column to be the same type, so it forced everything to be a character instead of a number.

In order to successfully use our data in R, we need to understand what the basic data structures are, and how they behave.

Vectors and Type Coercion

To better understand this behavior, let’s meet another of the data structures: the vector.

A vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type. If you don’t choose the data type, it’ll default to logical; or, you can declare an empty vector of whatever type you like.

You can specify a vector either using the vector() function or the combine (c()) function.

R

vector(length = 3) # this creates an empty vector of logical values

OUTPUT

[1] FALSE FALSE FALSE

R

c(1, 2, 3) # this creates a vector with numerical values

OUTPUT

[1] 1 2 3

R

c("this", "that", "the other") # this creates a vector with character values

OUTPUT

[1] "this"      "that"      "the other"

R

another_vector <- vector(mode = 'character', length = 3)
another_vector

OUTPUT

[1] "" "" ""

You can check if something is a vector using str which asks for an object’s structure:

R

str(another_vector)

OUTPUT

 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic data type found in this vector - in this case chr, character; an indication of the number of things in the vector - actually, the indexes of the vector, in this case [1:3]; and a few examples of what’s actually in the vector - in this case empty character strings. If we similarly do

R

str(nordic$lifeExp)

OUTPUT

 num [1:3] 77.2 80 79

we see that nordic$lifeExp is a vector, too - the columns of data we load into R data frames are all vectors, and that’s the root of why R forces everything in a column to be the same basic data type.

Discussion 1

Why is R so opinionated about what we put in our columns of data? How does this help us?

Discussion 1

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.

Given what we’ve learned so far, what do you think the following will produce?

R

quiz_vector <- c(2, 6, '3')

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type. Consider:

R

coercion_vector <- c('a', TRUE)
coercion_vector

OUTPUT

[1] "a"    "TRUE"

The coercion rules go: logical -> integer -> numeric -> complex -> character, where -> can be read as are transformed into. You can try to force coercion against this flow using the as. functions:

R

character_vector_example <- c('0', '2', '4')
character_vector_example

OUTPUT

[1] "0" "2" "4"

R

character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric

OUTPUT

[1] 0 2 4

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data frames, or you will get nasty surprises!

Challenge 1

Given what you now know about type conversion, look at the class of data in nordic_2$lifeExp and compare it with nordic$lifeExp. Why are these columns different classes?

Show me the solution

R

str(nordic_2$lifeExp)

OUTPUT

 chr [1:3] "77.2" "80" "79.0 or 83"

R

str(nordic$lifeExp)

OUTPUT

 num [1:3] 77.2 80 79

The data in nordic_2$lifeExp is stored as a character vector, rather than as a numeric vector. This is because of the “or” character string in the third data point.

The combine function, c(), will also append things to an existing vector:

R

ab_vector <- c('a', 'b')
ab_vector

OUTPUT

[1] "a" "b"

R

combine_example <- c(ab_vector, 'DC')
combine_example

OUTPUT

[1] "a"  "b"  "DC"

You can also make series of numbers:

R

my_series <- 1:10
my_series

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

R

seq(from = 1, to = 10, by = 0.1)

OUTPUT

 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

We can ask a few questions about vectors:

R

sequence_example <- 1:10
head(sequence_example)

OUTPUT

[1] 1 2 3 4 5 6

R

tail(sequence_example)

OUTPUT

[1]  5  6  7  8  9 10

R

length(sequence_example)

OUTPUT

[1] 10

R

class(sequence_example)

OUTPUT

[1] "integer"

Finally, you can give names to elements in your vector:

R

my_example <- 5:8
names(my_example) <- c("a", "b", "c", "d")
my_example

OUTPUT

a b c d 
5 6 7 8

R

names(my_example)

OUTPUT

[1] "a" "b" "c" "d"

Challenge 2

Start by making a vector with the numbers 1 through 26. Multiply the vector by 2, and give the resulting vector names A through Z (hint: there is a built in vector called LETTERS)

Show me the solution

R

x <- 1:26
x <- x * 2
names(x) <- LETTERS

Factors

We said that columns in data frames were vectors:

R

str(nordic$lifeExp)

OUTPUT

 num [1:3] 77.2 80 79

R

str(nordic$year)

OUTPUT

 int [1:3] 2002 2002 2002

R

str(nordic$country)

OUTPUT

 chr [1:3] "Denmark" "Sweden" "Norway"

One final important data structure in R is called a “factor”. Factors look like character data, but are used to represent data where each element of the vector must be one of a limited number of “levels”. To phrase that another way, factors are an “enumerated” type where there are a finite number of pre-defined values that your vector can have.

For example, let’s make a vector of strings labeling nordic countries for all the countries in our study:

R

nordic_countries <- nordic$country
str(nordic_countries)

OUTPUT

 chr [1:3] "Denmark" "Sweden" "Norway"

We can turn a vector into a factor like so:

R

categories <- factor(nordic_countries)
str(categories)

OUTPUT

 Factor w/ 3 levels "Denmark","Norway",..: 1 3 2

Now R has noticed that there are 3 possible categories in our data - but it also did something surprising; instead of printing out the strings we gave it, we got a bunch of numbers instead. R has replaced our human-readable categories with numbered indices under the hood, this is necessary as many statistical calculations utilise such numerical representations for categorical data.

Challenge 3

Can you guess why these numbers are used to represent these countries?

Show me the solution

They are sorted in alphabetical order

Challenge 4

Convert the country column of our nordic data frame to a factor. Then try converting it back to a character vector.

Now try converting lifeExp in our nordic data frame to a factor, then back to a numeric vector. What happens if you use as.numeric()?

Remember that you can reload the nordic data frame using read.csv("data/nordic-data.csv") if you accidentally lose some data!

Show me the solution

Converting character vectors to factors can be done using the factor() function:

R

nordic$country <- factor(nordic$country)
nordic$country

OUTPUT

[1] Denmark Sweden  Norway 
Levels: Denmark Norway Sweden

You can convert these back to character vectors using as.character():

R

nordic$country <- as.character(nordic$country)
nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway"

You can convert numeric vectors to factors in the exact same way:

R

nordic$lifeExp <- factor(nordic$lifeExp)
nordic$lifeExp

OUTPUT

[1] 77.2 80   79  
Levels: 77.2 79 80

But be careful – you can’t use as.numeric() to convert factors to numerics!

R

as.numeric(nordic$lifeExp)

OUTPUT

[1] 1 3 2

Instead, as.numeric() converts factors to those “numbers under the hood” we talked about. To go from a factor to a number, you need to first turn the factor into a character vector, and then turn that into a numeric vector:

R

nordic$lifeExp <- as.character(nordic$lifeExp)
nordic$lifeExp <- as.numeric(nordic$lifeExp)
nordic$lifeExp

OUTPUT

[1] 77.2 80.0 79.0

Note: new students find the help files difficult to understand; make sure to let them know that this is typical, and encourage them to take their best guess based on semantic meaning, even if they aren’t sure.

When doing statistical modelling, it’s important to know what the baseline levels are. This is assumed to be the first factor, but by default factors are labeled in alphabetical order. You can change this by specifying the levels:

R

mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)

OUTPUT

 Factor w/ 2 levels "control","case": 2 1 1 2

In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2. This designation can be very important for interpreting the results of statistical models!

Lists

Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it:

R

list_example <- list(1, "a", TRUE, c(2, 6, 7))
list_example

OUTPUT

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 2 6 7

R

another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list

OUTPUT

$title
[1] "Numbers"

$numbers
 [1]  1  2  3  4  5  6  7  8  9 10

$data
[1] TRUE

We can now understand something a bit surprising in our data frame; what happens if we compare str(nordic) and str(another_list):

R

str(nordic)

OUTPUT

'data.frame':	3 obs. of  3 variables:
 $ country: chr  "Denmark" "Sweden" "Norway"
 $ year   : int  2002 2002 2002
 $ lifeExp: num  77.2 80 79

R

str(another_list)

OUTPUT

List of 3
 $ title  : chr "Numbers"
 $ numbers: int [1:10] 1 2 3 4 5 6 7 8 9 10
 $ data   : logi TRUE

We see that the output for these two objects look very similar. It is because data frames are lists ‘under the hood’. Data frames are a special case of lists where each element (the columns of the data frame) have the same lengths.

In our nordic example, we have a character, an integer, and a numerical variable. As we have seen already, each column of data frame is a vector.

R

nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway"

We can also call to the contents within the items in a list via indexing. For example, if we wanted to just return the contents from the first object in another_list (the title), we can do that by using double brackets and specifying either the index value of the item, or it’s name.

R

another_list[[1]]

OUTPUT

[1] "Numbers"

R

another_list[["title"]]

OUTPUT

[1] "Numbers"

Challenge 5

There are several subtly different ways to call variables, observations and elements from data frames:

nordic[1]
nordic[[1]]
nordic$country
nordic["country"]
nordic[1, 1]
nordic[, 1]
nordic[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function class() to examine what is returned in each case.

Show me the solution

R

nordic[1]

OUTPUT

  country
1 Denmark
2  Sweden
3  Norway

We can think of a data frame as a list of vectors. The single brace [1] returns the first slice of the list, as another list. In this case it is the first column of the data frame.

R

nordic[[1]]

OUTPUT

[1] "Denmark" "Sweden"  "Norway"

The double brace [[1]] returns the contents of the list item. In this case it is the contents of the first column, a vector of type character.

R

nordic$country

OUTPUT

[1] "Denmark" "Sweden"  "Norway"

This example uses the $ character to address items by name. country is the first column of the data frame, again a vector of type character.

R

nordic["country"]

OUTPUT

  country
1 Denmark
2  Sweden
3  Norway

Here we are using a single brace ["country"] replacing the index number with the column name. Like example 1, the returned object is a list.

R

nordic[1, 1]

OUTPUT

[1] "Denmark"

This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object is an character: the first value of the first vector in our nordic object.

R

nordic[, 1]

OUTPUT

[1] "Denmark" "Sweden"  "Norway"

Like the previous example we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column vector.

R

nordic[1, ]

OUTPUT

  country year lifeExp
1 Denmark 2002    77.2

Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row.

Key Points

Use read.csv to read tabular data in R.
The basic data types in R are double, integer, complex, logical, and character.
Use factors to represent categories in R.

Content from Exploring Data Frames

Last updated on 2024-08-19 | Edit this page

Estimated time: 30 minutes

Overview

Questions

How can I manipulate a data frame?

Objectives

Append two data frames.
Understand what a factor is.
Convert a factor to a character vector and vice versa.
Display basic properties of data frames including size and class of the columns, names, and first few rows.

At this point, you’ve seen it all: in the last lesson, we toured all the basic data types and data structures in R. Everything you do will be a manipulation of those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we’ll learn a few more things about working with data frames.

Realistic example

We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. So far, you have seen the basics of manipulating data frames with our nordic data; now let’s use those skills to digest a more extensive dataset. Let’s read in the gapminder dataset that we downloaded previously:

Instructor Note

Pay attention to and explain the errors and warnings generated from the examples in this episode.

R

gapminder <- read.csv("data/gapminder_data.csv")

Let’s investigate the gapminder data frame a bit; the first thing we should always do is check out what the data looks like with str:

R

str(gapminder)

OUTPUT

'data.frame':	1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

We can also interrogate the data frame for information about its dimensions; remembering that str(gapminder) said there were 1704 observations of 6 variables in gapminder, what do you think the following will produce, and why?

R

length(gapminder)

A fair guess would have been to say that the length of a data frame would be the number of rows it has (1704), but this is not the case; it gives us the number of columns.

To get the number of rows and columns in our dataset, try:

R

nrow(gapminder)

OUTPUT

[1] 1704

R

ncol(gapminder)

OUTPUT

[1] 6

Or, both at once:

R

dim(gapminder)

OUTPUT

[1] 1704    6

We’ll also likely want to know what the titles of all the columns are, so we can ask for them later:

R

colnames(gapminder)

OUTPUT

[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

At this stage, it’s important to ask ourselves if the structure R is reporting matches our intuition or expectations; do the basic data types reported for each column make sense? If not, we need to sort any problems out now before they turn into bad surprises down the road, using what we’ve learned about how R interprets data, and the importance of strict consistency in how we record our data.

Once we’re happy that the data types and structures seem reasonable, it’s time to start digging into our data proper. Check out the first few lines:

R

head(gapminder)

OUTPUT

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Challenge 1

It’s good practice to also check the last few lines of your data and some in the middle. How would you do this?

Searching for ones specifically in the middle isn’t too hard but we could simply ask for a few lines at random. How would you code this?

Show me the solution

To check the last few lines it’s relatively simple as R already has a function for this:

R

tail(gapminder)
tail(gapminder, n = 15)

What about a few arbitrary rows just for sanity (or insanity depending on your view)?

There are several ways to achieve this.

The solution here presents one form using nested functions. i.e. a function passed as an argument to another function. This might sound like a new concept but you are already using it in fact.

Remember my_dataframe[rows, cols] will print to screen your data frame with the number of rows and columns you asked for (although you might have asked for a range or named columns for example). How would you get the last row if you don’t know how many rows your data frame has? R has a function for this. What about getting a (pseudorandom) sample? R also has a function for this.

R

gapminder[sample(nrow(gapminder), 5), ]

Challenge 2

Read the output of str(gapminder) again; this time, use what you’ve learned about factors and vectors, as well as the output of functions like colnames and dim to explain what everything that str prints out for gapminder means. If there are any parts you can’t interpret, discuss with your neighbors!

Show me the solution

The object gapminder is a data frame with columns

country and continent are character vectors.
year is an integer vector.
pop, lifeExp, and gdpPercap are numeric vectors.

Adding columns and rows in data frames

We would like to create a new column to hold information on whether the life expectancy is below the world average life expectancy (70.5) or above:

R

below_average <- gapminder$lifeExp < 70.5
head(below_average)

OUTPUT

[1] TRUE TRUE TRUE TRUE TRUE TRUE

We can then add this as a column via:

R

cbind(gapminder, below_average)

Note that if we tried to add a vector of below_average with a different number of entries than the number of rows in the dataframe, it would fail:

R

below_average <- c(TRUE, TRUE, TRUE, TRUE, TRUE)
cbind(gapminder, below_average)

ERROR

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 1704, 5

Why didn’t this work? R wants to see one element in our new column for every row in the table:

R

nrow(gapminder)

OUTPUT

[1] 1704

R

length(below_average)

OUTPUT

[1] 5

So for it to work we need either to have nrow(gapminder) = length(below_average) or nrow(gapminder) to be a multiple of length(below_average):

R

below_average <- c(TRUE, TRUE, FALSE)
head(cbind(gapminder, below_average))

OUTPUT

      country year      pop continent lifeExp gdpPercap below_average
1 Afghanistan 1952  8425333      Asia  28.801  779.4453          TRUE
2 Afghanistan 1957  9240934      Asia  30.332  820.8530          TRUE
3 Afghanistan 1962 10267083      Asia  31.997  853.1007         FALSE
4 Afghanistan 1967 11537966      Asia  34.020  836.1971          TRUE
5 Afghanistan 1972 13079460      Asia  36.088  739.9811          TRUE
6 Afghanistan 1977 14880372      Asia  38.438  786.1134         FALSE

The sequence TRUE,TRUE,FALSE is repeated over all the gapminder rows.

Let’s overwrite the content of gapminder with our new data frame. We can do this in two ways: using cbind and overwriting the gapminder object, or by identifying and adding a new column.

R

below_average <- gapminder$lifeExp<70.5

# Option 1 - use cbind and overwrite the object
gapminder <- cbind(gapminder, below_average)

# Option 2 - identifying and adding a new column
gapminder$below_average <- below_average

Now how about adding rows? The rows of a data frame are lists, so we can add a new row by either specifying a new list, or a new data frame.

R

# Option 1 - create a new list
new_row <- list(country = 'Norway', 
                year = 2016, 
                pop = 5000000, 
                continent = 'Nordic', 
                lifeExp = 80.3, 
                gdpPercap = 49400.0, 
                below_average = FALSE)

# Option 2 - perhaps more intuitive, create a new data frame
new_row <- data.frame(country = 'Norway', 
                      year = 2016, 
                      pop = 5000000, 
                      continent = 'Nordic', 
                      lifeExp = 80.3, 
                      gdpPercap = 49400.0, 
                      below_average = FALSE)

gapminder_norway <- rbind(gapminder, new_row)
tail(gapminder_norway)

OUTPUT

      country year      pop continent lifeExp  gdpPercap below_average
1700 Zimbabwe 1987  9216418    Africa  62.351   706.1573          TRUE
1701 Zimbabwe 1992 10704340    Africa  60.377   693.4208          TRUE
1702 Zimbabwe 1997 11404948    Africa  46.809   792.4500          TRUE
1703 Zimbabwe 2002 11926563    Africa  39.989   672.0386          TRUE
1704 Zimbabwe 2007 12311143    Africa  43.487   469.7093          TRUE
1705   Norway 2016  5000000    Nordic  80.300 49400.0000         FALSE

Factors

Here is another thing to look out for: in a factor, each different value represents what is called a level. Let’s alter our gapminder data so that the “continent” column is a factor.

R

gapminder$continent <- factor(gapminder$continent)
head(gapminder$continent)

OUTPUT

[1] Asia Asia Asia Asia Asia Asia
Levels: Africa Americas Asia Europe Oceania

Now the factor “continent” has 5 levels: “Africa”, “Americas”, “Asia”, “Europe” and “Oceania”. R will only accept values that match one of the levels. If you add a new value, it will become NA.

R

levels(gapminder$continent)

OUTPUT

[1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

R

levels(gapminder$continent) <- c(levels(gapminder$continent), "Nordic")
gapminder_norway  <- rbind(gapminder, new_row)
tail(gapminder_norway)

OUTPUT

      country year      pop continent lifeExp  gdpPercap below_average
1700 Zimbabwe 1987  9216418    Africa  62.351   706.1573          TRUE
1701 Zimbabwe 1992 10704340    Africa  60.377   693.4208          TRUE
1702 Zimbabwe 1997 11404948    Africa  46.809   792.4500          TRUE
1703 Zimbabwe 2002 11926563    Africa  39.989   672.0386          TRUE
1704 Zimbabwe 2007 12311143    Africa  43.487   469.7093          TRUE
1705   Norway 2016  5000000    Nordic  80.300 49400.0000         FALSE

Alternatively, we can change a factor into a character vector; we lose the handy categories of the factor, but we can subsequently add any word we want to the column without babysitting the factor levels:

R

str(gapminder)

OUTPUT

'data.frame':	1704 obs. of  7 variables:
 $ country      : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year         : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop          : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent    : Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ lifeExp      : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap    : num  779 821 853 836 740 ...
 $ below_average: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

R

gapminder$continent <- as.character(gapminder$continent)
str(gapminder)

OUTPUT

'data.frame':	1704 obs. of  7 variables:
 $ country      : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year         : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop          : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent    : chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp      : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap    : num  779 821 853 836 740 ...
 $ below_average: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Appending to a data frame

The key to remember when adding data to a data frame is that columns are vectors and rows are lists. We can also glue two data frames together with rbind:

R

gapminder <- rbind(gapminder, gapminder)
tail(gapminder, n=3)

OUTPUT

      country year      pop continent lifeExp gdpPercap below_average
3406 Zimbabwe 1997 11404948    Africa  46.809  792.4500          TRUE
3407 Zimbabwe 2002 11926563    Africa  39.989  672.0386          TRUE
3408 Zimbabwe 2007 12311143    Africa  43.487  469.7093          TRUE

Challenge 3

You can create a new data frame right from within R with the following syntax:

R

df <- data.frame(id = c("a", "b", "c"),
                 x = 1:3,
                 y = c(TRUE, TRUE, FALSE))

Make a data frame that holds the following information for yourself:

first name
last name
lucky number

Then use rbind to add an entry for the people sitting beside you. Finally, use cbind to add a column with each person’s answer to the question, “Is it time for coffee break?”

Show me the solution

R

df <- data.frame(first = c("Grace"),
                 last = c("Hopper"),
                 lucky_number = c(0))
df <- rbind(df, list("Marie", "Curie", 238) )
df <- cbind(df, coffeetime = c(TRUE, TRUE))

Key Points

Use cbind() to add a new column to a data frame.
Use rbind() to add a new row to a data frame.
Use levels() and as.character() to explore and manipulate factors.
Use str(), nrow(), ncol(), dim(), colnames(), rownames(), head(), and tail() to understand the structure of a data frame.
Read in a csv file using read.csv().
Understand what length() of a data frame represents.

Content from Subsetting Data

Last updated on 2024-08-19 | Edit this page

Estimated time: 35 minutes

Overview

Questions

How can I work with subsets of data in R?

Objectives

To be able to subset vectors and data frames
To be able to extract individual and multiple elements: by index, by name, using comparison operations
To be able to skip and remove elements from various data structures.

R has many powerful subset operators. Mastering them will allow you to easily perform complex operations on any kind of dataset.

There are six different ways we can subset any kind of object, and three different subsetting operators for the different data structures.

Let’s start with the workhorse of R: a simple numeric vector.

R

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x

OUTPUT

  a   b   c   d   e 
5.4 6.2 7.1 4.8 7.5

Atomic vectors

In R, simple vectors containing character strings, numbers, or logical values are called atomic vectors because they can’t be further simplified.

So now that we’ve created a dummy vector to play with, how do we get at its contents?

Accessing elements using their indices

To extract elements of a vector we can give their corresponding index, starting from one:

R

x[1]

OUTPUT

  a 
5.4

It may look different, but the square brackets operator is a function. For vectors (and matrices), it means “get me the nth element”.

We can ask for multiple elements at once:

R

x[c(1, 3)]

OUTPUT

  a   c 
5.4 7.1

Or slices of the vector:

R

x[1:4]

OUTPUT

  a   b   c   d 
5.4 6.2 7.1 4.8

the : operator creates a sequence of numbers from the left element to the right.

R

1:4

OUTPUT

[1] 1 2 3 4

R

c(1, 2, 3, 4)

OUTPUT

[1] 1 2 3 4

We can ask for the same element multiple times:

R

x[c(1, 1, 3)]

OUTPUT

  a   a   c 
5.4 5.4 7.1

If we ask for an index beyond the length of the vector, R will return a missing value:

R

x[6]

OUTPUT

<NA> 
  NA

This is a vector of length one containing an NA, whose name is also NA.

If we ask for the 0th element, we get an empty vector:

R

x[0]

OUTPUT

named numeric(0)

Vector numbering in R starts at 1

In many programming languages (C and Python, for example), the first element of a vector has an index of 0. In R, the first element is 1.

Skipping and removing elements

If we use a negative number as the index of a vector, R will return every element except for the one specified:

R

x[-2]

OUTPUT

  a   c   d   e 
5.4 7.1 4.8 7.5

We can skip multiple elements:

R

x[c(-1, -5)]  # or x[-c(1,5)]

OUTPUT

  b   c   d 
6.2 7.1 4.8

Tip: Order of operations

A common trip up for novices occurs when trying to skip slices of a vector. It’s natural to to try to negate a sequence like so:

R

x[-1:3]

This gives a somewhat cryptic error:

ERROR

Error in x[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a function. It takes its first argument as -1, and its second as 3, so generates the sequence of numbers: c(-1, 0, 1, 2, 3).

The correct solution is to wrap that function call in brackets, so that the - operator applies to the result:

R

x[-(1:3)]

OUTPUT

  d   e 
4.8 7.5

To remove elements from a vector, we need to assign the result back into the variable:

R

x <- x[-4]
x

OUTPUT

  a   b   c   e 
5.4 6.2 7.1 7.5

Challenge 1

Given the following code:

R

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)

OUTPUT

  a   b   c   d   e 
5.4 6.2 7.1 4.8 7.5

Come up with at least 3 different commands that will produce the following output:

OUTPUT

  b   c   d 
6.2 7.1 4.8

After you find 3 different commands, compare notes with your neighbour. Did you have different strategies?

Show me the solution

R

x[2:4]

OUTPUT

  b   c   d 
6.2 7.1 4.8

R

x[-c(1,5)]

OUTPUT

  b   c   d 
6.2 7.1 4.8

R

x[c("b", "c", "d")]

OUTPUT

  b   c   d 
6.2 7.1 4.8

R

x[c(2,3,4)]

OUTPUT

  b   c   d 
6.2 7.1 4.8

Subsetting by name

We can extract elements by using their name, instead of extracting by index:

R

x <- c(a = 5.4, b = 6.2, c = 7.1, d = 4.8, e = 7.5) # we can name a vector 'on the fly'
x[c("a", "c")]

OUTPUT

  a   c 
5.4 7.1

This is usually a much more reliable way to subset objects: the position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!

Subsetting through other logical operations

We can also use any logical vector to subset. Since comparison operators (e.g. >, <, ==) evaluate to logical vectors, we can also use them to succinctly subset vectors: the following statement gives the same result as the previous one.

R

x[x > 7]

OUTPUT

  c   e 
7.1 7.5

Breaking it down, this statement first evaluates x>7, generating a logical vector c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the elements of x corresponding to the TRUE values.

We can use == to mimic the previous method of indexing by name (remember you have to use == rather than = for comparisons):

R

x[names(x) == "a"]

OUTPUT

  a 
5.4

We can also evaluate multiple logical criteria. For example, if we want all values of x that are greater than 5 and less than 7; we can use & to denote “and”:

R

x[x > 5 & x < 7]

OUTPUT

  a   b 
5.4 6.2

Or, maybe we want the values less than 5 or more than 7; we can use ‘|’ to denote “or”:

R

x[x < 5 | x > 7]

OUTPUT

  c   d   e 
7.1 4.8 7.5

Finally, we can evaluate whether variables are present in another vector by using the %in% command. For example, if we wanted to grab all values of x with the names “a”, “c”, and “d” we could do this:

R

x[names(x) %in% c("a", "c", "d")]

OUTPUT

  a   c   d 
5.4 7.1 4.8

which is just a simpler way of writing it like this:

R

x[names(x) == "a" | names(x) == "c" | names(x) == "d"]

OUTPUT

  a   c   d 
5.4 7.1 4.8

We also might want to grab the values that are not equal to something. For this we can pair the ! symbol with the = (equal to), > (greater than), < (less than), >= (greater than or equal to), <= (less than or equal to) symbols to say we want the opposite answer. For example, this will return all values in x that do not have the name a:

R

x[names(x) != "a"]

OUTPUT

  b   c   d   e 
6.2 7.1 4.8 7.5

Challenge 2

Given the following code:

R

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)

OUTPUT

  a   b   c   d   e 
5.4 6.2 7.1 4.8 7.5

Write a subsetting command to return the values in x that are greater than 4 and less than 7.

Show me the solution

R

x_subset <- x[x<7 & x>4]
print(x_subset)

OUTPUT

  a   b   d 
5.4 6.2 4.8

Handling special values

At some point you will encounter functions in R that cannot handle missing, infinite, or undefined data.

There are a number of special functions you can use to filter out this data:

is.na will return all positions in a vector, matrix, or data frame containing NA (or NaN)
likewise, is.nan, and is.infinite will do the same for NaN and Inf.
is.finite will return all positions in a vector, matrix, or data.frame that do not contain NA, NaN or Inf.
na.omit will filter out all missing values from a vector

Data frames

Instructor Note

The episode after this one covers the dplyr package, which has an alternate subsetting mechanism. Learners do still need to learn the base R subsetting covered here, as dplyr won’t work in all situations. However, the examples in the rest of the workshop focus on dplyr syntax.

Remember the data frames are lists underneath the hood, so similar rules apply. However they are also two dimensional objects:

[] with one argument will act the same way as for lists, where each list element corresponds to a column. The resulting object will be a data frame:

R

# Read in gapminder data if you don't have it already
gapminder <- read.csv("data/gapminder_data.csv")

head(gapminder[3])

OUTPUT

Similarly, [[]] will act to extract the values from a single column:

R

head(gapminder[["lifeExp"]])

OUTPUT

[1] 28.801 30.332 31.997 34.020 36.088 38.438

And $ provides a convenient shorthand to extract columns by name:

R

head(gapminder$year)

OUTPUT

[1] 1952 1957 1962 1967 1972 1977

To select specific rows and/or columns, you can provide two arguments to []. Remember that the way to index data frames is always object[row(s), column(s)].

R

gapminder[1:3, ]

OUTPUT

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007

If we subset a single row, the result will be a data frame (because the elements are mixed types):

R

gapminder[3, ]

OUTPUT

      country year      pop continent lifeExp gdpPercap
3 Afghanistan 1962 10267083      Asia  31.997  853.1007

But for a single column the result will be a vector (this can be changed with the third argument, drop = FALSE).

Challenge 3

Fix each of the following common data frame subsetting errors:

Extract observations collected for the year 1957

R

gapminder[gapminder$year = 1957, ]

Extract all columns except 1 through to 4

R

gapminder[, -1:4]

Extract the rows where the life expectancy is longer the 80 years

R

gapminder[gapminder$lifeExp > 80]

Extract the first row, and the fourth and fifth columns (lifeExp and gdpPercap).

R

gapminder[1, 4, 5]

Advanced: extract rows that contain information for the years 2002 and 2007

R

gapminder[gapminder$year == 2002 | 2007,]

Show me the solution

Fix each of the following common data frame subsetting errors:

Extract observations collected for the year 1957

R

# gapminder[gapminder$year = 1957, ]
gapminder[gapminder$year == 1957, ]

Extract all columns except 1 through to 4

R

# gapminder[, -1:4]
gapminder[,-c(1:4)]

Extract the rows where the life expectancy is longer the 80 years

R

# gapminder[gapminder$lifeExp > 80]
gapminder[gapminder$lifeExp > 80,]

Extract the first row, and the fourth and fifth columns (lifeExp and gdpPercap).

R

# gapminder[1, 4, 5]
gapminder[1, c(4, 5)]

Advanced: extract rows that contain information for the years 2002 and 2007

R

# gapminder[gapminder$year == 2002 | 2007,]
gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
gapminder[gapminder$year %in% c(2002, 2007),] # another, more advanced way

Challenge 4

Why does gapminder[1:20] return an error? How does it differ from gapminder[1:20, ]?
Create a new data.frame called gapminder_small that only contains rows 1 through 9 and 19 through 23. You can do this in one or two steps.

Show me the solution

gapminder is a data.frame so it needs to be subsetted on two dimensions. gapminder[1:20, ] subsets the data to give the first 20 rows and all columns.

R

gapminder_small <- gapminder[c(1:9, 19:23),]

Key Points

Indexing in R starts at 1, not 0.
Access individual values by location using [].
Access slices of data using [low:high].
Access arbitrary sets of data using [c(...)].
Use logical operations and logical vectors to access subsets of data.

Content from Data frame Manipulation with dplyr

Last updated on 2024-08-19 | Edit this page

Estimated time: 40 minutes

Overview

Questions

How can I manipulate dataframes without repeating myself?

Objectives

To be able to use the six main dataframe manipulation ‘verbs’ with pipes in dplyr.
To understand how group_by() and summarize() can be combined to summarize datasets.
Be able to analyze a subset of data using logical filtering.

Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

R

gapminder <- read.csv("data/gapminder_data.csv", header = TRUE)

mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])

OUTPUT

[1] 2193.755

R

mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])

OUTPUT

[1] 7136.11

R

mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])

OUTPUT

[1] 7902.15

But this isn’t very efficient, and can become tedious quickly because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

The `dplyr` package

Instructor Note

Introduce the dplyr package as a simpler, more intuitive way of doing subsetting.
Unlike other SWC and DC R lessons, this lesson does not include data reshaping with tidyr as it isn’t used in the rest of the workshop.

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.

select()
filter()
group_by()
summarize()
count() and n()
mutate()

If you have have not installed this package earlier, please do so:

R

install.packages('dplyr')

Now let’s load the package:

R

library("dplyr")

Using `select()`

If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.

R

year_country_gdp <- select(gapminder, year, country, gdpPercap)

Illustration of selecting two columns from a dataframe

If we open up year_country_gdp we’ll see that it only contains the year, country and gdpPercap. Above we used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

R

year_country_gdp <- gapminder %>% select(year,country,gdpPercap)

To help you understand why we wrote that in that way, let’s walk through it step by step. First we summon the gapminder data frame and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don’t specify which data object we use in the select() function since in gets that from the previous pipe. Fun Fact: You may have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is | but the concept is the same!

Using `filter()`

If we now wanted to move forward with the above, but only with European countries, we can combine select and filter

R

year_country_gdp_euro <- gapminder %>%
  filter(continent == "Europe") %>%
  select(year, country, gdpPercap)

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your dataframe have and why?

Show me the solution

R

year_country_lifeExp_Africa <- gapminder %>%
                           filter(continent=="Africa") %>%
                           select(year,country,lifeExp)

As with last time, first we pass the gapminder dataframe to the filter() function, then we pass the filtered version of the gapminder data frame to the select() function. Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we would have removed it in the previous step.

Using `group_by()` and `summarize()`

Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each continent. Instead of filter(), which will only pass observations that meet your criteria (in the above: continent=="Europe"), we can use group_by(), which will essentially use every unique criteria that you could have used in filter.

R

str(gapminder)

OUTPUT

'data.frame':	1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

R

gapminder %>% group_by(continent) %>% str()

OUTPUT

gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
 - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
  ..$ .rows    : list<int> [1:5] 
  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the dataframe where we used group_by() (grouped_df) is not the same as the original gapminder (data.frame). A grouped_df can be thought of as a list where each item in the listis a data.frame which contains only the rows that correspond to the a particular value continent (at least in the example above).

Illustration of multiple dataframes created by piping a dataframe to group_by

Using `summarize()`

The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize(). To end our group_by(), we need to close off the pipe by using ungroup().

R

gdp_bycontinents <- gapminder %>%
  group_by(continent) %>%
  summarize(mean_gdpPercap = mean(gdpPercap)) %>% 
  ungroup()

gdp_bycontinents

OUTPUT

# A tibble: 5 × 2
  continent mean_gdpPercap
  <chr>              <dbl>
1 Africa             2194.
2 Americas           7136.
3 Asia               7902.
4 Europe            14469.
5 Oceania           18622.

illustration of creating a summary dataframe from grouped data

That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.

Challenge 2

Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?

Show me the solution

R

lifeExp_bycountry <- gapminder %>%
   group_by(country) %>%
   summarize(mean_lifeExp=mean(lifeExp)) %>% 
   ungroup()

lifeExp_bycountry %>%
   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))

OUTPUT

# A tibble: 2 × 2
  country      mean_lifeExp
  <chr>               <dbl>
1 Iceland              76.5
2 Sierra Leone         36.8

Another way to do this is to use the dplyr function arrange(), which arranges the rows in a data frame according to the order of one or more variables from the data frame. It has similar syntax to other functions from the dplyr package. You can use desc() inside arrange() to sort in descending order.

R

lifeExp_bycountry %>%
   arrange(mean_lifeExp) %>%
   head(1)

OUTPUT

# A tibble: 1 × 2
  country      mean_lifeExp
  <chr>               <dbl>
1 Sierra Leone         36.8

R

lifeExp_bycountry %>%
   arrange(desc(mean_lifeExp)) %>%
   head(1)

OUTPUT

# A tibble: 1 × 2
  country mean_lifeExp
  <chr>          <dbl>
1 Iceland         76.5

The function group_by() allows us to group by multiple variables. Let’s group by year and continent.

R

gdp_bycontinents_byyear <- gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap = mean(gdpPercap)) %>% 
  ungroup()

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize().

R

gdp_pop_bycontinents_byyear <- gapminder %>%
  group_by(continent,year) %>%
  summarize(mean_gdpPercap = mean(gdpPercap),
            sd_gdpPercap = sd(gdpPercap),
            mean_pop = mean(pop),
            sd_pop = sd(pop)) %>% 
  ungroup()

`count()` and `n()`

A very common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.

For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count() function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort=TRUE:

R

gapminder %>%
    filter(year == 2002) %>%
    count(continent, sort = TRUE)

OUTPUT

  continent  n
1    Africa 52
2      Asia 33
3    Europe 30
4  Americas 25
5   Oceania  2

If we need to use the number of observations in calculations, the n() function is useful. For instance, if we wanted to get the standard error of the life expectancy per continent:

R

gapminder %>%
  group_by(continent) %>%
  summarize(sd_le = sd(lifeExp), # standard deviation
            n_le = n(), # number of observations
            se_le = sd_le/sqrt(n_le)) %>% # standard error
  ungroup()

OUTPUT

# A tibble: 5 × 4
  continent sd_le  n_le se_le
  <chr>     <dbl> <int> <dbl>
1 Africa     9.15   624 0.366
2 Americas   9.35   300 0.540
3 Asia      11.9    396 0.596
4 Europe     5.43   360 0.286
5 Oceania    3.80    24 0.775

In the above example, we chained together several summary operations to simplify our calculation of standard error. We can also chain operations to calculate the minimum, maximum, mean and se of each continent’s per-country life-expectancy. This time, we will calculate the se in just one line:

R

gapminder %>%
    group_by(continent) %>%
    summarize(
      mean_le = mean(lifeExp),
      min_le = min(lifeExp),
      max_le = max(lifeExp), 
      se_le = sd(lifeExp)/sqrt(n())) %>% 
  ungroup()

OUTPUT

# A tibble: 5 × 5
  continent mean_le min_le max_le se_le
  <chr>       <dbl>  <dbl>  <dbl> <dbl>
1 Africa       48.9   23.6   76.4 0.366
2 Americas     64.7   37.6   80.7 0.540
3 Asia         60.1   28.8   82.6 0.596
4 Europe       71.9   43.6   81.8 0.286
5 Oceania      74.3   69.1   81.2 0.775

Using `mutate()`

We can also create new variables prior to (or even after) summarizing information using mutate().

R

gdp_pop_billion <- gapminder %>%
  mutate(gdp_billion = gdpPercap*pop/10^9)

head(gdp_pop_billion)

OUTPUT

      country year      pop continent lifeExp gdpPercap gdp_billion
1 Afghanistan 1952  8425333      Asia  28.801  779.4453    6.567086
2 Afghanistan 1957  9240934      Asia  30.332  820.8530    7.585449
3 Afghanistan 1962 10267083      Asia  31.997  853.1007    8.758856
4 Afghanistan 1967 11537966      Asia  34.020  836.1971    9.648014
5 Afghanistan 1972 13079460      Asia  36.088  739.9811    9.678553
6 Afghanistan 1977 14880372      Asia  38.438  786.1134   11.697659

Other great resources

Key Points

Use the dplyr package to manipulate dataframes.
Use select() to choose variables from a dataframe.
Use filter() to choose data based on values.
Use group_by() and summarize() to work with subsets of data.
Use count() and n() to obtain the number of observations in columns.
Use mutate() to create new variables.

Content from Introduction to Visualization

Last updated on 2024-08-19 | Edit this page

Estimated time: 35 minutes

Overview

Questions

What are the basics of creating graphics in R?

Objectives

To be able to use ggplot2 to generate histograms and bar plots.
To apply geometry and aesthetic layers to a ggplot plot.
To manipulate the aesthetics of a plot using different colors and position parameters.

Plotting our data is one of the best ways to quickly explore it and the various relationships between variables. There are three main plotting systems in R, the base plotting system, the lattice package, and the ggplot2 package. In this lesson we’ll be learning about the ggplot2 package, because it is the most effective for creating publication quality graphics. In this episode, we will introduce the key features of a ggplot and make a few example plots. We will expand on these concepts and see how they apply to geospatial data types when we start working with geospatial data in the R for Raster and Vector Data lesson.

Instructor Note

This episode introduces geom_col and geom_histogram. These geoms are used in the rest of the workshop, along with geoms specifically for geospatial data.
Emphasize that we will go much deeper into visualization and creating publication-quality graphics later in the workshop.

ggplot2 is built on the grammar of graphics, the idea that any plot can be expressed from the same set of components: a data set, a coordinate system, and a set of geoms--the visual representation of data points. The key to understanding ggplot2 is thinking about a figure in layers. This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape. In this episode we will focus on two geoms

histograms
bar plots

Let’s start by loading in the appropriate packages and data. Remember that we will have to load libraries and data at the start of each R script.

R

library(dplyr)
library(ggplot2)

gapminder <- read.csv("data/gapminder_data.csv", header = TRUE)

For our first example, we will be plotting the distribution of life expectancy in our dataset. The first thing we do is call the ggplot function. This function lets R know that we’re creating a new plot, and any of the arguments we give the ggplot() function are the global options for the plot: they apply to all layers on the plot.

We will pass in two arguments to ggplot. First, we tell ggplot what data we want to show on our figure, in this example we use the gapminder data we read in earlier. For the second argument we pass in the mapping = aes() function, which tells ggplot how variables in the data map to aesthetic properties of the figure. Here we will tell ggplot we want to plot the “lifeExp” column of the gapminder data frame on the x-axis. We don’t need to specify a y-axis for histograms.

R

ggplot(data = gapminder, mapping = aes(x = lifeExp)) +   
  geom_histogram()

Histogram of life expectancy by country showing bimodal distribution with modes at 45 and 75

By itself, the call to ggplot isn’t enough to draw a figure:

R

ggplot(data = gapminder, mapping = aes(x = lifeExp))

We need to tell ggplot how we want to visually represent the data, which we do by adding a geom layer. In our example, we used geom_histogram(), which tells ggplot we want to visually represent the distribution of one variable (in our case “lifeExp”):

R

ggplot(data = gapminder, mapping = aes(x = lifeExp)) +   
  geom_histogram()

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Challenge 1

Modify the example so that the figure shows the distribution of gdp per capita, rather than life expectancy:

Show me the solution

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap)) +   
 geom_histogram()

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram is a useful tool for visualizing the distribution of a single categorical variable. What if we want to compare the gdp per capita of the countries in our dataset? We can use a bar (or column) plot. To simplify our plot, let’s look at data only from the most recent year and only from countries in the Americas.

R

gapminder_small <- gapminder %>% 
  filter(year == 2007 & continent == "Americas")

This time, we will use the geom_col() function as our geometry. We will plot countries on the x-axis (listed in alphabetic order by default) and gdp per capita on the y-axis.

R

ggplot(data = gapminder_small, mapping = aes(x = country, y = gdpPercap)) + 
  geom_col()

Barplot of GDP per capita. Country names on x-axis overlap and are not readable

With this many bars plotted, it’s impossible to read all of the x-axis labels. A quick fix to this is the add the coord_flip() function to the end of our plot code.

R

ggplot(data = gapminder_small, mapping = aes(x = country, y = gdpPercap)) + 
  geom_col() +
  coord_flip()

Barplot showing GDP per capita. Country names on the y-axis are readable

There are more sophisticated ways of modifying axis labels. We will be learning some of those methods later in this workshop.

Challenge 2

In the previous examples and challenge we’ve used the aes function to tell the geom_histogram() and geom_col() functions which columns of the data set to plot. Another aesthetic property we can modify is the color. Create a new bar (column) plot showing the gdp per capita of all countries in the Americas for the years 1952 and 2007, color coded by year.

Show me the solution

First we create a new object with our filtered data:

R

gapminder_small_2 <- gapminder %>%
                        filter(continent == "Americas",
                               year %in% c(1952, 2007))

Then we plot that data using the geom_col() geom function. We color bars using the fill parameter within the aes() function. Since there are multiple bars for each country, we use the position parameter to “dodge” them so they appear side-by-side. The default behavior for postion in geom_col() is “stack”.

R

ggplot(gapminder_small_2, 
       mapping = aes(x = country, y = gdpPercap, 
       fill = as.factor(year))) +
   geom_col(position = "dodge") + 
   coord_flip()

The examples given here are just the start of creating complex and beautiful graphics with R. In a later lesson we will go into much more depth, including:

plotting geospatial specific data types
adjusting the color scheme of our plots
setting and formatting plot titles, subtitles, and axis labels
creating multi-panel plots
creating point (scatter) and line plots
layering datasets to create multi-layered plots
creating and customizing a plot legend
and much more!

The examples we’ve worked through in this episode should give you the building blocks for working with the more complex graphic types and customizations we will be working with in that lesson.

Key Points

Use ggplot2 to create plots.
Think about graphics in layers: aesthetics, geometry, etc.

Content from Writing Data

Last updated on 2024-08-19 | Edit this page

Estimated time: 20 minutes

Overview

Questions

How can I save plots and data created in R?

Objectives

To be able to write out plots and data from R.

Instructor Note

Learners will need to have created the directory structure described in Project Management With RStudio in order for the code in this episode to work.

First, let’s load in all relevant libraries and data to be used in this lesson:

R

library(ggplot2)
library(dplyr)

gapminder <- read.csv("data/gapminder_data.csv", header = TRUE)

We also need to create a cleaned-data folder within the data folder and a figures folder within the main project folder. We can do this manually or using code:

R

dir.create("data/cleaned-data")
dir.create("figures")

Saving plots

You can save a plot from within RStudio using the ‘Export’ button in the ‘Plot’ window. This will give you the option of saving as a .pdf or as .png, .jpg or other image formats.

Sometimes you will want to save plots without creating them in the ‘Plot’ window first. Perhaps you want to make a pdf document, for example. Or perhaps you’re looping through multiple subsets of a file, plotting data from each subset, and you want to save each plot. In this case you can use flexible approach. The ggsave function saves the latest plot by default. You can control the size and resolution using the arguments to this function.

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap)) +
  geom_histogram()
ggsave("figures/Distribution-of-gdpPercap.pdf", width=12, height=4)

Open up this document and have a look.

Challenge 1

Create and save a new plot showing the side-by-side bar plot of gdp per capita in countries in the Americas in the years 1952 and 2007.

Show me the solution

R

gapminder_small <- gapminder %>% 
  filter(continent == "Americas" & year %in% c(1952, 2007))

ggplot(data = gapminder_small, 
       mapping = aes(x = country, y = gdpPercap, fill = as.factor(year))) +
  geom_col(position = "dodge") +
  coord_flip()

# Note that ggsave saves by default the latest plot.
ggsave("figures/Distribution-of-gdpPercap.pdf", width = 12, height = 4)

To produce documents in different formats, change the file extension for jpeg, png, tiff, or bmp.

Writing data

At some point, you’ll also want to write out data from R.

We can use the write.csv function for this, which is very similar to read.csv from before.

Let’s create a data-cleaning script, for this analysis, we only want to focus on the gapminder data for Australia:

R

aust_subset <- gapminder %>% 
  filter(country == "Australia")

write.csv(aust_subset,
  file="data/cleaned-data/gapminder-aus.csv"
)

Let’s open the file to make sure it contains the data we expect. Navigate to your cleaned-data directory and double-click the file name. It will open using your computer’s default for opening files with a .csv extension. To open in a specific application, right click and select the application. Using a spreadsheet program (like Excel) to open this file shows us that we do have properly formatted data including only the data points from Australia. However, there are row numbers associated with the data that are not useful to us (they refer to the row numbers from the gapminder data frame).

Let’s look at the help file to work out how to change this behaviour.

R

?write.csv

By default R will write out the row and column names when writing data to a file. To over write this behavior, we can do the following:

R

write.csv(
  aust_subset,
  file="data/cleaned-data/gapminder-aus.csv",
  row.names=FALSE
)

Challenge 2

Subset the gapminder data to include only data points collected since 1990. Write out the new subset to a file in the cleaned-data/ directory.

Show me the solution

R

gapminder_after_1990 <- gapminder %>% 
  filter(year > 1990)

write.csv(gapminder_after_1990,
  file = "cleaned-data/gapminder-after-1990.csv",
  row.names = FALSE)

Key Points

Save plots using ggsave().
Use write.csv to save tabular data.

Instructor Note

Now that learners know the fundamentals of R, the rest of the workshop will apply these concepts to working with geospatial data in R.
Packages and functions specific for working with geospatial data will be the focus of the rest of the workshop.
They will have lots of challenges to practice applying and expanding these skills in the next lesson.

Content from Intro to Raster Data

Last updated on 2024-08-19 | Edit this page

Estimated time: 50 minutes

Overview

Questions

What is a raster dataset?
How do I work with and plot raster data in R?
How can I handle missing or bad data values for a raster?

Objectives

Describe the fundamental attributes of a raster dataset.
Explore raster attributes and metadata using R.
Import rasters into R using the terra package.
Plot a raster file in R using the ggplot2 package.
Describe the difference between single- and multi-band rasters.

In this episode, we will introduce the fundamental principles, packages and metadata/raster attributes that are needed to work with raster data in R. We will discuss some of the core metadata elements that we need to understand to work with rasters in R, including CRS and resolution. We will also explore missing and bad data values as stored in a raster and how R handles these elements.

We will continue to work with the dplyr and ggplot2 packages. We will use one additional package in this episode to work with raster data - the terra package. Make sure that you have all packages loaded.

R

library(terra)
library(ggplot2)
library(dplyr)

In this episode, we’ll be working with raster data collected from the Harvard Forest site. These data were collected using remote sensing (LiDAR) and include estimates of a canopy height model, digital elevation model, and digital surface model of the field site. Today we will be focusing on the digital surface model.

View Raster File Attributes

We will be working with a series of GeoTIFF files in this lesson. The GeoTIFF format contains a set of embedded tags with metadata about the raster data. We can use the function describe() in the terra package to get information about our raster data before we read that data into R. It is ideal to do this before importing your data.

Instructor Note

To minimize transcription errors, show students that they can use the “tab” button after typing in describe(" to get a list of files available to choose from. Users can then click through the appropriate folders to get to the desired location/file

R

describe("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")

OUTPUT

 [1] "Driver: GTiff/GeoTIFF"                                                                                                                                                                                                                                                          
 [2] "Files: data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"                                                                                                                                                                                                          
 [3] "Size is 1697, 1367"                                                                                                                                                                                                                                                             
 [4] "Coordinate System is:"                                                                                                                                                                                                                                                          
 [5] "PROJCRS[\"WGS 84 / UTM zone 18N\","                                                                                                                                                                                                                                             
 [6] "    BASEGEOGCRS[\"WGS 84\","                                                                                                                                                                                                                                                    
 [7] "        DATUM[\"World Geodetic System 1984\","                                                                                                                                                                                                                                  
 [8] "            ELLIPSOID[\"WGS 84\",6378137,298.257223563,"                                                                                                                                                                                                                        
 [9] "                LENGTHUNIT[\"metre\",1]]],"                                                                                                                                                                                                                                     
[10] "        PRIMEM[\"Greenwich\",0,"                                                                                                                                                                                                                                                
[11] "            ANGLEUNIT[\"degree\",0.0174532925199433]],"                                                                                                                                                                                                                         
[12] "        ID[\"EPSG\",4326]],"                                                                                                                                                                                                                                                    
[13] "    CONVERSION[\"UTM zone 18N\","                                                                                                                                                                                                                                               
[14] "        METHOD[\"Transverse Mercator\","                                                                                                                                                                                                                                        
[15] "            ID[\"EPSG\",9807]],"                                                                                                                                                                                                                                                
[16] "        PARAMETER[\"Latitude of natural origin\",0,"                                                                                                                                                                                                                            
[17] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                                                                                                                                                                                                          
[18] "            ID[\"EPSG\",8801]],"                                                                                                                                                                                                                                                
[19] "        PARAMETER[\"Longitude of natural origin\",-75,"                                                                                                                                                                                                                         
[20] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                                                                                                                                                                                                          
[21] "            ID[\"EPSG\",8802]],"                                                                                                                                                                                                                                                
[22] "        PARAMETER[\"Scale factor at natural origin\",0.9996,"                                                                                                                                                                                                                   
[23] "            SCALEUNIT[\"unity\",1],"                                                                                                                                                                                                                                            
[24] "            ID[\"EPSG\",8805]],"                                                                                                                                                                                                                                                
[25] "        PARAMETER[\"False easting\",500000,"                                                                                                                                                                                                                                    
[26] "            LENGTHUNIT[\"metre\",1],"                                                                                                                                                                                                                                           
[27] "            ID[\"EPSG\",8806]],"                                                                                                                                                                                                                                                
[28] "        PARAMETER[\"False northing\",0,"                                                                                                                                                                                                                                        
[29] "            LENGTHUNIT[\"metre\",1],"                                                                                                                                                                                                                                           
[30] "            ID[\"EPSG\",8807]]],"                                                                                                                                                                                                                                               
[31] "    CS[Cartesian,2],"                                                                                                                                                                                                                                                           
[32] "        AXIS[\"(E)\",east,"                                                                                                                                                                                                                                                     
[33] "            ORDER[1],"                                                                                                                                                                                                                                                          
[34] "            LENGTHUNIT[\"metre\",1]],"                                                                                                                                                                                                                                          
[35] "        AXIS[\"(N)\",north,"                                                                                                                                                                                                                                                    
[36] "            ORDER[2],"                                                                                                                                                                                                                                                          
[37] "            LENGTHUNIT[\"metre\",1]],"                                                                                                                                                                                                                                          
[38] "    USAGE["                                                                                                                                                                                                                                                                     
[39] "        SCOPE[\"Engineering survey, topographic mapping.\"],"                                                                                                                                                                                                                   
[40] "        AREA[\"Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamaica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.\"],"
[41] "        BBOX[0,-78,84,-72]],"                                                                                                                                                                                                                                                   
[42] "    ID[\"EPSG\",32618]]"                                                                                                                                                                                                                                                        
[43] "Data axis to CRS axis mapping: 1,2"                                                                                                                                                                                                                                             
[44] "Origin = (731453.000000000000000,4713838.000000000000000)"                                                                                                                                                                                                                      
[45] "Pixel Size = (1.000000000000000,-1.000000000000000)"                                                                                                                                                                                                                            
[46] "Metadata:"                                                                                                                                                                                                                                                                      
[47] "  AREA_OR_POINT=Area"                                                                                                                                                                                                                                                           
[48] "Image Structure Metadata:"                                                                                                                                                                                                                                                      
[49] "  COMPRESSION=LZW"                                                                                                                                                                                                                                                              
[50] "  INTERLEAVE=BAND"                                                                                                                                                                                                                                                              
[51] "Corner Coordinates:"                                                                                                                                                                                                                                                            
[52] "Upper Left  (  731453.000, 4713838.000) ( 72d10'52.71\"W, 42d32'32.18\"N)"                                                                                                                                                                                                      
[53] "Lower Left  (  731453.000, 4712471.000) ( 72d10'54.71\"W, 42d31'47.92\"N)"                                                                                                                                                                                                      
[54] "Upper Right (  733150.000, 4713838.000) ( 72d 9'38.40\"W, 42d32'30.35\"N)"                                                                                                                                                                                                      
[55] "Lower Right (  733150.000, 4712471.000) ( 72d 9'40.41\"W, 42d31'46.08\"N)"                                                                                                                                                                                                      
[56] "Center      (  732301.500, 4713154.500) ( 72d10'16.56\"W, 42d32' 9.13\"N)"                                                                                                                                                                                                      
[57] "Band 1 Block=1697x1 Type=Float64, ColorInterp=Gray"                                                                                                                                                                                                                             
[58] "  Min=305.070 Max=416.070 "                                                                                                                                                                                                                                                     
[59] "  Minimum=305.070, Maximum=416.070, Mean=359.853, StdDev=17.832"                                                                                                                                                                                                                
[60] "  NoData Value=-9999"                                                                                                                                                                                                                                                           
[61] "  Metadata:"                                                                                                                                                                                                                                                                    
[62] "    STATISTICS_MAXIMUM=416.06997680664"                                                                                                                                                                                                                                         
[63] "    STATISTICS_MEAN=359.85311802914"                                                                                                                                                                                                                                            
[64] "    STATISTICS_MINIMUM=305.07000732422"                                                                                                                                                                                                                                         
[65] "    STATISTICS_STDDEV=17.83169335933"

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"

If you wish to store this information in R, you can do the following:

R

HARV_dsmCrop_info <- capture.output(
  describe("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
)

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"

Each line of text that was printed to the console is now stored as an element of the character vector HARV_dsmCrop_info. We will be exploring this data throughout this episode. By the end of this episode, you will be able to explain and understand the output above.

Open a Raster in R

Now that we’ve previewed the metadata for our GeoTIFF, let’s import this raster dataset into R and explore its metadata more closely. We can use the rast() function in the terra package to open a raster in R.

Data Tip - Object names

To improve code readability, file and object names should be used that make it clear what is in the file. The data for this episode were collected from Harvard Forest so we’ll use a naming convention of datatype_HARV.

First we will load our raster file into R and view the data structure.

R

DSM_HARV <-
  rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"

DSM_HARV

OUTPUT

class       : SpatRaster 
dimensions  : 1367, 1697, 1  (nrow, ncol, nlyr)
resolution  : 1, 1  (x, y)
extent      : 731453, 733150, 4712471, 4713838  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 18N (EPSG:32618) 
source      : HARV_dsmCrop.tif 
name        : HARV_dsmCrop 
min value   :       305.07 
max value   :       416.07

The information above includes a report of min and max values, but no other data range statistics. Similar to other R data structures like vectors and data frame columns, descriptive statistics for raster data can be retrieved like

R

summary(DSM_HARV)

WARNING

Warning: [summary] used a sample

OUTPUT

  HARV_dsmCrop  
 Min.   :305.6  
 1st Qu.:345.6  
 Median :359.6  
 Mean   :359.8  
 3rd Qu.:374.3  
 Max.   :414.7

but note the warning - unless you force R to calculate these statistics using every cell in the raster, it will take a random sample of 100,000 cells and calculate from that instead. To force calculation all the values, you can use the function values:

R

summary(values(DSM_HARV))

OUTPUT

  HARV_dsmCrop  
 Min.   :305.1  
 1st Qu.:345.6  
 Median :359.7  
 Mean   :359.9  
 3rd Qu.:374.3  
 Max.   :416.1

We can visualise this data in R either using the terra plot() function, or by using ggplot2. To visualise the data using terra, we run this line of code:

R

plot(DSM_HARV)

To visualise this data in R using ggplot2, we need to convert it to a dataframe. We learned about dataframes in the Exploring Data Frames lesson. The terra package has an built-in function for conversion to a plotable dataframe. Setting xy = TRUE maintains the spatial locations of the data, so that we can plot the data as a map.

R

DSM_HARV_df <- as.data.frame(DSM_HARV, xy = TRUE)

Now when we view the structure of our data, we will see a standard dataframe format.

R

str(DSM_HARV_df)

OUTPUT

'data.frame':	2319799 obs. of  3 variables:
 $ x           : num  731454 731454 731456 731456 731458 ...
 $ y           : num  4713838 4713838 4713838 4713838 4713838 ...
 $ HARV_dsmCrop: num  409 408 407 407 409 ...

We can use ggplot() to plot this data. We will set the color scale to scale_fill_viridis_c which is a color-blindness friendly color scale. We will also use the coord_quickmap() function to use an approximate Mercator projection for our plots. This approximation is suitable for small areas that are not too close to the poles. Other coordinate systems are available in ggplot2 if needed, you can learn about them at their help page ?coord_map.

R

ggplot() +
    geom_raster(data = DSM_HARV_df , 
                mapping = aes(x = x, y = y, fill = HARV_dsmCrop)) +
    scale_fill_viridis_c() +
    coord_quickmap()

Raster plot with ggplot2 using the viridis color scale

Plotting Tip

More information about the Viridis palette used above at R Viridis package documentation.

This map shows the elevation of our study site in Harvard Forest. From the legend, we can see that the maximum elevation is ~400, but we can’t tell whether this is 400 feet or 400 meters because the legend doesn’t show us the units. We can look at the metadata of our object to see what the units are. Much of the metadata that we’re interested in is part of the CRS.

Now we will see how features of the CRS appear in our data file and what meanings they have.

View Raster Coordinate Reference System (CRS) in R

We can view the CRS string associated with our R object using thecrs() function.

R

crs(DSM_HARV, proj = TRUE)

OUTPUT

[1] "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"

Challenge 1

What units are our data in?

Show me the solution

+units=m tells us that our data is in meters.

Understanding CRS in Proj4 Format

The CRS for our data is given to us by R in proj4 format. Let’s break down the pieces of proj4 string. The string contains all of the individual CRS elements that R or another GIS might need. Each element is specified with a + sign, similar to how a .csv file is delimited or broken up by a ,. After each + we see the CRS element being defined. For example projection (proj=) and datum (datum=).

UTM Proj4 String

A projection string (like the one of DSM_HARV) specifies the UTM projection as follows:

+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0

proj=utm: the projection is UTM, UTM has several zones.
zone=18: the zone is 18
datum=WGS84: the datum is WGS84 (the datum refers to the 0,0 reference for the coordinate system used in the projection)
units=m: the units for the coordinates are in meters
ellps=WGS84: the ellipsoid (how the earth’s roundness is calculated) for the data is WGS84

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone. Image source: Chrismurf at English Wikipedia, via Wikimedia Commons (CC-BY).

Calculate Raster Min and Max Values

It is useful to know the minimum or maximum values of a raster dataset. In this case, given we are working with elevation data, these values represent the min/max elevation range at our site.

Raster statistics are often calculated and embedded in a GeoTIFF for us. We can view these values:

R

minmax(DSM_HARV)

OUTPUT

    HARV_dsmCrop
min       305.07
max       416.07

R

min(values(DSM_HARV))

OUTPUT

[1] 305.07

R

max(values(DSM_HARV))

OUTPUT

[1] 416.07

Data Tip - Set min and max values

If the minimum and maximum values haven’t already been calculated, we can calculate them using the setMinMax() function.

R

DSM_HARV <- setMinMax(DSM_HARV)

We can see that the elevation at our site ranges from 305.0700073m to 416.0699768m.

Raster Bands

The Digital Surface Model object (DSM_HARV) that we’ve been working with is a single band raster. This means that there is only one dataset stored in the raster: surface elevation in meters for one time period.

A raster dataset can contain one or more bands. We can use the rast() function to import one single band from a single or multi-band raster. We can view the number of bands in a raster using the nlyr() function.

R

nlyr(DSM_HARV)

OUTPUT

[1] 1

However, raster data can also be multi-band, meaning that one raster file contains data for more than one variable or time period for each cell. By default the rast() function only imports the first band in a raster regardless of whether it has one or more bands.

Dealing with Missing Data

Raster data often has a NoDataValue associated with it. This is a value assigned to pixels where data is missing or no data were collected.

By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn’t rectangular, some pixels at the edge of the raster will have NoDataValues. This often happens when the data were collected by an airplane which only flew over some part of a defined region.

In the image below, the pixels that are black have NoDataValues. The camera did not collect data in these areas.

In the next image, the black edges have been assigned NoDataValue. R doesn’t render pixels that contain a specified NoDataValue. R assigns missing data with the NoDataValue as NA.

The difference here shows up as ragged edges on the plot, rather than black spaces where there is no data.

If your raster already has NA values set correctly but you aren’t sure where they are, you can deliberately plot them in a particular colour. This can be useful when checking a dataset’s coverage. For instance, sometimes data can be missing where a sensor could not ‘see’ its target data, and you may wish to locate that missing data and fill it in.

To highlight NA values in ggplot, alter the scale_fill_*() layer to contain a colour instruction for NA values, like scale_fill_viridis_c(na.value = 'deeppink')

The value that is conventionally used to take note of missing data (the NoDataValue value) varies by the raster data type. For floating-point rasters, the figure -3.4e+38 is a common default, and for integers, -9999 is common. Some disciplines have specific conventions that vary from these common values.

In some cases, other NA values may be more appropriate. An NA value should be a) outside the range of valid values, and b) a value that fits the data type in use. For instance, if your data ranges continuously from -20 to 100, 0 is not an acceptable NA value! Or, for categories that number 1-15, 0 might be fine for NA, but using -.000003 will force you to save the GeoTIFF on disk as a floating point raster, resulting in a bigger file.

If we are lucky, our GeoTIFF file has a tag that tells us what is the NoDataValue. If we are less lucky, we can find that information in the raster’s metadata. If a NoDataValue was stored in the GeoTIFF tag, when R opens up the raster, it will assign each instance of the value to NA. Values of NA will be ignored by R as demonstrated above.

Challenge 2

Use the output from the describe() and sources() functions to find out what NoDataValue is used for our DSM_HARV dataset. We need to use sources() to give us the file path of the DSM_HARV object; describe() only works for file paths.

Show me the solution

R

describe(sources(DSM_HARV))

OUTPUT

 [1] "Driver: GTiff/GeoTIFF"                                                                                                                                                                                                                                                          
 [2] "Files: /Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"                                                                                                                                
 [3] "Size is 1697, 1367"                                                                                                                                                                                                                                                             
 [4] "Coordinate System is:"                                                                                                                                                                                                                                                          
 [5] "PROJCRS[\"WGS 84 / UTM zone 18N\","                                                                                                                                                                                                                                             
 [6] "    BASEGEOGCRS[\"WGS 84\","                                                                                                                                                                                                                                                    
 [7] "        DATUM[\"World Geodetic System 1984\","                                                                                                                                                                                                                                  
 [8] "            ELLIPSOID[\"WGS 84\",6378137,298.257223563,"                                                                                                                                                                                                                        
 [9] "                LENGTHUNIT[\"metre\",1]]],"                                                                                                                                                                                                                                     
[10] "        PRIMEM[\"Greenwich\",0,"                                                                                                                                                                                                                                                
[11] "            ANGLEUNIT[\"degree\",0.0174532925199433]],"                                                                                                                                                                                                                         
[12] "        ID[\"EPSG\",4326]],"                                                                                                                                                                                                                                                    
[13] "    CONVERSION[\"UTM zone 18N\","                                                                                                                                                                                                                                               
[14] "        METHOD[\"Transverse Mercator\","                                                                                                                                                                                                                                        
[15] "            ID[\"EPSG\",9807]],"                                                                                                                                                                                                                                                
[16] "        PARAMETER[\"Latitude of natural origin\",0,"                                                                                                                                                                                                                            
[17] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                                                                                                                                                                                                          
[18] "            ID[\"EPSG\",8801]],"                                                                                                                                                                                                                                                
[19] "        PARAMETER[\"Longitude of natural origin\",-75,"                                                                                                                                                                                                                         
[20] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                                                                                                                                                                                                          
[21] "            ID[\"EPSG\",8802]],"                                                                                                                                                                                                                                                
[22] "        PARAMETER[\"Scale factor at natural origin\",0.9996,"                                                                                                                                                                                                                   
[23] "            SCALEUNIT[\"unity\",1],"                                                                                                                                                                                                                                            
[24] "            ID[\"EPSG\",8805]],"                                                                                                                                                                                                                                                
[25] "        PARAMETER[\"False easting\",500000,"                                                                                                                                                                                                                                    
[26] "            LENGTHUNIT[\"metre\",1],"                                                                                                                                                                                                                                           
[27] "            ID[\"EPSG\",8806]],"                                                                                                                                                                                                                                                
[28] "        PARAMETER[\"False northing\",0,"                                                                                                                                                                                                                                        
[29] "            LENGTHUNIT[\"metre\",1],"                                                                                                                                                                                                                                           
[30] "            ID[\"EPSG\",8807]]],"                                                                                                                                                                                                                                               
[31] "    CS[Cartesian,2],"                                                                                                                                                                                                                                                           
[32] "        AXIS[\"(E)\",east,"                                                                                                                                                                                                                                                     
[33] "            ORDER[1],"                                                                                                                                                                                                                                                          
[34] "            LENGTHUNIT[\"metre\",1]],"                                                                                                                                                                                                                                          
[35] "        AXIS[\"(N)\",north,"                                                                                                                                                                                                                                                    
[36] "            ORDER[2],"                                                                                                                                                                                                                                                          
[37] "            LENGTHUNIT[\"metre\",1]],"                                                                                                                                                                                                                                          
[38] "    USAGE["                                                                                                                                                                                                                                                                     
[39] "        SCOPE[\"Engineering survey, topographic mapping.\"],"                                                                                                                                                                                                                   
[40] "        AREA[\"Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamaica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.\"],"
[41] "        BBOX[0,-78,84,-72]],"                                                                                                                                                                                                                                                   
[42] "    ID[\"EPSG\",32618]]"                                                                                                                                                                                                                                                        
[43] "Data axis to CRS axis mapping: 1,2"                                                                                                                                                                                                                                             
[44] "Origin = (731453.000000000000000,4713838.000000000000000)"                                                                                                                                                                                                                      
[45] "Pixel Size = (1.000000000000000,-1.000000000000000)"                                                                                                                                                                                                                            
[46] "Metadata:"                                                                                                                                                                                                                                                                      
[47] "  AREA_OR_POINT=Area"                                                                                                                                                                                                                                                           
[48] "Image Structure Metadata:"                                                                                                                                                                                                                                                      
[49] "  COMPRESSION=LZW"                                                                                                                                                                                                                                                              
[50] "  INTERLEAVE=BAND"                                                                                                                                                                                                                                                              
[51] "Corner Coordinates:"                                                                                                                                                                                                                                                            
[52] "Upper Left  (  731453.000, 4713838.000) ( 72d10'52.71\"W, 42d32'32.18\"N)"                                                                                                                                                                                                      
[53] "Lower Left  (  731453.000, 4712471.000) ( 72d10'54.71\"W, 42d31'47.92\"N)"                                                                                                                                                                                                      
[54] "Upper Right (  733150.000, 4713838.000) ( 72d 9'38.40\"W, 42d32'30.35\"N)"                                                                                                                                                                                                      
[55] "Lower Right (  733150.000, 4712471.000) ( 72d 9'40.41\"W, 42d31'46.08\"N)"                                                                                                                                                                                                      
[56] "Center      (  732301.500, 4713154.500) ( 72d10'16.56\"W, 42d32' 9.13\"N)"                                                                                                                                                                                                      
[57] "Band 1 Block=1697x1 Type=Float64, ColorInterp=Gray"                                                                                                                                                                                                                             
[58] "  Min=305.070 Max=416.070 "                                                                                                                                                                                                                                                     
[59] "  Minimum=305.070, Maximum=416.070, Mean=359.853, StdDev=17.832"                                                                                                                                                                                                                
[60] "  NoData Value=-9999"                                                                                                                                                                                                                                                           
[61] "  Metadata:"                                                                                                                                                                                                                                                                    
[62] "    STATISTICS_MAXIMUM=416.06997680664"                                                                                                                                                                                                                                         
[63] "    STATISTICS_MEAN=359.85311802914"                                                                                                                                                                                                                                            
[64] "    STATISTICS_MINIMUM=305.07000732422"                                                                                                                                                                                                                                         
[65] "    STATISTICS_STDDEV=17.83169335933"

NoDataValue are encoded as -9999.

Bad Data Values in Rasters

Bad data values are different from NoDataValues. Bad data values are values that fall outside of the applicable range of a dataset.

Examples of Bad Data Values:

The normalized difference vegetation index (NDVI), which is a measure of greenness, has a valid range of -1 to 1. Any value outside of that range would be considered a “bad” or miscalculated value.
Reflectance data in an image will often range from 0-1 or 0-10,000 depending upon how the data are scaled. Thus a value greater than 1 or greater than 10,000 is likely caused by an error in either data collection or processing.

Find Bad Data Values

Sometimes a raster’s metadata will tell us the range of expected values for a raster. Values outside of this range are suspect and we need to consider that when we analyze the data. Sometimes, we need to use some common sense and scientific insight as we examine the data - just as we would for field data to identify questionable values.

Plotting data with appropriate highlighting can help reveal patterns in bad values and may suggest a solution. Below, reclassification is used to highlight elevation values over 400m with a contrasting colour.

Create A Histogram of Raster Values

We can explore the distribution of values contained within our raster using the geom_histogram() function which produces a histogram. Histograms are often useful in identifying outliers and bad data values in our raster data.

R

ggplot() +
    geom_histogram(data = DSM_HARV_df, mapping = aes(x = HARV_dsmCrop))

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that a warning message is thrown when R creates the histogram.

stat_bin() using bins = 30. Pick better value with binwidth.

This warning is caused by a default setting in geom_histogram enforcing that there are 30 bins for the data. We can define the number of bins we want in the histogram by using the bins value in the geom_histogram() function.

R

ggplot() +
    geom_histogram(data = DSM_HARV_df, 
                   mapping = aes(x = HARV_dsmCrop), 
                   bins = 40)

Note that the shape of this histogram looks similar to the previous one that was created using the default of 30 bins. The distribution of elevation values for our Digital Surface Model (DSM) looks reasonable. It is likely there are no bad data values in this particular raster.

Challenge 3: Explore Raster Metadata

Use describe() to determine the following about the NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif file:

Does this file have the same CRS as DSM_HARV?
What is the NoDataValue?
What is resolution of the raster data?
How large would a 5x5 pixel area be on the Earth’s surface?
Is the file a multi- or single-band raster?

Notice: this file is a hillshade. We will learn about hillshades in the Working with Multi-band Rasters in R episode.

Show me the solution

R

describe("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif")

OUTPUT

 [1] "Driver: GTiff/GeoTIFF"                                                                                                                                                                                                                                                          
 [2] "Files: data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif"                                                                                                                                                                                                          
 [3] "Size is 1697, 1367"                                                                                                                                                                                                                                                             
 [4] "Coordinate System is:"                                                                                                                                                                                                                                                          
 [5] "PROJCRS[\"WGS 84 / UTM zone 18N\","                                                                                                                                                                                                                                             
 [6] "    BASEGEOGCRS[\"WGS 84\","                                                                                                                                                                                                                                                    
 [7] "        DATUM[\"World Geodetic System 1984\","                                                                                                                                                                                                                                  
 [8] "            ELLIPSOID[\"WGS 84\",6378137,298.257223563,"                                                                                                                                                                                                                        
 [9] "                LENGTHUNIT[\"metre\",1]]],"                                                                                                                                                                                                                                     
[10] "        PRIMEM[\"Greenwich\",0,"                                                                                                                                                                                                                                                
[11] "            ANGLEUNIT[\"degree\",0.0174532925199433]],"                                                                                                                                                                                                                         
[12] "        ID[\"EPSG\",4326]],"                                                                                                                                                                                                                                                    
[13] "    CONVERSION[\"UTM zone 18N\","                                                                                                                                                                                                                                               
[14] "        METHOD[\"Transverse Mercator\","                                                                                                                                                                                                                                        
[15] "            ID[\"EPSG\",9807]],"                                                                                                                                                                                                                                                
[16] "        PARAMETER[\"Latitude of natural origin\",0,"                                                                                                                                                                                                                            
[17] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                                                                                                                                                                                                          
[18] "            ID[\"EPSG\",8801]],"                                                                                                                                                                                                                                                
[19] "        PARAMETER[\"Longitude of natural origin\",-75,"                                                                                                                                                                                                                         
[20] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                                                                                                                                                                                                          
[21] "            ID[\"EPSG\",8802]],"                                                                                                                                                                                                                                                
[22] "        PARAMETER[\"Scale factor at natural origin\",0.9996,"                                                                                                                                                                                                                   
[23] "            SCALEUNIT[\"unity\",1],"                                                                                                                                                                                                                                            
[24] "            ID[\"EPSG\",8805]],"                                                                                                                                                                                                                                                
[25] "        PARAMETER[\"False easting\",500000,"                                                                                                                                                                                                                                    
[26] "            LENGTHUNIT[\"metre\",1],"                                                                                                                                                                                                                                           
[27] "            ID[\"EPSG\",8806]],"                                                                                                                                                                                                                                                
[28] "        PARAMETER[\"False northing\",0,"                                                                                                                                                                                                                                        
[29] "            LENGTHUNIT[\"metre\",1],"                                                                                                                                                                                                                                           
[30] "            ID[\"EPSG\",8807]]],"                                                                                                                                                                                                                                               
[31] "    CS[Cartesian,2],"                                                                                                                                                                                                                                                           
[32] "        AXIS[\"(E)\",east,"                                                                                                                                                                                                                                                     
[33] "            ORDER[1],"                                                                                                                                                                                                                                                          
[34] "            LENGTHUNIT[\"metre\",1]],"                                                                                                                                                                                                                                          
[35] "        AXIS[\"(N)\",north,"                                                                                                                                                                                                                                                    
[36] "            ORDER[2],"                                                                                                                                                                                                                                                          
[37] "            LENGTHUNIT[\"metre\",1]],"                                                                                                                                                                                                                                          
[38] "    USAGE["                                                                                                                                                                                                                                                                     
[39] "        SCOPE[\"Engineering survey, topographic mapping.\"],"                                                                                                                                                                                                                   
[40] "        AREA[\"Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamaica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.\"],"
[41] "        BBOX[0,-78,84,-72]],"                                                                                                                                                                                                                                                   
[42] "    ID[\"EPSG\",32618]]"                                                                                                                                                                                                                                                        
[43] "Data axis to CRS axis mapping: 1,2"                                                                                                                                                                                                                                             
[44] "Origin = (731453.000000000000000,4713838.000000000000000)"                                                                                                                                                                                                                      
[45] "Pixel Size = (1.000000000000000,-1.000000000000000)"                                                                                                                                                                                                                            
[46] "Metadata:"                                                                                                                                                                                                                                                                      
[47] "  AREA_OR_POINT=Area"                                                                                                                                                                                                                                                           
[48] "Image Structure Metadata:"                                                                                                                                                                                                                                                      
[49] "  COMPRESSION=LZW"                                                                                                                                                                                                                                                              
[50] "  INTERLEAVE=BAND"                                                                                                                                                                                                                                                              
[51] "Corner Coordinates:"                                                                                                                                                                                                                                                            
[52] "Upper Left  (  731453.000, 4713838.000) ( 72d10'52.71\"W, 42d32'32.18\"N)"                                                                                                                                                                                                      
[53] "Lower Left  (  731453.000, 4712471.000) ( 72d10'54.71\"W, 42d31'47.92\"N)"                                                                                                                                                                                                      
[54] "Upper Right (  733150.000, 4713838.000) ( 72d 9'38.40\"W, 42d32'30.35\"N)"                                                                                                                                                                                                      
[55] "Lower Right (  733150.000, 4712471.000) ( 72d 9'40.41\"W, 42d31'46.08\"N)"                                                                                                                                                                                                      
[56] "Center      (  732301.500, 4713154.500) ( 72d10'16.56\"W, 42d32' 9.13\"N)"                                                                                                                                                                                                      
[57] "Band 1 Block=1697x1 Type=Float64, ColorInterp=Gray"                                                                                                                                                                                                                             
[58] "  Min=-0.714 Max=1.000 "                                                                                                                                                                                                                                                        
[59] "  Minimum=-0.714, Maximum=1.000, Mean=0.313, StdDev=0.481"                                                                                                                                                                                                                      
[60] "  NoData Value=-9999"                                                                                                                                                                                                                                                           
[61] "  Metadata:"                                                                                                                                                                                                                                                                    
[62] "    STATISTICS_MAXIMUM=0.99999973665016"                                                                                                                                                                                                                                        
[63] "    STATISTICS_MEAN=0.31255246777216"                                                                                                                                                                                                                                           
[64] "    STATISTICS_MINIMUM=-0.71362979358008"                                                                                                                                                                                                                                       
[65] "    STATISTICS_STDDEV=0.48129385401108"

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif"

If this file has the same CRS as DSM_HARV? Yes: UTM Zone 18, WGS84, meters.
What format NoDataValues take? -9999
The resolution of the raster data? 1x1
How large a 5x5 pixel area would be? 5mx5m How? We are given resolution of 1x1 and units in meters, therefore resolution of 5x5 means 5x5m.
Is the file a multi- or single-band raster? Single.

More Resources

Read more about the terra package in R.
We use the ggplot2 package for plotting in this workshop, but the tmap package actually makes plotting rasters even easier. tmap is especially useful if you plan to plot rasters and vectors in the same plot.

Key Points

The GeoTIFF file format includes metadata about the raster data.
To plot raster data with the ggplot2 package, we need to convert it to a dataframe.
R stores CRS information in the Proj4 format.
Be careful when dealing with missing or bad data values.

Content from Plot Raster Data

Last updated on 2024-08-19 | Edit this page

Estimated time: 70 minutes

Overview

Questions

How can I create categorized or customized maps of raster data?
How can I customize the color scheme of a raster image?
How can I layer raster data in a single image?

Objectives

Build customized plots for a single band raster using the ggplot2 package.
Layer a raster dataset on top of a hillshade to create an elegant basemap.

Plot Raster Data in R

This episode covers how to plot a raster in R using the ggplot2 package with customized coloring schemes. It also covers how to layer a raster on top of a hillshade to produce an eloquent map. We will continue working with the Digital Surface Model (DSM) raster for the NEON Harvard Forest Field Site.

First, let’s load in the libraries and data we will need for this lesson:

R

library(terra)
library(ggplot2)
library(dplyr)

# DSM data for Harvard Forest
DSM_HARV <- 
  rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"

DSM_HARV_df <- as.data.frame(DSM_HARV, xy = TRUE)

Plotting Data Using Breaks

In the previous episode, we viewed our data using a continuous color ramp. For clarity and visibility of the plot, we may prefer to view the data “symbolized” or colored according to ranges of values. This is comparable to a “classified” map. To do this, we need to tell ggplot how many groups to break our data into, and where those breaks should be. To make these decisions, it is useful to first explore the distribution of the data using a bar plot. To begin with, we will use dplyr’s mutate() function combined with cut() to split the data into 3 bins.

R

DSM_HARV_df <- DSM_HARV_df %>%
                mutate(fct_elevation = cut(HARV_dsmCrop, breaks = 3))

ggplot() +
    geom_bar(data = DSM_HARV_df, mapping = aes(x = fct_elevation))

If we want to know the cutoff values for the groups, we can ask for the unique values of fct_elevation:

R

unique(DSM_HARV_df$fct_elevation)

OUTPUT

[1] (379,416] (342,379] (305,342]
Levels: (305,342] (342,379] (379,416]

Notice that the cut() function shows closed intervals on the right (]), and open intervals on the left ((). In our example, (305, 342] means that values greater than 305 (but not 305 itself) and values less than or equal to 342 are included in the interval.

And we can get the count of values in each group using dplyr’s group_by() and count() functions:

R

DSM_HARV_df %>%
  group_by(fct_elevation) %>%
  count() %>% 
  ungroup()

OUTPUT

# A tibble: 3 × 2
  fct_elevation       n
  <fct>           <int>
1 (305,342]      418891
2 (342,379]     1530073
3 (379,416]      370835

We might prefer to customize the cutoff values for these groups. Lets round the cutoff values so that we have groups for the ranges of 301–350 m, 351–400 m, and 401–450 m. To implement this we will give mutate() a numeric vector of break points instead of the number of breaks we want.

R

custom_bins <- c(300, 350, 400, 450)

DSM_HARV_df <- DSM_HARV_df %>%
  mutate(fct_elevation_2 = cut(HARV_dsmCrop, breaks = custom_bins))

unique(DSM_HARV_df$fct_elevation_2)

OUTPUT

[1] (400,450] (350,400] (300,350]
Levels: (300,350] (350,400] (400,450]

And now we can plot our bar plot again, using the new groups:

R

ggplot() +
  geom_bar(data = DSM_HARV_df, mapping = aes(x = fct_elevation_2))

And we can get the count of values in each group in the same way we did before:

R

DSM_HARV_df %>%
  group_by(fct_elevation_2) %>%
  count() %>% 
  ungroup()

OUTPUT

# A tibble: 3 × 2
  fct_elevation_2       n
  <fct>             <int>
1 (300,350]        741815
2 (350,400]       1567316
3 (400,450]         10668

We can use those groups to plot our raster data, with each group being a different color:

R

ggplot() +
  geom_raster(data = DSM_HARV_df , 
              mapping = aes(x = x, y = y, fill = fct_elevation_2)) + 
  coord_quickmap()

The plot above uses the default colors inside ggplot for raster objects. We can specify our own colors to make the plot look a little nicer. R has a built in set of colors for plotting terrain, which are built in to the terrain.colors() function. Since we have three bins, we want to create a 3-color palette:

R

terrain.colors(3)

OUTPUT

[1] "#00A600" "#ECB176" "#F2F2F2"

The terrain.colors() function returns hex colors - each of these character strings represents a color. To use these in our map, we pass them across using the scale_fill_manual() function.

R

ggplot() +
 geom_raster(data = DSM_HARV_df , 
             mapping = aes(x = x, y = y, fill = fct_elevation_2)) + 
    scale_fill_manual(values = terrain.colors(3)) + 
    coord_quickmap()

More Plot Formatting

If we need to create multiple plots using the same color palette, we can create an R object (my_col) for the set of colors that we want to use. We can then quickly change the palette across all plots by modifying the my_col object, rather than each individual plot.

We can label the x- and y-axes of our plot too using the labs() function. We can also give the legend a more meaningful title by passing either a value to the name argument of the scale_fill_manual() function or by specifying it within the labs() function as fill = "name here".

R

my_col <- terrain.colors(3)

ggplot() +
 geom_raster(data = DSM_HARV_df , 
             mapping = aes(x = x, y = y, fill = fct_elevation_2)) + 
  scale_fill_manual(values = my_col) + 
  coord_quickmap() + 
  labs(x = "longitude", 
       y = "latitude", 
       fill = "Elevation")

Or we can also turn off the labels of both axes by passing a NULL value to the labs() function for the x and y values.

R

ggplot() +
 geom_raster(data = DSM_HARV_df , 
             mapping = aes(x = x, y = y, fill = fct_elevation_2)) + 
  scale_fill_manual(values = my_col) + 
  coord_quickmap() + 
  labs(x = NULL, 
       y = NULL, 
       fill = "Elevation")

Challenge 1: Plot Using Custom Breaks

Create a plot of the Harvard Forest Digital Surface Model (DSM) that has:

Six classified ranges of values (break points) that are evenly divided among the range of pixel values.
Axis labels.
A plot title. (You can specify this in the labs() function as well, using title = "Title here")

Show me the solution

R

DSM_HARV_df <- DSM_HARV_df  %>%
               mutate(fct_elevation_6 = cut(HARV_dsmCrop, breaks = 6)) 

 my_col <- terrain.colors(6)

ggplot() +
    geom_raster(data = DSM_HARV_df, 
                mapping = aes(x = x, y = y, fill = fct_elevation_6)) + 
    scale_fill_manual(values = my_col) + 
    labs(x = "UTM Easting Coordinate (m)", 
         y = "UTM Northing Coordinate (m)", 
         fill = "Elevation", 
         title = "Classified Elevation Map - NEON Harvard Forest Field Site") + 
    coord_quickmap()

Layering Rasters

We can layer a raster on top of a hillshade raster for the same area, and use a transparency factor to create a 3-dimensional shaded effect. A hillshade is a raster that maps the shadows and texture that you would see from above when viewing terrain. We will add a custom color, making the plot grey.

First we need to read in our DSM hillshade data and view the structure:

R

DSM_hill_HARV <-
  rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif"

DSM_hill_HARV

OUTPUT

class       : SpatRaster 
dimensions  : 1367, 1697, 1  (nrow, ncol, nlyr)
resolution  : 1, 1  (x, y)
extent      : 731453, 733150, 4712471, 4713838  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 18N (EPSG:32618) 
source      : HARV_DSMhill.tif 
name        : HARV_DSMhill 
min value   :   -0.7136298 
max value   :    0.9999997

Next we convert it to a dataframe, so that we can plot it using ggplot2:

R

DSM_hill_HARV_df <- as.data.frame(DSM_hill_HARV, xy = TRUE) 

str(DSM_hill_HARV_df)

OUTPUT

'data.frame':	2313675 obs. of  3 variables:
 $ x           : num  731454 731456 731456 731458 731458 ...
 $ y           : num  4713836 4713836 4713836 4713836 4713836 ...
 $ HARV_DSMhill: num  -0.15567 0.00743 0.86989 0.9791 0.96283 ...

Now we can plot the hillshade data. Note that the alpha object translates to transparency, where a value of 0 indicates fully transparent and a value of 1 indicates fully opaque. We can force R to not show the legend for a particular scale if we use guide = "none":

R

ggplot() +
  geom_raster(data = DSM_hill_HARV_df,
              mapping = aes(x = x, y = y, alpha = HARV_DSMhill)) + 
  scale_alpha(range =  c(0.15, 0.65), guide = "none") + 
  coord_quickmap()

Data Tips

Turn off, or hide, the legend on a plot by adding guide = "none" to a scale_something() function or by setting theme(legend.position = "none").

We can layer another raster on top of our hillshade by adding another call to the geom_raster() function. Let’s overlay DSM_HARV on top of the hill_HARV. We can also add a title to our plot within the labs() function using labs(title = "Plot title") or we can use ggtitle("Plot title").

R

ggplot() +
  geom_raster(data = DSM_HARV_df, 
              mapping = aes(x = x, y = y, fill = HARV_dsmCrop)) + 
  geom_raster(data = DSM_hill_HARV_df, 
              mapping = aes(x = x, y = y, alpha = HARV_DSMhill)) +  
  scale_fill_viridis_c() +  
  scale_alpha(range = c(0.15, 0.65), guide = "none") +  
  labs(title = "Elevation with hillshade") +
  coord_quickmap()

Challenge 2: Create DTM & DSM for SJER

Use the files in the data/NEON-DS-Airborne-Remote-Sensing/SJER/ (or data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/ depending on your setup) directory to create a Digital Terrain Model map and Digital Surface Model map of the San Joaquin Experimental Range field site.

Make sure to:

include hillshade in the maps,
label axes on the DSM map and exclude them from the DTM map,
include a title for each map,
experiment with various alpha values and color palettes to represent the data.

Show me the solution

R

# CREATE DSM MAPS

# import DSM data
DSM_SJER <- 
  rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif"

# convert to a df for plotting
DSM_SJER_df <- as.data.frame(DSM_SJER, xy = TRUE)

# import DSM hillshade
DSM_hill_SJER <- 
  rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmHill.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmHill.tif"

# convert to a df for plotting
DSM_hill_SJER_df <- as.data.frame(DSM_hill_SJER, xy = TRUE)

# Build Plot
ggplot() +
    geom_raster(data = DSM_SJER_df , 
                mapping = aes(x = x, y = y, 
                              fill = SJER_dsmCrop), alpha = 0.8) + 
    geom_raster(data = DSM_hill_SJER_df, 
                mapping = aes(x = x, y = y, alpha = SJER_dsmHill)) +
    scale_fill_viridis_c() +
    scale_alpha(range = c(0.4, 0.7), guide = "none") +
    labs(x = "UTM Easting Coordinate (m)", 
         y = "UTM Northing Coordinate (m)", 
         title = "DSM with Hillshade") +
    coord_quickmap()

R

# CREATE DTM MAP
# import DTM
DTM_SJER <- 
  rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop.tif"

DTM_SJER_df <- as.data.frame(DTM_SJER, xy = TRUE)

# DTM Hillshade
DTM_hill_SJER <- 
  rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmHill.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmHill.tif"

DTM_hill_SJER_df <- as.data.frame(DTM_hill_SJER, xy = TRUE)

ggplot() +
    geom_raster(data = DTM_SJER_df,
                mapping = aes(x = x, y = y,
                              fill = SJER_dtmCrop),
                              alpha = 2.0) +
    geom_raster(data = DTM_hill_SJER_df,
                mapping = aes(x = x, y = y,
                              alpha = SJER_dtmHill)) +
    scale_fill_viridis_c() +
    scale_alpha(range = c(0.4, 0.7), guide = "none") +
    labs(x = NULL, 
         y = NULL, 
         title = "DTM with Hillshade") +
    coord_quickmap()

Key Points

Continuous data ranges can be grouped into categories using mutate() and cut().
Use built-in terrain.colors() or set your preferred color scheme manually.
Layer rasters on top of one another by using the alpha aesthetic.

Content from Reproject Raster Data

Last updated on 2024-08-19 | Edit this page

Estimated time: 60 minutes

Overview

Questions

How do I work with raster data sets that are in different projections?

Objectives

Reproject a raster in R.

Sometimes we encounter raster datasets that do not “line up” when plotted or analyzed. Rasters that don’t line up are most often in different Coordinate Reference Systems (CRS). This episode explains how to deal with rasters in different, known CRSs. It will walk though reprojecting rasters in R using the project() function in the terra package.

Let’s load the packages we’ll need for this lesson:

R

library(terra)
library(ggplot2)
library(dplyr)

Raster Projection in R

In the previous episode, we learned how to layer a raster file on top of a hillshade for a nice looking basemap. In that episode, all of our data were in the same CRS. What happens when things don’t line up?

For this episode, we will be working with the Harvard Forest Digital Terrain Model data. This differs from the surface model data we’ve been working with so far in that the digital surface model (DSM) includes the tops of trees, while the digital terrain model (DTM) shows the ground level.

Here, we will create a map of the Harvard Forest Digital Terrain Model (DTM_HARV) draped or layered on top of the hillshade (DTM_hill_HARV). The hillshade layer maps the terrain using light and shadow to create a 3D-looking image, based on a hypothetical illumination of the ground level.

Source: National Ecological Observatory Network (NEON).

First, we need to import the DTM and DTM hillshade data.

R

DTM_HARV <- 
    rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif"

DTM_hill_HARV <- 
    rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_DTMhill_WGS84.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_DTMhill_WGS84.tif"

Next, we will convert each of these datasets to a dataframe for plotting with ggplot.

R

DTM_HARV_df <- as.data.frame(DTM_HARV, xy = TRUE)

DTM_hill_HARV_df <- as.data.frame(DTM_hill_HARV, xy = TRUE)

Now we can create a map of the DTM layered over the hillshade. Note that we use scale_fill_gradientn() here to specify our color scale. This allows for us to easily create a color ramp for the data we want to display.

R

ggplot() +
     geom_raster(data = DTM_HARV_df , 
                 mapping = aes(x = x, y = y, 
                  fill = HARV_dtmCrop)) + 
     geom_raster(data = DTM_hill_HARV_df, 
                 mapping = aes(x = x, y = y, 
                   alpha = HARV_DTMhill_WGS84)) +
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

Our results are curious - neither the Digital Terrain Model (DTM_HARV_df) nor the DTM Hillshade (DTM_hill_HARV_df) plotted. Let’s try to plot the DTM on its own to make sure there are data there.

R

ggplot() +
  geom_raster(data = DTM_HARV_df,
      mapping = aes(x = x, y = y, fill = HARV_dtmCrop)) +
  scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
  coord_quickmap()

Our DTM seems to contain data and plots just fine.

Next we plot the DTM Hillshade on its own to see whether everything is OK.

R

ggplot() + 
  geom_raster(data = DTM_hill_HARV_df,
              mapping = aes(x = x, y = y,
                            alpha = HARV_DTMhill_WGS84)) + 
  coord_quickmap()

If we look at the axes, we can see that the projections of the two rasters are different. When this is the case, ggplot won’t render the image. It won’t even throw an error message to tell you something has gone wrong. We can look at Coordinate Reference Systems (CRSs) of the DTM and the hillshade data to see how they differ.

R

crs(DTM_HARV, proj = TRUE)

OUTPUT

[1] "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"

R

crs(DTM_hill_HARV, proj = TRUE)

OUTPUT

[1] "+proj=longlat +datum=WGS84 +no_defs"

DTM_HARV is in the UTM projection, with units of meters. DTM_hill_HARV is in Geographic WGS84 - which is represented by latitude and longitude values.

Because the two rasters are in different CRSs, they don’t line up when plotted in R. We need to reproject (or change the projection of) DTM_hill_HARV into the UTM CRS. Alternatively, we could reproject DTM_HARV into WGS84.

Reproject Rasters

We can use the project() function to reproject a raster into a new CRS. Keep in mind that reprojection only works when you first have a defined CRS for the raster object that you want to reproject. It cannot be used if no CRS is defined. Lucky for us, the DTM_hill_HARV has a defined CRS.

Data Tip

When we reproject a raster, we move it from one “grid” to another. Thus, we are modifying the data! Keep this in mind as we work with raster data.

To use the project() function, we need to define two things:

the object we want to reproject and
the CRS that we want to reproject it to.

The syntax is project(x = RasterObject, y = crs)

We want the CRS of our hillshade to match the DTM_HARV raster. One option would be to just use the CRS of DTM_HARV in the project() function, but this will cause other issues later down the line because the resolutions of DTM_HARV and DTM_hill_HARV are different. That’s okay, because we can specify the resolution of the output file in the project() function using res.

That’s a little complicated for our needs though. In our case, the most effective way to reproject the dataset is to use the DTM_HARV layer as a template for the reprojection. We can specify that by using the syntax project(x = RasterObject, y = TemplateRasterObject).

Note that we are using the project() function on the raster object, not the data.frame() we use for plotting with ggplot.

R

DTM_hill_UTMZ18N_HARV <- project(x = DTM_hill_HARV,
                                 y = DTM_HARV)

For plotting with ggplot(), we will need to create a dataframe from our newly reprojected raster.

R

DTM_hill_HARV_2_df <- as.data.frame(DTM_hill_UTMZ18N_HARV, xy = TRUE)

We can now create a plot of this data.

R

ggplot() +
     geom_raster(data = DTM_HARV_df , 
                 mapping = aes(x = x, y = y, 
                               fill = HARV_dtmCrop)) + 
     geom_raster(data = DTM_hill_HARV_2_df, 
                 mapping = aes(x = x, y = y, 
                               alpha = HARV_DTMhill_WGS84)) +
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

We have now successfully draped the Digital Terrain Model on top of our hillshade to produce a nice looking, textured map!

Challenge 1: Reproject, then Plot a Digital Terrain Model

Create a map of the San Joaquin Experimental Range field site using the SJER_DSMhill_WGS84.tif and SJER_dsmCrop.tif files.

Reproject the data as necessary to make things line up!

Show me the solution

R

# import DSM
DSM_SJER <- 
    rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif"

# import DSM hillshade
DSM_hill_SJER_WGS <-
    rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_DSMhill_WGS84.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_DSMhill_WGS84.tif"

# reproject raster
DSM_hill_UTMZ18N_SJER <- project(x = DSM_hill_SJER_WGS,
                                 y = DSM_SJER)

# convert to data.frames
DSM_SJER_df <- as.data.frame(DSM_SJER, xy = TRUE)

DSM_hill_SJER_df <- as.data.frame(DSM_hill_UTMZ18N_SJER, xy = TRUE)

ggplot() +
     geom_raster(data = DSM_hill_SJER_df, 
                 mapping = aes(x = x, y = y, 
                               alpha = SJER_DSMhill_WGS84)) +
     geom_raster(data = DSM_SJER_df, 
             mapping = aes(x = x, y = y, 
                           fill = SJER_dsmCrop,
                           alpha=0.8)) + 
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

Key Points

In order to plot two raster data sets together, they must be in the same CRS.
Use the project() function to convert between CRSs.

Content from Open and Plot Vector Layers

Last updated on 2024-08-19 | Edit this page

Estimated time: 30 minutes

Overview

Questions

How can I distinguish between and visualize point, line and polygon vector data?

Objectives

Know the difference between point, line, and polygon vector elements.
Load point, line, and polygon vector layers into R.
Access the attributes of a spatial object in R.

Starting with this episode, we will be moving from working with raster data to working with vector data. In this episode, we will open and plot point, line and polygon vector data loaded from ESRI’s shapefile format into R. These data refer to the NEON Harvard Forest field site.

Import Vector Data

We will use the sf package to work with vector data in R. Fun fact: sf stands for “simple features”.

If you haven’t downloaded the sf package yet, you can do so by going to the Packages tab in your R console and installing a new package, or you can type the following code into your R console:

install.packages("sf")

Make sure you have the sf library loaded along with the other libraries we will need.

R

library(ggplot2)
library(dplyr)
library(sf)

The vector layers that we will import from ESRI’s shapefile format are:

A polygon vector layer representing our field site boundary,
A line vector layer representing roads, and
A point vector layer representing the location of the Fisher flux tower located at the NEON Harvard Forest field site.

The first vector layer that we will open contains the boundary of our study area (or our Area Of Interest or AOI, hence the name aoiBoundary). To import a vector layer from an ESRI shapefile we use the sf function st_read(). st_read() requires the file path to the ESRI shapefile.

Let’s import our AOI:

R

aoi_boundary_HARV <- st_read(
  "data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")

OUTPUT

Reading layer `HarClip_UTMZ18' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 732128 ymin: 4713209 xmax: 732251.1 ymax: 4713359
Projected CRS: WGS 84 / UTM zone 18N

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"

Vector Layer Metadata & Attributes

When we import the HarClip_UTMZ18 vector layer from an ESRI shapefile into R (as our aoi_boundary_HARV object), the st_read() function automatically stores information about the data. We are particularly interested in the geospatial metadata, describing the format, CRS, extent, and other components of the vector data, and the attributes which describe properties associated with each individual vector object.

Spatial Metadata

Key metadata for all vector layers includes:

Object Type: the class of the imported object.
Coordinate Reference System (CRS): the projection of the data.
Extent: the spatial extent (i.e. geographic area that the vector layer covers) of the data. Note that the spatial extent for a vector layer represents the combined extent for all individual objects in the vector layer.

We can view metadata of a vector layer using the st_geometry_type(), st_crs() and st_bbox() functions. First, let’s view the geometry type for our AOI vector layer:

R

st_geometry_type(aoi_boundary_HARV)

OUTPUT

[1] POLYGON
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE

Our aoi_boundary_HARV is a polygon spatial object. The 18 levels shown below our output list the possible categories of the geometry type. Now let’s check what CRS this file data is in:

R

st_crs(aoi_boundary_HARV)

OUTPUT

Coordinate Reference System:
  User input: WGS 84 / UTM zone 18N 
  wkt:
PROJCRS["WGS 84 / UTM zone 18N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["UTM zone 18N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-75,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    ID["EPSG",32618]]

Our data in the CRS UTM zone 18N. The CRS is critical to interpreting the spatial object’s extent values as it specifies units. To find the extent of our AOI, we can use the st_bbox() function:

R

st_bbox(aoi_boundary_HARV)

OUTPUT

     xmin      ymin      xmax      ymax 
 732128.0 4713208.7  732251.1 4713359.2

The spatial extent of a vector layer or R spatial object represents the geographic “edge” or location that is the furthest north, south east and west. Thus it represents the overall geographic coverage of the spatial object.

Lastly, we can view all of the metadata and attributes for this R spatial object by printing it to the screen:

R

aoi_boundary_HARV

OUTPUT

Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 732128 ymin: 4713209 xmax: 732251.1 ymax: 4713359
Projected CRS: WGS 84 / UTM zone 18N
  id                       geometry
1  1 POLYGON ((732128 4713359, 7...

Plot a vector layer

Next, let’s visualize the data in our sf object using the ggplot package. Unlike with raster data, we do not need to convert vector data to a dataframe before plotting with ggplot.

We’re going to customize our boundary plot by setting the color and fill for our plot. When plotting sf objects with ggplot2, you need to use the coord_sf() coordinate system.

R

ggplot() +
  geom_sf(data = aoi_boundary_HARV, color = "black", fill = "cyan1") +
  labs(title = "AOI Boundary Plot") +
  coord_sf()

Challenge 1: Import Line and Point Vector Layers

Using the steps above, import the HARV_roads and HARVtower_UTM18N vector layers into R. Call the HARV_roads object lines_HARV and the HARVtower_UTM18N point_HARV.

Answer the following questions:

What type of R spatial object is created when you import each layer?
What is the CRS and extent for each object?
Do the files contain points, lines, or polygons?
How many spatial objects are in each file?

Show me the solution

First we import the data:

R

lines_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")

OUTPUT

Reading layer `HARV_roads' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 13 features and 15 fields
Geometry type: MULTILINESTRING
Dimension:     XY
Bounding box:  xmin: 730741.2 ymin: 4711942 xmax: 733295.5 ymax: 4714260
Projected CRS: WGS 84 / UTM zone 18N

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp"

point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")

OUTPUT

Reading layer `HARVtower_UTM18N' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 14 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 732183.2 ymin: 4713265 xmax: 732183.2 ymax: 4713265
Projected CRS: WGS 84 / UTM zone 18N

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"

Then we check its geometry type.

R

st_geometry_type(lines_HARV)

OUTPUT

 [1] MULTILINESTRING MULTILINESTRING MULTILINESTRING MULTILINESTRING
 [5] MULTILINESTRING MULTILINESTRING MULTILINESTRING MULTILINESTRING
 [9] MULTILINESTRING MULTILINESTRING MULTILINESTRING MULTILINESTRING
[13] MULTILINESTRING
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE

R

st_geometry_type(point_HARV)

OUTPUT

[1] POINT
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE

We also check the CRS and extent of each object:

R

st_crs(lines_HARV)

OUTPUT

Coordinate Reference System:
  User input: WGS 84 / UTM zone 18N 
  wkt:
PROJCRS["WGS 84 / UTM zone 18N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["UTM zone 18N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-75,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    ID["EPSG",32618]]

R

st_bbox(lines_HARV)

OUTPUT

     xmin      ymin      xmax      ymax 
 730741.2 4711942.0  733295.5 4714260.0

R

st_crs(point_HARV)

OUTPUT

Coordinate Reference System:
  User input: WGS 84 / UTM zone 18N 
  wkt:
PROJCRS["WGS 84 / UTM zone 18N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["UTM zone 18N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-75,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    ID["EPSG",32618]]

R

st_bbox(point_HARV)

OUTPUT

     xmin      ymin      xmax      ymax 
 732183.2 4713265.0  732183.2 4713265.0

To see the number of objects in each file, we can look at the output from when we read these objects into R. lines_HARV contains 13 features (all lines) and point_HARV contains only one point.

Key Points

Metadata for vector layers include geometry type, CRS, and extent.
Load spatial objects into R with the st_read() function.
Spatial objects can be plotted directly with ggplot using the geom_sf() function. No need to convert to a dataframe.

Content from Explore and Plot by Vector Layer Attributes

Last updated on 2024-08-19 | Edit this page

Estimated time: 60 minutes

Overview

Questions

How can I compute on the attributes of a spatial object?

Objectives

Query attributes of a spatial object.
Subset spatial objects using specific attribute values.
Plot a vector feature, colored by unique attribute values.

This episode continues our discussion of vector layer attributes and covers how to work with vector layer attributes in R. It covers how to identify and query layer attributes, as well as how to subset features by specific attribute values. Finally, we will learn how to plot a feature according to a set of attribute values.

Load the Data

We will continue using the sf and ggplot2 packages in this episode. Make sure that you have these packages loaded. We will continue to work with the three ESRI shapefiles (vector layers) that we loaded in a previous episode.

R

library(ggplot2)
library(dplyr)
library(sf)

R

point_HARV <-
    st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"

lines_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp"

aoi_boundary_HARV <-
    st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"

Query Vector Feature Metadata

As we discussed in the Open and Plot Vector Layers in R episode, we can view metadata associated with an R object using:

st_geometry_type() - The type of vector data stored in the object.
st_bbox() - The spatial extent (geographic area covered by) of the object.
st_crs() - The CRS (spatial projection) of the data.

We started to explore our point_HARV object in the previous episode. To see a summary of all of the metadata associated with our object, we can view the object with View(point_HARV) or print a summary of the object itself to the console. We’re going to do this, but with lines_HARV today:

R

lines_HARV

OUTPUT

Simple feature collection with 13 features and 15 fields
Geometry type: MULTILINESTRING
Dimension:     XY
Bounding box:  xmin: 730741.2 ymin: 4711942 xmax: 733295.5 ymax: 4714260
Projected CRS: WGS 84 / UTM zone 18N
First 10 features:
   OBJECTID_1 OBJECTID       TYPE             NOTES MISCNOTES RULEID
1          14       48 woods road Locust Opening Rd      <NA>      5
2          40       91   footpath              <NA>      <NA>      6
3          41      106   footpath              <NA>      <NA>      6
4         211      279 stone wall              <NA>      <NA>      1
5         212      280 stone wall              <NA>      <NA>      1
6         213      281 stone wall              <NA>      <NA>      1
7         214      282 stone wall              <NA>      <NA>      1
8         215      283 stone wall              <NA>      <NA>      1
9         216      284 stone wall              <NA>      <NA>      1
10        553      674  boardwalk              <NA>      <NA>      2
            MAPLABEL SHAPE_LENG             LABEL BIKEHORSE RESVEHICLE RECMAP
1  Locust Opening Rd 1297.35706 Locust Opening Rd         Y         R1      Y
2               <NA>  146.29984              <NA>         Y         R1      Y
3               <NA>  676.71804              <NA>         Y         R2      Y
4               <NA>  231.78957              <NA>      <NA>       <NA>   <NA>
5               <NA>   45.50864              <NA>      <NA>       <NA>   <NA>
6               <NA>  198.39043              <NA>      <NA>       <NA>   <NA>
7               <NA>  143.19240              <NA>      <NA>       <NA>   <NA>
8               <NA>   90.33118              <NA>      <NA>       <NA>   <NA>
9               <NA>   35.88146              <NA>      <NA>       <NA>   <NA>
10              <NA>   67.43464              <NA>         N         R3      N
   Shape_Le_1                            ResVehic_1                  BicyclesHo
1  1297.10617    R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
2   146.29983    R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
3   676.71807 R2 - 4WD/High Clearance Vehicles Only Bicycles and Horses Allowed
4   231.78962                                  <NA>                        <NA>
5    45.50859                                  <NA>                        <NA>
6   198.39041                                  <NA>                        <NA>
7   143.19241                                  <NA>                        <NA>
8    90.33114                                  <NA>                        <NA>
9    35.88152                                  <NA>                        <NA>
10   67.43466              R3 - No Vehicles Allowed      DO NOT SHOW ON REC MAP
                         geometry
1  MULTILINESTRING ((730819.2 ...
2  MULTILINESTRING ((732040.2 ...
3  MULTILINESTRING ((732057 47...
4  MULTILINESTRING ((731903.6 ...
5  MULTILINESTRING ((732039.1 ...
6  MULTILINESTRING ((732056.2 ...
7  MULTILINESTRING ((731964 47...
8  MULTILINESTRING ((732105.2 ...
9  MULTILINESTRING ((732222.9 ...
10 MULTILINESTRING ((732153.8 ...

We can use the ncol function to count the number of attributes associated with a spatial object too. Note that the geometry is just another column and counts towards the total.

R

ncol(lines_HARV)

OUTPUT

[1] 16

We can view the individual name of each attribute using the names() function in R:

R

names(lines_HARV)

OUTPUT

 [1] "OBJECTID_1" "OBJECTID"   "TYPE"       "NOTES"      "MISCNOTES" 
 [6] "RULEID"     "MAPLABEL"   "SHAPE_LENG" "LABEL"      "BIKEHORSE" 
[11] "RESVEHICLE" "RECMAP"     "Shape_Le_1" "ResVehic_1" "BicyclesHo"
[16] "geometry"

We could also view just the first 6 rows of attribute values using the head() function to get a preview of the data:

R

head(lines_HARV)

OUTPUT

Simple feature collection with 6 features and 15 fields
Geometry type: MULTILINESTRING
Dimension:     XY
Bounding box:  xmin: 730741.2 ymin: 4712685 xmax: 732232.3 ymax: 4713726
Projected CRS: WGS 84 / UTM zone 18N
  OBJECTID_1 OBJECTID       TYPE             NOTES MISCNOTES RULEID
1         14       48 woods road Locust Opening Rd      <NA>      5
2         40       91   footpath              <NA>      <NA>      6
3         41      106   footpath              <NA>      <NA>      6
4        211      279 stone wall              <NA>      <NA>      1
5        212      280 stone wall              <NA>      <NA>      1
6        213      281 stone wall              <NA>      <NA>      1
           MAPLABEL SHAPE_LENG             LABEL BIKEHORSE RESVEHICLE RECMAP
1 Locust Opening Rd 1297.35706 Locust Opening Rd         Y         R1      Y
2              <NA>  146.29984              <NA>         Y         R1      Y
3              <NA>  676.71804              <NA>         Y         R2      Y
4              <NA>  231.78957              <NA>      <NA>       <NA>   <NA>
5              <NA>   45.50864              <NA>      <NA>       <NA>   <NA>
6              <NA>  198.39043              <NA>      <NA>       <NA>   <NA>
  Shape_Le_1                            ResVehic_1                  BicyclesHo
1 1297.10617    R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
2  146.29983    R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
3  676.71807 R2 - 4WD/High Clearance Vehicles Only Bicycles and Horses Allowed
4  231.78962                                  <NA>                        <NA>
5   45.50859                                  <NA>                        <NA>
6  198.39041                                  <NA>                        <NA>
                        geometry
1 MULTILINESTRING ((730819.2 ...
2 MULTILINESTRING ((732040.2 ...
3 MULTILINESTRING ((732057 47...
4 MULTILINESTRING ((731903.6 ...
5 MULTILINESTRING ((732039.1 ...
6 MULTILINESTRING ((732056.2 ...

Challenge 1: Attributes for Different Spatial Classes

Explore the attributes associated with the point_HARV and aoi_boundary_HARV spatial objects.

How many attributes does each have?
Who owns the site in the point_HARV data object?
Which of the following is NOT an attribute of the point_HARV data object?

Latitude B) County C) Country

Show me the solution

To find the number of attributes, we use the ncol() function:

R

ncol(point_HARV)

OUTPUT

[1] 15

R

ncol(aoi_boundary_HARV)

OUTPUT

[1] 2

Ownership information is in a column named Ownership:

R

point_HARV$Ownership

OUTPUT

[1] "Harvard University, LTER"

To see a list of all of the attributes, we can use the names() function:

R

names(point_HARV)

OUTPUT

 [1] "Un_ID"      "Domain"     "DomainName" "SiteName"   "Type"      
 [6] "Sub_Type"   "Lat"        "Long"       "Zone"       "Easting"   
[11] "Northing"   "Ownership"  "County"     "annotation" "geometry"

“Country” is not an attribute of this object.

Explore Values within One Attribute

We can explore individual values stored within a particular attribute. Comparing attributes to a spreadsheet or a data frame, this is similar to exploring values in a column. We did this with the gapminder dataframe in an earlier lesson. For spatial objects, we can use the same syntax: objectName$attributeName.

We can see the contents of the TYPE field of our lines feature:

R

lines_HARV$TYPE

OUTPUT

 [1] "woods road" "footpath"   "footpath"   "stone wall" "stone wall"
 [6] "stone wall" "stone wall" "stone wall" "stone wall" "boardwalk" 
[11] "woods road" "woods road" "woods road"

To see only unique values within the TYPE field, we can use the unique() function for extracting the possible values of a character variable.

R

unique(lines_HARV$TYPE)

OUTPUT

[1] "woods road" "footpath"   "stone wall" "boardwalk"

Subset Features

We can use the filter() function from dplyr to select a subset of features from a spatial object in R, just like with data frames.

For example, we might be interested only in features that are of TYPE “footpath”. Once we subset out this data, we can use it as input to other code so that code only operates on the footpath lines.

R

footpath_HARV <- lines_HARV %>%
  filter(TYPE == "footpath")
nrow(footpath_HARV)

OUTPUT

[1] 2

Our subsetting operation reduces the features count to 2. This means that only two feature lines in our spatial object have the attribute TYPE == footpath. We can plot only the footpath lines:

R

ggplot() +
  geom_sf(data = footpath_HARV) +
  labs(title = "NEON Harvard Forest Field Site", 
       subtitle = "Footpaths") +
  coord_sf()

There are two features in our footpaths subset. Why does the plot look like there is only one feature? Let’s adjust the colors used in our plot. If we have 2 features in our vector object, we can plot each using a unique color by assigning a column name to the color aesthetic (color =). We use the syntax mapping = aes(color = ) to do this. We’ll use the OBJECTID column for the color aesthetic in this example, but because it is a number, we will have to convert it to a categorical value (a factor) first. We can also alter the default line thickness by using the linewidth = parameter, as the default value can be hard to see. Note that linewidth is placed outside of the aes() function, as we are not connecting line thickness to a data variable.

R

ggplot() +
  geom_sf(data = footpath_HARV, 
          mapping = aes(color = factor(OBJECTID)), linewidth = 1.5) +
  labs(color = "Footpath ID", 
       title = "NEON Harvard Forest Field Site", 
       subtitle = "Footpaths") +
  coord_sf()

Map of the footpaths in the study area where each feature is colored differently.

Now, we see that there are in fact two features in our plot!

Challenge 2: Subset Spatial Line Objects Part 1

Subset out all boardwalk from the lines layer and plot it.

Show me the solution

First we will save an object with only the boardwalk lines:

R

boardwalk_HARV <- lines_HARV %>%
  filter(TYPE == "boardwalk")

Let’s check how many features there are in this subset:

R

nrow(boardwalk_HARV)

OUTPUT

[1] 1

Now let’s plot that data:

R

ggplot() +
  geom_sf(data = boardwalk_HARV, linewidth = 1.5) +
  labs(title = "NEON Harvard Forest Field Site", 
       subtitle = "Boardwalks") +
  coord_sf()

Map of the boardwalks in the study area.

Challenge 3: Subset Spatial Line Objects Part 2

Subset out all stone wall features from the lines layer and plot it. For each plot, color each feature using a unique color.

Show me the solution

First we will save an object with only the stone wall lines and check the number of features:

R

stoneWall_HARV <- lines_HARV %>%
  filter(TYPE == "stone wall")
nrow(stoneWall_HARV)

OUTPUT

[1] 6

Now we can plot the data:

R

ggplot() +
  geom_sf(data = stoneWall_HARV, 
          mapping = aes(color = factor(OBJECTID)), linewidth = 1.5) +
  labs(color = "Wall ID", 
       title = "NEON Harvard Forest Field Site",
       subtitle = "Stonewalls") +
  coord_sf()

Map of the stone walls in the study area where each feature is colored differently.

Customize Plots

In the examples above, ggplot() automatically selected colors for each line based on a default color order. If we don’t like those default colors, we can create a vector of colors - one for each feature.

First we will check how many unique line types there are:

R

unique(lines_HARV$TYPE)

OUTPUT

[1] "woods road" "footpath"   "stone wall" "boardwalk"

Then we can create a palette of four colors, one for each feature in our vector object.

R

road_colors <- c("blue", "green", "navy", "purple")

We can tell ggplot to use these colors when we plot the data.

R

ggplot() +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  scale_color_manual(values = road_colors) +
  labs(color = "Road Type", 
       title = "NEON Harvard Forest Field Site", 
       subtitle = "Roads & Trails") +
  coord_sf()

Data Tip

Note that ggplot automatically decides which colors belong with which variables, based on the order of the colors in the color vector, and the alphabetical order of the road types. You can specify which colors to use for which road type by creating a named vector:

R

road_colors <- c("stone wall" = "blue", "boardwalk" = "green",
                 "footpath" = "navy", "woods road" = "purple")

Adjust Line Width

We adjusted line width universally earlier. If we want a unique line width for each level or attribute category in our spatial object, we can use the same syntax that we used for colors, above.

We already know that we have four different TYPEs in the lines_HARV object, so we will set four different line widths.

R

line_widths <- c(1, 2, 3, 4)

We can use those line widths when we plot the data.

R

ggplot() +
  geom_sf(data = lines_HARV, 
          mapping = aes(color = TYPE, linewidth = TYPE)) +
  scale_color_manual(values = road_colors) +
  scale_linewidth_manual(values = line_widths) +
  labs(color = "Road Type",
       linewidth = "Road Type",
       title = "NEON Harvard Forest Field Site",
       subtitle = "Roads & Trails - Line width varies") +
  coord_sf()

Roads and trails in the area demonstrating how to use different line thickness and colors.

Challenge 4: Plot Line Width by Attribute

In the example above, we set the line widths to be 1, 2, 3, and 4. Because R orders alphabetically by default, this gave us a plot where woods roads (the last type) were the thickest and boardwalks were the thinnest.

Let’s create another plot where we show the different line types with the following thicknesses:

woods road width = 6
boardwalks width = 1
footpath width = 3
stone wall width = 2

Show me the solution

First we need to look at the levels of our factor to see what order the road types are in:

R

unique(lines_HARV$TYPE)

OUTPUT

[1] "woods road" "footpath"   "stone wall" "boardwalk"

We then can create our line_width vector setting each of the levels to the desired thickness.

R

line_width <- c("woods road" = 6, "footpath" = 3, 
                "stone wall" = 2, "boardwalk" = 1)

Now we can create our plot.

R

ggplot() +
  geom_sf(data = lines_HARV, mapping = aes(linewidth = TYPE)) +
  scale_linewidth_manual(values = line_width) +
  labs(title = "NEON Harvard Forest Field Site",
       subtitle = "Roads & Trails - Line width varies") +
  coord_sf()

Roads and trails in the area with different line thickness for each type of paths.

Add Plot Legend

When we plot a vector object and specify colors, line widths, etc. using the mapping = aes() function, R will automatically add a plot to our legend. By default, the plot legend will show up on the right side of the plot and will include the values that we supplied to the ggplot object. We can adjust the way our legends look in a variety of ways.

We specify the location of our legend by using a default keyword. For example, we could use bottomright, top, topright, etc.
We can change the font size
We can change the little things, like the color of the legend box outline
And so much more!

Pretty much anything you would want to change on a plot, you can! We do this using the theme() function in R.

Let’s start with our basic plot. We will use the road_colors object that we created above to color the legend. We’ll also move the legend to the bottom by specifying theme(legend.position = "bottom").

R

ggplot() +
  geom_sf(data = lines_HARV, 
          mapping = aes(color = TYPE), linewidth = 1.5) +
  scale_color_manual(values = road_colors) +
  labs(color = "Road Type", 
       title = "NEON Harvard Forest Field Site",
       subtitle = "Roads & Trails - Default Legend") +
  coord_sf() + 
  theme(legend.position = "bottom")

Roads and trails in the study area using thicker lines than the previous figure.

Now, let’s change the appearance of our legend by manually setting different parameters using the theme() function.

We can start by changing the font size in the legend. The way we would do this is by adding the following line to our plot code: theme(legend.text = element_text(size = 20))

This may seem a little confusing, but there is a method to the madness. First, we note what we’re trying to change (the legend.text argument). Next, we have to specify what kind of object it needs to be. Since we’re altering the legend text, it needs to be a text element, so we use the element_text() function. Within this function, we can specify what we want to change from the default. In this case, it’s the size of the font (size = 20), but we could do this for things like color, the face (plain, italic, bold, bold.italic), the angle of rotation for the text, etc.

R

ggplot() +
  geom_sf(data = lines_HARV, 
          mapping = aes(color = TYPE), linewidth = 1.5) +
  scale_color_manual(values = road_colors) +
  labs(color = "Road Type", 
       title = "NEON Harvard Forest Field Site",
       subtitle = "Roads & Trails - Default Legend") +
  coord_sf() + 
  theme(legend.position = "bottom", 
        legend.text = element_text(size = 20))

Map of the paths in the study area with large-font.

We can decide to have a box around the legend, by altering the legend.box.background to have a black border. Since the legend is a rectangle of sorts, we will use the element: element_rect() to adjust these parameters.

So our new line would be:

theme(legend.box.background = element_rect(color = "black"))

R

ggplot() +
  geom_sf(data = lines_HARV, 
          mapping = aes(color = TYPE), linewidth = 1.5) +
  scale_color_manual(values = road_colors) +
  labs(color = "Road Type", 
       title = "NEON Harvard Forest Field Site",
       subtitle = "Roads & Trails - Default Legend") +
  coord_sf() + 
  theme(legend.position = "bottom", 
        legend.text = element_text(size = 20), 
        legend.box.background = element_rect(color = "black"))

Map of the paths in the study area with large-font and an outline around the legend.

Data Tip

You can modify the default R color palette using the palette method. For example palette(rainbow(6)) or palette(terrain.colors(6)). You can reset the palette colors using palette("default")!

You can also use colorblind-friendly palettes such as those in the viridis package.

Challenge 5: Plot Lines by Attribute

Create a plot that emphasizes only roads where bicycles and horses are allowed. To emphasize this, make the lines where bicycles are not allowed THINNER than the roads where bicycles are allowed. NOTE: this attribute information is located in the lines_HARV$BicyclesHo attribute.

Be sure to add a title and legend to your map. You might consider a color palette that has all bike/horse-friendly roads displayed in a bright color. All other lines can be black.

Show me the solution

First we explore the BicyclesHo attribute to learn the values that correspond to the roads we need.

R

lines_HARV %>%
  pull(BicyclesHo) %>%
  unique()

OUTPUT

[1] "Bicycles and Horses Allowed"     NA                               
[3] "DO NOT SHOW ON REC MAP"          "Bicycles and Horses NOT ALLOWED"

Now, we can create a data frame with only those roads where bicycles and horses are allowed.

R

lines_showHarv <-
  lines_HARV %>%
  filter(BicyclesHo == "Bicycles and Horses Allowed")

Finally, we plot the needed roads after setting them to magenta and a thicker line width.

R

ggplot() +
  geom_sf(data = lines_HARV) +
  geom_sf(data = lines_showHarv, 
          mapping = aes(color = BicyclesHo), linewidth = 2) +
  scale_color_manual(values = "magenta") +
  labs(title = "NEON Harvard Forest Field Site",
       subtitle = "Roads Where Bikes and Horses Are Allowed", 
       color = NULL) +
  coord_sf()

Roads and trails in the area highlighting paths where horses and bikes are allowed.

Challenge 6: Plot Polygon by Attribute

Create a map of the state boundaries in the United States using the data located in your downloaded data folder: NEON-DS-Site-Layout-Files/US-Boundary-Layers\US-State-Boundaries-Census-2014. Note that this dataset includes Z and M coordinates, which we do not need. Use the st_zm() function to drop these dimensions before plotting.

When plotting, apply a line color to each state using its region value. Be sure to add a legend to your plot as well.

Show me the solution

First we read in the data and check how many levels there are in the region column:

R

state_boundary_US <-
st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp") %>%
# NOTE: We need neither Z nor M coordinates!
st_zm()

OUTPUT

Reading layer `US-State-Boundaries-Census-2014' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 58 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp"

state_boundary_US$region <- as.factor(state_boundary_US$region)
levels(state_boundary_US$region)

OUTPUT

[1] "Midwest"   "Northeast" "Southeast" "Southwest" "West"

Next we set a color vector with that many items:

R

colors <- c("purple", "springgreen", "yellow", "brown", "navy")

Now we can create our plot:

R

ggplot() +
  geom_sf(data = state_boundary_US, 
          mapping = aes(color = region), size = 1) +
  scale_color_manual(values = colors) +
  labs(title = "Contiguous U.S. State Boundaries") +
  coord_sf()

Map of the continental United States where the state lines are colored by region.

Key Points

Spatial objects in sf are similar to standard data frames and can be manipulated using the same functions.
Almost any feature of a plot can be customized using the various functions and options in the ggplot2 package.

Content from Plot Multiple Vector and Raster Layers

Last updated on 2024-08-19 | Edit this page

Estimated time: 60 minutes

Overview

Questions

How can I create map compositions with custom legends using ggplot?
How can I plot raster and vector data together?

Objectives

Plot multiple vector layers in the same plot.
Apply custom symbols to spatial objects in a plot.
Create a multi-layered plot with raster and vector data.

This episode builds upon previous episodes that work with vector layers in R and explore how to plot multiple vector layers. It also covers how to plot raster and vector data together on the same plot.

Load the Data

To work with vector data in R, we can use the sf package. To work with raster data in R, we can use the terra package. Make sure that you have these packages loaded.

We will continue to work with the three ESRI shapefile that we loaded in the Open and Plot Vector Layers in R episode and the raster file we used in previous episodes.

R

library(terra)
library(ggplot2)
library(dplyr)
library(sf)

R

# Load the data
aoi_boundary_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"

lines_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp"

point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"

CHM_HARV <- rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif"

# Load the color palette from earlier lessons
road_colors <- c("blue", "green", "navy", "purple")

Plotting Multiple Vector Layers

In the previous episode, we learned how to plot information from a single vector layer and do some plot customization including adding a custom legend. However, what if we want to create a more complex plot with many vector layers and unique symbols that need to be represented clearly in a legend?

Now, let’s create a plot that combines our tower location (point_HARV), site boundary (aoi_boundary_HARV) and roads (lines_HARV) spatial objects. We will need to build a custom legend as well.

To begin, we will create a plot with the site boundary as the first layer. Then layer the tower location and road data on top using +.

R

ggplot() +
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  geom_sf(data = point_HARV) +
  labs(title = "NEON Harvard Forest Field Site") +
  coord_sf()

Next, let’s build a custom legend using the symbology (the colors and symbols) that we used to create the plot above.

R

ggplot() +
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  geom_sf(data = point_HARV) +
  scale_color_manual(values = road_colors) + 
  labs(title = "NEON Harvard Forest Field Site") +
  coord_sf()

What if we wanted to have a legend for the tower location (the point)? We can trick R into adding the legend for us, by specifying an aesthetic and value to plot within the mapping = aes() function for that layer. We’ve already used the color aesthetic in the plot for the road colors, so we’ll have to use another aesthetic type for getting the tower location to show up. In this example, let’s use the fill aesthetic (note that there are also scales we can apply to the point shape and alpha [transparency]).

We can do that by changing the line of code that plots the point: geom_sf(data = point_HARV, aes(fill = "Tower location"))

This tells the plot that we want to color the “fill” of the point to be dictated by the “Tower location” label. We can pair this with a custom fill palette so that it’s plotted the way we want. For example:

scale_fill_manual(values = c("Tower location" = "black"))

This tells the plot that anything filled with the “Tower location” tag will actually be colored black.

R

ggplot() +
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  geom_sf(data = point_HARV, mapping = aes(fill = "Tower location")) +
  scale_color_manual(values = road_colors) + 
  scale_fill_manual(values = c("Tower location" = "black")) + 
  labs(title = "NEON Harvard Forest Field Site") +
  coord_sf()

Now lets adjust the legend titles by passing a name to the respective color and fill palettes. We can do this within the color_scale_manual() and color_fill_manual() functions, or within the labs() function. If we don’t want a title for either of those values, we can assign the name to NULL.

R

ggplot() +
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  geom_sf(data = point_HARV, mapping = aes(fill = "Tower location")) +
  scale_color_manual(values = road_colors) + 
  scale_fill_manual(values = c("Tower location" = "black")) + 
  labs(title = "NEON Harvard Forest Field Site", 
       color = "Line Type", 
       fill = NULL) +
  coord_sf()

Finally, it might be better if the points were symbolized as a symbol. We can customize this using shape parameters in our call to geom_sf: 16 is a point symbol, 15 is a box.

Data Tip

To view a short list of shape symbols, type ?pch into the R console.

R

ggplot() +
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  geom_sf(data = point_HARV, mapping = aes(fill = "Tower location"), shape = 15) +
  scale_color_manual(values = road_colors) + 
  scale_fill_manual(values = c("Tower location" = "black")) + 
  labs(title = "NEON Harvard Forest Field Site", 
       color = "Line Type", 
       fill = NULL) +
  coord_sf()

Challenge 1: Plot Polygon by Attribute

Using the NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp ESRI shapefile, create a map of study plot locations, with each point colored by the soil type (soilTypeOr). How many different soil types are there at this particular field site? Overlay this layer on top of the lines_HARV layer (the roads). Create a custom legend that applies line symbols to lines and point symbols to the points.
Modify the plot above. Tell R to plot each point, using a different symbol of shape value.

Show me the solution

First we need to read in the data and see how many unique soils are represented in the soilTypeOr attribute.

R

plot_locations <-
  st_read("data/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp")

OUTPUT

Reading layer `PlotLocations_HARV' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 21 features and 25 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 731405.3 ymin: 4712845 xmax: 732275.3 ymax: 4713846
Projected CRS: WGS 84 / UTM zone 18N

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp"

unique(plot_locations$soilTypeOr)

OUTPUT

[1] "Inceptisols" "Histosols"

Next we can create a new color palette with one color for each soil type.

R

blue_orange <- c("cornflowerblue", "darkorange")

Finally, we will create our plot.

R

ggplot() +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  geom_sf(data = plot_locations, mapping = aes(fill = soilTypeOr),
          shape = 21) +
  scale_color_manual(values = road_colors) +
  scale_fill_manual(values = blue_orange) +
  labs(title = "NEON Harvard Forest Field Site", 
       color = "Line Type", 
       fill = "Soil Type") +
  coord_sf()

If we want each soil to be shown with a different symbol, we can give multiple values to the scale_shape_manual() argument.

This get’s a little tricky though, because now we have a legend for the fill color and a legend for the shape for the soil type from the plot locations. We can override the aesthetics of the legends so that we only display one of the legends for the plot soil type - a legend that includes the right shape and the right colors. We do that by removing one of the legends from the plot and altering the legend we want to keep.

Let’s change the scale_shape_manual() to include the appropriate fill colors for the points. To do this, we will use guide = guide_legend() within the scale_shape_manual() function. We override the aesthetics by using override.aes and providing it with a list of things to override.

Putting it all together results in a line of code that looks like this: scale_shape_manual(values = c(21, 22), guide = guide_legend(override.aes = list(fill = blue_orange)))

Here, we’re specifying that we want to keep the legend for the shapes, but we want the legend (the guide) to look different than the default. We want to override the default fill values (which is black) to the colors in the blue_orange vector. This will make the legend match the actual points on the plot.

It’s a lot to take in, we know, but it might come in handy for our own research!

R

ggplot() +
  geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
  geom_sf(data = plot_locations, mapping = aes(fill = soilTypeOr, shape = soilTypeOr)) +
  scale_color_manual(values = road_colors) +
  scale_fill_manual(values = blue_orange, guide = "none") +
  scale_shape_manual(values = c(21, 22), 
                     guide = guide_legend(override.aes = list(fill = blue_orange))) + 
  labs(title = "NEON Harvard Forest Field Site", 
       color = "Line Type", 
       shape = "Soil Type") +
  coord_sf()

Plotting Raster & Vector Data Together

You can plot vector data layered on top of raster data using the ggplot2 package or the tmap package.

We’ll first go into how to do this using ggplot, and then we’ll create a similar plot using tmap.

Together, let’s create a plot that uses the NEON AOI Canopy Height Model data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif (or data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif depending on your set up) as a base layer. On top of the CHM, we will add the study site AOI, roads, and the tower location.

Remember for ggplot to be able to plot raster data, we first need to convert it to a data frame.

R

CHM_HARV_df <- as.data.frame(CHM_HARV, xy = TRUE)

Now we can add it to the plot as a geom_raster() object:

R

ggplot() + 
  geom_raster(data = CHM_HARV_df, 
              mapping = aes(x = x, y = y, fill = HARV_chmCrop)) + 
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV) +
  geom_sf(data = point_HARV) +
  labs(title = "NEON Harvard Forest Field Site w/ Canopy Height Model", 
       fill = "Canopy Height") +
  coord_sf()

Now we can make a similar plot using tmap.

If you haven’t downloaded the tmap package yet, you can do so by going to the Packages tab in your R console and installing a new package, or you can type the following code into your R console:

install.packages("tmap")

Once downloaded, remember to load tmap into your environment using:

R

library(tmap)

The syntax for tmap is to first specify a layer to plot and then to specify the kind of layer it is. For example, if you wanted to plot a raster first, and then a vector layer on top of it, you would write something like this:

R

tm_shape(raster_object) + # first we specify the object to plot first
  tm_raster() + # then we specify the kind of object it is
  tm_shape(vector_object) + # then we specify the object to plot next
  tm_sf() # then we specify the kind of object it is

You can specify things like color and shape within the tm_sf() function as col = "black" or shape = 8 for example.

R

# Create the plot
tm_shape(CHM_HARV) + 
  tm_raster() + 
  tm_shape(lines_HARV) + 
  tm_sf() + 
  tm_shape(aoi_boundary_HARV) + 
  tm_sf(col = "grey20") + 
  tm_shape(point_HARV) + 
  tm_sf(col = "black")

OUTPUT

SpatRaster object downsampled to 898 by 1115 cells.

Key Points

Use the + operator to add multiple layers to a ggplot.
Multi-layered plots can combine raster and vector datasets. tmap does this best (see Challenge 2)!
Use the show.legend argument to set legend symbol types.
Use the scale_fill_manual() function to set legend colors.

Content from Handling Spatial Projection & CRS

Last updated on 2024-08-19 | Edit this page

Estimated time: 60 minutes

Overview

Questions

What do I do when vector data don’t line up?

Objectives

Plot vector objects with different CRSs in the same plot.

In an earlier episode we learned how to handle a situation where you have two different files with raster data in different projections. Now we will apply those same principles to working with vector data. We will create a base map of our study site using United States state and country boundary information accessed from the United States Census Bureau. We will learn how to map vector data that are in different CRSs and thus don’t line up on a map.

We will continue to work with the some of the ESRI shapefiles that we loaded in the previous episodes.

R

library(sf)
library(ggplot2)
library(dplyr)

R

point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"

Working With Spatial Data From Different Sources

We often need to gather spatial datasets from different sources and/or data that cover different spatial extents. These data are often in different Coordinate Reference Systems (CRSs).

Some reasons for data being in different CRSs include:

The data are stored in a particular CRS convention used by the data provider (for example, a government agency).
The data are stored in a particular CRS that is customized to a region. For instance, many states in the US prefer to use a State Plane projection customized for that state.

Maps of the United States using data in different projections. Source: opennews.org, from: https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg

Notice the differences in shape associated with each different projection. These differences are a direct result of the calculations used to “flatten” the data onto a 2-dimensional map. Often data are stored purposefully in a particular projection that optimizes the relative shape and size of surrounding geographic boundaries (states, counties, countries, etc).

In this episode we will learn how to identify and manage spatial data in different projections. We will learn how to reproject the data so that they are in the same projection to support plotting / mapping. Note that these skills are also required for any geoprocessing / spatial analysis. Data need to be in the same CRS to ensure accurate results.

We will continue to use the sf package in this episode.

Import US Boundaries - Census Data

There are many good sources of boundary base layers that we can use to create a basemap. Some R packages even have these base layers built in to support quick and efficient mapping. In this episode, we will use boundary layers for the contiguous United States, provided by the United States Census Bureau. It is useful to have vector layers in ESRI’s shapefile format to work with because we can add additional attributes to them if need be - for project specific mapping.

Read US Boundary File

We will use the st_read() function to import the /US-Boundary-Layers/US-State-Boundaries-Census-2014 layer into R. This layer contains the boundaries of all contiguous states in the U.S. Please note that these data have been modified and reprojected from the original data downloaded from the Census website to support the learning goals of this episode. We only care about the x and y coordinates, not the z or m dimensions, so we will use st_zm() to drop the coordinates that we do not need.

R

state_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp") %>%
  st_zm()

OUTPUT

Reading layer `US-State-Boundaries-Census-2014' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 58 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp"

Next, let’s plot the U.S. states data:

R

ggplot() +
  geom_sf(data = state_boundary_US) +
  labs(title = "Map of Contiguous US State Boundaries") +
  coord_sf()

U.S. Boundary Layer

We can add a boundary layer of the United States to our map - to make it look nicer. We will import NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.

R

country_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp") %>%
  st_zm()

OUTPUT

Reading layer `US-Boundary-Dissolved-States' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XYZ
Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp"

If we specify a thicker line width using linewidth = 0.5 for the border layer, it will make our map pop! We will also manually set the colors of the state boundaries and country boundaries.

R

ggplot() +
  geom_sf(data = state_boundary_US, color = "gray60") +
  geom_sf(data = country_boundary_US, color = "black", alpha = 0.25, linewidth = 0.5) +
  labs(title = "Map of Contiguous US State Boundaries") +
  coord_sf()

Next, let’s add the location of a flux tower where our study area is. As we are adding these layers, take note of the CRS of each object. First let’s look at the CRS of our tower location object:

R

st_crs(point_HARV)$proj4string

OUTPUT

[1] "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"

Our project string for point_HARV specifies the UTM projection as follows:

+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs

proj=utm: the projection is UTM, UTM has several zones.
zone=18: the zone is 18
datum=WGS84: the datum WGS84 (the datum refers to the 0,0 reference for the coordinate system used in the projection)
units=m: the units for the coordinates are in METERS.

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone.

Let’s check the CRS of our state and country boundary objects:

R

st_crs(state_boundary_US)$proj4string

OUTPUT

[1] "+proj=longlat +datum=WGS84 +no_defs"

R

st_crs(country_boundary_US)$proj4string

OUTPUT

[1] "+proj=longlat +datum=WGS84 +no_defs"

Our project string for state_boundary_US and country_boundary_US specifies the lat/long projection as follows:

+proj=longlat +datum=WGS84 +no_defs

proj=longlat: the data are in a geographic (latitude and longitude) coordinate system
datum=WGS84: the datum WGS84 (the datum refers to the 0,0 reference for the coordinate system used in the projection)
no_defs: ensures that no defaults are used, but this is now obsolete

Note that there are no specified units above. This is because this geographic coordinate reference system is in latitude and longitude which is most often recorded in decimal degrees.

Data Tip

the last portion of each proj4 string could potentially be something like +towgs84=0,0,0. This is a conversion factor that is used if a datum conversion is required. We will not deal with datums in this episode series.

CRS Units - View Object Extent

Next, let’s view the extent or spatial coverage for the point_HARV spatial object compared to the state_boundary_US object.

First we’ll look at the extent for our study site:

R

st_bbox(point_HARV)

OUTPUT

     xmin      ymin      xmax      ymax 
 732183.2 4713265.0  732183.2 4713265.0

And then the extent for the state boundary data.

R

st_bbox(state_boundary_US)

OUTPUT

      xmin       ymin       xmax       ymax 
-124.72584   24.49813  -66.94989   49.38436

Note the difference in the units for each object. The extent for state_boundary_US is in latitude and longitude which yields smaller numbers representing decimal degree units. Our tower location point is in UTM, is represented in meters.

Proj4 & CRS Resources

Reproject Vector Data or No?

We saw in an earlier episode that when working with raster data in different CRSs, we needed to convert all objects to the same CRS. We can do the same thing with our vector data - however, we don’t need to! When using the ggplot2 package, ggplot automatically converts all objects to the same CRS before plotting. This means we can plot our three data sets together without doing any conversion:

R

ggplot() +
  geom_sf(data = state_boundary_US, color = "gray60") +
  geom_sf(data = country_boundary_US, linewidth = 0.5, alpha = 0.25, color = "black") +
  geom_sf(data = point_HARV, shape = 19, color = "purple") +
  labs(title = "Map of Contiguous US State Boundaries") +
  coord_sf()

Challenge 1: Plot Multiple Layers of Spatial Data

Create a map of the North Eastern United States as follows:

Import and plot Boundary-US-State-NEast.shp. Adjust line width as necessary.
Layer the Fisher Tower (in the NEON Harvard Forest site) point location point_HARV onto the plot.
Add a title.
Add a legend that shows both the state boundary and the Tower location point.

Show me the solution

R

NE.States.Boundary.US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp") %>%
  st_zm()

OUTPUT

Reading layer `Boundary-US-State-NEast' from data source 
  `/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 12 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: -80.51989 ymin: 37.91685 xmax: -66.9499 ymax: 47.45716
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

R

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp"

ggplot() +
    geom_sf(data = NE.States.Boundary.US, mapping = aes(color ="State Boundary")) +
    scale_color_manual(values = c("State Boundary" = "gray18")) +
    geom_sf(data = point_HARV, mapping = aes(shape = "Fisher Tower"), 
            color = "purple") +
    scale_shape_manual(values = c("Fisher Tower" = 19)) +
    labs(color = NULL, 
         shape = NULL, 
         title = "Fisher Tower location") +
    coord_sf()

Key Points

ggplot2 automatically converts all objects in a plot to the same CRS.
Still be aware of the CRS and extent for each object.

Content from Convert from .csv to a Vector Layer

Last updated on 2024-08-19 | Edit this page

Estimated time: 60 minutes

Overview

Questions

How can I import CSV files as vector layers in R?

Objectives

Import .csv files containing x,y coordinate locations into R as a data frame.
Convert a data frame to a spatial object.
Export a spatial object to a text file.

This episode will review how to import spatial points stored in .csv (Comma Separated Value) format into R as an sf spatial object. We will also reproject data imported from an ESRI shapefile format and export the reprojected data as an ESRI shapefile.

Let’s load our packages and data for this episode:

R

library(ggplot2)
library(dplyr)
library(sf)

R

aoi_boundary_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"

country_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp"

point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"

Spatial Data in Text Format

The HARV_PlotLocations.csv file contains x, y (point) locations for study plot where NEON collects data on vegetation and other ecological metrics. We would like to:

Create a map of these plot locations.
Export the data in an ESRI shapefile format to share with our colleagues. This shapefile can be imported into most GIS software.
Create a map showing vegetation height with plot locations layered on top.

Spatial data are sometimes stored in a text file format (.txt or .csv). If the text file has an associated x and y location column, then we can convert it into an sf spatial object. The sf object allows us to store both the x,y values that represent the coordinate location of each point and the associated attribute data - or columns describing each feature in the spatial object.

We will continue using the sf package in this episode.

Import .csv

To begin let’s import a .csv file that contains plot coordinate x, y locations at the NEON Harvard Forest Field Site (HARV_PlotLocations.csv) and look at the structure of that new object:

R

plot_locations_HARV <-
  read.csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv"

str(plot_locations_HARV)

OUTPUT

'data.frame':	21 obs. of  16 variables:
 $ easting   : num  731405 731934 731754 731724 732125 ...
 $ northing  : num  4713456 4713415 4713115 4713595 4713846 ...
 $ geodeticDa: chr  "WGS84" "WGS84" "WGS84" "WGS84" ...
 $ utmZone   : chr  "18N" "18N" "18N" "18N" ...
 $ plotID    : chr  "HARV_015" "HARV_033" "HARV_034" "HARV_035" ...
 $ stateProvi: chr  "MA" "MA" "MA" "MA" ...
 $ county    : chr  "Worcester" "Worcester" "Worcester" "Worcester" ...
 $ domainName: chr  "Northeast" "Northeast" "Northeast" "Northeast" ...
 $ domainID  : chr  "D01" "D01" "D01" "D01" ...
 $ siteID    : chr  "HARV" "HARV" "HARV" "HARV" ...
 $ plotType  : chr  "distributed" "tower" "tower" "tower" ...
 $ subtype   : chr  "basePlot" "basePlot" "basePlot" "basePlot" ...
 $ plotSize  : int  1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 ...
 $ elevation : num  332 342 348 334 353 ...
 $ soilTypeOr: chr  "Inceptisols" "Inceptisols" "Inceptisols" "Histosols" ...
 $ plotdim_m : int  40 40 40 40 40 40 40 40 40 40 ...

We now have a data frame that contains 21 locations (rows) and 16 variables (attributes). Note that all of our character data was imported into R as character (text) data. Next, let’s explore the dataframe to determine whether it contains columns with coordinate values. If we are lucky, our .csv will contain columns labeled:

“X” and “Y” OR
Latitude and Longitude OR
easting and northing (UTM coordinates)

Let’s check out the column names of our dataframe.

R

names(plot_locations_HARV)

OUTPUT

 [1] "easting"    "northing"   "geodeticDa" "utmZone"    "plotID"    
 [6] "stateProvi" "county"     "domainName" "domainID"   "siteID"    
[11] "plotType"   "subtype"    "plotSize"   "elevation"  "soilTypeOr"
[16] "plotdim_m"

Identify X,Y Location Columns

Our column names include several fields that might contain spatial information. The plot_locations_HARV$easting and plot_locations_HARV$northing columns contain coordinate values. We can confirm this by looking at the first six rows of our data.

R

head(plot_locations_HARV$easting)

OUTPUT

[1] 731405.3 731934.3 731754.3 731724.3 732125.3 731634.3

R

head(plot_locations_HARV$northing)

OUTPUT

[1] 4713456 4713415 4713115 4713595 4713846 4713295

We have coordinate values in our data frame. In order to convert our data frame to an sf object, we also need to know the CRS associated with those coordinate values.

There are several ways to figure out the CRS of spatial data in text format.

We can check the file metadata in hopes that the CRS was recorded in the data.
We can explore the file itself to see if CRS information is embedded in the file header or somewhere in the data columns.

Following the easting and northing columns, there is a geodeticDa and a utmZone column. These appear to contain CRS information (datum and projection). Let’s view those next.

R

head(plot_locations_HARV$geodeticDa)

OUTPUT

[1] "WGS84" "WGS84" "WGS84" "WGS84" "WGS84" "WGS84"

R

head(plot_locations_HARV$utmZone)

OUTPUT

[1] "18N" "18N" "18N" "18N" "18N" "18N"

It is not typical to store CRS information in a column. But this particular file contains CRS information this way. The geodeticDa and utmZone columns contain the information that helps us determine the CRS:

geodeticDa: WGS84 – this is geodetic datum WGS84
utmZone: 18

In the previous episode we learned about the components of a proj4 string. We have everything we need to assign a CRS to our data frame.

To create the proj4 associated with UTM Zone 18 WGS84 we can look up the projection on the Spatial Reference website, which contains a list of CRS formats for each projection. From here, we can extract the proj4 string for UTM Zone 18N WGS84.

However, if we have other data in the UTM Zone 18N projection, it’s much easier to use the st_crs() function to extract the CRS in proj4 format from that object and assign it to our new spatial object. We’ve seen this CRS before with our Harvard Forest study site (point_HARV).

R

st_crs(point_HARV)

OUTPUT

Coordinate Reference System:
  User input: WGS 84 / UTM zone 18N 
  wkt:
PROJCRS["WGS 84 / UTM zone 18N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["UTM zone 18N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-75,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    ID["EPSG",32618]]

The output above shows that the points vector layer is in UTM zone 18N. We can thus use the CRS from that spatial object to convert our non-spatial dataframe into an sf object.

Next, let’s create a crs object that we can use to define the CRS of our sf object when we create it.

R

utm18nCRS <- st_crs(point_HARV)
utm18nCRS

OUTPUT

Coordinate Reference System:
  User input: WGS 84 / UTM zone 18N 
  wkt:
PROJCRS["WGS 84 / UTM zone 18N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["UTM zone 18N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-75,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    ID["EPSG",32618]]

R

class(utm18nCRS)

OUTPUT

[1] "crs"

.csv to sf object

Next, let’s convert our dataframe into an sf object. To do this, we need to specify:

The columns containing X (easting) and Y (northing) coordinate values
The CRS that the column coordinate represent (units are included in the CRS) - stored in our utmCRS object.

We will use the st_as_sf() function to perform the conversion.

R

plot_locations_sp_HARV <- st_as_sf(plot_locations_HARV,
                                   coords = c("easting", "northing"),
                                   crs = utm18nCRS)

We should double check the CRS to make sure it is correct and matches the CRS of point_HARV.

R

st_crs(plot_locations_sp_HARV) == st_crs(point_HARV)

OUTPUT

[1] TRUE

Plot Spatial Object

We now have a spatial R object, we can plot our newly created spatial object.

R

ggplot() +
  geom_sf(data = plot_locations_sp_HARV) +
  labs(title = "Map of Plot Locations") + 
  coord_sf()

Plot Extent

In Open and Plot Vector Layers in R we learned about spatial object extent. When we plot several spatial layers in R using ggplot, all of the layers of the plot are considered in setting the boundaries of the plot. To show this, let’s plot our country_boundary_US object.

R

ggplot() + 
  geom_sf(data = country_boundary_US) + 
  geom_sf(data = plot_locations_sp_HARV) + 
  labs(title = "Country Boundary Plot") + 
  coord_sf()

When we plot the two layers together, ggplot sets the plot boundaries so that they are large enough to include all of the data included in all of the layers. That’s really handy!

Challenge 1: Import & Plot Additional Points

We want to add two phenology plots to our existing map of vegetation plot locations.

Import the .csv: HARV/HARV_2NewPhenPlots.csv into R and do the following:

Find the X and Y coordinate locations. Which value is X and which value is Y?
These data were collected in a geographic coordinate system (WGS84). Convert the dataframe into an sf object.
Plot the new points with the plot location points from above. Be sure to add a legend. Use a different symbol for the 2 new points!

If you have extra time, feel free to add roads and other layers to your map!

Show me the solution

First we will read in the new csv file and look at the data structure.

R

newplot_locations_HARV <-
  read.csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_2NewPhenPlots.csv")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_2NewPhenPlots.csv"

str(newplot_locations_HARV)

OUTPUT

'data.frame':	2 obs. of  13 variables:
 $ decimalLat: num  42.5 42.5
 $ decimalLon: num  -72.2 -72.2
 $ country   : chr  "unitedStates" "unitedStates"
 $ stateProvi: chr  "MA" "MA"
 $ county    : chr  "Worcester" "Worcester"
 $ domainName: chr  "Northeast" "Northeast"
 $ domainID  : chr  "D01" "D01"
 $ siteID    : chr  "HARV" "HARV"
 $ plotType  : chr  "tower" "tower"
 $ subtype   : chr  "phenology" "phenology"
 $ plotSize  : int  40000 40000
 $ plotDimens: chr  "200m x 200m" "200m x 200m"
 $ elevation : num  358 346

The US boundary data we worked with previously is in a geographic WGS84 CRS. We can use that data to establish a CRS for this data. First we will extract the CRS from the country_boundary_US object and confirm that it is WGS84.

R

geogCRS <- st_crs(country_boundary_US)
geogCRS

OUTPUT

Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

Then we will convert our new data to a spatial dataframe, using the geogCRS object as our CRS.

R

newPlot.Sp.HARV <- st_as_sf(newplot_locations_HARV,
                            coords = c("decimalLon", "decimalLat"),
                            crs = geogCRS)

Next we’ll confirm that the CRS for our new object is correct.

R

st_crs(newPlot.Sp.HARV)

OUTPUT

Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

We will be adding these new data points to the plot we created before. The data for the earlier plot was in UTM. Since we’re using ggplot, it will reproject the data for us.

Now we can create our plot.

R

ggplot() +
  geom_sf(data = plot_locations_sp_HARV, color = "orange") +
  geom_sf(data = newPlot.Sp.HARV, color = "lightblue") +
  labs(title = "Map of All Plot Locations") + 
  coord_sf()

Export to an ESRI `shapefile`

We can write an R spatial object to an ESRI shapefile using the st_write function in sf. To do this we need the following arguments:

the name of the spatial object (plot_locations_sp_HARV)
the directory where we want to save our ESRI shapefile (to use current = getwd() or you can specify a different path)
the name of the new ESRI shapefile (PlotLocations_HARV)
the driver which specifies the file format (ESRI Shapefile)

We can now export the spatial object as an ESRI shapefile.

R

st_write(obj = plot_locations_sp_HARV,
         dsn = "data/cleaned-data/PlotLocations_HARV.shp", 
         driver = "ESRI Shapefile")

Key Points

Know the projection (if any) of your point data prior to converting to a spatial object.
Convert a data frame to an sf object using the st_as_sf() function.
Export an sf object as text using the st_write() function.

Content from Manipulate Raster Data

Last updated on 2024-08-19 | Edit this page

Estimated time: 60 minutes

Overview

Questions

How can I crop raster objects to vector objects, and extract the summary of raster pixels?

Objectives

Crop a raster to the extent of a vector layer.
Extract values from a raster that correspond to a vector file overlay.

This episode explains how to crop a raster using the extent of a vector layer. We will also cover how to extract values from a raster that occur within a set of polygons, or in a buffer (surrounding) region around a set of points.

Let’s start by loading the libraries and data we need for this episode:

R

library(sf)
library(terra)
library(ggplot2)
library(dplyr)

R

point_HARV <-
  st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"

aoi_boundary_HARV <-
  st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"

# CHM data, and convert it to a dataframe
CHM_HARV <-
  rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif"

CHM_HARV_df <- as.data.frame(CHM_HARV, xy = TRUE)

R

plot_locations_sp_HARV <- st_read("data/cleaned-data/PlotLocations_HARV.shp")

Crop a Raster to Vector Extent

We often work with spatial layers that have different spatial extents. The spatial extent of a vector layer or R spatial object represents the geographic “edge” or location that is the furthest north, south east and west. Thus it represents the overall geographic coverage of the spatial object.

Extent illustration Image Source: National Ecological Observatory Network (NEON)

The graphic below illustrates the extent of several of the spatial layers that we have worked with in this workshop:

Area of interest (AOI) – blue
Roads and trails – purple
Vegetation plot locations (marked with white dots)– black
A canopy height model (CHM) in GeoTIFF format – green

Frequent use cases of cropping a raster file include reducing file size and creating maps. Sometimes we have a raster file that is much larger than our study area or area of interest. It is often more efficient to crop the raster to the extent of our study area to reduce file sizes as we process our data. Cropping a raster can also be useful when creating pretty maps so that the raster layer matches the extent of the desired vector layers.

Crop a Raster Using Vector Extent

We can use the crop() function to crop a raster to the extent of another spatial object. To do this, we need to specify the raster to be cropped and the spatial object that will be used to crop the raster. R will use the extent of the spatial object as the cropping boundary.

To illustrate this, we will crop the Canopy Height Model (CHM) to only include the area of interest (AOI). Let’s start by plotting the full extent of the CHM data and overlay where the AOI falls within it. The boundaries of the AOI will be colored blue, and we use fill = NA to make the area transparent.

R

ggplot() +
  geom_raster(data = CHM_HARV_df, 
              mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
  scale_fill_gradientn(colors = terrain.colors(10)) +
  geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
  labs(fill = "Canopy Height") + 
  coord_sf()

Now that we have visualized the area of the CHM we want to subset, we can perform the cropping operation. We are going to crop() function from the raster package to create a new object with only the portion of the CHM data that falls within the boundaries of the AOI.

R

CHM_HARV_Cropped <- crop(x = CHM_HARV, y = aoi_boundary_HARV)

Now we can plot the cropped CHM data, along with a boundary box showing the full CHM extent. However, remember, since this is raster data, we need to convert to a data frame in order to plot using ggplot. To get the boundary box from CHM, the st_bbox() will extract the 4 corners of the rectangle that encompass all the features contained in this object. The st_as_sfc() converts these 4 coordinates into a polygon that we can plot:

R

CHM_HARV_Cropped_df <- as.data.frame(CHM_HARV_Cropped, xy = TRUE)
boundary <- st_as_sfc(st_bbox(CHM_HARV))

ggplot() +
  geom_sf(data = boundary, fill = "green",
          color = "green", alpha = .2) +
  geom_raster(data = CHM_HARV_Cropped_df,
              mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
  scale_fill_gradientn(colors = terrain.colors(10)) +
  labs(fill = "Canopy Height") + 
  coord_sf()

The plot above shows that the full CHM extent (plotted in green) is much larger than the resulting cropped raster. Our new cropped CHM now has the same extent as the aoi_boundary_HARV object that was used as a crop extent (blue border below).

R

ggplot() +
  geom_raster(data = CHM_HARV_Cropped_df,
              mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
  geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
  scale_fill_gradientn(colors = terrain.colors(10)) +
  labs(fill = "Canopy Height") + 
  coord_sf()

Challenge 1: Crop to Vector Points Extent

Crop the Canopy Height Model to the extent of the study plot locations.
Plot the vegetation plot location points on top of the Canopy Height Model.

Show me the solution

R

CHM_plots_HARVcrop <- crop(x = CHM_HARV, y = plot_locations_sp_HARV)

CHM_plots_HARVcrop_df <- as.data.frame(CHM_plots_HARVcrop, xy = TRUE)

ggplot() +
  geom_raster(data = CHM_plots_HARVcrop_df,
              mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
  scale_fill_gradientn(colors = terrain.colors(10)) +
  geom_sf(data = plot_locations_sp_HARV) +
  labs(fill = "Canopy Height") + 
  coord_sf()

In the plot above, created in the challenge, all the vegetation plot locations (black dots) appear on the Canopy Height Model raster layer except for one. One is situated on the blank space to the left of the map. Why?

A modification of the first figure in this episode is below, showing the relative extents of all the spatial objects. Notice that the extent for our vegetation plot layer (black) extends further west than the extent of our CHM raster (bright green). The crop() function will make a raster extent smaller, it will not expand the extent in areas where there are no data. Thus, the extent of our vegetation plot layer will still extend further west than the extent of our (cropped) raster data (dark green).

Define an Extent

So far, we have used a vector layer to crop the extent of a raster dataset. Alternatively, we can also the ext() function to define an extent to be used as a cropping boundary. This creates a new object of class extent. Here we will provide the ext() function our xmin, xmax, ymin, and ymax (in that order).

R

new_extent <- ext(732161.2, 732238.7, 4713249, 4713333)
class(new_extent)

OUTPUT

[1] "SpatExtent"
attr(,"package")
[1] "terra"

Data Tip

The extent can be created from a numeric vector (as shown above), a matrix, or a list. For more details see the ext() function help file (?terra::ext).

Once we have defined our new extent, we can use the crop() function to crop our raster to this extent object.

R

CHM_HARV_manual_cropped <- crop(x = CHM_HARV, y = new_extent)

To plot this data using ggplot() we need to convert it to a dataframe.

R

CHM_HARV_manual_cropped_df <- as.data.frame(CHM_HARV_manual_cropped, xy = TRUE)

Now we can plot this cropped data. We will show the AOI boundary on the same plot for scale.

R

ggplot() +
  geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
  geom_raster(data = CHM_HARV_manual_cropped_df,
              mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
  scale_fill_gradientn(colors = terrain.colors(10)) +
  labs(fill = "Canopy Height") + 
  coord_sf()

Extract Raster Pixels Values Using Vector Polygons

Often we want to extract values from a raster layer for particular locations - for example, plot locations that we are sampling on the ground. We can extract all pixel values within 20m of our x,y point of interest. These can then be summarized into some value of interest (e.g. mean, maximum, total).

Image shows raster information extraction using 20m polygon boundary. Image Source: National Ecological Observatory Network (NEON)

To do this in R, we use the extract() function. The extract() function requires:

The raster that we wish to extract values from,
The vector layer containing the polygons that we wish to use as a boundary or boundaries,
We can also tell it to store the output values in a data frame using raw = FALSE (this is optional).

We will begin by extracting all canopy height pixel values located within our aoi_boundary_HARV polygon which surrounds the tower located at the NEON Harvard Forest field site.

R

tree_height <- extract(x = CHM_HARV, y = aoi_boundary_HARV, raw = FALSE)

str(tree_height)

OUTPUT

'data.frame':	18450 obs. of  2 variables:
 $ ID          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HARV_chmCrop: num  21.2 23.9 23.8 22.4 23.9 ...

When we use the extract() function, R extracts the value for each pixel located within the boundary of the polygon being used to perform the extraction - in this case the aoi_boundary_HARV object (a single polygon). Here, the function extracted values from 18,450 pixels.

We can create a histogram of tree height values within the boundary to better understand the structure or height distribution of trees at our site. We will use the column HARV_chmCrop from our data frame as our x values, as this column represents the tree heights for each pixel.

R

ggplot() +
  geom_histogram(data = tree_height, mapping = aes(x = HARV_chmCrop)) +
  labs(title = "Histogram of CHM Height Values (m)", 
       x = "Tree Height", 
       y = "Frequency of Pixels")

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also use the summary() function to view descriptive statistics including min, max, and mean height values. These values help us better understand vegetation at our field site.

R

summary(tree_height$HARV_chmCrop)

OUTPUT

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.03   21.36   22.81   22.43   23.97   38.17

Summarize Extracted Raster Values

We often want to extract summary values from a raster. We can tell R the type of summary statistic we are interested in using the fun = argument. Let’s extract a mean height value for our AOI.

R

mean_tree_height_AOI <- extract(x = CHM_HARV, y = aoi_boundary_HARV,
                                fun = mean)

mean_tree_height_AOI

OUTPUT

  ID HARV_chmCrop
1  1     22.43018

It appears that the mean height value, extracted from our LiDAR data derived canopy height model is 22.43 meters.

Extract Data using x,y Locations

We can also extract pixel values from a raster by defining a buffer or area surrounding individual point locations using the st_buffer() function. To do this we define the summary argument (fun = mean) and the buffer distance (dist = 20) which represents the radius of a circular region around each point. By default, the units of the buffer are the same units as the data’s CRS. All pixels that are touched by the buffer region are included in the extract.

Image shows raster information extraction using 20m buffer region. Image Source: National Ecological Observatory Network (NEON)

Let’s put this into practice by figuring out the mean tree height in the 20m around the tower location (point_HARV).

R

mean_tree_height_tower <- extract(x = CHM_HARV,
                                  y = st_buffer(point_HARV, dist = 20),
                                  fun = mean)

mean_tree_height_tower

OUTPUT

  ID HARV_chmCrop
1  1     22.38806

Challenge 2: Extract Raster Height Values For Plot Locations

Use the plot locations object (plot_locations_sp_HARV) to extract an average tree height for the area within 20m of each vegetation plot location in the study area. Because there are multiple plot locations, there will be multiple averages returned.
Create a plot showing the mean tree height of each area.

Show me the solution

R

# extract data at each plot location
mean_tree_height_plots_HARV <- extract(x = CHM_HARV,
                                       y = st_buffer(plot_locations_sp_HARV,
                                                     dist = 20),
                                       fun = mean)

# view data
mean_tree_height_plots_HARV

OUTPUT

   ID HARV_chmCrop
1   1          NaN
2   2     23.96756
3   3     22.34937
4   4     16.49739
5   5     21.54419
6   6     19.16772
7   7     20.61651
8   8     21.61439
9   9     12.23006
10 10     19.13398
11 11     21.36966
12 12     19.32084
13 13     17.25975
14 14     20.47120
15 15     12.68301
16 16     15.51888
17 17     18.90894
18 18     18.19369
19 19     19.67441
20 20     20.23245
21 21     20.44984

R

# plot data
ggplot(data = mean_tree_height_plots_HARV,
       mapping = aes(x = ID, y = HARV_chmCrop)) +
  geom_col() +
  labs(title = "Mean Tree Height at each Plot", 
       x = "Plot ID", 
       y = "Tree Height (m)")

WARNING

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_col()`).

Key Points

Use the crop() function to crop a raster object.
Use the extract() function to extract pixels from a raster object that fall within a particular extent boundary.
Use the ext() function to define an extent.

Content from Create Publication-quality Graphics

Last updated on 2024-08-19 | Edit this page

Estimated time: 60 minutes

Overview

Questions

How can I deal with multiple raster layers?
How can I create a publication-quality graphic and customize plot parameters?

Objectives

Create a raster stack
Assign custom names to bands in a RasterStack.
Customize raster plots using the ggplot2 package.

This episode covers how to customize your raster plots using the ggplot2 package in R to create publication-quality plots.

First, let’s load the libraries that we will need for this episode.

We will be using the tidyr package in this episode, which you might not have used yet!

If you haven’t downloaded tidyr to your R yet, do so now using:

install.packages("tidyr")

Once it’s downloaded, load in all the packages for this episode:

R

library(terra)
library(ggplot2)
library(dplyr)
library(tidyr)

Faceting plots

There may come a time when you will need to plot several smaller plots within a larger plot. The result is a plot with facets. These plots can be useful if you have several variables that you want to show at once, or if you’re trying to look at maps over time.

We have a time series of raster data that are spread across several files. These data show the normalized difference vegetation index (NDVI) from the Harvard Forest Site. We can best visualize these data by using facets. To do this, we need to read the data in as a “raster stack”.

First, let’s get the list of all the raster files in our time series. We can do this using the list.files() function. We will only add files that have a .tif extension to our list. To do this, we will use the syntax pattern=".tif$". If we specify full.names = TRUE, the full path for each file will be added to the list.

Data Tip

In the pattern above, the $ character represents the end of a line. Using it ensures that our pattern will only match files that end in .tif. This pattern matching uses a language called “regular expressions”, which is beyond the scope of this workshop.

R

NDVI_HARV_path <- "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI"

all_NDVI_HARV <- list.files(NDVI_HARV_path,
                            full.names = TRUE,
                            pattern = ".tif$")

# If you are getting an error, check your file path: 
# You might need change your file path to: 
# "data/2009586/NEON-DS-Landsat-NDVI/HARV/2011/NDVI"

It’s a good idea to look at the file names that matched our search to make sure they meet our expectations.

R

all_NDVI_HARV

OUTPUT

 [1] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/005_HARV_ndvi_crop.tif"
 [2] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/037_HARV_ndvi_crop.tif"
 [3] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/085_HARV_ndvi_crop.tif"
 [4] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/133_HARV_ndvi_crop.tif"
 [5] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/181_HARV_ndvi_crop.tif"
 [6] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/197_HARV_ndvi_crop.tif"
 [7] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/213_HARV_ndvi_crop.tif"
 [8] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/229_HARV_ndvi_crop.tif"
 [9] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/245_HARV_ndvi_crop.tif"
[10] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/261_HARV_ndvi_crop.tif"
[11] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/277_HARV_ndvi_crop.tif"
[12] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/293_HARV_ndvi_crop.tif"
[13] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/309_HARV_ndvi_crop.tif"

Now we have a list of all GeoTIFF files in the NDVI directory for Harvard Forest. The number at the start of the filenames represents the julilan day. Next, we will create a stack of rasters from this list using the rast() function. This is the same function we use to read in any raster file.

R

NDVI_HARV_stack <- rast(all_NDVI_HARV)
NDVI_HARV_stack

OUTPUT

class       : SpatRaster 
dimensions  : 5, 4, 13  (nrow, ncol, nlyr)
resolution  : 30, 30  (x, y)
extent      : 239415, 239535, 4714215, 4714365  (xmin, xmax, ymin, ymax)
coord. ref. : UTM Zone 19, Northern Hemisphere 
sources     : 005_HARV_ndvi_crop.tif  
              037_HARV_ndvi_crop.tif  
              085_HARV_ndvi_crop.tif  
              ... and 10 more source(s)
names       : 005_H~_crop, 037_H~_crop, 085_H~_crop, 133_H~_crop, 181_H~_crop, 197_H~_crop, ... 
min values  :        2732,        1534,        1917,        5669,        8685,        8768, ... 
max values  :        5545,        3736,        3600,        6394,        8903,        9140, ...

We can see from looking that the data that we will need to scale the values. NDVI data should be a value between 0 and 1 (and it’s currently a value between 0 and 10,000!):

R

NDVI_HARV_stack <- NDVI_HARV_stack/10000

Once we have created our RasterStack, we can visualize our data. We can use the ggplot() command to create a multi-panelled plot showing each band in our RasterStack. First we need to create a data frame object. Because there are multiple columns in our data that are not variables, we will tidy (or “pivot”) the data so that we have a single column with the NDVI observations. We will use the function pivot_longer() from the tidyr package to do this. Since there are so many columns with relevant data, instead of specifying the columns we want to pivot, we can specify the columns we don’t want to pivot instead. We do this by specifying cols = -(x:y):

R

NDVI_HARV_stack_df <- as.data.frame(NDVI_HARV_stack, xy = TRUE) %>%
    pivot_longer(cols = -(x:y), names_to = "variable", values_to = "value")

Now we can plot our data using ggplot(). We want to create a separate panel for each time point in our time series, so we will use the facet_wrap() function to create a multi-paneled plot by the variable column:

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable)

Although this plot is informative, it isn’t something we would expect to see in a journal publication. The x and y-axis labels aren’t informative. There is a lot of unnecessary gray background and the titles of each panel don’t clearly state that the number refers to the Julian day the data was collected. In this episode, we will customize this plot above to produce a publication quality graphic. We will go through these steps iteratively. When we’re done, we will have created the plot shown below.

Adjust the Plot Theme

The first thing we will do to our plot remove the x and y-axis labels and axis ticks, as these are unnecessary and make our plot look messy. We can do this by using a pre-set ggplot theme that simplifies all plot parameters. We will add the function theme_void() to tell ggplot that we want to use this theme.

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable) +
  labs(title = "Landsat NDVI", 
       subtitle = "NEON Harvard Forest", 
       fill = "NDVI") + 
  theme_void()

Next we will center our plot title and subtitle. We need to do this after the theme_void() layer, because R interprets the ggplot layers in order. If we first tell R to center our plot title, and then set the theme to void, any adjustments we’ve made to the plot theme will be over-written by the theme_void() function. So first we make the theme void and then we center the title. We center both the title and subtitle by using the theme() function and setting the hjust parameter to 0.5. The hjust parameter stands for “horizontal justification” and takes any value between 0 and 1. A setting of 0 indicates left justification and a setting of 1 indicates right justification. We can also update the “face” of the font, to be bold or italics, or bold and italics. To do this, we would adjust the face parameter.

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable) +
  labs(title = "Landsat NDVI", 
       subtitle = "NEON Harvard Forest", 
       fill = "NDVI") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))

Challenge 1

Change the plot subtitle to italic font. You can (and should!) use the help menu in RStudio or any internet resources to figure out how to change this setting.

Hint: Use ?element_text() to pull up the help file and see how you might be able to change the appearance of text in the theme() function.

Show me the solution

Learners can find this information in the help files for the theme() function. The parameter to set is called face.

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable) +
  labs(title = "Landsat NDVI", 
       subtitle = "NEON Harvard Forest", 
       fill = "NDVI") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, face = "italic"))

Adjust the Color Ramp

Next, let’s adjust the color ramp used to render the rasters. First, we can change the blue color ramp to a green one that is more visually suited to our NDVI (greenness) data using the colorRampPalette() function in combination with colorBrewer which requires loading the RColorBrewer library.

If you don’t have the RColorBrewer library yet, install it using:

install.packages("RColorBrewer")

Now we need to create a set of colors to use.

We can view the palettes available in RColorBrewer:

R

library(RColorBrewer)

display.brewer.all()

We will select a set of nine colors from the “YlGn” (yellow-green) color palette. This returns a set of hex color codes:

R

brewer.pal(9, "YlGn")

OUTPUT

[1] "#FFFFE5" "#F7FCB9" "#D9F0A3" "#ADDD8E" "#78C679" "#41AB5D" "#238443"
[8] "#006837" "#004529"

We can save these colors to add to our plot later.

R

green_colors <- brewer.pal(9, "YlGn")

Now we can go ahead and build our plot!

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable) +
  scale_fill_gradientn(colours = green_colors) + 
  labs(title = "Landsat NDVI", 
       subtitle = "NEON Harvard Forest", 
       fill = "NDVI") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5))

The yellow to green color ramp visually represents NDVI well given it’s a measure of greenness. Someone looking at the plot can quickly understand that pixels that are more green have a higher NDVI value.

Data Tip

For all of the brewer.pal ramp names see the brewerpal page.

Data Tip

Cynthia Brewer, the creator of ColorBrewer, offers an online tool to help choose suitable color ramps, or to create your own. ColorBrewer 2.0; Color Advise for Cartography

Refine Plot & Tile Labels

Next, let’s label each panel in our plot with the Julian day that the raster data for that panel was collected. The current names come from the band “layer names” stored in the RasterStack and the first part of each name is the Julian day.

To create a more meaningful label we can remove the pieces of the label we don’t want and replace it with something we do by using the gsub() function in R. The syntax is as follows: gsub("StringToReplace", "TextToReplaceIt", object).

First let’s remove “_HARV_NDVI_crop” from each label to make the labels shorter and remove repetition. To illustrate how this works, we will first look at the names for our NDVI_HARV_stack_df object in the variable column:

R

unique(NDVI_HARV_stack_df$variable)

OUTPUT

 [1] "005_HARV_ndvi_crop" "037_HARV_ndvi_crop" "085_HARV_ndvi_crop"
 [4] "133_HARV_ndvi_crop" "181_HARV_ndvi_crop" "197_HARV_ndvi_crop"
 [7] "213_HARV_ndvi_crop" "229_HARV_ndvi_crop" "245_HARV_ndvi_crop"
[10] "261_HARV_ndvi_crop" "277_HARV_ndvi_crop" "293_HARV_ndvi_crop"
[13] "309_HARV_ndvi_crop"

Now we will use the gsub() function to find the character string “_HARV_ndvi_crop” and replace it with a blank string (““). We will assign this output to a new column (panel_name) using the mutate() function and look at that column to sure our code is doing what we want it to.

R

NDVI_HARV_stack_df <- NDVI_HARV_stack_df %>% 
  mutate(panel_name = gsub("_HARV_ndvi_crop", "", variable))

unique(NDVI_HARV_stack_df$panel_name)

OUTPUT

 [1] "005" "037" "085" "133" "181" "197" "213" "229" "245" "261" "277" "293"
[13] "309"

So far so good. Now let’s use the paste() function to add the word “Day” in front of each of those numbers:

R

NDVI_HARV_stack_df <- NDVI_HARV_stack_df %>% 
  mutate(panel_name = paste("Day", panel_name))

unique(NDVI_HARV_stack_df$panel_name)

OUTPUT

 [1] "Day 005" "Day 037" "Day 085" "Day 133" "Day 181" "Day 197" "Day 213"
 [8] "Day 229" "Day 245" "Day 261" "Day 277" "Day 293" "Day 309"

Our labels look good now. Let’s render our plot again. This time, we’ll facet by the panel_name column.

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~panel_name) +
  scale_fill_gradientn(colours = green_colors) + 
  labs(title = "Landsat NDVI", 
       subtitle = "NEON Harvard Forest", 
       fill = "NDVI") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5))

The “Day” looks a little bit cut off on our plots. We can fix that using the “theme” again! This time, we will want to add something for the strip.text area - let’s increase the margin on the bottom of the text area to accommodate the text size. We can do this using theme(strip.text = element_text(margin = unit(c(0,0,2,0), "pt"))). The c(0,0,2,0) tells us the new margins we want for the top (0 pt), right (0 pt), bottom (2 pt), and left (0 pt) sides.

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~panel_name) +
  scale_fill_gradientn(colours = green_colors) + 
  labs(title = "Landsat NDVI", 
       subtitle = "NEON Harvard Forest", 
       fill = "NDVI") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5), 
        strip.text = element_text(margin = unit(c(0,0,2,0), "pt")))

Change Layout of Panels

We can adjust the columns of our plot by setting the number of columns ncol and the number of rows nrow in facet_wrap. Let’s make our plot so that it has a width of five panels.

R

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~panel_name, ncol = 5) +
  scale_fill_gradientn(colours = green_colors) + 
  labs(title = "Landsat NDVI", 
       subtitle = "NEON Harvard Forest", 
       fill = "NDVI") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5), 
        strip.text = element_text(margin = unit(c(0,0,2,0), "pt")))

Now we have a beautiful, publication quality plot!

We can save this plot using the ggsave() function. Remember that by default R will save the last rendered plot. We can specify the height, width, and dpi for the figure when we save it. This makes it super easy to meet your target journal’s figure formatting requirements.

R

ggsave(filename = "figures/time_series_NDVI.png", 
       height = 5, width = 5, dpi = 600)

Challenge 2: Divergent Color Ramps

When we used the paste() function to modify the tile labels we replaced the beginning of each tile title with “Day”. A more descriptive name could be “Julian Day”. Update the plot above with the following changes:

Label each tile “Julian Day” with the julian day value following.
Change the color ramp to a divergent brown to green color ramp.

Questions: Does having a divergent color ramp represent the data better than a sequential color ramp (like “YlGn”)? Can you think of other data sets where a divergent color ramp may be best?

Show me the solution

R

NDVI_HARV_stack_df <- NDVI_HARV_stack_df %>% 
  mutate(panel_name = gsub("Day", "Julian Day", panel_name))

brown_green_colors <- brewer.pal(9, "BrBG")

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df, 
              mapping = aes(x = x, y = y, fill = value)) +
  facet_wrap(~panel_name, ncol = 5) +
  scale_fill_gradientn(colours = brown_green_colors) + 
  labs(title = "Landsat NDVI - Julian Days", 
       subtitle = "Harvard Forest 2011", 
       fill = "NDVI") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5), 
        strip.text = element_text(margin = unit(c(0,0,2,0), "pt")))

For NDVI data, the sequential color ramp is better than the divergent as it is more akin to the process of greening up, which starts off at one end and just keeps increasing.

Key Points

Use the theme_void() function for a clean background to your plot.
Use the element_text() function to adjust text size, font, and position.
Use the brewer.pal() function to create a custom color palette.
Use the gsub() function to do pattern matching and replacement in text and the paste() function to add to your current text.