Content from Introduction to R and RStudio
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How to find your way around RStudio?
- How to interact with R?
- How to install packages?
Objectives
- Describe the purpose and use of each pane in the RStudio IDE
- Locate buttons and options in the RStudio IDE
- Define a variable
- Assign data to a variable
- Use mathematical and comparison operators
- Call functions
- Manage packages
Motivation
Science is a multi-step process: once you’ve designed an experiment and collected data, the real fun begins! This lesson will teach you how to start this process using R and RStudio. We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. This example starts with a dataset from gapminder.org containing population information for many countries through time. Can you read the data into R? Can you plot the population for Senegal? Can you calculate the average income for countries on the continent of Asia? By the end of these lessons you will be able to do things like plot the populations for all of these countries in under a minute!
Before Starting The Workshop
Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date.
Introduction to RStudio
Throughout this lesson, we’re going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier.
We’ll be using RStudio: a free, open source R integrated development environment (IDE). It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.
Basic layout
When you first open RStudio, you will be greeted by three panels:
- The interactive R console (entire left)
- Environment/History (tabbed in upper right)
- Files/Plots/Packages/Help/Viewer (tabbed in lower right)
Once you open files, such as R scripts, an editor panel will also open in the top left.
Workflow within RStudio
There are two main ways one can work within RStudio.
- Test and play within the interactive R console then copy code into a .R file to run later.
- This works well when doing small tests and initially starting off.
- It quickly becomes laborious
- Start writing in an .R file and use RStudio’s shortcut keys for the Run command to push the current line, selected lines or modified lines to the interactive R console.
- This is a great way to start; all your code is saved for later
- You will be able to run the file you create from within RStudio or
using R’s
source()
function.
Tip: Running segments of your code
RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can
- click on the
Run
button above the editor panel, or - select “Run Lines” from the “Code” menu, or
- hit Ctrl+Enter in Windows,
Ctrl+Return in Linux, or
⌘+Return on OS X. (This shortcut can also be seen
by hovering the mouse over the button). To run a block of code, select
it and then
Run
.
Introduction to R
Much of your time in R will be spent in the R interactive console.
This is where you will run all of your code, and can be a useful
environment to try out ideas before adding them to an R script file.
This console in RStudio is the same as the one you would get if you
typed in R
in your command-line environment.
The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. When you are running a section of your code, this is the location where R will first read your code, attempt to execute them, and then returns a result.
Using R as a calculator
The simplest thing you could do with R is do arithmetic:
R
1 + 100
OUTPUT
[1] 101
And R will print out the answer, with a preceding “[1]
”.
Don’t worry about this for now, we’ll explain that later. For now think
of it as indicating output.
Like bash, if you type in an incomplete command, R will wait for you to complete it:
OUTPUT
+
Any time you hit return and the R session shows a “+
”
instead of a “>
”, it means it’s waiting for you to
complete the command. If you want to cancel a command you can simply hit
“Esc” and RStudio will give you back the “>
”
prompt.
Tip: Cancelling commands
If you’re using R from the command line instead of from within RStudio, you need to use Ctrl+C instead of Esc to cancel the command. This applies to Mac users as well!
Cancelling a command isn’t only useful for killing incomplete commands: you can also use it to tell R to stop running code (for example if it’s taking much longer than you expect), or to get rid of the code you’re currently writing.
When using R as a calculator, the order of operations is the same as you would have learned back in school.
From highest to lowest precedence:
- Parentheses:
(
)
- Exponents:
^
or**
- Divide:
/
- Multiply:
*
- Add:
+
- Subtract:
-
R
3 + 5 * 2
OUTPUT
[1] 13
Use parentheses to group operations in order to force the order of evaluation if it differs from the default, or to make clear what you intend.
R
(3 + 5) * 2
OUTPUT
[1] 16
This can get unwieldy when not needed, but clarifies your intentions. Remember that others may later read your code.
R
(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2 # clear, if you remember the rules
3 + 5 * (2 ^ 2) # if you forget some rules, this might help
The text after each line of code is called a “comment”. Anything that
follows after the hash (or octothorpe) symbol #
is ignored
by R when it executes code.
Really small or large numbers get a scientific notation:
R
2/10000
OUTPUT
[1] 2e-04
Which is shorthand for “multiplied by 10^XX
”. So
2e-4
is shorthand for 2 * 10^(-4)
.
You can write numbers in scientific notation too:
R
5e3 # Note the lack of minus here
OUTPUT
[1] 5000
Don’t worry about trying to remember every function in R. You can look them up using a search engine, or if you can remember the start of the function’s name, use the tab completion in RStudio.
This is one advantage that RStudio has over R on its own, it has auto-completion abilities that allow you to more easily look up functions, their arguments, and the values that they take.
Typing a ?
before the name of a command will open the
help page for that command. As well as providing a detailed description
of the command and how it works, scrolling to the bottom of the help
page will usually show a collection of code examples which illustrate
command usage. We’ll go through an example later.
Comparing things
We can also do comparison in R:
R
1 == 1 # equality (note two equals signs, read as "is equal to")
OUTPUT
[1] TRUE
R
1 != 2 # inequality (read as "is not equal to")
OUTPUT
[1] TRUE
R
1 < 2 # less than
OUTPUT
[1] TRUE
R
1 <= 1 # less than or equal to
OUTPUT
[1] TRUE
R
1 > 0 # greater than
OUTPUT
[1] TRUE
R
1 >= -9 # greater than or equal to
OUTPUT
[1] TRUE
Tip: Comparing Numbers
A word of warning about comparing numbers: you should never use
==
to compare two numbers unless they are integers (a data
type which can specifically represent only whole numbers).
Computers may only represent decimal numbers with a certain degree of precision, so two numbers which look the same when printed out by R, may actually have different underlying representations and therefore be different by a small margin of error (called Machine numeric tolerance).
Instead you should use the all.equal
function.
Further reading: http://floating-point-gui.de/
Variables and assignment
We can store values in variables using the assignment operator
<-
, like this:
R
x <- 1/40
Notice that assignment does not print a value. Instead, we stored it
for later in something called a variable.
x
now contains the value
0.025
:
R
x
OUTPUT
[1] 0.025
More precisely, the stored value is a decimal approximation of this fraction called a floating point number.
Look for the Environment
tab in one of the panes of
RStudio, and you will see that x
and its value have
appeared. Our variable x
can be used in place of a number
in any calculation that expects a number:
R
log(x)
OUTPUT
[1] -3.688879
Notice also that variables can be reassigned:
R
x <- 100
x
used to contain the value 0.025 and and now it has the
value 100.
Assignment values can contain the variable being assigned to:
R
x <- x + 1 #notice how RStudio updates its description of x on the top right tab
y <- x * 2
The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.
R
mass <- 47.5
This will give a value of 47.5 for the variable mass
R
age <- 122
This will give a value of 122 for the variable age
R
mass <- mass * 2.3
This will multiply the existing value of 47.5 by 2.3 to give a new value of 109.25 to the variable mass.
R
age <- age - 20
This will subtract 20 from the existing value of 122 to give a new value of 102 to the variable age.
One way of answering this question in R is to use the
>
to set up the following:
R
mass > age
OUTPUT
[1] TRUE
This should yield a boolean value of TRUE since 109.25 is greater than 102.
Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long variable names, these include
- periods.between.words
- underscores_between_words
- camelCaseToSeparateWords
What you use is up to you, but be consistent.
It is also possible to use the =
operator for
assignment:
R
x = 1/40
But this is much less common among R users. The most important thing
is to be consistent with the operator you use. There
are occasionally places where it is less confusing to use
<-
than =
, and it is the most common symbol
used in the community. So the recommendation is to use
<-
.
The following can be used as R variables:
R
min_height
max.height
MaxLength
celsius2kelvin
The following creates a hidden variable:
R
.mass
We won’t be discussing hidden variables in this lesson. We recommend not using a period at the beginning of variable names unless you intend your variables to be hidden.
The following will not be able to be used to create a variable
Installing Packages
We can use R as a calculator to do mathematical operations (e.g., addition, subtraction, multiplication, division), as we did above. However, we can also use R to carry out more complicated analyses, make visualizations, and much more. In later episodes, we’ll use R to do some data wrangling, plotting, and saving of reformatted data.
R coders around the world have developed collections of R code to accomplish themed tasks (e.g., data wrangling). These collections of R code are known as R packages. It is also important to note that R packages refer to code that is not automatically downloaded when we install R on our computer. Therefore, we’ll have to install each R package that we want to use (more on this below).
We will practice using the dplyr
package to wrangle our
datasets in episode 6 and will also practice using the
ggplot2
package to plot our data in episode 7. To give an
example, the dplyr
package includes code for a function
called filter()
. A function is something that
takes input(s) does some internal operations and produces output(s). For
the filter()
function, the inputs are a dataset and a
logical statement (i.e., when data value is greater than or equal to
100) and the output is data within the dataset that has a value greater
than or equal to 100.
There are two main ways to install packages in R:
If you are using RStudio, we can go to
Tools
>Install Packages...
and then search for the name of the R package we need and clickInstall
.We can use the
install.packages( )
function. We can do this to install thedplyr
R package.
R
install.packages("dplyr")
OUTPUT
The following package(s) will be installed:
- dplyr [1.1.4]
These packages will be installed into "~/Documents/2024-07-01-ucsb-intro-geospatial/renv/profiles/lesson-requirements/renv/library/R-4.3/aarch64-apple-darwin20".
# Installing packages --------------------------------------------------------
- Installing dplyr ... OK [linked from cache]
Successfully installed 1 package in 5.6 milliseconds.
It’s important to note that we only need to install the R package on our computer once. Well, if we install a new version of R on the same computer, then we will likely need to also re-install the R packages too.
We would use the following R code to install the ggplot2
package:
R
install.packages("ggplot2")
Now that we’ve installed the R package, we’re ready to use it! To use the R package, we need to “load” it into our R session. We can think of “loading” an R packages as telling R that we’re ready to use the package we just installed. It’s important to note that while we only have to install the package once, we’ll have to load the package each time we open R (or RStudio).
To load an R package, we use the library( )
function. We
can load the dplyr
package like this:
R
library(dplyr)
OUTPUT
Attaching package: 'dplyr'
OUTPUT
The following objects are masked from 'package:stats':
filter, lag
OUTPUT
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The correct answers are b and c. Answer a will install, not load, the ggplot2 package. Answer b will correctly load the ggplot2 package. Note there are no quotation marks. Answer c will correctly load the ggplot2 package. Note there are quotation marks. Answer d will produce an error because ggplot2 is misspelled.
Note: It is more common for coders to not use quotation marks when loading an R package (i.e., answer c).
R
library(ggplot2)
Content from Project Management With RStudio
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I manage my projects in R?
Objectives
- Create self-contained projects in RStudio
Introduction
The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together. Organising a project involving spatial data is no different from any other data analysis project, although you may require more disk space than usual.
Managing your projects in a reproducible fashion doesn’t just make your science reproducible, it makes your life easier.
— Vince Buffalo (@vsbuffalo) April 15, 2013
Most people tend to organize their projects like this:
There are many reasons why we should ALWAYS avoid this:
- It is really hard to tell which version of your data is the original and which is the modified;
- It gets really messy because it mixes files with various extensions together;
- It probably takes you a lot of time to actually find things, and relate the correct figures to the exact code that has been used to generate it;
A good project layout will ultimately make your life easier:
- It will help ensure the integrity of your data;
- It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
- It allows you to easily upload your code with your manuscript submission;
- It makes it easier to pick the project back up after a break.
A possible solution
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.
A key advantage of an RStudio Project is that whenever we open this
project in subsequent RStudio sessions our working directory will
always be set to the folder r-geospatial
. Let’s
check our working directory by entering the following into the R
console:
R
getwd()
R should return your/path/r-geospatial
as the working
directory.
Best practices for project organization
Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:
Treat data as read only
This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.
Data Cleaning
In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. I find it useful to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “cleaned” data sets.
Treat generated output as disposable
Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.
There are lots of different ways to manage this output. I find it useful to have an output folder with different sub-directories for each separate analysis. This makes it easier later, as many of my analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.
Keep related data together
Some GIS file formats are really 3-6 files that need to be kept together and have the same name, e.g. shapefiles. It may be tempting to store those components separately, but your spatial data will be unusable if you do that.
Keep a consistent naming scheme
It is generally best to avoid renaming downloaded spatial data, so
that a clear connection is maintained with the point of truth. You may
otherwise find yourself wondering whether file_A
really is
just a copy of Official_file_on_website
or not.
For datasets you generate, it’s worth taking the time to come up with a naming convention that works for your project, and sticking to it. File names don’t have to be long, they just have to be long enough that you can tell what the file is about. Date generated, topic, and whether a product is intermediate or final are good bits of information to keep in a file name. For more tips on naming files, check out the slides from Jenny Bryan’s talk “Naming things” at the 2015 Reproducible Science Workshop.
Tip: Good Enough Practices for Scientific Computing
Good Enough Practices for Scientific Computing gives the following recommendations for project organization:
- Put each project in its own directory, which is named after the project.
- Put text documents associated with the project in the
doc
directory. - Put raw data and metadata in the
data
directory, and files generated during cleanup and analysis in aresults
directory. - Put source for the project’s scripts and programs in the
src
directory, and programs brought in from elsewhere or compiled locally in thebin
directory. - Name all files to reflect their content or function.
Save the data in the data directory
Now we have a good directory structure we will now place/save our
data files in the data/
directory.
Challenge 1
1. Download each of the data files listed below (Ctrl+S, right mouse click -> “Save as”, or File -> “Save page as”)
2. Make sure the files have the following names:
nordic-data.csv
nordic-data-2.csv
gapminder_data.csv
3. Save the files in the data/
folder within your
project.
We will load and inspect these data later.
Challenge 2
We also want to move the data that we downloaded previously for this
workshop into a subdirectory inside r-geospatial
. If you
haven’t already downloaded the data, you can do so by clicking this
download link.
- Move the downloaded zip file to the
data
directory. - Once the data have been moved, unzip all files.
Once you have completed moving the data across to the new folder, your data directory should look as follows:
data/
gapminder_data.csv
NEON-DS-Airborne-Remote-Sensing/
NEON-DS-Landsat-NDVI/
NEON-DS-Met-Time-Series/
NEON-DS-Site-Layout-Files/
NEON-DS-Airborne-Remote-Sensing.zip
NEON-DS-Landsat-NDVI.zip
NEON-DS-Met-Time-Series.zip
NEON-DS-Site-Layout-Files.zip
nordic-data.csv
nordic-data-2.csv
Content from Data Structures
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I read data in R?
- What are the basic data types in R?
- How do I represent categorical information in R?
Objectives
- To be aware of the different types of data.
- To begin exploring data frames, and understand how they are related to vectors, factors and lists.
- To be able to ask questions from R about the type, class, and structure of an object.
Let’s start by creating a new R script and saving it to the
scripts
folder in our project directory. We will create new
scripts for each episode in this workshop.
We can create a new R script by clicking the button at the top left of our RStudio, the one that looks like a piece of paper with a green plus sign next to it. On the drop down, click “R Script”. We can also create an R script using File > New file > R script.
We can add a comment to our script to remind us what we’re working on:
R
# R script for data structures
Comments are useful notes to us that are ignored by the computer.
Now that we’ve got the basics covered, we can move on to the lesson.
One of R’s most powerful features is its ability to deal with tabular
data - like you may already have in a spreadsheet or a CSV file. Let’s
start by downloading and reading in a file nordic-data.csv
.
We will save this data as an object named nordic
:
R
nordic <- read.csv("data/nordic-data.csv")
Tip: Running segments of your code
RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can
- click on the
Run
button above the editor panel, or - select “Run Lines” from the “Code” menu, or
- hit Ctrl+Enter in Windows,
Ctrl+Return in Linux, or
⌘+Return on OS X. (This shortcut can also be seen
by hovering the mouse over the button). To run a block of code, select
it and then
Run
.
The read.table
function is used for reading in tabular
data stored in a text file where the columns of data are separated by
punctuation characters such as CSV files (csv = comma-separated values).
Tabs and commas are the most common punctuation characters used to
separate or delimit data points in csv files. For convenience R provides
2 other versions of read.table
. These are:
read.csv
for files where the data are separated with commas
and read.delim
for files where the data are separated with
tabs. Of these three functions read.csv
is the most
commonly used. If needed it is possible to override the default
delimiting punctuation marks for both read.csv
and
read.delim
.
We can begin exploring our dataset right away, pulling out columns by
specifying them using the $
operator:
R
nordic$country
OUTPUT
[1] "Denmark" "Sweden" "Norway"
R
nordic$lifeExp
OUTPUT
[1] 77.2 80.0 79.0
We can do other operations on the columns. For example, if we discovered that the life expectancy is two years higher:
R
nordic$lifeExp + 2
OUTPUT
[1] 79.2 82.0 81.0
But what about:
R
nordic$lifeExp + nordic$country
ERROR
Error in nordic$lifeExp + nordic$country: non-numeric argument to binary operator
Understanding what happened here is key to successfully analyzing data in R.
Data Types
If you guessed that the last command will return an error because
77.2
plus "Denmark"
is nonsense, you’re right
- and you already have some intuition for an important concept in
programming called data classes. We can ask what class of data
something is:
R
class(nordic$lifeExp)
OUTPUT
[1] "numeric"
There are 6 main types:
-
numeric
: values like 1.254, 0.8, -7.2 -
integer
: values like 1, 10, 150 -
complex
: values like 1+1i -
logical
: TRUE or FALSE -
character
: values like “cats”, “dogs”, and “animals” -
factor
: categories with all possible values, like the months of the year
No matter how complicated our analyses become, all data in R is interpreted a specific data class. This strictness has some really important consequences.
A user has added new details of age expectancy. This information is
in the file data/nordic-data-2.csv
.
Load the new nordic data as nordic_2
, and check what
class of data we find in the lifeExp
column:
R
nordic_2 <- read.csv("data/nordic-data-2.csv")
class(nordic_2$lifeExp)
OUTPUT
[1] "character"
Oh no, our life expectancy lifeExp aren’t the numeric type anymore! If we try to do the same math we did on them before, we run into trouble:
R
nordic_2$lifeExp + 2
ERROR
Error in nordic_2$lifeExp + 2: non-numeric argument to binary operator
What happened? When R reads a csv file into one of these tables, it insists that everything in a column be the same class; if it can’t understand everything in the column as numeric, then nothing in the column gets to be numeric. The table that R loaded our nordic data into is something called a dataframe, and it is our first example of something called a data structure - that is, a structure which R knows how to build out of the basic data types.
We can look at the data within R by clicking the object in our
environment or using View(nordic)
. We see that the
lifeExp
column now has the value “79.0 or 83” in the third
row. R needed all values in that column to be the same type, so it
forced everything to be a character instead of a number.
In order to successfully use our data in R, we need to understand what the basic data structures are, and how they behave.
Vectors and Type Coercion
To better understand this behavior, let’s meet another of the data structures: the vector.
A vector in R is essentially an ordered list of things, with the
special condition that everything in the vector must be the same basic
data type. If you don’t choose the data type, it’ll default to
logical
; or, you can declare an empty vector of whatever
type you like.
You can specify a vector either using the vector()
function or the combine (c()
) function.
R
vector(length = 3) # this creates an empty vector of logical values
OUTPUT
[1] FALSE FALSE FALSE
R
c(1, 2, 3) # this creates a vector with numerical values
OUTPUT
[1] 1 2 3
R
c("this", "that", "the other") # this creates a vector with character values
OUTPUT
[1] "this" "that" "the other"
R
another_vector <- vector(mode = 'character', length = 3)
another_vector
OUTPUT
[1] "" "" ""
You can check if something is a vector using str
which
asks for an object’s structure:
R
str(another_vector)
OUTPUT
chr [1:3] "" "" ""
The somewhat cryptic output from this command indicates the basic
data type found in this vector - in this case chr
,
character; an indication of the number of things in the vector -
actually, the indexes of the vector, in this case [1:3]
;
and a few examples of what’s actually in the vector - in this case empty
character strings. If we similarly do
R
str(nordic$lifeExp)
OUTPUT
num [1:3] 77.2 80 79
we see that nordic$lifeExp
is a vector, too - the
columns of data we load into R data frames are all vectors, and that’s
the root of why R forces everything in a column to be the same basic
data type.
Discussion 1
Why is R so opinionated about what we put in our columns of data? How does this help us?
By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.
Given what we’ve learned so far, what do you think the following will produce?
R
quiz_vector <- c(2, 6, '3')
This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type. Consider:
R
coercion_vector <- c('a', TRUE)
coercion_vector
OUTPUT
[1] "a" "TRUE"
The coercion rules go: logical
->
integer
-> numeric
->
complex
-> character
, where -> can be
read as are transformed into. You can try to force coercion
against this flow using the as.
functions:
R
character_vector_example <- c('0', '2', '4')
character_vector_example
OUTPUT
[1] "0" "2" "4"
R
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
OUTPUT
[1] 0 2 4
As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data frames, or you will get nasty surprises!
R
str(nordic_2$lifeExp)
OUTPUT
chr [1:3] "77.2" "80" "79.0 or 83"
R
str(nordic$lifeExp)
OUTPUT
num [1:3] 77.2 80 79
The data in nordic_2$lifeExp
is stored as a character
vector, rather than as a numeric vector. This is because of the “or”
character string in the third data point.
The combine function, c()
, will also append things to an
existing vector:
R
ab_vector <- c('a', 'b')
ab_vector
OUTPUT
[1] "a" "b"
R
combine_example <- c(ab_vector, 'DC')
combine_example
OUTPUT
[1] "a" "b" "DC"
You can also make series of numbers:
R
my_series <- 1:10
my_series
OUTPUT
[1] 1 2 3 4 5 6 7 8 9 10
R
seq(from = 1, to = 10, by = 0.1)
OUTPUT
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
[16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
[31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
[46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
[61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
[76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
[91] 10.0
We can ask a few questions about vectors:
R
sequence_example <- 1:10
head(sequence_example)
OUTPUT
[1] 1 2 3 4 5 6
R
tail(sequence_example)
OUTPUT
[1] 5 6 7 8 9 10
R
length(sequence_example)
OUTPUT
[1] 10
R
class(sequence_example)
OUTPUT
[1] "integer"
Finally, you can give names to elements in your vector:
R
my_example <- 5:8
names(my_example) <- c("a", "b", "c", "d")
my_example
OUTPUT
a b c d
5 6 7 8
R
names(my_example)
OUTPUT
[1] "a" "b" "c" "d"
R
x <- 1:26
x <- x * 2
names(x) <- LETTERS
Factors
We said that columns in data frames were vectors:
R
str(nordic$lifeExp)
OUTPUT
num [1:3] 77.2 80 79
R
str(nordic$year)
OUTPUT
int [1:3] 2002 2002 2002
R
str(nordic$country)
OUTPUT
chr [1:3] "Denmark" "Sweden" "Norway"
One final important data structure in R is called a “factor”. Factors look like character data, but are used to represent data where each element of the vector must be one of a limited number of “levels”. To phrase that another way, factors are an “enumerated” type where there are a finite number of pre-defined values that your vector can have.
For example, let’s make a vector of strings labeling nordic countries for all the countries in our study:
R
nordic_countries <- nordic$country
str(nordic_countries)
OUTPUT
chr [1:3] "Denmark" "Sweden" "Norway"
We can turn a vector into a factor like so:
R
categories <- factor(nordic_countries)
str(categories)
OUTPUT
Factor w/ 3 levels "Denmark","Norway",..: 1 3 2
Now R has noticed that there are 3 possible categories in our data - but it also did something surprising; instead of printing out the strings we gave it, we got a bunch of numbers instead. R has replaced our human-readable categories with numbered indices under the hood, this is necessary as many statistical calculations utilise such numerical representations for categorical data.
They are sorted in alphabetical order
Challenge 4
Convert the country
column of our nordic
data frame to a factor. Then try converting it back to a character
vector.
Now try converting lifeExp
in our nordic
data frame to a factor, then back to a numeric vector. What happens if
you use as.numeric()
?
Remember that you can reload the nordic
data frame using
read.csv("data/nordic-data.csv")
if you accidentally lose
some data!
Converting character vectors to factors can be done using the
factor()
function:
R
nordic$country <- factor(nordic$country)
nordic$country
OUTPUT
[1] Denmark Sweden Norway
Levels: Denmark Norway Sweden
You can convert these back to character vectors using
as.character()
:
R
nordic$country <- as.character(nordic$country)
nordic$country
OUTPUT
[1] "Denmark" "Sweden" "Norway"
You can convert numeric vectors to factors in the exact same way:
R
nordic$lifeExp <- factor(nordic$lifeExp)
nordic$lifeExp
OUTPUT
[1] 77.2 80 79
Levels: 77.2 79 80
But be careful – you can’t use as.numeric()
to convert
factors to numerics!
R
as.numeric(nordic$lifeExp)
OUTPUT
[1] 1 3 2
Instead, as.numeric()
converts factors to those “numbers
under the hood” we talked about. To go from a factor to a number, you
need to first turn the factor into a character vector, and then
turn that into a numeric vector:
R
nordic$lifeExp <- as.character(nordic$lifeExp)
nordic$lifeExp <- as.numeric(nordic$lifeExp)
nordic$lifeExp
OUTPUT
[1] 77.2 80.0 79.0
Note: new students find the help files difficult to understand; make sure to let them know that this is typical, and encourage them to take their best guess based on semantic meaning, even if they aren’t sure.
When doing statistical modelling, it’s important to know what the baseline levels are. This is assumed to be the first factor, but by default factors are labeled in alphabetical order. You can change this by specifying the levels:
R
mydata <- c("case", "control", "control", "case")
factor_ordering_example <- factor(mydata, levels = c("control", "case"))
str(factor_ordering_example)
OUTPUT
Factor w/ 2 levels "control","case": 2 1 1 2
In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2. This designation can be very important for interpreting the results of statistical models!
Lists
Another data structure you’ll want in your bag of tricks is the
list
. A list is simpler in some ways than the other types,
because you can put anything you want in it:
R
list_example <- list(1, "a", TRUE, c(2, 6, 7))
list_example
OUTPUT
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 2 6 7
R
another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list
OUTPUT
$title
[1] "Numbers"
$numbers
[1] 1 2 3 4 5 6 7 8 9 10
$data
[1] TRUE
We can now understand something a bit surprising in our data frame;
what happens if we compare str(nordic)
and
str(another_list)
:
R
str(nordic)
OUTPUT
'data.frame': 3 obs. of 3 variables:
$ country: chr "Denmark" "Sweden" "Norway"
$ year : int 2002 2002 2002
$ lifeExp: num 77.2 80 79
R
str(another_list)
OUTPUT
List of 3
$ title : chr "Numbers"
$ numbers: int [1:10] 1 2 3 4 5 6 7 8 9 10
$ data : logi TRUE
We see that the output for these two objects look very similar. It is because data frames are lists ‘under the hood’. Data frames are a special case of lists where each element (the columns of the data frame) have the same lengths.
In our nordic
example, we have a character, an integer,
and a numerical variable. As we have seen already, each column of data
frame is a vector.
R
nordic$country
OUTPUT
[1] "Denmark" "Sweden" "Norway"
We can also call to the contents within the items in a list via
indexing. For example, if we wanted to just return the contents from the
first object in another_list
(the title), we can do that by
using double brackets and specifying either the index value of the item,
or it’s name.
R
another_list[[1]]
OUTPUT
[1] "Numbers"
R
another_list[["title"]]
OUTPUT
[1] "Numbers"
Challenge 5
There are several subtly different ways to call variables, observations and elements from data frames:
nordic[1]
nordic[[1]]
nordic$country
nordic["country"]
nordic[1, 1]
nordic[, 1]
nordic[1, ]
Try out these examples and explain what is returned by each one.
Hint: Use the function class()
to examine what
is returned in each case.
R
nordic[1]
OUTPUT
country
1 Denmark
2 Sweden
3 Norway
We can think of a data frame as a list of vectors. The single brace
[1]
returns the first slice of the list, as another list.
In this case it is the first column of the data frame.
R
nordic[[1]]
OUTPUT
[1] "Denmark" "Sweden" "Norway"
The double brace [[1]]
returns the contents of the list
item. In this case it is the contents of the first column, a
vector of type character.
R
nordic$country
OUTPUT
[1] "Denmark" "Sweden" "Norway"
This example uses the $
character to address items by
name. country is the first column of the data frame, again a
vector of type character.
R
nordic["country"]
OUTPUT
country
1 Denmark
2 Sweden
3 Norway
Here we are using a single brace ["country"]
replacing
the index number with the column name. Like example 1, the returned
object is a list.
R
nordic[1, 1]
OUTPUT
[1] "Denmark"
This example uses a single brace, but this time we provide row and
column coordinates. The returned object is the value in row 1, column 1.
The object is an character: the first value of the first vector
in our nordic
object.
R
nordic[, 1]
OUTPUT
[1] "Denmark" "Sweden" "Norway"
Like the previous example we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column vector.
R
nordic[1, ]
OUTPUT
country year lifeExp
1 Denmark 2002 77.2
Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row.
Content from Exploring Data Frames
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I manipulate a data frame?
Objectives
- Append two data frames.
- Understand what a
factor
is. - Convert a
factor
to acharacter
vector and vice versa. - Display basic properties of data frames including size and class of the columns, names, and first few rows.
At this point, you’ve seen it all: in the last lesson, we toured all the basic data types and data structures in R. Everything you do will be a manipulation of those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we’ll learn a few more things about working with data frames.
Realistic example
We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. So far, you have seen the basics of manipulating data frames with our nordic data; now let’s use those skills to digest a more extensive dataset. Let’s read in the gapminder dataset that we downloaded previously:
R
gapminder <- read.csv("data/gapminder_data.csv")
Let’s investigate the gapminder
data frame a bit; the
first thing we should always do is check out what the data looks like
with str
:
R
str(gapminder)
OUTPUT
'data.frame': 1704 obs. of 6 variables:
$ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: chr "Asia" "Asia" "Asia" "Asia" ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
We can also interrogate the data frame for information about its
dimensions; remembering that str(gapminder)
said there were
1704 observations of 6 variables in gapminder, what do you think the
following will produce, and why?
R
length(gapminder)
A fair guess would have been to say that the length of a data frame would be the number of rows it has (1704), but this is not the case; it gives us the number of columns.
To get the number of rows and columns in our dataset, try:
R
nrow(gapminder)
OUTPUT
[1] 1704
R
ncol(gapminder)
OUTPUT
[1] 6
Or, both at once:
R
dim(gapminder)
OUTPUT
[1] 1704 6
We’ll also likely want to know what the titles of all the columns are, so we can ask for them later:
R
colnames(gapminder)
OUTPUT
[1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
At this stage, it’s important to ask ourselves if the structure R is reporting matches our intuition or expectations; do the basic data types reported for each column make sense? If not, we need to sort any problems out now before they turn into bad surprises down the road, using what we’ve learned about how R interprets data, and the importance of strict consistency in how we record our data.
Once we’re happy that the data types and structures seem reasonable, it’s time to start digging into our data proper. Check out the first few lines:
R
head(gapminder)
OUTPUT
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4 Afghanistan 1967 11537966 Asia 34.020 836.1971
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
6 Afghanistan 1977 14880372 Asia 38.438 786.1134
To check the last few lines it’s relatively simple as R already has a function for this:
R
tail(gapminder)
tail(gapminder, n = 15)
What about a few arbitrary rows just for sanity (or insanity depending on your view)?
There are several ways to achieve this.
The solution here presents one form using nested functions. i.e. a function passed as an argument to another function. This might sound like a new concept but you are already using it in fact.
Remember my_dataframe[rows, cols]
will print to screen
your data frame with the number of rows and columns you asked for
(although you might have asked for a range or named columns for
example). How would you get the last row if you don’t know how many rows
your data frame has? R has a function for this. What about getting a
(pseudorandom) sample? R also has a function for this.
R
gapminder[sample(nrow(gapminder), 5), ]
Challenge 2
Read the output of str(gapminder)
again; this time, use
what you’ve learned about factors and vectors, as well as the output of
functions like colnames
and dim
to explain
what everything that str
prints out for
gapminder
means. If there are any parts you can’t
interpret, discuss with your neighbors!
The object gapminder
is a data frame with columns
-
country
andcontinent
are character vectors. -
year
is an integer vector. -
pop
,lifeExp
, andgdpPercap
are numeric vectors.
Adding columns and rows in data frames
We would like to create a new column to hold information on whether the life expectancy is below the world average life expectancy (70.5) or above:
R
below_average <- gapminder$lifeExp < 70.5
head(below_average)
OUTPUT
[1] TRUE TRUE TRUE TRUE TRUE TRUE
We can then add this as a column via:
R
cbind(gapminder, below_average)
Note that if we tried to add a vector of below_average
with a different number of entries than the number of rows in the
dataframe, it would fail:
R
below_average <- c(TRUE, TRUE, TRUE, TRUE, TRUE)
cbind(gapminder, below_average)
ERROR
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 1704, 5
Why didn’t this work? R wants to see one element in our new column for every row in the table:
R
nrow(gapminder)
OUTPUT
[1] 1704
R
length(below_average)
OUTPUT
[1] 5
So for it to work we need either to have nrow(gapminder)
= length(below_average)
or nrow(gapminder)
to
be a multiple of length(below_average)
:
R
below_average <- c(TRUE, TRUE, FALSE)
head(cbind(gapminder, below_average))
OUTPUT
country year pop continent lifeExp gdpPercap below_average
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 TRUE
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 TRUE
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 FALSE
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 TRUE
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 TRUE
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 FALSE
The sequence TRUE,TRUE,FALSE
is repeated over all the
gapminder rows.
Let’s overwrite the content of gapminder with our new data frame. We
can do this in two ways: using cbind
and overwriting the
gapminder
object, or by identifying and adding a new
column.
R
below_average <- gapminder$lifeExp<70.5
# Option 1 - use cbind and overwrite the object
gapminder <- cbind(gapminder, below_average)
# Option 2 - identifying and adding a new column
gapminder$below_average <- below_average
Now how about adding rows? The rows of a data frame are lists, so we can add a new row by either specifying a new list, or a new data frame.
R
# Option 1 - create a new list
new_row <- list(country = 'Norway',
year = 2016,
pop = 5000000,
continent = 'Nordic',
lifeExp = 80.3,
gdpPercap = 49400.0,
below_average = FALSE)
# Option 2 - perhaps more intuitive, create a new data frame
new_row <- data.frame(country = 'Norway',
year = 2016,
pop = 5000000,
continent = 'Nordic',
lifeExp = 80.3,
gdpPercap = 49400.0,
below_average = FALSE)
gapminder_norway <- rbind(gapminder, new_row)
tail(gapminder_norway)
OUTPUT
country year pop continent lifeExp gdpPercap below_average
1700 Zimbabwe 1987 9216418 Africa 62.351 706.1573 TRUE
1701 Zimbabwe 1992 10704340 Africa 60.377 693.4208 TRUE
1702 Zimbabwe 1997 11404948 Africa 46.809 792.4500 TRUE
1703 Zimbabwe 2002 11926563 Africa 39.989 672.0386 TRUE
1704 Zimbabwe 2007 12311143 Africa 43.487 469.7093 TRUE
1705 Norway 2016 5000000 Nordic 80.300 49400.0000 FALSE
Factors
Here is another thing to look out for: in a factor
, each
different value represents what is called a level
. Let’s
alter our gapminder data so that the “continent” column is a factor.
R
gapminder$continent <- factor(gapminder$continent)
head(gapminder$continent)
OUTPUT
[1] Asia Asia Asia Asia Asia Asia
Levels: Africa Americas Asia Europe Oceania
Now the factor
“continent” has 5 levels: “Africa”,
“Americas”, “Asia”, “Europe” and “Oceania”. R will only accept values
that match one of the levels. If you add a new value, it will become
NA
.
R
levels(gapminder$continent)
OUTPUT
[1] "Africa" "Americas" "Asia" "Europe" "Oceania"
R
levels(gapminder$continent) <- c(levels(gapminder$continent), "Nordic")
gapminder_norway <- rbind(gapminder, new_row)
tail(gapminder_norway)
OUTPUT
country year pop continent lifeExp gdpPercap below_average
1700 Zimbabwe 1987 9216418 Africa 62.351 706.1573 TRUE
1701 Zimbabwe 1992 10704340 Africa 60.377 693.4208 TRUE
1702 Zimbabwe 1997 11404948 Africa 46.809 792.4500 TRUE
1703 Zimbabwe 2002 11926563 Africa 39.989 672.0386 TRUE
1704 Zimbabwe 2007 12311143 Africa 43.487 469.7093 TRUE
1705 Norway 2016 5000000 Nordic 80.300 49400.0000 FALSE
Alternatively, we can change a factor into a character vector; we lose the handy categories of the factor, but we can subsequently add any word we want to the column without babysitting the factor levels:
R
str(gapminder)
OUTPUT
'data.frame': 1704 obs. of 7 variables:
$ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent : Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap : num 779 821 853 836 740 ...
$ below_average: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
R
gapminder$continent <- as.character(gapminder$continent)
str(gapminder)
OUTPUT
'data.frame': 1704 obs. of 7 variables:
$ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent : chr "Asia" "Asia" "Asia" "Asia" ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap : num 779 821 853 836 740 ...
$ below_average: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
Appending to a data frame
The key to remember when adding data to a data frame is that
columns are vectors and rows are lists. We can also glue two
data frames together with rbind
:
R
gapminder <- rbind(gapminder, gapminder)
tail(gapminder, n=3)
OUTPUT
country year pop continent lifeExp gdpPercap below_average
3406 Zimbabwe 1997 11404948 Africa 46.809 792.4500 TRUE
3407 Zimbabwe 2002 11926563 Africa 39.989 672.0386 TRUE
3408 Zimbabwe 2007 12311143 Africa 43.487 469.7093 TRUE
Challenge 3
You can create a new data frame right from within R with the following syntax:
R
df <- data.frame(id = c("a", "b", "c"),
x = 1:3,
y = c(TRUE, TRUE, FALSE))
Make a data frame that holds the following information for yourself:
- first name
- last name
- lucky number
Then use rbind
to add an entry for the people sitting
beside you. Finally, use cbind
to add a column with each
person’s answer to the question, “Is it time for coffee break?”
R
df <- data.frame(first = c("Grace"),
last = c("Hopper"),
lucky_number = c(0))
df <- rbind(df, list("Marie", "Curie", 238) )
df <- cbind(df, coffeetime = c(TRUE, TRUE))
Key Points
- Use
cbind()
to add a new column to a data frame. - Use
rbind()
to add a new row to a data frame. - Use
levels()
andas.character()
to explore and manipulate factors. - Use
str()
,nrow()
,ncol()
,dim()
,colnames()
,rownames()
,head()
, andtail()
to understand the structure of a data frame. - Read in a csv file using
read.csv()
. - Understand what
length()
of a data frame represents.
Content from Subsetting Data
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I work with subsets of data in R?
Objectives
- To be able to subset vectors and data frames
- To be able to extract individual and multiple elements: by index, by name, using comparison operations
- To be able to skip and remove elements from various data structures.
R has many powerful subset operators. Mastering them will allow you to easily perform complex operations on any kind of dataset.
There are six different ways we can subset any kind of object, and three different subsetting operators for the different data structures.
Let’s start with the workhorse of R: a simple numeric vector.
R
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
OUTPUT
a b c d e
5.4 6.2 7.1 4.8 7.5
So now that we’ve created a dummy vector to play with, how do we get at its contents?
Accessing elements using their indices
To extract elements of a vector we can give their corresponding index, starting from one:
R
x[1]
OUTPUT
a
5.4
It may look different, but the square brackets operator is a function. For vectors (and matrices), it means “get me the nth element”.
We can ask for multiple elements at once:
R
x[c(1, 3)]
OUTPUT
a c
5.4 7.1
Or slices of the vector:
R
x[1:4]
OUTPUT
a b c d
5.4 6.2 7.1 4.8
the :
operator creates a sequence of numbers from the
left element to the right.
R
1:4
OUTPUT
[1] 1 2 3 4
R
c(1, 2, 3, 4)
OUTPUT
[1] 1 2 3 4
We can ask for the same element multiple times:
R
x[c(1, 1, 3)]
OUTPUT
a a c
5.4 5.4 7.1
If we ask for an index beyond the length of the vector, R will return a missing value:
R
x[6]
OUTPUT
<NA>
NA
This is a vector of length one containing an NA
, whose
name is also NA
.
If we ask for the 0th element, we get an empty vector:
R
x[0]
OUTPUT
named numeric(0)
Skipping and removing elements
If we use a negative number as the index of a vector, R will return every element except for the one specified:
R
x[-2]
OUTPUT
a c d e
5.4 7.1 4.8 7.5
We can skip multiple elements:
R
x[c(-1, -5)] # or x[-c(1,5)]
OUTPUT
b c d
6.2 7.1 4.8
Tip: Order of operations
A common trip up for novices occurs when trying to skip slices of a vector. It’s natural to to try to negate a sequence like so:
R
x[-1:3]
This gives a somewhat cryptic error:
ERROR
Error in x[-1:3]: only 0's may be mixed with negative subscripts
But remember the order of operations. :
is really a
function. It takes its first argument as -1, and its second as 3, so
generates the sequence of numbers: c(-1, 0, 1, 2, 3)
.
The correct solution is to wrap that function call in brackets, so
that the -
operator applies to the result:
R
x[-(1:3)]
OUTPUT
d e
4.8 7.5
To remove elements from a vector, we need to assign the result back into the variable:
R
x <- x[-4]
x
OUTPUT
a b c e
5.4 6.2 7.1 7.5
Challenge 1
Given the following code:
R
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
OUTPUT
a b c d e
5.4 6.2 7.1 4.8 7.5
Come up with at least 3 different commands that will produce the following output:
OUTPUT
b c d
6.2 7.1 4.8
After you find 3 different commands, compare notes with your neighbour. Did you have different strategies?
R
x[2:4]
OUTPUT
b c d
6.2 7.1 4.8
R
x[-c(1,5)]
OUTPUT
b c d
6.2 7.1 4.8
R
x[c("b", "c", "d")]
OUTPUT
b c d
6.2 7.1 4.8
R
x[c(2,3,4)]
OUTPUT
b c d
6.2 7.1 4.8
Subsetting by name
We can extract elements by using their name, instead of extracting by index:
R
x <- c(a = 5.4, b = 6.2, c = 7.1, d = 4.8, e = 7.5) # we can name a vector 'on the fly'
x[c("a", "c")]
OUTPUT
a c
5.4 7.1
This is usually a much more reliable way to subset objects: the position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!
Subsetting through other logical operations
We can also use any logical vector to subset. Since comparison
operators (e.g. >
, <
, ==
)
evaluate to logical vectors, we can also use them to succinctly subset
vectors: the following statement gives the same result as the previous
one.
R
x[x > 7]
OUTPUT
c e
7.1 7.5
Breaking it down, this statement first evaluates x>7
,
generating a logical vector
c(FALSE, FALSE, TRUE, FALSE, TRUE)
, and then selects the
elements of x
corresponding to the TRUE
values.
We can use ==
to mimic the previous method of indexing
by name (remember you have to use ==
rather than
=
for comparisons):
R
x[names(x) == "a"]
OUTPUT
a
5.4
We can also evaluate multiple logical criteria. For example, if we
want all values of x that are greater than 5 and less than 7;
we can use &
to denote “and”:
R
x[x > 5 & x < 7]
OUTPUT
a b
5.4 6.2
Or, maybe we want the values less than 5 or more than 7; we can use ‘|’ to denote “or”:
R
x[x < 5 | x > 7]
OUTPUT
c d e
7.1 4.8 7.5
Finally, we can evaluate whether variables are present in another
vector by using the %in%
command. For example, if we wanted
to grab all values of x with the names “a”, “c”, and “d” we could do
this:
R
x[names(x) %in% c("a", "c", "d")]
OUTPUT
a c d
5.4 7.1 4.8
which is just a simpler way of writing it like this:
R
x[names(x) == "a" | names(x) == "c" | names(x) == "d"]
OUTPUT
a c d
5.4 7.1 4.8
We also might want to grab the values that are not equal to
something. For this we can pair the !
symbol with the
=
(equal to), >
(greater than),
<
(less than), >=
(greater than or equal
to), <=
(less than or equal to) symbols to say we want
the opposite answer. For example, this will return all values
in x
that do not have the name a
:
R
x[names(x) != "a"]
OUTPUT
b c d e
6.2 7.1 4.8 7.5
R
x_subset <- x[x<7 & x>4]
print(x_subset)
OUTPUT
a b d
5.4 6.2 4.8
Handling special values
At some point you will encounter functions in R that cannot handle missing, infinite, or undefined data.
There are a number of special functions you can use to filter out this data:
-
is.na
will return all positions in a vector, matrix, or data frame containingNA
(orNaN
) - likewise,
is.nan
, andis.infinite
will do the same forNaN
andInf
. -
is.finite
will return all positions in a vector, matrix, or data.frame that do not containNA
,NaN
orInf
. -
na.omit
will filter out all missing values from a vector
Data frames
Remember the data frames are lists underneath the hood, so similar rules apply. However they are also two dimensional objects:
[]
with one argument will act the same way as for lists,
where each list element corresponds to a column. The resulting object
will be a data frame:
R
# Read in gapminder data if you don't have it already
gapminder <- read.csv("data/gapminder_data.csv")
head(gapminder[3])
OUTPUT
pop
1 8425333
2 9240934
3 10267083
4 11537966
5 13079460
6 14880372
Similarly, [[]]
will act to extract the values
from a single column:
R
head(gapminder[["lifeExp"]])
OUTPUT
[1] 28.801 30.332 31.997 34.020 36.088 38.438
And $
provides a convenient shorthand to extract columns
by name:
R
head(gapminder$year)
OUTPUT
[1] 1952 1957 1962 1967 1972 1977
To select specific rows and/or columns, you can provide two arguments
to []
. Remember that the way to index data frames is always
object[row(s), column(s)]
.
R
gapminder[1:3, ]
OUTPUT
country year pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333 Asia 28.801 779.4453
2 Afghanistan 1957 9240934 Asia 30.332 820.8530
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
If we subset a single row, the result will be a data frame (because the elements are mixed types):
R
gapminder[3, ]
OUTPUT
country year pop continent lifeExp gdpPercap
3 Afghanistan 1962 10267083 Asia 31.997 853.1007
But for a single column the result will be a vector (this can be
changed with the third argument, drop = FALSE
).
Challenge 3
Fix each of the following common data frame subsetting errors:
- Extract observations collected for the year 1957
- Extract all columns except 1 through to 4
R
gapminder[, -1:4]
- Extract the rows where the life expectancy is longer the 80 years
R
gapminder[gapminder$lifeExp > 80]
- Extract the first row, and the fourth and fifth columns
(
lifeExp
andgdpPercap
).
R
gapminder[1, 4, 5]
- Advanced: extract rows that contain information for the years 2002 and 2007
R
gapminder[gapminder$year == 2002 | 2007,]
Fix each of the following common data frame subsetting errors:
- Extract observations collected for the year 1957
R
# gapminder[gapminder$year = 1957, ]
gapminder[gapminder$year == 1957, ]
- Extract all columns except 1 through to 4
R
# gapminder[, -1:4]
gapminder[,-c(1:4)]
- Extract the rows where the life expectancy is longer the 80 years
R
# gapminder[gapminder$lifeExp > 80]
gapminder[gapminder$lifeExp > 80,]
- Extract the first row, and the fourth and fifth columns
(
lifeExp
andgdpPercap
).
R
# gapminder[1, 4, 5]
gapminder[1, c(4, 5)]
- Advanced: extract rows that contain information for the years 2002 and 2007
R
# gapminder[gapminder$year == 2002 | 2007,]
gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
gapminder[gapminder$year %in% c(2002, 2007),] # another, more advanced way
gapminder
is a data.frame so it needs to be subsetted on two dimensions.gapminder[1:20, ]
subsets the data to give the first 20 rows and all columns.
R
gapminder_small <- gapminder[c(1:9, 19:23),]
Content from Data frame Manipulation with dplyr
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I manipulate dataframes without repeating myself?
Objectives
- To be able to use the six main dataframe manipulation ‘verbs’ with
pipes in
dplyr
. - To understand how
group_by()
andsummarize()
can be combined to summarize datasets. - Be able to analyze a subset of data using logical filtering.
Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:
R
gapminder <- read.csv("data/gapminder_data.csv", header = TRUE)
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
OUTPUT
[1] 2193.755
R
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
OUTPUT
[1] 7136.11
R
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
OUTPUT
[1] 7902.15
But this isn’t very efficient, and can become tedious quickly because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.
The dplyr
package
Luckily, the dplyr
package
provides a number of very useful functions for manipulating dataframes
in a way that will reduce the above repetition, reduce the probability
of making errors, and probably even save you some typing. As an added
bonus, you might even find the dplyr
grammar easier to
read.
Here we’re going to cover 6 of the most commonly used functions as
well as using pipes (%>%
) to combine them.
select()
filter()
group_by()
summarize()
-
count()
andn()
mutate()
If you have have not installed this package earlier, please do so:
R
install.packages('dplyr')
Now let’s load the package:
R
library("dplyr")
Using select()
If, for example, we wanted to move forward with only a few of the
variables in our dataframe we could use the select()
function. This will keep only the variables you select.
R
year_country_gdp <- select(gapminder, year, country, gdpPercap)
If we open up year_country_gdp
we’ll see that it only
contains the year, country and gdpPercap. Above we used ‘normal’
grammar, but the strengths of dplyr
lie in combining
several functions using pipes. Since the pipes grammar is unlike
anything we’ve seen in R before, let’s repeat what we’ve done above
using pipes.
R
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
To help you understand why we wrote that in that way, let’s walk
through it step by step. First we summon the gapminder
data
frame and pass it on, using the pipe symbol %>%
, to the
next step, which is the select()
function. In this case we
don’t specify which data object we use in the select()
function since in gets that from the previous pipe. Fun
Fact: You may have encountered pipes before in the shell. In R,
a pipe symbol is %>%
while in the shell it is
|
but the concept is the same!
Using filter()
If we now wanted to move forward with the above, but only with
European countries, we can combine select
and
filter
R
year_country_gdp_euro <- gapminder %>%
filter(continent == "Europe") %>%
select(year, country, gdpPercap)
R
year_country_lifeExp_Africa <- gapminder %>%
filter(continent=="Africa") %>%
select(year,country,lifeExp)
As with last time, first we pass the gapminder dataframe to the
filter()
function, then we pass the filtered version of the
gapminder data frame to the select()
function.
Note: The order of operations is very important in this
case. If we used ‘select’ first, filter would not be able to find the
variable continent since we would have removed it in the previous
step.
Using group_by()
and summarize()
Now, we were supposed to be reducing the error prone repetitiveness
of what can be done with base R, but up to now we haven’t done that
since we would have to repeat the above for each continent. Instead of
filter()
, which will only pass observations that meet your
criteria (in the above: continent=="Europe"
), we can use
group_by()
, which will essentially use every unique
criteria that you could have used in filter.
R
str(gapminder)
OUTPUT
'data.frame': 1704 obs. of 6 variables:
$ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: chr "Asia" "Asia" "Asia" "Asia" ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
R
gapminder %>% group_by(continent) %>% str()
OUTPUT
gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
$ country : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
$ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
- attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
..$ .rows : list<int> [1:5]
.. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
.. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
.. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
You will notice that the structure of the dataframe where we used
group_by()
(grouped_df
) is not the same as the
original gapminder
(data.frame
). A
grouped_df
can be thought of as a list
where
each item in the list
is a data.frame
which
contains only the rows that correspond to the a particular value
continent
(at least in the example above).
Using summarize()
The above was a bit on the uneventful side but
group_by()
is much more exciting in conjunction with
summarize()
. This will allow us to create new variable(s)
by using functions that repeat for each of the continent-specific data
frames. That is to say, using the group_by()
function, we
split our original dataframe into multiple pieces, then we can run
functions (e.g. mean()
or sd()
) within
summarize()
. To end our group_by()
, we need to
close off the pipe by using ungroup()
.
R
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap)) %>%
ungroup()
gdp_bycontinents
OUTPUT
# A tibble: 5 × 2
continent mean_gdpPercap
<chr> <dbl>
1 Africa 2194.
2 Americas 7136.
3 Asia 7902.
4 Europe 14469.
5 Oceania 18622.
That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.
R
lifeExp_bycountry <- gapminder %>%
group_by(country) %>%
summarize(mean_lifeExp=mean(lifeExp)) %>%
ungroup()
lifeExp_bycountry %>%
filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
OUTPUT
# A tibble: 2 × 2
country mean_lifeExp
<chr> <dbl>
1 Iceland 76.5
2 Sierra Leone 36.8
Another way to do this is to use the dplyr
function
arrange()
, which arranges the rows in a data frame
according to the order of one or more variables from the data frame. It
has similar syntax to other functions from the dplyr
package. You can use desc()
inside arrange()
to sort in descending order.
R
lifeExp_bycountry %>%
arrange(mean_lifeExp) %>%
head(1)
OUTPUT
# A tibble: 1 × 2
country mean_lifeExp
<chr> <dbl>
1 Sierra Leone 36.8
R
lifeExp_bycountry %>%
arrange(desc(mean_lifeExp)) %>%
head(1)
OUTPUT
# A tibble: 1 × 2
country mean_lifeExp
<chr> <dbl>
1 Iceland 76.5
The function group_by()
allows us to group by multiple
variables. Let’s group by year
and
continent
.
R
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap)) %>%
ungroup()
That is already quite powerful, but it gets even better! You’re not
limited to defining 1 new variable in summarize()
.
R
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop)) %>%
ungroup()
count()
and n()
A very common operation is to count the number of observations for
each group. The dplyr
package comes with two related
functions that help with this.
For instance, if we wanted to check the number of countries included
in the dataset for the year 2002, we can use the count()
function. It takes the name of one or more columns that contain the
groups we are interested in, and we can optionally sort the results in
descending order by adding sort=TRUE
:
R
gapminder %>%
filter(year == 2002) %>%
count(continent, sort = TRUE)
OUTPUT
continent n
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
If we need to use the number of observations in calculations, the
n()
function is useful. For instance, if we wanted to get
the standard error of the life expectancy per continent:
R
gapminder %>%
group_by(continent) %>%
summarize(sd_le = sd(lifeExp), # standard deviation
n_le = n(), # number of observations
se_le = sd_le/sqrt(n_le)) %>% # standard error
ungroup()
OUTPUT
# A tibble: 5 × 4
continent sd_le n_le se_le
<chr> <dbl> <int> <dbl>
1 Africa 9.15 624 0.366
2 Americas 9.35 300 0.540
3 Asia 11.9 396 0.596
4 Europe 5.43 360 0.286
5 Oceania 3.80 24 0.775
In the above example, we chained together several summary operations
to simplify our calculation of standard error. We can also chain
operations to calculate the minimum
, maximum
,
mean
and se
of each continent’s per-country
life-expectancy. This time, we will calculate the se
in
just one line:
R
gapminder %>%
group_by(continent) %>%
summarize(
mean_le = mean(lifeExp),
min_le = min(lifeExp),
max_le = max(lifeExp),
se_le = sd(lifeExp)/sqrt(n())) %>%
ungroup()
OUTPUT
# A tibble: 5 × 5
continent mean_le min_le max_le se_le
<chr> <dbl> <dbl> <dbl> <dbl>
1 Africa 48.9 23.6 76.4 0.366
2 Americas 64.7 37.6 80.7 0.540
3 Asia 60.1 28.8 82.6 0.596
4 Europe 71.9 43.6 81.8 0.286
5 Oceania 74.3 69.1 81.2 0.775
Using mutate()
We can also create new variables prior to (or even after) summarizing
information using mutate()
.
R
gdp_pop_billion <- gapminder %>%
mutate(gdp_billion = gdpPercap*pop/10^9)
head(gdp_pop_billion)
OUTPUT
country year pop continent lifeExp gdpPercap gdp_billion
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6.567086
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7.585449
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8.758856
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9.648014
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9.678553
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11.697659
Other great resources
- R for Data Science
- Data Wrangling Cheat sheet
- Introduction to dplyr
- Data wrangling with R and RStudio
Key Points
- Use the
dplyr
package to manipulate dataframes. - Use
select()
to choose variables from a dataframe. - Use
filter()
to choose data based on values. - Use
group_by()
andsummarize()
to work with subsets of data. - Use
count()
andn()
to obtain the number of observations in columns. - Use
mutate()
to create new variables.
Content from Introduction to Visualization
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- What are the basics of creating graphics in R?
Objectives
- To be able to use ggplot2 to generate histograms and bar plots.
- To apply geometry and aesthetic layers to a ggplot plot.
- To manipulate the aesthetics of a plot using different colors and position parameters.
Plotting our data is one of the best ways to quickly explore it and the various relationships between variables. There are three main plotting systems in R, the base plotting system, the lattice package, and the ggplot2 package. In this lesson we’ll be learning about the ggplot2 package, because it is the most effective for creating publication quality graphics. In this episode, we will introduce the key features of a ggplot and make a few example plots. We will expand on these concepts and see how they apply to geospatial data types when we start working with geospatial data in the R for Raster and Vector Data lesson.
ggplot2
is built on the grammar of graphics, the idea
that any plot can be expressed from the same set of components: a
data set, a coordinate system, and a
set of geoms--the visual representation of data points.
The key to understanding ggplot2
is thinking about a figure
in layers. This idea may be familiar to you if you have used image
editing programs like Photoshop, Illustrator, or Inkscape. In this
episode we will focus on two geoms
- histograms
- bar plots
Let’s start by loading in the appropriate packages and data. Remember that we will have to load libraries and data at the start of each R script.
R
library(dplyr)
library(ggplot2)
gapminder <- read.csv("data/gapminder_data.csv", header = TRUE)
For our first example, we will be plotting the distribution of life
expectancy in our dataset. The first thing we do is call the
ggplot
function. This function lets R know that we’re
creating a new plot, and any of the arguments we give the
ggplot()
function are the global options for the plot: they
apply to all layers on the plot.
We will pass in two arguments to ggplot
. First, we tell
ggplot
what data we want to show on our figure, in this
example we use the gapminder data we read in earlier. For the second
argument we pass in the mapping = aes()
function, which
tells ggplot
how variables in the data map to aesthetic
properties of the figure. Here we will tell ggplot
we want
to plot the “lifeExp” column of the gapminder data frame on the x-axis.
We don’t need to specify a y-axis for histograms.
R
ggplot(data = gapminder, mapping = aes(x = lifeExp)) +
geom_histogram()
By itself, the call to ggplot
isn’t enough to draw a
figure:
R
ggplot(data = gapminder, mapping = aes(x = lifeExp))
We need to tell ggplot
how we want to visually represent
the data, which we do by adding a geom layer. In our example, we used
geom_histogram()
, which tells ggplot
we want
to visually represent the distribution of one variable (in our case
“lifeExp”):
R
ggplot(data = gapminder, mapping = aes(x = lifeExp)) +
geom_histogram()
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap)) +
geom_histogram()
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram is a useful tool for visualizing the distribution of a single categorical variable. What if we want to compare the gdp per capita of the countries in our dataset? We can use a bar (or column) plot. To simplify our plot, let’s look at data only from the most recent year and only from countries in the Americas.
R
gapminder_small <- gapminder %>%
filter(year == 2007 & continent == "Americas")
This time, we will use the geom_col()
function as our
geometry. We will plot countries on the x-axis (listed in alphabetic
order by default) and gdp per capita on the y-axis.
R
ggplot(data = gapminder_small, mapping = aes(x = country, y = gdpPercap)) +
geom_col()
With this many bars plotted, it’s impossible to read all of the
x-axis labels. A quick fix to this is the add the
coord_flip()
function to the end of our plot code.
R
ggplot(data = gapminder_small, mapping = aes(x = country, y = gdpPercap)) +
geom_col() +
coord_flip()
There are more sophisticated ways of modifying axis labels. We will be learning some of those methods later in this workshop.
Challenge 2
In the previous examples and challenge we’ve used the
aes
function to tell the geom_histogram()
and
geom_col()
functions which columns of the data set to plot.
Another aesthetic property we can modify is the color. Create a new bar
(column) plot showing the gdp per capita of all countries in the
Americas for the years 1952 and 2007, color coded by year.
First we create a new object with our filtered data:
R
gapminder_small_2 <- gapminder %>%
filter(continent == "Americas",
year %in% c(1952, 2007))
Then we plot that data using the geom_col()
geom
function. We color bars using the fill
parameter within the
aes()
function. Since there are multiple bars for each
country, we use the position
parameter to “dodge” them so
they appear side-by-side. The default behavior for postion
in geom_col()
is “stack”.
R
ggplot(gapminder_small_2,
mapping = aes(x = country, y = gdpPercap,
fill = as.factor(year))) +
geom_col(position = "dodge") +
coord_flip()
The examples given here are just the start of creating complex and beautiful graphics with R. In a later lesson we will go into much more depth, including:
- plotting geospatial specific data types
- adjusting the color scheme of our plots
- setting and formatting plot titles, subtitles, and axis labels
- creating multi-panel plots
- creating point (scatter) and line plots
- layering datasets to create multi-layered plots
- creating and customizing a plot legend
- and much more!
The examples we’ve worked through in this episode should give you the building blocks for working with the more complex graphic types and customizations we will be working with in that lesson.
Content from Writing Data
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I save plots and data created in R?
Objectives
- To be able to write out plots and data from R.
First, let’s load in all relevant libraries and data to be used in this lesson:
R
library(ggplot2)
library(dplyr)
gapminder <- read.csv("data/gapminder_data.csv", header = TRUE)
We also need to create a cleaned-data
folder within the
data
folder and a figures
folder within the
main project folder. We can do this manually or using code:
R
dir.create("data/cleaned-data")
dir.create("figures")
Saving plots
You can save a plot from within RStudio using the ‘Export’ button in the ‘Plot’ window. This will give you the option of saving as a .pdf or as .png, .jpg or other image formats.
Sometimes you will want to save plots without creating them in the
‘Plot’ window first. Perhaps you want to make a pdf document, for
example. Or perhaps you’re looping through multiple subsets of a file,
plotting data from each subset, and you want to save each plot. In this
case you can use flexible approach. The ggsave
function
saves the latest plot by default. You can control the size and
resolution using the arguments to this function.
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap)) +
geom_histogram()
ggsave("figures/Distribution-of-gdpPercap.pdf", width=12, height=4)
Open up this document and have a look.
R
gapminder_small <- gapminder %>%
filter(continent == "Americas" & year %in% c(1952, 2007))
ggplot(data = gapminder_small,
mapping = aes(x = country, y = gdpPercap, fill = as.factor(year))) +
geom_col(position = "dodge") +
coord_flip()
# Note that ggsave saves by default the latest plot.
ggsave("figures/Distribution-of-gdpPercap.pdf", width = 12, height = 4)
To produce documents in different formats, change the file extension
for jpeg
, png
, tiff
, or
bmp
.
Writing data
At some point, you’ll also want to write out data from R.
We can use the write.csv
function for this, which is
very similar to read.csv
from before.
Let’s create a data-cleaning script, for this analysis, we only want to focus on the gapminder data for Australia:
R
aust_subset <- gapminder %>%
filter(country == "Australia")
write.csv(aust_subset,
file="data/cleaned-data/gapminder-aus.csv"
)
Let’s open the file to make sure it contains the data we expect.
Navigate to your cleaned-data
directory and double-click
the file name. It will open using your computer’s default for opening
files with a .csv
extension. To open in a specific
application, right click and select the application. Using a spreadsheet
program (like Excel) to open this file shows us that we do have properly
formatted data including only the data points from Australia. However,
there are row numbers associated with the data that are not useful to us
(they refer to the row numbers from the gapminder data frame).
Let’s look at the help file to work out how to change this behaviour.
R
?write.csv
By default R will write out the row and column names when writing data to a file. To over write this behavior, we can do the following:
R
write.csv(
aust_subset,
file="data/cleaned-data/gapminder-aus.csv",
row.names=FALSE
)
R
gapminder_after_1990 <- gapminder %>%
filter(year > 1990)
write.csv(gapminder_after_1990,
file = "cleaned-data/gapminder-after-1990.csv",
row.names = FALSE)
Content from Intro to Raster Data
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- What is a raster dataset?
- How do I work with and plot raster data in R?
- How can I handle missing or bad data values for a raster?
Objectives
- Describe the fundamental attributes of a raster dataset.
- Explore raster attributes and metadata using R.
- Import rasters into R using the
terra
package. - Plot a raster file in R using the
ggplot2
package. - Describe the difference between single- and multi-band rasters.
In this episode, we will introduce the fundamental principles, packages and metadata/raster attributes that are needed to work with raster data in R. We will discuss some of the core metadata elements that we need to understand to work with rasters in R, including CRS and resolution. We will also explore missing and bad data values as stored in a raster and how R handles these elements.
We will continue to work with the dplyr
and
ggplot2
packages. We will use one additional package in
this episode to work with raster data - the terra
package.
Make sure that you have all packages loaded.
R
library(terra)
library(ggplot2)
library(dplyr)
In this episode, we’ll be working with raster data collected from the Harvard Forest site. These data were collected using remote sensing (LiDAR) and include estimates of a canopy height model, digital elevation model, and digital surface model of the field site. Today we will be focusing on the digital surface model.
View Raster File Attributes
We will be working with a series of GeoTIFF files in this lesson. The
GeoTIFF format contains a set of embedded tags with metadata about the
raster data. We can use the function describe()
in the
terra
package to get information about our raster data
before we read that data into R. It is ideal to do this before importing
your data.
R
describe("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
OUTPUT
[1] "Driver: GTiff/GeoTIFF"
[2] "Files: data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"
[3] "Size is 1697, 1367"
[4] "Coordinate System is:"
[5] "PROJCRS[\"WGS 84 / UTM zone 18N\","
[6] " BASEGEOGCRS[\"WGS 84\","
[7] " DATUM[\"World Geodetic System 1984\","
[8] " ELLIPSOID[\"WGS 84\",6378137,298.257223563,"
[9] " LENGTHUNIT[\"metre\",1]]],"
[10] " PRIMEM[\"Greenwich\",0,"
[11] " ANGLEUNIT[\"degree\",0.0174532925199433]],"
[12] " ID[\"EPSG\",4326]],"
[13] " CONVERSION[\"UTM zone 18N\","
[14] " METHOD[\"Transverse Mercator\","
[15] " ID[\"EPSG\",9807]],"
[16] " PARAMETER[\"Latitude of natural origin\",0,"
[17] " ANGLEUNIT[\"degree\",0.0174532925199433],"
[18] " ID[\"EPSG\",8801]],"
[19] " PARAMETER[\"Longitude of natural origin\",-75,"
[20] " ANGLEUNIT[\"degree\",0.0174532925199433],"
[21] " ID[\"EPSG\",8802]],"
[22] " PARAMETER[\"Scale factor at natural origin\",0.9996,"
[23] " SCALEUNIT[\"unity\",1],"
[24] " ID[\"EPSG\",8805]],"
[25] " PARAMETER[\"False easting\",500000,"
[26] " LENGTHUNIT[\"metre\",1],"
[27] " ID[\"EPSG\",8806]],"
[28] " PARAMETER[\"False northing\",0,"
[29] " LENGTHUNIT[\"metre\",1],"
[30] " ID[\"EPSG\",8807]]],"
[31] " CS[Cartesian,2],"
[32] " AXIS[\"(E)\",east,"
[33] " ORDER[1],"
[34] " LENGTHUNIT[\"metre\",1]],"
[35] " AXIS[\"(N)\",north,"
[36] " ORDER[2],"
[37] " LENGTHUNIT[\"metre\",1]],"
[38] " USAGE["
[39] " SCOPE[\"Engineering survey, topographic mapping.\"],"
[40] " AREA[\"Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamaica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.\"],"
[41] " BBOX[0,-78,84,-72]],"
[42] " ID[\"EPSG\",32618]]"
[43] "Data axis to CRS axis mapping: 1,2"
[44] "Origin = (731453.000000000000000,4713838.000000000000000)"
[45] "Pixel Size = (1.000000000000000,-1.000000000000000)"
[46] "Metadata:"
[47] " AREA_OR_POINT=Area"
[48] "Image Structure Metadata:"
[49] " COMPRESSION=LZW"
[50] " INTERLEAVE=BAND"
[51] "Corner Coordinates:"
[52] "Upper Left ( 731453.000, 4713838.000) ( 72d10'52.71\"W, 42d32'32.18\"N)"
[53] "Lower Left ( 731453.000, 4712471.000) ( 72d10'54.71\"W, 42d31'47.92\"N)"
[54] "Upper Right ( 733150.000, 4713838.000) ( 72d 9'38.40\"W, 42d32'30.35\"N)"
[55] "Lower Right ( 733150.000, 4712471.000) ( 72d 9'40.41\"W, 42d31'46.08\"N)"
[56] "Center ( 732301.500, 4713154.500) ( 72d10'16.56\"W, 42d32' 9.13\"N)"
[57] "Band 1 Block=1697x1 Type=Float64, ColorInterp=Gray"
[58] " Min=305.070 Max=416.070 "
[59] " Minimum=305.070, Maximum=416.070, Mean=359.853, StdDev=17.832"
[60] " NoData Value=-9999"
[61] " Metadata:"
[62] " STATISTICS_MAXIMUM=416.06997680664"
[63] " STATISTICS_MEAN=359.85311802914"
[64] " STATISTICS_MINIMUM=305.07000732422"
[65] " STATISTICS_STDDEV=17.83169335933"
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"
If you wish to store this information in R, you can do the following:
R
HARV_dsmCrop_info <- capture.output(
describe("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
)
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"
Each line of text that was printed to the console is now stored as an
element of the character vector HARV_dsmCrop_info
. We will
be exploring this data throughout this episode. By the end of this
episode, you will be able to explain and understand the output
above.
Open a Raster in R
Now that we’ve previewed the metadata for our GeoTIFF, let’s import
this raster dataset into R and explore its metadata more closely. We can
use the rast()
function in the terra
package
to open a raster in R.
First we will load our raster file into R and view the data structure.
R
DSM_HARV <-
rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"
DSM_HARV
OUTPUT
class : SpatRaster
dimensions : 1367, 1697, 1 (nrow, ncol, nlyr)
resolution : 1, 1 (x, y)
extent : 731453, 733150, 4712471, 4713838 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 18N (EPSG:32618)
source : HARV_dsmCrop.tif
name : HARV_dsmCrop
min value : 305.07
max value : 416.07
The information above includes a report of min and max values, but no other data range statistics. Similar to other R data structures like vectors and data frame columns, descriptive statistics for raster data can be retrieved like
R
summary(DSM_HARV)
WARNING
Warning: [summary] used a sample
OUTPUT
HARV_dsmCrop
Min. :305.6
1st Qu.:345.6
Median :359.6
Mean :359.8
3rd Qu.:374.3
Max. :414.7
but note the warning - unless you force R to calculate these
statistics using every cell in the raster, it will take a random sample
of 100,000 cells and calculate from that instead. To force calculation
all the values, you can use the function values
:
R
summary(values(DSM_HARV))
OUTPUT
HARV_dsmCrop
Min. :305.1
1st Qu.:345.6
Median :359.7
Mean :359.9
3rd Qu.:374.3
Max. :416.1
We can visualise this data in R either using the terra
plot()
function, or by using ggplot2
. To
visualise the data using terra
, we run this line of
code:
R
plot(DSM_HARV)
To visualise this data in R using ggplot2
, we need to
convert it to a dataframe. We learned about dataframes in the Exploring
Data Frames lesson. The terra
package has an built-in
function for conversion to a plotable dataframe. Setting
xy = TRUE
maintains the spatial locations of the data, so
that we can plot the data as a map.
R
DSM_HARV_df <- as.data.frame(DSM_HARV, xy = TRUE)
Now when we view the structure of our data, we will see a standard dataframe format.
R
str(DSM_HARV_df)
OUTPUT
'data.frame': 2319799 obs. of 3 variables:
$ x : num 731454 731454 731456 731456 731458 ...
$ y : num 4713838 4713838 4713838 4713838 4713838 ...
$ HARV_dsmCrop: num 409 408 407 407 409 ...
We can use ggplot()
to plot this data. We will set the
color scale to scale_fill_viridis_c
which is a
color-blindness friendly color scale. We will also use the
coord_quickmap()
function to use an approximate Mercator
projection for our plots. This approximation is suitable for small areas
that are not too close to the poles. Other coordinate systems are
available in ggplot2 if needed, you can learn about them at their help
page ?coord_map
.
R
ggplot() +
geom_raster(data = DSM_HARV_df ,
mapping = aes(x = x, y = y, fill = HARV_dsmCrop)) +
scale_fill_viridis_c() +
coord_quickmap()
Plotting Tip
More information about the Viridis palette used above at R Viridis package documentation.
This map shows the elevation of our study site in Harvard Forest. From the legend, we can see that the maximum elevation is ~400, but we can’t tell whether this is 400 feet or 400 meters because the legend doesn’t show us the units. We can look at the metadata of our object to see what the units are. Much of the metadata that we’re interested in is part of the CRS.
Now we will see how features of the CRS appear in our data file and what meanings they have.
Understanding CRS in Proj4 Format
The CRS for our data is given to us by R in proj4
format. Let’s break down the pieces of proj4
string. The
string contains all of the individual CRS elements that R or another GIS
might need. Each element is specified with a +
sign,
similar to how a .csv
file is delimited or broken up by a
,
. After each +
we see the CRS element being
defined. For example projection (proj=
) and datum
(datum=
).
UTM Proj4 String
A projection string (like the one of DSM_HARV
) specifies
the UTM projection as follows:
+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
- proj=utm: the projection is UTM, UTM has several zones.
- zone=18: the zone is 18
- datum=WGS84: the datum is WGS84 (the datum refers to the 0,0 reference for the coordinate system used in the projection)
- units=m: the units for the coordinates are in meters
- ellps=WGS84: the ellipsoid (how the earth’s roundness is calculated) for the data is WGS84
Note that the zone is unique to the UTM projection. Not all CRSs will have a zone. Image source: Chrismurf at English Wikipedia, via Wikimedia Commons (CC-BY).
Calculate Raster Min and Max Values
It is useful to know the minimum or maximum values of a raster dataset. In this case, given we are working with elevation data, these values represent the min/max elevation range at our site.
Raster statistics are often calculated and embedded in a GeoTIFF for us. We can view these values:
R
minmax(DSM_HARV)
OUTPUT
HARV_dsmCrop
min 305.07
max 416.07
R
min(values(DSM_HARV))
OUTPUT
[1] 305.07
R
max(values(DSM_HARV))
OUTPUT
[1] 416.07
We can see that the elevation at our site ranges from 305.0700073m to 416.0699768m.
Raster Bands
The Digital Surface Model object (DSM_HARV
) that we’ve
been working with is a single band raster. This means that there is only
one dataset stored in the raster: surface elevation in meters for one
time period.
A raster dataset can contain one or more bands. We can use the
rast()
function to import one single band from a single or
multi-band raster. We can view the number of bands in a raster using the
nlyr()
function.
R
nlyr(DSM_HARV)
OUTPUT
[1] 1
However, raster data can also be multi-band, meaning that one raster
file contains data for more than one variable or time period for each
cell. By default the rast()
function only imports the first
band in a raster regardless of whether it has one or more bands.
Dealing with Missing Data
Raster data often has a NoDataValue
associated with it.
This is a value assigned to pixels where data is missing or no data were
collected.
By default the shape of a raster is always rectangular. So if we have
a dataset that has a shape that isn’t rectangular, some pixels at the
edge of the raster will have NoDataValue
s. This often
happens when the data were collected by an airplane which only flew over
some part of a defined region.
In the image below, the pixels that are black have
NoDataValue
s. The camera did not collect data in these
areas.
In the next image, the black edges have been assigned
NoDataValue
. R doesn’t render pixels that contain a
specified NoDataValue
. R assigns missing data with the
NoDataValue
as NA
.
The difference here shows up as ragged edges on the plot, rather than black spaces where there is no data.
If your raster already has NA
values set correctly but
you aren’t sure where they are, you can deliberately plot them in a
particular colour. This can be useful when checking a dataset’s
coverage. For instance, sometimes data can be missing where a sensor
could not ‘see’ its target data, and you may wish to locate that missing
data and fill it in.
To highlight NA
values in ggplot, alter the
scale_fill_*()
layer to contain a colour instruction for
NA
values, like
scale_fill_viridis_c(na.value = 'deeppink')
The value that is conventionally used to take note of missing data
(the NoDataValue
value) varies by the raster data type. For
floating-point rasters, the figure -3.4e+38
is a common
default, and for integers, -9999
is common. Some
disciplines have specific conventions that vary from these common
values.
In some cases, other NA
values may be more appropriate.
An NA
value should be a) outside the range of valid values,
and b) a value that fits the data type in use. For instance, if your
data ranges continuously from -20 to 100, 0 is not an acceptable
NA
value! Or, for categories that number 1-15, 0 might be
fine for NA
, but using -.000003 will force you to save the
GeoTIFF on disk as a floating point raster, resulting in a bigger
file.
If we are lucky, our GeoTIFF file has a tag that tells us what is the
NoDataValue
. If we are less lucky, we can find that
information in the raster’s metadata. If a NoDataValue
was
stored in the GeoTIFF tag, when R opens up the raster, it will assign
each instance of the value to NA
. Values of NA
will be ignored by R as demonstrated above.
R
describe(sources(DSM_HARV))
OUTPUT
[1] "Driver: GTiff/GeoTIFF"
[2] "Files: /Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"
[3] "Size is 1697, 1367"
[4] "Coordinate System is:"
[5] "PROJCRS[\"WGS 84 / UTM zone 18N\","
[6] " BASEGEOGCRS[\"WGS 84\","
[7] " DATUM[\"World Geodetic System 1984\","
[8] " ELLIPSOID[\"WGS 84\",6378137,298.257223563,"
[9] " LENGTHUNIT[\"metre\",1]]],"
[10] " PRIMEM[\"Greenwich\",0,"
[11] " ANGLEUNIT[\"degree\",0.0174532925199433]],"
[12] " ID[\"EPSG\",4326]],"
[13] " CONVERSION[\"UTM zone 18N\","
[14] " METHOD[\"Transverse Mercator\","
[15] " ID[\"EPSG\",9807]],"
[16] " PARAMETER[\"Latitude of natural origin\",0,"
[17] " ANGLEUNIT[\"degree\",0.0174532925199433],"
[18] " ID[\"EPSG\",8801]],"
[19] " PARAMETER[\"Longitude of natural origin\",-75,"
[20] " ANGLEUNIT[\"degree\",0.0174532925199433],"
[21] " ID[\"EPSG\",8802]],"
[22] " PARAMETER[\"Scale factor at natural origin\",0.9996,"
[23] " SCALEUNIT[\"unity\",1],"
[24] " ID[\"EPSG\",8805]],"
[25] " PARAMETER[\"False easting\",500000,"
[26] " LENGTHUNIT[\"metre\",1],"
[27] " ID[\"EPSG\",8806]],"
[28] " PARAMETER[\"False northing\",0,"
[29] " LENGTHUNIT[\"metre\",1],"
[30] " ID[\"EPSG\",8807]]],"
[31] " CS[Cartesian,2],"
[32] " AXIS[\"(E)\",east,"
[33] " ORDER[1],"
[34] " LENGTHUNIT[\"metre\",1]],"
[35] " AXIS[\"(N)\",north,"
[36] " ORDER[2],"
[37] " LENGTHUNIT[\"metre\",1]],"
[38] " USAGE["
[39] " SCOPE[\"Engineering survey, topographic mapping.\"],"
[40] " AREA[\"Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamaica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.\"],"
[41] " BBOX[0,-78,84,-72]],"
[42] " ID[\"EPSG\",32618]]"
[43] "Data axis to CRS axis mapping: 1,2"
[44] "Origin = (731453.000000000000000,4713838.000000000000000)"
[45] "Pixel Size = (1.000000000000000,-1.000000000000000)"
[46] "Metadata:"
[47] " AREA_OR_POINT=Area"
[48] "Image Structure Metadata:"
[49] " COMPRESSION=LZW"
[50] " INTERLEAVE=BAND"
[51] "Corner Coordinates:"
[52] "Upper Left ( 731453.000, 4713838.000) ( 72d10'52.71\"W, 42d32'32.18\"N)"
[53] "Lower Left ( 731453.000, 4712471.000) ( 72d10'54.71\"W, 42d31'47.92\"N)"
[54] "Upper Right ( 733150.000, 4713838.000) ( 72d 9'38.40\"W, 42d32'30.35\"N)"
[55] "Lower Right ( 733150.000, 4712471.000) ( 72d 9'40.41\"W, 42d31'46.08\"N)"
[56] "Center ( 732301.500, 4713154.500) ( 72d10'16.56\"W, 42d32' 9.13\"N)"
[57] "Band 1 Block=1697x1 Type=Float64, ColorInterp=Gray"
[58] " Min=305.070 Max=416.070 "
[59] " Minimum=305.070, Maximum=416.070, Mean=359.853, StdDev=17.832"
[60] " NoData Value=-9999"
[61] " Metadata:"
[62] " STATISTICS_MAXIMUM=416.06997680664"
[63] " STATISTICS_MEAN=359.85311802914"
[64] " STATISTICS_MINIMUM=305.07000732422"
[65] " STATISTICS_STDDEV=17.83169335933"
NoDataValue
are encoded as -9999.
Bad Data Values in Rasters
Bad data values are different from NoDataValue
s. Bad
data values are values that fall outside of the applicable range of a
dataset.
Examples of Bad Data Values:
- The normalized difference vegetation index (NDVI), which is a measure of greenness, has a valid range of -1 to 1. Any value outside of that range would be considered a “bad” or miscalculated value.
- Reflectance data in an image will often range from 0-1 or 0-10,000 depending upon how the data are scaled. Thus a value greater than 1 or greater than 10,000 is likely caused by an error in either data collection or processing.
Find Bad Data Values
Sometimes a raster’s metadata will tell us the range of expected values for a raster. Values outside of this range are suspect and we need to consider that when we analyze the data. Sometimes, we need to use some common sense and scientific insight as we examine the data - just as we would for field data to identify questionable values.
Plotting data with appropriate highlighting can help reveal patterns in bad values and may suggest a solution. Below, reclassification is used to highlight elevation values over 400m with a contrasting colour.
Create A Histogram of Raster Values
We can explore the distribution of values contained within our raster
using the geom_histogram()
function which produces a
histogram. Histograms are often useful in identifying outliers and bad
data values in our raster data.
R
ggplot() +
geom_histogram(data = DSM_HARV_df, mapping = aes(x = HARV_dsmCrop))
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that a warning message is thrown when R creates the histogram.
stat_bin()
using bins = 30
. Pick better
value with binwidth
.
This warning is caused by a default setting in
geom_histogram
enforcing that there are 30 bins for the
data. We can define the number of bins we want in the histogram by using
the bins
value in the geom_histogram()
function.
R
ggplot() +
geom_histogram(data = DSM_HARV_df,
mapping = aes(x = HARV_dsmCrop),
bins = 40)
Note that the shape of this histogram looks similar to the previous
one that was created using the default of 30 bins. The distribution of
elevation values for our Digital Surface Model (DSM)
looks
reasonable. It is likely there are no bad data values in this particular
raster.
Challenge 3: Explore Raster Metadata
Use describe()
to determine the following about the
NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif
file:
- Does this file have the same CRS as
DSM_HARV
? - What is the
NoDataValue
? - What is resolution of the raster data?
- How large would a 5x5 pixel area be on the Earth’s surface?
- Is the file a multi- or single-band raster?
Notice: this file is a hillshade. We will learn about hillshades in the Working with Multi-band Rasters in R episode.
R
describe("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif")
OUTPUT
[1] "Driver: GTiff/GeoTIFF"
[2] "Files: data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif"
[3] "Size is 1697, 1367"
[4] "Coordinate System is:"
[5] "PROJCRS[\"WGS 84 / UTM zone 18N\","
[6] " BASEGEOGCRS[\"WGS 84\","
[7] " DATUM[\"World Geodetic System 1984\","
[8] " ELLIPSOID[\"WGS 84\",6378137,298.257223563,"
[9] " LENGTHUNIT[\"metre\",1]]],"
[10] " PRIMEM[\"Greenwich\",0,"
[11] " ANGLEUNIT[\"degree\",0.0174532925199433]],"
[12] " ID[\"EPSG\",4326]],"
[13] " CONVERSION[\"UTM zone 18N\","
[14] " METHOD[\"Transverse Mercator\","
[15] " ID[\"EPSG\",9807]],"
[16] " PARAMETER[\"Latitude of natural origin\",0,"
[17] " ANGLEUNIT[\"degree\",0.0174532925199433],"
[18] " ID[\"EPSG\",8801]],"
[19] " PARAMETER[\"Longitude of natural origin\",-75,"
[20] " ANGLEUNIT[\"degree\",0.0174532925199433],"
[21] " ID[\"EPSG\",8802]],"
[22] " PARAMETER[\"Scale factor at natural origin\",0.9996,"
[23] " SCALEUNIT[\"unity\",1],"
[24] " ID[\"EPSG\",8805]],"
[25] " PARAMETER[\"False easting\",500000,"
[26] " LENGTHUNIT[\"metre\",1],"
[27] " ID[\"EPSG\",8806]],"
[28] " PARAMETER[\"False northing\",0,"
[29] " LENGTHUNIT[\"metre\",1],"
[30] " ID[\"EPSG\",8807]]],"
[31] " CS[Cartesian,2],"
[32] " AXIS[\"(E)\",east,"
[33] " ORDER[1],"
[34] " LENGTHUNIT[\"metre\",1]],"
[35] " AXIS[\"(N)\",north,"
[36] " ORDER[2],"
[37] " LENGTHUNIT[\"metre\",1]],"
[38] " USAGE["
[39] " SCOPE[\"Engineering survey, topographic mapping.\"],"
[40] " AREA[\"Between 78°W and 72°W, northern hemisphere between equator and 84°N, onshore and offshore. Bahamas. Canada - Nunavut; Ontario; Quebec. Colombia. Cuba. Ecuador. Greenland. Haiti. Jamaica. Panama. Turks and Caicos Islands. United States (USA). Venezuela.\"],"
[41] " BBOX[0,-78,84,-72]],"
[42] " ID[\"EPSG\",32618]]"
[43] "Data axis to CRS axis mapping: 1,2"
[44] "Origin = (731453.000000000000000,4713838.000000000000000)"
[45] "Pixel Size = (1.000000000000000,-1.000000000000000)"
[46] "Metadata:"
[47] " AREA_OR_POINT=Area"
[48] "Image Structure Metadata:"
[49] " COMPRESSION=LZW"
[50] " INTERLEAVE=BAND"
[51] "Corner Coordinates:"
[52] "Upper Left ( 731453.000, 4713838.000) ( 72d10'52.71\"W, 42d32'32.18\"N)"
[53] "Lower Left ( 731453.000, 4712471.000) ( 72d10'54.71\"W, 42d31'47.92\"N)"
[54] "Upper Right ( 733150.000, 4713838.000) ( 72d 9'38.40\"W, 42d32'30.35\"N)"
[55] "Lower Right ( 733150.000, 4712471.000) ( 72d 9'40.41\"W, 42d31'46.08\"N)"
[56] "Center ( 732301.500, 4713154.500) ( 72d10'16.56\"W, 42d32' 9.13\"N)"
[57] "Band 1 Block=1697x1 Type=Float64, ColorInterp=Gray"
[58] " Min=-0.714 Max=1.000 "
[59] " Minimum=-0.714, Maximum=1.000, Mean=0.313, StdDev=0.481"
[60] " NoData Value=-9999"
[61] " Metadata:"
[62] " STATISTICS_MAXIMUM=0.99999973665016"
[63] " STATISTICS_MEAN=0.31255246777216"
[64] " STATISTICS_MINIMUM=-0.71362979358008"
[65] " STATISTICS_STDDEV=0.48129385401108"
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif"
- If this file has the same CRS as DSM_HARV? Yes: UTM Zone 18, WGS84, meters.
- What format
NoDataValues
take? -9999 - The resolution of the raster data? 1x1
- How large a 5x5 pixel area would be? 5mx5m How? We are given resolution of 1x1 and units in meters, therefore resolution of 5x5 means 5x5m.
- Is the file a multi- or single-band raster? Single.
More Resources
- Read more about
the
terra
package in R. - We use the
ggplot2
package for plotting in this workshop, but thetmap
package actually makes plotting rasters even easier.tmap
is especially useful if you plan to plot rasters and vectors in the same plot.
Content from Plot Raster Data
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I create categorized or customized maps of raster data?
- How can I customize the color scheme of a raster image?
- How can I layer raster data in a single image?
Objectives
- Build customized plots for a single band raster using the
ggplot2
package. - Layer a raster dataset on top of a hillshade to create an elegant basemap.
Plot Raster Data in R
This episode covers how to plot a raster in R using the
ggplot2
package with customized coloring schemes. It also
covers how to layer a raster on top of a hillshade to produce an
eloquent map. We will continue working with the Digital Surface Model
(DSM) raster for the NEON Harvard Forest Field Site.
First, let’s load in the libraries and data we will need for this lesson:
R
library(terra)
library(ggplot2)
library(dplyr)
# DSM data for Harvard Forest
DSM_HARV <-
rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif"
DSM_HARV_df <- as.data.frame(DSM_HARV, xy = TRUE)
Plotting Data Using Breaks
In the previous episode, we viewed our data using a continuous color
ramp. For clarity and visibility of the plot, we may prefer to view the
data “symbolized” or colored according to ranges of values. This is
comparable to a “classified” map. To do this, we need to tell
ggplot
how many groups to break our data into, and where
those breaks should be. To make these decisions, it is useful to first
explore the distribution of the data using a bar plot. To begin with, we
will use dplyr
’s mutate()
function combined
with cut()
to split the data into 3 bins.
R
DSM_HARV_df <- DSM_HARV_df %>%
mutate(fct_elevation = cut(HARV_dsmCrop, breaks = 3))
ggplot() +
geom_bar(data = DSM_HARV_df, mapping = aes(x = fct_elevation))
If we want to know the cutoff values for the groups, we can ask for
the unique values of fct_elevation
:
R
unique(DSM_HARV_df$fct_elevation)
OUTPUT
[1] (379,416] (342,379] (305,342]
Levels: (305,342] (342,379] (379,416]
Notice that the cut()
function shows closed intervals on
the right (]
), and open intervals on the left
((
). In our example, (305, 342] means that values greater
than 305 (but not 305 itself) and values less than or equal to 342 are
included in the interval.
And we can get the count of values in each group using
dplyr
’s group_by()
and count()
functions:
R
DSM_HARV_df %>%
group_by(fct_elevation) %>%
count() %>%
ungroup()
OUTPUT
# A tibble: 3 × 2
fct_elevation n
<fct> <int>
1 (305,342] 418891
2 (342,379] 1530073
3 (379,416] 370835
We might prefer to customize the cutoff values for these groups. Lets
round the cutoff values so that we have groups for the ranges of 301–350
m, 351–400 m, and 401–450 m. To implement this we will give
mutate()
a numeric vector of break points instead of the
number of breaks we want.
R
custom_bins <- c(300, 350, 400, 450)
DSM_HARV_df <- DSM_HARV_df %>%
mutate(fct_elevation_2 = cut(HARV_dsmCrop, breaks = custom_bins))
unique(DSM_HARV_df$fct_elevation_2)
OUTPUT
[1] (400,450] (350,400] (300,350]
Levels: (300,350] (350,400] (400,450]
And now we can plot our bar plot again, using the new groups:
R
ggplot() +
geom_bar(data = DSM_HARV_df, mapping = aes(x = fct_elevation_2))
And we can get the count of values in each group in the same way we did before:
R
DSM_HARV_df %>%
group_by(fct_elevation_2) %>%
count() %>%
ungroup()
OUTPUT
# A tibble: 3 × 2
fct_elevation_2 n
<fct> <int>
1 (300,350] 741815
2 (350,400] 1567316
3 (400,450] 10668
We can use those groups to plot our raster data, with each group being a different color:
R
ggplot() +
geom_raster(data = DSM_HARV_df ,
mapping = aes(x = x, y = y, fill = fct_elevation_2)) +
coord_quickmap()
The plot above uses the default colors inside ggplot
for
raster objects. We can specify our own colors to make the plot look a
little nicer. R has a built in set of colors for plotting terrain, which
are built in to the terrain.colors()
function. Since we
have three bins, we want to create a 3-color palette:
R
terrain.colors(3)
OUTPUT
[1] "#00A600" "#ECB176" "#F2F2F2"
The terrain.colors()
function returns hex
colors - each of these character strings represents a color. To use
these in our map, we pass them across using the
scale_fill_manual()
function.
R
ggplot() +
geom_raster(data = DSM_HARV_df ,
mapping = aes(x = x, y = y, fill = fct_elevation_2)) +
scale_fill_manual(values = terrain.colors(3)) +
coord_quickmap()
More Plot Formatting
If we need to create multiple plots using the same color palette, we
can create an R object (my_col
) for the set of colors that
we want to use. We can then quickly change the palette across all plots
by modifying the my_col
object, rather than each individual
plot.
We can label the x- and y-axes of our plot too using the
labs()
function. We can also give the legend a more
meaningful title by passing either a value to the name
argument of the scale_fill_manual()
function or by
specifying it within the labs()
function as
fill = "name here"
.
R
my_col <- terrain.colors(3)
ggplot() +
geom_raster(data = DSM_HARV_df ,
mapping = aes(x = x, y = y, fill = fct_elevation_2)) +
scale_fill_manual(values = my_col) +
coord_quickmap() +
labs(x = "longitude",
y = "latitude",
fill = "Elevation")
Or we can also turn off the labels of both axes by passing a NULL
value to the labs()
function for the x and y values.
R
ggplot() +
geom_raster(data = DSM_HARV_df ,
mapping = aes(x = x, y = y, fill = fct_elevation_2)) +
scale_fill_manual(values = my_col) +
coord_quickmap() +
labs(x = NULL,
y = NULL,
fill = "Elevation")
Challenge 1: Plot Using Custom Breaks
Create a plot of the Harvard Forest Digital Surface Model (DSM) that has:
- Six classified ranges of values (break points) that are evenly divided among the range of pixel values.
- Axis labels.
- A plot title. (You can specify this in the
labs()
function as well, usingtitle = "Title here"
)
R
DSM_HARV_df <- DSM_HARV_df %>%
mutate(fct_elevation_6 = cut(HARV_dsmCrop, breaks = 6))
my_col <- terrain.colors(6)
ggplot() +
geom_raster(data = DSM_HARV_df,
mapping = aes(x = x, y = y, fill = fct_elevation_6)) +
scale_fill_manual(values = my_col) +
labs(x = "UTM Easting Coordinate (m)",
y = "UTM Northing Coordinate (m)",
fill = "Elevation",
title = "Classified Elevation Map - NEON Harvard Forest Field Site") +
coord_quickmap()
Layering Rasters
We can layer a raster on top of a hillshade raster for the same area, and use a transparency factor to create a 3-dimensional shaded effect. A hillshade is a raster that maps the shadows and texture that you would see from above when viewing terrain. We will add a custom color, making the plot grey.
First we need to read in our DSM hillshade data and view the structure:
R
DSM_hill_HARV <-
rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif"
DSM_hill_HARV
OUTPUT
class : SpatRaster
dimensions : 1367, 1697, 1 (nrow, ncol, nlyr)
resolution : 1, 1 (x, y)
extent : 731453, 733150, 4712471, 4713838 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 18N (EPSG:32618)
source : HARV_DSMhill.tif
name : HARV_DSMhill
min value : -0.7136298
max value : 0.9999997
Next we convert it to a dataframe, so that we can plot it using
ggplot2
:
R
DSM_hill_HARV_df <- as.data.frame(DSM_hill_HARV, xy = TRUE)
str(DSM_hill_HARV_df)
OUTPUT
'data.frame': 2313675 obs. of 3 variables:
$ x : num 731454 731456 731456 731458 731458 ...
$ y : num 4713836 4713836 4713836 4713836 4713836 ...
$ HARV_DSMhill: num -0.15567 0.00743 0.86989 0.9791 0.96283 ...
Now we can plot the hillshade data. Note that the alpha
object translates to transparency, where a value of 0 indicates fully
transparent and a value of 1 indicates fully opaque. We can force R to
not show the legend for a particular scale if we use
guide = "none"
:
R
ggplot() +
geom_raster(data = DSM_hill_HARV_df,
mapping = aes(x = x, y = y, alpha = HARV_DSMhill)) +
scale_alpha(range = c(0.15, 0.65), guide = "none") +
coord_quickmap()
We can layer another raster on top of our hillshade by adding another
call to the geom_raster()
function. Let’s overlay
DSM_HARV
on top of the hill_HARV
. We can also
add a title to our plot within the labs()
function using
labs(title = "Plot title")
or we can use
ggtitle("Plot title")
.
R
ggplot() +
geom_raster(data = DSM_HARV_df,
mapping = aes(x = x, y = y, fill = HARV_dsmCrop)) +
geom_raster(data = DSM_hill_HARV_df,
mapping = aes(x = x, y = y, alpha = HARV_DSMhill)) +
scale_fill_viridis_c() +
scale_alpha(range = c(0.15, 0.65), guide = "none") +
labs(title = "Elevation with hillshade") +
coord_quickmap()
Challenge 2: Create DTM & DSM for SJER
Use the files in the
data/NEON-DS-Airborne-Remote-Sensing/SJER/
(or
data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/
depending on your setup) directory to create a Digital Terrain Model map
and Digital Surface Model map of the San Joaquin Experimental Range
field site.
Make sure to:
- include hillshade in the maps,
- label axes on the DSM map and exclude them from the DTM map,
- include a title for each map,
- experiment with various alpha values and color palettes to represent the data.
R
# CREATE DSM MAPS
# import DSM data
DSM_SJER <-
rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif"
# convert to a df for plotting
DSM_SJER_df <- as.data.frame(DSM_SJER, xy = TRUE)
# import DSM hillshade
DSM_hill_SJER <-
rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmHill.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmHill.tif"
# convert to a df for plotting
DSM_hill_SJER_df <- as.data.frame(DSM_hill_SJER, xy = TRUE)
# Build Plot
ggplot() +
geom_raster(data = DSM_SJER_df ,
mapping = aes(x = x, y = y,
fill = SJER_dsmCrop), alpha = 0.8) +
geom_raster(data = DSM_hill_SJER_df,
mapping = aes(x = x, y = y, alpha = SJER_dsmHill)) +
scale_fill_viridis_c() +
scale_alpha(range = c(0.4, 0.7), guide = "none") +
labs(x = "UTM Easting Coordinate (m)",
y = "UTM Northing Coordinate (m)",
title = "DSM with Hillshade") +
coord_quickmap()
R
# CREATE DTM MAP
# import DTM
DTM_SJER <-
rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop.tif"
DTM_SJER_df <- as.data.frame(DTM_SJER, xy = TRUE)
# DTM Hillshade
DTM_hill_SJER <-
rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmHill.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmHill.tif"
DTM_hill_SJER_df <- as.data.frame(DTM_hill_SJER, xy = TRUE)
ggplot() +
geom_raster(data = DTM_SJER_df,
mapping = aes(x = x, y = y,
fill = SJER_dtmCrop),
alpha = 2.0) +
geom_raster(data = DTM_hill_SJER_df,
mapping = aes(x = x, y = y,
alpha = SJER_dtmHill)) +
scale_fill_viridis_c() +
scale_alpha(range = c(0.4, 0.7), guide = "none") +
labs(x = NULL,
y = NULL,
title = "DTM with Hillshade") +
coord_quickmap()
Content from Reproject Raster Data
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How do I work with raster data sets that are in different projections?
Objectives
- Reproject a raster in R.
Sometimes we encounter raster datasets that do not “line up” when
plotted or analyzed. Rasters that don’t line up are most often in
different Coordinate Reference Systems (CRS). This episode explains how
to deal with rasters in different, known CRSs. It will walk though
reprojecting rasters in R using the project()
function in
the terra
package.
Let’s load the packages we’ll need for this lesson:
R
library(terra)
library(ggplot2)
library(dplyr)
Raster Projection in R
In the previous episode, we learned how to layer a raster file on top of a hillshade for a nice looking basemap. In that episode, all of our data were in the same CRS. What happens when things don’t line up?
For this episode, we will be working with the Harvard Forest Digital Terrain Model data. This differs from the surface model data we’ve been working with so far in that the digital surface model (DSM) includes the tops of trees, while the digital terrain model (DTM) shows the ground level.
Here, we will create a map of the Harvard Forest Digital Terrain
Model (DTM_HARV
) draped or layered on top of the hillshade
(DTM_hill_HARV
). The hillshade layer maps the terrain using
light and shadow to create a 3D-looking image, based on a hypothetical
illumination of the ground level.
First, we need to import the DTM and DTM hillshade data.
R
DTM_HARV <-
rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif"
DTM_hill_HARV <-
rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_DTMhill_WGS84.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_DTMhill_WGS84.tif"
Next, we will convert each of these datasets to a dataframe for
plotting with ggplot
.
R
DTM_HARV_df <- as.data.frame(DTM_HARV, xy = TRUE)
DTM_hill_HARV_df <- as.data.frame(DTM_hill_HARV, xy = TRUE)
Now we can create a map of the DTM layered over the hillshade. Note
that we use scale_fill_gradientn()
here to specify our
color scale. This allows for us to easily create a color ramp for the
data we want to display.
R
ggplot() +
geom_raster(data = DTM_HARV_df ,
mapping = aes(x = x, y = y,
fill = HARV_dtmCrop)) +
geom_raster(data = DTM_hill_HARV_df,
mapping = aes(x = x, y = y,
alpha = HARV_DTMhill_WGS84)) +
scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) +
coord_quickmap()
Our results are curious - neither the Digital Terrain Model
(DTM_HARV_df
) nor the DTM Hillshade
(DTM_hill_HARV_df
) plotted. Let’s try to plot the DTM on
its own to make sure there are data there.
R
ggplot() +
geom_raster(data = DTM_HARV_df,
mapping = aes(x = x, y = y, fill = HARV_dtmCrop)) +
scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) +
coord_quickmap()
Our DTM seems to contain data and plots just fine.
Next we plot the DTM Hillshade on its own to see whether everything is OK.
R
ggplot() +
geom_raster(data = DTM_hill_HARV_df,
mapping = aes(x = x, y = y,
alpha = HARV_DTMhill_WGS84)) +
coord_quickmap()
If we look at the axes, we can see that the projections of the two
rasters are different. When this is the case, ggplot
won’t
render the image. It won’t even throw an error message to tell you
something has gone wrong. We can look at Coordinate Reference Systems
(CRSs) of the DTM and the hillshade data to see how they differ.
R
crs(DTM_HARV, proj = TRUE)
OUTPUT
[1] "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"
R
crs(DTM_hill_HARV, proj = TRUE)
OUTPUT
[1] "+proj=longlat +datum=WGS84 +no_defs"
DTM_HARV
is in the UTM projection, with units of meters.
DTM_hill_HARV
is in Geographic WGS84
- which
is represented by latitude and longitude values.
Because the two rasters are in different CRSs, they don’t line up
when plotted in R. We need to reproject (or change the projection of)
DTM_hill_HARV
into the UTM CRS. Alternatively, we could
reproject DTM_HARV
into WGS84.
Reproject Rasters
We can use the project()
function to reproject a raster
into a new CRS. Keep in mind that reprojection only works when you first
have a defined CRS for the raster object that you want to reproject. It
cannot be used if no CRS is defined. Lucky for us, the
DTM_hill_HARV
has a defined CRS.
To use the project()
function, we need to define two
things:
- the object we want to reproject and
- the CRS that we want to reproject it to.
The syntax is project(x = RasterObject, y = crs)
We want the CRS of our hillshade to match the DTM_HARV
raster. One option would be to just use the CRS of DTM_HARV
in the project()
function, but this will cause other issues
later down the line because the resolutions of DTM_HARV
and
DTM_hill_HARV
are different. That’s okay, because we can
specify the resolution of the output file in the project()
function using res
.
That’s a little complicated for our needs though. In our case, the
most effective way to reproject the dataset is to use the
DTM_HARV
layer as a template for the reprojection. We can
specify that by using the syntax
project(x = RasterObject, y = TemplateRasterObject)
.
Note that we are using the project()
function on the
raster object, not the data.frame()
we use for plotting
with ggplot
.
R
DTM_hill_UTMZ18N_HARV <- project(x = DTM_hill_HARV,
y = DTM_HARV)
For plotting with ggplot()
, we will need to create a
dataframe from our newly reprojected raster.
R
DTM_hill_HARV_2_df <- as.data.frame(DTM_hill_UTMZ18N_HARV, xy = TRUE)
We can now create a plot of this data.
R
ggplot() +
geom_raster(data = DTM_HARV_df ,
mapping = aes(x = x, y = y,
fill = HARV_dtmCrop)) +
geom_raster(data = DTM_hill_HARV_2_df,
mapping = aes(x = x, y = y,
alpha = HARV_DTMhill_WGS84)) +
scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) +
coord_quickmap()
We have now successfully draped the Digital Terrain Model on top of our hillshade to produce a nice looking, textured map!
Challenge 1: Reproject, then Plot a Digital Terrain Model
Create a map of the San
Joaquin Experimental Range field site using the
SJER_DSMhill_WGS84.tif
and SJER_dsmCrop.tif
files.
Reproject the data as necessary to make things line up!
R
# import DSM
DSM_SJER <-
rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif"
# import DSM hillshade
DSM_hill_SJER_WGS <-
rast("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_DSMhill_WGS84.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_DSMhill_WGS84.tif"
# reproject raster
DSM_hill_UTMZ18N_SJER <- project(x = DSM_hill_SJER_WGS,
y = DSM_SJER)
# convert to data.frames
DSM_SJER_df <- as.data.frame(DSM_SJER, xy = TRUE)
DSM_hill_SJER_df <- as.data.frame(DSM_hill_UTMZ18N_SJER, xy = TRUE)
ggplot() +
geom_raster(data = DSM_hill_SJER_df,
mapping = aes(x = x, y = y,
alpha = SJER_DSMhill_WGS84)) +
geom_raster(data = DSM_SJER_df,
mapping = aes(x = x, y = y,
fill = SJER_dsmCrop,
alpha=0.8)) +
scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) +
coord_quickmap()
Content from Open and Plot Vector Layers
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I distinguish between and visualize point, line and polygon vector data?
Objectives
- Know the difference between point, line, and polygon vector elements.
- Load point, line, and polygon vector layers into R.
- Access the attributes of a spatial object in R.
Starting with this episode, we will be moving from working with
raster data to working with vector data. In this episode, we will open
and plot point, line and polygon vector data loaded from ESRI’s
shapefile
format into R. These data refer to the NEON
Harvard Forest field site.
Import Vector Data
We will use the sf
package to work with vector data in
R. Fun fact: sf
stands for “simple features”.
If you haven’t downloaded the sf
package yet, you can do
so by going to the Packages tab in your R console and installing a new
package, or you can type the following code into your R console:
install.packages("sf")
Make sure you have the sf
library loaded along with the
other libraries we will need.
R
library(ggplot2)
library(dplyr)
library(sf)
The vector layers that we will import from ESRI’s
shapefile
format are:
- A polygon vector layer representing our field site boundary,
- A line vector layer representing roads, and
- A point vector layer representing the location of the Fisher flux tower located at the NEON Harvard Forest field site.
The first vector layer that we will open contains the boundary of our
study area (or our Area Of Interest or AOI, hence the name
aoiBoundary
). To import a vector layer from an ESRI
shapefile
we use the sf
function
st_read()
. st_read()
requires the file path to
the ESRI shapefile
.
Let’s import our AOI:
R
aoi_boundary_HARV <- st_read(
"data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
OUTPUT
Reading layer `HarClip_UTMZ18' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 732128 ymin: 4713209 xmax: 732251.1 ymax: 4713359
Projected CRS: WGS 84 / UTM zone 18N
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"
Vector Layer Metadata & Attributes
When we import the HarClip_UTMZ18
vector layer from an
ESRI shapefile
into R (as our
aoi_boundary_HARV
object), the st_read()
function automatically stores information about the data. We are
particularly interested in the geospatial metadata, describing the
format, CRS, extent, and other components of the vector data, and the
attributes which describe properties associated with each individual
vector object.
Spatial Metadata
Key metadata for all vector layers includes:
- Object Type: the class of the imported object.
- Coordinate Reference System (CRS): the projection of the data.
- Extent: the spatial extent (i.e. geographic area that the vector layer covers) of the data. Note that the spatial extent for a vector layer represents the combined extent for all individual objects in the vector layer.
We can view metadata of a vector layer using the
st_geometry_type()
, st_crs()
and
st_bbox()
functions. First, let’s view the geometry type
for our AOI vector layer:
R
st_geometry_type(aoi_boundary_HARV)
OUTPUT
[1] POLYGON
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
Our aoi_boundary_HARV
is a polygon spatial object. The
18 levels shown below our output list the possible categories of the
geometry type. Now let’s check what CRS this file data is in:
R
st_crs(aoi_boundary_HARV)
OUTPUT
Coordinate Reference System:
User input: WGS 84 / UTM zone 18N
wkt:
PROJCRS["WGS 84 / UTM zone 18N",
BASEGEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["UTM zone 18N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-75,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
ID["EPSG",32618]]
Our data in the CRS UTM zone 18N. The CRS is
critical to interpreting the spatial object’s extent values as it
specifies units. To find the extent of our AOI, we can use the
st_bbox()
function:
R
st_bbox(aoi_boundary_HARV)
OUTPUT
xmin ymin xmax ymax
732128.0 4713208.7 732251.1 4713359.2
The spatial extent of a vector layer or R spatial object represents the geographic “edge” or location that is the furthest north, south east and west. Thus it represents the overall geographic coverage of the spatial object.
Lastly, we can view all of the metadata and attributes for this R spatial object by printing it to the screen:
R
aoi_boundary_HARV
OUTPUT
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 732128 ymin: 4713209 xmax: 732251.1 ymax: 4713359
Projected CRS: WGS 84 / UTM zone 18N
id geometry
1 1 POLYGON ((732128 4713359, 7...
Plot a vector layer
Next, let’s visualize the data in our sf
object using
the ggplot
package. Unlike with raster data, we do not need
to convert vector data to a dataframe before plotting with
ggplot
.
We’re going to customize our boundary plot by setting the color and
fill for our plot. When plotting sf
objects with
ggplot2
, you need to use the coord_sf()
coordinate system.
R
ggplot() +
geom_sf(data = aoi_boundary_HARV, color = "black", fill = "cyan1") +
labs(title = "AOI Boundary Plot") +
coord_sf()
Challenge 1: Import Line and Point Vector Layers
Using the steps above, import the HARV_roads and HARVtower_UTM18N
vector layers into R. Call the HARV_roads object lines_HARV
and the HARVtower_UTM18N point_HARV
.
Answer the following questions:
What type of R spatial object is created when you import each layer?
What is the CRS and extent for each object?
Do the files contain points, lines, or polygons?
How many spatial objects are in each file?
First we import the data:
R
lines_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
OUTPUT
Reading layer `HARV_roads' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp'
using driver `ESRI Shapefile'
Simple feature collection with 13 features and 15 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: 730741.2 ymin: 4711942 xmax: 733295.5 ymax: 4714260
Projected CRS: WGS 84 / UTM zone 18N
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp"
point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
OUTPUT
Reading layer `HARVtower_UTM18N' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 14 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 732183.2 ymin: 4713265 xmax: 732183.2 ymax: 4713265
Projected CRS: WGS 84 / UTM zone 18N
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"
Then we check its geometry type.
R
st_geometry_type(lines_HARV)
OUTPUT
[1] MULTILINESTRING MULTILINESTRING MULTILINESTRING MULTILINESTRING
[5] MULTILINESTRING MULTILINESTRING MULTILINESTRING MULTILINESTRING
[9] MULTILINESTRING MULTILINESTRING MULTILINESTRING MULTILINESTRING
[13] MULTILINESTRING
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
R
st_geometry_type(point_HARV)
OUTPUT
[1] POINT
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
We also check the CRS and extent of each object:
R
st_crs(lines_HARV)
OUTPUT
Coordinate Reference System:
User input: WGS 84 / UTM zone 18N
wkt:
PROJCRS["WGS 84 / UTM zone 18N",
BASEGEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["UTM zone 18N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-75,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
ID["EPSG",32618]]
R
st_bbox(lines_HARV)
OUTPUT
xmin ymin xmax ymax
730741.2 4711942.0 733295.5 4714260.0
R
st_crs(point_HARV)
OUTPUT
Coordinate Reference System:
User input: WGS 84 / UTM zone 18N
wkt:
PROJCRS["WGS 84 / UTM zone 18N",
BASEGEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["UTM zone 18N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-75,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
ID["EPSG",32618]]
R
st_bbox(point_HARV)
OUTPUT
xmin ymin xmax ymax
732183.2 4713265.0 732183.2 4713265.0
To see the number of objects in each file, we can look at the output
from when we read these objects into R. lines_HARV
contains
13 features (all lines) and point_HARV
contains only one
point.
Content from Explore and Plot by Vector Layer Attributes
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I compute on the attributes of a spatial object?
Objectives
- Query attributes of a spatial object.
- Subset spatial objects using specific attribute values.
- Plot a vector feature, colored by unique attribute values.
This episode continues our discussion of vector layer attributes and covers how to work with vector layer attributes in R. It covers how to identify and query layer attributes, as well as how to subset features by specific attribute values. Finally, we will learn how to plot a feature according to a set of attribute values.
Load the Data
We will continue using the sf
and ggplot2
packages in this episode. Make sure that you have these packages loaded.
We will continue to work with the three ESRI shapefiles
(vector layers) that we loaded in a previous episode.
R
library(ggplot2)
library(dplyr)
library(sf)
R
point_HARV <-
st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"
lines_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp"
aoi_boundary_HARV <-
st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"
Query Vector Feature Metadata
As we discussed in the Open and Plot Vector Layers in R episode, we can view metadata associated with an R object using:
-
st_geometry_type()
- The type of vector data stored in the object. -
st_bbox()
- The spatial extent (geographic area covered by) of the object. -
st_crs()
- The CRS (spatial projection) of the data.
We started to explore our point_HARV
object in the
previous episode. To see a summary of all of the metadata associated
with our object, we can view the object with
View(point_HARV)
or print a summary of the object itself to
the console. We’re going to do this, but with lines_HARV
today:
R
lines_HARV
OUTPUT
Simple feature collection with 13 features and 15 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: 730741.2 ymin: 4711942 xmax: 733295.5 ymax: 4714260
Projected CRS: WGS 84 / UTM zone 18N
First 10 features:
OBJECTID_1 OBJECTID TYPE NOTES MISCNOTES RULEID
1 14 48 woods road Locust Opening Rd <NA> 5
2 40 91 footpath <NA> <NA> 6
3 41 106 footpath <NA> <NA> 6
4 211 279 stone wall <NA> <NA> 1
5 212 280 stone wall <NA> <NA> 1
6 213 281 stone wall <NA> <NA> 1
7 214 282 stone wall <NA> <NA> 1
8 215 283 stone wall <NA> <NA> 1
9 216 284 stone wall <NA> <NA> 1
10 553 674 boardwalk <NA> <NA> 2
MAPLABEL SHAPE_LENG LABEL BIKEHORSE RESVEHICLE RECMAP
1 Locust Opening Rd 1297.35706 Locust Opening Rd Y R1 Y
2 <NA> 146.29984 <NA> Y R1 Y
3 <NA> 676.71804 <NA> Y R2 Y
4 <NA> 231.78957 <NA> <NA> <NA> <NA>
5 <NA> 45.50864 <NA> <NA> <NA> <NA>
6 <NA> 198.39043 <NA> <NA> <NA> <NA>
7 <NA> 143.19240 <NA> <NA> <NA> <NA>
8 <NA> 90.33118 <NA> <NA> <NA> <NA>
9 <NA> 35.88146 <NA> <NA> <NA> <NA>
10 <NA> 67.43464 <NA> N R3 N
Shape_Le_1 ResVehic_1 BicyclesHo
1 1297.10617 R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
2 146.29983 R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
3 676.71807 R2 - 4WD/High Clearance Vehicles Only Bicycles and Horses Allowed
4 231.78962 <NA> <NA>
5 45.50859 <NA> <NA>
6 198.39041 <NA> <NA>
7 143.19241 <NA> <NA>
8 90.33114 <NA> <NA>
9 35.88152 <NA> <NA>
10 67.43466 R3 - No Vehicles Allowed DO NOT SHOW ON REC MAP
geometry
1 MULTILINESTRING ((730819.2 ...
2 MULTILINESTRING ((732040.2 ...
3 MULTILINESTRING ((732057 47...
4 MULTILINESTRING ((731903.6 ...
5 MULTILINESTRING ((732039.1 ...
6 MULTILINESTRING ((732056.2 ...
7 MULTILINESTRING ((731964 47...
8 MULTILINESTRING ((732105.2 ...
9 MULTILINESTRING ((732222.9 ...
10 MULTILINESTRING ((732153.8 ...
We can use the ncol
function to count the number of
attributes associated with a spatial object too. Note that the geometry
is just another column and counts towards the total.
R
ncol(lines_HARV)
OUTPUT
[1] 16
We can view the individual name of each attribute using the
names()
function in R:
R
names(lines_HARV)
OUTPUT
[1] "OBJECTID_1" "OBJECTID" "TYPE" "NOTES" "MISCNOTES"
[6] "RULEID" "MAPLABEL" "SHAPE_LENG" "LABEL" "BIKEHORSE"
[11] "RESVEHICLE" "RECMAP" "Shape_Le_1" "ResVehic_1" "BicyclesHo"
[16] "geometry"
We could also view just the first 6 rows of attribute values using
the head()
function to get a preview of the data:
R
head(lines_HARV)
OUTPUT
Simple feature collection with 6 features and 15 fields
Geometry type: MULTILINESTRING
Dimension: XY
Bounding box: xmin: 730741.2 ymin: 4712685 xmax: 732232.3 ymax: 4713726
Projected CRS: WGS 84 / UTM zone 18N
OBJECTID_1 OBJECTID TYPE NOTES MISCNOTES RULEID
1 14 48 woods road Locust Opening Rd <NA> 5
2 40 91 footpath <NA> <NA> 6
3 41 106 footpath <NA> <NA> 6
4 211 279 stone wall <NA> <NA> 1
5 212 280 stone wall <NA> <NA> 1
6 213 281 stone wall <NA> <NA> 1
MAPLABEL SHAPE_LENG LABEL BIKEHORSE RESVEHICLE RECMAP
1 Locust Opening Rd 1297.35706 Locust Opening Rd Y R1 Y
2 <NA> 146.29984 <NA> Y R1 Y
3 <NA> 676.71804 <NA> Y R2 Y
4 <NA> 231.78957 <NA> <NA> <NA> <NA>
5 <NA> 45.50864 <NA> <NA> <NA> <NA>
6 <NA> 198.39043 <NA> <NA> <NA> <NA>
Shape_Le_1 ResVehic_1 BicyclesHo
1 1297.10617 R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
2 146.29983 R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
3 676.71807 R2 - 4WD/High Clearance Vehicles Only Bicycles and Horses Allowed
4 231.78962 <NA> <NA>
5 45.50859 <NA> <NA>
6 198.39041 <NA> <NA>
geometry
1 MULTILINESTRING ((730819.2 ...
2 MULTILINESTRING ((732040.2 ...
3 MULTILINESTRING ((732057 47...
4 MULTILINESTRING ((731903.6 ...
5 MULTILINESTRING ((732039.1 ...
6 MULTILINESTRING ((732056.2 ...
Challenge 1: Attributes for Different Spatial Classes
Explore the attributes associated with the point_HARV
and aoi_boundary_HARV
spatial objects.
How many attributes does each have?
Who owns the site in the
point_HARV
data object?Which of the following is NOT an attribute of the
point_HARV
data object?
- Latitude B) County C) Country
- To find the number of attributes, we use the
ncol()
function:
R
ncol(point_HARV)
OUTPUT
[1] 15
R
ncol(aoi_boundary_HARV)
OUTPUT
[1] 2
- Ownership information is in a column named
Ownership
:
R
point_HARV$Ownership
OUTPUT
[1] "Harvard University, LTER"
- To see a list of all of the attributes, we can use the
names()
function:
R
names(point_HARV)
OUTPUT
[1] "Un_ID" "Domain" "DomainName" "SiteName" "Type"
[6] "Sub_Type" "Lat" "Long" "Zone" "Easting"
[11] "Northing" "Ownership" "County" "annotation" "geometry"
“Country” is not an attribute of this object.
Explore Values within One Attribute
We can explore individual values stored within a particular
attribute. Comparing attributes to a spreadsheet or a data frame, this
is similar to exploring values in a column. We did this with the
gapminder
dataframe in an earlier lesson. For spatial
objects, we can use the same syntax:
objectName$attributeName
.
We can see the contents of the TYPE
field of our lines
feature:
R
lines_HARV$TYPE
OUTPUT
[1] "woods road" "footpath" "footpath" "stone wall" "stone wall"
[6] "stone wall" "stone wall" "stone wall" "stone wall" "boardwalk"
[11] "woods road" "woods road" "woods road"
To see only unique values within the TYPE
field, we can
use the unique()
function for extracting the possible
values of a character variable.
R
unique(lines_HARV$TYPE)
OUTPUT
[1] "woods road" "footpath" "stone wall" "boardwalk"
Subset Features
We can use the filter()
function from dplyr
to select a subset of features from a spatial object in R, just like
with data frames.
For example, we might be interested only in features that are of
TYPE
“footpath”. Once we subset out this data, we can use
it as input to other code so that code only operates on the footpath
lines.
R
footpath_HARV <- lines_HARV %>%
filter(TYPE == "footpath")
nrow(footpath_HARV)
OUTPUT
[1] 2
Our subsetting operation reduces the features
count to
2. This means that only two feature lines in our spatial object have the
attribute TYPE == footpath
. We can plot only the footpath
lines:
R
ggplot() +
geom_sf(data = footpath_HARV) +
labs(title = "NEON Harvard Forest Field Site",
subtitle = "Footpaths") +
coord_sf()
There are two features in our footpaths subset. Why does the plot
look like there is only one feature? Let’s adjust the colors used in our
plot. If we have 2 features in our vector object, we can plot each using
a unique color by assigning a column name to the color aesthetic
(color =
). We use the syntax
mapping = aes(color = )
to do this. We’ll use the
OBJECTID
column for the color aesthetic in this example,
but because it is a number, we will have to convert it to a categorical
value (a factor) first. We can also alter the default line thickness by
using the linewidth =
parameter, as the default value can
be hard to see. Note that linewidth
is placed outside of
the aes()
function, as we are not connecting line thickness
to a data variable.
R
ggplot() +
geom_sf(data = footpath_HARV,
mapping = aes(color = factor(OBJECTID)), linewidth = 1.5) +
labs(color = "Footpath ID",
title = "NEON Harvard Forest Field Site",
subtitle = "Footpaths") +
coord_sf()
Now, we see that there are in fact two features in our plot!
First we will save an object with only the boardwalk lines:
R
boardwalk_HARV <- lines_HARV %>%
filter(TYPE == "boardwalk")
Let’s check how many features there are in this subset:
R
nrow(boardwalk_HARV)
OUTPUT
[1] 1
Now let’s plot that data:
R
ggplot() +
geom_sf(data = boardwalk_HARV, linewidth = 1.5) +
labs(title = "NEON Harvard Forest Field Site",
subtitle = "Boardwalks") +
coord_sf()
First we will save an object with only the stone wall lines and check the number of features:
R
stoneWall_HARV <- lines_HARV %>%
filter(TYPE == "stone wall")
nrow(stoneWall_HARV)
OUTPUT
[1] 6
Now we can plot the data:
R
ggplot() +
geom_sf(data = stoneWall_HARV,
mapping = aes(color = factor(OBJECTID)), linewidth = 1.5) +
labs(color = "Wall ID",
title = "NEON Harvard Forest Field Site",
subtitle = "Stonewalls") +
coord_sf()
Customize Plots
In the examples above, ggplot()
automatically selected
colors for each line based on a default color order. If we don’t like
those default colors, we can create a vector of colors - one for each
feature.
First we will check how many unique line types there are:
R
unique(lines_HARV$TYPE)
OUTPUT
[1] "woods road" "footpath" "stone wall" "boardwalk"
Then we can create a palette of four colors, one for each feature in our vector object.
R
road_colors <- c("blue", "green", "navy", "purple")
We can tell ggplot
to use these colors when we plot the
data.
R
ggplot() +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
scale_color_manual(values = road_colors) +
labs(color = "Road Type",
title = "NEON Harvard Forest Field Site",
subtitle = "Roads & Trails") +
coord_sf()
Data Tip
Note that ggplot
automatically decides which colors
belong with which variables, based on the order of the colors in the
color vector, and the alphabetical order of the road types. You can
specify which colors to use for which road type by creating a named
vector:
R
road_colors <- c("stone wall" = "blue", "boardwalk" = "green",
"footpath" = "navy", "woods road" = "purple")
Adjust Line Width
We adjusted line width universally earlier. If we want a unique line width for each level or attribute category in our spatial object, we can use the same syntax that we used for colors, above.
We already know that we have four different TYPE
s in the
lines_HARV object, so we will set four different line widths.
R
line_widths <- c(1, 2, 3, 4)
We can use those line widths when we plot the data.
R
ggplot() +
geom_sf(data = lines_HARV,
mapping = aes(color = TYPE, linewidth = TYPE)) +
scale_color_manual(values = road_colors) +
scale_linewidth_manual(values = line_widths) +
labs(color = "Road Type",
linewidth = "Road Type",
title = "NEON Harvard Forest Field Site",
subtitle = "Roads & Trails - Line width varies") +
coord_sf()
Challenge 4: Plot Line Width by Attribute
In the example above, we set the line widths to be 1, 2, 3, and 4. Because R orders alphabetically by default, this gave us a plot where woods roads (the last type) were the thickest and boardwalks were the thinnest.
Let’s create another plot where we show the different line types with the following thicknesses:
- woods road width = 6
- boardwalks width = 1
- footpath width = 3
- stone wall width = 2
First we need to look at the levels of our factor to see what order the road types are in:
R
unique(lines_HARV$TYPE)
OUTPUT
[1] "woods road" "footpath" "stone wall" "boardwalk"
We then can create our line_width
vector setting each of
the levels to the desired thickness.
R
line_width <- c("woods road" = 6, "footpath" = 3,
"stone wall" = 2, "boardwalk" = 1)
Now we can create our plot.
R
ggplot() +
geom_sf(data = lines_HARV, mapping = aes(linewidth = TYPE)) +
scale_linewidth_manual(values = line_width) +
labs(title = "NEON Harvard Forest Field Site",
subtitle = "Roads & Trails - Line width varies") +
coord_sf()
Add Plot Legend
When we plot a vector object and specify colors, line widths, etc.
using the mapping = aes()
function, R will automatically
add a plot to our legend. By default, the plot legend will show up on
the right side of the plot and will include the values that we supplied
to the ggplot object. We can adjust the way our legends look in a
variety of ways.
- We specify the location of our legend by using a default keyword.
For example, we could use
bottomright
,top
,topright
, etc. - We can change the font size
- We can change the little things, like the color of the legend box outline
- And so much more!
Pretty much anything you would want to change on a plot, you can! We
do this using the theme()
function in R.
Let’s start with our basic plot. We will use the
road_colors
object that we created above to color the
legend. We’ll also move the legend to the bottom by specifying
theme(legend.position = "bottom")
.
R
ggplot() +
geom_sf(data = lines_HARV,
mapping = aes(color = TYPE), linewidth = 1.5) +
scale_color_manual(values = road_colors) +
labs(color = "Road Type",
title = "NEON Harvard Forest Field Site",
subtitle = "Roads & Trails - Default Legend") +
coord_sf() +
theme(legend.position = "bottom")
Now, let’s change the appearance of our legend by manually setting
different parameters using the theme()
function.
We can start by changing the font size in the legend. The way we
would do this is by adding the following line to our plot code:
theme(legend.text = element_text(size = 20))
This may seem a little confusing, but there is a method to the
madness. First, we note what we’re trying to change (the
legend.text
argument). Next, we have to specify what kind
of object it needs to be. Since we’re altering the legend text,
it needs to be a text element, so we use the element_text()
function. Within this function, we can specify what we want to change
from the default. In this case, it’s the size of the font
(size = 20
), but we could do this for things like color,
the face (plain, italic, bold, bold.italic), the angle of rotation for
the text, etc.
R
ggplot() +
geom_sf(data = lines_HARV,
mapping = aes(color = TYPE), linewidth = 1.5) +
scale_color_manual(values = road_colors) +
labs(color = "Road Type",
title = "NEON Harvard Forest Field Site",
subtitle = "Roads & Trails - Default Legend") +
coord_sf() +
theme(legend.position = "bottom",
legend.text = element_text(size = 20))
We can decide to have a box around the legend, by altering the
legend.box.background
to have a black border. Since the
legend is a rectangle of sorts, we will use the element:
element_rect()
to adjust these parameters.
So our new line would be:
theme(legend.box.background = element_rect(color = "black"))
R
ggplot() +
geom_sf(data = lines_HARV,
mapping = aes(color = TYPE), linewidth = 1.5) +
scale_color_manual(values = road_colors) +
labs(color = "Road Type",
title = "NEON Harvard Forest Field Site",
subtitle = "Roads & Trails - Default Legend") +
coord_sf() +
theme(legend.position = "bottom",
legend.text = element_text(size = 20),
legend.box.background = element_rect(color = "black"))
Data Tip
You can modify the default R color palette using the palette method.
For example palette(rainbow(6))
or
palette(terrain.colors(6))
. You can reset the palette
colors using palette("default")
!
You can also use colorblind-friendly palettes such as those in the viridis package.
Challenge 5: Plot Lines by Attribute
Create a plot that emphasizes only roads where bicycles and horses
are allowed. To emphasize this, make the lines where bicycles are not
allowed THINNER than the roads where bicycles are allowed. NOTE: this
attribute information is located in the
lines_HARV$BicyclesHo
attribute.
Be sure to add a title and legend to your map. You might consider a color palette that has all bike/horse-friendly roads displayed in a bright color. All other lines can be black.
First we explore the BicyclesHo
attribute to learn the
values that correspond to the roads we need.
R
lines_HARV %>%
pull(BicyclesHo) %>%
unique()
OUTPUT
[1] "Bicycles and Horses Allowed" NA
[3] "DO NOT SHOW ON REC MAP" "Bicycles and Horses NOT ALLOWED"
Now, we can create a data frame with only those roads where bicycles and horses are allowed.
R
lines_showHarv <-
lines_HARV %>%
filter(BicyclesHo == "Bicycles and Horses Allowed")
Finally, we plot the needed roads after setting them to magenta and a thicker line width.
R
ggplot() +
geom_sf(data = lines_HARV) +
geom_sf(data = lines_showHarv,
mapping = aes(color = BicyclesHo), linewidth = 2) +
scale_color_manual(values = "magenta") +
labs(title = "NEON Harvard Forest Field Site",
subtitle = "Roads Where Bikes and Horses Are Allowed",
color = NULL) +
coord_sf()
Challenge 6: Plot Polygon by Attribute
Create a map of the state boundaries in the United States using the
data located in your downloaded data folder:
NEON-DS-Site-Layout-Files/US-Boundary-Layers\US-State-Boundaries-Census-2014
.
Note that this dataset includes Z and M coordinates, which we do not
need. Use the st_zm()
function to drop these dimensions
before plotting.
When plotting, apply a line color to each state using its
region
value. Be sure to add a legend to your plot as
well.
First we read in the data and check how many levels there are in the
region
column:
R
state_boundary_US <-
st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp") %>%
# NOTE: We need neither Z nor M coordinates!
st_zm()
OUTPUT
Reading layer `US-State-Boundaries-Census-2014' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp'
using driver `ESRI Shapefile'
Simple feature collection with 58 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY, XYZ
Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp"
state_boundary_US$region <- as.factor(state_boundary_US$region)
levels(state_boundary_US$region)
OUTPUT
[1] "Midwest" "Northeast" "Southeast" "Southwest" "West"
Next we set a color vector with that many items:
R
colors <- c("purple", "springgreen", "yellow", "brown", "navy")
Now we can create our plot:
R
ggplot() +
geom_sf(data = state_boundary_US,
mapping = aes(color = region), size = 1) +
scale_color_manual(values = colors) +
labs(title = "Contiguous U.S. State Boundaries") +
coord_sf()
Content from Plot Multiple Vector and Raster Layers
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I create map compositions with custom legends using ggplot?
- How can I plot raster and vector data together?
Objectives
- Plot multiple vector layers in the same plot.
- Apply custom symbols to spatial objects in a plot.
- Create a multi-layered plot with raster and vector data.
This episode builds upon previous episodes that work with vector layers in R and explore how to plot multiple vector layers. It also covers how to plot raster and vector data together on the same plot.
Load the Data
To work with vector data in R, we can use the sf
package. To work with raster data in R, we can use the
terra
package. Make sure that you have these packages
loaded.
We will continue to work with the three ESRI shapefile
that we loaded in the Open and
Plot Vector Layers in R episode and the raster file we used in
previous episodes.
R
library(terra)
library(ggplot2)
library(dplyr)
library(sf)
R
# Load the data
aoi_boundary_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"
lines_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp"
point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"
CHM_HARV <- rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif"
# Load the color palette from earlier lessons
road_colors <- c("blue", "green", "navy", "purple")
Plotting Multiple Vector Layers
In the previous episode, we learned how to plot information from a single vector layer and do some plot customization including adding a custom legend. However, what if we want to create a more complex plot with many vector layers and unique symbols that need to be represented clearly in a legend?
Now, let’s create a plot that combines our tower location
(point_HARV
), site boundary
(aoi_boundary_HARV
) and roads (lines_HARV
)
spatial objects. We will need to build a custom legend as well.
To begin, we will create a plot with the site boundary as the first
layer. Then layer the tower location and road data on top using
+
.
R
ggplot() +
geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
geom_sf(data = point_HARV) +
labs(title = "NEON Harvard Forest Field Site") +
coord_sf()
Next, let’s build a custom legend using the symbology (the colors and symbols) that we used to create the plot above.
R
ggplot() +
geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
geom_sf(data = point_HARV) +
scale_color_manual(values = road_colors) +
labs(title = "NEON Harvard Forest Field Site") +
coord_sf()
What if we wanted to have a legend for the tower location (the
point)? We can trick R into adding the legend for us, by specifying an
aesthetic and value to plot within the mapping = aes()
function for that layer. We’ve already used the color aesthetic in the
plot for the road colors, so we’ll have to use another aesthetic type
for getting the tower location to show up. In this example, let’s use
the fill aesthetic (note that there are also scales we can apply to the
point shape and alpha [transparency]).
We can do that by changing the line of code that plots the point:
geom_sf(data = point_HARV, aes(fill = "Tower location"))
This tells the plot that we want to color the “fill” of the point to be dictated by the “Tower location” label. We can pair this with a custom fill palette so that it’s plotted the way we want. For example:
scale_fill_manual(values = c("Tower location" = "black"))
This tells the plot that anything filled with the “Tower location” tag will actually be colored black.
R
ggplot() +
geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
geom_sf(data = point_HARV, mapping = aes(fill = "Tower location")) +
scale_color_manual(values = road_colors) +
scale_fill_manual(values = c("Tower location" = "black")) +
labs(title = "NEON Harvard Forest Field Site") +
coord_sf()
Now lets adjust the legend titles by passing a name
to
the respective color
and fill
palettes. We can
do this within the color_scale_manual()
and
color_fill_manual()
functions, or within the
labs()
function. If we don’t want a title for either of
those values, we can assign the name to NULL.
R
ggplot() +
geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
geom_sf(data = point_HARV, mapping = aes(fill = "Tower location")) +
scale_color_manual(values = road_colors) +
scale_fill_manual(values = c("Tower location" = "black")) +
labs(title = "NEON Harvard Forest Field Site",
color = "Line Type",
fill = NULL) +
coord_sf()
Finally, it might be better if the points were symbolized as a
symbol. We can customize this using shape
parameters in our
call to geom_sf
: 16 is a point symbol, 15 is a box.
R
ggplot() +
geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
geom_sf(data = point_HARV, mapping = aes(fill = "Tower location"), shape = 15) +
scale_color_manual(values = road_colors) +
scale_fill_manual(values = c("Tower location" = "black")) +
labs(title = "NEON Harvard Forest Field Site",
color = "Line Type",
fill = NULL) +
coord_sf()
Challenge 1: Plot Polygon by Attribute
Using the
NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp
ESRIshapefile
, create a map of study plot locations, with each point colored by the soil type (soilTypeOr
). How many different soil types are there at this particular field site? Overlay this layer on top of thelines_HARV
layer (the roads). Create a custom legend that applies line symbols to lines and point symbols to the points.Modify the plot above. Tell R to plot each point, using a different symbol of
shape
value.
First we need to read in the data and see how many unique soils are
represented in the soilTypeOr
attribute.
R
plot_locations <-
st_read("data/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp")
OUTPUT
Reading layer `PlotLocations_HARV' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp'
using driver `ESRI Shapefile'
Simple feature collection with 21 features and 25 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 731405.3 ymin: 4712845 xmax: 732275.3 ymax: 4713846
Projected CRS: WGS 84 / UTM zone 18N
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp"
unique(plot_locations$soilTypeOr)
OUTPUT
[1] "Inceptisols" "Histosols"
Next we can create a new color palette with one color for each soil type.
R
blue_orange <- c("cornflowerblue", "darkorange")
Finally, we will create our plot.
R
ggplot() +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
geom_sf(data = plot_locations, mapping = aes(fill = soilTypeOr),
shape = 21) +
scale_color_manual(values = road_colors) +
scale_fill_manual(values = blue_orange) +
labs(title = "NEON Harvard Forest Field Site",
color = "Line Type",
fill = "Soil Type") +
coord_sf()
If we want each soil to be shown with a different symbol, we can give
multiple values to the scale_shape_manual()
argument.
This get’s a little tricky though, because now we have a legend for the fill color and a legend for the shape for the soil type from the plot locations. We can override the aesthetics of the legends so that we only display one of the legends for the plot soil type - a legend that includes the right shape and the right colors. We do that by removing one of the legends from the plot and altering the legend we want to keep.
Let’s change the scale_shape_manual()
to include the
appropriate fill colors for the points. To do this, we will use
guide = guide_legend()
within the
scale_shape_manual()
function. We override the aesthetics
by using override.aes
and providing it with a list of
things to override.
Putting it all together results in a line of code that looks like
this:
scale_shape_manual(values = c(21, 22), guide = guide_legend(override.aes = list(fill = blue_orange)))
Here, we’re specifying that we want to keep the legend for the
shapes, but we want the legend (the guide) to look different than the
default. We want to override the default fill values (which is black) to
the colors in the blue_orange
vector. This will make the
legend match the actual points on the plot.
It’s a lot to take in, we know, but it might come in handy for our own research!
R
ggplot() +
geom_sf(data = lines_HARV, mapping = aes(color = TYPE)) +
geom_sf(data = plot_locations, mapping = aes(fill = soilTypeOr, shape = soilTypeOr)) +
scale_color_manual(values = road_colors) +
scale_fill_manual(values = blue_orange, guide = "none") +
scale_shape_manual(values = c(21, 22),
guide = guide_legend(override.aes = list(fill = blue_orange))) +
labs(title = "NEON Harvard Forest Field Site",
color = "Line Type",
shape = "Soil Type") +
coord_sf()
Plotting Raster & Vector Data Together
You can plot vector data layered on top of raster data using the
ggplot2
package or the tmap
package.
We’ll first go into how to do this using ggplot
, and
then we’ll create a similar plot using tmap
.
Together, let’s create a plot that uses the NEON AOI Canopy Height
Model
data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif
(or
data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif
depending on your set up) as a base layer. On top of the CHM, we will
add the study site AOI, roads, and the tower location.
Remember for ggplot
to be able to plot raster data, we
first need to convert it to a data frame.
R
CHM_HARV_df <- as.data.frame(CHM_HARV, xy = TRUE)
Now we can add it to the plot as a geom_raster()
object:
R
ggplot() +
geom_raster(data = CHM_HARV_df,
mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
geom_sf(data = lines_HARV) +
geom_sf(data = point_HARV) +
labs(title = "NEON Harvard Forest Field Site w/ Canopy Height Model",
fill = "Canopy Height") +
coord_sf()
Now we can make a similar plot using tmap
.
If you haven’t downloaded the tmap
package yet, you can
do so by going to the Packages tab in your R console and installing a
new package, or you can type the following code into your R console:
install.packages("tmap")
Once downloaded, remember to load tmap
into your
environment using:
R
library(tmap)
The syntax for tmap
is to first specify a layer to plot
and then to specify the kind of layer it is. For example, if you wanted
to plot a raster first, and then a vector layer on top of it, you would
write something like this:
R
tm_shape(raster_object) + # first we specify the object to plot first
tm_raster() + # then we specify the kind of object it is
tm_shape(vector_object) + # then we specify the object to plot next
tm_sf() # then we specify the kind of object it is
You can specify things like color and shape within the
tm_sf()
function as col = "black"
or
shape = 8
for example.
R
# Create the plot
tm_shape(CHM_HARV) +
tm_raster() +
tm_shape(lines_HARV) +
tm_sf() +
tm_shape(aoi_boundary_HARV) +
tm_sf(col = "grey20") +
tm_shape(point_HARV) +
tm_sf(col = "black")
OUTPUT
SpatRaster object downsampled to 898 by 1115 cells.
Content from Handling Spatial Projection & CRS
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- What do I do when vector data don’t line up?
Objectives
- Plot vector objects with different CRSs in the same plot.
In an earlier episode we learned how to handle a situation where you have two different files with raster data in different projections. Now we will apply those same principles to working with vector data. We will create a base map of our study site using United States state and country boundary information accessed from the United States Census Bureau. We will learn how to map vector data that are in different CRSs and thus don’t line up on a map.
We will continue to work with the some of the ESRI
shapefiles
that we loaded in the previous episodes.
R
library(sf)
library(ggplot2)
library(dplyr)
R
point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"
Working With Spatial Data From Different Sources
We often need to gather spatial datasets from different sources and/or data that cover different spatial extents. These data are often in different Coordinate Reference Systems (CRSs).
Some reasons for data being in different CRSs include:
- The data are stored in a particular CRS convention used by the data provider (for example, a government agency).
- The data are stored in a particular CRS that is customized to a region. For instance, many states in the US prefer to use a State Plane projection customized for that state.
Notice the differences in shape associated with each different projection. These differences are a direct result of the calculations used to “flatten” the data onto a 2-dimensional map. Often data are stored purposefully in a particular projection that optimizes the relative shape and size of surrounding geographic boundaries (states, counties, countries, etc).
In this episode we will learn how to identify and manage spatial data in different projections. We will learn how to reproject the data so that they are in the same projection to support plotting / mapping. Note that these skills are also required for any geoprocessing / spatial analysis. Data need to be in the same CRS to ensure accurate results.
We will continue to use the sf
package in this
episode.
Import US Boundaries - Census Data
There are many good sources of boundary base layers that we can use
to create a basemap. Some R packages even have these base layers built
in to support quick and efficient mapping. In this episode, we will use
boundary layers for the contiguous United States, provided by the United
States Census Bureau. It is useful to have vector layers in ESRI’s
shapefile
format to work with because we can add additional
attributes to them if need be - for project specific mapping.
Read US Boundary File
We will use the st_read()
function to import the
/US-Boundary-Layers/US-State-Boundaries-Census-2014
layer
into R. This layer contains the boundaries of all contiguous states in
the U.S. Please note that these data have been modified and reprojected
from the original data downloaded from the Census website to support the
learning goals of this episode. We only care about the x and y
coordinates, not the z or m dimensions, so we will use
st_zm()
to drop the coordinates that we do not need.
R
state_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp") %>%
st_zm()
OUTPUT
Reading layer `US-State-Boundaries-Census-2014' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp'
using driver `ESRI Shapefile'
Simple feature collection with 58 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension: XY, XYZ
Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp"
Next, let’s plot the U.S. states data:
R
ggplot() +
geom_sf(data = state_boundary_US) +
labs(title = "Map of Contiguous US State Boundaries") +
coord_sf()
U.S. Boundary Layer
We can add a boundary layer of the United States to our map - to make
it look nicer. We will import
NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States
.
R
country_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp") %>%
st_zm()
OUTPUT
Reading layer `US-Boundary-Dissolved-States' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XYZ
Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp"
If we specify a thicker line width using linewidth = 0.5
for the border layer, it will make our map pop! We will also manually
set the colors of the state boundaries and country boundaries.
R
ggplot() +
geom_sf(data = state_boundary_US, color = "gray60") +
geom_sf(data = country_boundary_US, color = "black", alpha = 0.25, linewidth = 0.5) +
labs(title = "Map of Contiguous US State Boundaries") +
coord_sf()
Next, let’s add the location of a flux tower where our study area is. As we are adding these layers, take note of the CRS of each object. First let’s look at the CRS of our tower location object:
R
st_crs(point_HARV)$proj4string
OUTPUT
[1] "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"
Our project string for point_HARV
specifies the UTM
projection as follows:
+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
- proj=utm: the projection is UTM, UTM has several zones.
- zone=18: the zone is 18
- datum=WGS84: the datum WGS84 (the datum refers to the 0,0 reference for the coordinate system used in the projection)
- units=m: the units for the coordinates are in METERS.
Note that the zone
is unique to the UTM projection. Not
all CRSs will have a zone.
Let’s check the CRS of our state and country boundary objects:
R
st_crs(state_boundary_US)$proj4string
OUTPUT
[1] "+proj=longlat +datum=WGS84 +no_defs"
R
st_crs(country_boundary_US)$proj4string
OUTPUT
[1] "+proj=longlat +datum=WGS84 +no_defs"
Our project string for state_boundary_US
and
country_boundary_US
specifies the lat/long projection as
follows:
+proj=longlat +datum=WGS84 +no_defs
- proj=longlat: the data are in a geographic (latitude and longitude) coordinate system
- datum=WGS84: the datum WGS84 (the datum refers to the 0,0 reference for the coordinate system used in the projection)
- no_defs: ensures that no defaults are used, but this is now obsolete
Note that there are no specified units above. This is because this geographic coordinate reference system is in latitude and longitude which is most often recorded in decimal degrees.
CRS Units - View Object Extent
Next, let’s view the extent or spatial coverage for the
point_HARV
spatial object compared to the
state_boundary_US
object.
First we’ll look at the extent for our study site:
R
st_bbox(point_HARV)
OUTPUT
xmin ymin xmax ymax
732183.2 4713265.0 732183.2 4713265.0
And then the extent for the state boundary data.
R
st_bbox(state_boundary_US)
OUTPUT
xmin ymin xmax ymax
-124.72584 24.49813 -66.94989 49.38436
Note the difference in the units for each object. The extent for
state_boundary_US
is in latitude and longitude which yields
smaller numbers representing decimal degree units. Our tower location
point is in UTM, is represented in meters.
Reproject Vector Data or No?
We saw in an earlier episode
that when working with raster data in different CRSs, we needed to
convert all objects to the same CRS. We can do the same thing with our
vector data - however, we don’t need to! When using the
ggplot2
package, ggplot
automatically converts
all objects to the same CRS before plotting. This means we can plot our
three data sets together without doing any conversion:
R
ggplot() +
geom_sf(data = state_boundary_US, color = "gray60") +
geom_sf(data = country_boundary_US, linewidth = 0.5, alpha = 0.25, color = "black") +
geom_sf(data = point_HARV, shape = 19, color = "purple") +
labs(title = "Map of Contiguous US State Boundaries") +
coord_sf()
Challenge 1: Plot Multiple Layers of Spatial Data
Create a map of the North Eastern United States as follows:
- Import and plot
Boundary-US-State-NEast.shp
. Adjust line width as necessary. - Layer the Fisher Tower (in the NEON Harvard Forest site) point
location
point_HARV
onto the plot. - Add a title.
- Add a legend that shows both the state boundary and the Tower location point.
R
NE.States.Boundary.US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp") %>%
st_zm()
OUTPUT
Reading layer `Boundary-US-State-NEast' from data source
`/Users/echelleburns/Documents/2024-07-01-ucsb-intro-geospatial/site/built/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp'
using driver `ESRI Shapefile'
Simple feature collection with 12 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY, XYZ
Bounding box: xmin: -80.51989 ymin: 37.91685 xmax: -66.9499 ymax: 47.45716
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
R
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp"
ggplot() +
geom_sf(data = NE.States.Boundary.US, mapping = aes(color ="State Boundary")) +
scale_color_manual(values = c("State Boundary" = "gray18")) +
geom_sf(data = point_HARV, mapping = aes(shape = "Fisher Tower"),
color = "purple") +
scale_shape_manual(values = c("Fisher Tower" = 19)) +
labs(color = NULL,
shape = NULL,
title = "Fisher Tower location") +
coord_sf()
Content from Convert from .csv to a Vector Layer
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I import CSV files as vector layers in R?
Objectives
- Import .csv files containing x,y coordinate locations into R as a data frame.
- Convert a data frame to a spatial object.
- Export a spatial object to a text file.
This episode will review how to import spatial points stored in
.csv
(Comma Separated Value) format into R as an
sf
spatial object. We will also reproject data imported
from an ESRI shapefile
format and export the reprojected
data as an ESRI shapefile
.
Let’s load our packages and data for this episode:
R
library(ggplot2)
library(dplyr)
library(sf)
R
aoi_boundary_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"
country_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp"
point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"
Spatial Data in Text Format
The HARV_PlotLocations.csv
file contains
x, y
(point) locations for study plot where NEON collects
data on vegetation
and other ecological metrics. We would like to:
- Create a map of these plot locations.
- Export the data in an ESRI
shapefile
format to share with our colleagues. Thisshapefile
can be imported into most GIS software. - Create a map showing vegetation height with plot locations layered on top.
Spatial data are sometimes stored in a text file format
(.txt
or .csv
). If the text file has an
associated x
and y
location column, then we
can convert it into an sf
spatial object. The
sf
object allows us to store both the x,y
values that represent the coordinate location of each point and the
associated attribute data - or columns describing each feature in the
spatial object.
We will continue using the sf
package in this
episode.
Import .csv
To begin let’s import a .csv
file that contains plot
coordinate x, y
locations at the NEON Harvard Forest Field
Site (HARV_PlotLocations.csv
) and look at the structure of
that new object:
R
plot_locations_HARV <-
read.csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv"
str(plot_locations_HARV)
OUTPUT
'data.frame': 21 obs. of 16 variables:
$ easting : num 731405 731934 731754 731724 732125 ...
$ northing : num 4713456 4713415 4713115 4713595 4713846 ...
$ geodeticDa: chr "WGS84" "WGS84" "WGS84" "WGS84" ...
$ utmZone : chr "18N" "18N" "18N" "18N" ...
$ plotID : chr "HARV_015" "HARV_033" "HARV_034" "HARV_035" ...
$ stateProvi: chr "MA" "MA" "MA" "MA" ...
$ county : chr "Worcester" "Worcester" "Worcester" "Worcester" ...
$ domainName: chr "Northeast" "Northeast" "Northeast" "Northeast" ...
$ domainID : chr "D01" "D01" "D01" "D01" ...
$ siteID : chr "HARV" "HARV" "HARV" "HARV" ...
$ plotType : chr "distributed" "tower" "tower" "tower" ...
$ subtype : chr "basePlot" "basePlot" "basePlot" "basePlot" ...
$ plotSize : int 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 ...
$ elevation : num 332 342 348 334 353 ...
$ soilTypeOr: chr "Inceptisols" "Inceptisols" "Inceptisols" "Histosols" ...
$ plotdim_m : int 40 40 40 40 40 40 40 40 40 40 ...
We now have a data frame that contains 21 locations (rows) and 16
variables (attributes). Note that all of our character data was imported
into R as character (text) data. Next, let’s explore the dataframe to
determine whether it contains columns with coordinate values. If we are
lucky, our .csv
will contain columns labeled:
- “X” and “Y” OR
- Latitude and Longitude OR
- easting and northing (UTM coordinates)
Let’s check out the column names of our dataframe.
R
names(plot_locations_HARV)
OUTPUT
[1] "easting" "northing" "geodeticDa" "utmZone" "plotID"
[6] "stateProvi" "county" "domainName" "domainID" "siteID"
[11] "plotType" "subtype" "plotSize" "elevation" "soilTypeOr"
[16] "plotdim_m"
Identify X,Y Location Columns
Our column names include several fields that might contain spatial
information. The plot_locations_HARV$easting
and
plot_locations_HARV$northing
columns contain coordinate
values. We can confirm this by looking at the first six rows of our
data.
R
head(plot_locations_HARV$easting)
OUTPUT
[1] 731405.3 731934.3 731754.3 731724.3 732125.3 731634.3
R
head(plot_locations_HARV$northing)
OUTPUT
[1] 4713456 4713415 4713115 4713595 4713846 4713295
We have coordinate values in our data frame. In order to convert our
data frame to an sf
object, we also need to know the CRS
associated with those coordinate values.
There are several ways to figure out the CRS of spatial data in text format.
- We can check the file metadata in hopes that the CRS was recorded in the data.
- We can explore the file itself to see if CRS information is embedded in the file header or somewhere in the data columns.
Following the easting
and northing
columns,
there is a geodeticDa
and a utmZone
column.
These appear to contain CRS information (datum
and
projection
). Let’s view those next.
R
head(plot_locations_HARV$geodeticDa)
OUTPUT
[1] "WGS84" "WGS84" "WGS84" "WGS84" "WGS84" "WGS84"
R
head(plot_locations_HARV$utmZone)
OUTPUT
[1] "18N" "18N" "18N" "18N" "18N" "18N"
It is not typical to store CRS information in a column. But this
particular file contains CRS information this way. The
geodeticDa
and utmZone
columns contain the
information that helps us determine the CRS:
-
geodeticDa
: WGS84 – this is geodetic datum WGS84 -
utmZone
: 18
In the previous
episode we learned about the components of a proj4
string. We have everything we need to assign a CRS to our data
frame.
To create the proj4
associated with UTM Zone 18 WGS84 we
can look up the projection on the Spatial Reference
website, which contains a list of CRS formats for each projection.
From here, we can extract the proj4
string for UTM Zone 18N WGS84.
However, if we have other data in the UTM Zone 18N projection, it’s
much easier to use the st_crs()
function to extract the CRS
in proj4
format from that object and assign it to our new
spatial object. We’ve seen this CRS before with our Harvard Forest study
site (point_HARV
).
R
st_crs(point_HARV)
OUTPUT
Coordinate Reference System:
User input: WGS 84 / UTM zone 18N
wkt:
PROJCRS["WGS 84 / UTM zone 18N",
BASEGEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["UTM zone 18N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-75,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
ID["EPSG",32618]]
The output above shows that the points vector layer is in UTM zone
18N. We can thus use the CRS from that spatial object to convert our
non-spatial dataframe into an sf
object.
Next, let’s create a crs
object that we can use to
define the CRS of our sf
object when we create it.
R
utm18nCRS <- st_crs(point_HARV)
utm18nCRS
OUTPUT
Coordinate Reference System:
User input: WGS 84 / UTM zone 18N
wkt:
PROJCRS["WGS 84 / UTM zone 18N",
BASEGEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["UTM zone 18N",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",0,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-75,
ANGLEUNIT["Degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",500000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
ID["EPSG",32618]]
R
class(utm18nCRS)
OUTPUT
[1] "crs"
.csv to sf object
Next, let’s convert our dataframe into an sf
object. To
do this, we need to specify:
- The columns containing X (
easting
) and Y (northing
) coordinate values - The CRS that the column coordinate represent (units are included in
the CRS) - stored in our
utmCRS
object.
We will use the st_as_sf()
function to perform the
conversion.
R
plot_locations_sp_HARV <- st_as_sf(plot_locations_HARV,
coords = c("easting", "northing"),
crs = utm18nCRS)
We should double check the CRS to make sure it is correct and matches
the CRS of point_HARV
.
R
st_crs(plot_locations_sp_HARV) == st_crs(point_HARV)
OUTPUT
[1] TRUE
Plot Spatial Object
We now have a spatial R object, we can plot our newly created spatial object.
R
ggplot() +
geom_sf(data = plot_locations_sp_HARV) +
labs(title = "Map of Plot Locations") +
coord_sf()
Plot Extent
In Open and Plot Vector
Layers in R we learned about spatial object extent. When we plot
several spatial layers in R using ggplot
, all of the layers
of the plot are considered in setting the boundaries of the plot. To
show this, let’s plot our country_boundary_US
object.
R
ggplot() +
geom_sf(data = country_boundary_US) +
geom_sf(data = plot_locations_sp_HARV) +
labs(title = "Country Boundary Plot") +
coord_sf()
When we plot the two layers together, ggplot
sets the
plot boundaries so that they are large enough to include all of the data
included in all of the layers. That’s really handy!
Challenge 1: Import & Plot Additional Points
We want to add two phenology plots to our existing map of vegetation plot locations.
Import the .csv: HARV/HARV_2NewPhenPlots.csv
into R and
do the following:
- Find the X and Y coordinate locations. Which value is X and which value is Y?
- These data were collected in a geographic coordinate system (WGS84).
Convert the dataframe into an
sf
object. - Plot the new points with the plot location points from above. Be sure to add a legend. Use a different symbol for the 2 new points!
If you have extra time, feel free to add roads and other layers to your map!
- First we will read in the new csv file and look at the data structure.
R
newplot_locations_HARV <-
read.csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_2NewPhenPlots.csv")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARV_2NewPhenPlots.csv"
str(newplot_locations_HARV)
OUTPUT
'data.frame': 2 obs. of 13 variables:
$ decimalLat: num 42.5 42.5
$ decimalLon: num -72.2 -72.2
$ country : chr "unitedStates" "unitedStates"
$ stateProvi: chr "MA" "MA"
$ county : chr "Worcester" "Worcester"
$ domainName: chr "Northeast" "Northeast"
$ domainID : chr "D01" "D01"
$ siteID : chr "HARV" "HARV"
$ plotType : chr "tower" "tower"
$ subtype : chr "phenology" "phenology"
$ plotSize : int 40000 40000
$ plotDimens: chr "200m x 200m" "200m x 200m"
$ elevation : num 358 346
- The US boundary data we worked with previously is in a geographic
WGS84 CRS. We can use that data to establish a CRS for this data. First
we will extract the CRS from the
country_boundary_US
object and confirm that it is WGS84.
R
geogCRS <- st_crs(country_boundary_US)
geogCRS
OUTPUT
Coordinate Reference System:
User input: WGS 84
wkt:
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]]
Then we will convert our new data to a spatial dataframe, using the
geogCRS
object as our CRS.
R
newPlot.Sp.HARV <- st_as_sf(newplot_locations_HARV,
coords = c("decimalLon", "decimalLat"),
crs = geogCRS)
Next we’ll confirm that the CRS for our new object is correct.
R
st_crs(newPlot.Sp.HARV)
OUTPUT
Coordinate Reference System:
User input: WGS 84
wkt:
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]]
We will be adding these new data points to the plot we created
before. The data for the earlier plot was in UTM. Since we’re using
ggplot
, it will reproject the data for us.
- Now we can create our plot.
R
ggplot() +
geom_sf(data = plot_locations_sp_HARV, color = "orange") +
geom_sf(data = newPlot.Sp.HARV, color = "lightblue") +
labs(title = "Map of All Plot Locations") +
coord_sf()
Export to an ESRI shapefile
We can write an R spatial object to an ESRI shapefile
using the st_write
function in sf
. To do this
we need the following arguments:
- the name of the spatial object
(
plot_locations_sp_HARV
) - the directory where we want to save our ESRI
shapefile
(to usecurrent = getwd()
or you can specify a different path) - the name of the new ESRI
shapefile
(PlotLocations_HARV
) - the driver which specifies the file format (ESRI Shapefile)
We can now export the spatial object as an ESRI
shapefile
.
R
st_write(obj = plot_locations_sp_HARV,
dsn = "data/cleaned-data/PlotLocations_HARV.shp",
driver = "ESRI Shapefile")
Content from Manipulate Raster Data
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I crop raster objects to vector objects, and extract the summary of raster pixels?
Objectives
- Crop a raster to the extent of a vector layer.
- Extract values from a raster that correspond to a vector file overlay.
This episode explains how to crop a raster using the extent of a vector layer. We will also cover how to extract values from a raster that occur within a set of polygons, or in a buffer (surrounding) region around a set of points.
Let’s start by loading the libraries and data we need for this episode:
R
library(sf)
library(terra)
library(ggplot2)
library(dplyr)
R
point_HARV <-
st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp"
aoi_boundary_HARV <-
st_read("data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp"
# CHM data, and convert it to a dataframe
CHM_HARV <-
rast("data/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Airborne-Remote-Sensing/HARV/CHM/HARV_chmCrop.tif"
CHM_HARV_df <- as.data.frame(CHM_HARV, xy = TRUE)
R
plot_locations_sp_HARV <- st_read("data/cleaned-data/PlotLocations_HARV.shp")
Crop a Raster to Vector Extent
We often work with spatial layers that have different spatial extents. The spatial extent of a vector layer or R spatial object represents the geographic “edge” or location that is the furthest north, south east and west. Thus it represents the overall geographic coverage of the spatial object.
Image Source: National Ecological Observatory Network (NEON)
The graphic below illustrates the extent of several of the spatial layers that we have worked with in this workshop:
- Area of interest (AOI) – blue
- Roads and trails – purple
- Vegetation plot locations (marked with white dots)– black
- A canopy height model (CHM) in GeoTIFF format – green
Frequent use cases of cropping a raster file include reducing file size and creating maps. Sometimes we have a raster file that is much larger than our study area or area of interest. It is often more efficient to crop the raster to the extent of our study area to reduce file sizes as we process our data. Cropping a raster can also be useful when creating pretty maps so that the raster layer matches the extent of the desired vector layers.
Crop a Raster Using Vector Extent
We can use the crop()
function to crop a raster to the
extent of another spatial object. To do this, we need to specify the
raster to be cropped and the spatial object that will be used to crop
the raster. R will use the extent
of the spatial object as
the cropping boundary.
To illustrate this, we will crop the Canopy Height Model (CHM) to
only include the area of interest (AOI). Let’s start by plotting the
full extent of the CHM data and overlay where the AOI falls within it.
The boundaries of the AOI will be colored blue, and we use
fill = NA
to make the area transparent.
R
ggplot() +
geom_raster(data = CHM_HARV_df,
mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
scale_fill_gradientn(colors = terrain.colors(10)) +
geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
labs(fill = "Canopy Height") +
coord_sf()
Now that we have visualized the area of the CHM we want to subset, we
can perform the cropping operation. We are going to crop()
function from the raster package to create a new object with only the
portion of the CHM data that falls within the boundaries of the AOI.
R
CHM_HARV_Cropped <- crop(x = CHM_HARV, y = aoi_boundary_HARV)
Now we can plot the cropped CHM data, along with a boundary box
showing the full CHM extent. However, remember, since this is raster
data, we need to convert to a data frame in order to plot using
ggplot
. To get the boundary box from CHM, the
st_bbox()
will extract the 4 corners of the rectangle that
encompass all the features contained in this object. The
st_as_sfc()
converts these 4 coordinates into a polygon
that we can plot:
R
CHM_HARV_Cropped_df <- as.data.frame(CHM_HARV_Cropped, xy = TRUE)
boundary <- st_as_sfc(st_bbox(CHM_HARV))
ggplot() +
geom_sf(data = boundary, fill = "green",
color = "green", alpha = .2) +
geom_raster(data = CHM_HARV_Cropped_df,
mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
scale_fill_gradientn(colors = terrain.colors(10)) +
labs(fill = "Canopy Height") +
coord_sf()
The plot above shows that the full CHM extent (plotted in green) is
much larger than the resulting cropped raster. Our new cropped CHM now
has the same extent as the aoi_boundary_HARV
object that
was used as a crop extent (blue border below).
R
ggplot() +
geom_raster(data = CHM_HARV_Cropped_df,
mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
scale_fill_gradientn(colors = terrain.colors(10)) +
labs(fill = "Canopy Height") +
coord_sf()
R
CHM_plots_HARVcrop <- crop(x = CHM_HARV, y = plot_locations_sp_HARV)
CHM_plots_HARVcrop_df <- as.data.frame(CHM_plots_HARVcrop, xy = TRUE)
ggplot() +
geom_raster(data = CHM_plots_HARVcrop_df,
mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
scale_fill_gradientn(colors = terrain.colors(10)) +
geom_sf(data = plot_locations_sp_HARV) +
labs(fill = "Canopy Height") +
coord_sf()
In the plot above, created in the challenge, all the vegetation plot locations (black dots) appear on the Canopy Height Model raster layer except for one. One is situated on the blank space to the left of the map. Why?
A modification of the first figure in this episode is below, showing
the relative extents of all the spatial objects. Notice that the extent
for our vegetation plot layer (black) extends further west than the
extent of our CHM raster (bright green). The crop()
function will make a raster extent smaller, it will not expand the
extent in areas where there are no data. Thus, the extent of our
vegetation plot layer will still extend further west than the extent of
our (cropped) raster data (dark green).
Define an Extent
So far, we have used a vector layer to crop the extent of a raster
dataset. Alternatively, we can also the ext()
function to
define an extent to be used as a cropping boundary. This creates a new
object of class extent. Here we will provide the ext()
function our xmin, xmax, ymin, and ymax (in that order).
R
new_extent <- ext(732161.2, 732238.7, 4713249, 4713333)
class(new_extent)
OUTPUT
[1] "SpatExtent"
attr(,"package")
[1] "terra"
Once we have defined our new extent, we can use the
crop()
function to crop our raster to this extent
object.
R
CHM_HARV_manual_cropped <- crop(x = CHM_HARV, y = new_extent)
To plot this data using ggplot()
we need to convert it
to a dataframe.
R
CHM_HARV_manual_cropped_df <- as.data.frame(CHM_HARV_manual_cropped, xy = TRUE)
Now we can plot this cropped data. We will show the AOI boundary on the same plot for scale.
R
ggplot() +
geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
geom_raster(data = CHM_HARV_manual_cropped_df,
mapping = aes(x = x, y = y, fill = HARV_chmCrop)) +
scale_fill_gradientn(colors = terrain.colors(10)) +
labs(fill = "Canopy Height") +
coord_sf()
Extract Raster Pixels Values Using Vector Polygons
Often we want to extract values from a raster layer for particular locations - for example, plot locations that we are sampling on the ground. We can extract all pixel values within 20m of our x,y point of interest. These can then be summarized into some value of interest (e.g. mean, maximum, total).
Image Source: National Ecological Observatory Network (NEON)
To do this in R, we use the extract()
function. The
extract()
function requires:
- The raster that we wish to extract values from,
- The vector layer containing the polygons that we wish to use as a boundary or boundaries,
- We can also tell it to store the output values in a data frame using
raw = FALSE
(this is optional).
We will begin by extracting all canopy height pixel values located
within our aoi_boundary_HARV
polygon which surrounds the
tower located at the NEON Harvard Forest field site.
R
tree_height <- extract(x = CHM_HARV, y = aoi_boundary_HARV, raw = FALSE)
str(tree_height)
OUTPUT
'data.frame': 18450 obs. of 2 variables:
$ ID : num 1 1 1 1 1 1 1 1 1 1 ...
$ HARV_chmCrop: num 21.2 23.9 23.8 22.4 23.9 ...
When we use the extract()
function, R extracts the value
for each pixel located within the boundary of the polygon being used to
perform the extraction - in this case the aoi_boundary_HARV
object (a single polygon). Here, the function extracted values from
18,450 pixels.
We can create a histogram of tree height values within the boundary
to better understand the structure or height distribution of trees at
our site. We will use the column HARV_chmCrop
from our data
frame as our x values, as this column represents the tree heights for
each pixel.
R
ggplot() +
geom_histogram(data = tree_height, mapping = aes(x = HARV_chmCrop)) +
labs(title = "Histogram of CHM Height Values (m)",
x = "Tree Height",
y = "Frequency of Pixels")
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also use the summary()
function to view
descriptive statistics including min, max, and mean height values. These
values help us better understand vegetation at our field site.
R
summary(tree_height$HARV_chmCrop)
OUTPUT
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.03 21.36 22.81 22.43 23.97 38.17
Summarize Extracted Raster Values
We often want to extract summary values from a raster. We can tell R
the type of summary statistic we are interested in using the
fun =
argument. Let’s extract a mean height value for our
AOI.
R
mean_tree_height_AOI <- extract(x = CHM_HARV, y = aoi_boundary_HARV,
fun = mean)
mean_tree_height_AOI
OUTPUT
ID HARV_chmCrop
1 1 22.43018
It appears that the mean height value, extracted from our LiDAR data derived canopy height model is 22.43 meters.
Extract Data using x,y Locations
We can also extract pixel values from a raster by defining a buffer
or area surrounding individual point locations using the
st_buffer()
function. To do this we define the summary
argument (fun = mean
) and the buffer distance
(dist = 20
) which represents the radius of a circular
region around each point. By default, the units of the buffer are the
same units as the data’s CRS. All pixels that are touched by the buffer
region are included in the extract.
Image Source: National Ecological Observatory Network (NEON)
Let’s put this into practice by figuring out the mean tree height in
the 20m around the tower location (point_HARV
).
R
mean_tree_height_tower <- extract(x = CHM_HARV,
y = st_buffer(point_HARV, dist = 20),
fun = mean)
mean_tree_height_tower
OUTPUT
ID HARV_chmCrop
1 1 22.38806
Challenge 2: Extract Raster Height Values For Plot Locations
Use the plot locations object (
plot_locations_sp_HARV
) to extract an average tree height for the area within 20m of each vegetation plot location in the study area. Because there are multiple plot locations, there will be multiple averages returned.Create a plot showing the mean tree height of each area.
R
# extract data at each plot location
mean_tree_height_plots_HARV <- extract(x = CHM_HARV,
y = st_buffer(plot_locations_sp_HARV,
dist = 20),
fun = mean)
# view data
mean_tree_height_plots_HARV
OUTPUT
ID HARV_chmCrop
1 1 NaN
2 2 23.96756
3 3 22.34937
4 4 16.49739
5 5 21.54419
6 6 19.16772
7 7 20.61651
8 8 21.61439
9 9 12.23006
10 10 19.13398
11 11 21.36966
12 12 19.32084
13 13 17.25975
14 14 20.47120
15 15 12.68301
16 16 15.51888
17 17 18.90894
18 18 18.19369
19 19 19.67441
20 20 20.23245
21 21 20.44984
R
# plot data
ggplot(data = mean_tree_height_plots_HARV,
mapping = aes(x = ID, y = HARV_chmCrop)) +
geom_col() +
labs(title = "Mean Tree Height at each Plot",
x = "Plot ID",
y = "Tree Height (m)")
WARNING
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_col()`).
Content from Create Publication-quality Graphics
Last updated on 2024-08-19 | Edit this page
Overview
Questions
- How can I deal with multiple raster layers?
- How can I create a publication-quality graphic and customize plot parameters?
Objectives
- Create a raster stack
- Assign custom names to bands in a RasterStack.
- Customize raster plots using the
ggplot2
package.
This episode covers how to customize your raster plots using the
ggplot2
package in R to create publication-quality
plots.
First, let’s load the libraries that we will need for this episode.
We will be using the tidyr
package in this episode,
which you might not have used yet!
If you haven’t downloaded tidyr
to your R yet, do so now
using:
install.packages("tidyr")
Once it’s downloaded, load in all the packages for this episode:
R
library(terra)
library(ggplot2)
library(dplyr)
library(tidyr)
Faceting plots
There may come a time when you will need to plot several smaller plots within a larger plot. The result is a plot with facets. These plots can be useful if you have several variables that you want to show at once, or if you’re trying to look at maps over time.
We have a time series of raster data that are spread across several files. These data show the normalized difference vegetation index (NDVI) from the Harvard Forest Site. We can best visualize these data by using facets. To do this, we need to read the data in as a “raster stack”.
First, let’s get the list of all the raster files in our time series.
We can do this using the list.files()
function. We will
only add files that have a .tif extension to our list. To do this, we
will use the syntax pattern=".tif$"
. If we specify
full.names = TRUE
, the full path for each file will be
added to the list.
R
NDVI_HARV_path <- "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI"
all_NDVI_HARV <- list.files(NDVI_HARV_path,
full.names = TRUE,
pattern = ".tif$")
# If you are getting an error, check your file path:
# You might need change your file path to:
# "data/2009586/NEON-DS-Landsat-NDVI/HARV/2011/NDVI"
It’s a good idea to look at the file names that matched our search to make sure they meet our expectations.
R
all_NDVI_HARV
OUTPUT
[1] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/005_HARV_ndvi_crop.tif"
[2] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/037_HARV_ndvi_crop.tif"
[3] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/085_HARV_ndvi_crop.tif"
[4] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/133_HARV_ndvi_crop.tif"
[5] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/181_HARV_ndvi_crop.tif"
[6] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/197_HARV_ndvi_crop.tif"
[7] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/213_HARV_ndvi_crop.tif"
[8] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/229_HARV_ndvi_crop.tif"
[9] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/245_HARV_ndvi_crop.tif"
[10] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/261_HARV_ndvi_crop.tif"
[11] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/277_HARV_ndvi_crop.tif"
[12] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/293_HARV_ndvi_crop.tif"
[13] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/309_HARV_ndvi_crop.tif"
Now we have a list of all GeoTIFF files in the NDVI directory for
Harvard Forest. The number at the start of the filenames represents the
julilan day. Next, we will create a stack of rasters from this list
using the rast()
function. This is the same function we use
to read in any raster file.
R
NDVI_HARV_stack <- rast(all_NDVI_HARV)
NDVI_HARV_stack
OUTPUT
class : SpatRaster
dimensions : 5, 4, 13 (nrow, ncol, nlyr)
resolution : 30, 30 (x, y)
extent : 239415, 239535, 4714215, 4714365 (xmin, xmax, ymin, ymax)
coord. ref. : UTM Zone 19, Northern Hemisphere
sources : 005_HARV_ndvi_crop.tif
037_HARV_ndvi_crop.tif
085_HARV_ndvi_crop.tif
... and 10 more source(s)
names : 005_H~_crop, 037_H~_crop, 085_H~_crop, 133_H~_crop, 181_H~_crop, 197_H~_crop, ...
min values : 2732, 1534, 1917, 5669, 8685, 8768, ...
max values : 5545, 3736, 3600, 6394, 8903, 9140, ...
We can see from looking that the data that we will need to scale the values. NDVI data should be a value between 0 and 1 (and it’s currently a value between 0 and 10,000!):
R
NDVI_HARV_stack <- NDVI_HARV_stack/10000
Once we have created our RasterStack, we can visualize our data. We
can use the ggplot()
command to create a multi-panelled
plot showing each band in our RasterStack. First we need to create a
data frame object. Because there are multiple columns in our data that
are not variables, we will tidy (or “pivot”) the data so that we have a
single column with the NDVI observations. We will use the function
pivot_longer()
from the tidyr
package to do
this. Since there are so many columns with relevant data, instead of
specifying the columns we want to pivot, we can specify the columns we
don’t want to pivot instead. We do this by specifying
cols = -(x:y)
:
R
NDVI_HARV_stack_df <- as.data.frame(NDVI_HARV_stack, xy = TRUE) %>%
pivot_longer(cols = -(x:y), names_to = "variable", values_to = "value")
Now we can plot our data using ggplot()
. We want to
create a separate panel for each time point in our time series, so we
will use the facet_wrap()
function to create a
multi-paneled plot by the variable
column:
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~variable)
Although this plot is informative, it isn’t something we would expect to see in a journal publication. The x and y-axis labels aren’t informative. There is a lot of unnecessary gray background and the titles of each panel don’t clearly state that the number refers to the Julian day the data was collected. In this episode, we will customize this plot above to produce a publication quality graphic. We will go through these steps iteratively. When we’re done, we will have created the plot shown below.
Adjust the Plot Theme
The first thing we will do to our plot remove the x and y-axis labels
and axis ticks, as these are unnecessary and make our plot look messy.
We can do this by using a pre-set ggplot
theme that
simplifies all plot parameters. We will add the function
theme_void()
to tell ggplot
that we want to
use this theme.
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~variable) +
labs(title = "Landsat NDVI",
subtitle = "NEON Harvard Forest",
fill = "NDVI") +
theme_void()
Next we will center our plot title and subtitle. We need to do this
after the theme_void()
layer, because R
interprets the ggplot
layers in order. If we first tell R
to center our plot title, and then set the theme to void
,
any adjustments we’ve made to the plot theme will be over-written by the
theme_void()
function. So first we make the theme
void
and then we center the title. We center both the title
and subtitle by using the theme()
function and setting the
hjust
parameter to 0.5. The hjust
parameter
stands for “horizontal justification” and takes any value between 0 and
1. A setting of 0 indicates left justification and a setting of 1
indicates right justification. We can also update the “face” of the
font, to be bold or italics, or bold and italics. To do this, we would
adjust the face
parameter.
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~variable) +
labs(title = "Landsat NDVI",
subtitle = "NEON Harvard Forest",
fill = "NDVI") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
Challenge 1
Change the plot subtitle to italic font. You can (and should!) use the help menu in RStudio or any internet resources to figure out how to change this setting.
Hint: Use ?element_text()
to pull up the help
file and see how you might be able to change the appearance of text in
the theme()
function.
Learners can find this information in the help files for the
theme()
function. The parameter to set is called
face
.
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~variable) +
labs(title = "Landsat NDVI",
subtitle = "NEON Harvard Forest",
fill = "NDVI") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, face = "italic"))
Adjust the Color Ramp
Next, let’s adjust the color ramp used to render the rasters. First,
we can change the blue color ramp to a green one that is more visually
suited to our NDVI (greenness) data using the
colorRampPalette()
function in combination with
colorBrewer
which requires loading the
RColorBrewer
library.
If you don’t have the RColorBrewer
library yet, install
it using:
install.packages("RColorBrewer")
Now we need to create a set of colors to use.
We can view the palettes available in RColorBrewer
:
R
library(RColorBrewer)
display.brewer.all()
We will select a set of nine colors from the “YlGn” (yellow-green) color palette. This returns a set of hex color codes:
R
brewer.pal(9, "YlGn")
OUTPUT
[1] "#FFFFE5" "#F7FCB9" "#D9F0A3" "#ADDD8E" "#78C679" "#41AB5D" "#238443"
[8] "#006837" "#004529"
We can save these colors to add to our plot later.
R
green_colors <- brewer.pal(9, "YlGn")
Now we can go ahead and build our plot!
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~variable) +
scale_fill_gradientn(colours = green_colors) +
labs(title = "Landsat NDVI",
subtitle = "NEON Harvard Forest",
fill = "NDVI") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
The yellow to green color ramp visually represents NDVI well given it’s a measure of greenness. Someone looking at the plot can quickly understand that pixels that are more green have a higher NDVI value.
Data Tip
For all of the brewer.pal
ramp names see the brewerpal
page.
Data Tip
Cynthia Brewer, the creator of ColorBrewer, offers an online tool to help choose suitable color ramps, or to create your own. ColorBrewer 2.0; Color Advise for Cartography
Refine Plot & Tile Labels
Next, let’s label each panel in our plot with the Julian day that the
raster data for that panel was collected. The current names come from
the band “layer names” stored in the RasterStack
and the
first part of each name is the Julian day.
To create a more meaningful label we can remove the pieces of the
label we don’t want and replace it with something we do by using the
gsub()
function in R. The syntax is as follows:
gsub("StringToReplace", "TextToReplaceIt", object)
.
First let’s remove “_HARV_NDVI_crop” from each label to make the
labels shorter and remove repetition. To illustrate how this works, we
will first look at the names for our NDVI_HARV_stack_df
object in the variable
column:
R
unique(NDVI_HARV_stack_df$variable)
OUTPUT
[1] "005_HARV_ndvi_crop" "037_HARV_ndvi_crop" "085_HARV_ndvi_crop"
[4] "133_HARV_ndvi_crop" "181_HARV_ndvi_crop" "197_HARV_ndvi_crop"
[7] "213_HARV_ndvi_crop" "229_HARV_ndvi_crop" "245_HARV_ndvi_crop"
[10] "261_HARV_ndvi_crop" "277_HARV_ndvi_crop" "293_HARV_ndvi_crop"
[13] "309_HARV_ndvi_crop"
Now we will use the gsub()
function to find the
character string “_HARV_ndvi_crop” and replace it with a blank string
(““). We will assign this output to a new column
(panel_name
) using the mutate()
function and
look at that column to sure our code is doing what we want it to.
R
NDVI_HARV_stack_df <- NDVI_HARV_stack_df %>%
mutate(panel_name = gsub("_HARV_ndvi_crop", "", variable))
unique(NDVI_HARV_stack_df$panel_name)
OUTPUT
[1] "005" "037" "085" "133" "181" "197" "213" "229" "245" "261" "277" "293"
[13] "309"
So far so good. Now let’s use the paste()
function to
add the word “Day” in front of each of those numbers:
R
NDVI_HARV_stack_df <- NDVI_HARV_stack_df %>%
mutate(panel_name = paste("Day", panel_name))
unique(NDVI_HARV_stack_df$panel_name)
OUTPUT
[1] "Day 005" "Day 037" "Day 085" "Day 133" "Day 181" "Day 197" "Day 213"
[8] "Day 229" "Day 245" "Day 261" "Day 277" "Day 293" "Day 309"
Our labels look good now. Let’s render our plot again. This time,
we’ll facet by the panel_name
column.
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~panel_name) +
scale_fill_gradientn(colours = green_colors) +
labs(title = "Landsat NDVI",
subtitle = "NEON Harvard Forest",
fill = "NDVI") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
The “Day” looks a little bit cut off on our plots. We can fix that
using the “theme” again! This time, we will want to add something for
the strip.text
area - let’s increase the margin on the
bottom of the text area to accommodate the text size. We can do this
using
theme(strip.text = element_text(margin = unit(c(0,0,2,0), "pt")))
.
The c(0,0,2,0)
tells us the new margins we want for the top
(0 pt), right (0 pt), bottom (2 pt), and left (0 pt) sides.
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~panel_name) +
scale_fill_gradientn(colours = green_colors) +
labs(title = "Landsat NDVI",
subtitle = "NEON Harvard Forest",
fill = "NDVI") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
strip.text = element_text(margin = unit(c(0,0,2,0), "pt")))
Change Layout of Panels
We can adjust the columns of our plot by setting the number of
columns ncol
and the number of rows nrow
in
facet_wrap
. Let’s make our plot so that it has a width of
five panels.
R
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~panel_name, ncol = 5) +
scale_fill_gradientn(colours = green_colors) +
labs(title = "Landsat NDVI",
subtitle = "NEON Harvard Forest",
fill = "NDVI") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
strip.text = element_text(margin = unit(c(0,0,2,0), "pt")))
Now we have a beautiful, publication quality plot!
We can save this plot using the ggsave()
function.
Remember that by default R will save the last rendered plot. We can
specify the height, width, and dpi for the figure when we save it. This
makes it super easy to meet your target journal’s figure formatting
requirements.
R
ggsave(filename = "figures/time_series_NDVI.png",
height = 5, width = 5, dpi = 600)
Challenge 2: Divergent Color Ramps
When we used the paste()
function to modify the tile
labels we replaced the beginning of each tile title with “Day”. A more
descriptive name could be “Julian Day”. Update the plot above with the
following changes:
- Label each tile “Julian Day” with the julian day value following.
- Change the color ramp to a divergent brown to green color ramp.
Questions: Does having a divergent color ramp represent the data better than a sequential color ramp (like “YlGn”)? Can you think of other data sets where a divergent color ramp may be best?
R
NDVI_HARV_stack_df <- NDVI_HARV_stack_df %>%
mutate(panel_name = gsub("Day", "Julian Day", panel_name))
brown_green_colors <- brewer.pal(9, "BrBG")
ggplot() +
geom_raster(data = NDVI_HARV_stack_df,
mapping = aes(x = x, y = y, fill = value)) +
facet_wrap(~panel_name, ncol = 5) +
scale_fill_gradientn(colours = brown_green_colors) +
labs(title = "Landsat NDVI - Julian Days",
subtitle = "Harvard Forest 2011",
fill = "NDVI") +
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
strip.text = element_text(margin = unit(c(0,0,2,0), "pt")))
For NDVI data, the sequential color ramp is better than the divergent as it is more akin to the process of greening up, which starts off at one end and just keeps increasing.
Key Points
- Use the
theme_void()
function for a clean background to your plot. - Use the
element_text()
function to adjust text size, font, and position. - Use the
brewer.pal()
function to create a custom color palette. - Use the
gsub()
function to do pattern matching and replacement in text and thepaste()
function to add to your current text.