library(targets)
source("R/functions.R")
tar_option_set(packages = c("tidyverse", "sf"))
list(
# Track a raw data file
tar_target(raw_data_file, "data/raw_samples.csv", format = "file"),
# Clean data using a custom function from R/functions.R
tar_target(data_cleaned, clean_my_data(raw_data_file)),
# Run a model
tar_target(model_results, run_analysis(data_cleaned)),
# Render a Quarto report
tarchetypes::tar_quarto(final_report, "report.qmd")
)Pipeline reproducibility
Modern research workflows often involve complex, interconnected scripts where the output of one file is the input for the next. Traditional sequential approaches (e.g., first running 01_clean.R, then running 02_model.R) are prone to error: if you change a parameter in 01_clean.R, you must remember to manually re-run everything downstream.
To ensure end-to-end reproducibility and efficiency, emLab projects should use automated pipeline tools. These tools track dependencies, skip tasks that are already up-to-date, and provide a clear map of the research workflow.
The philosophy of pipeline management
Adopting a pipeline tool is not just about learning a new package; it is a shift in research philosophy. We use these tools for several critical reasons:
- Resilience to Change: In research, data is rarely static. If a raw dataset is updated or a cleaning parameter is tweaked, manual workflows require the researcher to remember every downstream script affected. Automated pipelines handle this bookkeeping for you, ensuring the “state” of your results always matches the “state” of your code.
- Efficiency and “Lazy” Computing: Many emLab analyses involve long-running operations (e.g., processing GFW data or training complex models). Pipeline tools are “lazy”—they only execute a task if its inputs or the code itself has changed. This saves hours of unnecessary computation and allows for rapid iteration on specific parts of the analysis.
- Trust and Transparency: A pipeline tool serves as a “living” README. Instead of trusting a narrative description of how the analysis was run, a collaborator can look at a dependency graph and see exactly how every figure and table was derived.
- Collaborative Scalability: In a team setting, pipelines reduce the “cognitive load” for new members. A new researcher can clone a repository and run a single command to see the entire project in action, rather than guessing which scripts are current and which are experimental leftovers.
R pipelines: the targets package
For projects primarily using R, we highly recommend the targets package. targets maintains a persistent cache and only re-runs the specific parts of your analysis that have changed.
Targets repo structure
A targets project typically follows this directory structure:
project/
├── _targets.R # The "instruction manual" for the pipeline
├── R/ # Folder containing modular functions
│ ├── functions.R
├── data/ # Raw data
├── _targets/ # (Automatic) Local cache/metadata
└── report.qmd # Final output (e.g., paper figures and statistics)
Functionalize your code
Instead of long scripts, write small, modular functions in the R/ directory. Each function should do one thing (e.g., “clean the sea surface temperature data”).
The _targets.R file defines the relationship between data, code, and outputs.
Multiple pipelines in one repo
Complex emLab projects often require split pipelines. For example:
- Pipeline A (Data Ingestion): Pulls heavy data from BigQuery/GFW. Requires specific credentials and often runs on the emLab server/GRIT.
- Pipeline B (Analysis): Uses the pulled data for modeling. Accessible to all team members for local iteration.
To manage this, create two separate target files (e.g., _targets_data.R and _targets_analysis.R) and a _targets.yaml file in the root directory to define them:
# _targets.yaml
data_pipeline:
script: _targets_data.R
store: _targets_data_store
analysis_pipeline:
script: _targets_analysis.R
store: _targets_analysis_storeYou can then run them specifically from the console:
tar_make(script = "_targets_data.R", store = "_targets_data_store")
Rendering Quarto documents in a targets pipeline
When your targets pipeline produces a Quarto document, the render step should be the last target in your pipeline — and how you define it matters. The right tool for this is tarchetypes::tar_quarto().
Why not just use targets::tar_target()?
It might seem natural to call the Quarto render function inside a regular target:
# Not recommended
targets::tar_target(report, quarto::quarto_render("analysis/final_report.qmd"))The issue is that targets sees this as an opaque function call with no connection to the rest of your pipeline - it does not know which targets are refereced inside the quarto doc. It has no way of knowing which upstream targets the document reads, so it cannot tell when the document is out-of-date. In practice, this means you either re-render unnecessarily on every tar_make() run, or — worse — end up with a stale report that targets believes is current.
Use tarchetypes::tar_quarto() instead
tarchetypes::tar_quarto() fixes this by scanning your .qmd file for targets::tar_read() and targets::tar_load(), so that it can recognize any target objects that are called as explicit upstream dependencies. The document is then re-rendered only when one of those dependencies has actually changed.
# Recommended
tarchetypes::tar_quarto(
name = quarto_notebook,
path = "qmd/quarto_notebook.qmd",
quiet = FALSE
)Inside your .qmd file, use targets::tar_read(target_name) or targets::tar_load(target_name) to access pipeline results. These are the calls that tarchetypes::tar_quarto() scans to build the dependency graph, so make sure all pipeline inputs are accessed this way rather than loaded through other means.
Using targets::tar_target() directly is technically possible, but you would need to manually declare all dependencies yourself to get correct invalidation behavior. We therefore strongly recommend tarchetypes::tar_quarto() for most use cases.
Non-R pipelines: GNU Make
For projects using Python, Stata, or a mix of languages, we recommend GNU Make. Make is language-agnostic, comes pre-installed on macOS and most Linux distributions (and is easy to install on Windows via WSL), and uses a single plain-text Makefile to describe how each output in the pipeline is built.
Unlike targets, which tracks dependencies by hashing R objects, Make reasons in terms of files and timestamps: a target is considered out-of-date whenever any of its prerequisites has a newer modification time. This is a coarser signal than targets provides — Make cannot tell whether a file’s contents actually changed, only that it was touched — but it is enough for most multi-language pipelines and is easy to reason about.
The Makefile syntax
A Makefile is a list of rules. Each rule has three parts:
target: prerequisites
<TAB>recipe- target — the file the rule produces (e.g.
data/processed.csv). - prerequisites — files the target depends on. If any prerequisite is newer than the target, the recipe runs.
- recipe — the shell command(s) Make runs to build the target. The indentation must be a literal tab character, not spaces. Silent space-for-tab substitution by editors is the single most common Makefile bug; configure your editor to preserve tabs in your
Makefile.
Example workflow (Python/Stata)
Makefile
.PHONY: all clean
# Default target: build everything
all: results/paper.pdf
# Clean raw data (Python)
data/processed.csv: scripts/clean_data.py data/raw.csv
python scripts/clean_data.py
# Run analysis (Stata)
results/model_output.log: scripts/analysis.do data/processed.csv
stata -batch do scripts/analysis.do
# Render report (Quarto)
results/paper.pdf: paper.qmd results/model_output.log
quarto render paper.qmd --to pdf
# Wipe generated outputs so the pipeline can be re-run from scratch
clean:
rm -f data/processed.csv results/model_output.log results/paper.pdfA few things worth pointing out in this example:
.PHONY: all cleantells Make thatallandcleanare command names, not files. Without it, a stray file namedallin the project root would make Make think the pipeline is already built and refuse to do anything.allis listed first, so a baremake(no arguments) builds it. By convention,alldepends on the final outputs of the pipeline.- Each script appears as a prerequisite and in its own recipe. This way, edits to
scripts/clean_data.pyinvalidatedata/processed.csvand trigger a rerun on the nextmake.
How to use Make
| Command | Effect |
|---|---|
make |
Build the default target (the first non-phony target in the file, by convention all). |
make <target> |
Build a specific target, e.g. make results/paper.pdf. |
make -n |
Dry run — print the commands Make would execute, without running them. Useful for sanity-checking a pipeline change. |
make -B |
Force-rebuild every target, ignoring timestamps. |
make -j N |
Run up to N independent steps in parallel. |
make clean |
Run the clean rule above (or whichever cleanup rule the project defines). |
Pitfalls and tips
- Always declare phony targets. Any rule that does not actually produce a file with the target’s name (
all,clean,lint,test, …) belongs in.PHONY. - Include the script itself as a prerequisite. Otherwise edits to the analysis code won’t trigger a rebuild and downstream outputs will go stale silently.
- For scripts that don’t produce a tracked file (e.g. a Stata script that only prints to console), redirect the output to a sentinel file (
results/model_output.login the example above) so Make has something to time-stamp. - Document the entry point in the README. A first-time collaborator should be able to clone the repo, type
make, and have the whole pipeline build itself.
Minimum requirement: README documentation
If a project does not yet use an automated tool like targets or Make, it is mandatory to provide a detailed “How to Run” section in the repository’s README.md. At a absolute minimum, this must include:
- Script Execution Order: A numbered list of scripts in the exact order they must be run.
- Dependencies: Which data files are required for each script and which outputs they produce.
- Environment Setup: Instructions on how to set up the necessary libraries or software versions (e.g., using renv or conda).
Best practices for all pipelines
- Documentation is Mandatory: No matter what pipeline or tool is used, the repository README.md must clearly document how to execute the analysis. A new collaborator should be able to clone the repo and know exactly which command to run (e.g., “Run tar_make() in the R console” or “Type make in the terminal”).
- Atomic Steps: Each step in the pipeline should be small enough to be easily understood but large enough to justify being its own “target.”
- File Tracking: Always include the script itself as a prerequisite/dependency. If you change the code, the output should be considered “outdated.”
- No Manual Interventions: A pipeline is only reproducible if it can be run from start to finish on a new machine without a human clicking “save,” “export,” or manually moving files in the middle.