library(targets)
source("R/functions.R")
tar_option_set(packages = c("tidyverse", "sf"))
list(
# Track a raw data file
tar_target(raw_data_file, "data/raw_samples.csv", format = "file"),
# Clean data using a custom function from R/functions.R
tar_target(data_cleaned, clean_my_data(raw_data_file)),
# Run a model
tar_target(model_results, run_analysis(data_cleaned)),
# Render a Quarto report
tarchetypes::tar_quarto(final_report, "report.qmd")
)Pipeline reproducibility
Modern research workflows often involve complex, interconnected scripts where the output of one file is the input for the next. Traditional sequential approaches (e.g., first running 01_clean.R, then running 02_model.R) are prone to error: if you change a parameter in 01_clean.R, you must remember to manually re-run everything downstream.
To ensure end-to-end reproducibility and efficiency, emLab projects should use automated pipeline tools. These tools track dependencies, skip tasks that are already up-to-date, and provide a clear map of the research workflow.
The philosophy of pipeline management
Adopting a pipeline tool is not just about learning a new package; it is a shift in research philosophy. We use these tools for several critical reasons:
- Resilience to Change: In research, data is rarely static. If a raw dataset is updated or a cleaning parameter is tweaked, manual workflows require the researcher to remember every downstream script affected. Automated pipelines handle this bookkeeping for you, ensuring the “state” of your results always matches the “state” of your code.
- Efficiency and “Lazy” Computing: Many emLab analyses involve long-running operations (e.g., processing GFW data or training complex models). Pipeline tools are “lazy”—they only execute a task if its inputs or the code itself has changed. This saves hours of unnecessary computation and allows for rapid iteration on specific parts of the analysis.
- Trust and Transparency: A pipeline tool serves as a “living” README. Instead of trusting a narrative description of how the analysis was run, a collaborator can look at a dependency graph and see exactly how every figure and table was derived.
- Collaborative Scalability: In a team setting, pipelines reduce the “cognitive load” for new members. A new researcher can clone a repository and run a single command to see the entire project in action, rather than guessing which scripts are current and which are experimental leftovers.
R pipelines: the targets package
For projects primarily using R, we highly recommend the targets package. targets maintains a persistent cache and only re-runs the specific parts of your analysis that have changed.
Targets repo structure
A targets project typically follows this directory structure:
project/
├── _targets.R # The "instruction manual" for the pipeline
├── R/ # Folder containing modular functions
│ ├── functions.R
├── data/ # Raw data
├── _targets/ # (Automatic) Local cache/metadata
└── report.qmd # Final output (e.g., paper figures and statistics)
Functionalize your code
Instead of long scripts, write small, modular functions in the R/ directory. Each function should do one thing (e.g., “clean the sea surface temperature data”).
The _targets.R file defines the relationship between data, code, and outputs.
Multiple pipelines in one repo
Complex emLab projects often require split pipelines. For example:
- Pipeline A (Data Ingestion): Pulls heavy data from BigQuery/GFW. Requires specific credentials and often runs on the emLab server/GRIT.
- Pipeline B (Analysis): Uses the pulled data for modeling. Accessible to all team members for local iteration.
To manage this, create two separate target files (e.g., _targets_data.R and _targets_analysis.R) and a _targets.yaml file in the root directory to define them:
# _targets.yaml
data_pipeline:
script: _targets_data.R
store: _targets_data_store
analysis_pipeline:
script: _targets_analysis.R
store: _targets_analysis_storeYou can then run them specifically from the console:
tar_make(script = "_targets_data.R", store = "_targets_data_store")
Rendering Quarto documents in a targets pipeline
When your targets pipeline produces a Quarto document, the render step should be the last target in your pipeline — and how you define it matters. The right tool for this is tarchetypes::tar_quarto().
Why not just use targets::tar_target()?
It might seem natural to call the Quarto render function inside a regular target:
# Not recommended
targets::tar_target(report, quarto::quarto_render("analysis/final_report.qmd"))The issue is that targets sees this as an opaque function call with no connection to the rest of your pipeline - it does not know which targets are refereced inside the quarto doc. It has no way of knowing which upstream targets the document reads, so it cannot tell when the document is out-of-date. In practice, this means you either re-render unnecessarily on every tar_make() run, or — worse — end up with a stale report that targets believes is current.
Use tarchetypes::tar_quarto() instead
tarchetypes::tar_quarto() fixes this by scanning your .qmd file for targets::tar_read() and targets::tar_load(), so that it can recognize any target objects that are called as explicit upstream dependencies. The document is then re-rendered only when one of those dependencies has actually changed.
# Recommended
tarchetypes::tar_quarto(
name = quarto_notebook,
path = "qmd/quarto_notebook.qmd",
quiet = FALSE
)Inside your .qmd file, use targets::tar_read(target_name) or targets::tar_load(target_name) to access pipeline results. These are the calls that tarchetypes::tar_quarto() scans to build the dependency graph, so make sure all pipeline inputs are accessed this way rather than loaded through other means.
Using targets::tar_target() directly is technically possible, but you would need to manually declare all dependencies yourself to get correct invalidation behavior. We therefore strongly recommend tarchetypes::tar_quarto() for most use cases.
Non-R pipelines: GNU Make
For projects using Python, Stata, or a mix of languages, we recommend GNU Make. Make uses a Makefile to define “rules” for creating files.
The Makefile syntax
A rule consists of a target (the file you want to create), prerequisites (files it depends on), and a recipe (the command to run).
target: prerequisites
recipe
Example workflow (Python/Stata)
# Makefile
# Run everything
all: results/paper.pdf
# Clean raw data (Python)
data/processed.csv: scripts/clean_data.py data/raw.csv
python scripts/clean_data.py
# Run analysis (Stata)
results/model_output.log: scripts/analysis.do data/processed.csv
stata -batch do scripts/analysis.do
# Render report (Quarto)
results/paper.pdf: paper.qmd results/model_output.log
quarto render paper.qmd --to pdfHow to use Make
- Run the pipeline: Simply type make in your terminal. Make will check the timestamps of all files and only run the steps where the prerequisites are newer than the target.
- Dry run: Type make -n to see what commands would be run without actually executing them.
Minimum requirement: README documentation
If a project does not yet use an automated tool like targets or Make, it is mandatory to provide a detailed “How to Run” section in the repository’s README.md. At a absolute minimum, this must include:
- Script Execution Order: A numbered list of scripts in the exact order they must be run.
- Dependencies: Which data files are required for each script and which outputs they produce.
- Environment Setup: Instructions on how to set up the necessary libraries or software versions (e.g., using renv or conda).
Best practices for all pipelines
- Documentation is Mandatory: No matter what pipeline or tool is used, the repository README.md must clearly document how to execute the analysis. A new collaborator should be able to clone the repo and know exactly which command to run (e.g., “Run tar_make() in the R console” or “Type make in the terminal”).
- Atomic Steps: Each step in the pipeline should be small enough to be easily understood but large enough to justify being its own “target.”
- File Tracking: Always include the script itself as a prerequisite/dependency. If you change the code, the output should be considered “outdated.”
- No Manual Interventions: A pipeline is only reproducible if it can be run from start to finish on a new machine without a human clicking “save,” “export,” or manually moving files in the middle.