Coding at emLab

Writing code is one of the most consequential tasks that we do at emLab. Our analyses inform real policy and resource management decisions, which means a coding error is not just a technical inconvenience, it can undermine the credibility of our science and erode trust with partners. At the same time, the broader scientific community has increasingly shifted towards Open Science, a movement that calls for transparent, accessible, and reproducible research (Roth et al. 2025). Many journals now require public code repositories as a condition of publication, and peer reviewers are increasingly expected to evaluate not just the methods as described, but the implementation itself. Understandable, accurate, and reproducible code is thus essential to rigorous research. Code that can be understood and re-run by others enables collaboration, continuity with changing team compositions, and a smooth publication process.

This and following chapters are about establishing a shared foundation for coding at emLab. The following guiding principles are intended to orient our coding decisions, regardless of project type, team size, or programming language. The other chapters provide instructions and guidance for actualizing this shared vision.

Some practices are fundamental enough to effective collaboration that we designate them as universal requirements — expectations that all team members engaged in coding should follow, regardless of project. These requirements establish a shared baseline that reduces friction in collaboration and code review, and makes it possible for anyone on the team to engage with, build on, or evaluate another person’s work. Beyond these requirements, we also identify best practices that represent what we aspire to as a team. We recognize that projects vary considerably in scope, team composition, available personnel, and external collaborator expectations, and that not every best practice will be achievable on every project. The goal is to strive toward them where possible, and to make deliberate decisions when circumstances require a different approach.

Guiding principles and universal requirements

Everything we write should serve two goals: accuracy (the code does what we think it does) and reproducibility (someone else — or our future selves — can run it and get the same result).

1. Code for accuracy. Accuracy means the logic is correct (e.g., the right observations are being filtered, the model is specified as intended, and the units are consistent). Accuracy errors are often invisible: the code may run and produce plausible-looking output, obscuring that a mistake was made. Such errors may go unnoticed for a long time, or entirely. These errors come at costs: errors discovered late in a project can require extensive revision, delay or derail publication, and undermine trust with partners. Accuracy is best protected through continuous, incremental review, which requires adequate time for developing clean code and personnel available to provide review.

2. Code for reproducibility. Reproducibility means that a person with access to your code, data, and documentation can independently re-run the analysis and arrive at the same outputs. This is increasingly a necessity: public code repositories are now required by many journals and the broader scientific community’s trust in computational research depends on it. Reproducibility is protected through good code organization, documentation, package management, and using platforms (like GitHub) that track the full history of a project.

The following universal requirements reflect our shared commitment to a research environment that supports excellent, collaborative code development at emLab. These are described in detail in the following chapters.

GitHub as the Platform for All Coding Work GitHub is emLab’s required platform for all coding projects. Using GitHub means that:

  • All code is maintained in a GitHub repository. GitHub repos include a README, document dependencies, and follow a logical file structure — is the foundation of a reproducible analysis.

  • Code is clear and commented. All code is formatted using the appropriate code formatter for the language (e.g., Air for R, Ruff for Python). Code is sufficiently described documented using comments.

  • Every change is tracked. All code changes are tracked using Git version control, with clear and descriptive commit messages.

  • Collaboration has structure. Branches, pull requests, and issues give teams a shared vocabulary and workflow for dividing work, reviewing code, and documenting decisions. The main branch is protected. Code development decisions and discussions are documented using GitHub Issues.

  • Code review completed through pull requests. Meaningful peer review of code requires a shared platform. GitHub’s pull request workflow is the mechanism through which we conduct structured code review at emLab.

Scripts and version control

Code should be crafted according to the following guidelines:

  • Use scripts
  • Document scripts, but not too much
  • Organize scripts consistently (see format below)
  • Use Git to version control scripts
  • Make atomic Git commits (see description below)

Script files should be documented and organized in such a way to enhance readability and comprehension. For example, use a standardized header for general documentation and sections to make it easier to understand and find specific code of interest. Code should also be self-documenting as much as possible. Additionally, use relative filepaths for importing and exporting objects. Scripts should also be modular by focusing on one general task. For example, use one script for cleaning data, another script for visualizing data, etc. A makefile can then be used to document the analysis workflow. There is an art to this organization, so just keep in mind the general principle of making code easy to understand, for your future self and for others.

Here is an example template for R scripts:

# =============================================================================
# Name:           script.R
# Description:    Visualizes data
#
# Inputs:         data.csv
# Outputs:        graph.png
#
# Notes:          - Use a diverging color palette for best results
#                 - Output format can be changed as needed
# =============================================================================


# Set up environment ----------------------------------------------------------

library(tidyverse)

# Set path for Google Drive filestream based on OS type
# Note - this will work for Windows and Mac machines.
# If you use Linux, you will need to set your own path to where Google Drive filestream lives.

team_path <- ifelse(Sys.info()["sysname"]=="Windows","G:/","/Volumes/GoogleDrive/")

# Next, set the path for data directory based on whether project is current or archived.
# Note that if you use a different Shared Drive file structure than the one recommended in the "File Structure" section, you will need to manually define your data path.
# You should always double-check the automatically generated paths in order to ensure they point to the correct directory.

# First, set the name of your project

project_name <- "my-project"

# This will automatically determine if the project exists in the "current-projects" or "archived-projects" Shared Drive folder, and set the appropriate data path accordingly.
data_path <- ifelse(dir.exists(paste0(team_path,"Shared drives/emlab/projects/current-projects/",project_name)),
                       paste0(team_path,"Shared drives/emlab/projects/current-projects/",project_name,"/data/"),
                       (paste0(team_path,"Shared drives/emlab/projects/archived-projects/",project_name,"/data/")))


# Import data -----------------------------------------------------------------

# Load data from Shared Drive using appropriate data path
my_raw_data <- read_csv(paste0(data_path,"raw/my_raw_data.csv"))

# Process data ----------------------------------------------------------------


# Analyze data ----------------------------------------------------------------


# Visualize results -----------------------------------------------------------


# Save results ----------------------------------------------------------------

Git tracks changes in code line-by-line with the use of commits. Commits should be atomic by documenting single, specific changes in code as opposed to multiple, unrelated changes. Atomic commits can be small or large depending on the change being made, and they enable easier code review and reversion. Git commit messages should be informative and follow a certain style, such as the guide found here. There is also an art to the version control process, so just keep in mind the general principle of making atomic commits.

More advanced workflows for using Git and GitHub, such as using pull requests or branches, will vary from project to project. It is important that the members of each project agree to and follow a specific workflow to ensure that collaboration is effective and efficient.

Reproducibility

Prioritizing reproducibility when writing code not only fosters collaboration with others during a project, but also makes it easier for users in the future (including yourself!) to make changes and rerun analyses as new data become available. Some useful tools and practices include:

  • Commenting code: Adding brief but detailed comments to your scripts that document what your code does and how it works will help others understand and use your scripts. At the top of your scripts, describe the purpose of the code as well as its necessary inputs, required packages, and outputs.
  • File paths: Avoid writing file paths that only work on your computer. Where possible, use relative file paths instead of absolute files paths so your code can be run by different users or operating systems. For R users, using R Projects and/or the here package are great ways to help implement this practice.
  • Functions: If you find yourself copying and pasting similar blocks of code over and over to repeat tasks, turn it into a function!
  • R Markdown: R users can take advantage of R Markdown for a coding format that seamlessly integrates sections of text alongside code chunks. R Markdown also enables you to transform your code and text into report-friendly formats such as PDFs or Word documents.
  • Git and GitHub: As described above, Git tracks changes to files line-by-line in commits that are attributable to each team member working on the project. GitHub then compiles the history of any Git-tracked file online, and synchronizes the work of all collaborators in a central repository. Using Git and GitHub allow multiple people to code collaboratively, examine changes as they occur, and restore prior versions of files if necessary.

We also recommend managing package dependencies and creating a reproducible coding pipeline. These two aspects are discussed in more detail in ?@sec-package-reproducibility and ?@sec-pipeline-reproducibility.