3.3 Data Directory

3.3.1 emLab Data Directory

All emLab data is stored in subfolders of the Data folder on the emLab Team Drive (emlab/data). To document these data, we use the emLab Data Directory that includes key, standardized information from each readme metadata file. Every data file in the emlab/data folder has a record (row) in the emLab Data Directory. The emLab Data Directory file contains two sheets: (1) Data directory (the record and standardized documentation for each data file); (2) Metadata (information needed to populate the Data Directory, i.e. the meta-metadata)

In the case of placeholder metadata (as described in the Metadata section), only the following columns should be filled out: folder, filename, contact, and summary. This (mostly blank) row serves two purposes: 1) it retains some of the searchability function for that dataset and 2) it serves as a visual reminder that those datasets are in need of more robust metadata development.

Column Description
Domain Climate/Energy; Land; Ocean; General; Other [drop down menu]
Description A few word description (e.g. SST US 2017); max 5 words
Folder Name of folder containing data
Filename Name of data
Year Year of publication
Version Sub category of year; NA if not applicable
Project Project name that used these data (can have multiple listings) or ‘General’ if widely used (e.g. FAO data), hyperlinked to Google Drive/Box folder
Code Link to Github repo or wherever code is stored
Data Stage raw’ if raw data; ‘final input’ for the input data used for the analysis; ‘output’ for what was used for the project and/or published [drop down menu]
Filetype File extension (e.g. csv; tif; rds); note: do note include ‘.’
Citation Hyperlinked reference to publication or online resource or contact for individual/group data author
URL Link to original data source
Extent global; regional; national; local [drop down menu]
Resolution Resolution of spatial data (in degrees)
Permissions open = open source/open access; restricted = need author permission; secure = confidential data and likely involves a DUA or NDA [drop down menu]
Start year Data set start year; numeric
End year Data set end year; numeric
Source e.g. emLab; FAO; Rare
Contact Name and email of contact person in emLab who used/stored data
emLab reference Hyperlinked reference to emLab publication using data (can be NA)
Keywords e.g. fisheries; fire; utilities; property value; VDS; MPA; oceanography; temperature; habitat; biodiversity (up to 5 per entry, separated by semi-colons)
Summary Brief description of the data (1-2 sentences). Include years for timeseries; location/spatial extent for spatial data; key variables; resolution; sampling frequency; species; etc.
Notes Other relevant information about data. Initial your entry (e.g. if it was processed (e.g. subset from a larger dataset); what specifically was done; are there suspicious data points?; note if there are issues; etc.)

Any time you add a new dataset to the shared emLab data folder and directory, please message the #data-streamlining Slack channel so that others on the team know about the new dataset.

3.3.2 Project-level Data Directory

We highly recommend that research teams create a data_overview spreadsheet for keeping track of project-related data (i.e. a separate spreadsheet stored in the project’s Google Shared Drive data folder). This centralized document can be used to document project-relevant information and communicate to team members datasets that have already been saved. This document can then be used to guide and simplify data migration to the emLab Data Directory once the project is complete. Suggested attributes include:

  • File name
  • Folder name
  • Source of data
  • Link where data was downloaded
  • Description of data
  • Name of the researcher who downloaded the data
  • Data directory entry (complete, in progress, not started, etc.)
  • Metadata sheet (complete, in progress, not started, etc.)