2.2 Metadata

Metadata is data about your data. It includes information about your data’s content, structure, authors, and permissions to make your data interpretable and usable by your future self and others. EVERY data file should be accompanied by a metadata file.

This includes files that people tend to overlook or think are not useful for the broader team. For example, if you’re using Google Sheets to keep track of literature or data reviews for a specific project, these documents should also have some form of accompanying metadata.

2.2.1 Metadata Standards

We use “readme” style metadata, named _readme_datafilename, and stored in the same folder as the data file.

Create one readme file for each data file. Download and use this template to create your readme file (when one is not already available).
Name the readme _readme_datafilename and save as a text file.
Format the readme so it is easy to understand (use bullets, break up information, etc.)
Use a standardized date format (YYYY-MM-DD)

We acknowledge that it may not always be feasible to draft robust metadata immediately when a dataset is first uploaded to the emlab/data folder. In this case, a minimal readme file can be created as a temporary placeholder that contains the following information: your name, contact info, a very brief (1-2 lines) description of the data, and a note on how you obtained them. Please refer to the Data Directory subheading for further instruction on how to incorporate this form of temporary documentation into the emLab Data Directory.

For files that may be considered more “internal notes” than datasets (Google Sheets example mentioned above), please ensure that some sort of metadata is present. One alternative to a “readme” file is to create an extra tab on the Google Sheet labeled “metadata”. Here, you can include information on the column names (column 1) and their definitions (column 2). This allows collaborators and team members to easily interpret the columns and use the dataset appropriately.

2.2.2 emLab common data directory

All commonly used emLab data is stored in subfolders of the Data folder on the emLab GRIT storage space (emlab/data). To document these data, we use the emLab Data Directory that includes key, standardized information from each readme metadata file. Every data file in the emlab/data folder has a record (row) in the emLab Data Directory. The emLab Data Directory file contains two sheets: (1) Data directory (the record and standardized documentation for each data file); (2) Metadata (information needed to populate the Data Directory, i.e. the meta-metadata)

In the case of placeholder metadata (as described in the Metadata section), only the following columns should be filled out: folder, filename, contact, and summary. This (mostly blank) row serves two purposes: 1) it retains some of the searchability function for that dataset and 2) it serves as a visual reminder that those datasets are in need of more robust metadata development.

Column	Description
Domain	Climate/Energy; Land; Ocean; General; Other [drop down menu]
Description	A few word description (e.g. SST US 2017); max 5 words
Folder	Name of folder containing data
Filename	Name of data
Year	Year of publication
Version	Sub category of year; NA if not applicable
Project	Project name that used these data (can have multiple listings) or ‘General’ if widely used (e.g. FAO data), hyperlinked to Google Drive/Box folder
Code	Link to Github repo or wherever code is stored
Data Stage	raw’ if raw data; ‘final input’ for the input data used for the analysis; ‘output’ for what was used for the project and/or published [drop down menu]
Filetype	File extension (e.g. csv; tif; rds); note: do note include ‘.’
Citation	Hyperlinked reference to publication or online resource or contact for individual/group data author
URL	Link to original data source
Extent	global; regional; national; local [drop down menu]
Resolution	Resolution of spatial data (in degrees)
Permissions	open = open source/open access; restricted = need author permission; secure = confidential data and likely involves a DUA or NDA [drop down menu]
Start year	Data set start year; numeric
End year	Data set end year; numeric
Source	e.g. emLab; FAO; Rare
Contact	Name and email of contact person in emLab who used/stored data
emLab reference	Hyperlinked reference to emLab publication using data (can be NA)
Keywords	e.g. fisheries; fire; utilities; property value; VDS; MPA; oceanography; temperature; habitat; biodiversity (up to 5 per entry, separated by semi-colons)
Summary	Brief description of the data (1-2 sentences). Include years for timeseries; location/spatial extent for spatial data; key variables; resolution; sampling frequency; species; etc.
Notes	Other relevant information about data. Initial your entry (e.g. if it was processed (e.g. subset from a larger dataset); what specifically was done; are there suspicious data points?; note if there are issues; etc.)

Any time you add a new dataset to the shared emLab data folder and directory, please message the #data-streamlining Slack channel so that others on the team know about the new dataset.

2.2.3 Project-specific data directories

We highly recommend that research teams create a data_overview spreadsheet for keeping track of project-related data (i.e. a separate Google Sheet stored in the project’s Google Shared Drive folder). This centralized document can be used to document project-relevant information and communicate to team members datasets that have already been saved. This document can then be used to guide and simplify data migration to the emLab Data Directory once the project is complete. Suggested attributes include:

File name
Folder name
Source of data
Link where data was downloaded
Description of data
Name of the researcher who downloaded the data
Data directory entry (complete, in progress, not started, etc.)
Metadata sheet (complete, in progress, not started, etc.)