2.2 Metadata
Metadata is data about your data. It includes information about your data’s content, structure, authors, and permissions to make your data interpretable and usable by your future self and others. EVERY data file should be accompanied by a metadata file.
This includes files that people tend to overlook or think are not useful for the broader team. For example, if you’re using Google Sheets to keep track of literature or data reviews for a specific project, these documents should also have some form of accompanying metadata.
2.2.1 Metadata Standards
We use “readme” style metadata, named _readme_datafilename
, and stored in the same folder as the data file.
Create one readme file for each data file. Download and use this template to create your readme file (when one is not already available).
Name the readme
_readme_datafilename
and save as a text file.Format the readme so it is easy to understand (use bullets, break up information, etc.)
Use a standardized date format (YYYY-MM-DD)
We acknowledge that it may not always be feasible to draft robust metadata immediately when a dataset is first uploaded to the emlab/data
folder. In this case, a minimal readme file can be created as a temporary placeholder that contains the following information: your name, contact info, a very brief (1-2 lines) description of the data, and a note on how you obtained them. Please refer to the Data Directory subheading for further instruction on how to incorporate this form of temporary documentation into the emLab Data Directory.
For files that may be considered more “internal notes” than datasets (Google Sheets example mentioned above), please ensure that some sort of metadata is present. One alternative to a “readme” file is to create an extra tab on the Google Sheet labeled “metadata”. Here, you can include information on the column names (column 1) and their definitions (column 2). This allows collaborators and team members to easily interpret the columns and use the dataset appropriately.
2.2.2 emLab common data directory
All commonly used emLab data is stored in subfolders of the Data folder on the emLab GRIT storage space (emlab/data
). To document these data, we use the emLab Data Directory that includes key, standardized information from each readme metadata file. Every data file in the emlab/data
folder has a record (row) in the emLab Data Directory. The emLab Data Directory file contains two sheets: (1) Data directory (the record and standardized documentation for each data file); (2) Metadata (information needed to populate the Data Directory, i.e. the meta-metadata)
In the case of placeholder metadata (as described in the Metadata section), only the following columns should be filled out: folder, filename, contact, and summary. This (mostly blank) row serves two purposes: 1) it retains some of the searchability function for that dataset and 2) it serves as a visual reminder that those datasets are in need of more robust metadata development.
Column | Description |
---|---|
Domain | Climate/Energy; Land; Ocean; General; Other [drop down menu] |
Description | A few word description (e.g. SST US 2017); max 5 words |
Folder | Name of folder containing data |
Filename | Name of data |
Year | Year of publication |
Version | Sub category of year; NA if not applicable |
Project | Project name that used these data (can have multiple listings) or ‘General’ if widely used (e.g. FAO data), hyperlinked to Google Drive/Box folder |
Code | Link to Github repo or wherever code is stored |
Data Stage | raw’ if raw data; ‘final input’ for the input data used for the analysis; ‘output’ for what was used for the project and/or published [drop down menu] |
Filetype | File extension (e.g. csv; tif; rds); note: do note include ‘.’ |
Citation | Hyperlinked reference to publication or online resource or contact for individual/group data author |
URL | Link to original data source |
Extent | global; regional; national; local [drop down menu] |
Resolution | Resolution of spatial data (in degrees) |
Permissions | open = open source/open access; restricted = need author permission; secure = confidential data and likely involves a DUA or NDA [drop down menu] |
Start year | Data set start year; numeric |
End year | Data set end year; numeric |
Source | e.g. emLab; FAO; Rare |
Contact | Name and email of contact person in emLab who used/stored data |
emLab reference | Hyperlinked reference to emLab publication using data (can be NA) |
Keywords | e.g. fisheries; fire; utilities; property value; VDS; MPA; oceanography; temperature; habitat; biodiversity (up to 5 per entry, separated by semi-colons) |
Summary | Brief description of the data (1-2 sentences). Include years for timeseries; location/spatial extent for spatial data; key variables; resolution; sampling frequency; species; etc. |
Notes | Other relevant information about data. Initial your entry (e.g. if it was processed (e.g. subset from a larger dataset); what specifically was done; are there suspicious data points?; note if there are issues; etc.) |
Any time you add a new dataset to the shared emLab data folder and directory, please message the #data-streamlining
Slack channel so that others on the team know about the new dataset.
2.2.3 Project-specific data directories
We highly recommend that research teams create a data_overview
spreadsheet for keeping track of project-related data (i.e. a separate Google Sheet stored in the project’s Google Shared Drive folder). This centralized document can be used to document project-relevant information and communicate to team members datasets that have already been saved. This document can then be used to guide and simplify data migration to the emLab Data Directory once the project is complete. Suggested attributes include:
- File name
- Folder name
- Source of data
- Link where data was downloaded
- Description of data
- Name of the researcher who downloaded the data
- Data directory entry (complete, in progress, not started, etc.)
- Metadata sheet (complete, in progress, not started, etc.)