3.1 Core AIS Datasets and Assumptions
The world-fishing-827
project has many datasets but the following five are the most commonly used: pipe_production_vYYYYMMDD
, pipe_static
, gfw_research
, anchorages
, and vessel_database
.
pipe_production_vYYYYMMDD
This is GFW’s core internal dataset and is the output of the pipeline, which is a process that automates parsing, cleaning, augmenting, and publishing the raw AIS data. In most cases, queries should use the research tables in gfw_research
not the pipeline tables. However, the following are some tables that are only found in the pipeline which may be useful to emLab researchers:
port_events_YYYYMMDD
- List of port events by vessel id (not ssvid)
- Important fields: vessel_id, start_timestamp, end_timestamp, start_anchorage_id , end_anchorage_id
- Relationship to other tables: match the
vessel_id
to thevessel_id
field of thepipe_production_vYYYYMMDD.vessel_info
table to obtain anssvid
. Thessvid
can then be used to relate port events to other AIS data such as vessel tracks or characteristics. Match the start or endanchorage_id
to thes2id
in theanchorages.named_anchorages_vYYYYMMDD
table to obtain information for the anchorages such as location, name, and EEZ
- Assumptions: Port events are individual events composed of a port entry/port exit pair. A port entry occurs when the vessel comes within 3 km of an anchorage point and the port exit occurs when the vessel is more than 4 km from an anchorage point
- The port events table is organized daily. To select all events from a year (for example 2020) use
world-fishing-827.pipe_production_vYYYYMMDD.port_events_2020*
, or select a single day (for example 1.1.20) to reduce query size and cost usingworld-fishing-827.pipe_production_vYYYYMMDD.port_events_20200101
- List of port events by vessel id (not ssvid)
published_events_encounters
- List of encounter events; each encounter event is listed twice with the
event_id
field ending in .1 or .2 to distinguish between the first and second vessel involved
- Important fields: event_id, vessel_id, event_start, event_end, event_info, lat_mean, lon_mean
- Relationship to other tables: match the
vessel_id
to thevessel_id
field of thepipe_production_vYYYYMMDD.vessel_info
table to obtain anssvid
. Thessvid
can then be used to relate encounters to other AIS data such as vessel tracks or characteristics
- Assumptions: 2 vessels within 500 meters of each other, traveling < 2 knots, minimum duration of 2 hours, and at least 10 km from a coastal anchorage. Encounter events and loitering events may overlap (i.e. if a vessel’s behavior fits the loitering definition and within the same timeframe there is an encounter event it will be listed on both tables representing the same possible transshipment event)
- List of encounter events; each encounter event is listed twice with the
voyages
- List of voyages by ssvid
- Important fields: ssvid, vessel_id, trip_id, trip_start_anchorage_id, trip_end_anchorage_id, trip_start, trip_end
- Relationship to other tables:
ssvid
can be used to relate the vessel to other AIS data such as vessel tracks or characteristics. The start or endtrip_anchorage_id
can be matched to thes2id
in theanchorages.named_anchorages_vYYYYMMDD
table to obtain information on the anchorages such as location, name, and EEZ
- Assumptions: voyages are a port exit/port entry pair, following a vessel from when it leaves a port to the next time it enters a port
pipe_static
- List of voyages by ssvid
These are static data tables used by the GFW data pipeline. These are sources that change infrequently and are generally used as lookup tables in the pipeline but which may also be useful as lookup tables in emLab projects.
regions
- Region information (for EEZs, RMFOs, FAO region, MPAs etc.) for each longitude, latitude grid cell
- Important fields: gridcode, regions.eez, regions.mpant, regions.mparu, regions.rfmo, regions.major_fao
- Relationship to other tables: the gridded lon/lat can be used to spatially join the table to other AIS data of the same resolution
- Assumptions: Gridded longitude, latitude (WGS84) at 0.01 resolution
- Region information (for EEZs, RMFOs, FAO region, MPAs etc.) for each longitude, latitude grid cell
spatial_measures
- Distance from shore and depth for each longitude, latitude grid cell
- Important fields: gridcode, distance_from_shore_m, elevation_m
- Relationship to other tables: the gridded lon/lat can be used to spatially join the table to other AIS data of the same resolution
- Assumptions: Gridded longitude, latitude (WGS84) at 0.01 resolution
gfw_research
- Distance from shore and depth for each longitude, latitude grid cell
The second dataset gfw_research
, is most commonly used by GFW research partners. These tables are versions of the pipeline tables that have been altered to make them more suitable and cost effective for analysis. The following are some tables that may be most relevant to emLab researchers:
eez_info
- List of Exclusive Economic Zones (EEZ), can be used to add country names or ISO3 codes to the numeric EEZ id
- Important fields: eez_id, territory1, territory1_iso3, sovereign1, sovereign1_iso3
- Relationship to other tables: the numeric EEZ id code (
eez_id
) can be matched to theactivity.eez.value
field of the vessel info tables (vi_ssvid_byyyear_vYYYMMDD
) or to theregions.eez
field of thepipe_static.regions
andpipe_vYYYYMMDD_fishing
tables to add country names, ISO3 codes, and other associated EEZ details
- List of Exclusive Economic Zones (EEZ), can be used to add country names or ISO3 codes to the numeric EEZ id
fishing_vessels_ssvid_vYYYYMMDD
- Current best list of active fishing vessels by ssvid by year. This list is the most restrictive filter for fishing vessels and contains fewer fishing vessels than the
gfw_research.vi_ssvid_byyear_vYYYYMMDD
table
- Important fields: ssvid, year, best_flag, best_vessel_class (gear type)
- Relationship to other tables: use the
ssvid
andyear
to match to other AIS data such as vessel tracks or characteristics
- Assumptions: MMSI is on_fishing_list_best, MMSI is not likely fishing gear based on shipname, MMSI is not offsetting its position, MMSI did not broadcast 5 or more different shipnames in a year, MMSI is spoofed no more than 24 hours in a year, the MMSI was active enough for the nerual net to infer a vessel class, and the MMSI is active for at least 5 days and has at least 24 hours of fishing activity in a year
- Current best list of active fishing vessels by ssvid by year. This list is the most restrictive filter for fishing vessels and contains fewer fishing vessels than the
loitering_events_2knots_vYYYYMMDD
- List of loitering activities by vessel. Queries will likely want to further restrict results to vessels of a specific type, a minimum distance from shore, and a minimum event duration
- Important fields: ssvid, loitering_start_timestamp, loitering_end_timestamp, loitering_hours, avg_distance_from_shore_nm, start_lon, start_lat, end_lon, end_lat
- Relationship to other tables: use
ssvid
to match loitering events to other AIS data such as vessel tracks or characteristics
- Assumptions: vessels are moving at < 2 knots (includes all vessel types)
- List of loitering activities by vessel. Queries will likely want to further restrict results to vessels of a specific type, a minimum distance from shore, and a minimum event duration
pipe_vYYYYMMDD_fishing
- Table of fishing activity, best table to use to find active fishing positions
- Important fields: seg_id, ssvid, timestamp, lat, lon, nnet_score, hours, night_loitering, regions records
- Relationship to other tables: use
ssvid
to relate fishing positions to other vessel specific AIS data. The regions record has information on the location of the position including the EEZ id coderegions.eez
, which can be related to EEZ specific information ineez_info
using theeez_id
- Assumptions: Vessels are listed on at least one of the fishing lists in the
vi_ssvid_byyyear_vYYYYMMDD
table
- This is a partitioned table. See Section 4.2.4 for more infomraiton about subsetting data in partitioned tables
- Table of fishing activity, best table to use to find active fishing positions
pipe_vYYYYMMDD_segs
- Used to identify good segements for inclusion in analyses
- Important fields: good_seg, positions, overlapping_and_short
- Relationship to other tables: use the
seg_id
to match segments passing the quality filters to vessel position segments inpipe_vYYYYMMDD_fishing
- Assumptions: To be labeled as a
good_seg
, there are more than 5 positions, the vessel moves at least ~100 meters with an average speed > 0, and the longitude is not between -0.109225 and 0.109225
- Used to identify good segements for inclusion in analyses
port_visits_no_overlapping_short_seg_vYYYYMMDD
- List of port visits by vessel id (not ssvid)
- Important fields: ssvid, vessel_id, start_anchorage_id, end_anchorage_id, start_timestamp, end_timestamp
- Relationship to other tables: use the
ssvid
to match to vessel tracks or characteristics. Match the start or endanchorage_id
to thes2id
in theanchorages.named_anchorages_vYYYYMMDD
table to obtain information for the anchorages such as location, name, and EEZ
- Assumptions: This table differs from
port_events_YYYYMMDD
because port visits must include a port entry, a port stop or a port gap, and a port exit. Port stops begin when the vessel speed is < 0.2 knots and ends when the vessel speed is > 0.5 knots. Port gaps are defined as gaps in AIS transmission for more than 4 hours
- List of port visits by vessel id (not ssvid)
vi_ssvid_byyear_vYYYYMMDD
- Summary of annual vessel activity and identity information by ssvid. This table is best used to get a set of best vessel characteristics or summarize vessel activity (like fishing hours) by ssvid and year
- Important fields: ssvid, year, activity records (summary of the amount and location of the vessel’s activity), best records (best vessel characteristics)
- Relationship to other tables: the
ssvid
can be used to match vessel characteristics to vessel tracks in thepipe_vYYYYMMDD_fishing
- Assumptions: fishing hours are calculated by segment and summed by EEZ. If a segment boarders two EEZs fishing hours will be counted in both, therefore it’s possible for the sum of fishing hours in the
activity.eez.fishing_hours
to be greater than the total hours recorded in theactivity.fishing_hours
field. For a more accurate estimate of fishing hours, particularly binned fishing hours, use thepipe_vYYYYMMDD_fishing
table and calculate fishing hours using thennet_score
by vessel, year, and grid cell. An example of calculating binned fishing effort is provided in Section 4.3.
- Summary of annual vessel activity and identity information by ssvid. This table is best used to get a set of best vessel characteristics or summarize vessel activity (like fishing hours) by ssvid and year
anchorages
The GFW data uses anchorages which are different from ports. The anchorage dataset gridded the globe at approximately 0.5 km cells and identified grid cells where at least 20 individual vessels remained stationary from 2012-2019. Each location was assigned a unique anchorage id. Generally, there are many anchorages within a single port. More information about how anchorages are assigned can be found on the GFW website. The following table is likely the most useful for emLab researchers:
named_anchorages_vYYYYMMDD
- List of all named anchorages in the GFW data with associated information on location and EEZs
- Important fields: s2id (anchorage id), iso3, lat, lon
- Relationship to other tables: the
s2id
can be used to match to a start or endanchorage_id
in thepipe_production_vYYYYMMDD.port_events_YYYYMMDD
,pipe_production_vYYYYMMDD.voyages
, andgfw_research.port_visits_no_overlapping_short_seg_vYYYYMMDD
tables
- Assumptions: at least 20 vessels remained stationary between 2012 and 2019
- List of all named anchorages in the GFW data with associated information on location and EEZs
vessel_database
The vessel database is a collection of tables tracking information from over 30 different vessel registries. The database provides historic registry information and can be used to track changes in vessel identities over time. This database is particularly useful for querying lists of non-fishing vessels, such as carriers. It is better to first use the gfw_research.vi_ssvid_byyear_vYYYYMMDD
table when searching for vessel characteristics and then using vessel database for vessels that aren’t found in the vessel info table particularly non-fishing vessels. The following table may be the most useful to emLab researchers:
all_vessels_vYYYYMMDD
- List of all vessels in the GFW database for all years
- Important fields: matched, feature records, is_carrier, is_fishing, is_bunker, is_new
- Relationship to other tables: use
ssvid
to match vessel registry information to other AIS data
- The feature record summarizes vessel characteristics (geartype, length, engine power, tonnage, crew size) matched between AIS broadcasts and the vessel registries. In general this is the cleanest way to get vessel characteristics from the vessel database. The identity records summarize vessel characteristics broadcast over AIS and the registry record summarizes vessels characteristics from all the scraped vessel registries. The
is_carrier
,is_fishing
,is_bunker
, andis_new
fields are helpful for easily filtering each category of vessels.
- List of all vessels in the GFW database for all years
The vesssel database is not comprehensive and is only as good as the AIS and registry data. The dataset may contain typos or outdated records and caution should be used in analysis.