Improving Metadata, Improving Research

Metadata. It’s critical to our knowledge-based society, but it’s something people rarely, if ever, think about. Like rebar inside concrete, metadata provides underlying structure that increases our ability to locate and access relevant data. Here’s an example. The dataset IRS 1C LIS3 Standard Products contains over ten years of detailed data. However, unless you are already familiar with IRS 1C, how can you know if that extensive collection of data is relevant to the knowledge you are seeking to discover?

Metadata provides support to a data-driven world. (photo credit: Charley Lhasa)

Enter that overlooked workhorse, metadata. The dataset IRS 1C LIS3 Standard Products is manually tagged with the metadata keywords: Earth science, land surface, surface radiative properties, erosion sedimentation, and geomorphic landform processes. It is this metadata, not the actual data contained in the dataset that allows search engines such as NASA’s Earthdata search client to connect you with this potentially valuable collection of data. Given the increasing importance of data to our society, robust and accurate metadata across multiple parameters is essential.

Architecture of IMPACT’s GCMD Keyword Tagger tool

Given the importance of metadata and the subjectivity that arises from human curation, how can we efficiently verify the accuracy of metadata? If the keywords listed above for IRS 1C LIS3 Standard Products are inaccurate, the data set will appear in the wrong search results, impeding data discovery and research efforts. To address this need, the machine learning team at IMPACT trained a machine learning model to process data set abstracts. These models suggest relevant metadata keywords by leveraging the ability of Word2Vec models to embed the meaning of words into numeric vectors. This approach utilities machine learning techniques to provide subject matter experts with automated keyword suggestions that complements the hand curation processes.

The research that underlay this effort produced a valuable insight: machine learning training sets produce more accurate results when they utilize a training corpus aligned with the subject matter of a set of datasets. Muthukumaran Ramasubramanian, the lead developer on the project, explains:

The word embedding models used by the GCMD Keyword Tagger achieve higher accuracies.

This research effort has produced not only the GCMD keyword tagger, a tool that allows dataset curators to select metadata keywords from NASA’s Global Change Master Directory (GCMD) set of keywords, but also a conference paper at the recent IEEE SoutheastCon 2020: “ES2Vec: Earth Science Metadata Keyword Assignment using Domain-Specific WordEmbeddings.”

The alpha release of the GCMD keyword tagger tool is currently being tested by an IMPACT metadata validation team. Access to the alpha release is available at the website below. Keywords can be generated from specific collection level descriptions in NASA’s CMR or from a long-form description supplied by the user.

--

--

This is the unofficial blog of the Interagency Implementation and Advanced Concepts Team.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
IMPACT Unofficial

This is the unofficial blog of the Interagency Implementation and Advanced Concepts Team.