Transforming Language Understanding in the Earth Sciences
The Earth science domain is home to highly specialized mathematical models, science data, and scientific principles and Earth Science research depends upon language that is just as specialized. Earth science researchers need to find separately-developed concepts to bring together ideas for solving today’s problems. For these researchers, language is an indispensable tool to make that happen. Whether in data keyword tags, textual metadata, or journal articles, the ability to accurately apply words to research and data and quickly search for those words is essential to advance science.
The popular press has recently reported on the impressive advancements in natural language processing (NLP) tools known as transformers, machine learning (ML) models trained on huge text datasets using millions of parameters (i.e., terms in the modeling equation). One example is GPT-3 which can generate articles and essays which sound as if they were written by humans. However, these generalized language models are not as well-suited for domain-specific tasks, such as those in Earth science.
The BERT-E project is an effort by IMPACT’s ML team to develop an industry-standard language model for Earth science based on transformers. Describing transformers in general, IMPACT’s Prasanna Koirala commented:
I read an essay written by GPT-3, and it was no way any less than what a human would have written or in some ways even better. Ever since then I have been reading and understanding more and more about transformer models.
Bidirectional Encoder Representations from Transformers (BERT) is a NLP tool which makes use of transformers to understand the context of a given word with respect to what is to the left of it and to the right of it in a sentence. IMPACT’s ML team has fine-tuned a pre-trained BERT model for science called Sci-BERT with an additional layer to create a domain-specific Earth science model called BERT-E which was trained utilizing a corpus of over 270,000 Earth science articles. At the American Geophysical Union 2021 fall meeting, Prasanna presented the results of a comparison between BERT-E and the more generalized Sci-BERT on a masked language modeling (MLM) task where a word in an existing sentence is hidden and the machine must predict the missing word. BERT-E showed an improvement in accuracy of 2.19 percent and a total accuracy of 92.16 percent.
BERT-E can be used for a myriad of downstream tasks such as recognizing named entities (e.g., United States, NASA), next sentence prediction (which seeks to understand dependencies across sentences), and question answering. One such task is the automated generation of metadata keywords that accurately describe Earth science datasets. NASA’s Global Change Master Directory (GCMD) contains a hierarchy of keywords that form a controlled Earth science vocabulary set. The more accurately the keywords describe the datasets which they tag, the more search tools can provide consistent and precise results. IMPACT has used BERT-E to develop the GCMD Keyword Recommender, also called GKR, that provides data curators with suggested GCMD keywords using predictions based on existing descriptions of datasets. BERT-E constitutes a key part of the recommender’s architecture by providing embeddings which are numerical representations of words that show how a represented word is related to other words that commonly occur close to it.
Looking forward, the IMPACT ML team envisions utilizing BERT-E for other additional tasks such as graph convolutions and satellite/product recommendations. Language facilitates transformations in Earth science, and IMPACT’s work with NLP transformers promises to enhance that process.
IMPACT’s latest work with BERT-E is available on Hugging Face.
More information about BERT-E, GKR and IMPACT can be found at NASA Earthdata and the IMPACT project website.