HLS: Going Global, Going Cloud
There is a scientific need for more frequent global surface reflectance observations for land monitoring applications. To address this need, observations from instruments on multiple platforms can be used to improve temporal coverage. To combine data from similar instruments and platforms, the data must be harmonized, or made more consistent. To provide the frequent observations needed, the Harmonized Landsat-8 Sentinel-2 (HLS) data products were developed by a team at Goddard Space Flight Center. The HLS algorithm had been implemented for more than 120 test sites worldwide covering approximately 28% of the globe.
As described in an earlier blog post, the IMPACT team was tasked with expanding the implementation of the HLS algorithm from the initial 120 sites to a global scale, generating data products in near real time along with a full archive of the HLS data products. To provide global HLS data, the team knew they would have to go to the cloud. Reliable production of HLS data products at the global scale requires a cloud-based processing approach. Such an approach allows for the scaling of compute resources to accommodate the variable needs of HLS data production. To accomplish their mandate the team concluded that migrating to Amazon Web Services (AWS) was necessary to streamline development, facilitate dynamic scaling for archival processing and optimize data transfer to LP DAAC’s cloud distribution system.
At its core the HLS project constitutes a large data bakery. It takes ingredients (Sentinel 2, Landsat 8, Terra and Aqua data) and processes those separate ingredients to produce a uniform, coherent view of the planet’s surface. These input ingredients are produced by partners at the United States Geological Survey (Landsat 8) and the European Space Agency (Sentinel-2), so the HLS pipeline uses an event driven architecture to continuously transform the data inputs as they become available in near real time. The HLS project utilizes AWS for two main purposes: orchestration and compute. Orchestration involves listening for new data availability, downloading the input data from external providers and tracking the various stages of input data processing. Compute is where the real work occurs, and scientific algorithms are used to transform raw input data into final products for publication.
Historically, the production of planetary-scale Earth observation products has been managed by high performance compute clusters and on-premise data production systems. Cloud-based production systems provide two advantages to this approach. While on-premise systems are capable of processing data at very large scales, a network bottleneck exists when moving data between on-premise systems and end users. This bottleneck has become increasingly problematic as Earth observation data volumes have exploded. Producing data using cloud-based systems allows direct delivery to massively scalable cloud storage systems where scientists can perform large scale analysis with data-proximate computing resources while avoiding any network-based limitations.
Cloud-based production systems also provide flexibility and can adjust to variable processing resource demands. Generating products on a regular schedule requires a consistent amount of computing resources. However, products such as HLS can require a massive scaling of resources to support archival processing for historical data or re-processing due to algorithm updates. Cloud-based production systems can scale quickly and efficiently so that these large processing efforts can be accomplished on tighter schedules and products can be delivered more rapidly to the scientific community.
HLS team member Sean Harkins articulates the promise that cloud platforms such as AWS offer toward realizing the research potential of the increasing volume of Earth science data:
Early in my career, I recognized the incredible value of Earth observation data to provide insights about planetary systems. I began as a data consumer, and then as my analysis requirements grew I became increasingly interested in how new and innovative scientific products could be created. I think you would be hard pressed to find many people working in this space who envisioned the volume and sophistication of data available to us today.
All of the code and resources used to construct the AWS infrastructure for the HLS project will be made available in open source repositories. These resources may be of interest to any organization undertaking large scale data processing with an event-driven model. An understanding of how the product’s reflectance values are derived can be found in the HLS user guide. An introductory notebook demonstrates how to leverage new data access patterns and data-proximate computing that the COG format will support.
More information about IMPACT can be found at NASA Earthdata and the IMPACT project website.