Machine learning can use human-labeled datasets as training datasets to achieve impressive results. However, hard problems exist in domains with sparse amounts of labeled data, such as in Earth science. Self-supervised learning (SSL) is a method designed to address this challenge. Using clever tricks that range from representation clustering to random transform comparisons, self-supervised learning for computer vision is a growing area of machine learning whose goal is simple: learn meaningful vector representations of images without having human labels associated with each image such that similar images have similar vector representations.
In particular, remote sensing is characterized by a huge amount of images and, depending on the data survey, a reasonable amount of metadata contextualizing the image such as location, time of day, temperature, and wind. However, when a phenomena of interest cannot be found from a metadata search alone, research teams will often spend hundreds of hours conducting visual inspections, combing through data such as on NASA’s Worldview, which covers all 197 million square miles of the Earth’s surface per day across 20 years.
This is the fundamental challenge addressed by the collaboration between IMPACT and the SpaceML initiative. This collaboration produced the Worldview image search pipeline. A key component of that pipeline is the self-supervised learner (SSL) which employs self-supervised learning to build the model store. The SSL model sits on top of an unlabeled pool of data and circumvents the random search process. Leveraging the vector representations generated by the SSL, researchers can provide a single reference image and search for similar images, thus enabling rapid curation of datasets of interest from massive unlabeled datasets.
The impetus behind this collaboration is to streamline and increase the efficiency of Earth science research. SSL developer and winner of the Exceptional Contribution Award from the NASA IMPACT team, Rudy Venguswamy explains:
Machine learning has the potential to radically transform how we find out about things happening in our universe to, proverbially, more quickly find needles in our various haystacks. When I started building the SSL as a package, I wanted to build something for scientists in diverse fields, not just machine learning experts.
The SSL tool was released as an open-source package built on PyTorch Lightning. The pipeline uses compatible GPU-based transforms from the NVIDIA Dali package for augmentation and can be trained across multiple GPUs leading to a five to ten times increase in the speed of self-supervised training. New transforms in each epoch are critical to model learning, so improvements to the speed of transforms have a direct impact on training speed.
The package is built with customizability in mind. Currently, SimCLR and SimSiam are supported, and the package allows users to specify custom encoder architectures to the model as well as change in depth model parameters with optional arguments specific to each model. For instance, researchers can specify their own pretrained encoder or use one of the defaults provided that are pre-trained on imagenet data.
As an example of the capabilities of the SSL, the above diagram represents data from the Worldview website. The team trained SimCLR using the SSL on a data sample of approximately 50,000 images and plotted a t-Distributed Stochastic Neighbor Embedding visualization, reducing the dimensionality of the embeddings to plot on a 2D plane. With no labels, it manages to cluster images intuitively.
As part of the larger pipeline, the SSL helps streamline the research process as scientists work to research phenomena such as wildfires, oil spills, desertification, and the polar vortex.