Introducing Croissant 🥐: A format for machine learning datasets
Croissant is a format to describe datasets used in machine learning (ML). It was designed to make it easier for ML practitioners to work with datasets across ML platforms and repositories. It is being developed in synergy with the MLCommons Croissant Working Group [1].
Croissant provides enough metadata information for ML platforms to load a dataset, allowing platform users to incorporate Croissant datasets into the training or evaluation of a model with just a few lines of code. Croissant can be added easily to any tools commonly used by ML practitioners (e.g., for data preprocessing, analysis, or labeling). Besides helping developers work with ML datasets across platforms, Croissant also facilitates dataset discovery. After dataset publishers generate Croissant metadata and establish dataset repositories compatible with the format, dataset search engines can facilitate users in discovering and utilizing datasets, regardless of their publication sources. Creating or changing Croissant dataset descriptions is supported through a visual editor and a Python library. Detailed information about the Croissant launch can be found here [2][3].
Croissant is designed as a modular and extensible format capable of extending its core specification to include relevant ML concepts and integration with other platforms and tools. One such extension is Croissant RAI (responsible AI) vocabulary which captures RAI concerns around biases, fairness, robustness, and the use of human labeling [4]. The geospatial use-case for RAI in the Croissant RAI specification [5] is contributed by the IMPACT team.
To further incorporate geospatial data for AI-, the IMPACT team, along with a proposed working group, will explore a Geo-Croissant extension built on the Croissant core and RAI specification.
Proposed Geo-Croissant 🌍🥐 specification:
Croissant Core and the RAI extension support the efficient representation of metadata and RAI attributes. They also enhance processing in an end-to-end workflow. However, certain crucial characteristics required to define Earth observation datasets for AI are missing, some of which are:
- Spatial reference information
- Nested data attributes (file formats such as netCDF4, HDF5, ZARR)
- Interoperability with existing cloud-native geospatial data formats
- Geographical biases
- Region restricted data access (i.e., compatibility with NASA Distributed Active Archive Centers)
- Data-fusion opportunities with other modality datasets (i.e., tabular, graph)
With Geo-Croissant, we envision a standard way for defining geospatial datasets for AI. Additionally, developing Geo-Croissant will involve developing tools and platforms for converting existing datasets to the Croissant format which could be used directly with the machine learning/deep learning frameworks such as PyTorch, Tensorflow, Keras, HuggingFace etc.
With the ever-increasing size of geospatial datasets approaching petabyte equivalent datasets distributed across multiple archives, there is a need for fast and efficient input/output data transfers. To accomplish this, Geo-Croissant will use metadata (data of data) to make data discoverable and provide access to the data when required for training. Moreover, it is important to abide by responsible Geo-AI practices as location is an important information given the attributes change with respect to location. Additionally, sampling strategy and geospatial bias are significant data-centric concepts that can lead to inaccuracies in training the model. The Geo-Croissant specification will represent such information in an efficient manner enhancing data processing in an end-to-end workflow.
Those interested in contributing to the development of Geo-Croissant are encouraged to contact Rajat Shinde (rajat.shinde@uah.edu).
References:
- https://mlcommons.org/working-groups/data/croissant/
- https://mlcommons.org/2024/03/croissant_metadata_announce/
- https://blog.research.google/2024/03/croissant-metadata-format-for-ml-ready.html
- YouTube video link for the Croissant and GeoCroissant discussion from the ESIP Data Readiness Cluster meeting. https://youtu.be/BtucRiyj3ag
- Croissant RAI Specification — https://mlcommons.github.io/croissant/docs/croissant-rai-spec.html