Architecting the Future: A Vision for Using Large Language Models to Enable Open Science
Kaylin Bugbee, Rahul Ramachandran
“We are called to be architects of the future, not its victims.” -R. Buckminster Fuller
Emerging new technologies are always feared by some. In 1474, priest Filippo de Strata worried that the “brothel of the printing press” would dilute the quality of information made available to readers. In the mid 1800s, people were concerned the railway would destroy community life. In 1969, Neil Postman and others worried that low quality programming on television would divert viewers from the information that mattered. Yet, despite the fears, and fearmongers, these technologies have emerged to fundamentally change our lives and our world.
We are experiencing the emergence of another transformative new technology: large language models, or LLMs. The most familiar application of LLMs is ChatGPT, powered by the Generative Pre-trained Transformer (GPT) family of language models, but there are other LLMs in development by major tech corporations. Applications like ChatGPT allow a user to ask questions and receive answers in a conversational way. More specifically, ChatGPT and similar tools open new pathways for information discovery. Instead of formulating concise, specific, keyword-based queries to seek information, a user can simply have a conversation with ChatGPT in a more natural way. Results are often returned in context instead of isolated lists of datasets that are ranked by a relevance scheme that may or may not be useful or understandable to a user.
LLMs are promising but pose a number of challenges for science. Because many of these models are generative, they are prone to hallucinations, i.e. making things up in order to give the user an answer. This poses a problem for science, and especially for information providers, who have a responsibility to deliver reasonably trustworthy results to its users. LLMs are built on the shoulders of existing content (not all of which can be properly labeled as ‘giants’) resulting in a number of challenges. Some persistent issues include propagation of bias, lack of attribution to the original content creators and gaps in LLM content coverage due to the passage of time or constraints of the original corpus on which it was trained. Lastly, the development process and the data used for proprietary LLMs is opaque, making it difficult to understand the strengths, weaknesses and gaps in an LLM. These challenges push up against some of the central value systems of science including reproducibility, transparency, attribution and broader movements like open science. While these challenges are real, they should not stop the responsible adoption and use of technologies like LLMs. We stand in a unique position to architect the future of science and its relationship to techniques like LLMs. We should not fall victim to fear but instead work together to define a future that responsibly includes LLMs.
Open Science and AI/LLM Principles
For data systems and programs, AI and LLMs have the potential to transform the three main aspects of open science: increasing accessibility to the scientific process & knowledge, making research & knowledge sharing more efficient and understanding scientific impact. To responsibly and ethically implement LLMs into data systems, principles need to be developed for the entire AI lifecycle. While a number of legislative acts and AI principles already exist, we still lack guiding principles for designing, implementing and operating AI that were developed through the lens of open science. These open science principles must be applied to both developing new models from scratch and reusing existing models.
Given the lack of guiding principles, we propose that open science AI principles can be described as the 5 Ts: Transparency, Trust, Teamwork, Training and Techniques. Transparency emphasizes a commitment to making the models, workflows, data, code and validation techniques open. It also means openly disclosing when LLMs are used in applications. Trust focuses on building trust with the users of the LLM through providing factual answers, providing attribution for answers and reducing bias as much as possible. Teamwork acknowledges that the community is critical in the responsible development and use of LLMs. This T emphasizes the value in collaborating across organizations, sharing resources and ensuring the scientific community can participate in every step of the process. Teamwork with the community can also guarantee that valid use cases are defined for LLMs. Training acknowledges that users must be trained to interact responsibly with LLMs and to understand the strengths and weaknesses of those models. Training focuses on building AI skills like prompt engineering through workshops, documentation and notebooks. Lastly, understanding and monitoring emerging techniques on how LLM applications will be built is critical to ensuring that trustworthiness is built into the design. Techniques like retrieval-augmented generation (RAG) or constitutional AI will be critical in ensuring LLMs are operated in a constrained and trustworthy manner. The rapidly changing nature of AI will require constant monitoring and assessment of emerging techniques that will dictate how we engineer reliable AI applications to support open science.
Approach for an LLM-Enabled, Open Science Search Infrastructure
For data systems, these 5 Ts of open science AI principles should guide the development and operation of LLM-enabled applications. We envision an LLM-enabled, open science search infrastructure that embraces these principles. At the heart of this search infrastructure is curated information sources and the LLMs themselves. Curated science information sources are essential to ensuring trust and quality in the search results. Curation grounds the model, constraining access to only documents and facts relevant to the topic at hand. The LLMs generate the retrievals needed to power the applications. For instance, LLM tools and frameworks interface with trusted external sources, like APIs, and leverage techniques like RAG to meet the demands of individual application use cases. Also, trusted system operators build a variety of applications to meet the needs of the scientific community. These applications include upstream data management processes like annotating metadata, searching for data or checking for compliance with organizational requirements. The applications can also include downstream data analysis and application needs such as analyzing how many fires took place over a given area for a given time period or how many exoplanets orbit a distant star.
Vision for the Future
LLMs will undoubtedly transform the science research lifecycle. Defining a research goal will be streamlined through LLM-powered literature surveys, proposal surveys and analysis of research trends. Similarly, finding data and information will be optimized through integrated data and information discovery, curated and focused search applications and targeted location search. Accessing data will be more efficient as it will be easier to subset data by variable, location and/or time. Data analysis will be accelerated as tools are developed to more easily probe datasets for trends or anomalies. LLMs may also review and optimize data analysis code, making analysis faster. In addition, LLMs will help reduce the time to publication by checking data and information for compliance and by streamlining data management tasks such as creating metadata or annotating documents.
Any trusted scientific agency has a responsibility to provide knowledge with transparency and justification about the source of that information. Misinformation will always be a problem in the internet age. We need to make use of curation workflows to both mitigate and detect misinformation. We also need to systematically design and evaluate LLM workflows to ensure trustworthy results.
Collaboration is essential to successfully develop and implement LLMs. Specifically, collaborating with scientific stakeholders is essential to ensure use cases and applications are developed for various needs across diverse scientific disciplines. Similarly, as innovation moves rapidly in the LLM space, sustainable collaborations with partners help close the gap in understanding new techniques and architectures.
Innovation is needed to allow for serendipitous discovery of data and information and to ensure epistemic diversity. New, transformative ideas happen when serendipitous discovery is possible. If only one answer is ever returned, this will limit the diversity of views into a topic. This poses problems for interdisciplinary science and non-traditional users such as decision makers, educators and the general public.
Acknowledgement of existing bias in scientific systems is needed. There is a lot of discussion within the AI community about bias in AI systems, and of course, we should do everything in our power to reduce bias. However, we need to acknowledge that bias is already built into our existing scientific systems. Publications in and of themselves represent a biased sample of all research that is conducted because only research with successful results is published. Using things like citation counts or h-indexes for relevance is also another form of bias. Frequently cited articles will keep getting cited simply because they receive more exposure, even though that publication may or may not be the best reference for a given search.
Finally, while we are committed to open science and open science principles for the AI lifecycle, not everyone will have the resources or collaboration opportunities to build open models. We should make models open whenever possible so that others can benefit. We should also establish an ongoing discussion within the science community about whether some use cases, especially those involving implementation in applications, do not require complete transparency and the use of open models.
Emerging new technologies, like large language models, are already changing the scientific research cycle. As data system providers, we have the opportunity to responsibly design and build search capabilities that align with open science principles. By embracing the 5 Ts, we can do our part to alleviate the fears surrounding LLMs to architect the open future (or futures) that we want to see.
Bugbee, Kaylin, & Ramachandran, Rahul. (2023). Architecting the Future: A Vision for Using Large Language Models to Enable Open Science. Zenodo. https://doi.org/10.5281/zenodo.8403782
The following documents and publications were referred to when writing this blog:
- Ramachandran, R., Bugbee, K., & Murphy, K. (2021). From open data to open science. Earth and Space Science, 8, e2020EA001562. https://doi.org/10.1029/2020EA001562
- Shelley Stall, Guido Cervone, Caroline Coward, et al. Ethical and Responsible Use of AI/ML in the Earth, Space, and Environmental Sciences. ESS Open Archive. April 12, 2023. DOI: 10.22541/essoar.168132856.66485758/v1
- ChatGPT: five priorities for research. Nature 614, 224–226 (2023). doi:10.1038/d41586–023–00288–7
- Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature 613, 612 (2023). doi:10.1038/d41586–023–00191–1
- Why open-source generative AI models are an ethical way forward for science. Nature 616, 413 (2023). doi: 0.1038/d41586–023–01295–4
- ChatGPT is a black box: how AI research can break it open. Nature 619, 671–672 (2023). doi:10.1038/d41586–023–02366–2
- Coursera Course: Generative AI with Large Language Models by DeepLearning.AI & Amazon Web Services. https://www.coursera.org/learn/generative-ai-with-llms/
- Latest Generative AI Boldly Labeled As Constitutional AI Such As Claude By Anthropic Has Heart In The Right Place, Says AI Ethics And AI Law. Forbes. https://www.forbes.com/sites/lanceeliot/2023/05/25/latest-generative-ai-boldly-labeled-as-constitutional-ai-such-as-claude-by-anthropic-has-heart-in-the-right-place-says-ai-ethics-and-ai-law/?sh=5c8c86323064.