Collaborations between NASA and private, non-federal partners play a crucial role in advancing scientific research and technological innovation. One such collaboration is between NASA’s Interagency Implementation and Advanced Concepts Team (IMPACT) and International Business Machines (IBM). This partnership has led to the development of INDUS, a comprehensive suite of large language models (LLMs) specifically designed for various scientific domains.
INDUS consists of encoders and sentence transformers that convert natural language text into numeric coding for processing by the LLM. The encoders in INDUS were trained on a corpus of 60 billion tokens covering astrophysics, planetary science, Earth science, heliophysics, biological, and physical sciences data. Unlike generic tokenizers, the custom tokenizer in INDUS recognizes scientific terms, making it more adept at processing domain-specific vocabulary.
The IMPACT-IBM collaboration team achieved superior performance with INDUS compared to open LLMs for tasks such as biomedical benchmarks, scientific question-answering, and Earth science entity recognition tests. By fine-tuning the sentence transformer models on a vast number of text pairs, INDUS excels in processing researcher questions, retrieving relevant documents, and generating answers accurately. Validation tests have demonstrated the efficacy of INDUS in retrieving pertinent information from science corpora in response to NASA-curated queries.
INDUS has been integrated into various NASA divisions and projects, including the Biological and Physical Sciences (BPS) Division, the Goddard Earth Sciences Data and Information Services Center (GES-DISC), and the Science Discovery Engine (SDE). By incorporating INDUS into these platforms, researchers have access to improved search capabilities, enhanced data curation systems, and more accurate dataset recommendations. The model’s adaptability to different science domain applications makes it a valuable tool for extracting and analyzing scientific information efficiently.
In alignment with NASA and IBM’s commitment to open and transparent artificial intelligence, the INDUS models are openly available on platforms like Hugging Face. This move benefits the scientific community by providing access to advanced language models and benchmark datasets for various research tasks. The release of benchmark datasets for named entity recognition, extractive QA, and information retrieval further enhances the utility of INDUS across multiple domains.
The collaboration between NASA and IBM in developing the INDUS suite of large language models represents a significant advancement in scientific research and data analysis. By leveraging domain-specific vocabulary and training strategies, INDUS offers researchers improved access to specialized knowledge and efficient data processing capabilities. With its open-access framework and adaptability to diverse scientific domains, INDUS stands as a testament to the power of collaborative efforts in enhancing scientific research and technological innovation.
Leave a Reply