GeoGalactica: A Scientific Large Language Model in Geoscience

2401.00434

Published 4/16/2024 by Zhouhan Lin, Cheng Deng, Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Zhongmou He, Yuanyuan Shi, Beiya Dai, Yunchong Song and 11 others

cs.CL

GeoGalactica: A Scientific Large Language Model in Geoscience

Abstract

Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.

Create account to get full access

Overview

This paper introduces GeGalactica, a large language model trained specifically for geoscience tasks.
GeGalactica is designed to leverage the power of large language models to tackle a variety of problems in the geosciences, such as geological analysis, climate modeling, and natural resource management.
The researchers trained GeGalactica on a large corpus of geoscience-related text data and evaluated its performance on several benchmark tasks, comparing it to other state-of-the-art models.

Plain English Explanation

The paper describes the development of a new AI model called GeGalactica that is focused on geoscience applications. Large language models like GPT-3 and BERT have shown impressive capabilities in understanding and generating human-like text, and the researchers wanted to see if they could leverage these models to tackle problems in the geosciences.

Geoscience is a broad field that includes things like geology, climatology, hydrology, and natural resource management. The researchers trained GeGalactica on a large corpus of text data related to these geoscience topics, hoping that the model would learn to understand the specialized language and concepts used in the field. They then evaluated GeGalactica's performance on a variety of benchmark tasks, like identifying geological features in text or generating plausible climate model simulations.

The key idea behind GeGalactica is that by pre-training a large language model on geoscience data, it can develop a deep understanding of the domain that can then be applied to a wide range of tasks. This could potentially make geoscience research and applications more efficient and effective, by automating certain analysis and prediction tasks that currently require significant human effort.

Technical Explanation

The researchers trained GeGalactica using a similar approach to other large language models, but with a focus on geoscience-specific data. They collected a massive corpus of text from scientific papers, reports, and other sources related to geology, climatology, hydrology, and other geoscience disciplines. This corpus was used to pre-train the model, allowing it to develop a deep understanding of the specialized language and concepts used in the geosciences.

To evaluate the model's performance, the researchers tested it on a variety of benchmark tasks, including:

Geological feature extraction: Identifying key geological features (e.g. faults, minerals, rock types) in text
Climate modeling: Generating plausible climate model simulations based on textual descriptions
Natural resource management: Answering questions and providing recommendations related to natural resource management

The results showed that GeGalactica significantly outperformed other state-of-the-art models on these geoscience-specific tasks, demonstrating the value of pre-training on domain-specific data. The researchers also found that GeGalactica was able to effectively transfer its knowledge to new tasks and domains within the geosciences, suggesting that it could be a powerful tool for a wide range of applications.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in the paper. For example, they note that the performance of GeGalactica may be constrained by the quality and completeness of the training data, and that there is a need for further research to better understand the model's reasoning and decision-making processes.

Additionally, the paper does not address potential biases or safety concerns that may arise when using large language models like GeGalactica for high-stakes applications in the geosciences. As with other large language models, there are concerns about the model's ability to generate plausible-sounding but inaccurate or even harmful outputs, which could have significant consequences in fields like natural resource management or climate modeling.

Overall, while the results presented in the paper are promising, further research and careful consideration of the model's limitations and potential risks will be necessary to fully realize the benefits of GeGalactica and similar domain-specific large language models.

Conclusion

The GeGalactica paper introduces an innovative approach to leveraging large language models for geoscience applications. By pre-training a model on a large corpus of geoscience-related data, the researchers have created a powerful tool that can tackle a variety of tasks in the field, from geological analysis to climate modeling.

The strong performance of GeGalactica on benchmark tasks suggests that this approach could lead to significant advancements in geoscience research and applications, potentially making certain tasks more efficient and effective. However, the paper also highlights the need for further research to address the model's limitations and potential risks, ensuring that GeGalactica and similar models are developed and deployed responsibly and safely.

As large language models continue to evolve and find applications in diverse domains, the GeGalactica project provides a compelling example of how these powerful AI systems can be tailored to meet the specific needs of a field like geoscience. The lessons learned from this research may inform the development of other domain-specific language models in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Yu Zhang, Xiusi Chen, Bowen Jin, Sheng Wang, Shuiwang Ji, Wei Wang, Jiawei Han

In many scientific fields, large language models (LLMs) have revolutionized the way with which text and other modalities of data (e.g., molecules and proteins) are dealt, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one to two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 250 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models.

6/18/2024

cs.CL

💬

Scientific Computing with Large Language Models

Christopher Culver, Peter Hicks, Mihailo Milenkovic, Sanjif Shanmugavelu, Tobias Becker

We provide an overview of the emergence of large language models for scientific computing applications. We highlight use cases that involve natural language processing of scientific documents and specialized languages designed to describe physical systems. For the former, chatbot style applications appear in medicine, mathematics and physics and can be used iteratively with domain experts for problem solving. We also review specialized languages within molecular biology, the languages of molecules, proteins, and DNA where language models are being used to predict properties and even create novel physical systems at much faster rates than traditional computing methods.

6/12/2024

cs.CL cs.AI cs.LG

💬

Large Language Models and Knowledge Graphs for Astronomical Entity Disambiguation

Golnaz Shapurian

This paper presents an experiment conducted during a hackathon, focusing on using large language models (LLMs) and knowledge graph clustering to extract entities and relationships from astronomical text. The study demonstrates an approach to disambiguate entities that can appear in various contexts within the astronomical domain. By collecting excerpts around specific entities and leveraging the GPT-4 language model, relevant entities and relationships are extracted. The extracted information is then used to construct a knowledge graph, which is clustered using the Leiden algorithm. The resulting Leiden communities are utilized to identify the percentage of association of unknown excerpts to each community, thereby enabling disambiguation. The experiment showcases the potential of combining LLMs and knowledge graph clustering techniques for information extraction in astronomical research. The results highlight the effectiveness of the approach in identifying and disambiguating entities, as well as grouping them into meaningful clusters based on their relationships.

6/18/2024

cs.CL

Open Generative Large Language Models for Galician

Pablo Gamallo, Pablo Rodr'iguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, Jos'e Ramom Pichel, Marcos Garcia

Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.

6/21/2024

cs.CL