Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life & Earth Sciences

Read original: arXiv:2403.20063 - Published 4/1/2024 by Genoveva Vargas-Solar (LIRIS), J'er^ome Darmont (ERIC), Alejandro Adorjan (LIRIS), Javier A. Espinosa-Oviedo (LIRIS), Carmem Hara (ERIC), Sabine Loudcher (ERIC), Regina Motz (DIMAP), Martin Musicante (DIMAP), Jos'e-Luis Zechinelli-Martini

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life & Earth Sciences

Introduction

This paper discusses the challenges of managing and curating massive amounts of data in experimental and observational sciences like life and earth sciences. The authors note that it is now relatively easy and inexpensive to acquire large datasets, even in a continuous mode. However, traditional data management approaches like ETL are ineffective for the needs of these fields.

The paper proposes using data lakes - repositories that store raw data in its original format - as a more suitable approach. Data lakes can accommodate data harvested from various digital sources. The key elements for extracting value from data-driven experiments in life and earth sciences are:

Maintaining metadata that captures the conditions and processes of the experiments to enable understanding and reproducibility.
Adopting an open science perspective that goes beyond just data sharing, and includes sharing know-how, decision-making, expertise, project management, and people within the research projects.

The remainder of the paper outlines the general approaches for curating and managing knowledge in life and earth sciences, the challenges in building data lakes for these domains, and the principles for building, maintaining, and exploiting a data lake to create "dataverses" that capture the history of experimental processes leading to scientific knowledge.

Related work

This paper introduces the main concepts and approaches for maintaining and sharing data to enable data-driven experiments. It covers data harvesting tools, data curation techniques, data labs, data lakes, science lakes, and dataverses.

Data harvesting involves collecting and structuring data from the web, such as news articles, using web scraping tools like ParseHub, 80legs, and Octoparse. Data curation is the process of preparing research data for sharing and long-term preservation, which involves extensive preprocessing, cleaning, transformation, and documentation.

Data labs like Kaggle and Dryad provide environments for data storage, exploration, and collaborative sharing. Specialized repositories like DataONE and re3data offer platforms for researchers to access, store, share, and manage their datasets.

Data lakes are large storage repositories that hold raw data in its native format, enabling scalable and flexible big data analytics. Science lakes are a variant tailored for the scientific community, allowing better metadata curation and domain-specific data models.

Dataverses are data repository platforms that enable researchers to publish, cite, and discover datasets with rich metadata and tools for data analysis and collaboration. Examples include the Dataverse Project and dataverses developed by various academic institutions.

The paper emphasizes the importance of data lakes and dataverses in life and earth sciences, where they consolidate and curate scientific data to support data-driven experiments and enable open science and interdisciplinary collaboration.

Maintaining and sharing earth and life sciences knowledge: challenges

The text discusses several challenges in organizing and integrating life and earth science data in a data lake.

The first challenge is how to structure and organize life and earth sciences metadata. Metadata modeling can help make the data findable, accessible, interoperable, and reusable (FAIR principles). Metadata can represent structural, semantic, and contextual aspects of the data.

The second challenge is how to integrate the heterogeneous data (text, signals, multimedia, proprietary formats) from different sources into the data lake. This requires a pipeline for data discovery, exploration, selection, and integration.

The third challenge is how to integrate the data while considering the needs of scientists. Researcher-in-the-loop (RITL) is a crucial aspect, where researchers assess the data conditions and make decisions about future tasks. This human-in-the-loop (HITL) approach is important for supervision, exception control, optimization, and maintenance.

The paper states that the scientific content, including data, analytics tasks, and associated metadata, should be extracted and computed to allow the produced knowledge to be reusable and the analytics results to be reproducible, adhering to FAIR principles.

Towards a curation approach for building a Life & Earth sciences data lake

The provided text describes a vision for building, maintaining, and exploiting a life and earth sciences data lake. The approach involves the quantitative and qualitative curation of data harvested digitally and in situ. Heterogeneous raw data is gathered and stored in the data lake. Algorithms and researchers then process, filter, and classify the data, producing and storing metadata in the data lake. Data exploration and integration processes can be performed on data samples from the data lake for experimental purposes, generating content associated with the data. Clean and curated data, along with metadata representing the quantitative and qualitative perspective of the experiments, can then be shared in a data verse.

Figure 1: General overview of the curation approach for building, maintaining and exploiting a data lake.

This paper proposes an approach for integrating heterogeneous data from various life and earth sciences sources. The key elements are:

Developing a pivot meta-representation to capture the content and process metadata (technical, structural, semantic) of the disparate data sources. This allows integrated access to the data collections and curated versions through a global knowledge graph.
Maintaining a catalog of data-related questions, experiments, and results to promote open science and share knowledge derived from the data.
Designing experiment models and languages to enable friendly, context-aware construction of experiments in life and earth sciences.
Collecting execution data (input, datasets, calibration, results) from these experiments.

The approach will be tested through two pilot studies: classifying seismic signals to detect natural vs. human-made earthquakes, and classifying and modeling the behavior of the Portuguese man o' war on the Brazilian coast. The goal is to apply statistical methods to unveil new patterns in the data, and build predictive models to increase knowledge about these phenomena. The integrated data lake will include the raw collected data, the data produced through data science experimentation, and contextual metadata describing the data and experiments.

Conclusions and future work

The paper proposes addressing fundamental research topics in data science, big data management, and analytics to solve data-driven problems in life and earth sciences. The key contribution is the design and exploration of a data lake with a well-adapted metadata model for life and earth sciences experiments that consume and produce quantitative and qualitative data. An important aspect of the work is defining exploration operators and pipelines to exploit the data lake's content, enabling the maintenance and development of new life and earth sciences experiments.

Acknowledgements

The paper describes work done as part of the LETITIA project, which is funded by the Fédération Informatique de Lyon.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life & Earth Sciences

Genoveva Vargas-Solar (LIRIS), J'er^ome Darmont (ERIC), Alejandro Adorjan (LIRIS), Javier A. Espinosa-Oviedo (LIRIS), Carmem Hara (ERIC), Sabine Loudcher (ERIC), Regina Motz (DIMAP), Martin Musicante (DIMAP), Jos'e-Luis Zechinelli-Martini

This vision paper introduces a pioneering data lake architecture designed to meet Life & Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to scientific discovery.The core of the design and construction of a data lake is the development of formal and semi-automatic tools, enabling the meticulous curation of quantitative and qualitative data from experiments. Our unique ''research-in-the-loop'' methodology ensures that scientists across various disciplines are integrally involved in the curation process, combining automated, mathematical, and manual tasks to address complex problems, from seismic detection to biodiversity studies. By fostering reproducibility and applicability of research, our approach enhances the integrity and impact of scientific experiments. This initiative is set to improve data management practices, strengthening the capacity of Life & Earth sciences to solve some of our time's most critical environmental and biological challenges.

4/1/2024

DatAasee -- A Metadata-Lake as Metadata Catalog for a Virtual Data-Lake

Christian Himpe

Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata system is presented and evaluated as well.

9/10/2024

Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework

Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj, Christoph Becker

Studies of dataset development in machine learning call for greater attention to the data practices that make model development possible and shape its outcomes. Many argue that the adoption of theory and practices from archives and data curation fields can support greater fairness, accountability, transparency, and more ethical machine learning. In response, this paper examines data practices in machine learning dataset development through the lens of data curation. We evaluate data practices in machine learning as data curation practices. To do so, we develop a framework for evaluating machine learning datasets using data curation concepts and principles through a rubric. Through a mixed-methods analysis of evaluation results for 25 ML datasets, we study the feasibility of data curation principles to be adopted for machine learning data work in practice and explore how data curation is currently performed. We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles. Our findings illustrate difficulties at the intersection of these fields, such as evaluating dimensions that have shared terms in both fields but non-shared meanings, a high degree of interpretative flexibility in adapting concepts without prescriptive restrictions, obstacles in limiting the depth of data curation expertise needed to apply the rubric, and challenges in scoping the extent of documentation dataset creators are responsible for. We propose ways to address these challenges and develop an overall framework for evaluation that outlines how data curation concepts and methods can inform machine learning data practices.

5/7/2024

📊

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Riccardo Cappuzzo (SODA Team - Inria Saclay), Aimee Coelho (Dataiku), Felix Lefebvre (SODA Team - Inria Saclay), Paolo Papotti (EURECOM), Gael Varoquaux (SODA Team - Inria Saclay)

We present an in-depth analysis of data discovery in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset we developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration on both lakes, our study outlines the importance of accurately retrieving join candidates and the efficiency of simple merging methods. We report new insights on the benefits of existing solutions and on their limitations, aiming at guiding future research in this space.

5/28/2024