FAIR evaluation of ten widely used chemical datasets: Lessons learned and recommendations

Read original: arXiv:2407.15591 - Published 7/23/2024 by Marcos Da Silveira, Oona Freudenthal, Louis Deladiennee

🤔

Overview

This paper focuses on analyzing the FAIR (Findability, Accessibility, Interoperability, Reusability) principles of open data on hazardous substances in North America and Europe.
The researchers used both manual and automated approaches to evaluate FAIR compliance in 10 widely used datasets.
The highest score from the automated analysis was 54/100, indicating room for improvement in making these datasets more FAIR.
The manual analysis revealed that some FAIR metrics were satisfied, but not detected by the automated tools due to issues like missing metadata or non-standard data formats.

Plain English Explanation

The paper examines open data on hazardous chemicals found in North America and Europe. The goal is to understand how findable, accessible, interoperable, and reusable this data is, based on a set of FAIR principles.

The researchers used two approaches to evaluate the FAIR-ness of the data: a manual approach with online questionnaires, and an automated approach using specialized tools. The manual method helps start discussions on implementing FAIR principles and identifies areas needing more work. The automated tools perform a series of tests on the data, starting from the web address, to check how well the FAIR principles are applied.

The analysis covered 10 widely used datasets from Europe and North America. The highest automated score was 54 out of 100, indicating there is room for improvement to make the data more FAIR. The manual analysis found that some FAIR criteria were met, but the automated tools couldn't detect them, likely because the metadata was missing or the data format wasn't standard.

The paper provides details on the analysis results, the issues identified, and suggestions for addressing them to make the data more findable, accessible, interoperable, and reusable.

Technical Explanation

This paper presents a comprehensive analysis of the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles as applied to open data on hazardous substances in North America and Europe. The researchers implemented a two-pronged approach, involving both manual and automated assessments.

The manual approach utilized online questionnaires to guide users through a structured evaluation of FAIR compliance. This method is particularly useful for initiating discussions on FAIR implementation within research teams and identifying areas requiring further attention.

The automated approach leveraged tools like F-UJI and FAIR Checker, which perform a series of tests starting from the data resource's URL. These tools provide a more systematic and scalable way to evaluate FAIR-ness compared to the manual method.

The researchers analyzed 10 widely adopted datasets managed in Europe and North America. The highest score from the automated analysis was 54 out of 100, indicating significant room for improvement in making these datasets more FAIR.

The manual analysis revealed that several FAIR metrics were satisfied, but not detectable by the automated tools. This was due to issues like missing metadata or non-standard data formats, which the automated tools were unable to interpret.

The paper presents detailed tables summarizing the outcomes, the issues identified, and suggestions for addressing them to enhance the findability, accessibility, interoperability, and reusability of the data.

Critical Analysis

The paper provides a comprehensive and rigorous assessment of the FAIR principles as applied to open data on hazardous substances. The combined use of manual and automated approaches offers a well-rounded evaluation, highlighting the strengths and limitations of each method.

One potential caveat is the relatively small sample size of 10 datasets analyzed. While these datasets are widely used, a larger and more diverse set of data sources could provide additional insights and a more representative view of FAIR compliance in this domain.

Additionally, the paper acknowledges that the manual analysis uncovered FAIR metrics that were not detected by the automated tools. This suggests that a combination of both approaches may be necessary to fully capture the FAIR-ness of a dataset, as automated tools may miss certain aspects that require human interpretation.

Further research could explore ways to enhance the interoperability and integration of these manual and automated assessment methods, potentially leading to more comprehensive and reliable FAIR evaluations. Investigating the underlying reasons for the non-standard metadata and data formats identified in the analysis could also help inform strategies for improving the overall FAIR-ness of the data.

Conclusion

This paper provides a valuable contribution to the ongoing efforts to assess and improve the FAIR principles in the context of open data on hazardous substances. The combined manual and automated approaches offer a multifaceted evaluation, revealing both the strengths and limitations of current FAIR compliance across several widely used datasets.

The findings highlight the need for continued efforts to enhance the findability, accessibility, interoperability, and reusability of this critical data, which can have significant implications for environmental protection, public health, and scientific research. The insights and recommendations provided in this paper can help guide future initiatives to address the identified issues and make this data more FAIR for the benefit of all stakeholders.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

FAIR evaluation of ten widely used chemical datasets: Lessons learned and recommendations

Marcos Da Silveira, Oona Freudenthal, Louis Deladiennee

This document focuses on databases disseminating data on (hazardous) substances found on the North American and the European (EU) market. The goal is to analyse the FAIRness (Findability, Accessibility, Interoperability and Reusability) of published open data on these substances and to qualitatively evaluate to what extend the selected databases already fulfil the criteria set out in the commission draft regulation on a common data chemicals platform. We implemented two complementary approaches: Manual, and Automatic. The manual approach is based on online questionnaires. These questionnaires provide a structured approach to evaluating FAIRness by guiding users through a series of questions related to the FAIR principles. They are particularly useful for initiating discussions on FAIR implementation within research teams and for identifying areas that require further attention. Automated tools for FAIRness assessment, such as F-UJI and FAIR Checker, are gaining prominence and are continuously under development. Unlike manual tools, automated tools perform a series of tests automatically starting from a dereferenceable URL to the data resource to be evaluated. We analysed ten widely adopted datasets managed in Europe and North America. The highest score from automatic analysis was 54/100. The manual analysis shows that several FAIR metrics were satisfied, but not detectable by automatic tools because there is no metadata, or the format of the information was not a standard one. Thus, it was not interpretable by the tool. We present the details of the analysis and tables summarizing the outcomes, the issues, and the suggestions to address these issues.

7/23/2024

AutoFAIR : Automatic Data FAIRification via Machine Reading

Tingyan Ma, Wei Liu, Bin Lu, Xiaoying Gan, Yunqiang Zhu, Luoyi Fu, Chenghu Zhou

The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.

8/12/2024

FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training?

Shaina Raza, Shardul Ghuge, Chen Ding, Elham Dolatabadi, Deval Pandya

The rapid evolution of Large Language Models (LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area. This research gap is the focus of our study, which begins with an examination of existing literature to underline the importance of FAIR principles in managing data for LLM training. Building upon this, we propose a novel framework designed to integrate FAIR principles into the LLM development lifecycle. A contribution of our work is the development of a comprehensive checklist intended to guide researchers and developers in applying FAIR data principles consistently across the model development process. The utility and effectiveness of our framework are validated through a case study on creating a FAIR-compliant dataset aimed at detecting and mitigating biases in LLMs. We present this framework to the community as a tool to foster the creation of technologically advanced, ethically grounded, and socially responsible AI models.

4/4/2024

Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph

Raia Abu Ahmad, Jennifer D'Souza, Matthaus Zloch, Wolfgang Otto, Georg Rehm, Allard Oelen, Stefan Dietze, Soren Auer

Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research datasets, still need to be made discoverable and, therefore, largely remain unused. This is due to the sheer volume of datasets released every day and the inability of metadata to reflect a dataset's content and context accurately. This work seeks to improve this situation for a specific class of datasets, namely research datasets, which are the result of research endeavors and are accompanied by a scholarly publication. We propose the ORKG-Dataset content type, a specialized branch of the Open Research Knowledge Graoh (ORKG) platform, which provides descriptive information and a semantic model for research datasets, integrating them with their accompanying scholarly publications. This work aims to establish a standardized framework for recording and reporting research datasets within the ORKG-Dataset content type. This, in turn, increases research dataset transparency on the web for their improved discoverability and applied use. In this paper, we present a proposal -- the minimum FAIR, comparable, semantic description of research datasets in terms of salient properties of their supporting publication. We design a specific application of the ORKG-Dataset semantic model based on 40 diverse research datasets on scientific information extraction.

4/15/2024