[Citation needed] Data usage and citation practices in medical imaging conferences

Read original: arXiv:2402.03003 - Published 9/12/2024 by Th'eo Sourget, Ahmet Akkoc{c}, Stinna Winther, Christine Lyngbye Galsgaard, Amelia Jim'enez-S'anchez, Dovile Juodelyte, Caroline Petitjean, Veronika Cheplygina

📊

Overview

Medical imaging research often focuses on methodology, but the quality and validity of the results depend heavily on the datasets used
Creating datasets is a lot of work, so researchers often use publicly available datasets
However, there is no standard way to cite the datasets used in scientific papers, making it difficult to track their usage

Plain English Explanation

The papers on medical imaging techniques usually describe the methods and algorithms used, but the quality and reliability of the results depend a lot on the datasets that were used in the research. Collecting and preparing datasets for medical imaging is a huge amount of work, so researchers often turn to publicly available datasets instead of creating their own.

The problem is that there is no common way for researchers to cite the datasets they used in their scientific papers. This makes it hard to keep track of how these publicly available datasets are being used across the field. To address this issue, the researchers created two open-source tools that can help detect when datasets are used in papers. They applied these tools to study the usage of 20 publicly available medical datasets in papers from two major conferences, MICCAI and MIDL.

Technical Explanation

The researchers developed two tools to help track the usage of publicly available medical datasets in scientific papers:

A pipeline that uses the OpenAlex database and full-text analysis to detect when datasets are cited or mentioned in papers.
A PDF annotation software that they used to manually label whether datasets were present in the papers they studied.

They applied these tools to examine the usage of 20 different publicly available medical datasets in papers published at the MICCAI and MIDL conferences between 2013 and 2023. The researchers calculated the proportion and trends over time of three types of dataset presence in the papers: cited, mentioned in the full text, and both cited and mentioned.

Their analysis revealed that a limited set of datasets tend to be heavily used, demonstrating a concentration of dataset usage. The researchers also observed inconsistent citing practices across papers, making it difficult to automate the tracking of dataset usage.

Critical Analysis

The researchers acknowledge that their study is limited to the datasets and papers they examined, and that there may be other publicly available medical datasets and citing practices not captured in their analysis.

Additionally, the manually annotated dataset used in their study could be subject to human error or bias. Automating the detection of dataset usage in papers remains challenging due to the lack of standardization in how datasets are cited.

Further research is needed to develop more robust and comprehensive methods for tracking the usage of publicly available datasets across the medical imaging field. Establishing best practices for dataset citation could greatly improve transparency and accountability in this area of research.

Conclusion

This study highlights the importance of dataset usage in medical imaging research and the challenges in tracking how publicly available datasets are being utilized. The researchers developed tools to analyze dataset citations and mentions, revealing a concentration of usage for a limited set of datasets and inconsistent citing practices.

While their findings are limited in scope, this work underscores the need for greater standardization and transparency around dataset usage in the field. Addressing these issues could lead to more robust and reliable medical imaging research, as well as better opportunities for collaboration and reproducibility.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

[Citation needed] Data usage and citation practices in medical imaging conferences

Th'eo Sourget, Ahmet Akkoc{c}, Stinna Winther, Christine Lyngbye Galsgaard, Amelia Jim'enez-S'anchez, Dovile Juodelyte, Caroline Petitjean, Veronika Cheplygina

Medical imaging papers often focus on methodology, but the quality of the algorithms and the validity of the conclusions are highly dependent on the datasets used. As creating datasets requires a lot of effort, researchers often use publicly available datasets, there is however no adopted standard for citing the datasets used in scientific papers, leading to difficulty in tracking dataset usage. In this work, we present two open-source tools we created that could help with the detection of dataset usage, a pipeline url{https://github.com/TheoSourget/Public_Medical_Datasets_References} using OpenAlex and full-text analysis, and a PDF annotation software url{https://github.com/TheoSourget/pdf_annotator} used in our study to manually label the presence of datasets. We applied both tools on a study of the usage of 20 publicly available medical datasets in papers from MICCAI and MIDL. We compute the proportion and the evolution between 2013 and 2023 of 3 types of presence in a paper: cited, mentioned in the full text, cited and mentioned. Our findings demonstrate the concentration of the usage of a limited set of datasets. We also highlight different citing practices, making the automation of tracking difficult.

9/12/2024

Copycats: the many lives of a publicly available medical imaging dataset

Amelia Jim'enez-S'anchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Th'eo Sourget, Caroline Vang-Larsen, Anna Rogers, Hubert Dariusz Zajk{a}c, Veronika Cheplygina

Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets' context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

6/11/2024

📈

A PRISMA Driven Systematic Review of Publicly Available Datasets for Benchmark and Model Developments for Industrial Defect Detection

Can Akbas, Irem Su Arin, Sinan Onal

Recent advancements in quality control across various industries have increasingly utilized the integration of video cameras and image processing for effective defect detection. A critical barrier to progress is the scarcity of comprehensive datasets featuring annotated defects, which are essential for developing and refining automated defect detection models. This systematic review, spanning from 2015 to 2023, identifies 15 publicly available datasets and critically examines them to assess their effectiveness and applicability for benchmarking and model development. Our findings reveal a diverse landscape of datasets, such as NEU-CLS, NEU-DET, DAGM, KolektorSDD, PCB Defect Dataset, and the Hollow Cylindrical Defect Detection Dataset, each with unique strengths and limitations in terms of image quality, defect type representation, and real-world applicability. The goal of this systematic review is to consolidate these datasets in a single location, providing researchers who seek such publicly available resources with a comprehensive reference.

6/13/2024

Datasets of Visualization for Machine Learning

Can Liu, Ruike Jiang, Shaocong Tan, Jiacheng Yu, Chaofan Yang, Hanning Shao, Xiaoru Yuan

Datasets of visualization play a crucial role in automating data-driven visualization pipelines, serving as the foundation for supervised model training and algorithm benchmarking. In this paper, we survey the literature on visualization datasets and provide a comprehensive overview of existing visualization datasets, including their data types, formats, supported tasks, and openness. We propose a what-why-how model for visualization datasets, considering the content of the dataset (what), the supported tasks (why), and the dataset construction process (how). This model provides a clear understanding of the diversity and complexity of visualization datasets. Additionally, we highlight the challenges faced by existing visualization datasets, including the lack of standardization in data types and formats and the limited availability of large-scale datasets. To address these challenges, we suggest future research directions.

7/24/2024