Data Quality in Edge Machine Learning: A State-of-the-Art Survey

2406.02600

Published 6/6/2024 by Mohammed Djameleddine Belgoumri, Mohamed Reda Bouadjenek, Sunil Aryal, Hakim Hacid

Data Quality in Edge Machine Learning: A State-of-the-Art Survey

Abstract

Data-driven Artificial Intelligence (AI) systems trained using Machine Learning (ML) are shaping an ever-increasing (in size and importance) portion of our lives, including, but not limited to, recommendation systems, autonomous driving technologies, healthcare diagnostics, financial services, and personalized marketing. On the one hand, the outsized influence of these systems imposes a high standard of quality, particularly in the data used to train them. On the other hand, establishing and maintaining standards of Data Quality (DQ) becomes more challenging due to the proliferation of Edge Computing and Internet of Things devices, along with their increasing adoption for training and deploying ML models. The nature of the edge environment -- characterized by limited resources, decentralized data storage, and processing -- exacerbates data-related issues, making them more frequent, severe, and difficult to detect and mitigate. From these observations, it follows that DQ research for edge ML is a critical and urgent exploration track for the safety and robust usefulness of present and future AI systems. Despite this fact, DQ research for edge ML is still in its infancy. The literature on this subject remains fragmented and scattered across different research communities, with no comprehensive survey to date. Hence, this paper aims to fill this gap by providing a global view of the existing literature from multiple disciplines that can be grouped under the umbrella of DQ for edge ML. Specifically, we present a tentative definition of data quality in Edge computing, which we use to establish a set of DQ dimensions. We explore each dimension in detail, including existing solutions for mitigation.

Create account to get full access

Overview

This paper provides a comprehensive survey of the current state of research on data quality in edge machine learning, which refers to the deployment of AI models on resource-constrained edge devices like smartphones and IoT sensors.
The authors cover a wide range of topics, including data readiness for AI, AI competitions and dataset development, frameworks for enhancing data quality, uncertainty in machine learning, and data cleaning techniques.
The survey aims to provide researchers and practitioners with a comprehensive understanding of the key challenges and state-of-the-art solutions in ensuring data quality for edge machine learning applications.

Plain English Explanation

The paper discusses the important topic of data quality in edge machine learning, which refers to the use of AI models on small, low-power devices like smartphones and sensors. Ensuring high-quality data is crucial for these edge devices, as they often operate in unpredictable real-world environments and have limited computing resources.

The authors cover a range of relevant areas, including preparing data for AI systems, how researchers are developing new datasets and benchmarks, frameworks for improving data quality, dealing with uncertainty in machine learning models, and techniques for cleaning and preprocessing data. These are all critical considerations when deploying AI on the edge.

By providing a thorough overview of the current research landscape, the paper aims to help researchers and engineers better understand the key challenges and state-of-the-art solutions for ensuring data quality in edge machine learning applications. This is an important area of study as the use of AI on edge devices becomes more widespread, from smart home assistants to industrial sensors.

Technical Explanation

The paper begins by introducing the concept of edge machine learning, which involves deploying AI models on resource-constrained edge devices rather than in the cloud. The authors then provide a comprehensive survey of the current research on data quality challenges in this domain.

They cover topics such as data readiness for AI, exploring frameworks and methodologies for ensuring that data is properly prepared and curated for training machine learning models. The authors also discuss AI competitions and dataset development, reviewing how researchers are creating new benchmark datasets to drive progress in edge machine learning.

The survey then delves into frameworks for enhancing data quality, examining techniques like active learning, data augmentation, and federated learning that can help improve the quality of data used in edge ML models. The authors also cover uncertainty in machine learning, discussing how to quantify and manage the inherent uncertainty present in edge-deployed AI systems.

Finally, the paper reviews data cleaning techniques that can be used to identify and correct errors or anomalies in the data feeding into edge machine learning applications. This includes both traditional data cleaning methods as well as more advanced approaches like outlier detection and data imputation.

Critical Analysis

The paper provides a thorough and well-structured overview of the current state of research on data quality in edge machine learning. The authors have done an excellent job of covering a wide range of relevant topics and highlighting the key challenges and solutions in this rapidly evolving field.

One potential limitation of the survey is that it may not fully capture the most recent developments, as the paper was likely written before the latest advancements in the field. As the authors note, edge machine learning is a rapidly progressing area, and new techniques and frameworks are likely to emerge quickly.

Additionally, while the paper does a good job of summarizing the current research, it could have delved deeper into the technical details and experimental results of some of the more promising approaches. This would have provided readers with a more comprehensive understanding of the state-of-the-art solutions and their relative merits.

Overall, however, this survey is a valuable resource for researchers and practitioners working in the field of edge machine learning. By highlighting the critical importance of data quality and the various strategies for addressing it, the paper lays a solid foundation for further progress in this increasingly relevant domain.

Conclusion

This comprehensive survey on data quality in edge machine learning provides a thorough overview of the current state of research in this rapidly evolving field. The authors have done an excellent job of covering a wide range of relevant topics, from data readiness and dataset development to uncertainty management and data cleaning techniques.

By highlighting the key challenges and state-of-the-art solutions, the paper serves as an invaluable resource for researchers and practitioners working on deploying AI models on resource-constrained edge devices. As the use of edge machine learning continues to grow, ensuring high-quality data will be crucial for the reliable and robust performance of these systems in real-world applications.

The survey lays a strong foundation for further progress in this important area of study, and the authors' comprehensive coverage of the current research landscape will undoubtedly inspire and guide future work in the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Data Readiness for AI: A 360-Degree Survey

Kaveen Hiniduma, Suren Byna, Jean Luca Bez

Data are the critical fuel for Artificial Intelligence (AI) models. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Checking for data readiness is a crucial step in improving data quality. Numerous R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used for verifying AI's data readiness. This survey examines more than 120 papers that are published by ACM Digital Library, IEEE Xplore, other reputable journals, and articles published on the web by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy can lead to new standards for DRAI metrics that would be used for enhancing the quality and accuracy of AI training and inference.

4/10/2024

cs.LG cs.AI

📊

Towards augmented data quality management: Automation of Data Quality Rule Definition in Data Warehouses

Heidi Carolina Tamm, Anastasija Nikiforova

In the contemporary data-driven landscape, ensuring data quality (DQ) is crucial for deriving actionable insights from vast data repositories. The objective of this study is to explore the potential for automating data quality management within data warehouses as data repository commonly used by large organizations. By conducting a systematic review of existing DQ tools available in the market and academic literature, the study assesses their capability to automatically detect and enforce data quality rules. The review encompassed 151 tools from various sources, revealing that most current tools focus on data cleansing and fixing in domain-specific databases rather than data warehouses. Only a limited number of tools, specifically ten, demonstrated the capability to detect DQ rules, not to mention implementing this in data warehouses. The findings underscore a significant gap in the market and academic research regarding AI-augmented DQ rule detection in data warehouses. This paper advocates for further development in this area to enhance the efficiency of DQ management processes, reduce human workload, and lower costs. The study highlights the necessity of advanced tools for automated DQ rule detection, paving the way for improved practices in data quality management tailored to data warehouse environments. The study can guide organizations in selecting data quality tool that would meet their requirements most.

6/18/2024

cs.DB cs.AI cs.CE cs.ET

AI Competitions and Benchmarks: Dataset Development

Romain Egele, Julio C. S. Jacques Junior, Jan N. van Rijn, Isabelle Guyon, Xavier Bar'o, Albert Clap'es, Prasanna Balaprakash, Sergio Escalera, Thomas Moeslund, Jun Wan

Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even in today's digital era, where substantial data is generated daily, it is uncommon for it to be readily usable; most often, it necessitates meticulous manual data preparation. The haste in developing new models can frequently result in various shortcomings, potentially posing risks when deployed in real-world scenarios (eg social discrimination, critical failures), leading to the failure or substantial escalation of costs in AI-based projects. This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience, in the development of datasets for machine learning. Initially, we develop the tasks involved in dataset development and offer insights into their effective management (including requirements, design, implementation, evaluation, distribution, and maintenance). Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation. Finally, we address practical considerations regarding dataset distribution and maintenance.

4/16/2024

cs.LG stat.ML

📊

AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration

Widad Elouataoui

The widespread adoption of big data has ushered in a new era of data-driven decision-making, transforming numerous industries and sectors. However, the efficacy of these decisions hinges on the quality of the underlying data. Poor data quality can result in inaccurate analyses and deceptive conclusions. Managing the vast volume, velocity, and variety of data sources presents significant challenges, heightening the importance of addressing big data quality issues. While there has been increased attention from both academia and industry, current approaches often lack comprehensiveness and universality. They tend to focus on limited metrics, neglecting other dimensions of data quality. Moreover, existing methods are often context-specific, limiting their applicability across different domains. There is a clear need for intelligent, automated approaches leveraging artificial intelligence (AI) for advanced data quality corrections. To bridge these gaps, this Ph.D. thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively. Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment. Secondly, we present a generic framework for detecting various quality anomalies using AI models. Thirdly, we propose an innovative framework for correcting detected anomalies through predictive modeling. Additionally, we address metadata quality enhancement within big data ecosystems. These frameworks are rigorously tested on diverse datasets, demonstrating their efficacy in improving big data quality. Finally, the thesis concludes with insights and suggestions for future research directions.

5/8/2024

cs.AI cs.DB