Generative Subspace Adversarial Active Learning for Outlier Detection in Multiple Views of High-dimensional Data

2404.14451

Published 4/24/2024 by Jose Cribeiro-Ramallo, Vadim Arzamasov, Federico Matteucci, Denis Wambold, Klemens Bohm

Generative Subspace Adversarial Active Learning for Outlier Detection in Multiple Views of High-dimensional Data

Abstract

Outlier detection in high-dimensional tabular data is an important task in data mining, essential for many downstream tasks and applications. Existing unsupervised outlier detection algorithms face one or more problems, including inlier assumption (IA), curse of dimensionality (CD), and multiple views (MV). To address these issues, we introduce Generative Subspace Adversarial Active Learning (GSAAL), a novel approach that uses a Generative Adversarial Network with multiple adversaries. These adversaries learn the marginal class probability functions over different data subspaces, while a single generator in the full space models the entire distribution of the inlier class. GSAAL is specifically designed to address the MV limitation while also handling the IA and CD, being the only method to do so. We provide a comprehensive mathematical formulation of MV, convergence guarantees for the discriminators, and scalability results for GSAAL. Our extensive experiments demonstrate the effectiveness and scalability of GSAAL, highlighting its superior performance compared to other popular OD methods, especially in MV scenarios.

Create account to get full access

Overview

• This paper presents a novel framework called Generative Subspace Adversarial Active Learning (GSAAL) for outlier detection in high-dimensional data with multiple views. • The approach combines generative adversarial networks (GANs) and active learning to efficiently identify outliers in complex, high-dimensional datasets. • The method leverages the complementary strengths of multiple data views to improve outlier detection performance.

Plain English Explanation

Detecting outliers, or data points that are significantly different from the majority, is an important problem in many fields, such as fraud detection, image analysis, and medical diagnostics. However, this task can be challenging when dealing with high-dimensional data, where there are many features or characteristics for each data point.

The authors of this paper propose a new method called Generative Subspace Adversarial Active Learning (GSAAL) to address this challenge. GSAAL combines two powerful machine learning techniques: generative adversarial networks (GANs) and active learning.

GANs are a type of machine learning model that can generate new data that looks similar to the training data. In GSAAL, the authors use GANs to learn a generative model of the normal, or non-outlier, data. This allows them to identify outliers as data points that are significantly different from the generated normal data.

Active learning is a technique where the machine learning model can actively request labels for specific data points it is uncertain about, rather than relying on a fixed labeled dataset. In GSAAL, the active learning component helps the model efficiently identify the most informative data points to label, which can lead to better outlier detection performance with fewer labeled examples.

The key innovation in GSAAL is that it leverages multiple "views" of the data, such as different sets of features or modalities, to improve outlier detection. By combining the information from these complementary views, GSAAL can more accurately identify outliers compared to using a single view.

Technical Explanation

The GSAAL framework consists of three main components:

Generative Subspace Learning: The authors train a GAN to learn a generative model of the normal data in each data view. This allows them to identify outliers as data points that are significantly different from the generated normal samples.
Adversarial Active Learning: GSAAL uses an adversarial active learning strategy to efficiently select the most informative data points to label. This involves training a discriminator network to distinguish between normal and outlier data, and then using this discriminator to guide the active learning process.
Multi-view Fusion: GSAAL combines the outlier scores from the individual data views using a weighted fusion strategy. This allows the method to leverage the complementary information in the different views to improve overall outlier detection performance.

The authors evaluate GSAAL on several high-dimensional datasets with multiple views and show that it outperforms state-of-the-art outlier detection methods, especially when the number of labeled examples is limited. They also provide analysis and insights into the behavior of the different components of the GSAAL framework.

Critical Analysis

The authors acknowledge several limitations of their approach:

The performance of GSAAL relies on the quality of the generative models learned for each data view, which can be challenging in practice, especially for high-dimensional or complex data.
The active learning component may not be as effective if the views are highly correlated or if there are insufficient labeled examples to train the discriminator network.
The fusion strategy used to combine the outlier scores from different views may not be optimal, and more advanced techniques could potentially improve performance further.

Additionally, while the authors demonstrate the effectiveness of GSAAL on several benchmark datasets, it would be valuable to see how the method performs on real-world, high-impact applications, such as fraud detection or medical anomaly identification, where the consequences of missed outliers can be severe.

Conclusion

This paper presents a novel framework called Generative Subspace Adversarial Active Learning (GSAAL) that leverages the strengths of GANs and active learning to efficiently detect outliers in high-dimensional data with multiple views. By combining complementary information from different data views, GSAAL can outperform state-of-the-art outlier detection methods, especially when labeled examples are limited.

The key innovations of GSAAL, such as the generative subspace learning and adversarial active learning components, as well as the multi-view fusion strategy, offer a promising approach to addressing the challenging problem of outlier detection in complex, high-dimensional datasets. While the authors have identified several limitations, the GSAAL framework represents an important step forward in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training

Enes Altinisik, Safa Messaoud, Husrev Taha Sencar, Hassan Sajjad, Sanjay Chawla

Despite being a heavily researched topic, Adversarial Training (AT) is rarely, if ever, deployed in practical AI systems for two primary reasons: (i) the gained robustness is frequently accompanied by a drop in generalization and (ii) generating adversarial examples (AEs) is computationally prohibitively expensive. To address these limitations, we propose SMAAT, a new AT algorithm that leverages the manifold conjecture, stating that off-manifold AEs lead to better robustness while on-manifold AEs result in better generalization. Specifically, SMAAT aims at generating a higher proportion of off-manifold AEs by perturbing the intermediate deepnet layer with the lowest intrinsic dimension. This systematically results in better scalability compared to classical AT as it reduces the PGD chains length required for generating the AEs. Additionally, our study provides, to the best of our knowledge, the first explanation for the difference in the generalization and robustness trends between vision and language models, ie., AT results in a drop in generalization in vision models whereas, in encoder-based language models, generalization either improves or remains unchanged. We show that vision transformers and decoder-based models tend to have low intrinsic dimensionality in the earlier layers of the network (more off-manifold AEs), while encoder-based models have low intrinsic dimensionality in the later layers. We demonstrate the efficacy of SMAAT; on several tasks, including robustifying (i) sentiment classifiers, (ii) safety filters in decoder-based models, and (iii) retrievers in RAG setups. SMAAT requires only 25-33% of the GPU time compared to standard AT, while significantly improving robustness across all applications and maintaining comparable generalization.

5/28/2024

cs.LG cs.CL

An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification

Ibrahim Al-Hurani, Abedalrhman Alkhateeb, Salama Ikki

In the relentless efforts in enhancing medical diagnostics, the integration of state-of-the-art machine learning methodologies has emerged as a promising research area. In molecular biology, there has been an explosion of data generated from multi-omics sequencing. The advent sequencing equipment can provide large number of complicated measurements per one experiment. Therefore, traditional statistical methods face challenging tasks when dealing with such high dimensional data. However, most of the information contained in these datasets is redundant or unrelated and can be effectively reduced to significantly fewer variables without losing much information. Dimensionality reduction techniques are mathematical procedures that allow for this reduction; they have largely been developed through statistics and machine learning disciplines. The other challenge in medical datasets is having an imbalanced number of samples in the classes, which leads to biased results in machine learning models. This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features, and Generative Adversarial Networks (GAN) to generate synthetic samples. Latent space is the reduced dimensional space that captures the meaningful features of the original data. Our model starts with feature selection to select the discriminative features before feeding them to the neural network. Then, the model predicts the outcome of cancer for different datasets. The proposed model outperformed other existing models by scoring accuracy of 95.09% for bladder cancer dataset and 88.82% for the breast cancer dataset.

5/17/2024

cs.LG cs.NE

Learning Images Across Scales Using Adversarial Training

Krzysztof Wolski, Adarsh Djeacoumar, Alireza Javanmardi, Hans-Peter Seidel, Christian Theobalt, Guillaume Cordonnier, Karol Myszkowski, George Drettakis, Xingang Pan, Thomas Leimkuhler

The real world exhibits rich structure and detail across many scales of observation. It is difficult, however, to capture and represent a broad spectrum of scales using ordinary images. We devise a novel paradigm for learning a representation that captures an orders-of-magnitude variety of scales from an unstructured collection of ordinary images. We treat this collection as a distribution of scale-space slices to be learned using adversarial training, and additionally enforce coherency across slices. Our approach relies on a multiscale generator with carefully injected procedural frequency content, which allows to interactively explore the emerging continuous scale space. Training across vastly different scales poses challenges regarding stability, which we tackle using a supervision scheme that involves careful sampling of scales. We show that our generator can be used as a multiscale generative model, and for reconstructions of scale spaces from unstructured patches. Significantly outperforming the state of the art, we demonstrate zoom-in factors of up to 256x at high quality and scale consistency.

6/14/2024

cs.GR cs.CV cs.LG

A Unified Approach Towards Active Learning and Out-of-Distribution Detection

Sebastian Schmidt, Leonard Schenk, Leo Schwinn, Stephan Gunnemann

When applying deep learning models in open-world scenarios, active learning (AL) strategies are crucial for identifying label candidates from a nearly infinite amount of unlabeled data. In this context, robust out-of-distribution (OOD) detection mechanisms are essential for handling data outside the target distribution of the application. However, current works investigate both problems separately. In this work, we introduce SISOM as the first unified solution for both AL and OOD detection. By leveraging feature space distance metrics SISOM combines the strengths of the currently independent tasks to solve both effectively. We conduct extensive experiments showing the problems arising when migrating between both tasks. In these evaluations SISOM underlined its effectiveness by achieving first place in two of the widely used OpenOOD benchmarks and second place in the remaining one. In AL, SISOM outperforms others and delivers top-1 performance in three benchmarks

5/28/2024

cs.CV