Rethinking Unsupervised Outlier Detection via Multiple Thresholding

Read original: arXiv:2407.05382 - Published 7/16/2024 by Zhonghang Liu, Panzhong Lu, Guoyang Xie, Zhichao Lu, Wen-Yan Lin

Rethinking Unsupervised Outlier Detection via Multiple Thresholding

Overview

This paper proposes a novel unsupervised outlier detection method that uses multiple thresholds to identify outliers in data.
The key idea is to apply different thresholds to the outlier scores of data points, rather than relying on a single threshold, to better capture the diverse nature of outliers.
The authors evaluate their method on several benchmark datasets and show that it outperforms state-of-the-art unsupervised outlier detection techniques.

Plain English Explanation

When dealing with large datasets, it's common to encounter data points that don't fit the typical patterns or distributions of the majority of the data. These unusual data points are known as "outliers," and being able to identify them is important for tasks like fraud detection, anomaly identification, and data cleaning.

Existing unsupervised outlier detection methods often rely on a single threshold to determine whether a data point is an outlier or not. However, this can be problematic because different types of outliers may require different thresholds to be properly identified. This paper proposes a new approach that uses multiple thresholds to capture the diverse nature of outliers.

The key insight is that by applying different thresholds to the outlier scores of data points, the method can better identify a wider range of outliers, including those that might be missed by a single threshold. This is similar to how self-adaptive threshold and pseudo-labeling can be used to handle unreliable samples, or how output thresholding using mixed integer linear programming can be applied to improve object detection.

By using multiple thresholds, the method can adapt to the specific characteristics of the data and identify a more comprehensive set of outliers. This can be particularly useful in applications where the outliers have diverse characteristics, such as in fraud detection or network anomaly monitoring.

Technical Explanation

The paper presents a novel unsupervised outlier detection method that uses multiple thresholds to identify outliers in data. The key steps of the method are as follows:

Outlier Scoring: The method first computes an outlier score for each data point using an existing unsupervised outlier detection algorithm, such as Quantile-based Maximum Likelihood Training for Outlier Detection or Enhancing 3D Object Detection by Using Neural Network.
Multiple Thresholding: Instead of using a single threshold to determine whether a data point is an outlier, the method applies multiple thresholds to the outlier scores. This allows the method to capture a wider range of outliers, as different types of outliers may require different thresholds to be properly identified.
Outlier Aggregation: The method then aggregates the outlier detection results from the multiple thresholds to produce a final outlier detection score for each data point. This is done by combining the outlier detection results from the different thresholds using a weighted sum.

The authors evaluate their method on several benchmark datasets and compare it to state-of-the-art unsupervised outlier detection techniques, such as Isolation Forest and One-Class SVM. The results show that the proposed method outperforms these existing techniques in terms of various outlier detection metrics.

Critical Analysis

The paper presents a promising approach to unsupervised outlier detection, but there are a few potential limitations and areas for further research:

Threshold Selection: The paper does not provide a systematic way to choose the multiple thresholds used in the method. The authors use a grid search to determine the optimal thresholds, but this can be computationally expensive and may not generalize well to other datasets. Further research could explore more efficient and adaptive ways to select the thresholds.
Interpretability: While the multiple thresholding approach can improve outlier detection performance, it may also reduce the interpretability of the results. Explaining why certain data points are identified as outliers may be more challenging with this method, which could be a concern in applications where interpretability is crucial, such as medical diagnostics or credit risk analysis.
Scalability: The paper evaluates the method on relatively small benchmark datasets. It's unclear how the method would scale to larger, more complex datasets common in real-world applications. Further research is needed to assess the scalability and computational efficiency of the proposed approach.
Sensitivity to Outlier Characteristics: The paper assumes that different types of outliers can be captured by different thresholds. However, it's possible that the method may not perform as well in situations where outliers have more complex or heterogeneous characteristics that cannot be easily separated by multiple thresholds.

Overall, the paper presents a novel and promising approach to unsupervised outlier detection, but more research is needed to address the potential limitations and further improve the method.

Conclusion

This paper introduces a new unsupervised outlier detection method that uses multiple thresholds to identify a wider range of outliers in data. By applying different thresholds to the outlier scores of data points, the method can capture the diverse nature of outliers, which can be particularly useful in applications where outliers have heterogeneous characteristics.

The experimental results show that the proposed method outperforms state-of-the-art unsupervised outlier detection techniques on several benchmark datasets. While the method has some potential limitations, such as the need for a more systematic approach to threshold selection and the potential reduction in interpretability, it represents an important step forward in unsupervised outlier detection and could have significant implications for a wide range of applications, from fraud detection to network anomaly monitoring.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Rethinking Unsupervised Outlier Detection via Multiple Thresholding

Zhonghang Liu, Panzhong Lu, Guoyang Xie, Zhichao Lu, Wen-Yan Lin

In the realm of unsupervised image outlier detection, assigning outlier scores holds greater significance than its subsequent task: thresholding for predicting labels. This is because determining the optimal threshold on non-separable outlier score functions is an ill-posed problem. However, the lack of predicted labels not only hiders some real applications of current outlier detectors but also causes these methods not to be enhanced by leveraging the dataset's self-supervision. To advance existing scoring methods, we propose a multiple thresholding (Multi-T) module. It generates two thresholds that isolate inliers and outliers from the unlabelled target dataset, whereas outliers are employed to obtain better feature representation while inliers provide an uncontaminated manifold. Extensive experiments verify that Multi-T can significantly improve proposed outlier scoring methods. Moreover, Multi-T contributes to a naive distance-based method being state-of-the-art.

7/16/2024

🗣️

Outlier-Robust Geometric Perception: A Novel Thresholding-Based Estimator with Intra-Class Variance Maximization

Lei Sun

Geometric perception problems are fundamental tasks in robotics and computer vision. In real-world applications, they often encounter the inevitable issue of outliers, preventing traditional algorithms from making correct estimates. In this paper, we present a novel general-purpose robust estimator TIVM (Thresholding with Intra-class Variance Maximization) that can collaborate with standard non-minimal solvers to efficiently reject outliers for geometric perception problems. First, we introduce the technique of intra-class variance maximization to design a dynamic 2-group thresholding method on the measurement residuals, aiming to distinctively separate inliers from outliers. Then, we develop an iterative framework that robustly optimizes the model by approaching the pure-inlier group using a multi-layered dynamic thresholding strategy as subroutine, in which a self-adaptive mechanism for layer-number tuning is further employed to minimize the user-defined parameters. We validate the proposed estimator on 3 classic geometric perception problems: rotation averaging, point cloud registration and category-level perception, and experiments show that it is robust against 70--90% of outliers and can converge typically in only 3--15 iterations, much faster than state-of-the-art robust solvers such as RANSAC, GNC and ADAPT. Furthermore, another highlight is that: our estimator can retain approximately the same level of robustness even when the inlier-noise statistics of the problem are fully unknown.

7/2/2024

Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning

Jia-Hao Xiao, Ming-Kun Xie, Heng-Bo Fan, Gang Niu, Masashi Sugiyama, Sheng-Jun Huang

Semi-supervised multi-label learning (SSMLL) is a powerful framework for leveraging unlabeled data to reduce the expensive cost of collecting precise multi-label annotations. Unlike semi-supervised learning, one cannot select the most probable label as the pseudo-label in SSMLL due to multiple semantics contained in an instance. To solve this problem, the mainstream method developed an effective thresholding strategy to generate accurate pseudo-labels. Unfortunately, the method neglected the quality of model predictions and its potential impact on pseudo-labeling performance. In this paper, we propose a dual-perspective method to generate high-quality pseudo-labels. To improve the quality of model predictions, we perform dual-decoupling to boost the learning of correlative and discriminative features, while refining the generation and utilization of pseudo-labels. To obtain proper class-wise thresholds, we propose the metric-adaptive thresholding strategy to estimate the thresholds, which maximize the pseudo-label performance for a given metric on labeled data. Experiments on multiple benchmark datasets show the proposed method can achieve the state-of-the-art performance and outperform the comparative methods with a significant margin.

7/29/2024

Self Adaptive Threshold Pseudo-labeling and Unreliable Sample Contrastive Loss for Semi-supervised Image Classification

Xuerong Zhang, Li Huang, Jing Lv, Ming Yang

Semi-supervised learning is attracting blooming attention, due to its success in combining unlabeled data. However, pseudo-labeling-based semi-supervised approaches suffer from two problems in image classification: (1) Existing methods might fail to adopt suitable thresholds since they either use a pre-defined/fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. (2) Discarding unlabeled data with confidence below the thresholds results in the loss of discriminating information. To solve these issues, we develop an effective method to make sufficient use of unlabeled data. Specifically, we design a self adaptive threshold pseudo-labeling strategy, which thresholds for each class can be dynamically adjusted to increase the number of reliable samples. Meanwhile, in order to effectively utilise unlabeled data with confidence below the thresholds, we propose an unreliable sample contrastive loss to mine the discriminative information in low-confidence samples by learning the similarities and differences between sample features. We evaluate our method on several classification benchmarks under partially labeled settings and demonstrate its superiority over the other approaches.

7/8/2024