Tabular Data Contrastive Learning via Class-Conditioned and Feature-Correlation Based Augmentation

Read original: arXiv:2404.17489 - Published 5/1/2024 by Wei Cui, Rasa Hosseinzadeh, Junwei Ma, Tongzi Wu, Yi Sui, Keyvan Golestan

📊

Overview

Contrastive learning is a technique for pre-training machine learning models by creating similar views of the original data and encouraging the model to learn representations that bring the data and its corresponding views closer together in the embedding space.
This technique has been successful in domains like image processing and natural language processing, thanks to effective domain-specific data augmentation techniques.
However, in tabular data settings, the predominant augmentation technique of randomly swapping values has not been as effective.
The paper proposes a simple yet powerful improvement to this tabular data augmentation technique by conditioning the corruption on class identity.

Plain English Explanation

Contrastive learning is a way to pre-train machine learning models by first creating similar versions, or "views," of the original data. The model is then encouraged to learn representations that bring the original data and its corresponding views closer together in the embedding space. This technique has been very successful in areas like image processing and natural language processing, where the augmentation techniques used to create the different views are intuitive and effective.

However, when it comes to tabular data, the main way of creating these different views has been to randomly swap values in the table. This approach doesn't work as well. The paper proposes a simple but powerful improvement to this tabular data augmentation technique. Instead of randomly swapping values, the authors suggest corrupting the tabular entries based on the class identity of the data.

Specifically, when corrupting a value in a row, they only sample replacement values from rows that are in the same class as the original row. This helps preserve the underlying structure and relationships in the data. The authors also explore using feature correlation structures to decide which features to corrupt.

This new approach consistently outperforms the conventional random corruption method for tabular data classification tasks, according to the experiments described in the paper.

Technical Explanation

The paper proposes a novel data augmentation technique for contrastive learning on tabular data. In contrast to common image and natural language data augmentation methods that leverage domain-specific intuitions, tabular data augmentation has relied on simpler techniques like randomly swapping values between rows.

The authors argue that this conventional corruption method is not as sound or effective for tabular data. Instead, they introduce a class-conditional corruption approach, where the corrupted value is sampled from rows within the same class as the anchor row, rather than from the entire table uniformly.

To enable this class-conditional corruption, the authors adopt a semi-supervised learning setting and use pseudo-labeling to obtain class identities for all table rows. They also explore the idea of feature selection for corruption, based on feature correlation structures.

Extensive experiments on tabular data classification tasks show that the proposed class-conditional corruption approach consistently outperforms the conventional random corruption method. The authors make their code available at https://github.com/willtop/Tabular-Class-Conditioned-SSL.

Critical Analysis

The key innovation proposed in this paper is the class-conditional corruption approach for tabular data augmentation in contrastive learning. This is a simple yet powerful idea that addresses the limitations of the commonly used random corruption method.

One potential limitation of the approach is the reliance on pseudo-labeling to obtain class identities, which could be sensitive to the quality of the pseudo-labels. The authors do not provide a detailed analysis of the impact of pseudo-label quality on the final performance.

Additionally, the paper does not explore the interaction between feature selection for corruption and the class-conditional corruption technique. It would be interesting to see how these two components work together and whether there are any synergies or trade-offs to be considered.

Another area for further research could be investigating the generalization of this approach to other semi-supervised or self-supervised learning settings beyond contrastive learning, such as group-wise prompting for synthetic tabular data generation or implicit adversarial data augmentation.

Overall, the proposed class-conditional corruption technique is a promising approach that could help advance the state of the art in tabular data representation learning. The authors have made their code available, which should facilitate further research and exploration in this direction.

Conclusion

This paper introduces a simple yet powerful improvement to the data augmentation technique used in contrastive learning for tabular data. By conditioning the corruption of tabular entries on class identity, the authors are able to outperform the conventional random corruption method across a range of tabular data classification tasks.

The class-conditional corruption approach leverages semi-supervised learning and pseudo-labeling to enable this more effective data augmentation. The authors also explore the idea of feature selection for corruption based on feature correlation structures.

The results demonstrate the potential of this new augmentation technique to advance the state of the art in tabular data representation learning. While the approach has some limitations, such as the reliance on pseudo-labels, it opens up exciting avenues for further research in this area, including exploring the interaction with other semi-supervised and self-supervised learning methods.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Tabular Data Contrastive Learning via Class-Conditioned and Feature-Correlation Based Augmentation

Wei Cui, Rasa Hosseinzadeh, Junwei Ma, Tongzi Wu, Yi Sui, Keyvan Golestan

Contrastive learning is a model pre-training technique by first creating similar views of the original data, and then encouraging the data and its corresponding views to be close in the embedding space. Contrastive learning has witnessed success in image and natural language data, thanks to the domain-specific augmentation techniques that are both intuitive and effective. Nonetheless, in tabular domain, the predominant augmentation technique for creating views is through corrupting tabular entries via swapping values, which is not as sound or effective. We propose a simple yet powerful improvement to this augmentation technique: corrupting tabular data conditioned on class identity. Specifically, when corrupting a specific tabular entry from an anchor row, instead of randomly sampling a value in the same feature column from the entire table uniformly, we only sample from rows that are identified to be within the same class as the anchor row. We assume the semi-supervised learning setting, and adopt the pseudo labeling technique for obtaining class identities over all table rows. We also explore the novel idea of selecting features to be corrupted based on feature correlation structures. Extensive experiments show that the proposed approach consistently outperforms the conventional corruption method for tabular data classification tasks. Our code is available at https://github.com/willtop/Tabular-Class-Conditioned-SSL.

5/1/2024

PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning

Xiaoqi Qiu, Yongjie Wang, Xu Guo, Zhiwei Zeng, Yue Yu, Yuhong Feng, Chunyan Miao

Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes. Training with CAD enhances model robustness against spurious features that happen to correlate with labels by spreading the casual relationships across different classes. Yet, recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information, inadvertently introducing biases that may impair performance on out-ofdistribution (OOD) datasets. To mitigate this issue, we employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues. We theoretically prove that contrastive loss can encourage models to leverage a broader range of features beyond those modified ones. Comprehensive experiments on two human-edited CAD datasets demonstrate that our proposed method outperforms the state-of-the-art on OOD datasets.

6/12/2024

📊

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

6/3/2024

Class-aware and Augmentation-free Contrastive Learning from Label Proportion

Jialiang Wang, Ning Zhang, Shimin Di, Ruidong Wang, Lei Chen

Learning from Label Proportion (LLP) is a weakly supervised learning scenario in which training data is organized into predefined bags of instances, disclosing only the class label proportions per bag. This paradigm is essential for user modeling and personalization, where user privacy is paramount, offering insights into user preferences without revealing individual data. LLP faces a unique difficulty: the misalignment between bag-level supervision and the objective of instance-level prediction, primarily due to the inherent ambiguity in label proportion matching. Previous studies have demonstrated deep representation learning can generate auxiliary signals to promote the supervision level in the image domain. However, applying these techniques to tabular data presents significant challenges: 1) they rely heavily on label-invariant augmentation to establish multi-view, which is not feasible with the heterogeneous nature of tabular datasets, and 2) tabular datasets often lack sufficient semantics for perfect class distinction, making them prone to suboptimality caused by the inherent ambiguity of label proportion matching. To address these challenges, we propose an augmentation-free contrastive framework TabLLP-BDC that introduces class-aware supervision (explicitly aware of class differences) at the instance level. Our solution features a two-stage Bag Difference Contrastive (BDC) learning mechanism that establishes robust class-aware instance-level supervision by disassembling the nuance between bag label proportions, without relying on augmentations. Concurrently, our model presents a pioneering multi-task pretraining pipeline tailored for tabular-based LLP, capturing intrinsic tabular feature correlations in alignment with label proportion distribution. Extensive experiments demonstrate that TabLLP-BDC achieves state-of-the-art performance for LLP in the tabular domain.

8/14/2024