ExcelFormer: Can a DNN be a Sure Bet for Tabular Prediction?

2301.02819

Published 5/27/2024 by Jintai Chen, Jiahuan Yan, Qiyuan Chen, Danny Ziyi Chen, Jian Wu, Jimeng Sun

🤷

Abstract

Data organized in tabular format is ubiquitous in real-world applications, and users often craft tables with biased feature definitions and flexibly set prediction targets of their interests. Thus, a rapid development of a robust, effective, dataset-versatile, user-friendly tabular prediction approach is highly desired. While Gradient Boosting Decision Trees (GBDTs) and existing deep neural networks (DNNs) have been extensively utilized by professional users, they present several challenges for casual users, particularly: (i) the dilemma of model selection due to their different dataset preferences, and (ii) the need for heavy hyperparameter searching, failing which their performances are deemed inadequate. In this paper, we delve into this question: Can we develop a deep learning model that serves as a sure bet solution for a wide range of tabular prediction tasks, while also being user-friendly for casual users? We delve into three key drawbacks of deep tabular models, encompassing: (P1) lack of rotational variance property, (P2) large data demand, and (P3) over-smooth solution. We propose ExcelFormer, addressing these challenges through a semi-permeable attention module that effectively constrains the influence of less informative features to break the DNNs' rotational invariance property (for P1), data augmentation approaches tailored for tabular data (for P2), and attentive feedforward network to boost the model fitting capability (for P3). These designs collectively make ExcelFormer a sure bet solution for diverse tabular datasets. Extensive and stratified experiments conducted on real-world datasets demonstrate that our model outperforms previous approaches across diverse tabular data prediction tasks, and this framework can be friendly to casual users, offering ease of use without the heavy hyperparameter tuning.

Create account to get full access

Overview

Tabular data is ubiquitous in real-world applications, but users often create biased tables with custom prediction targets
Existing models like Gradient Boosting Decision Trees and deep neural networks have challenges for casual users, including model selection and heavy hyperparameter tuning
The paper proposes "ExcelFormer," a deep learning model aimed at being a versatile, user-friendly solution for tabular prediction tasks

Plain English Explanation

Tables of data are extremely common in the real world, and people often create these tables in biased ways or with specific prediction goals in mind. While powerful machine learning models like decision tree-based and deep neural network approaches have been used by expert users, they present challenges for more casual users.

These challenges include difficulties in selecting the right model for a particular dataset, as well as the need to heavily tune the model's hyperparameters (the settings that control how the model behaves) in order to get good performance. If users don't put in the time and effort to tune the hyperparameters properly, the model's performance can be inadequate.

To address these issues, the researchers developed a new deep learning model called "ExcelFormer." This model aims to be a versatile and user-friendly solution that can work well across a wide range of tabular prediction tasks, without requiring the same level of expertise and hyperparameter tuning.

Technical Explanation

The key technical contributions of the paper are:

Semi-permeable Attention Module: This module helps break the "rotational invariance" property of deep neural networks, which can limit their ability to effectively use the information in tabular datasets.
Tabular Data Augmentation: The researchers developed data augmentation techniques specifically tailored for tabular data, which can help the model perform well even with limited training data.
Attentive Feedforward Network: This component boosts the model's ability to fit the patterns in the data, addressing the tendency of deep models to produce "over-smooth" solutions.

The researchers conducted extensive experiments on real-world datasets and found that their ExcelFormer model outperformed previous approaches across a variety of tabular prediction tasks. Importantly, they also demonstrated that ExcelFormer can be more user-friendly for casual users, as it does not require the same level of hyperparameter tuning as other models.

Critical Analysis

The paper presents a compelling solution to the challenges faced by casual users when working with tabular prediction tasks. The researchers have identified key issues with existing models and have designed ExcelFormer to address them.

One potential limitation of the study is the specific datasets used for evaluation. While the researchers claim that the datasets cover a "diverse" range of tabular prediction tasks, it would be valuable to see how ExcelFormer performs on an even wider variety of real-world tabular datasets, including those with unique characteristics or domain-specific features.

Additionally, the paper does not provide much insight into the computational efficiency of ExcelFormer compared to other models. This could be an important consideration, especially for casual users who may have limited computational resources.

Overall, the ExcelFormer approach is a promising step towards making tabular prediction more accessible and user-friendly, and the researchers have presented a thoughtful and well-designed solution. Further research and validation on a broader range of datasets could help strengthen the case for adopting ExcelFormer in real-world applications.

Conclusion

This paper introduces ExcelFormer, a deep learning model designed to be a versatile and user-friendly solution for a wide range of tabular prediction tasks. By addressing key challenges with existing models, such as rotational invariance, data demand, and over-smoothing, the researchers have created a model that can perform well across diverse datasets without requiring extensive hyperparameter tuning.

The technical innovations, including the semi-permeable attention module, tabular data augmentation, and attentive feedforward network, demonstrate the researchers' thoughtful approach to improving the state-of-the-art in tabular prediction. While further validation on a broader range of datasets could strengthen the case for ExcelFormer, this work represents an important step towards making advanced machine learning more accessible to casual users working with tabular data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

Challenging Gradient Boosted Decision Trees with Tabular Transformers for Fraud Detection at Booking.com

Sergei Krutikov (Booking.com), Bulat Khaertdinov (Maastricht University), Rodion Kiriukhin (Booking.com), Shubham Agrawal (Booking.com), Kees Jan De Vries (Booking.com)

Transformer-based neural networks, empowered by Self-Supervised Learning (SSL), have demonstrated unprecedented performance across various domains. However, related literature suggests that tabular Transformers may struggle to outperform classical Machine Learning algorithms, such as Gradient Boosted Decision Trees (GBDT). In this paper, we aim to challenge GBDTs with tabular Transformers on a typical task faced in e-commerce, namely fraud detection. Our study is additionally motivated by the problem of selection bias, often occurring in real-life fraud detection systems. It is caused by the production system affecting which subset of traffic becomes labeled. This issue is typically addressed by sampling randomly a small part of the whole production data, referred to as a Control Group. This subset follows a target distribution of production data and therefore is usually preferred for training classification models with standard ML algorithms. Our methodology leverages the capabilities of Transformers to learn transferable representations using all available data by means of SSL, giving it an advantage over classical methods. Furthermore, we conduct large-scale experiments, pre-training tabular Transformers on vast amounts of data instances and fine-tuning them on smaller target datasets. The proposed approach outperforms heavily tuned GBDTs by a considerable margin of the Average Precision (AP) score. Pre-trained models show more consistent performance than the ones trained from scratch when fine-tuning data is limited. Moreover, they require noticeably less labeled data for reaching performance comparable to their GBDT competitor that utilizes the whole dataset.

5/24/2024

cs.LG

🧠

Interpretable Graph Neural Networks for Tabular Data

Amr Alkhatib, Sofiane Ennadir, Henrik Bostrom, Michalis Vazirgiannis

Data in tabular format is frequently occurring in real-world applications. Graph Neural Networks (GNNs) have recently been extended to effectively handle such data, allowing feature interactions to be captured through representation learning. However, these approaches essentially produce black-box models, in the form of deep neural networks, precluding users from following the logic behind the model predictions. We propose an approach, called IGNNet (Interpretable Graph Neural Network for tabular data), which constrains the learning algorithm to produce an interpretable model, where the model shows how the predictions are exactly computed from the original input features. A large-scale empirical investigation is presented, showing that IGNNet is performing on par with state-of-the-art machine-learning algorithms that target tabular data, including XGBoost, Random Forests, and TabNet. At the same time, the results show that the explanations obtained from IGNNet are aligned with the true Shapley values of the features without incurring any additional computational overhead.

4/22/2024

cs.LG cs.AI

🧪

BUFF: Boosted Decision Tree based Ultra-Fast Flow matching

Cheng Jiang, Sitian Qian, Huilin Qu

Tabular data stands out as one of the most frequently encountered types in high energy physics. Unlike commonly homogeneous data such as pixelated images, simulating high-dimensional tabular data and accurately capturing their correlations are often quite challenging, even with the most advanced architectures. Based on the findings that tree-based models surpass the performance of deep learning models for tasks specific to tabular data, we adopt the very recent generative modeling class named conditional flow matching and employ different techniques to integrate the usage of Gradient Boosted Trees. The performances are evaluated for various tasks on different analysis level with several public datasets. We demonstrate the training and inference time of most high-level simulation tasks can achieve speedup by orders of magnitude. The application can be extended to low-level feature simulation and conditioned generations with competitive performance.

4/30/2024

cs.LG

Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data

Jintai Chen, Zhen Lin, Qiyuan Chen, Jimeng Sun

Tabular data from different tables exhibit significant diversity due to varied definitions and types of features, as well as complex inter-feature and feature-target relationships. Cross-dataset pretraining, which learns reusable patterns from upstream data to support downstream tasks, have shown notable success in various fields. Yet, when applied to tabular data prediction, this paradigm faces challenges due to the limited reusable patterns among diverse tabular datasets (tables) and the general scarcity of tabular data available for fine-tuning. In this study, we fill this gap by introducing a cross-table pretrained Transformer, XTFormer, for versatile downstream tabular prediction tasks. Our methodology insight is pretraining XTFormer to establish a meta-function space that encompasses all potential feature-target mappings. In pre-training, a variety of potential mappings are extracted from pre-training tabular datasets and are embedded into the meta-function space, and suited mappings are extracted from the meta-function space for downstream tasks by a specified coordinate positioning approach. Experiments show that, in 190 downstream tabular prediction tasks, our cross-table pretrained XTFormer wins both XGBoost and Catboost on 137 (72%) tasks, and surpasses representative deep learning models FT-Transformer and the tabular pre-training approach XTab on 144 (76%) and 162 (85%) tasks.

6/4/2024

cs.LG cs.AI