A Closer Look at Deep Learning on Tabular Data

2407.00956

Published 7/2/2024 by Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, De-Chuan Zhan

A Closer Look at Deep Learning on Tabular Data

Abstract

Tabular data is prevalent across various domains in machine learning. Although Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones, in-depth evaluation of these methods is challenging due to varying performance ranks across diverse datasets. In this paper, we propose a comprehensive benchmark comprising 300 tabular datasets, covering a wide range of task types, size distributions, and domains. We perform an extensive comparison between state-of-the-art deep tabular methods and tree-based methods, revealing the average rank of all methods and highlighting the key factors that influence the success of deep tabular methods. Next, we analyze deep tabular methods based on their training dynamics, including changes in validation metrics and other statistics. For each dataset-method pair, we learn a mapping from both the meta-features of datasets and the first part of the validation curve to the final validation set performance and even the evolution of validation curves. This mapping extracts essential meta-features that influence prediction accuracy, helping the analysis of tabular methods from novel aspects. Based on the performance of all methods on this large benchmark, we identify two subsets of 45 datasets each. The first subset contains datasets that favor either tree-based methods or DNN-based methods, serving as effective analysis tools to evaluate strategies (e.g., attribute encoding strategies) for improving deep tabular models. The second subset contains datasets where the ranks of methods are consistent with the overall benchmark, acting as a probe for tabular analysis. These ``tiny tabular benchmarks'' will facilitate further studies on tabular data.

Create account to get full access

Overview

This paper takes a closer look at the performance of deep learning models on tabular data, which is data organized in rows and columns like a spreadsheet.
The authors explore various techniques for improving the performance of deep learning on tabular data, including feature engineering, model architecture, and training strategies.
The paper also compares the performance of deep learning to more traditional machine learning algorithms on a range of tabular datasets.

Plain English Explanation

Deep learning, a type of artificial intelligence that can automatically learn patterns from data, has become very popular in recent years. However, when it comes to tabular data - the kind of data you might find in a spreadsheet - deep learning hasn't always performed as well as traditional machine learning algorithms.

This paper investigates why deep learning sometimes struggles with tabular data and what can be done to improve its performance. The authors try out different techniques, such as carefully engineering the features (the columns) of the data, designing specialized deep learning model architectures, and using better training strategies.

They also compare the performance of deep learning models to more traditional machine learning algorithms, like decision trees and random forests, on a variety of tabular datasets. The goal is to understand the strengths and weaknesses of deep learning for this type of data and provide guidance on when it might be the best choice.

Overall, the paper offers a detailed look at the challenges and potential solutions for using deep learning on tabular data, which is an important real-world application of machine learning.

Technical Explanation

The paper begins by highlighting the importance of tabular data in many real-world applications, such as finance, healthcare, and business operations. While deep learning has achieved remarkable success in domains like computer vision and natural language processing, it has not always performed as well on tabular datasets compared to traditional machine learning algorithms.

To better understand this gap, the authors conduct an extensive empirical study on a diverse set of tabular datasets, evaluating the performance of deep learning models alongside more conventional machine learning methods. They explore various techniques for improving deep learning on tabular data, including:

Feature Engineering: The authors investigate the impact of different feature engineering approaches, such as TabRED, on the performance of deep learning models.
Model Architecture: The paper examines specialized deep learning architectures designed for tabular data, like ExcelFormer, and compares them to more generic deep learning models.
Training Strategies: The authors experiment with different training techniques, such as Federated Learning, to improve the robustness and generalization of deep learning models on tabular datasets.

The results of their experiments reveal interesting insights. For example, the authors find that deep learning can outperform traditional machine learning algorithms on some tabular datasets, particularly when the data is high-dimensional and complex. However, they also identify scenarios where simpler, interpretable models, like Interpretable Deep Clustering, may be more appropriate.

Additionally, the paper discusses the importance of carefully evaluating the performance of deep learning models on tabular data, as the authors identify potential issues with overconfident predictions and the limitations of AutoML approaches in this domain.

Critical Analysis

The paper provides a comprehensive and rigorous evaluation of deep learning on tabular data, offering valuable insights for researchers and practitioners working in this domain. The authors have carefully designed their experiments to compare deep learning to traditional machine learning algorithms, and their findings highlight the nuances and trade-offs involved in selecting the appropriate modeling approach for tabular data.

One potential limitation of the study is the relatively small number of datasets used. While the authors have made an effort to include a diverse set of tabular datasets, expanding the evaluation to a larger and more varied set of benchmarks could further strengthen the generalizability of the conclusions.

Additionally, the paper primarily focuses on the predictive performance of the models, but does not delve deeply into the interpretability and explainability of the deep learning models. As tabular data is often used in domains where interpretability is crucial, such as healthcare and finance, further exploration of this aspect could provide additional insights.

Overall, this paper makes a significant contribution to the understanding of deep learning on tabular data and provides a solid foundation for future research in this area. The authors' thoughtful analysis and clear presentation of the results make this work a valuable resource for the machine learning community.

Conclusion

This paper provides a comprehensive investigation of the performance of deep learning models on tabular data, a common and important type of data in many real-world applications. The authors explore various techniques for improving deep learning's performance on tabular data, including feature engineering, model architecture, and training strategies.

The study's findings offer valuable insights, demonstrating that deep learning can outperform traditional machine learning algorithms in certain scenarios, but also highlighting the importance of carefully evaluating the strengths and limitations of different modeling approaches. The paper encourages researchers and practitioners to think critically about the appropriate use of deep learning for tabular data and provides a solid foundation for future work in this area.

By shedding light on the nuances of deep learning on tabular data, this paper contributes to the ongoing efforts to advance the capabilities of machine learning in real-world applications, where tabular data is ubiquitous and the choice of the right modeling approach can have significant implications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, Artem Babenko

Benchmarks that closely reflect downstream application scenarios are essential for the streamlined adoption of new research in tabular machine learning (ML). In this work, we examine existing tabular benchmarks and find two common characteristics of industry-grade tabular data that are underrepresented in the datasets available to the academic community. First, tabular data often changes over time in real-world deployment scenarios. This impacts model performance and requires time-based train and test splits for correct model evaluation. Yet, existing academic tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. For each specific dataset, this can have a different impact on the absolute and relative number of predictive, uninformative, and correlated features, which in turn can affect model selection. To fill the aforementioned gaps in academic benchmarks, we introduce TabReD -- a collection of eight industry-grade tabular datasets covering a wide range of domains from finance to food delivery services. We assess a large number of tabular ML models in the feature-rich, temporally-evolving data setting facilitated by TabReD. We demonstrate that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks. Furthermore, on the TabReD datasets, MLP-like architectures and GBDT show the best results, while more sophisticated DL models are yet to prove their effectiveness.

6/28/2024

cs.LG

A Federated Learning Benchmark on Tabular Data: Comparing Tree-Based Models and Neural Networks

William Lindskog, Christian Prehofer

Federated Learning (FL) has lately gained traction as it addresses how machine learning models train on distributed datasets. FL was designed for parametric models, namely Deep Neural Networks (DNNs).Thus, it has shown promise on image and text tasks. However, FL for tabular data has received little attention. Tree-Based Models (TBMs) have been considered to perform better on tabular data and they are starting to see FL integrations. In this study, we benchmark federated TBMs and DNNs for horizontal FL, with varying data partitions, on 10 well-known tabular datasets. Our novel benchmark results indicates that current federated boosted TBMs perform better than federated DNNs in different data partitions. Furthermore, a federated XGBoost outperforms all other models. Lastly, we find that federated TBMs perform better than federated parametric models, even when increasing the number of clients significantly.

5/6/2024

cs.LG

🤿

Squeezing Lemons with Hammers: An Evaluation of AutoML and Tabular Deep Learning for Data-Scarce Classification Applications

Ricardo Knauer, Erik Rodner

Many industry verticals are confronted with small-sized tabular data. In this low-data regime, it is currently unclear whether the best performance can be expected from simple baselines, or more complex machine learning approaches that leverage meta-learning and ensembling. On 44 tabular classification datasets with sample sizes $leq$ 500, we find that L2-regularized logistic regression performs similar to state-of-the-art automated machine learning (AutoML) frameworks (AutoPrognosis, AutoGluon) and off-the-shelf deep neural networks (TabPFN, HyperFast) on the majority of the benchmark datasets. We therefore recommend to consider logistic regression as the first choice for data-scarce applications with tabular data and provide practitioners with best practices for further method selection.

5/14/2024

cs.LG cs.AI

🤷

ExcelFormer: Can a DNN be a Sure Bet for Tabular Prediction?

Jintai Chen, Jiahuan Yan, Qiyuan Chen, Danny Ziyi Chen, Jian Wu, Jimeng Sun

Data organized in tabular format is ubiquitous in real-world applications, and users often craft tables with biased feature definitions and flexibly set prediction targets of their interests. Thus, a rapid development of a robust, effective, dataset-versatile, user-friendly tabular prediction approach is highly desired. While Gradient Boosting Decision Trees (GBDTs) and existing deep neural networks (DNNs) have been extensively utilized by professional users, they present several challenges for casual users, particularly: (i) the dilemma of model selection due to their different dataset preferences, and (ii) the need for heavy hyperparameter searching, failing which their performances are deemed inadequate. In this paper, we delve into this question: Can we develop a deep learning model that serves as a sure bet solution for a wide range of tabular prediction tasks, while also being user-friendly for casual users? We delve into three key drawbacks of deep tabular models, encompassing: (P1) lack of rotational variance property, (P2) large data demand, and (P3) over-smooth solution. We propose ExcelFormer, addressing these challenges through a semi-permeable attention module that effectively constrains the influence of less informative features to break the DNNs' rotational invariance property (for P1), data augmentation approaches tailored for tabular data (for P2), and attentive feedforward network to boost the model fitting capability (for P3). These designs collectively make ExcelFormer a sure bet solution for diverse tabular datasets. Extensive and stratified experiments conducted on real-world datasets demonstrate that our model outperforms previous approaches across diverse tabular data prediction tasks, and this framework can be friendly to casual users, offering ease of use without the heavy hyperparameter tuning.

5/27/2024

cs.LG