When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Read original: arXiv:2305.02997 - Published 7/17/2024 by Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, Colin White

🧠

Overview

Tabular data is a common type of data used in machine learning
There is ongoing debate about whether neural networks (NNs) or gradient-boosted decision trees (GBDTs) perform better on tabular data
This paper takes a deeper look at this debate and provides insights to guide practitioners

Plain English Explanation

Tabular data, which is data organized into rows and columns, is one of the most common types of data used in machine learning. Recently, there has been a lot of discussion about whether neural networks or gradient-boosted decision trees perform better on tabular data. Some argue that GBDTs are consistently better, while others say NNs outperform.

This paper steps back and looks at this debate more broadly. The researchers conducted a very large analysis, comparing 19 different algorithms across 176 different datasets. They found that for many datasets, the difference in performance between NNs and GBDTs is quite small. They also found that simply tuning the settings of a GBDT can be more important than choosing between NNs and GBDTs.

The paper highlights an exception - a new type of neural network called TabPFN that seems to outperform all other methods, even when only using a small subset of the training data. The researchers also analyzed what properties of a dataset make NNs or GBDTs better suited to perform well. For example, they found that GBDTs handle irregularities in the data much better than NNs.

Overall, this research provides helpful guidance for practitioners on choosing the right machine learning approach for their specific tabular dataset.

Technical Explanation

The paper conducts the largest tabular data analysis to date, comparing 19 different machine learning algorithms across 176 datasets. The key findings are:

The performance difference between neural networks (NNs) and gradient-boosted decision trees (GBDTs) is negligible for many datasets. Simple hyperparameter tuning of a GBDT can be more important than choosing between NNs and GBDTs.
A recently proposed neural network architecture called TabPFN outperforms all other algorithms on average, even when only using a small 3000-sample training set.
The researchers analyzed dozens of dataset "metafeatures" to determine what properties make NNs or GBDTs better suited. They found that GBDTs handle irregularities like skewed or heavy-tailed feature distributions much better than NNs.

The paper's large-scale empirical analysis provides important insights to guide practitioners in choosing the most appropriate machine learning technique for their tabular datasets. Additionally, the researchers release a benchmark suite of the 36 "hardest" datasets from their study, called the TabZilla Benchmark Suite, to accelerate future tabular data research.

Critical Analysis

The paper presents a thorough and well-designed empirical study, but there are a few potential limitations and areas for further research:

The analysis is limited to a fixed set of 19 algorithms, and it's possible that other techniques not included could outperform the ones studied.
The metafeature analysis provides helpful insights, but may not capture all the complex factors that influence algorithm performance on tabular data.
While the TabPFN architecture shows promising results, its requirement of a small 3000-sample training set could limit its practical applicability for larger datasets.
The paper does not delve into the computational costs and training time requirements of the different algorithms, which are also important practical considerations.

Despite these caveats, this research makes a valuable contribution by providing a comprehensive and balanced perspective on the "NN vs. GBDT" debate, and offering guidance to help practitioners navigate the landscape of tabular data machine learning.

Conclusion

This paper takes a step back from the ongoing debate about whether neural networks or gradient-boosted decision trees perform better on tabular data. Through an extensive empirical analysis across 176 datasets, the researchers found that for many datasets, the difference in performance between these two approaches is negligible, and that careful hyperparameter tuning of a GBDT can be more important than the choice between NNs and GBDTs.

The paper also highlights an exception - a new neural network architecture called TabPFN that outperforms all other methods, even when using a small 3000-sample training set. Additionally, the researchers provide insights into the dataset properties that make NNs or GBDTs better suited, which can guide practitioners in choosing the appropriate technique for their specific tabular data problem.

Overall, this research helps to reframe the "NN vs. GBDT" discussion and provides a more nuanced perspective to inform the field of tabular data machine learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, Colin White

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.

7/17/2024

Team up GBDTs and DNNs: Advancing Efficient and Effective Tabular Prediction with Tree-hybrid MLPs

Jiahuan Yan, Jintai Chen, Qianxing Wang, Danny Z. Chen, Jian Wu

Tabular datasets play a crucial role in various applications. Thus, developing efficient, effective, and widely compatible prediction algorithms for tabular data is important. Currently, two prominent model types, Gradient Boosted Decision Trees (GBDTs) and Deep Neural Networks (DNNs), have demonstrated performance advantages on distinct tabular prediction tasks. However, selecting an effective model for a specific tabular dataset is challenging, often demanding time-consuming hyperparameter tuning. To address this model selection dilemma, this paper proposes a new framework that amalgamates the advantages of both GBDTs and DNNs, resulting in a DNN algorithm that is as efficient as GBDTs and is competitively effective regardless of dataset preferences for GBDTs or DNNs. Our idea is rooted in an observation that deep learning (DL) offers a larger parameter space that can represent a well-performing GBDT model, yet the current back-propagation optimizer struggles to efficiently discover such optimal functionality. On the other hand, during GBDT development, hard tree pruning, entropy-driven feature gate, and model ensemble have proved to be more adaptable to tabular data. By combining these key components, we present a Tree-hybrid simple MLP (T-MLP). In our framework, a tensorized, rapidly trained GBDT feature gate, a DNN architecture pruning approach, as well as a vanilla back-propagation optimizer collaboratively train a randomly initialized MLP model. Comprehensive experiments show that T-MLP is competitive with extensively tuned DNNs and GBDTs in their dominating tabular benchmarks (88 datasets) respectively, all achieved with compact model storage and significantly reduced training duration.

7/16/2024

A Closer Look at Deep Learning on Tabular Data

Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, De-Chuan Zhan

Tabular data is prevalent across various domains in machine learning. Although Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones, in-depth evaluation of these methods is challenging due to varying performance ranks across diverse datasets. In this paper, we propose a comprehensive benchmark comprising 300 tabular datasets, covering a wide range of task types, size distributions, and domains. We perform an extensive comparison between state-of-the-art deep tabular methods and tree-based methods, revealing the average rank of all methods and highlighting the key factors that influence the success of deep tabular methods. Next, we analyze deep tabular methods based on their training dynamics, including changes in validation metrics and other statistics. For each dataset-method pair, we learn a mapping from both the meta-features of datasets and the first part of the validation curve to the final validation set performance and even the evolution of validation curves. This mapping extracts essential meta-features that influence prediction accuracy, helping the analysis of tabular methods from novel aspects. Based on the performance of all methods on this large benchmark, we identify two subsets of 45 datasets each. The first subset contains datasets that favor either tree-based methods or DNN-based methods, serving as effective analysis tools to evaluate strategies (e.g., attribute encoding strategies) for improving deep tabular models. The second subset contains datasets where the ranks of methods are consistent with the overall benchmark, acting as a probe for tabular analysis. These ``tiny tabular benchmarks'' will facilitate further studies on tabular data.

7/2/2024

🔎

Challenging Gradient Boosted Decision Trees with Tabular Transformers for Fraud Detection at Booking.com

Sergei Krutikov (Booking.com), Bulat Khaertdinov (Maastricht University), Rodion Kiriukhin (Booking.com), Shubham Agrawal (Booking.com), Kees Jan De Vries (Booking.com)

Transformer-based neural networks, empowered by Self-Supervised Learning (SSL), have demonstrated unprecedented performance across various domains. However, related literature suggests that tabular Transformers may struggle to outperform classical Machine Learning algorithms, such as Gradient Boosted Decision Trees (GBDT). In this paper, we aim to challenge GBDTs with tabular Transformers on a typical task faced in e-commerce, namely fraud detection. Our study is additionally motivated by the problem of selection bias, often occurring in real-life fraud detection systems. It is caused by the production system affecting which subset of traffic becomes labeled. This issue is typically addressed by sampling randomly a small part of the whole production data, referred to as a Control Group. This subset follows a target distribution of production data and therefore is usually preferred for training classification models with standard ML algorithms. Our methodology leverages the capabilities of Transformers to learn transferable representations using all available data by means of SSL, giving it an advantage over classical methods. Furthermore, we conduct large-scale experiments, pre-training tabular Transformers on vast amounts of data instances and fine-tuning them on smaller target datasets. The proposed approach outperforms heavily tuned GBDTs by a considerable margin of the Average Precision (AP) score. Pre-trained models show more consistent performance than the ones trained from scratch when fine-tuning data is limited. Moreover, they require noticeably less labeled data for reaching performance comparable to their GBDT competitor that utilizes the whole dataset.

5/24/2024