Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

Read original: arXiv:2407.04491 - Published 7/8/2024 by David Holzmuller, L'eo Grinsztajn, Ingo Steinwart

📊

Overview

Gradient-boosted decision trees (GBDTs) have long dominated in classification and regression tasks on tabular data.
Recently, deep learning methods have challenged GBDTs, but often require extensive hyperparameter tuning.
The authors address this discrepancy by:
- Introducing RealMLP, an improved multilayer perceptron (MLP)
- Improving default parameters for GBDTs and RealMLP

Plain English Explanation

The paper addresses a discrepancy in the performance of different machine learning models on tabular data. Gradient-boosted decision trees (GBDTs) have traditionally been the go-to approach for classification and regression tasks on this type of data. However, more recently, deep learning methods have started to challenge the dominance of GBDTs, though they often require a lot of hyperparameter tuning to achieve good performance.

The authors of this paper aim to address this discrepancy in a few ways:

They introduce a new type of multilayer perceptron (MLP) called RealMLP, which is an improved version of the standard MLP.
They also provide improved default parameters for both GBDTs and RealMLP, so that these models can achieve good performance without extensive hyperparameter tuning.

Technical Explanation

The core of the paper is the introduction of RealMLP, an improved MLP architecture, and the optimization of default hyperparameters for both RealMLP and GBDTs.

The authors first tuned RealMLP and the default hyperparameters on a "meta-train" benchmark consisting of 71 classification and 47 regression datasets. They then compared the performance of the tuned RealMLP and default hyperparameter settings to hyperparameter-optimized versions of other models on a disjoint "meta-test" benchmark with 48 classification and 42 regression datasets, as well as the GBDT-friendly benchmark from a previous study.

The benchmark results showed that RealMLP offers a better time-accuracy tradeoff than other neural networks and is competitive with GBDTs. Moreover, the authors found that a combination of RealMLP and GBDTs with improved default parameters can achieve excellent results on medium-sized tabular datasets (1K--500K samples) without the need for extensive hyperparameter tuning.

Critical Analysis

The paper presents a compelling approach to improving the performance of both GBDTs and deep learning models on tabular data. The authors acknowledge that their work is limited to a specific set of datasets and that further research is needed to understand the broader applicability of their methods.

One potential concern is the reliance on meta-learning to tune the default hyperparameters. While this approach has been shown to be effective, it may not generalize well to all types of tabular datasets, and the authors do not provide a clear way to extend the default hyperparameters to new datasets.

Additionally, the authors do not provide a detailed analysis of the inner workings of RealMLP or the reasons behind its improved performance compared to other neural network architectures. A deeper understanding of the model's strengths and weaknesses could help inform future developments in this area.

Conclusion

This paper offers a promising solution to the discrepancy between the performance of GBDTs and deep learning models on tabular data. By introducing RealMLP and optimizing default hyperparameters, the authors have demonstrated that it is possible to achieve excellent results on medium-sized datasets without the need for extensive hyperparameter tuning.

The findings of this research could have significant implications for the broader field of machine learning, particularly in domains where tabular data is prevalent and the time and resources required for model development are limited. As the authors suggest, further research is needed to explore the broader applicability of their methods and to deepen the understanding of the underlying mechanisms driving the performance improvements.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

David Holzmuller, L'eo Grinsztajn, Ingo Steinwart

For classification and regression on tabular data, the dominance of gradient-boosted decision trees (GBDTs) has recently been challenged by often much slower deep learning methods with extensive hyperparameter tuning. We address this discrepancy by introducing (a) RealMLP, an improved multilayer perceptron (MLP), and (b) improved default parameters for GBDTs and RealMLP. We tune RealMLP and the default parameters on a meta-train benchmark with 71 classification and 47 regression datasets and compare them to hyperparameter-optimized versions on a disjoint meta-test benchmark with 48 classification and 42 regression datasets, as well as the GBDT-friendly benchmark by Grinsztajn et al. (2022). Our benchmark results show that RealMLP offers a better time-accuracy tradeoff than other neural nets and is competitive with GBDTs. Moreover, a combination of RealMLP and GBDTs with improved default parameters can achieve excellent results on medium-sized tabular datasets (1K--500K samples) without hyperparameter tuning.

7/8/2024

Team up GBDTs and DNNs: Advancing Efficient and Effective Tabular Prediction with Tree-hybrid MLPs

Jiahuan Yan, Jintai Chen, Qianxing Wang, Danny Z. Chen, Jian Wu

Tabular datasets play a crucial role in various applications. Thus, developing efficient, effective, and widely compatible prediction algorithms for tabular data is important. Currently, two prominent model types, Gradient Boosted Decision Trees (GBDTs) and Deep Neural Networks (DNNs), have demonstrated performance advantages on distinct tabular prediction tasks. However, selecting an effective model for a specific tabular dataset is challenging, often demanding time-consuming hyperparameter tuning. To address this model selection dilemma, this paper proposes a new framework that amalgamates the advantages of both GBDTs and DNNs, resulting in a DNN algorithm that is as efficient as GBDTs and is competitively effective regardless of dataset preferences for GBDTs or DNNs. Our idea is rooted in an observation that deep learning (DL) offers a larger parameter space that can represent a well-performing GBDT model, yet the current back-propagation optimizer struggles to efficiently discover such optimal functionality. On the other hand, during GBDT development, hard tree pruning, entropy-driven feature gate, and model ensemble have proved to be more adaptable to tabular data. By combining these key components, we present a Tree-hybrid simple MLP (T-MLP). In our framework, a tensorized, rapidly trained GBDT feature gate, a DNN architecture pruning approach, as well as a vanilla back-propagation optimizer collaboratively train a randomly initialized MLP model. Comprehensive experiments show that T-MLP is competitive with extensively tuned DNNs and GBDTs in their dominating tabular benchmarks (88 datasets) respectively, all achieved with compact model storage and significantly reduced training duration.

7/16/2024

🧠

When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, Colin White

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.

7/17/2024

🔎

Challenging Gradient Boosted Decision Trees with Tabular Transformers for Fraud Detection at Booking.com

Sergei Krutikov (Booking.com), Bulat Khaertdinov (Maastricht University), Rodion Kiriukhin (Booking.com), Shubham Agrawal (Booking.com), Kees Jan De Vries (Booking.com)

Transformer-based neural networks, empowered by Self-Supervised Learning (SSL), have demonstrated unprecedented performance across various domains. However, related literature suggests that tabular Transformers may struggle to outperform classical Machine Learning algorithms, such as Gradient Boosted Decision Trees (GBDT). In this paper, we aim to challenge GBDTs with tabular Transformers on a typical task faced in e-commerce, namely fraud detection. Our study is additionally motivated by the problem of selection bias, often occurring in real-life fraud detection systems. It is caused by the production system affecting which subset of traffic becomes labeled. This issue is typically addressed by sampling randomly a small part of the whole production data, referred to as a Control Group. This subset follows a target distribution of production data and therefore is usually preferred for training classification models with standard ML algorithms. Our methodology leverages the capabilities of Transformers to learn transferable representations using all available data by means of SSL, giving it an advantage over classical methods. Furthermore, we conduct large-scale experiments, pre-training tabular Transformers on vast amounts of data instances and fine-tuning them on smaller target datasets. The proposed approach outperforms heavily tuned GBDTs by a considerable margin of the Average Precision (AP) score. Pre-trained models show more consistent performance than the ones trained from scratch when fine-tuning data is limited. Moreover, they require noticeably less labeled data for reaching performance comparable to their GBDT competitor that utilizes the whole dataset.

5/24/2024