Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data

2406.00281

Published 6/4/2024 by Jintai Chen, Zhen Lin, Qiyuan Chen, Jimeng Sun

Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data

Abstract

Tabular data from different tables exhibit significant diversity due to varied definitions and types of features, as well as complex inter-feature and feature-target relationships. Cross-dataset pretraining, which learns reusable patterns from upstream data to support downstream tasks, have shown notable success in various fields. Yet, when applied to tabular data prediction, this paradigm faces challenges due to the limited reusable patterns among diverse tabular datasets (tables) and the general scarcity of tabular data available for fine-tuning. In this study, we fill this gap by introducing a cross-table pretrained Transformer, XTFormer, for versatile downstream tabular prediction tasks. Our methodology insight is pretraining XTFormer to establish a meta-function space that encompasses all potential feature-target mappings. In pre-training, a variety of potential mappings are extracted from pre-training tabular datasets and are embedded into the meta-function space, and suited mappings are extracted from the meta-function space for downstream tasks by a specified coordinate positioning approach. Experiments show that, in 190 downstream tabular prediction tasks, our cross-table pretrained XTFormer wins both XGBoost and Catboost on 137 (72%) tasks, and surpasses representative deep learning models FT-Transformer and the tabular pre-training approach XTab on 144 (76%) and 162 (85%) tasks.

Create account to get full access

Overview

• This paper introduces a novel pretraining approach called "Cross-Table Pretraining" to learn a universal function space for heterogeneous tabular data. • The key idea is to leverage the rich information contained in the relationships between columns in multiple tabular datasets, rather than training on individual datasets in isolation. • The proposed method aims to learn a robust and generalizable representation that can be effectively applied to a wide range of tabular data tasks, including classification, regression, and more.

Plain English Explanation

The paper describes a new way to train machine learning models for working with tabular data, which is data that is structured in rows and columns, like in a spreadsheet. Traditionally, models are trained on one specific dataset at a time, but this paper proposes a different approach.

The key insight is that there is valuable information in the relationships between the different columns (features) in a dataset. For example, in a dataset about cars, the relationships between columns like "make", "model", "year", and "price" can reveal patterns that are useful for making predictions.

The researchers developed a "Cross-Table Pretraining" method that allows the model to learn these relationships by training on multiple datasets at the same time. This helps the model develop a more general and robust understanding of tabular data, rather than being narrowly specialized to a single dataset.

The goal is to create a universal function space - a flexible, all-purpose model that can be effectively applied to a wide variety of tabular data problems, like classification (e.g. predicting if a loan will default) or regression (e.g. predicting a home's sale price). This could be very useful in real-world applications where you may need to work with many different datasets.

Technical Explanation

The key innovation of this paper is the "Cross-Table Pretraining" approach, which aims to learn a universal function space for tabular data by leveraging the relationships between columns across multiple datasets.

The overall training process involves two main steps:

Pretraining: The model is first trained on a large collection of tabular datasets using the Cross-Table Pretraining method. This pretraining stage allows the model to learn general patterns and relationships in tabular data.
Fine-tuning: The pretrained model is then fine-tuned on a specific target task and dataset using standard supervised learning techniques. This fine-tuning step allows the model to adapt and specialize to the particular problem at hand.

The Cross-Table Pretraining method works by randomly masking certain columns in the input data and then training the model to predict the masked values based on the remaining columns. This forces the model to learn the complex interdependencies between the different features in the data.

The researchers demonstrate the effectiveness of their approach through extensive experiments on a wide range of tabular data benchmarks, showing that the Cross-Table Pretraining method outperforms traditional pretraining and fine-tuning approaches.

Critical Analysis

The researchers acknowledge several limitations and areas for future work in their paper. For example, they note that the current implementation of Cross-Table Pretraining is computationally expensive, as it requires training on a large number of datasets simultaneously.

Additionally, the paper does not explore the interpretability of the learned representations, which could be an important consideration for real-world applications where model transparency is crucial.

Further research could also investigate how the Cross-Table Pretraining approach might be extended to handle more complex tabular data structures, such as hierarchical or relational data, which are common in many domains.

Overall, the proposed Cross-Table Pretraining method represents a promising step towards developing more powerful and generalizable machine learning models for tabular data, but there is still room for improvement and further exploration.

Conclusion

This paper presents a novel pretraining approach called "Cross-Table Pretraining" that aims to learn a universal function space for heterogeneous tabular data. By leveraging the relationships between columns across multiple datasets, the method can develop a more robust and generalizable representation that can be effectively applied to a wide range of tabular data tasks.

The researchers demonstrate the effectiveness of their approach through extensive experiments, showing that Cross-Table Pretraining outperforms traditional pretraining and fine-tuning techniques. While the method has some limitations, it represents an important step forward in the field of tabular data machine learning and could have significant real-world implications for applications that require working with diverse and complex datasets.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

UniTable: Towards a Unified Framework for Table Recognition via Self-Supervised Pretraining

ShengYun Peng, Aishwarya Chakravarthy, Seongmin Lee, Xiaojing Wang, Rajarajeswari Balasubramaniyan, Duen Horng Chau

Tables convey factual and quantitative data with implicit conventions created by humans that are often challenging for machines to parse. Prior work on table recognition (TR) has mainly centered around complex task-specific combinations of available inputs and tools. We present UniTable, a training framework that unifies both the training paradigm and training objective of TR. Its training paradigm combines the simplicity of purely pixel-level inputs with the effectiveness and scalability empowered by self-supervised pretraining from diverse unannotated tabular images. Our framework unifies the training objectives of all three TR tasks - extracting table structure, cell content, and cell bounding box - into a unified task-agnostic training objective: language modeling. Extensive quantitative and qualitative analyses highlight UniTable's state-of-the-art (SOTA) performance on four of the largest TR datasets. UniTable's table parsing capability has surpassed both existing TR methods and general large vision-language models, e.g., GPT-4o, GPT-4-turbo with vision, and LLaVA. Our code is publicly available at https://github.com/poloclub/unitable, featuring a Jupyter Notebook that includes the complete inference pipeline, fine-tuned across multiple TR datasets, supporting all three TR tasks.

5/28/2024

cs.CV cs.LG

Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification

Quangao Liu, Wei Yang, Chen Liang, Longlong Pang, Zhuozhang Zou

Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.

6/12/2024

cs.LG cs.AI

CARTE: Pretraining and Transfer for Tabular Learning

Myung Jun Kim, L'eo Grinsztajn, Gael Varoquaux

Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture -- CARTE for Context Aware Representation of Table Entries -- uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.

6/3/2024

cs.LG

Large Scale Transfer Learning for Tabular Data via Language Modeling

Josh Gardner, Juan C. Perdomo, Ludwig Schmidt

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

6/19/2024

cs.LG cs.AI cs.CL