TabularFM: An Open Framework For Tabular Foundational Models

Read original: arXiv:2406.09837 - Published 6/19/2024 by Quan M. Tran, Suong N. Hoang, Lam M. Nguyen, Dzung Phan, Hoang Thanh Lam

TabularFM: An Open Framework For Tabular Foundational Models

Overview

• This paper introduces TabularFM, an open framework for developing tabular foundational models.

• Tabular foundational models are a new class of large-scale machine learning models that can be applied to a wide range of tabular data tasks.

• The framework aims to facilitate research and development in this emerging field by providing standardized datasets, benchmarks, and model architectures.

Plain English Explanation

The paper presents a new framework called TabularFM that is designed to advance the field of tabular foundational models. Tabular data refers to information organized in rows and columns, like spreadsheets or databases. Foundational models are large, general-purpose AI systems that can be applied to many different tasks.

The key idea behind TabularFM is to create a common set of standardized resources - datasets, benchmarks, and model architectures - that researchers and developers can use to more easily build and evaluate tabular foundational models. This should help accelerate progress in this emerging area of AI by making it easier for teams to build on each other's work.

The paper argues that tabular foundational models could have a big impact because tabular data is ubiquitous in the real world, used for everything from financial records to scientific measurements. But developing high-performing models for tabular data has historically been challenging. TabularFM aims to make this process more streamlined and efficient.

Technical Explanation

The paper introduces the TabularFM framework, which provides standardized datasets, benchmarks, and model architectures for developing tabular foundational models. Tabular data is structured information organized in rows and columns, like spreadsheets or databases.

The framework includes several key components:

Standardized Tabular Datasets: TabularFM provides a collection of real-world tabular datasets covering a range of domains and task types.
Benchmarking Suite: The framework defines a set of standard evaluation tasks and metrics to assess the performance of tabular foundational models.
Model Architectures: TabularFM includes reference model designs that leverage techniques like transformers and meta-learning to handle tabular data.

The goal is to provide a common infrastructure to facilitate research and development in this emerging field. By standardizing key elements, TabularFM aims to make it easier for teams to build on each other's work and drive progress in applying large-scale AI models to tabular data problems.

Critical Analysis

The TabularFM framework addresses an important need in the field of AI and machine learning. Tabular data is ubiquitous in the real world, yet developing high-performing models for these types of structured datasets has historically been challenging.

One potential limitation of the framework is the specific datasets and benchmark tasks included. While the authors claim these cover a diverse range of domains, there may be other important real-world tabular data problems that are not represented. Continued expansion and refinement of the included resources will be important.

Additionally, the reference model architectures proposed in the paper, while innovative, may not be the only effective approaches for tabular foundational models. As the field evolves, it will be important for TabularFM to remain flexible and open to incorporating new model designs and techniques.

Overall, the TabularFM framework represents a valuable contribution that could help accelerate progress in applying large-scale AI models to a wide range of tabular data challenges. Continued research and development in this area has the potential to yield significant real-world impacts.

Conclusion

This paper introduces TabularFM, an open framework for developing tabular foundational models - a new class of large-scale AI systems that can be applied to a broad range of structured data problems.

The key innovation of TabularFM is that it provides standardized datasets, benchmarks, and model architectures to facilitate research and development in this emerging field. By creating a common infrastructure, the framework aims to make it easier for teams to build on each other's work and drive progress in applying powerful AI techniques to tabular data challenges.

Tabular data is ubiquitous in the real world, used for everything from financial records to scientific measurements. But developing high-performing models for these structured datasets has historically been difficult. TabularFM represents an important step towards making this process more streamlined and efficient, with the potential for significant real-world impacts across many industries and domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TabularFM: An Open Framework For Tabular Foundational Models

Quan M. Tran, Suong N. Hoang, Lam M. Nguyen, Dzung Phan, Hoang Thanh Lam

Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.

6/19/2024

👨‍🏫

From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models

Xumeng Wen, Han Zhang, Shun Zheng, Wei Xu, Jiang Bian

Tabular data is foundational to predictive modeling in various crucial industries, including healthcare, finance, retail, sustainability, etc. Despite the progress made in specialized models, there is an increasing demand for universal models that can transfer knowledge, generalize from limited data, and follow human instructions. These are challenges that current tabular deep learning approaches have not fully tackled. Here we introduce Generative Tabular Learning (GTL), a novel framework that integrates the advanced functionalities of large language models (LLMs)-such as prompt-based zero-shot generalization and in-context learning-into tabular deep learning. GTL capitalizes on the pre-training of LLMs on diverse tabular data, enhancing their understanding of domain-specific knowledge, numerical sequences, and statistical dependencies critical for accurate predictions. Our empirical study spans 384 public datasets, rigorously analyzing GTL's convergence and scaling behaviors and assessing the impact of varied data templates. The GTL-enhanced LLaMA-2 model demonstrates superior zero-shot and in-context learning capabilities across numerous classification and regression tasks. Notably, it achieves this without fine-tuning, outperforming traditional methods and rivaling state-of-the-art models like GPT-4 in certain cases. Through GTL, we not only foster a deeper integration of LLMs' sophisticated abilities into tabular data comprehension and application but also offer a new training resource and a test bed for LLMs to enhance their ability to comprehend tabular data. To facilitate reproducible research, we release our code, data, and model checkpoints at https://github.com/microsoft/Industrial-Foundation-Models.

7/12/2024

Why Tabular Foundation Models Should Be a Research Priority

Boris van Breugel, Mihaela van der Schaar

Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML research community's priorities ever so slightly to a different modality: tabular data. Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power. We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM). LTMs could revolutionise the way science and ML use tabular data: not as single datasets that are analyzed in a vacuum, but contextualized with respect to related datasets. The potential impact is far-reaching: from few-shot tabular models to automating data science; from out-of-distribution synthetic data to empowering multidisciplinary scientific discovery. We intend to excite reflections on the modalities we study, and convince some researchers to study large tabular models.

6/4/2024

TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes

Aamod Khatiwada, Harsha Kokel, Ibrahim Abdelaziz, Subhajit Chaudhury, Julian Dolby, Oktie Hassanzadeh, Zhenhan Huang, Tejaswini Pedapati, Horst Samulowitz, Kavitha Srinivas

Enterprises have a growing need to identify relevant tables in data lakes; e.g. tables that are unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such data discovery tasks. In this paper, we present TabSketchFM, a neural tabular model for data discovery over data lakes. First, we propose novel pre-training: a sketch-based approach to enhance the effectiveness of data discovery in neural tabular models. Second, we finetune the pretrained model for identifying unionable, joinable, and subset table pairs and show significant improvement over previous tabular neural models. Third, we present a detailed ablation study to highlight which sketches are crucial for which tasks. Fourth, we use these finetuned models to perform table search; i.e., given a query table, find other tables in a corpus that are unionable, joinable, or that are subsets of the query. Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques. Finally, we show significant transfer across datasets and tasks establishing that our model can generalize across different tasks and over different data lakes.

8/22/2024