PRICE: A Pretrained Model for Cross-Database Cardinality Estimation

Read original: arXiv:2406.01027 - Published 6/4/2024 by Tianjing Zeng, Junwei Lan, Jiahong Ma, Wenqing Wei, Rong Zhu, Pengfei Li, Bolin Ding, Defu Lian, Zhewei Wei, Jingren Zhou

PRICE: A Pretrained Model for Cross-Database Cardinality Estimation

Overview

This paper presents a pretrained machine learning model called PRICE (PRetrIced Cross-Estimator) for estimating the cardinality (size) of database queries across different databases.
Cardinality estimation is a crucial task in database management, as it helps optimize query execution plans and improve overall performance.
PRICE aims to address the challenge of poor cardinality estimation, which can lead to suboptimal database performance, by leveraging a pretrained model to transfer knowledge across databases.

Plain English Explanation

The paper describes a new machine learning model called PRICE that can estimate the size or 'cardinality' of database queries. Cardinality estimation is important for databases because it helps them figure out the most efficient way to run a query. However, cardinality estimation can be tricky, especially when working with different databases.

PRICE tries to solve this problem by using a 'pretrained' model. This means the model has already been trained on a lot of data, so it can take what it's learned and apply it to new databases, even ones it hasn't seen before. The key idea is to capture general patterns about how database queries work, rather than just memorizing the details of one specific database.

By using a pretrained model, PRICE can provide more accurate cardinality estimates compared to traditional techniques. This can lead to better database performance, as the database can make smarter decisions about how to execute queries. The paper demonstrates the effectiveness of PRICE through experiments on real-world databases.

Technical Explanation

The paper introduces PRICE, a pretrained model for cross-database cardinality estimation. Cardinality estimation is the task of predicting the number of result rows for a given database query, which is crucial for optimizing query execution plans.

PRICE works by leveraging a pretrained neural network model that can be fine-tuned on data from a target database. The model takes as input the query expression and database statistics, and outputs a cardinality estimate. The key innovations of PRICE include:

Pretraining on a broad set of databases: The base PRICE model is pretrained on a large and diverse set of databases, allowing it to capture general patterns about how queries work across different schemas and data distributions.
Efficient fine-tuning: PRICE can be quickly fine-tuned on a new target database using only a small amount of training data, thanks to the strong initial pretraining.
Robust to database changes: PRICE can adapt to changes in the target database schema or statistics, maintaining accurate cardinality estimates over time.

The paper evaluates PRICE on a variety of real-world databases, showing that it outperforms traditional cardinality estimation techniques as well as recently proposed learned models. PRICE demonstrates strong cross-database generalization, even when the target database has very different characteristics from the pretraining data.

Critical Analysis

The PRICE paper makes a compelling case for the benefits of a pretrained cross-database cardinality estimation model. By capturing general patterns about how queries work, PRICE can provide more accurate estimates than traditional techniques, which often struggle to generalize across different databases.

One potential limitation is that the pretraining process requires a large and diverse set of databases, which may not always be available. The paper acknowledges this and discusses techniques for efficient fine-tuning, but the reliance on extensive pretraining data could be a practical hurdle in some scenarios.

Additionally, while PRICE demonstrates strong cross-database performance, the paper does not deeply explore the model's robustness to unusual or adversarial database queries. It would be valuable to understand how PRICE behaves under edge cases or queries that deviate significantly from the training data.

Overall, the PRICE approach represents an important step forward in cardinality estimation, with the potential to significantly improve database performance across a wide range of applications. As the authors note, further research into transfer learning and cross-database knowledge sharing could lead to even more powerful techniques in this area.

Conclusion

The PRICE paper presents a novel pretrained model for cross-database cardinality estimation, a crucial task in database management. By leveraging a large and diverse set of pretraining data, PRICE can capture general patterns about how queries work and provide more accurate cardinality estimates compared to traditional techniques.

The key innovation of PRICE is its ability to efficiently fine-tune on a new target database, allowing it to adapt to different schemas and data distributions while maintaining strong performance. This cross-database generalization capability could lead to significant improvements in database query optimization and overall system performance.

The paper's thorough experimental evaluation and analysis of PRICE's strengths and limitations provides a solid foundation for further research in this area. As databases continue to grow in complexity and scale, techniques like PRICE that can bridge the gap between different systems will become increasingly valuable for maintaining efficient and responsive data management.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PRICE: A Pretrained Model for Cross-Database Cardinality Estimation

Tianjing Zeng, Junwei Lan, Jiahong Ma, Wenqing Wei, Rong Zhu, Pengfei Li, Bolin Ding, Defu Lian, Zhewei Wei, Jingren Zhou

Cardinality estimation (CardEst) is essential for optimizing query execution plans. Recent ML-based CardEst methods achieve high accuracy but face deployment challenges due to high preparation costs and lack of transferability across databases. In this paper, we propose PRICE, a PRetrained multI-table CardEst model, which addresses these limitations. PRICE takes low-level but transferable features w.r.t. data distributions and query information and elegantly applies self-attention models to learn meta-knowledge to compute cardinality in any database. It is generally applicable to any unseen new database to attain high estimation accuracy, while its preparation cost is as little as the basic one-dimensional histogram-based CardEst methods. Moreover, PRICE can be finetuned to further enhance its performance on any specific database. We pretrained PRICE using 30 diverse datasets, completing the process in about 5 hours with a resulting model size of only about 40MB. Evaluations show that PRICE consistently outperforms existing methods, achieving the highest estimation accuracy on several unseen databases and generating faster execution plans with lower overhead. After finetuning with a small volume of databasespecific queries, PRICE could even find plans very close to the optimal ones. Meanwhile, PRICE is generally applicable to different settings such as data updates, data scaling, and query workload shifts. We have made all of our data and codes publicly available at https://github.com/StCarmen/PRICE.

6/4/2024

CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Ozcan

Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.

8/30/2024

Cardinality Estimation on Hyper-relational Knowledge Graphs

Fei Teng, Haoyang Li, Shimin Di, Lei Chen

Cardinality Estimation (CE) for query is to estimate the number of results without execution, which is an effective index in query optimization. Recently, CE over has achieved great success in knowledge graphs (KGs) that consist of triple facts. To more precisely represent facts, current researchers propose hyper-relational KGs (HKGs) to represent a triple fact with qualifiers, where qualifiers provide additional context to the fact. However, existing CE methods over KGs achieve unsatisfying performance on HKGs due to the complexity of qualifiers in HKGs. Also, there is only one dataset for HKG query cardinality estimation, i.e., WD50K-QE, which is not comprehensive and only covers limited patterns. The lack of querysets over HKG also becomes a bottleneck to comprehensively investigate CE problems on HKGs. In this work, we first construct diverse and unbiased hyper-relational querysets over three popular HKGs for investigating CE. Besides, we also propose a novel qualifier-attached graph neural network (GNN) model that effectively incorporates qualifier information and adaptively combines outputs from multiple GNN layers, to accurately predict the cardinality. Our experiments illustrate that the proposed hyper-relational query encoder outperforms all state-of-the-art CE methods over three popular HKGs on the diverse and unbiased benchmark.

5/27/2024

CARTE: Pretraining and Transfer for Tabular Learning

Myung Jun Kim, L'eo Grinsztajn, Gael Varoquaux

Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture -- CARTE for Context Aware Representation of Table Entries -- uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.

6/3/2024