CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Read original: arXiv:2408.16170 - Published 8/30/2024 by Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Ozcan

CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Overview

This paper introduces CardBench, a new benchmark for evaluating learned cardinality estimation methods in relational databases.
Cardinality estimation is a critical component of query optimization in databases, and learned approaches have shown promise in improving accuracy over traditional techniques.
The CardBench benchmark provides a standardized dataset and evaluation framework to facilitate research and development in this area.

Plain English Explanation

When you query a relational database, the database management system needs to estimate how many rows or results will be returned. This cardinality estimation is a crucial step in optimizing the query plan and improving performance. Traditional cardinality estimation techniques can be inaccurate, especially for complex queries.

Recent research has explored using machine learning models to provide more accurate cardinality estimates. However, evaluating and comparing these learned approaches has been challenging due to the lack of a standardized benchmark.

The authors of this paper introduce CardBench, a new benchmark specifically designed for evaluating learned cardinality estimation methods. CardBench provides a comprehensive dataset of database queries and ground truth cardinality information, along with a rigorous evaluation framework. This allows researchers and practitioners to easily test and compare different machine learning models for cardinality estimation.

By providing a common benchmark, CardBench aims to accelerate progress in this important area of database research and facilitate the development of more accurate cardinality estimation techniques.

Technical Explanation

The key contributions of this paper are:

CardBench Dataset: The authors curated a diverse dataset of real-world database queries from popular benchmarks, such as TPC-H and JOB, along with their true cardinality values. This dataset covers a wide range of query complexity and database schema characteristics.
CardBench Evaluation: The paper defines a comprehensive evaluation protocol for assessing the performance of learned cardinality estimation models. This includes metrics like relative error, normalized discounted cumulative gain, and top-k accuracy.
Benchmark Baseline: The authors provide baseline results using several state-of-the-art learned cardinality estimation techniques, including deep learning models like NeuroCard and learned-cardinality. This serves as a reference point for evaluating new approaches.
Extensibility: CardBench is designed to be easily extensible, allowing researchers to add new queries, datasets, and evaluation metrics as the field progresses.

The CardBench benchmark aims to facilitate the development and adoption of more accurate cardinality estimation techniques, ultimately leading to improved query optimization and performance in relational databases.

Critical Analysis

The authors acknowledge several limitations of the CardBench benchmark:

The dataset, while diverse, may not fully capture the entire spectrum of real-world database queries and schema characteristics.
The evaluation metrics, while comprehensive, may not capture all aspects of cardinality estimation performance that are relevant to specific applications or use cases.
The baseline models provided may not represent the absolute state-of-the-art, and new approaches may outperform them.

Additionally, the authors note that CardBench is focused on cardinality estimation for individual query operators, and does not yet address the challenge of cardinality estimation for complex, multi-table queries.

Future research could explore expanding the benchmark to include more diverse query types, schema characteristics, and evaluation metrics to further improve its utility and representativeness.

Conclusion

The CardBench benchmark provides a valuable tool for advancing the field of learned cardinality estimation in relational databases. By offering a standardized dataset and evaluation framework, the authors hope to accelerate research and development in this critical area of query optimization.

The availability of CardBench can help researchers and practitioners benchmark their techniques, compare them to state-of-the-art approaches, and identify areas for further improvement. This, in turn, can lead to more accurate cardinality estimation models, which can significantly enhance the performance and efficiency of database management systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Ozcan

Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.

8/30/2024

PRICE: A Pretrained Model for Cross-Database Cardinality Estimation

Tianjing Zeng, Junwei Lan, Jiahong Ma, Wenqing Wei, Rong Zhu, Pengfei Li, Bolin Ding, Defu Lian, Zhewei Wei, Jingren Zhou

Cardinality estimation (CardEst) is essential for optimizing query execution plans. Recent ML-based CardEst methods achieve high accuracy but face deployment challenges due to high preparation costs and lack of transferability across databases. In this paper, we propose PRICE, a PRetrained multI-table CardEst model, which addresses these limitations. PRICE takes low-level but transferable features w.r.t. data distributions and query information and elegantly applies self-attention models to learn meta-knowledge to compute cardinality in any database. It is generally applicable to any unseen new database to attain high estimation accuracy, while its preparation cost is as little as the basic one-dimensional histogram-based CardEst methods. Moreover, PRICE can be finetuned to further enhance its performance on any specific database. We pretrained PRICE using 30 diverse datasets, completing the process in about 5 hours with a resulting model size of only about 40MB. Evaluations show that PRICE consistently outperforms existing methods, achieving the highest estimation accuracy on several unseen databases and generating faster execution plans with lower overhead. After finetuning with a small volume of databasespecific queries, PRICE could even find plans very close to the optimal ones. Meanwhile, PRICE is generally applicable to different settings such as data updates, data scaling, and query workload shifts. We have made all of our data and codes publicly available at https://github.com/StCarmen/PRICE.

6/4/2024

RelBench: A Benchmark for Deep Learning on Relational Databases

Joshua Robinson, Rishabh Ranjan, Weihua Hu, Kexin Huang, Jiaqi Han, Alejandro Dobles, Matthias Fey, Jan E. Lenssen, Yiwen Yuan, Zecheng Zhang, Xinwei He, Jure Leskovec

We present RelBench, a public benchmark for solving predictive tasks over relational databases with graph neural networks. RelBench provides databases and tasks spanning diverse domains and scales, and is intended to be a foundational infrastructure for future research. We use RelBench to conduct the first comprehensive study of Relational Deep Learning (RDL) (Fey et al., 2024), which combines graph neural network predictive models with (deep) tabular models that extract initial entity-level representations from raw tables. End-to-end learned RDL models fully exploit the predictive signal encoded in primary-foreign key links, marking a significant shift away from the dominant paradigm of manual feature engineering combined with tabular models. To thoroughly evaluate RDL against this prior gold-standard, we conduct an in-depth user study where an experienced data scientist manually engineers features for each task. In this study, RDL learns better models whilst reducing human work needed by more than an order of magnitude. This demonstrates the power of deep learning for solving predictive tasks over relational databases, opening up many new research opportunities enabled by RelBench.

7/30/2024

Cardinality Estimation on Hyper-relational Knowledge Graphs

Fei Teng, Haoyang Li, Shimin Di, Lei Chen

Cardinality Estimation (CE) for query is to estimate the number of results without execution, which is an effective index in query optimization. Recently, CE over has achieved great success in knowledge graphs (KGs) that consist of triple facts. To more precisely represent facts, current researchers propose hyper-relational KGs (HKGs) to represent a triple fact with qualifiers, where qualifiers provide additional context to the fact. However, existing CE methods over KGs achieve unsatisfying performance on HKGs due to the complexity of qualifiers in HKGs. Also, there is only one dataset for HKG query cardinality estimation, i.e., WD50K-QE, which is not comprehensive and only covers limited patterns. The lack of querysets over HKG also becomes a bottleneck to comprehensively investigate CE problems on HKGs. In this work, we first construct diverse and unbiased hyper-relational querysets over three popular HKGs for investigating CE. Besides, we also propose a novel qualifier-attached graph neural network (GNN) model that effectively incorporates qualifier information and adaptively combines outputs from multiple GNN layers, to accurately predict the cardinality. Our experiments illustrate that the proposed hyper-relational query encoder outperforms all state-of-the-art CE methods over three popular HKGs on the diverse and unbiased benchmark.

5/27/2024