Do Text-to-Vis Benchmarks Test Real Use of Visualisations?

Read original: arXiv:2407.19726 - Published 8/16/2024 by Hy Nguyen, Xuefei He, Andrew Reeson, Cecile Paris, Josiah Poon, Jonathan K. Kummerfeld

Do Text-to-Vis Benchmarks Test Real Use of Visualisations?

Overview

This paper investigates whether current text-to-visualization benchmarks accurately reflect how visualizations are used in the real world.
The authors collect a new dataset of visualization usage from real-world sources and compare it to existing benchmarks.
They find that benchmarks focus more on generating visualizations from textual descriptions, rather than the actual usage of visualizations.

Plain English Explanation

The paper looks at whether the benchmarks used to test text-to-visualization models really capture how people use visualizations in the real world. The authors gathered a new dataset of how people actually interact with and use visualizations from various online sources. They then compared this to the types of tasks and scenarios covered by existing benchmarks.

The key finding is that the current benchmarks tend to focus more on generating visualizations from textual descriptions, rather than the actual usage of visualizations. In the real world, people often use visualizations to understand data, draw insights, and support decision-making. The benchmarks don't necessarily reflect these real-world uses of visualizations.

Technical Explanation

The paper begins by surveying related work on text-to-visualization benchmarks, such as ViSEval and ChartBench. It notes that these benchmarks primarily evaluate the ability to generate visualizations from textual descriptions, rather than assessing the actual usage of visualizations.

To investigate this gap, the authors collected a new dataset of visualization usage from real-world sources like online articles, tutorials, and social media. They analyzed the types of tasks, interactions, and visualizations present in this dataset.

The results show that the real-world usage of visualizations is more diverse and oriented towards tasks like data exploration, insight generation, and decision support - rather than just generating visualizations from text. The authors argue that future benchmarks should better reflect these real-world use cases for visualizations.

Critical Analysis

The paper raises important points about the limitations of current text-to-visualization benchmarks. By highlighting the gap between benchmark tasks and real-world visualization usage, it suggests that these benchmarks may not accurately assess the capabilities of models for practical applications.

However, the paper does not delve deeply into the specific challenges or tradeoffs involved in designing more representative benchmarks. It also does not discuss potential biases or limitations in the authors' own data collection process.

Further research could explore ways to bridge the gap between benchmarks and real-world usage, such as incorporating more diverse task types, interaction patterns, and visualization types into benchmark design. Longitudinal studies on how people use visualizations in different contexts could also provide valuable insights.

Conclusion

This paper demonstrates that current text-to-visualization benchmarks do not fully capture the real-world usage of visualizations. By collecting a new dataset of visualization usage and comparing it to existing benchmarks, the authors highlight the need for more representative and task-oriented evaluation of visualization-related technologies.

The findings suggest that future research and development in this area should focus on supporting the actual ways people interact with and use visualizations, rather than just generating visualizations from text. This shift could lead to more impactful and practical advancements in the field of data visualization.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Do Text-to-Vis Benchmarks Test Real Use of Visualisations?

Hy Nguyen, Xuefei He, Andrew Reeson, Cecile Paris, Josiah Poon, Jonathan K. Kummerfeld

Large language models are able to generate code for visualisations in response to user requests. This is a useful application, and an appealing one for NLP research because plots of data provide grounding for language. However, there are relatively few benchmarks, and it is unknown whether those that exist are representative of what people do in practice. This paper aims to answer that question through an empirical study comparing benchmark datasets and code from public repositories. Our findings reveal a substantial gap in datasets, with evaluations not testing the same distribution of chart types, attributes, and the number of actions. The only representative dataset requires modification to become an end-to-end and practical benchmark. This shows that new, more benchmarks are needed to support the development of systems that truly address users' visualisation needs. These observations will guide future data creation, highlighting which features hold genuine significance for users.

8/16/2024

PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation

Aneta Pawelec, Victoria Sara Weso{l}owska, Zuzanna Bk{a}czek, Piotr Sankowski

The ability of large language models (LLMs) to interpret visual representations of data is crucial for advancing their application in data analysis and decision-making processes. This paper presents a novel synthetic dataset designed to evaluate the proficiency of LLMs in interpreting various forms of data visualizations, including plots like time series, histograms, violins, boxplots, and clusters. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models like ChatGPT or Gemini, assessing their understanding and interpretative accuracy. To ensure data integrity, our benchmark dataset is generated automatically, making it entirely new and free from prior exposure to the models being tested. This strategy allows us to evaluate the models' ability to truly interpret and understand the data, eliminating possibility of pre-learned responses, and allowing for an unbiased evaluation of the models' capabilities. We also introduce quantitative metrics to assess the performance of the models, providing a robust and comprehensive evaluation tool. Benchmarking several state-of-the-art LLMs with this dataset reveals varying degrees of success, highlighting specific strengths and weaknesses in interpreting diverse types of visual data. The results provide valuable insights into the current capabilities of LLMs and identify key areas for improvement. This work establishes a foundational benchmark for future research and development aimed at enhancing the visual interpretative abilities of language models. In the future, improved LLMs with robust visual interpretation skills can significantly aid in automated data analysis, scientific research, educational tools, and business intelligence applications.

9/5/2024

VisEval: A Benchmark for Data Visualization in the Era of Large Language Models

Nan Chen, Yuge Zhang, Jiahang Xu, Kan Ren, Yuqing Yang

Translating natural language to visualization (NL2VIS) has shown great promise for visual data analysis, but it remains a challenging task that requires multiple low-level implementations, such as natural language processing and visualization design. Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. However, the lack of a comprehensive and reliable benchmark hinders our understanding of LLMs' capabilities in visualization generation. In this paper, we address this gap by proposing a new NL2VIS benchmark called VisEval. Firstly, we introduce a high-quality and large-scale dataset. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths. Secondly, we advocate for a comprehensive automated evaluation methodology covering multiple dimensions, including validity, legality, and readability. By systematically scanning for potential issues with a number of heterogeneous checkers, VisEval provides reliable and trustworthy evaluation outcomes. We run VisEval on a series of state-of-the-art LLMs. Our evaluation reveals prevalent challenges and delivers essential insights for future advancements.

8/9/2024

🤯

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, Jian Guo

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. However, current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. To address this, we propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning. ChartBench includes 42 categories, 66.6k charts, and 600k question-answer pairs. Notably, many charts lack data point annotations, which requires MLLMs to derive values similar to human understanding by leveraging inherent chart elements such as color, legends, and coordinate systems. We also design an enhanced evaluation metric, Acc+, to evaluate MLLMs without extensive manual or costly LLM-based evaluations. Furthermore, we propose two baselines based on the chain of thought and supervised fine-tuning to improve model performance on unannotated charts. Extensive experimental evaluations of 18 open-sourced and 3 proprietary MLLMs reveal their limitations in chart comprehension and offer valuable insights for further research. Code and dataset are publicly available at https://chartbench.github.io.

6/21/2024