SS-Bench: A Benchmark for Social Story Generation and Evaluation

Read original: arXiv:2406.15695 - Published 9/10/2024 by Yi Feng, Mingyang Song, Jiaqi Wang, Zhuang Chen, Guanqun Bi, Minlie Huang, Liping Jing, Jian Yu

SS-Bench: A Benchmark for Social Story Generation and Evaluation

Overview

This paper introduces SS-Bench, a new benchmark for evaluating the performance of language models in generating and understanding social stories.
Social stories are narratives that involve human relationships, emotions, and social interactions - an important aspect of language understanding and generation.
The benchmark includes a diverse dataset of social stories, as well as automated and human-based evaluation metrics to assess the quality, coherence, and social awareness of generated stories.
The goal is to provide a standardized way to measure the progress of language models in this socially-aware narrative generation task.

Plain English Explanation

The researchers have created a new benchmarking tool called SS-Bench to test how well AI language models can generate and understand "social stories" - narratives that involve human relationships, emotions, and social interactions. This is an important part of language understanding, as being able to communicate about social situations is crucial for human-like communication.

The benchmark includes a large dataset of diverse social stories, as well as ways to automatically and manually evaluate the quality, coherence, and social awareness of stories generated by AI models. This gives researchers a standardized way to measure how well different language models perform at this socially-aware narrative generation task.

The hope is that by having this benchmark, researchers can better track the progress of AI models in this area and work towards developing language technologies that can engage in more natural, human-like conversations.

Technical Explanation

The key elements of the SS-Bench benchmark are:

Dataset: The researchers curated a diverse dataset of over 20,000 social stories, gathered from online writing communities. These stories cover a range of topics, characters, and social dynamics.
Evaluation Metrics: The benchmark includes both automated metrics (e.g. coherence, sentiment, social awareness) and human evaluation to assess the quality of generated stories. This allows for a multifaceted assessment of the models' performance.
Standardized Evaluation: By providing a common dataset and evaluation framework, SS-Bench enables fair and consistent comparisons between different language models and their social story generation capabilities.

The researchers tested several large language models on the SS-Bench and found that while the models could generate plausible stories, they struggled to capture the nuanced social dynamics and emotional intelligence required for truly engaging social narratives. This highlights the need for further advancements in socially-aware language understanding and generation.

Critical Analysis

The SS-Bench benchmark is a valuable contribution to the field, as it addresses an important gap in the evaluation of language models' abilities to understand and generate socially-aware narratives. However, some potential limitations and areas for further research include:

Benchmark Scope: While the dataset covers a wide range of social stories, it may not fully represent the diversity of real-world social interactions and narratives. Expanding the dataset or creating multiple benchmarks could further strengthen the evaluation.
Evaluation Metrics: The automated metrics, while useful, may not fully capture the subjective and contextual nature of social intelligence. Incorporating more human-centric evaluation methods could provide additional insights.
Generalization: The benchmark focuses on the task of social story generation. Assessing how well the models' social understanding generalizes to other language tasks, such as StructBench or NewsBench, could yield valuable insights.
Ethical Considerations: As with any language modeling task, there are potential issues around bias, fairness, and the societal impact of these models. The authors could explore ways to incorporate ethical considerations into the benchmark design and evaluation.

Conclusion

The SS-Bench benchmark is a significant step forward in the evaluation of language models' social intelligence and narrative generation capabilities. By providing a standardized dataset and evaluation framework, it enables researchers to track progress and identify areas for improvement in this crucial aspect of human-like language understanding and generation.

As AI systems become more integrated into our daily lives, the ability to engage in socially-aware and emotionally intelligent communication will be increasingly important. The SS-Bench benchmark, and the insights it generates, can help guide the development of more socially-intelligent language models that can better interact with and understand humans.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SS-Bench: A Benchmark for Social Story Generation and Evaluation

Yi Feng, Mingyang Song, Jiaqi Wang, Zhuang Chen, Guanqun Bi, Minlie Huang, Liping Jing, Jian Yu

Children with Autism Spectrum Disorder (ASD) often misunderstand social situations and struggle to participate in daily routines. Social Stories are traditionally crafted by psychology experts under strict constraints to address these challenges but are costly and limited in diversity. As Large Language Models (LLMs) advance, there's an opportunity to develop more automated, affordable, and accessible methods to generate Social Stories in real-time with broad coverage. However, adapting LLMs to meet the unique and strict constraints of Social Stories is a challenging issue. To this end, we propose textbf{SS-GEN}, a textbf{S}ocial textbf{S}tory textbf{GEN}eration framework with LLMs. Firstly, we develop a constraint-driven sophisticated strategy named textbf{textsc{StarSow}} to hierarchically prompt LLMs to generate Social Stories at scale, followed by rigorous human filtering to build a high-quality dataset. Additionally, we introduce textbf{quality assessment criteria} to evaluate the effectiveness of these generated stories. Considering that powerful closed-source large models require very complex instructions and expensive API fees, we finally fine-tune smaller language models with our curated high-quality dataset, achieving comparable results at lower costs and with simpler instruction and deployment. This work marks a significant step in leveraging AI to personalize Social Stories cost-effectively for autistic children at scale, which we hope can encourage future research. The prompt, code and data will release in the texttt{Technical Appendix} and texttt{Code & Data Appendix} at url{https://github.com/MIMIFY/SS-GEN}.

9/10/2024

Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level

Chenxu Wang, Bin Dai, Huaping Liu, Baoyuan Wang

Prominent large language models have exhibited human-level performance in many domains, even enabling the derived agents to simulate human and social interactions. While practical works have substantiated the practicability of grounding language agents in sandbox simulation or embodied simulators, current social intelligence benchmarks either stay at the language level or use subjective metrics. In pursuit of a more realistic and objective evaluation, we introduce the Social Tasks in Sandbox Simulation (STSS) benchmark, which assesses language agents textbf{objectively} at the textbf{action level} by scrutinizing the goal achievements within the multi-agent simulation. Additionally, we sample conversation scenarios to build a language-level benchmark to provide an economically prudent preliminary evaluation and align with prevailing benchmarks. To gauge the significance of agent architecture, we implement a target-driven planning (TDP) module as an adjunct to the existing agent. Our evaluative findings highlight that the STSS benchmark is challenging for state-of-the-art language agents. Furthermore, it effectively discriminates between distinct language agents, suggesting its usefulness as a benchmark for evaluating both language models and agent architectures.

4/9/2024

StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

Zhouhong Gu, Haoning Ye, Zeyang Zhou, Hongwei Feng, Yanghua Xiao

Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates structured data of controllable complexity based on manually crafted question templates and generation rules. Building on this generation method, we introduce StrucText-Eval, a benchmark comprising 6,032 questions across 8 different structured languages and 29 specific tasks. Furthermore, considering human proficiency in rule-based tasks, we also present StrucText-Eval-Hard, which includes 3,016 questions designed to further examine the gap between LLMs and human performance. Results indicate that the best-performing LLM currently achieve an accuracy of 65.0% on StrucText-Eval-Hard, while human accuracy reaches up to 95.7%. Moreover, while fine-tuning using StrucText-Eval can enhance existing LLMs' understanding of all structured languages, it does not necessarily improve performance across all task types. The benchmark and generation codes are open sourced in https://github.com/MikeGu721/StrucText-Eval

7/2/2024

💬

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

Cyril Chhun, Fabian M. Suchanek, Chlo'e Clavel

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.

5/24/2024