Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

Read original: arXiv:2406.18906 - Published 6/28/2024 by Melanie Walsh, Anna Preus, Maria Antoniak

Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

Overview

This paper evaluates the ability of large language models to generate poetic content, specifically focused on the task of distinguishing sonnets from non-sonnets.
The authors analyze the performance of several state-of-the-art language models on this task, using a range of datasets and evaluation metrics.
The findings provide insights into the current limitations and capabilities of language models in the domain of creative writing and poetry generation.

Plain English Explanation

The paper investigates whether large language models, which are powerful AI systems trained on massive amounts of text data, can accurately identify poetic forms like sonnets. Sonnets are a specific type of poem with a structured rhyme scheme and meter. The researchers tested several advanced language models to see how well they could distinguish sonnets from other types of poetry.

They used different datasets of poems and a variety of evaluation methods to assess the models' performance on this task. The results shed light on the current strengths and weaknesses of these language models when it comes to understanding and generating poetic content. This work provides important insights into the capabilities and limitations of AI systems in the creative domain of poetry writing.

Technical Explanation

The paper explores the performance of large language models on the task of identifying poetic forms, specifically sonnets. The authors evaluate several state-of-the-art models, including GPT-3, Phonology Bench, and others, on their ability to distinguish sonnets from non-sonnets.

The researchers use a range of datasets, including the SONNET and POEMOTIONS datasets, to train and evaluate the models. They employ various evaluation metrics, such as accuracy, F1-score, and perplexity, to assess the models' performance.

The results show that while the language models demonstrate some ability to identify sonnets, they still struggle with this task, particularly when faced with more complex or ambiguous poetic forms. The authors discuss the potential reasons for these limitations, including the models' lack of deeper understanding of poetic structure and the challenges of capturing the nuances of creative expression.

Critical Analysis

The paper provides a valuable contribution to the understanding of large language models' capabilities in the domain of poetry generation and analysis. However, the authors acknowledge several caveats and areas for further research.

One limitation is the reliance on relatively small datasets, which may not fully capture the diversity of poetic forms and styles. Additionally, the evaluation metrics used, while standard, may not fully capture the subjective and qualitative aspects of poetry appreciation.

The authors also note that the language models employed in the study may have been trained on data that did not include a significant amount of poetic content, which could limit their ability to learn the intricate patterns and structures of poetry. Further research is needed to explore the impact of training data composition on the models' performance in this domain.

Another area for exploration is the potential role of human-in-the-loop approaches, where language models are combined with human expertise and feedback to enhance their understanding of poetry and creative writing.

Conclusion

This paper provides a valuable contribution to the ongoing research on the capabilities of large language models in the domain of poetry generation and analysis. The findings highlight the current limitations of these models in accurately identifying poetic forms, such as sonnets, and suggest the need for further advancements in AI-powered creativity and artistic expression.

The insights from this work can inform the development of more sophisticated language models and evaluation frameworks for creative writing tasks, ultimately contributing to the broader goal of bridging the gap between human and artificial intelligence in the realm of artistic and cultural expression.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

Melanie Walsh, Anna Preus, Maria Antoniak

Large language models (LLMs) can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry? We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language. Poetic form captures many different poetic features, including rhyme scheme, meter, and word or line repetition. We use this task to reflect on LLMs' current poetic capabilities, as well as the challenges and pitfalls of creating NLP benchmarks for poetry and for other creative tasks. In particular, we use this task to audit and reflect on the poems included in popular pretraining datasets. Our findings have implications for NLP researchers interested in model evaluation, digital humanities and cultural analytics scholars, and cultural heritage professionals.

6/28/2024

Understanding Literary Texts by LLMs: A Case Study of Ancient Chinese Poetry

Cheng Zhao, Bin Wang, Zhen Wang

The birth and rapid development of large language models (LLMs) have caused quite a stir in the field of literature. Once considered unattainable, AI's role in literary creation is increasingly becoming a reality. In genres such as poetry, jokes, and short stories, numerous AI tools have emerged, offering refreshing new perspectives. However, it's difficult to further improve the quality of these works. This is primarily because understanding and appreciating a good literary work involves a considerable threshold, such as knowledge of literary theory, aesthetic sensibility, interdisciplinary knowledge. Therefore, authoritative data in this area is quite lacking. Additionally, evaluating literary works is often complex and hard to fully quantify, which directly hinders the further development of AI creation. To address this issue, this paper attempts to explore the mysteries of literary texts from the perspective of LLMs, using ancient Chinese poetry as an example for experimentation. First, we collected a variety of ancient poems from different sources and had experts annotate a small portion of them. Then, we designed a range of comprehension metrics based on LLMs to evaluate all these poems. Finally, we analyzed the correlations and differences between various poem collections to identify literary patterns. Through our experiments, we observed a series of enlightening phenomena that provide technical support for the future development of high-level literary creation based on LLMs.

9/12/2024

Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy, Fluency, and Elegance

Andong Chen, Lianzhang Lou, Kehai Chen, Xuefeng Bai, Yang Xiang, Muyun Yang, Tiejun Zhao, Min Zhang

Large language models (LLMs) have shown remarkable performance in general translation tasks. However, the increasing demand for high-quality translations that are not only adequate but also fluent and elegant. To assess the extent to which current LLMs can meet these demands, we introduce a suitable benchmark for translating classical Chinese poetry into English. This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance. Our study reveals that existing LLMs fall short of this task. To address these issues, we propose RAT, a textbf{R}etrieval-textbf{A}ugmented machine textbf{T}ranslation method that enhances the translation process by incorporating knowledge related to classical poetry. Additionally, we propose an automatic evaluation metric based on GPT-4, which better assesses translation quality in terms of adequacy, fluency, and elegance, overcoming the limitations of traditional metrics. Our dataset and code will be made available.

8/20/2024

Evaluating Diversity in Automatic Poetry Generation

Yanran Chen, Hannes Groner, Sina Zarrie{ss}, Steffen Eger

Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation - can humans distinguish between automatic and human generated poetry - we evaluate the diversity of automatically generated poetry, by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably underdiverse along multiple dimensions - they often do not rhyme sufficiently, are semantically too uniform and even do not match the length distribution of human poetry. Our experiments reveal, however, that style-conditioning and character-level modeling clearly increases diversity across virtually all dimensions we explore. Our identified limitations may serve as the basis for more genuinely diverse future poetry generation models.

6/24/2024