Good Books are Complex Matters: Gauging Complexity Profiles Across Diverse Categories of Perceived Literary Quality

Read original: arXiv:2404.04022 - Published 4/16/2024 by Yuri Bizzoni, Pascale Feldkamp, Ida Marie Lassen, Mia Jacobsen, Mads Rosendahl Thomsen, Kristoffer Nielbo

🏅

Overview

The study investigates the linguistic profiles of different categories of literary quality, including canonical/high-brow texts, bestsellers, and award-winning novels.
The researchers built a corpus of texts from the Norton Anthology, Penguin Classics, Open Syllabus Project, contemporary bestsellers, and Nobel Prize/prestigious literary award winners.
They used a machine learning approach called Random Forest to differentiate between these quality categories and found that high-brow texts exhibit distinct textual features compared to other categories.
The study suggests that while literary quality features may be distinguishable, they are also shared across different quality proxies.

Plain English Explanation

The researchers wanted to see if different types of high-quality literature, such as canonical texts and award-winning novels, have unique linguistic patterns that set them apart. They collected a large set of texts, including titles from well-known literature collections, bestsellers, and works that have won prestigious awards.

Using a machine learning algorithm, the researchers were able to accurately differentiate between the various quality categories about 77% of the time. This suggests that high-brow or "literary" texts do have distinct textual characteristics, but that these features are also present, to some degree, in other types of quality literature.

In other words, there seem to be both unique and shared qualities across different levels of literary merit. The researchers believe this indicates that while we can identify markers of high-quality writing, the boundaries between "literary" and more popular or commercial fiction may not be as clear-cut as some might assume.

Technical Explanation

The researchers constructed a large corpus of texts spanning different levels of literary quality, including:

Canonical works from the Norton Anthology and Penguin Classics
Titles from the Open Syllabus Project, which represents texts commonly used in college courses
Contemporary bestsellers
Novels that have won the Nobel Prize or other major literary awards

They then applied a supervised machine learning approach, specifically the Random Forest classifier, to differentiate between these quality categories. The goal was to see if high-brow or "literary" texts exhibited distinct linguistic profiles compared to more commercially successful or award-winning works.

The analysis revealed that the machine learning model was able to achieve F1 scores up to 77% in distinguishing between the quality categories. This suggests that canonical/high-brow texts do have unique textual features that set them apart from other types of quality literature.

However, the researchers also found that it was generally easier to differentiate the quality categories from control groups than it was to distinguish them from each other. This implies that while there are identifiable markers of literary quality, these features are also shared, to some degree, across different quality proxies.

Critical Analysis

The study provides valuable insights into the linguistic characteristics of different levels of literary quality. By leveraging a diverse corpus and applying robust machine learning techniques, the researchers were able to shed light on the nuanced relationship between "literary" and more mainstream or award-winning fiction.

One limitation of the research is that the quality categorization was based on established literary canons and award lists, which may not fully capture the subjective and evolving nature of literary merit. Additionally, the Open Syllabus Project data, while representing texts commonly used in academia, may not be representative of the full spectrum of literary quality.

Further research could explore the linguistic profiles of specific genres within each quality category, as well as investigate how these patterns may shift over time or across different cultural contexts. Additionally, incorporating cross-lingual analysis could provide a more comprehensive understanding of the universal and culturally-specific aspects of literary quality.

Conclusion

This study offers a nuanced perspective on the linguistic characteristics of different levels of literary quality. While the researchers found that canonical and high-brow texts exhibit distinct textual features, they also observed that these markers of literary merit are shared, to some degree, across various quality proxies, such as bestsellers and award-winning novels.

This suggests that the boundaries between "literary" and more commercially successful or critically acclaimed fiction may not be as clear-cut as some might assume. The findings have implications for our understanding of literary quality and the complex interplay between artistic merit, popular appeal, and critical recognition in the world of literature.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏅

Good Books are Complex Matters: Gauging Complexity Profiles Across Diverse Categories of Perceived Literary Quality

Yuri Bizzoni, Pascale Feldkamp, Ida Marie Lassen, Mia Jacobsen, Mads Rosendahl Thomsen, Kristoffer Nielbo

In this study, we employ a classification approach to show that different categories of literary quality display unique linguistic profiles, leveraging a corpus that encompasses titles from the Norton Anthology, Penguin Classics series, and the Open Syllabus project, contrasted against contemporary bestsellers, Nobel prize winners and recipients of prestigious literary awards. Our analysis reveals that canonical and so called high-brow texts exhibit distinct textual features when compared to other quality categories such as bestsellers and popular titles as well as to control groups, likely responding to distinct (but not mutually exclusive) models of quality. We apply a classic machine learning approach, namely Random Forest, to distinguish quality novels from control groups, achieving up to 77% F1 scores in differentiating between the categories. We find that quality category tend to be easier to distinguish from control groups than from other quality categories, suggesting than literary quality features might be distinguishable but shared through quality proxies.

4/16/2024

📊

QuRating: Selecting High-Quality Data for Training Language Models

Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs are able to discern these qualities, especially when making pairwise judgments of texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity. When we sample using quality ratings as logits over documents, our models obtain lower perplexity and stronger in-context learning performance than baselines. Our best model is based on educational value and performs similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.

7/19/2024

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Ozen Nergis Dolcerocca

This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

7/23/2024

📊

Separating Style from Substance: Enhancing Cross-Genre Authorship Attribution through Data Selection and Presentation

Steven Fincke, Elizabeth Boschee

The task of deciding whether two documents are written by the same author is challenging for both machines and humans. This task is even more challenging when the two documents are written about different topics (e.g. baseball vs. politics) or in different genres (e.g. a blog post vs. an academic article). For machines, the problem is complicated by the relative lack of real-world training examples that cross the topic boundary and the vanishing scarcity of cross-genre data. We propose targeted methods for training data selection and a novel learning curriculum that are designed to discourage a model's reliance on topic information for authorship attribution and correspondingly force it to incorporate information more robustly indicative of style no matter the topic. These refinements yield a 62.7% relative improvement in average cross-genre authorship attribution, as well as 16.6% in the per-genre condition.

8/12/2024