Toward accessible comics for blind and low vision readers

Read original: arXiv:2407.08248 - Published 9/11/2024 by Christophe Rigaud (L3I), Jean-Christophe Burie (L3I), Samuel Petit (Comix AI)
Total Score

0

👀

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • The paper explores how to fine-tune large language models using prompt engineering techniques with contextual information to generate accurate text descriptions of full stories, ready to be used with off-the-shelf speech synthesis tools.
  • The researchers propose using existing computer vision and optical character recognition techniques to build a grounded context from comic strip image content, such as panels, characters, text, reading order, and the association of speech bubbles and characters.
  • They then infer character identification and generate comic book scripts with context-aware panel descriptions, including details about characters' appearances, postures, moods, and dialogues.
  • The researchers believe this enriched content description can be easily used to produce audiobooks and e-books with various voices for characters, captions, and sound effects.

Plain English Explanation

The paper discusses a way to improve the text descriptions generated by large language models when dealing with comic strip content. The researchers propose using existing computer vision and optical character recognition techniques to extract relevant contextual information from the comic strip images, such as the layout of panels, the characters present, the text in speech bubbles, and the order in which the story is meant to be read. This grounded context is then used to fine-tune the language model, allowing it to generate more accurate and detailed descriptions of the full story, including things like how the characters look, their body language, and what they are saying. The researchers believe this enhanced content description could be easily used to create audiobooks and e-books with different voices for the characters, captions, and sound effects.

Technical Explanation

The paper proposes a system that combines computer vision, optical character recognition, and large language model fine-tuning to generate rich text descriptions of comic strip content. First, the researchers use existing techniques to extract relevant contextual information from the comic strip images, such as panel layout, character identification, text in speech bubbles, and reading order. This grounded context is then used to fine-tune a large language model, allowing it to generate detailed descriptions of the full story, including character appearances, body language, and dialogue.

The researchers draw on previous work in zero-shot character identification and speaker prediction in comics, as well as large-scale dialogue datasets for comics, to inform their approach. The goal is to produce content descriptions that can be easily used with off-the-shelf speech synthesis tools to create audiobooks and e-books with varied character voices, captions, and sound effects.

Critical Analysis

The paper presents a promising approach to enhancing the text descriptions generated by large language models for comic strip content. The use of computer vision and optical character recognition techniques to build a grounded context is a logical and well-justified step, as it allows the language model to better understand the nuances of the comic strip format.

However, the paper does not delve into the specific challenges or limitations of this approach. For example, it is unclear how the system would handle complex or ambiguous comic strip layouts, or how well it would perform on a diverse range of comic styles and genres. Additionally, the researchers do not discuss the potential issues with using off-the-shelf speech synthesis tools, such as the quality of the generated voices or the ability to capture the intended emotional tone and delivery.

Further research and testing would be needed to fully evaluate the effectiveness and scalability of this approach, as well as to identify any areas for improvement or refinement.

Conclusion

The paper proposes an innovative approach to enhancing the text descriptions generated by large language models when dealing with comic strip content. By leveraging computer vision and optical character recognition techniques to build a grounded context, the researchers aim to enable language models to produce more accurate and detailed narratives that can be easily converted into audiobooks and e-books.

While the paper presents a promising solution, it also raises questions about the specific challenges and limitations of the approach that would need to be explored further. Overall, this research represents an important step towards improving the capabilities of language models in the domain of text-heavy visual content, with potential applications in a variety of industries, from entertainment to education.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Total Score

0

Toward accessible comics for blind and low vision readers

Christophe Rigaud (L3I), Jean-Christophe Burie (L3I), Samuel Petit (Comix AI)

This work explores how to fine-tune large language models using prompt engineering techniques with contextual information for generating an accurate text description of the full story, ready to be forwarded to off-the-shelve speech synthesis tools. We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content, such as panels, characters, text, reading order and the association of bubbles and characters. Then we infer character identification and generate comic book script with context-aware panel description including character's appearance, posture, mood, dialogues etc. We believe that such enriched content description can be easily used to produce audiobook and eBook with various voices for characters, captions and playing sound effects.

Read more

9/11/2024

Context-Aware Image Descriptions for Web Accessibility
Total Score

0

Context-Aware Image Descriptions for Web Accessibility

Ananya Gubbi Mohanbabu, Amy Pavel

Blind and low vision (BLV) internet users access images on the web via text descriptions. New vision-to-language models such as GPT-V, Gemini, and LLaVa can now provide detailed image descriptions on-demand. While prior research and guidelines state that BLV audiences' information preferences depend on the context of the image, existing tools for accessing vision-to-language models provide only context-free image descriptions by generating descriptions for the image alone without considering the surrounding webpage context. To explore how to integrate image context into image descriptions, we designed a Chrome Extension that automatically extracts webpage context to inform GPT-4V-generated image descriptions. We gained feedback from 12 BLV participants in a user study comparing typical context-free image descriptions to context-aware image descriptions. We then further evaluated our context-informed image descriptions with a technical evaluation. Our user evaluation demonstrated that BLV participants frequently prefer context-aware descriptions to context-free descriptions. BLV participants also rated context-aware descriptions significantly higher in quality, imaginability, relevance, and plausibility. All participants shared that they wanted to use context-aware descriptions in the future and highlighted the potential for use in online shopping, social media, news, and personal interest blogs.

Read more

9/6/2024

The Manga Whisperer: Automatically Generating Transcriptions for Comics
Total Score

0

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Ragav Sachdeva, Andrew Zisserman

In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.

Read more

8/2/2024

👀

Total Score

0

Enhancing Vision Models for Text-Heavy Content Understanding and Interaction

Adithya TG, Adithya SK, Abhinav R Bharadwaj, Abhiram HA, Dr. Surabhi Narayan

Interacting and understanding with text heavy visual content with multiple images is a major challenge for traditional vision models. This paper is on enhancing vision models' capability to comprehend or understand and learn from images containing a huge amount of textual information from the likes of textbooks and research papers which contain multiple images like graphs, etc and tables in them with different types of axes and scales. The approach involves dataset preprocessing, fine tuning which is by using instructional oriented data and evaluation. We also built a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark which is developed to consider both textual and visual inputs. An accuracy of 96.71% was obtained. The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data, contributing to multimodal AI.

Read more

6/3/2024