Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection

Read original: arXiv:2402.09055 - Published 4/16/2024 by Yang Liu, Tongfei Shen, Dong Zhang, Qingying Sun, Shoushan Li, Guodong Zhou

🔎

Overview

The growing importance of detecting humor in short-form videos on social media platforms is driving advancements in multi-modal affective computing.
This paper proposes a novel two-branch hierarchical model for short-form video humor detection (SVHD), called Comment-aided Video-Language Alignment (CVLA).
CVLA aligns video and language components within a consistent semantic space by leveraging raw signals across different modalities and data-augmented multi-modal contrastive pre-training.

Plain English Explanation

The way people express humor is changing with the rise of short-form video platforms like TikTok and YouTube Shorts. Detecting humor in these videos is becoming increasingly important for understanding user sentiment and improving content recommendations.

The researchers developed a new model called CVLA that can identify humor in short videos. CVLA looks at both the video content and the accompanying text (like video titles and comments) to determine if a video is funny. It does this by aligning the video and text information into a common semantic space, allowing the model to understand the relationship between the visual and language elements.

This approach outperformed other state-of-the-art methods for detecting humor in two benchmark datasets. The researchers have made their dataset, code, and model publicly available to help advance research in this area.

Technical Explanation

The CVLA model uses a two-branch architecture to process video and language inputs separately, then aligns them in a shared semantic space. The video branch encodes the raw video signal, while the language branch encodes any accompanying text like titles and comments.

To improve the model's ability to understand humor across modalities, the researchers used data-augmented multi-modal contrastive pre-training. This involves training the model on a large, diverse dataset to learn general representations of humor-related concepts, before fine-tuning on specific humor detection tasks.

Experiments on two humor detection benchmarks, DY11k and UR-FUNNY, showed that CVLA significantly outperformed other state-of-the-art approaches. This demonstrates the value of the model's ability to leverage both video and language cues to accurately identify humor in short-form videos.

Critical Analysis

The paper provides a thorough evaluation of the CVLA model on established datasets, but there are a few limitations to consider:

The model was only tested on English-language videos, so its performance on other languages is unclear.
The datasets used are relatively small, so the model's scalability to larger real-world applications is uncertain.
The paper does not delve into the model's interpretability or provide much insight into the specific humor-related features it learns.

Future research could explore expanding CVLA to handle more diverse video and language inputs, as well as improving the model's ability to explain its humor detection decisions.

Conclusion

This paper presents a novel approach to multi-modal humor detection in short-form videos, a task that is growing in importance as social media platforms continue to shape how people consume and express humor online. The CVLA model's strong performance on benchmark datasets demonstrates the value of aligning video and language cues to better understand the nuances of digital humor. As research in this area continues, models like CVLA could help social platforms better curate and recommend humorous content, ultimately enhancing user experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection

Yang Liu, Tongfei Shen, Dong Zhang, Qingying Sun, Shoushan Li, Guodong Zhou

The growing importance of multi-modal humor detection within affective computing correlates with the expanding influence of short-form video sharing on social media platforms. In this paper, we propose a novel two-branch hierarchical model for short-form video humor detection (SVHD), named Comment-aided Video-Language Alignment (CVLA) via data-augmented multi-modal contrastive pre-training. Notably, our CVLA not only operates on raw signals across various modal channels but also yields an appropriate multi-modal representation by aligning the video and language components within a consistent semantic space. The experimental results on two humor detection datasets, including DY11k and UR-FUNNY, demonstrate that CVLA dramatically outperforms state-of-the-art and several competitive baseline approaches. Our dataset, code and model release at https://github.com/yliu-cs/CVLA.

4/16/2024

Unified Video-Language Pre-training with Synchronized Audio

Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.

5/14/2024

ViLA: Efficient Video-Language Alignment for Video Question Answering

Xijun Wang, Junbang Liang, Chun-Kai Wang, Kenan Deng, Yu Lou, Ming Lin, Shan Yang

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up.

4/30/2024

🔮

Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results

Lukas Christ, Shahin Amiriparian, Alexander Kathan, Niklas Muller, Andreas Konig, Bjorn W. Schuller

Humor is a substantial element of human social behavior, affect, and cognition. Its automatic understanding can facilitate a more naturalistic human-AI interaction. Current methods of humor detection have been exclusively based on staged data, making them inadequate for real-world applications. We contribute to addressing this deficiency by introducing the novel Passau-Spontaneous Football Coach Humor (Passau-SFCH) dataset, comprising about 11 hours of recordings. The Passau-SFCH dataset is annotated for the presence of humor and its dimensions (sentiment and direction) as proposed in Martin's Humor Style Questionnaire. We conduct a series of experiments employing pretrained Transformers, convolutional neural networks, and expert-designed features. The performance of each modality (text, audio, video) for spontaneous humor recognition is analyzed and their complementarity is investigated. Our findings suggest that for the automatic analysis of humor and its sentiment, facial expressions are most promising, while humor direction can be best modeled via text-based features. Further, we experiment with different multimodal approaches to humor recognition, including decision-level fusion and MulT, a multimodal Transformer approach. In this context, we propose a novel multimodal architecture that yields the best overall results. Finally, we make our code publicly available at https://www.github.com/lc0197/passau-sfch. The Passau-SFCH dataset is available upon request.

7/9/2024