UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Read original: arXiv:2408.10129 - Published 8/27/2024 by Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Introduction

UNINEXT-Cutie is a solution for the LSVOS Challenge RVOS (Referring Video Object Segmentation) track.
It is the 1st place solution for this challenge.
The paper introduces a novel method for semi-supervised video object segmentation.

Method

UNINEXT Architecture

UNINEXT-Cutie uses a two-stage architecture.
The first stage is a Referring Video Object Segmentation (RVOS) module that generates an initial segmentation mask.
The second stage is a Cutie module that refines the segmentation mask.
The RVOS module uses a vision-language model to segment the target object based on a referring expression.
The Cutie module then refines the segmentation using both visual and textual information.

Training Process

The model is trained in a semi-supervised manner.
It is first pre-trained on large-scale video object segmentation datasets.
Then, it is fine-tuned on the target dataset for the LSVOS Challenge.
The fine-tuning process leverages both labeled and unlabeled videos to improve performance.

Technical Explanation

The RVOS module uses a transformer-based architecture to fuse visual and linguistic features.
The Cutie module uses a long-short text prediction network to effectively combine visual and textual information.
The semi-supervised training process involves self-supervised pretraining and task-specific fine-tuning.
The model achieves state-of-the-art performance on the LSVOS Challenge RVOS track.

Critical Analysis

The paper does not provide detailed information on the hyperparameter tuning or ablation studies conducted.
It would be helpful to understand the specific design choices and their impact on the final performance.
The generalization of the method to other video segmentation tasks is not discussed in depth.
Further research could explore the potential of this approach for other video understanding problems.

Conclusion

UNINEXT-Cutie is a novel solution for the LSVOS Challenge RVOS track that leverages semi-supervised learning.
The two-stage architecture with the RVOS and Cutie modules demonstrates state-of-the-art performance.
The paper provides a valuable contribution to the field of video object segmentation and highlights the potential of combining vision and language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong

Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video. In this year, LSVOS Challenge RVOS Track replaced the origin YouTube-RVOS benchmark with MeViS. MeViS focuses on referring the target object in a video through its motion descriptions instead of static attributes, posing a greater challenge to RVOS task. In this work, we integrate strengths of that leading RVOS and VOS models to build up a simple and effective pipeline for RVOS. Firstly, We finetune the state-of-the-art RVOS model to obtain mask sequences that are correlated with language descriptions. Secondly, based on a reliable and high-quality key frames, we leverage VOS model to enhance the quality and temporal consistency of the mask results. Finally, we further improve the performance of the RVOS model using semi-supervised learning. Our solution achieved 62.57 J&F on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.

8/27/2024

3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation

Feiyu Pan, Hao Fang, Xiankai Lu

Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information. Furthermore, we propose a novel video query initialization method to generate higher quality video queries. Without bells and whistles, our method achieved 51.5 J&F on the MeViS test set and ranked 3rd place for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation.

6/10/2024

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation

Tuyen Tran

Referring Video Object Segmentation (RVOS) is a challenging task due to its requirement for temporal understanding. Due to the obstacle of computational complexity, many state-of-the-art models are trained on short time intervals. During testing, while these models can effectively process information over short time steps, they struggle to maintain consistent perception over prolonged time sequences, leading to inconsistencies in the resulting semantic segmentation masks. To address this challenge, we take a step further in this work by leveraging the tracking capabilities of the newly introduced Segment Anything Model version 2 (SAM-v2) to enhance the temporal consistency of the referring object segmentation model. Our method achieved a score of 60.40 mathcal{Jtext{&}F} on the test set of the MeViS dataset, placing 2nd place in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.

8/23/2024

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu

Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 J&F in the validation phase and 60.36 J&F in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.

8/21/2024