UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

Read original: arXiv:2310.08984 - Published 5/21/2024 by Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao

🏅

Overview

This paper introduces UniParser, a new approach for multi-human parsing that integrates instance-level and category-level information in a unified framework.
Prior methods have typically processed these two types of information separately, leading to inefficient and redundant frameworks.
UniParser aims to address this by learning instance and category features within a shared cosine space, using a homogeneous output format and a joint optimization procedure.

Plain English Explanation

The paper focuses on the task of multi-human parsing, which involves segmenting an image to identify individual people and the specific categories they belong to (e.g., woman, child, etc.). Prior approaches have typically handled the instance-level (identifying individual people) and category-level (identifying the type of person) information separately, resulting in complex and redundant systems.

UniParser, the new method introduced in this paper, takes a more unified approach. It learns to represent both the instance-level and category-level features in a shared cosine space, allowing the network to jointly optimize these two types of information. The outputs of the network are also unified as pixel-level segmentation results, avoiding the need for manual post-processing.

By integrating the instance and category information in this way, UniParser is able to outperform state-of-the-art methods on benchmark multi-human parsing datasets, achieving higher accuracy without the complexity of separate processing pipelines.

Technical Explanation

The key innovations in UniParser are:

Unified Correlation Representation Learning: The network learns instance and category features within a shared cosine space, allowing it to capture the relationships between them.
Unified Output Format: The network produces pixel-level segmentation results for both instance and category information, using a homogeneous label and an auxiliary loss to supervise these outputs.
Joint Optimization: The network is trained using a joint optimization procedure that fuses the instance and category representations, rather than processing them separately.

The authors evaluate UniParser on two popular multi-human parsing datasets, MHPv2.0 and CIHP, and show that it outperforms state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP.

Critical Analysis

The paper presents a well-designed and effective approach to the challenging task of multi-human parsing. By unifying the instance-level and category-level information, UniParser avoids the complexities and inefficiencies of prior methods that treated these aspects separately.

However, the paper does not discuss any significant limitations or potential drawbacks of the UniParser approach. For example, it would be valuable to understand how the method performs on more diverse or challenging datasets, or how it might scale to real-world applications with a larger number of people and categories.

Additionally, the authors could have provided more insight into the key design choices and the reasons behind them, as well as any ablation studies or comparisons to alternative architectures.

Overall, the research presented in this paper is a significant contribution to the field of multi-human parsing, and the UniParser framework sets a new standard for integrating instance-level and category-level information. However, further exploration of the method's limitations and potential areas for improvement could strengthen the impact of this work.

Conclusion

The UniParser paper introduces a novel approach to multi-human parsing that unifies instance-level and category-level information in a shared representation and optimization framework. By avoiding the complexity of separate processing pipelines, UniParser achieves state-of-the-art performance on benchmark datasets, demonstrating the potential of integrated visual representations for complex image understanding tasks.

This research opens up new directions for unified visual understanding that could have widespread applications in areas like human-centric computer vision and scene analysis. As the field continues to evolve, the principles and techniques introduced in this paper are likely to influence the development of more efficient and effective approaches to multi-modal perception and reasoning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏅

UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao

Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.

5/21/2024

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo

Human-centric perception (e.g. detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision. This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP). Our approach centers on learning a unified human query representation, denoted as Human Query, which captures intricate instance-level features for individual persons and disentangles complex multi-person scenarios. Although different HCP tasks have been well-studied individually, single-stage multi-task learning of HCP tasks has not been fully exploited in the literature due to the absence of a comprehensive benchmark dataset. To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. Experimental results demonstrate the proposed method's state-of-the-art performance among multi-task HCP models and its competitive performance compared to task-specific HCP models. Moreover, our experiments underscore Human Query's adaptability to new HCP tasks, thus demonstrating its robust generalization capability. Codes and data are available at https://github.com/lishuhuai527/COCO-UniHuman.

7/16/2024

👁️

UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning

Shikun Feng, Yuyan Ni, Minghao Li, Yanwen Huang, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan

Recently, a noticeable trend has emerged in developing pre-trained foundation models in the domains of CV and NLP. However, for molecular pre-training, there lacks a universal model capable of effectively applying to various categories of molecular tasks, since existing prevalent pre-training methods exhibit effectiveness for specific types of downstream tasks. Furthermore, the lack of profound understanding of existing pre-training methods, including 2D graph masking, 2D-3D contrastive learning, and 3D denoising, hampers the advancement of molecular foundation models. In this work, we provide a unified comprehension of existing pre-training methods through the lens of contrastive learning. Thus their distinctions lie in clustering different views of molecules, which is shown beneficial to specific downstream tasks. To achieve a complete and general-purpose molecular representation, we propose a novel pre-training framework, named UniCorn, that inherits the merits of the three methods, depicting molecular views in three different levels. SOTA performance across quantum, physicochemical, and biological tasks, along with comprehensive ablation study, validate the universality and effectiveness of UniCorn.

5/20/2024

UniProcessor: A Text-induced Unified Low-level Image Processor

Huiyu Duan, Xiongkuo Min, Sijing Wu, Wei Shen, Guangtao Zhai

Image processing, including image restoration, image enhancement, etc., involves generating a high-quality clean image from a degraded input. Deep learning-based methods have shown superior performance for various image processing tasks in terms of single-task conditions. However, they require to train separate models for different degradations and levels, which limits the generalization abilities of these models and restricts their applications in real-world. In this paper, we propose a text-induced unified image processor for low-level vision tasks, termed UniProcessor, which can effectively process various degradation types and levels, and support multimodal control. Specifically, our UniProcessor encodes degradation-specific information with the subject prompt and process degradations with the manipulation prompt. These context control features are injected into the UniProcessor backbone via cross-attention to control the processing procedure. For automatic subject-prompt generation, we further build a vision-language model for general-purpose low-level degradation perception via instruction tuning techniques. Our UniProcessor covers 30 degradation types, and extensive experiments demonstrate that our UniProcessor can well process these degradations without additional training or tuning and outperforms other competing methods. Moreover, with the help of degradation-aware context control, our UniProcessor first shows the ability to individually handle a single distortion in an image with multiple degradations.

7/31/2024