Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

Read original: arXiv:2408.08575 - Published 8/19/2024 by Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

Overview

Proposes a new semantically disentangled image coding approach for large multimodal language models (LMMs)
Aims to compress images in a way that prioritizes the most semantically relevant information for LMMs
Outperforms traditional image codecs in terms of coding efficiency and task-specific performance

Plain English Explanation

This research paper introduces a novel approach to image coding that is tailored for large multimodal language models (LMMs). LMMs are AI systems that can understand and generate human language, as well as process other modalities like images and videos.

The key idea is to compress images in a way that preserves the most semantically relevant information for LMMs, rather than simply optimizing for human visual perception. This is achieved by "disentangling" the image representation into separate components that capture different semantic aspects, such as object identity, location, and attributes.

By focusing the compression process on these semantically meaningful components, the authors show that their approach can outperform traditional image codecs in terms of coding efficiency and task-specific performance for LMMs. In other words, the compressed images contain less data but are still highly useful for the AI system's understanding and reasoning tasks.

This work is part of a broader effort to develop new visual coding paradigms that are tailored to the needs of modern AI systems, rather than simply optimizing for human visual perception. The potential benefits include more efficient data transmission, lower computational requirements, and better alignment between the visual representations and the AI system's internal conceptual understanding.

Technical Explanation

The paper proposes a novel image coding framework called "Tell Codec What Worth Compressing" (TCWWC), which aims to semantically disentangle the image representation for efficient coding in the context of LMMs.

The key components of the TCWWC framework are:

Semantic Disentanglement Module: This module learns to decompose the image representation into semantically meaningful components, such as object identity, location, and attributes.
Prioritized Coding Module: This module selectively encodes the most semantically relevant components, using a coding scheme that prioritizes the most important information for the target LMM.
Reconstruction Module: This module takes the compressed bitstream and reconstructs the image, focusing on preserving the semantically relevant aspects while allowing some distortion in less important areas.

The authors evaluate the TCWWC framework on various image classification and retrieval tasks using LMMs. They demonstrate that the semantically disentangled and prioritized coding approach outperforms traditional image codecs, both in terms of coding efficiency and task-specific performance.

Critical Analysis

The TCWWC framework addresses an important challenge in the field of visual coding for AI systems. By moving beyond simple pixel-level compression and instead prioritizing semantically relevant information, the authors show a promising path towards more efficient and effective image representation for LMMs.

However, the paper does not provide a comprehensive evaluation of the framework's limitations and potential downsides. For example, it would be helpful to understand how the semantic disentanglement module performs on a wider range of image types and datasets, and how sensitive the overall approach is to potential errors or biases in the semantic segmentation.

Additionally, the authors do not discuss the computational and memory footprint of the TCWWC framework, which could be an important consideration for real-world deployment, especially on resource-constrained devices.

Further research could also explore the potential for user-controllable and versatile image coding, where the prioritization of semantic components can be adjusted based on the specific needs of the target LMM or application.

Conclusion

The TCWWC framework presented in this paper represents an important step towards more efficient and effective image coding for large multimodal language models. By prioritizing semantically relevant information, the approach can outperform traditional image codecs in terms of both coding efficiency and task-specific performance.

This work highlights the potential benefits of developing visual coding paradigms that are tailored to the needs of modern AI systems, rather than solely optimizing for human visual perception. As LMMs and other AI models become increasingly prevalent, such domain-specific image coding techniques could play a crucial role in enabling more efficient and effective visual processing capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

We present a new image compression paradigm to achieve ``intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eyes, the image coding for machines (ICM) framework we focus on requires the compressed bitstream to more comply with different downstream intelligent analysis tasks. To this end, we employ LMM to textcolor{red}{tell codec what to compress}: 1) first utilize the powerful semantic understanding capability of LMMs w.r.t object grounding, identification, and importance ranking via prompts, to disentangle image content before compression, 2) and then based on these semantic priors we accordingly encode and transmit objects of the image in order with a structured bitstream. In this way, diverse vision benchmarks including image classification, object detection, instance segmentation, etc., can be well supported with such a semantically structured bitstream. We dub our method ``textit{SDComp}'' for ``textit{S}emantically textit{D}isentangled textit{Comp}ression'', and compare it with state-of-the-art codecs on a wide variety of different vision tasks. SDComp codec leads to more flexible reconstruction results, promised decoded visual quality, and a more generic/satisfactory intelligent task-supporting ability.

8/19/2024

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Pingping Zhang, Jinlong Li, Meng Wang, Nicu Sebe, Sam Kwong, Shiqi Wang

Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

8/16/2024

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Chunyi Li, Xiele Wu, Haoning Wu, Donghui Feng, Zicheng Zhang, Guo Lu, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, Weisi Lin

Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.

6/14/2024

High Efficiency Image Compression for Large Visual-Language Models

Binzhe Li, Shurun Wang, Shiqi Wang, Yan Ye

In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

7/25/2024