CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Read original: arXiv:2406.09356 - Published 6/14/2024 by Chunyi Li, Xiele Wu, Haoning Wu, Donghui Feng, Zicheng Zhang, Guo Lu, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, Weisi Lin

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Overview

The paper proposes a new dataset called CMC-Bench for evaluating visual signal compression algorithms
CMC-Bench aims to provide a more comprehensive and challenging benchmark for assessing the performance of compression methods
The dataset includes a diverse range of visual content, such as natural images, computer-generated images, and text-based images
The authors argue that existing compression benchmarks have limitations in terms of the types of content and the scope of the evaluation

Plain English Explanation

The researchers have created a new dataset called CMC-Bench to help evaluate how well different algorithms can compress visual information, such as images and other types of visual content. The goal is to provide a more comprehensive and challenging way to assess the performance of compression methods.

Existing benchmarks for evaluating compression algorithms often have limitations in the types of visual content they include and the scope of the evaluation. The CMC-Bench dataset aims to address these limitations by including a wider range of visual content, including natural images, computer-generated images, and text-based images.

By having a more diverse and challenging dataset, the researchers hope to better understand the strengths and weaknesses of different compression algorithms and push the field of visual signal compression forward. This could lead to more efficient and effective compression methods that can be used in a variety of applications, such as text-guided image encoding or video compression.

Technical Explanation

The paper introduces a new dataset called CMC-Bench that is designed to evaluate the performance of visual signal compression algorithms. The dataset includes a diverse range of visual content, including natural images, computer-generated images, and text-based images. This is in contrast to existing compression benchmarks, which often have a more limited scope in terms of the types of visual content they include.

The authors argue that the diversity and complexity of the CMC-Bench dataset can provide a more comprehensive and challenging evaluation of compression methods. For example, the inclusion of text-based images, such as language-oriented semantic latent representations, can help assess how well compression algorithms can handle content with both visual and textual elements.

The paper also describes the process of constructing the CMC-Bench dataset, including the selection of source data and the curation of the final dataset. The authors evaluate the performance of several state-of-the-art compression algorithms on the CMC-Bench dataset and provide detailed analysis of the results.

Critical Analysis

The CMC-Bench dataset proposed in this paper represents an important step forward in the field of visual signal compression. By including a more diverse and challenging set of visual content, the dataset can help researchers and practitioners better understand the strengths and weaknesses of different compression algorithms.

However, the paper does acknowledge some limitations of the dataset, such as the potential for biases in the selection of source data. Additionally, the authors note that the evaluation of compression algorithms on CMC-Bench is primarily focused on objective metrics, such as compression ratio and image quality, and may not fully capture the user experience or practical considerations of real-world applications.

Further research could explore ways to incorporate more user-centric evaluations or to investigate the performance of compression algorithms on emerging types of visual content, such as 3D models or virtual environments.

Conclusion

The CMC-Bench dataset proposed in this paper represents a significant advancement in the field of visual signal compression. By providing a more diverse and challenging benchmark for evaluating compression algorithms, the dataset can help researchers and practitioners develop more efficient and effective compression methods that can be applied in a variety of real-world applications, such as image and video compression.

The insights gained from the evaluation of compression algorithms on the CMC-Bench dataset could also have broader implications for the field of multimedia processing and the development of advanced content delivery systems. As the demand for high-quality visual content continues to grow, the ability to compress and transmit this information efficiently will become increasingly important.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Chunyi Li, Xiele Wu, Haoning Wu, Donghui Feng, Zicheng Zhang, Guo Lu, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, Weisi Lin

Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.

6/14/2024

Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

We present a new image compression paradigm to achieve ``intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eyes, the image coding for machines (ICM) framework we focus on requires the compressed bitstream to more comply with different downstream intelligent analysis tasks. To this end, we employ LMM to textcolor{red}{tell codec what to compress}: 1) first utilize the powerful semantic understanding capability of LMMs w.r.t object grounding, identification, and importance ranking via prompts, to disentangle image content before compression, 2) and then based on these semantic priors we accordingly encode and transmit objects of the image in order with a structured bitstream. In this way, diverse vision benchmarks including image classification, object detection, instance segmentation, etc., can be well supported with such a semantically structured bitstream. We dub our method ``textit{SDComp}'' for ``textit{S}emantically textit{D}isentangled textit{Comp}ression'', and compare it with state-of-the-art codecs on a wide variety of different vision tasks. SDComp codec leads to more flexible reconstruction results, promised decoded visual quality, and a more generic/satisfactory intelligent task-supporting ability.

8/19/2024

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Pingping Zhang, Jinlong Li, Meng Wang, Nicu Sebe, Sam Kwong, Shiqi Wang

Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

8/16/2024

High Efficiency Image Compression for Large Visual-Language Models

Binzhe Li, Shurun Wang, Shiqi Wang, Yan Ye

In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

7/25/2024