Learned Image Transmission with Hierarchical Variational Autoencoder

Read original: arXiv:2408.16340 - Published 9/11/2024 by Guangyi Zhang, Hanlei Li, Yunlong Cai, Qiyu Hu, Guanding Yu, Runmin Zhang

Learned Image Transmission with Hierarchical Variational Autoencoder

Overview

The paper proposes a new method for learned image transmission using a hierarchical variational autoencoder (HVAE) architecture.
The HVAE model learns a compressed latent representation of the input image, which can then be transmitted and reconstructed at the receiver.
The hierarchical structure of the HVAE allows for progressive reconstruction, where the image can be partially reconstructed with fewer bits and then refined with additional bits.

Plain English Explanation

The researchers have developed a way to send images over a network more efficiently. Their method uses a special type of neural network called a hierarchical variational autoencoder (HVAE). This network learns to compress the image into a smaller, encoded form that can be transmitted using fewer bits.

On the receiving end, the HVAE can then reconstruct the original image from the compressed data. Importantly, the hierarchical structure of the HVAE allows the image to be reconstructed in stages. This means the receiver can get a rough version of the image first, using fewer bits, and then gradually refine it with additional bits. This can be useful in situations with limited bandwidth, where you want to prioritize getting some information through quickly rather than waiting for the full high-quality image.

The researchers tested their HVAE model on various image datasets and showed that it outperformed other state-of-the-art methods for learned image transmission. This suggests their approach could be valuable for applications like image compression and remote visualization.

Technical Explanation

The core of the researchers' approach is the hierarchical variational autoencoder (HVAE) model. This consists of an encoder network that compresses the input image into a latent representation, and a decoder network that reconstructs the image from the latent code.

The key innovation is the hierarchical structure of the HVAE. The encoder and decoder both have multiple layers, where each layer learns to extract and reconstruct features at a different scale or resolution. This allows the model to progressively refine the image reconstruction as more bits are transmitted.

For example, the first layer of the decoder might produce a low-resolution version of the image using a small number of bits. Subsequent layers can then add higher-frequency details to gradually improve the reconstruction quality.

The researchers trained the HVAE end-to-end using a variational learning objective, which encourages the latent representation to be compact and informative. They evaluated the model on standard image datasets like ImageNet and showed that it outperformed previous learned image coding approaches in terms of rate-distortion performance.

Critical Analysis

The paper provides a thorough technical explanation of the HVAE architecture and its application to learned image transmission. The hierarchical structure is a clever way to enable progressive reconstruction, which could be very useful in bandwidth-constrained scenarios.

That said, the paper does not extensively explore the practical limitations or potential downsides of the approach. For example, it is not clear how the HVAE model would scale to higher-resolution images or how it would perform under different channel noise conditions.

Additionally, while the rate-distortion results are promising, the paper does not contextualize the improvements relative to the complexity and computational cost of the HVAE model. It would be helpful to understand the tradeoffs in terms of encoding/decoding speed, memory usage, and other practical deployment considerations.

Overall, the research represents an interesting advance in learned image coding, but further analysis of the method's real-world applicability and limitations would strengthen the paper.

Conclusion

The proposed hierarchical variational autoencoder (HVAE) model offers a novel approach to learned image transmission that enables progressive image reconstruction. By learning a multi-scale latent representation of the input image, the HVAE can efficiently transmit and reconstruct images using a variable number of bits.

The technical evaluation demonstrates the HVAE's strong rate-distortion performance compared to prior work. This suggests the method could be valuable for applications like image compression, remote visualization, and other scenarios where bandwidth is limited.

While the paper provides a solid technical foundation, further research is needed to fully understand the practical implications and tradeoffs of the HVAE approach. Exploring scalability, robustness to noise, and deployment efficiency would help contextualize the significance of this contribution to the field of learned image coding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Learned Image Transmission with Hierarchical Variational Autoencoder

Guangyi Zhang, Hanlei Li, Yunlong Cai, Qiyu Hu, Guanding Yu, Runmin Zhang

In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise. The source code will be made available upon acceptance.

9/11/2024

🤿

Deep Joint Source-Channel Coding for Adaptive Image Transmission over MIMO Channels

Haotian Wu, Yulin Shao, Chenghong Bian, Krystian Mikolajczyk, Deniz Gunduz

This paper introduces a vision transformer (ViT)-based deep joint source and channel coding (DeepJSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) channels, denoted as DeepJSCC-MIMO. We consider DeepJSCC-MIMO for adaptive image transmission in both open-loop and closed-loop MIMO systems. The novel DeepJSCC-MIMO architecture surpasses the classical separation-based benchmarks with robustness to channel estimation errors and showcases remarkable flexibility in adapting to diverse channel conditions and antenna numbers without requiring retraining. Specifically, by harnessing the self-attention mechanism of ViT, DeepJSCC-MIMO intelligently learns feature mapping and power allocation strategies tailored to the unique characteristics of the source image and prevailing channel conditions. Extensive numerical experiments validate the significant improvements in transmission quality achieved by DeepJSCC-MIMO for both open-loop and closed-loop MIMO systems across a wide range of scenarios. Moreover, DeepJSCC-MIMO exhibits robustness to varying channel conditions, channel estimation errors, and different antenna numbers, making it an appealing solution for emerging semantic communication systems.

7/16/2024

Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission

Mingyu Yang, Bowen Liu, Boyang Wang, Hun-Seok Kim

Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated to be an effective approach for wireless image transmission. Nevertheless, most existing work adopts an autoencoder framework to optimize conventional criteria such as Mean Squared Error (MSE) and Structural Similarity Index (SSIM) which do not suffice to maintain the perceptual quality of reconstructed images. Such an issue is more prominent under stringent bandwidth constraints or low signal-to-noise ratio (SNR) conditions. To tackle this challenge, we propose DiffJSCC, a novel framework that leverages the prior knowledge of the pre-trained Statble Diffusion model to produce high-realism images via the conditional diffusion denoising process. Our DiffJSCC first extracts multimodal spatial and textual features from the noisy channel symbols in the generation phase. Then, it produces an initial reconstructed image as an intermediate representation to aid robust feature extraction and a stable training process. In the following diffusion step, DiffJSCC uses the derived multimodal features, together with channel state information such as the signal-to-noise ratio (SNR), as conditions to guide the denoising diffusion process, which converts the initial random noise to the final reconstruction. DiffJSCC employs a novel control module to fine-tune the Stable Diffusion model and adjust it to the multimodal conditions. Extensive experiments on diverse datasets reveal that our method significantly surpasses prior deep JSCC approaches on both perceptual metrics and downstream task performance, showcasing its ability to preserve the semantics of the original transmitted images. Notably, DiffJSCC can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols (<0.008 symbols per pixel) under 1dB SNR channels.

7/18/2024

Discriminative Hamiltonian Variational Autoencoder for Accurate Tumor Segmentation in Data-Scarce Regimes

Aghiles Kebaili, J'er^ome Lapuyade-Lahorgue, Pierre Vera, Su Ruan

Deep learning has gained significant attention in medical image segmentation. However, the limited availability of annotated training data presents a challenge to achieving accurate results. In efforts to overcome this challenge, data augmentation techniques have been proposed. However, the majority of these approaches primarily focus on image generation. For segmentation tasks, providing both images and their corresponding target masks is crucial, and the generation of diverse and realistic samples remains a complex task, especially when working with limited training datasets. To this end, we propose a new end-to-end hybrid architecture based on Hamiltonian Variational Autoencoders (HVAE) and a discriminative regularization to improve the quality of generated images. Our method provides an accuracte estimation of the joint distribution of the images and masks, resulting in the generation of realistic medical images with reduced artifacts and off-distribution instances. As generating 3D volumes requires substantial time and memory, our architecture operates on a slice-by-slice basis to segment 3D volumes, capitilizing on the richly augmented dataset. Experiments conducted on two public datasets, BRATS (MRI modality) and HECKTOR (PET modality), demonstrate the efficacy of our proposed method on different medical imaging modalities with limited data.

6/18/2024