Hierarchical B-frame Video Coding for Long Group of Pictures

Read original: arXiv:2406.16544 - Published 6/26/2024 by Ivan Kirillov, Denis Parkhomenko, Kirill Chernyshev, Alexander Pletnev, Yibo Shi, Kai Lin, Dmitry Babin

Hierarchical B-frame Video Coding for Long Group of Pictures

Overview

This paper presents a hierarchical B-frame video coding technique for long Group of Pictures (GOP) that aims to improve compression efficiency and random access.
The method uses a hierarchical structure with multiple levels of B-frames to enable adaptive coding and decoding based on the importance of different frames.
This approach can potentially reduce bitrate while maintaining video quality and enabling faster random access compared to traditional video coding techniques.

Plain English Explanation

Video compression is important for efficiently storing and transmitting video content, especially for long videos. Traditional video coding methods often use a Group of Pictures (GOP) structure, where a sequence of frames is encoded together. However, this can make it difficult to access specific parts of the video quickly (known as random access).

The researchers in this paper propose a new hierarchical video coding technique that uses multiple levels of B-frames (bi-directional frames that can reference both past and future frames). By arranging the frames in a hierarchical structure, the system can adaptively encode and decode the video based on the importance of different frames. This allows for better compression efficiency while still enabling faster random access compared to traditional GOP-based coding.

The hierarchical structure means that some frames are more critical than others for reconstructing the full video. The system can prioritize the encoding of these key frames to maintain video quality, while using fewer bits for less important frames. This trade-off between compression and random access is a key innovation of this work.

Technical Explanation

The paper introduces a hierarchical B-frame video coding technique that aims to improve both compression efficiency and random access for long GOP video sequences. The method uses a multi-level hierarchy of B-frames, where frames at higher levels in the hierarchy are more important for reconstructing the full video.

The authors design an adaptive coding and decoding process that selectively encodes and decodes frames based on their position in the hierarchy. Key frames at the top of the hierarchy are encoded with higher priority to maintain video quality, while less important frames lower in the hierarchy are encoded more efficiently. This allows the system to achieve better overall compression compared to traditional GOP-based coding.

Crucially, the hierarchical structure also enables faster random access, as the decoder can selectively decode only the necessary frames to reconstruct a particular segment of the video. This is in contrast to the rigid GOP structure, where the entire GOP must be decoded to access any part of the video.

The paper presents theoretical analysis and experimental results demonstrating the effectiveness of the proposed hierarchical B-frame coding approach. Compared to standard video coding techniques, the authors show improvements in both compression efficiency and random access performance.

Critical Analysis

The hierarchical B-frame coding approach presented in this paper is a promising technique for improving video compression and random access capabilities, particularly for long video sequences. The adaptive coding and decoding process based on frame importance is a key innovation that allows for efficient compression while still enabling fast random access.

One potential limitation of the approach is the complexity of the hierarchical structure and the associated coding/decoding processes. Implementing this system efficiently may require significant computational resources, which could be a challenge for some real-world applications. The authors acknowledge this and suggest that future work could explore ways to simplify the hierarchical structure or optimize the implementation.

Additionally, the paper does not provide a detailed analysis of the trade-offs between compression efficiency and random access performance. While the results indicate improvements in both areas, it would be valuable to understand the specific performance characteristics and how they might vary under different video content or system constraints.

Overall, the hierarchical B-frame video coding technique presented in this paper is a promising approach that merits further investigation and refinement. As video content continues to grow in both volume and complexity, innovations like this that can balance compression, quality, and accessibility will be increasingly important.

Conclusion

This paper introduces a novel hierarchical B-frame video coding technique that aims to improve both compression efficiency and random access performance for long GOP video sequences. By using a multi-level hierarchy of B-frames and an adaptive coding/decoding process, the system can prioritize the encoding of key frames to maintain video quality while efficiently compressing less important frames.

The hierarchical structure also enables faster random access, as the decoder can selectively decode only the necessary frames to reconstruct a particular segment of the video. This is a significant advantage over traditional GOP-based coding, where the entire GOP must be decoded to access any part of the video.

While the proposed approach introduces some additional complexity, the potential benefits in terms of improved compression and random access make it a compelling area for further research and development. As the demand for efficient video storage and delivery continues to grow, techniques like this that can balance these competing requirements will become increasingly valuable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Hierarchical B-frame Video Coding for Long Group of Pictures

Ivan Kirillov, Denis Parkhomenko, Kirill Chernyshev, Alexander Pletnev, Yibo Shi, Kai Lin, Dmitry Babin

Learned video compression methods already outperform VVC in the low-delay (LD) case, but the random-access (RA) scenario remains challenging. Most works on learned RA video compression either use HEVC as an anchor or compare it to VVC in specific test conditions, using RGB-PSNR metric instead of Y-PSNR and avoiding comprehensive evaluation. Here, we present an end-to-end learned video codec for random access that combines training on long sequences of frames, rate allocation designed for hierarchical coding and content adaptation on inference. We show that under common test conditions (JVET-CTC), it achieves results comparable to VTM (VVC reference software) in terms of YUV-PSNR BD-Rate on some classes of videos, and outperforms it on almost all test sets in terms of VMAF BD-Rate. On average it surpasses open LD and RA end-to-end solutions in terms of VMAF and YUV BD-Rates.

6/26/2024

PNVC: Towards Practical INR-based Video Compression

Ge Gao, Ho Man Kwan, Fan Zhang, David Bull

Neural video compression has recently demonstrated significant potential to compete with conventional video codecs in terms of rate-quality performance. These learned video codecs are however associated with various issues related to decoding complexity (for autoencoder-based methods) and/or system delays (for implicit neural representation (INR) based models), which currently prevent them from being deployed in practical applications. In this paper, targeting a practical neural video codec, we propose a novel INR-based coding framework, PNVC, which innovatively combines autoencoder-based and overfitted solutions. Our approach benefits from several design innovations, including a new structural reparameterization-based architecture, hierarchical quality control, modulation-based entropy modeling, and scale-aware positional embedding. Supporting both low delay (LD) and random access (RA) configurations, PNVC outperforms existing INR-based codecs, achieving nearly 35%+ BD-rate savings against HEVC HM 18.0 (LD) - almost 10% more compared to one of the state-of-the-art INR-based codecs, HiNeRV and 5% more over VTM 20.0 (LD), while maintaining 20+ FPS decoding speeds for 1080p content. This represents an important step forward for INR-based video coding, moving it towards practical deployment. The source code will be available for public evaluation.

9/4/2024

Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration

Siyue Teng (University of Bristol), Yuxuan Jiang (University of Bristol), Ge Gao (University of Bristol), Fan Zhang (University of Bristol), Thomas Davis (Visionular Inc), Zoe Liu (Visionular Inc), David Bull (University of Bristol)

Recent advances in video compression have seen significant coding performance improvements with the development of new standards and learning-based video codecs. However, most of these works focus on application scenarios that allow a certain amount of system delay (e.g., Random Access mode in MPEG codecs), which is not always acceptable for live delivery. This paper conducts a comparative study of state-of-the-art conventional and learned video coding methods based on a low delay configuration. Specifically, this study includes two MPEG standard codecs (H.266/VVC VTM and JVET ECM), two AOM codecs (AV1 libaom and AVM), and two recent neural video coding models (DCVC-DC and DCVC-FM). To allow a fair and meaningful comparison, the evaluation was performed on test sequences defined in the AOM and MPEG common test conditions in the YCbCr 4:2:0 color space. The evaluation results show that the JVET ECM codecs offer the best overall coding performance among all codecs tested, with a 16.1% (based on PSNR) average BD-rate saving over AOM AVM, and 11.0% over DCVC-FM. We also observed inconsistent performance with the learned video codecs, DCVC-DC and DCVC-FM, for test content with large background motions.

8/12/2024

↗️

Accelerating Learned Video Compression via Low-Resolution Representation Learning

Zidian Qiu, Zongyao He, Zhi Jin

In recent years, the field of learned video compression has witnessed rapid advancement, exemplified by the latest neural video codecs DCVC-DC that has outperformed the upcoming next-generation codec ECM in terms of compression ratio. Despite this, learned video compression frameworks often exhibit low encoding and decoding speeds primarily due to their increased computational complexity and unnecessary high-resolution spatial operations, which hugely hinder their applications in reality. In this work, we introduce an efficiency-optimized framework for learned video compression that focuses on low-resolution representation learning, aiming to significantly enhance the encoding and decoding speeds. Firstly, we diminish the computational load by reducing the resolution of inter-frame propagated features obtained from reused features of decoded frames, including I-frames. We implement a joint training strategy for both the I-frame and P-frame models, further improving the compression ratio. Secondly, our approach efficiently leverages multi-frame priors for parameter prediction, minimizing computation at the decoding end. Thirdly, we revisit the application of the Online Encoder Update (OEU) strategy for high-resolution sequences, achieving notable improvements in compression ratio without compromising decoding efficiency. Our efficiency-optimized framework has significantly improved the balance between compression ratio and speed for learned video compression. In comparison to traditional codecs, our method achieves performance levels on par with the low-decay P configuration of the H.266 reference software VTM. Furthermore, when contrasted with DCVC-HEM, our approach delivers a comparable compression ratio while boosting encoding and decoding speeds by a factor of 3 and 7, respectively. On RTX 2080Ti, our method can decode each 1080p frame under 100ms.

7/24/2024