MaskCRT

Masked Conditional Residual Transformer for Learned Video Compression

verfasst von: Yi Hsin Chen, Hong Sheng Xie, Cheng Wei Chen, Zong Lin Gao, Martin Benjak, Wen Hsiao Peng, Jorn Ostermann
Abstract: Conditional coding has lately emerged as the main-stream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.
Organisationseinheit(en): Institut für Informationsverarbeitung
Externe Organisation(en): National Yang Ming Chiao Tung University (NSTC)
Typ: Artikel
Journal: IEEE Transactions on Circuits and Systems for Video Technology
Band: 34
Seiten: 11980-11992
Anzahl der Seiten: 13
ISSN: 1051-8215
Publikationsdatum: 12.07.2024
Publikationsstatus: Veröffentlicht
Peer-reviewed: Ja
ASJC Scopus Sachgebiete: Medientechnik, Elektrotechnik und Elektronik
Elektronische Version(en): https://doi.org/10.1109/TCSVT.2024.3427426 (Zugang: Geschlossen)
https://doi.org/10.48550/arXiv.2312.15829 (Zugang: Offen)

BibTeX

@article{bc6f5c43c3d147e795f5cdff9088acec,
title = "MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression",
abstract = "Conditional coding has lately emerged as the main-stream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.",
keywords = "Encoding, Entropy, Feature extraction, Image coding, Learned video compression, masked conditional residual coding, Transformer-based video compression, Transformers, Video codecs, Video compression, transformer-based video compression",
author = "Chen, {Yi Hsin} and Xie, {Hong Sheng} and Chen, {Cheng Wei} and Gao, {Zong Lin} and Martin Benjak and Peng, {Wen Hsiao} and Jorn Ostermann",
note = "Publisher Copyright: IEEE",
year = "2024",
month = jul,
day = "12",
doi = "10.1109/TCSVT.2024.3427426",
language = "English",
volume = "34",
pages = "11980--11992",
journal = "IEEE Transactions on Circuits and Systems for Video Technology",
issn = "1051-8215",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "11",
}