MDL-AE: Investigating the trade-off between compressive fidelity and discriminative utility in self-supervised learning
Abstract
Current paradigms in Self-Supervised Learning (SSL) achieve state-of-the-art results through complex, heuristic-driven pretext tasks like contrastive learning or masked image modeling. We propose a departure from these heuristics by reframing SSL through the fundamental Minimum Description Length (MDL) principle. We introduce the MDL-Autoencoder (MDL-AE), learning visual representations by optimizing a Vector Quantized Variational AutoEncoder (VQ-VAE)-based objective for efficient, discrete compression of visual data. Through rigorous experiments on the Canadian Institute for Advanced Research 10 (CIFAR-10), we demonstrate that this compression-driven objective learns a rich vocabulary of local visual concepts. However, we uncover a critical architectural insight: despite learning a visibly superior, higher-fidelity vocabulary, a more powerful tokenizer fails to improve downstream performance. We show that the MDL-AE learns holistic object parts rather than generic, composable primitives. Consequently, a sophisticated Vision Transformer (ViT) head consistently fails to outperform a simple linear probe on the flattened feature map. This architectural mismatch reveals that the nature of the learned representation dictates the optimal downstream architecture. To validate this, we demonstrate that a dedicated self-supervised alignment task, based on Masked Autoencoding of the discrete tokens, resolves this mismatch and dramatically improves performance, bridging the gap between generative fidelity and discriminative utility. Our work provides a compelling case study on co-designing objectives and downstream architectures.
Copyright (c) 2025 Zaryab Rahman, Mattia Ottoborgo

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
[1]Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database. In: Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; 20–25 June 2009; Miami, FL, USA. pp. 248–255.
[2]Bengio Y, Courville A, Vincent P. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013; 35(8): 1798–1828. doi: 10.1109/TPAMI.2013.50
[3]Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning. 2020; 119: 1597–1607.
[4]He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 14–19 June 2020; Online. pp. 9729–9738.
[5]He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 18–24 June 2022; New Orleans, LA, USA. pp. 16000–16009.
[6]Bao H, Dong L, Piao S, et al. Beit: Bert pre-training of image transformers. In: Proceedings of the International Conference on Learning Representations; 25–29 April 2022; Online.
[7]Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations; 3–7 May 2021; Vienna, Austria.
[8]Rissanen J. Modeling by shortest data description. Automatica. 1978; 14(5): 465–471.
[9]Van Den Oord A, Vinyals O, others. Neural discrete representation learning. In: Proceedings of the Advances in Neural Information Processing Systems; 4–9 December 2017; Long Beach, CA, USA.
[10]Wu Z, Xiong Y, Yu SX, et al. Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 18–22 June 2018; Salt Lake City, UT, USA. pp. 3733–3742.
[11]Grill JB, Strub F, Altché F, et al. Bootstrap your own latent-a new approach to self-supervised learning. In: Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); 6–12 December 2020; Online. pp. 21271–21284.
[12]Chen X, He K. Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 19–25 June 2021; Online. pp. 15750–15758.
[13]LeCun Y. A Path towards Autonomous Machine Intelligence. OpenReview Archive. 2022. Available online: https://openreview.net/forum?id=BZ5a1r-kVsf
[14]Assran M, Duval Q, Misra I, et al. Self-supervised learning from images with a joint-embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 18–22 June 2023; Vancouver, BC, Canada. pp. 15619–15629.
[15]Oquab M, Darcet T, Moutakanni T, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint. 2023; arXiv:2304.07193.
[16]Devlin J, Chang MW, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2–7 June 2019; Minneapolis, MN, USA. pp. 4171–4186.
[17]Ramesh A, Pavlov M, Goh G, et al. Zero-shot text-to-image generation. Proceedings of the 38th International Conference on Machine Learning. 2021; 139: 8821–8831.
[18]Razavi A, van den Oord A, Vinyals O. Generating diverse high-fidelity images with vq-vae-2. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 8–14 December 2019; Vancouver, BC, Canada.
[19]Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 19–25 June 2021; Online. pp. 12873–12883.
[20]Yu L, Cheng Y, Sohn K, et al. Magvit: Masked generative video transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 18–22 June 2023; Vancouver, BC, Canada. pp. 10459–10469.
[21]Balestriero R, Ibrahim M, Sobal V, et al. A cookbook of self-supervised learning. arXiv preprint. 2023. doi: 10.48550/arXiv.2304.12210
[22]Garrido Q, Balestriero R, Najman L, et al. RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. In: Proceedings of the 40th International Conference on Machine Learning; 23–29 July 2023; Honolulu, HI, USA. pp. 10929–10974.
[23]Shannon CE. A Mathematical Theory of Communication. The Bell System Technical Journal. 1948; 27(3): 379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x
[24]He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 27–30 June 2016; Las Vegas, NV, USA. pp. 770–778. doi: 10.1109/CVPR.2016.90
[25]Girshick R, Donahue J, Darrell T, et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 23–28 June 2014; Columbus, OH, USA. pp. 580–587. doi: 10.1109/CVPR.2014.81

