Shortcuts

Awesome Masked Image Modeling for Visual Represention

PRs Welcome Awesome GitHub stars GitHub forks

We summarize masked image modeling (MIM) methods proposed for self-supervised visual representation learning. The list of awesome MIM methods is summarized in chronological order and is on updating.

  • To find related papers and their relationships, check out Connected Papers, which visualizes the academic field in a graph representation.

  • To export BibTeX citations of papers, check out ArXiv or Semantic Scholar of the paper for professional reference formats.

MIM for Backbones

MIM for Transformers

  • Generative Pretraining from Pixels
    Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, David Luan, Ilya Sutskever
    ICML’2020 [Paper] [Code]

    iGPT Framework

  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
    ICLR’2021 [Paper] [Code]

    ViT Framework

  • BEiT: BERT Pre-Training of Image Transformers
    Hangbo Bao, Li Dong, Furu Wei
    ICLR’2022 [Paper] [Code]

    BEiT Framework

  • iBOT: Image BERT Pre-Training with Online Tokenizer
    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong
    ICLR’2022 [Paper] [Code]

    iBOT Framework

  • Masked Autoencoders Are Scalable Vision Learners
    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
    CVPR’2022 [Paper] [Code]

    MAE Framework

  • SimMIM: A Simple Framework for Masked Image Modeling
    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu
    CVPR’2022 [Paper] [Code]

    SimMIM Framework

  • Masked Feature Prediction for Self-Supervised Visual Pre-Training
    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, Christoph Feichtenhofer
    CVPR’2022 [Paper] [Code]

    MaskFeat Framework

  • data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli
    ICML’2022 [Paper] [Code]

    data2vec Framework

  • Position Prediction as an Effective Pretraining Strategy
    Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua Susskind
    ICML’2022 [Paper]

    MP3 Framework

  • PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
    Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu
    AAAI’2023 [Paper] [Code]

    PeCo Framework

  • MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning
    Sara Atito, Muhammad Awais, Ammarah Farooq, Zhenhua Feng, Josef Kittler
    ArXiv’2021 [Paper]

    MC-SSL0.0 Framework

  • mc-BEiT: Multi-choice Discretization for Image BERT Pre-training
    Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan
    ECCV’2022 [Paper] [Code]

    mc-BEiT Framework

  • Bootstrapped Masked Autoencoders for Vision BERT Pretraining
    Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu
    ECCV’2022 [Paper] [Code]

    BootMAE Framework

  • SdAE: Self-distillated Masked Autoencoder
    Yabo Chen, Yuchen Liu, Dongsheng Jiang, Xiaopeng Zhang, Wenrui Dai, Hongkai Xiong, Qi Tian
    ECCV’2022 [Paper] [Code]

    SdAE Framework

  • MultiMAE: Multi-modal Multi-task Masked Autoencoders
    Roman Bachmann, David Mizrahi, Andrei Atanov, Amir Zamir
    ECCV’2022 [Paper] [Code]

    MultiMAE Framework

  • SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners
    Feng Liang, Yangguang Li, Diana Marculescu
    ArXiv’2022 [Paper] [Code]

    SupMAE Framework

  • MVP: Multimodality-guided Visual Pre-training
    Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian
    ArXiv’2022 [Paper]

    MVP Framework

  • The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training
    Hao Liu, Xinghua Jiang, Xin Li, Antai Guo, Deqiang Jiang, Bo Ren
    AAAI’2023 [Paper]

    Ge2AE Framework

  • ConvMAE: Masked Convolution Meets Masked Autoencoders
    Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, Yu Qiao
    NeurIPS’2022 [Paper] [Code]

    ConvMAE Framework

  • Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
    Peng Gao, Renrui Zhang, Rongyao Fang, Ziyi Lin, Hongyang Li, Hongsheng Li, Qiao Yu
    arXiv’2023 [Paper] [Code]

    MR-MAE (ConvMAE.V2) Framework

  • Green Hierarchical Vision Transformer for Masked Image Modeling
    Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Toshihiko Yamasaki
    NeurIPS’2022 [Paper] [Code]

    GreenMIM Framework

  • Test-Time Training with Masked Autoencoders
    Yossi Gandelsman, Yu Sun, Xinlei Chen, Alexei A. Efros
    NeurIPS’2022 [Paper] [Code]

    TTT-MAE Framework

  • HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling
    Xiaosong Zhang, Yunjie Tian, Wei Huang, Qixiang Ye, Qi Dai, Lingxi Xie, Qi Tian
    ICLR’2023 [Paper]

    HiViT Framework

  • Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation
    Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, Baining Guo
    ArXiv’2022 [Paper] [Code]

    FD Framework

  • Object-wise Masked Autoencoders for Fast Pre-training
    Jiantao Wu, Shentong Mo
    ArXiv’2022 [Paper]

    ObjMAE Framework

  • Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction
    Jun Chen, Ming Hu, Boyang Li, Mohamed Elhoseiny
    ArXiv’2022 [Paper] [Code]

    LoMaR Framework

  • Extreme Masking for Learning Instance and Distributed Visual Representations
    Zhirong Wu, Zihang Lai, Xiao Sun, Stephen Lin
    ArXiv’2022 [Paper]

    ExtreMA Framework

  • BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
    Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei
    ArXiv’2022 [Paper] [Code]

    BEiT.V2 Framework

  • MILAN: Masked Image Pretraining on Language Assisted Representation
    Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, Sun-Yuan Kung
    ArXiv’2022 [Paper] [Code]

    MILAN Framework

  • Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei
    ArXiv’2022 [Paper] [Code]

    BEiT.V3 Framework

  • Masked Autoencoders Enable Efficient Knowledge Distillers
    Yutong Bai, Zeyu Wang, Junfei Xiao, Chen Wei, Huiyu Wang, Alan Yuille, Yuyin Zhou, Cihang Xie
    ArXiv’2022 [Paper] [Code]

    DMAE Framework

  • Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders
    Youngwan Lee, Jeffrey Willette, Jonghee Kim, Juho Lee, Sung Ju Hwang
    ICLR’2023 [Paper]

    RC-MAE Framework

  • Denoising Masked AutoEncoders are Certifiable Robust Vision Learners
    Quanlin Wu, Hang Ye, Yuntian Gu, Huishuai Zhang, Liwei Wang, Di He
    ArXiv’2022 [Paper] [Code]

    DMAE Framework

  • A Unified View of Masked Image Modeling
    Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei
    ArXiv’2022 [Paper] [Code]

    MaskDistill Framework

  • Masked Vision and Language Modeling for Multi-modal Representation Learning
    Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, Stefano Soatto
    ICLR’2023 [Paper]

    MaskVLM Framework

  • DILEMMA: Self-Supervised Shape and Texture Learning with Transformers
    Sepehr Sameni, Simon Jenni, Paolo Favaro
    AAAI’2023 [Paper]

    DILEMMA Framework

  • MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
    Xiaoyi Dong, Yinglin Zheng, Jianmin Bao, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu
    ArXiv’2022 [Paper]

    MaskCLIP Framework

  • i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable
    Kevin Zhang, Zhiqiang Shen
    ArXiv’2022 [Paper] [Code]

    i-MAE Framework

  • EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao
    CVPR’2023 [Paper] [Code]

    EVA Framework

  • Context Autoencoder for Self-Supervised Representation Learning
    Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang
    IJCV’2023 [Paper] [Code]

    CAE Framework

  • CAE v2: Context Autoencoder with CLIP Target
    Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
    ArXiv’2022 [Paper]

    CAE.V2 Framework

  • FastMIM: Expediting Masked Image Modeling Pre-training for Vision
    Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Yunhe Wang, Chang Xu
    ArXiv’2022 [Paper]

    FastMIM Framework

  • Exploring Target Representations for Masked Autoencoders
    Xingbin Liu, Jinghao Zhou, Tao Kong, Xianming Lin, Rongrong Ji
    ArXiv’2022 [Paper] [Code]

    dBOT Framework

  • Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
    Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli
    ICML’2023 [Paper] [Code]

    Data2Vec.V2 Framework

  • Scaling Language-Image Pre-training via Masking
    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He
    ArXiv’2022 [Paper]

    FLIP Framework

  • Attentive Mask CLIP
    Yifan Yang, Weiquan Huang, Yixuan Wei, Houwen Peng, Xinyang Jiang, Huiqiang Jiang, Fangyun Wei, Yin Wang, Han Hu, Lili Qiu, Yuqing Yang
    ArXiv’2022 [Paper]

    A-CLIP Framework

  • Masked autoencoders are effective solution to transformer data-hungry
    Jiawei Mao, Honggu Zhou, Xuesong Yin, Yuanqi Chang. Binling Nie. Rui Xu
    ArXiv’2022 [Paper] [Code]

    SDMAE Framework

  • TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
    Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu
    ArXiv’2023 [Paper] [Code]

    TinyMIM Framework

  • Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling
    Xin Ma, Chang Liu, Chunyu Xie, Long Ye, Yafeng Deng, Xiangyang Ji
    ArXiv’2023 [Paper] [Code]

    DMJD Framework

  • Mixed Autoencoder for Self-supervised Visual Representation Learning
    Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung
    CVPR’2023 [Paper]

    MixedAE Framework

  • Masked Image Modeling with Local Multi-Scale Reconstruction
    Haoqing Wang, Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhi-Hong Deng, Kai Han
    CVPR’2023 [Paper] [Code]

    LocalMAE Framework

  • Stare at What You See: Masked Image Modeling without Reconstruction
    Hongwei Xue, Peng Gao, Hongyang Li, Yu Qiao, Hao Sun, Houqiang Li, Jiebo Luo
    CVPR’2023 [Paper] [Code]

    MaskAlign Framework

  • Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas
    CVPR’2023 [Paper]

    I-JEPA Framework

  • MOMA: Distill from Self-Supervised Teachers
    Yuchong Yao, Nandakishor Desai, Marimuthu Palaniswami
    arXiv’2023 [Paper]

    MOMA Framework

  • PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling
    Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin
    arXiv’2023 [Paper] [Code]

    PixMIM Framework

  • Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders
    Heng Pan, Chenyang Liu, Wenxiao Wang, Li Yuan, Hongfa Wang, Zhifeng Li, Wei Liu
    arXiv’2023 [Paper]

    Img2Vec Framework

  • A Closer Look at Self-Supervised Lightweight Vision Transformers
    Shaoru Wang, Jin Gao, Zeming Li, Xiaoqin Zhang, Weiming Hu
    ICML’2023 [Paper] [Code]

    MAE-Lite Framework

  • Architecture-Agnostic Masked Image Modeling - From ViT back to CNN
    Siyuan Li, Di Wu, Fang Wu, Zelin Zang, Stan.Z.Li
    ICML’2023 [Paper] [Code] [project]

    A2MIM Framework

  • Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
    ICML’2023 [Paper] [Code]

    Hiera Framework

  • The effectiveness of MAE pre-pretraining for billion-scale pretraining
    Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
    ArXiv’2023 [Paper]

    WSP Framework

  • Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
    Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara
    ArXiv’2023 [Paper] [Code]

    MaPeT Framework

  • R-MAE: Regions Meet Masked Autoencoders
    Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen
    ArXiv’2023 [Paper] [Code]

    R-MAE Framework

  • Improving Pixel-based MIM by Reducing Wasted Modeling Capability
    Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin
    ICCV’2023 [Paper] [Code]

    MFM Framework

(back to top)

MIM with Constrastive Learning

  • MST: Masked Self-Supervised Transformer for Visual Representation
    Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang
    NeurIPS’2021 [Paper]

    MST Framework

  • Are Large-scale Datasets Necessary for Self-Supervised Pre-training
    Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave
    ArXiv’2021 [Paper]

    SplitMask Framework

  • Masked Siamese Networks for Label-Efficient Learning
    Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas
    ArXiv’2022 [Paper] [Code]

    MSN Framework

  • Siamese Image Modeling for Self-Supervised Vision Representation Learning
    Chenxin Tao, Xizhou Zhu, Gao Huang, Yu Qiao, Xiaogang Wang, Jifeng Dai
    ArXiv’2022 [Paper] [Code]

    SIM Framework

  • Masked Image Modeling with Denoising Contrast
    Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, Xiaohu Qie
    ICLR’2023 [Paper] [Code]

    ConMIM Framework

  • RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training
    Luya Wang, Feng Liang, Yangguang Li, Honggang Zhang, Wanli Ouyang, Jing Shao
    ArXiv’2022 [Paper]

    RePre Framework

  • Masked Siamese ConvNets
    Li Jing, Jiachen Zhu, Yann LeCun
    ArXiv’2022 [Paper]

    MSCN Framework

  • Contrastive Masked Autoencoders are Stronger Vision Learners
    Zhicheng Huang, Xiaojie Jin, Chengze Lu, Qibin Hou, Ming-Ming Cheng, Dongmei Fu, Xiaohui Shen, Jiashi Feng
    ArXiv’2022 [Paper] [Code]

    CMAE Framework

  • A simple, efficient and scalable contrastive masked autoencoder for learning visual representations
    Shlok Mishra, Joshua Robinson, Huiwen Chang, David Jacobs, Aaron Sarna, Aaron Maschinot, Dilip Krishnan
    ArXiv’2022 [Paper]

    CAN Framework

  • MimCo: Masked Image Modeling Pre-training with Contrastive Teacher
    Qiang Zhou, Chaohui Yu, Hao Luo, Zhibin Wang, Hao Li
    ArXiv’2022 [Paper]

    MimCo Framework

  • Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining
    Shaofeng Zhang, Feng Zhu, Rui Zhao, Junchi Yan
    ICLR’2023 [Paper] [Code]

    ccMIM Framework

  • How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders
    Qi Zhang, Yifei Wang, Yisen Wang
    NIP’2022 [Paper] [Code]

    U-MAE Framework

  • Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations
    Ziyu Jiang, Yinpeng Chen, Mengchen Liu, Dongdong Chen, Xiyang Dai, Lu Yuan, Zicheng Liu, Zhangyang Wang
    ICLR’2023 [Paper] [Code]

    Layer Grafted Framework

  • Self-supervision through Random Segments with Autoregressive Coding (RandSAC)
    Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao, Leonid Sigal
    ICLR’2023 [Paper]

    RandSAC Framework

(back to top)

MIM for Transformers and CNNs

  • Context Encoders: Feature Learning by Inpainting
    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros
    CVPR’2016 [Paper] [Code]

    Context-Encoder Framework

  • Corrupted Image Modeling for Self-Supervised Visual Pre-Training
    Yuxin Fang, Li Dong, Hangbo Bao, Xinggang Wang, Furu Wei
    ICLR’2023 [Paper]

    CIM Framework

  • Architecture-Agnostic Masked Image Modeling - From ViT back to CNN
    Siyuan Li, Di Wu, Fang Wu, Zelin Zang, Stan.Z.Li
    ICML’2023 [Paper] [Code] [project]

    A2MIM Framework

  • Masked Frequency Modeling for Self-Supervised Visual Pre-Training
    Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen Change Loy
    ICLR’2023 [Paper] [Code]

    MFM Framework

  • MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
    Jihao Liu, Xin Huang, Jinliang Zheng, Yu Liu, Hongsheng Li
    CVPR’2023 [Paper] [Code]

    MixMAE Framework

  • Masked Autoencoders are Robust Data Augmentors
    Haohang Xu, Shuangrui Ding, Xiaopeng Zhang, Hongkai Xiong, Qi Tian
    ArXiv’2022 [Paper] [Code]

    MRA Framework

  • Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling
    Keyu Tian, Yi Jiang, Qishuai Diao, Chen Lin, Liwei Wang, Zehuan Yuan
    ICLR’2023 [Paper] [Code]

    SparK Framework

  • ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
    Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie
    ArXiv’2023 [Paper] [Code]

    ConvNeXt.V2 Framework

(back to top)

MIM with Advanced Masking

  • MST: Masked Self-Supervised Transformer for Visual Representation
    Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang
    NeurIPS’2021 [Paper]

    MST Framework

  • Adversarial Masking for Self-Supervised Learning
    Yuge Shi, N. Siddharth, Philip H.S. Torr, Adam R. Kosiorek
    ICML’2022 [Paper] [Code]

    ADIOS Framework

  • What to Hide from Your Students: Attention-Guided Masked Image Modeling
    Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yannis Avrithis, Andrei Bursuc, Konstantinos Karantzalos, Nikos Komodakis
    ECCV’2022 [Paper] [Code]

    AttMask Framework

  • Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality
    Xiang Li, Wenhai Wang, Lingfeng Yang, Jian Yang
    ArXiv’2022 [Paper] [Code]

    UnMAE Framework

  • SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders
    Gang Li, Heliang Zheng, Daqing Liu, Chaoyue Wang, Bing Su, Changwen Zheng
    NeurIPS’2022 [Paper] [Code]

    SemMAE Framework

  • Hard Patches Mining for Masked Image Modeling
    Haochen Wang, Kaiyou Song, Junsong Fan, Yuxi Wang, Jin Xie, Zhaoxiang Zhang
    CVPR’2023 [Paper] [Code]

    HPM Framework

  • Improving Masked Autoencoders by Learning Where to Mask
    Haijian Chen, Wendong Zhang, Yunbo Wang, Xiaokang Yang
    arXiv’2023 [Paper]

    AutoMAE Framework

Image Generation

  • Discrete Variational Autoencoders
    Jason Tyler Rolfe
    ICLR’2017 [Paper] [Code]

  • Neural Discrete Representation Learning
    Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu
    NeurIPS’2017 [Paper] [Code]

  • Theory and Experiments on Vector Quantized Autoencoders (EM VQ-VAE)
    Aurko Roy, Ashish Vaswani, Arvind Neelakantan, Niki Parmar
    Arxiv’2018 [Paper] [Code]

  • DVAE: Discrete Variational Autoencoders with Relaxed Boltzmann Priors
    Arash Vahdat, Evgeny Andriyash, William G. Macready
    NeurIPS’2018 [Paper] [Code]

  • DVAE++: Discrete Variational Autoencoders with Overlapping Transformations
    Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, Evgeny Andriyash
    ICML’2018 [Paper] [Code]

  • Generating Diverse High-Fidelity Images with VQ-VAE-2
    Ali Razavi, Aaron van den Oord, Oriol Vinyals
    NeurIPS’2019 [Paper] [Code]

  • Generative Pretraining from Pixels
    Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, David Luan, Ilya Sutskever
    ICML’2020 [Paper] [Code]

    iGPT Framework

  • Taming Transformers for High-Resolution Image Synthesis
    Patrick Esser, Robin Rombach, Björn Ommer
    CVPR’2021 [Paper] [Code]

    VQGAN Framework

  • MaskGIT: Masked Generative Image Transformer
    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman
    CVPR’2022 [Paper] [Code]

    MaskGIT Framework

  • ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation
    Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
    Arxiv’2021 [Paper] [Project]

  • NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
    Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan
    Arxiv’2021 [Paper] [Code]

  • ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis
    Patrick Esser, Robin Rombach, Andreas Blattmann, Björn Ommer
    NeurIPS’2021 [Paper] [Code] [Project]

  • Vector-quantized Image Modeling with Improved VQGAN
    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu
    ICLR’2022 [Paper] [Code]

    ViT-VQGAN Framework

  • MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
    Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan
    CVPR’2023 [Paper] [Code]

    MAGE Framework

  • Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment
    Hao Liu, Wilson Yan, Pieter Abbeel
    ArXiv’2023 [Paper] [Code]

    LQAE Framework

  • SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
    Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
    ArXiv’2023 [Paper] [Code]

    SPAE Framework

(back to top)

MIM for Downstream Tasks

Object Detection

  • Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
    Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang
    ArXiv’2022 [Paper] [Code]

    MIMDet Framework

  • SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers
    Guoqiang Jin, Fan Yang, Mingshan Sun, Ruyi Zhao, Yakun Liu, Wei Li, Tianpeng Bao, Liwei Wu, Xingyu Zeng, Rui Zhao
    ArXiv’2022 [Paper]

    SeqCo-DETR Framework

  • Integrally Pre-Trained Transformer Pyramid Networks
    Yunjie Tian, Lingxi Xie, Zhaozhi Wang, Longhui Wei, Xiaopeng Zhang, Jianbin Jiao, Yaowei Wang, Qi Tian, Qixiang Ye
    CVPR’2023 [Paper] [Code]

    iTPN Framework

  • PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
    Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu, Yandong Guo, Shanghang Zhang
    CVPR’2023 [Paper] [Code]

    PiMAE Framework

  • Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection
    Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin
    ICCV’2023 [Paper] [Code]

    imTED Framework

Video Rrepresentation

  • VideoGPT: Video Generation using VQ-VAE and Transformers
    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas
    arXiv’2021 [Paper] [Code]

    VideoGPT Framework

  • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
    Zhan Tong, Yibing Song, Jue Wang, Limin Wang
    NeurIPS’2022 [Paper] [Code]

    VideoMAE Framework

  • VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao
    CVPR’2023 [Paper] [Code]

    VideoMAE.V2 Framework

  • Masked Autoencoders As Spatiotemporal Learners
    Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He
    NeurIPS’2022 [Paper] [Code]

    MAE Framework

  • Less is More: Consistent Video Depth Estimation with Masked Frames Modeling
    Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, Jianming Zhang
    ACMMM’2022 [Paper] [Code]

    FMNet Framework

  • MaskViT: Masked Visual Pre-Training for Video Prediction
    Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei
    CVPR’2022 [Paper] [Code]

    MaskViT Framework

  • OmniMAE: Single Model Masked Pretraining on Images and Videos
    Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra
    ArXiv’2022 [Paper] [Code]

    OmniMAE Framework

  • MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
    Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, Xiaohu Qie, Ping Luo
    ArXiv’2022 [Paper] [Code]

    MILES Framework

  • MAR: Masked Autoencoders for Efficient Action Recognition
    Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang
    ArXiv’2022 [Paper]

    MAR Framework

  • Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
    Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, Zehuan Yuan
    ArXiv’2022 [Paper] [Code]

    MotionMAE Framework

  • It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training
    Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li, Jingdong Wang
    ArXiv’2022 [Paper]

    MAM2 Framework

  • MIMT: Masked Image Modeling Transformer for Video Compression
    Jinxi Xiang, Kuan Tian, Jun Zhang
    ICLR’2023 [Paper]

    MIMT Framework

  • DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks
    Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, Antoni B. Chan
    CVPR’2023 [Paper] [Code]

    DropMAE Framework

  • AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders
    Wele Gedara Chaminda Bandara, Naman Patel, Ali Gholami, Mehdi Nikkhah, Motilal Agrawal, Vishal M. Patel
    CVPR’2023 [Paper] [Code]

    AdaMAE Framework

  • MAGVIT: Masked Generative Video Transformer
    Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
    CVPR’2023 [Paper] [Code]

    MAGVIT Framework

  • CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
    Cheng-Ze Lu, Xiaojie Jin, Zhicheng Huang, Qibin Hou, Ming-Ming Cheng, Jiashi Feng
    arXiv’2023 [Paper]

    CMAE-V Framework

  • Siamese Masked Autoencoders
    Agrim Gupta, Jiajun Wu, Jia Deng, Li Fei-Fei
    arXiv’2023 [Paper] [Code]

    SiamMAE Framework

  • MGMAE: Motion Guided Masking for Video Masked Autoencoding
    Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, Limin Wang
    ICCV’2023 [Paper] [Code]

    MGMAE Framework

Knowledge Distillation

  • Generic-to-Specific Distillation of Masked Autoencoders
    Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao, Qixiang Ye
    CVPR’2023 [Paper] [Code]

    G2SD Framework

Efficient Fine-tuning

  • Masked Images Are Counterfactual Samples for Robust Fine-tuning
    Yao Xiao, Ziyi Tang, Pengxu Wei, Cong Liu, Liang Lin
    CVPR’2023 [Paper] [Code]

    Robust Finetuning Framework

  • Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget
    Johannes Lehner, Benedikt Alkin, Andreas Fürst, Elisabeth Rumetshofer, Lukas Miklautz, Sepp Hochreiter
    arXiv’2023 [Paper] [Code]

    MAE-CT Framework

Medical Image

  • Self Pre-training with Masked Autoencoders for Medical Image Analysis
    Lei Zhou, Huidong Liu, Joseph Bae, Junjun He, Dimitris Samaras, Prateek Prasanna
    ArXiv’2022 [Paper]

  • Self-distillation Augmented Masked Autoencoders for Histopathological Image Classification
    Yang Luo, Zhineng Chen, Xieping Gao
    ArXiv’2022 [Paper]

  • Global Contrast Masked Autoencoders Are Powerful Pathological Representation Learners
    Hao Quan, Xingyu Li, Weixing Chen, Qun Bai, Mingchen Zou, Ruijie Yang, Tingting Zheng, Ruiqun Qi, Xinghua Gao, Xiaoyu Cui
    ArXiv’2022 [Paper]

  • FreMAE: Fourier Transform Meets Masked Autoencoders for Medical Image Segmentation
    Wenxuan Wang, Jing Wang, Chen Chen, Jianbo Jiao, Lichao Sun, Yuanxiu Cai, Shanshan Song, Jiangyun Li
    ArXiv’2023 [Paper]

  • Masked Image Modeling Advances 3D Medical Image Analysis
    Zekai Chen, Devansh Agarwal, Kshitij Aggarwal, Wiem Safta, Samit Hirawat, Venkat Sethuraman, Mariann Micsinai Balan, Kevin Brown
    WACV’2023 [Paper]

Face Recognition

  • FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders
    Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Jiankang Deng, Xinchao Wang, Hakan Bilen, Yang You
    ArXiv’2022 [Paper]

Scene Text Recognition (OCR)

  • MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
    Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
    ArXiv’2022 [Paper]

  • DocMAE: Document Image Rectification via Self-supervised Representation Learning
    Shaokai Liu, Hao Feng, Wengang Zhou, Houqiang Li, Cong Liu, Feng Wu
    ICME’2023 [Paper]

Remote Sensing Image

  • SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery
    Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David B. Lobell, Stefano Ermon
    ArXiv’2022 [Paper]

  • CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding
    Dilxat Muhtar, Xueliang Zhang, Pengfeng Xiao, Zhenshi Li, Feng Gu
    TGRS’2023 [Paper]

3D Point Cloud

  • Pre-Training 3D Point Cloud Transformers with Masked Point Modeling
    Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, Jiwen Lu
    CVPR’2022 [Paper]

  • Masked Autoencoders for Point Cloud Self-supervised Learning
    Yatian Pang, Wenxiao Wang, Francis E.H. Tay, Wei Liu, Yonghong Tian, Li Yuan
    ECCV’2022 [Paper]

  • Masked Discrimination for Self-Supervised Learning on Point Clouds
    Haotian Liu, Mu Cai, Yong Jae Lee
    ECCV’2022 [Paper]

  • MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis
    Yaqian Liang, Shanshan Zhao, Baosheng Yu, Jing Zhang, Fazhi He
    ECCV’2022 [Paper]

  • Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds
    Chen Min, Xinli Xu, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai
    ArXiv’2022 [Paper]

  • Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training
    Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, Hongsheng Li
    NeurIPS’2022 [Paper]

  • Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders
    Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, Hongsheng Li
    CVPR’2023 [Paper]

  • GeoMAE: Masked Geometric Target Prediction for Self-supervised Point Cloud Pre-Training
    Xiaoyu Tian, Haoxi Ran, Yue Wang, Hang Zhao
    CVPR’2023 [Paper]

  • Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?
    Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, Kaisheng Ma
    ICLR’2023 [Paper]

  • Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
    Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, Li Yi
    ICML’2023 [Paper]

  • MGM: A meshfree geometric multilevel method for systems arising from elliptic equations on point cloud surfaces
    Grady B. Wright, Andrew M. Jones, Varun Shankar
    ICCV’2023 [Paper]

Reinforcement Learning

  • Mask-based Latent Reconstruction for Reinforcement Learning
    Tao Yu, Zhizheng Zhang, Cuiling Lan, Yan Lu, Zhibo Chen
    ArXiv’2022 [Paper]

Audio

  • MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation
    Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang
    ArXiv’2021 [Paper]

  • MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
    Alan Baade, Puyuan Peng, David Harwath
    ArXiv’2022 [Paper]

  • Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training
    Dading Chong, Helin Wang, Peilin Zhou, Qingcheng Zeng
    ArXiv’2022 [Paper]

  • Masked Autoencoders that Listen
    Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer
    NeurIPS’2022 [Paper]

  • Contrastive Audio-Visual Masked Autoencoder
    Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
    ICLR’2023 [Paper]

(back to top)

Analysis and Understanding of MIM

  • Demystifying Self-Supervised Learning: An Information-Theoretical Framework
    Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, Louis-Philippe Morency
    ICLR’2021 [Paper]

  • A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
    Nikunj Saunshi, Sadhika Malladi, Sanjeev Arora
    ICLR’2021 [Paper]

  • Predicting What You Already Know Helps: Provable Self-Supervised Learning
    Jason D. Lee, Qi Lei, Nikunj Saunshi, Jiacheng Zhuo
    NeurIPS’2021 [Paper]

  • How to Understand Masked Autoencoders
    Shuhao Cao, Peng Xu, David A. Clifton
    ArXiv’2022 [Paper]

  • Masked prediction tasks: a parameter identifiability view
    Bingbin Liu, Daniel Hsu, Pradeep Ravikumar, Andrej Risteski
    ArXiv’2022 [Paper]

  • Revealing the Dark Secrets of Masked Image Modeling
    Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao
    ArXiv’2022 [Paper]

  • Architecture-Agnostic Masked Image Modeling - From ViT back to CNN
    Siyuan Li, Di Wu, Fang Wu, Zelin Zang, Kai Wang, Lei Shang, Baigui Sun, Hao Li, Stan.Z.Li
    ArXiv’2022 [Paper]

  • On Data Scaling in Masked Image Modeling
    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu
    CVPR’2023 [Paper]

  • Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks
    Jiachun Pan, Pan Zhou, Shuicheng Yan
    ArXiv’2022 [Paper]

  • An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers
    Gokul Karthik Kumar, Sahal Shaji Mullappilly, Abhishek Singh Gehlot
    ArXiv’2022 [Paper]

  • Understanding Masked Image Modeling via Learning Occlusion Invariant Feature
    Xiangwen Kong, Xiangyu Zhang
    ArXiv’2022 [Paper]

  • How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders
    Qi Zhang, Yifei Wang, Yisen Wang
    NIP’2022 [Paper]

  • i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable
    Kevin Zhang, Zhiqiang Shen
    ArXiv’2022 [Paper]

  • Understanding Masked Autoencoders via Hierarchical Latent Variable Models
    Lingjing Kong, Martin Q. Ma, Guangyi Chen, Eric P. Xing, Yuejie Chi, Louis-Philippe Morency, Kun Zhang
    CVPR’2023 [Paper]

  • Evaluating Self-Supervised Learning via Risk Decomposition
    Yann Dubois, Tatsunori Hashimoto, Percy Liang
    ICML’2023 [Paper]

  • Regeneration Learning: A Learning Paradigm for Data Generation
    Xu Tan, Tao Qin, Jiang Bian, Tie-Yan Liu, Yoshua Bengio
    ArXiv’2023 [Paper]

(back to top)

Survey

  • A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond
    Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, Kang Zhang, In So Kweon
    ArXiv’2022 [Paper]

Contribution

Feel free to send pull requests to add more links with the following Markdown format. Note that the abbreviation, the code link, and the figure link are optional attributes.

* **TITLE**<br>
*AUTHER*<br>
PUBLISH'YEAR [[Paper](link)] [[Code](link)]
   <details close>
   <summary>ABBREVIATION Framework</summary>
   <p align="center"><img width="90%" src="link_to_image" /></p>
   </details>
Read the Docs v: latest
Versions
latest
stable
Downloads
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.