Masked Autoencoders Are
Scalable Vision Learners
During pre-training, a large random subset of image patches (75%) is masked
out. The encoder is applied to the small subset of visible patches. Mask tokens
are introduced after the encoder and the full set of encoded pathes and mask
tokens is processed by a small decoder that reconstructs the original image in
pixels.
Introduction
Solutions based on autoregressive language modeling in GPT and masked
autoencoding BERT are conceptually simple, they remove a portion of the data
and learn to predict removed content.
The idea of masked autoencoders, a form of more general denoising
autoencoder, is natural and applicable in computer vision as well.
What makes autoencoding different between vision and language?
Languages are human-generated signals that are highly semantic and
information-dense. When training a model to predict only a few missing words
per sentence, this task appears to induce sophisticated language
understanding. Images on the contrary are natural signals with heavy spatial
redundancy — e.g. a missing patch can be covered from neighboring patches
with little high-level understanding of parts, objects and scenes.
Masked Autoencoders Are Scalable Vision Learners 1
Scalable Vision Learners
During pre-training, a large random subset of image patches (75%) is masked
out. The encoder is applied to the small subset of visible patches. Mask tokens
are introduced after the encoder and the full set of encoded pathes and mask
tokens is processed by a small decoder that reconstructs the original image in
pixels.
Introduction
Solutions based on autoregressive language modeling in GPT and masked
autoencoding BERT are conceptually simple, they remove a portion of the data
and learn to predict removed content.
The idea of masked autoencoders, a form of more general denoising
autoencoder, is natural and applicable in computer vision as well.
What makes autoencoding different between vision and language?
Languages are human-generated signals that are highly semantic and
information-dense. When training a model to predict only a few missing words
per sentence, this task appears to induce sophisticated language
understanding. Images on the contrary are natural signals with heavy spatial
redundancy — e.g. a missing patch can be covered from neighboring patches
with little high-level understanding of parts, objects and scenes.
Masked Autoencoders Are Scalable Vision Learners 1