1. MLP-Mixer: An all-MLP Architecture for Vision [May 04-21]
Proposed by Google Brain in Germany, MLP-mixer is an all-MLP architecture for vision without CNN or Transformer, just pure MLPs. Convolutional Neural Networks (CNNs) are the go-to model for computer vision.
Recently, attention-based networks, such as the Vision Transformer, have also
become popular. In this paper we show that while convolutions and attention are
both sufficient for good performance, neither of them are necessary. We present
MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
MLP-Mixer contains two types of layers: one with MLPs applied independently to
image patches (i.e. “mixing” the per-location features), and one with MLPs applied
across patches (i.e. “mixing” spatial information). When trained on large datasets,
or with modern regularization schemes, MLP-Mixer attains competitive scores on
image classification benchmarks, with pre-training and inference cost comparable
to state-of-the-art models. We hope that these results spark further research beyond
the realms of well established CNNs and Transformers
Here is the architecture of MLP-mixer:
[Link paper]
[Link Github - vision transformer]
2. Going deeper with Image Transformers - Proposed by Facebook AI
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers
has been little studied so far. In this work, we build and optimize deeper
transformer networks for image classification. In particular, we investigate
the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models
whose performance does not saturate early with more depth, for instance
we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters.
Moreover, our best model establishes the new state of the art on Imagenet
with Reassessed labels and Imagenet-V2 / match frequency, in the setting
with no additional training data. We share our code and models.
[Link paper]
[Link Github - CaiT]
3. Training data-efficient image transformers & distillation through attention [Jan 15-21] - Facebook AI
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions
of images using a large infrastructure, thereby limiting their adoption.
In this work, we produce competitive convolution-free transformers by
training on Imagenet only. We train them on a single computer in less than
3 days. Our reference vision transformer (86M parameters) achieves top-1
accuracy of 83.1% (single-crop) on ImageNet with no external data.
More importantly, we introduce a teacher-student strategy specific to
transformers. It relies on a distillation token ensuring that the student
learns from the teacher through attention. We show the interest of this
token-based distillation, especially when using a convnet as a teacher. This
leads us to report results competitive with convnets for both Imagenet
(where we obtain up to 85.2% accuracy) and when transferring to other
tasks. We share our code and models.
[Link paper]
[Link Github - DEIT]