A few days ago, DeepMind announced their release of NFNets, a family of image classification models that are:

  • State of the art on ImageNet (86.5% top-1 without extra data)
  • Up to 8.7x faster to train than EfficientNets to a given accuracy
  • Normalizer-free (no BatchNorm!) Paper: http://dpmd.ai/06171 Code: http://dpmd.ai/nfnets
  • Has more than 438.4 million parameters
Image adopted from https://arxiv.org/abs/2102.06171. "Training Latency (s/step) on TPUv3, Batch Size per Device = 32
Image adapted from https://arxiv.org/abs/2102.06171. «Training Latency (s/step) on TPUv3, Batch Size per Device = 32

The paper describing their implementation, and some previous work leading to their results, can be found at:

https://arxiv.org/abs/2102.06171

DeepMind’s code implementation of this paper is found at:

https://github.com/deepmind/deepmind-research/tree/master/nfnets

What is new about NFNets?

NFNets stands for Normalizer-Free (Res)Nets. Part of the novelty is that this image recognition model can be trained without batch normalization layers. Instead of these, they use a new gradient clipping algorithm to design models that outperform the state-of-the-art classification models (best ImageNet top-1 accuracy without additional data). As an extra benefit, this new approach significantly reduces training time and is also said to be state-of-the-art when it comes to transfer learning.

What is batch normalization?

Batch normalization is a technique for improving the performance and stability of neural networks. It is a commonly used technique in the real of deep artificial networks. The idea is to normalize the inputs of each layer in such a way that they have a mean output activation of zero and a standard deviation of one. This is analogous to how the inputs to networks are standardized.

When we apply a tf.keras.layers.BatchNormalization, we normalize the input coming from the previous layer.

I really recommend you taking a look at Andrew Ng’s video explaining why batch normalization works.

Why does batch norm work?

Why «should» we remove batch normalization?

But batch norm works! Why do we need to remove it now!?

As pointed out by the authors from DeepMind, batch normalization has three significant practical disadvantages,

  1. It is a surprisingly expensive computational primitive, which incurs memory overhead, and significantly increases the time required to evaluate the gradient in some networks.
  2. It introduces a discrepancy between the behavior of the model during training and at inference time, introducing hidden hyper-parameters that have to be tuned.
  3. Last and most important point, batch normalization breaks the independence between training examples in the minibatch (batch size matters with batch norm, distributed training becomes extremely cumbersome).

Point number 3 has several negative consequences. It has been found that batch norm networks can be difficult to replicate on different hardware, and batch norm is often the cause of implementation errors, specially during distributed training.

Better than batch normalization!

In their work, the DeepMind researchers built on top of their previous results, presented in «Normalizer-Free ResNets». The main contributions in their NFNets paper are three,

  • They propose Adaptive Gradient Clipping (AGC), which clips gradients based on the unit-wise ratio of gradient norms to parameter norms, and they demonstrate that AGC allows them to train normalizer-free networks with larger batch sizes and stronger data augmentations.
  • They design a family of Normalizer-Free ResNets, called NFNets, which set new state-of-the-art validation accuracies on ImageNet for a range of training latencies. Their NFNet-F1 model achieves similar accuracy to EfficientNet-B7 while being 8.7× faster to train, and their largest model sets a new overall state of the art without extra data of 86.5% top-1 accuracy.
  • They show that NFNets achieve substantially higher validation accuracies than batch-normalized networks when fine-tuning on ImageNet after pre-training on a large private dataset of 300 million labelled images. Their best model achieves 89.2% top-1 after fine-tuning.

NAS is here to stay…

NAS, or Neural Architecture Search is here to stay. It is super exciting to see that DeepMind’s researchers didn’t design the architecture by hand. Instead, after including this new AGC layer, they released a NAS to find the best possible architecture automatically… with pretty impressive results.

If you would like to see a TF2 implementation of AGC, check out this repository.

DeepMind’s NFNet on a webcam stream

If you would like DeepMind’s sexiest model for a spin, check out our GitHub repository:

https://github.com/digitalemerge/nfnets-cv2-demo

There, you will find instructions on how to set up the requirements and a script to run the model on a webcam stream. You can check the video below where you will see me having fun with the SOA model 🙂

Taking NFNets for a ride!

Legg igjen en kommentar

Fyll inn i feltene under, eller klikk på et ikon for å logge inn:

WordPress.com-logo

Du kommenterer med bruk av din WordPress.com konto. Logg ut /  Endre )

Google-bilde

Du kommenterer med bruk av din Google konto. Logg ut /  Endre )

Twitter-bilde

Du kommenterer med bruk av din Twitter konto. Logg ut /  Endre )

Facebookbilde

Du kommenterer med bruk av din Facebook konto. Logg ut /  Endre )

Kobler til %s