Pre-trained models are easy to use, but are you glossing over details that could impact your model performance?

How many times have you run the following snippets:
import torchvision.models as models
inception = models.inception_v3(pretrained=
or
from keras.applications.inception_v3 import InceptionV3
base_model = InceptionV3(weights='imagenet', include_top=False)
It seems like using these pre-trained models have become a new standard for industry best practices. After all, why wouldn’t you take advantage of a model that’s been trained on more data and compute than you could ever muster by yourself?

Long live pre-trained models!

There are several substantial benefits to leveraging pre-trained models:
Advances within the NLP space have also encouraged the use of pre-trained language models like GPT and GPT-2, AllenNLP’s ELMo, Google’s BERT, and Sebastian Ruder and Jeremy Howard’s ULMFiT (for an excellent over of these models, see this TOPBOTs post).
One common technique for leveraging pretrained models is feature extraction, where you’re retrieving intermediate representations produced by the pretrained model and using those representations as inputs for a new model. These final fully-connected layers are generally assumed to capture information that is relevant for solving a new task.

Everyone’s in on the game

Every major framework like Tensorflow, Keras, PyTorch, MXNet, etc… offers pre-trained models like Inception V3, ResNet, AlexNet with weights:
But are these benchmarks reproducible?

The article that inspired this post came from Curtis Northcutt, a computer science PhD candidate at MIT.
His article ‘Towards Reproducibility: Benchmarking Keras and PyTorch’ made several interesting claims 
1.
resnet
 architectures perform better in PyTorch and inception architectures perform better in Keras
2. The published benchmarks on Keras Applications cannot be reproduced, even when exactly copying the example code. In fact, their reported accuracies (as of Feb. 2019) are usually higher than the actual accuracies (citing 1 and 2)
3. Some pre-trained Keras models yield inconsistent or lower accuracies when deployed on a server (3) or run in sequence with other Keras models (4)
4. Keras models using batch normalization can be unreliable. For some models, forward-pass evaluations (with gradients supposedly off) still result in weights changing at inference time. (See 5)

You might be wondering: 
How is that possible?
Aren’t these the same model and shouldn’t they have the same performance if trained with the same conditions?

Well, you’re not alone. Curtis’ article also sparked some reactions on Twitter:
and some interesting insights into the reason for these differences: