Batch Normalization In Neural Networks (Code Included)

Implemented With TensorFlow (Keras)

Photo by Christopher Gower on Unsplash


Batch Normalization (BN) is a technique many machine learning practitioners encounter.

And if you haven’t, this article explains the basic intuition behind BN, including its origin and how it can be implemented within a neural network using TensorFlow and Keras.

For those who are familiar with BN technique and would like to just focus on the implementation instead, you can skip to the Code section below.

For those who might be more intrigued in the maths behind it, feel free to the article below.

Batch Normalization In Neural Networks Explained (Algorithm Breakdown)


Batch Normalization is a technique that mitigates the effect of unstable gradients within a neural network through the introduction of an additional layer that performs operations on the inputs from the previous layer. The operations standardize and normalize the input values, after that the input values are transformed through scaling and shifting operations.


The first step is to import tools and libraries that will be utilized to either implement or support the implementation of the neural network. The tools that are utilized are as follow:

  • TensorFlow: An open-source platform for the implementation, training, and deployment of machine learning models.
  • Keras: An open-source library used for the implementation of neural network architectures that run on both CPUs and GPUs.
import tensorflow as tf
from tensorflow import keras

The dataset we’ll be utilizing is the trivial fashion-MNIST dataset.

The fashion-MNIST dataset contains 70,000 images of clothing. More specifically, it includes 60,000 training examples and 10,000 testing examples, that are all grayscale images with dimension 28 x 28 categorized into ten classes.

Preparation of the dataset includes the normalization of the training image and test images by dividing each pixel value by 255.0. This places the pixel value within the range 0 and 1.

A validation portion of the dataset is also created at this stage. This group of the dataset is utilized during training to assess the performance of the network at various iterations.

(train_images, train_labels),  (test_images, test_labels) = keras.datasets.fashion_mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0
validation_images = train_images[:5000]
validation_labels = train_labels[:5000]

Keras provides tools required to implement the classification model. Keras presents a Sequential API for stacking layers of the neural network in a consecutive manner.

Below is some information on the layers that will be implemented to make up our neural network.

  • Flatten: Takes an input shape and flattens the input image data into a one-dimensional array.
  • Dense: A dense layer has an embedded number of arbitrary units/neurons within. Each neuron is a perceptron.
  • A Perceptron is a fundamental component of an artificial neural network, and it was invented by Frank Rosenblatt in 1958. A perceptron utilizes operations based on the threshold logic unit.
  • Batch Normalization: Batch Normalization layer works by performing a series of operations on the incoming input data. The set of operations involves standardization, normalization, rescaling and shifting of offset of input values coming into the BN layer.
  • Activation Layer: This performs a specified operation on the inputs within the neural network. This layer introduces non -linearity within the network. The model implemented in this article will be utilizing the activation functions: Rectified Linear Unit(ReLU) and softmax.
  • The transformation imposed by ReLU on values from a neuron is represented by the formula y=max(0,x). The ReLU activation function clamps down any negative values from the neuron to 0, and positive values remain unchanged. The result of this mathematical transformation is utilized as the activation of the current layer, and as input to the next.
# Placing batch normalization layer before the activation layers
model = keras.models.Sequential([
keras.layers.Dense(300, use_bias=False),
keras.layers.Dense(200, use_bias=False),
keras.layers.Dense(100, use_bias=False),
keras.layers.Dense(10, activation=keras.activations.softmax)

Let’s take a look at the internal components of a BN layer

Merely accessing the layer at index two will provide information into the variables and their contents within the first BN layer,


I won’t go into too many details here, but take note of the variable names ‘gamma’, and ‘beta’, the values held within these variables are responsible for the rescaling and offsetting of activations within the layer.

for variable in model.layers[2].variables:
>> batch_normalization/gamma:0
>> batch_normalization/beta:0
>> batch_normalization/moving_mean:0
>> batch_normalization/moving_variance:0

This article goes into more detail in regards to the operations within BN layers.

Within the dense layers, the bias component is set to false. The omission of bias is as a result of the cancellation of constant values that occurs due to mean subtraction during normalization of activations.

Below is a snippet of a twitter post by Andrej Karpathy, current Director of AI at Tesla. His tweet was based on the topic of neural network mistakes that are often made, not setting bias to false when using BN was on the list.

In the next snippet of code we set and specify the optimization algorithm to train the implemented neural network with, along with the loss function and hyperparameters such as learning rate and the number of epochs.

sgd = keras.optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss="sparse_categorical_crossentropy", optimizer=sgd, metrics=["accuracy"])

Now we train the network using the model’s sequential API’s ‘fit’ method provides the tools to train the implemented network. We will skip some details in regards to how the neural network model is trained. For further information on a detailed explanation on the training and implementation of neural networks, refer to the link below.

(In-depth) Machine Learning Image Classification With TensorFlow 2.0, train_labels, epochs=60, validation_data=(validation_images, validation_labels))

The evaluation of the model performance is conducted using the test data set aside earlier.

With evaluation results, you can decide to fine-tune the network hyperparameters or move forward to production after observing the accuracy of the evaluation over the test dataset.

model.evaluate(test_images, test_labels)

During the training phase, you might notice that each epoch takes longer to train in comparison to a training a network without batch normalization layers. This is since the batch normalization adds a layer of complexity to the neural network, along with extra parameters required for the model to learn during training.

Although the increase in each epoch time is balanced with the fact that Batch Normalization reduces the time taken for the model to converge to an optimal solution.

The model implemented in this article is too shallow for us to notice the full benefits of utilizing batch normalization within a neural network architecture. Typically, batch normalization is found in deeper convolutional neural networks such as Xception, ResNet50 and Inception V3.


  • The neural network implemented above has the Batch Normalization layer just before the activation layers. But it is entirely possible to add BN layers after activation layers.
# Placing batch normalization layer after the activation layers
model = keras.models.Sequential([
keras.layers.Dense(300, use_bias=False),
keras.layers.Dense(200, use_bias=False),
keras.layers.Dense(100, use_bias=False),
keras.layers.Dense(10, activation=keras.activations.softmax)


BN is a commonly used technique within neural networks, therefore understanding how the technique works, along with how it’s implemented will be useful knowledge, especially when analyzing most neural network architecture.

Below is a GitHub link to a notebook that includes the code snippets presented in this article.

Batch Normalization In Neural Networks (Code) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.