Understanding He Initialization in Deep Learning

The starting point of training a deep learning model involves initializing the weights of the model's layers. These weights get updated throughout the training process, so the initial values might seem unimportant. However, the initial weights can significantly affect how quickly the model converges, or even whether it converges at all. This blog will focus on a popular technique called He initialization. We'll also compare it to other common initialization methods.

What is He Initialization?

He initialization, also known as He normal initialization, is a weight initialization method named after Kaiming He, its primary developer. This technique draws initial weights from a truncated normal distribution centered around zero.

The key to He initialization lies in its consideration of the size of the previous layer in the network (the number of input nodes for the weight). In He initialization, the standard deviation of the normal distribution is proportional to the square root of the inverse of the number of input nodes. In mathematical terms, if we denote the number of input nodes as n, each initial weight w is drawn from a normal distribution with standard deviation sqrt(2/n).

Why is He Initialization Used?

The purpose of He initialization, and other initialization techniques, is to prevent exploding or vanishing gradients. Exploding gradients occur when large error gradients accumulate and result in very large updates to neural network model weights during training. On the other hand, vanishing gradients are a problem where the model weights update very slowly and prevent the model from learning.

He initialization is primarily used with layers that have ReLU (or variants of ReLU) as the activation function. This is because ReLU and its variants have a tendency to cause weight initialization issues, leading to the "dying ReLU" problem where neurons essentially become inactive and only output 0.

He initialization overcomes this by drawing initial weights from a distribution that takes the size of the previous layer into account, which helps in breaking the symmetry between hidden units in the same layer and aids in alleviating the vanishing/exploding gradients problem.

Other Common Initialization Methods

While He initialization is commonly used, especially with ReLU and its variants, there are other initialization methods that work better with different activation functions.

Xavier/Glorot Initialization: Similar to He initialization, Xavier initialization, named after Xavier Glorot, also considers the size of the previous layer. However, instead of using sqrt(2/n), Xavier initialization uses sqrt(1/n). This slight difference makes Xavier initialization better suited for sigmoid or tanh activation functions.