Exponential Weighted Average, RMSprop, and Adam optimisation algorithms

Exponential Weighted Average, RMSprop, and Adam optimisation algorithms are important concepts in the field of machine learning, specifically for training deep learning models. Each of these techniques plays a role in updating the weights of a neural network during training to minimize the loss function. I'll explain each concept and their relationships below.

1. Exponential Weighted Average (EWA)

Exponential Weighted Average is a technique used to compute the moving average of a time series data, where more recent data points have higher importance (or weight) compared to older data points. This is done by introducing a smoothing factor, beta (0 < beta < 1), that determines the degree of weighting decrease.

EWA can be calculated as follows:

scssCopy codeV_t = beta * V_(t-1) + (1 - beta) * x_t

Here, V_t is the weighted average at time step t, V_(t-1) is the weighted average at time step t-1, x_t is the actual data point at time step t, and beta is the smoothing factor.

The choice of beta determines how quickly the weights decay. A small value of beta will give more importance to recent data points, while a larger value will make the average more sensitive to older data points.

EWA helps in reducing noise and fluctuations in the data, making it easier to identify trends and patterns. It's a crucial component in the optimization algorithms discussed below.

2. RMSprop (Root Mean Square Propagation)

RMSprop is an adaptive learning rate optimization algorithm proposed by Geoff Hinton. It addresses the issue of the learning rate in gradient-based optimization algorithms, such as Gradient Descent, by adapting the learning rate for each parameter during training.

RMSprop computes the weighted average of the squares of the gradients and divides the current gradient by the square root of this weighted average. This allows the algorithm to use different learning rates for each parameter, resulting in more stable and efficient convergence.

The update rule for RMSprop is as follows:

S_t = beta * S_(t-1) + (1 - beta) * (g_t)^2
theta_t = theta_(t-1) - (learning_rate / sqrt(S_t + epsilon)) * g_t

Here, S_t is the weighted average of the squared gradients at time step t, S_(t-1) is the weighted average of the squared gradients at time step t-1, g_t is the gradient at time step t, theta_t and theta_(t-1) are the model parameters at time steps t and t-1, respectively, epsilon is a small constant to prevent division by zero, and learning_rate is the global learning rate.

3. Adam (Adaptive Moment Estimation)

Adam is another adaptive learning rate optimization algorithm, proposed by Diederik P. Kingma and Jimmy Ba. It combines the ideas of Momentum and RMSprop, making it well-suited for a wide range of deep learning tasks.

Adam maintains two exponentially weighted averages: one for the gradients (similar to Momentum) and one for the squared gradients (similar to RMSprop). These two weighted averages are used to compute adaptive learning rates for each parameter.

The update rule for Adam is as follows:

m_t = beta1 * m_(t-1) + (1 - beta1) * g_t
v_t = beta2 * v_(t-1) + (1 - beta2) * (g_t)^2

m_t_hat = m_t / (1 - beta1^t)
v_t