Mini batch gradient decent

Gradient descent is a popular optimization algorithm used to find the minimum value of a function, usually the cost function of a machine learning model. The idea behind gradient descent is to take steps in the direction of the negative gradient of the cost function, which leads to the steepest descent and eventually to the minimum value.

However, in practice, the cost function may involve millions or even billions of data points. Computing the gradient of the cost function over the entire dataset can be computationally expensive and may not fit into memory. To address this issue, the mini-batch gradient descent algorithm was introduced.

Mini-batch gradient descent is a variant of gradient descent where instead of computing the gradient over the entire dataset, it computes the gradient over a smaller subset of the data, called a mini-batch. The mini-batch size is typically chosen to be a power of 2, such as 32, 64, 128, or 256, and is a hyperparameter that needs to be tuned.

Instead of calculating the cost over all data points in the training set, mini-batch gradient descent computes the cost over a small subset of data points at each iteration, which is computationally efficient. This approach can be much faster than batch gradient descent, especially when dealing with large datasets.

For instance, suppose you have a training set of 10,000 data points, and the mini-batch size is set to 32. In that case, mini-batch gradient descent will compute the cost and update the parameters of the model every 32 data points, which is significantly faster than calculating the cost over all 10,000 data points in a single batch.

By computing the gradient over a smaller subset of the data, mini-batch gradient descent provides an approximation of the true gradient at a lower computational cost. This approach allows for a faster convergence rate and can help the algorithm avoid getting stuck in local minima.

Overall, the whole point of mini-batch gradient decent is to converge faster.

Although both batch gradient descent and mini-batch gradient descent eventually use all data sets during training, mini-batch gradient descent can be computationally more efficient than batch gradient descent, particularly when the size of the training dataset is very large.

The main reason for this is that mini-batch gradient descent updates the model parameters after each mini-batch, while batch gradient descent updates the model parameters after computing the gradient over the entire training set. This means that mini-batch gradient descent requires fewer computations to update the model parameters than batch gradient descent, resulting in faster convergence and lower computational cost.