Residual prediction models of a gradient boosting tree algorithm

Here is the question I had when I first learned a residual prediction model of gradient boosting trees: if the first model makes an error, and we're training the next model on those errors, wouldn't we just be adding more error?

The key point here is to remember what the second model is trying to predict. It's not trying to predict the original target variable, but rather the errors (residuals) of the first model. Essentially, it's learning how much the first model was off by for each instance. If the second model is reasonably accurate, then its predictions represent a good estimate of the first model's errors.

So when we add the predictions from the second model (the estimated errors) to the predictions of the first model, we're essentially correcting the first model's predictions. If the first model predicted too high a value for a particular instance, and the second model correctly learned this, it would predict a negative residual for that instance. Adding this negative residual to the first model's prediction would bring it closer to the true value, reducing the error.

Similarly, if the first model predicted too low a value for an instance, the second model would predict a positive residual, and adding this to the first model's prediction would again bring it closer to the true value.

This is why the second model is trained on the residuals of the first model: its job is to learn how to correct the first model's mistakes. The process then repeats, with each new model trying to correct the errors made by the sum of all previous models. Over multiple iterations, this procedure can significantly improve the accuracy of the predictions.