Deep-Q-Learning

Deep Q-Learning is a powerful machine learning algorithm used in the field of reinforcement learning, which allows an agent to learn how to make optimal decisions based on past experiences. It is an extension of the Q-learning algorithm, which uses a table to store Q-values, i.e., the expected rewards for each possible action in a given state. Deep Q-Learning, on the other hand, uses a neural network to approximate the Q-values, making it suitable for large, complex environments.

Training

The training process of Deep Q-Learning can be broken down into several steps, which are as follows:

Initialization: The network's weights are initialized randomly.
Exploration: The agent explores the environment by taking random actions. This helps in discovering new states and actions.
Experience Replay: The agent stores its experiences in a replay buffer. Each experience consists of the state, action, reward, and next state. The replay buffer is used to sample random batches of experiences during training.
Action Selection: The agent selects actions using an epsilon-greedy policy. With a probability epsilon, the agent selects a random action, and with a probability of 1-epsilon, it selects the action with the highest Q-value.
Update: The agent updates the weights of the network using the Bellman equation. The Bellman equation is a recursive equation that relates the value of a state to the values of its successor states. The equation is given by:

Q(s, a) = r + γ * max(Q(s', a'))

Deep-Q-Learning is totally depentent on this Bellman equation while training deep learning. The objective of deep learning is to create an appropriate Q(s, a) for each input. By testing many cases beforehand, many {s, a, R(s), s'} are given.

Therefore, if you start with random Q policy, the error;

r + γ * max(Q(s', a'))- Q(s, a)

gets large. But gradually this Q is fixed, and Q gets close to ideal value by establishing Bellman equation nearly perfectly. To traing neural network, x(s, a) is used as an input and Q(s, a) is used as an output where the expected value is r + γ * max(Q(s', a')). By this way, Deep Learning learned how to behave this is situation. This procedure is repeated 10,000 times.