This article is an attempt to establish the intuition behind Deep Q Networks. Before we get to that, let’s spend some good amount of time on understanding what is Q Learning. Too long, do read!
In simple terms, Q Learning is a model-free Reinforcement Learning approach to enable the agent to take actions or navigate in an environment and reach its goal. Model-free doesn’t mean the absence of any network to train the agent (like used in Deep Q Network ) but the absence of a model that is able to understand the environment and predict the next state or reward. In model free learning such as this, the agent majorly relies on the past memory of transitions (Action a taken at state s giving some reward r) to train the network. This set of transitions is also termed as Action Replay or Replay Buffer.
Rewards or penalties are laid out by the environment at every step, given a state. Reaching the goal would, of course, give a higher reward.
When it comes to real life use cases, the environment is going to be complex with multiple action spaces. We are going to use a simple grid based example here with just 4 actions: Left, Up, Right, Down
The highlighted box is the destination of the agent. With no rewards or penalty in the given situation it’s easy for the agent to reach the destination. A kid who loves to solve puzzles may draw it like this:
Now let’s make the grid, a bit more interesting. The agent is going to receive a reward of 1 when it reaches the goal, but is going to be fined -0.5 (living penalty)for every step taken. Now the kid being smart chooses the smallest path. His answer looks something like this:
While it was easy for the kid to find the optimal path, the agent here has to trace its path from the goal. The path highlighted in blue is the optimal path for the given state (starting position). For every state, a plan of action which comprises of optimal paths such as Path 2, could be identified given the reward-penalty system.
Cost of Path 1 = 4 * -0.5 + 1 = -1
Cost of Path 2= 2* -0.5 + 1 = 0
Cost of Path 3= 4 * -0.5 + 1 = -1
In this scheme of things, the agent’s actions are deterministic. When an agent decides to go right, the agent takes a right step. Now you wonder, how come the agent can go down when it has decided to go right? Imagine a robot driving on a sandy road. While it decided to go right by 20 degree, it couldn’t overcome the resistance due to the sand and went straight. There are other environmental conditions which impact an action. This is where the actions become nondeterministic . Let’s hold on to this thought and we will come back to it in few moments.