Whether you are a student in Georgia Tech's OMSCS CS 7642 Reinforcement Learning course or a curious learner, this article aims to explain concepts and terminologies discussed in the literature on reinforcement learning. It is inspired by the many teachers in the field who have come before, such as Sutton and Barto's seminal book on reinforcement learning, Reinforcement Learning: Second Edition.

Here are some of the topics that will be discussed:

Defining RL terms
Themes
Core Concepts, including MDPs, Returns, and TD Updates
Learning Updates: Monte-Carlo, N-Step, Lambda Returns, TD($\lambda$), SARSA, Q-Learning

This article is part of a series of articles for the OMSCS CS 7642 Reinforcement Learning class.

OMSCS Reinforcement Learning Series:

Georgia Tech Reinforcement Learning: Preparing for Success
Single Agent Reinforcement Learning: Basic Concepts and Terminologies
Turbocharging Advantage Actor-Critic with Proximal Policy Optimization
Advantage Actor-Critic with Proximal Policy Optimization: A Journey Through Code
Multi-Agent Reinforcement Learning Soft Introduction: Cooperation

Friendly Caution

Hello dear reader! I'm thrilled that I get to share my knowledge. However, I am still a student, journeying through the complex field of computer science. While I strive to present information that is accurate and insightful, there may be instances where I misinterpret or oversimplify concepts. Please approach this article with a curious and critical mind and consult the learning material sections for the most accurate and comprehensive information available.

Defining Terms

I go over the terminology used in reinforcement learning. Although there are many ways to convey the following terms, the article will try it's best to use terminologies found in the Sutton Reinforcement Learning book. I will also try my best to use The Alberta Plan for AI Research for terminology descriptions.

Learning materials:

Chapter 3 in Reinforcement Learning (Sutton and Barto)
Chapter 2 in Grokking Deep Reinforcement Learning (Miguel Morales)

Environment

The environment is simply a term for a collection of different things that are part of the agent's world, a world in which the agent is attempting to learn how to complete a goal in a given time.

An environment is a place that has rules on how an agent can act and how the world can react to the agent's actions. This dance between the agent and the world determines how an agent improves and learns over time.

Agent

The goal of reinforcement learning is to have some agent (actor) capable of learning and completing some goal without human intervention.

Although this is the perfect scenario for agent learning and goal completion, it is sometimes not the case. Human intervention may be necessary regardless; as Sutton states, it's best to start with no human intervention and introduce it only afterward. More of this kind of framework can be found here.

Goal

The goal is what the agent is trying to accomplish. Some terms that indirectly try to achieve the goal include the objective function, reward function, loss function, etc.

Ultimately, the agent wishes to maximize or minimize some function to achieve the goal. In this case, optimize the rewards or reduce the loss. To keep things simple, for this article, the agent wishes to maximize the objective function, which will be the cumulative discounted rewards.

Lastly, the goal can also be referred to as the "true goal," which is used when environments may have subgoals or subtasks that are intermediary steps to the true goal. An environment may have subgoals because the problem space and the complexity of the true goal are so immense that to train an agent properly, a designer must add subgoals to get the agent to learn and achieve the true goal.

States

The possible states in which an agent can occupy in the environment. All states must be reachable by the agent. Even if the agent finds it challenging to get to some states, it cannot be impossible for the agent to reach any state in the environment. A single state belonging to the entire state space in the environment is denoted as $s \in S$.

Lastly, the state where the goal resides is called the goal or terminal state.

Actions

An agent's ability to interact with the environment is called the action, whereas all possible actions are called the action space. In the context of a video game, an action would be things such as moving, jumping, and shooting. A single action in the action space is denoted as $a \in A$.

Rewards

The rewards or reward function is the idea that an agent may receive an immediate reward from the environment based on the agent's transitions.

An agent can receive rewards from any state in the environment, which can also be called reward states, but in a small environment, it's usually expected to only reward the agent for accomplishing the goal; this can also be called the goal state, and the goal state usually terminates the episode.

Transitions

In reinforcement learning, the agent's actions in their current state can lead to a probabilistic transition, where they end up in a new state and receive a reward. This new state encompasses all possible states in the environment, including the state the agent was in before taking the action.

This idea of being in a state and going to a new state and receiving a reward is called a transition or transition function; which is denoted as:

$$ P(s_{t}, a_{t}) \rightarrow (s_{t+1}, r_{t+1})$$

where the transition results in the agent being in a new state and receiving a reward.

If transitions are 100% predictable regarding the new state and reward, then they are deterministic; otherwise, they are stochastic.

Experiences

Every timestep the agent "experiences" the world. This experience is a collection, or tuple that expresses the following sentence: An agent in a given state takes an action—also known as a state-action pair—which leads the agent to a new state and results in a reward for that action.

This can be expressed as the following tuple:

$$ < s_t, a_t, r_{t+1}, s_{t+1}> $$

where $t$ is the exact timestep the agent took the action.

Episodes

An agent requires time to learn how to achieve their goal. In many cases, the agent has to simulate interacting with the environment multiple times before it learns to accomplish its goal.

An agent usually has a starting state position and tries to reach the goal state, also termed the terminal state, to receive a reward.

An episode is a term for describing an agent's completion of a single simulation from the starting state to the terminal state.

The word trajectory describes all the states the agent has visited, from the start to the terminal state. In some books, the entire trajectory of the episode is denoted as $\tau$ where $\tau = (s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, r_{t+2}, \text{ ... }, s_{T}, a_{T}, r_{T+1})$.

Action Selection Policy

Action selection policy describes how an agent chooses an action based on their state.

A simple action-selection policy called epsilon-greedy is where an agent takes a random action for probability $P(x)$ where $x \in [0-1]$ and where the agent takes the best action based on probability $P(1 - x)$. A variant of epsilon-greedy, called epsilon-decay, is where the value of $x$ decays over time, either through a linear or exponential decay factor.

Other action selection choices include: Thompson Sampling, Softmax, Mellowmax.

Policy

A policy, or strategy, is the agent's "plan" for achieving the goal. Ultimately, reinforcement learning aims for the agent to achieve some goal and achieve it better than humans.

In essence, think of a policy as a function that maps a state to an action that usually maximizes the potential for achieving the true goal.

The policy, as written in the literature, is denoted as $\pi$. A single policy $\pi$ exists in the policy space where all possible policies exist, denoted as $\pi_{all}$. The agents aim to find the best policy in the policy space, also called the optimal policy, denoted as $\pi_*$. Lastly, multiple optimal policies in the policy space may exist. However, at least one optimal policy exists, denoted as $\pi* \in \pi_{all}$.

Themes

This section goes over some soft topics to think about when working on Reinforcement Learning problems.

Exploration and Exploitation Tradeoff

An agent needs to learn how to accomplish the true goal but doesn't know how to at the beginning of training. Essentially, the agent should explore the environment, trying all sorts of random transitions to see what happens. As the agent explores, it should learn pieces about the environment.

At some point, the agent will think it knows the environment and eventually should learn to take actions that seek to exploit the maximum reward it can get, which should lead to achieving the goal, which is called exploitation.

The tradeoff comes from the fact that when an agent explores the environment, it learns more and more about it. When an agent starts exploiting, it learns more about its transitions and, more importantly, its actions. The agent will discover which transitions are the best, but it does so only on its current knowledge of the environment, which may or may not be the best knowledge.

For this reason, the exploration and exploitation dilemma is an active area of research in reinforcement learning.

One popular technique for "balancing" exploration and exploitation that produces high-performance results is the Epsilon Greedy Action Selection Strategy.

Credit Assignment Problem

The credit assignment problem occurs when the agent does not know which specific transition, or state-action pair, is ultimately responsible for achieving the goal.

For simpler problems, this may not be the case or something to worry about. The real problem comes into play for much larger problems, such as the games of Chess or Go. In these types of games, many actions or moves have different distribution levels of importance, and worst of all, they are sequence-dependent. This means that the importance of a move is not fixed, but changes depending on the sequence of moves that follow it. For instance, a seemingly insignificant move early in the game could turn out to be crucial later on.

Is the reason for victory because of the very first or last move, or perhaps because of an earlier move, say the God Tier white 78 move in the game of Go?

No one knows for sure what move contributes to the overall achievement of the goal; however, reinforcement learning aims to address these problems with techniques such as Monte Carlo Methods, Eligibility Traces, and Temporal Difference Learning.

Core Concepts

This section goes over the core of Reinforcement Learning. All algorithms are derived from the topics of this section.

Markov Decision Process

Markov Decision Process, or MDP, is at the heart of what makes reinforcement learning.

An MDP is a mathematical framework used for modeling decision-making situations. In an MDP, outcomes can be random but are still under the control of a decision-maker. In this case, the decision-maker is the agent; the decision-making situation is simply the state of the agent.

Markov Decision Process Example:

mdp-example

The image has states in green, actions in orange where $a \in \{left,right\}$, and rewards are shown as dotted red lines when a state-action pair has been taken. Notice that some transitions are stochastic while other transitions are deterministic. The possibilities are endless, and an MDP cannot be reasonably drawn in most complex cases. This is not a problem in reinforcement learning because an agent can still understand everything about the world, even when the MDP is unknown to us. As long as the environment is represented as an MDP, the agent will learn the rest.

At the start of training, an agent, like a blank slate, exists in a world. It is devoid of any knowledge about the world, and is unaware of the actions it may take in this world or the true goal it should strive for. As programmers, we possess this knowledge in advance, but the agent starts from scratch, knowing nothing.

Regardless, it is through interacting with the world that the agent learns more about it. Eventually, the agent will learn what the goal is, what actions to take when in certain parts of the world, and better yet, how to accomplish the true goal optimally.

This process is called the Agent-Environment Interaction Loop:

interaction-loop

So, to read this, an agent starts at a state, then takes an action, the environment returns a reward, and the agent finds itself in a new state. The agent can repeat the interaction loop until it has fully learned about the environment. The environment is just an MDP underneath it all, and the agent is trying to solve this MDP.

Lastly, let's consider a simple scenario to showcase the evolution of the MDP through the eyes of the agent. The agent will exist in a 2D grid world environment of size 2x2, meaning that this grid world has four states. The agent can select actions $a \in {up, down, left, right}$. All actions and transitions are deterministic, meaning they happen 100% of the time. The agent will start at the top-left position, and the true goal is to reach the bottom-right position that has a reward of +1.

Before any training occurs, the MDP is unknown in the eyes of the agent.

Unknown MDP:

unknown-mdp

This simplified MDP represents the environment space. The agent knows nothing in the beginning, and so this MDP represents all the possible MDPs. The agent doesn't know it can't move diagnol in the environment.

However, the agent will interact with the world and eventually learn the constraints imposed on it, whether from the environment or its actions, along with what it needs to do to achieve the goal.

Optimal MDP:

known-mdp

The green square represents the goal or terminal state, and the grey squares represent non-terminal states. In this illustration, the agent has fully learned the environment and understands what actions are needed to achieve the true goal, which is called the optimal policy.

However, this illustration is for us to see. In reality, the agent has built an understanding of the world through math; internally, it sees something different.

State-Value Function Representation Example:

internal-state-value

A value now represents each state. This value represents how good it is to be in any given state, called the value function, denoted as $V(s)$. Notice how all states closer to the goal or terminal state have higher state-value functions. Therefore, the agent should pick actions that lead to states with the highest possible value function from any state in the environment, called the optimal policy.

However, the agent can also represent the world regarding how good an action is when being in any given state; this is called the action-value function, also known as the Q-function, denoted as $Q(s,a)$.

Action-Value Function Representation Example:

internal-action-value

Regarding this 2D grid world problem, the action-value function can be represented by a 2D array, where states are along the x-axis or rows, and the actions are along the y-axis or columns.

In the action-value function image, green represents actions that lead to higher values than the current state, red represents actions that lead to lower values than the current state, and orange represents actions that lead to states with the same value as the current state. One thing to note is that $s_3$ row is greyed out where all values are 0. This is because $s_3$ is the terminal state that ends the episode, so an agent cannot take any actions in that state.

To better understand value function calculations, we must first understand about Returns.

Learning Materials:

Chapter 3.1 - 3.2 in Reinforcement Learning (Sutton and Barto)

Returns

The agent only cares about maximizing the environment's rewards, which are usually only given when the true goal has been achieved. These rewards, which are collected at every time step, are called the Return. In the end, the agent wishes to maximize all rewards or the total Returns it can achieve per episode.

To that end, we can calculate returns for every episode in two ways, either cumulative or discounted.

(Cumulative) Returns:

$$ G_{t:T} = R_{t+1} + R_{t+2} + R_{t+3} + \text{ ... } + R_{T+1} = \sum_{k=1}^{T-t+1} R_{t+k}$$

where $R_{t+k}$ is the reward received at $k$ steps in the future from time $t$.

(Cumulative) Discounted Returns:

$$ G_{t:T} = \gamma^0R_{t+1} + \gamma^1R_{t+2} + \gamma^2R_{t+3} + \text{ ... } + \gamma^{T-t}R_{T+1} = \sum_{k=1}^{T-t+1} \gamma^{k-1}R_{t+k}$$

where $R_{t+k}$ is the reward received at $k$ steps in the future from time $t$, $\gamma \in [0,1]$ is the discount factor.

When $\gamma = 1 $, the calculation is no different than the Cumulative Returns. When $\gamma = 0$, then the calculations only care about the reward received immediately after an action was taken. The $\gamma$ term determines the importance of future rewards for future actions. Discounted Returns work even in infinite time problems because of $\gamma$.

Let's look at a single-chain example to illustrate why we would do calculations this way. An agent can take actions $a \in \{left, right\}$, $\gamma=1$, and the goal is to reach the right end of the chain. Let's say the action selection policy always selects right. The orange circle is an intermediary reward of +2 received upon entering the third state; the green square is the terminal state where the episode ends on entering and where an agent gets a +5 reward.

t=0:

t-0-robot-example

In this example, the agent's start position is represented as $t=0$. From this position, the agent's potential Return based on its policy is +7.

t=3:

t-3-robot-example

In this example, we see that at timestep $t=3$, the agent's potential Return based on its policy is +5.

T=t=4:

t-4-robot-example

This example is the last timestep for the episode where the agent's potential Return based on its policy is still +5.

Final Return Values:

return-robot-example

Now, we can see the potential Returns for each state in the environment. The third state, which contained the intermediary reward, is +5 because the reward was given upon entering the state, not simply being inside the state.

Learning Materials:

Chapter 3.3 in Reinforcement Learning (Sutton and Barto)

Bellman Equation

How can we use Returns to estimate the both the state-value and action-value functions? Enter the Bellman Equation.

Since an environment is just an MDP, we can use the concept of the Bellman Equation to estimate the Returns that should be seen from each state or from each state-action pair.

State-Value Bellman Equation:

$$ v_\pi(s) = \mathbb{E}[G_t \mid S_t=s] = \mathbb{E}[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1} \mid S_t=s] \text{ , for all } s \in S $$

This math is saying that the state-value function can be estimated using average discounted Returns which can then be derived to produce the final State-Value Bellman Equation.

Final State-Value Bellman Equation:

$$ v_\pi(s) = \sum_{a \in A} \pi(a \mid s) \sum_{s',r}p(s',r \mid s,a)[r + \gamma v_\pi(s')] \text{ , for all } s \in S $$

where $a$ is the actions belonging to a set of actions $a \in A$, $s$ is the current state, $s'$ is the next state, $r$ is the reward for taking an action in a given state $(s,a)$, $\gamma$ is the discount rate.

Let's break this down.

Remember that $ p(s',r \mid s,a)$ represents the transition probability. It's the probability that the agent experiences a specific next state and reward given the action taken in the current state.

$\pi(a \mid s)$ represents the probability the agent takes an action in the current state, according to the policy.

$[r + \gamma v_\pi(s')]$, called the target update, is crucial as it indicates the direction of our updates where $v_\pi(s)$ is the average target value over all actions and transition scenarios. If the next state is terminal, then $v_\pi(s)$ updates towards the reward as $v_\pi(s_{terminal}) = 0$.

To further illustrate this, let's go through an example of the MDP 2x2 room example from earlier. All value-state functions are initialized to zero; we are doing the $v(s)$ for the bottom left state. The policy for that specific state is to go 'right' 80%, and select a random action 20% of the time. This ends up being 'right' 85% of the time, and all other actions at 5%. The transition probabilities are deterministic, meaning the transition probability value is 1. The reward is +1, the $v(s)$ of terminal states are 0, and $\lambda = .99$.

$$\begin{aligned} \text{right} &= \pi(a_{right} \mid s) \times \sum_{s',r}(p(s',r \mid s,a) \times [r + \gamma \times v_\pi(s')]) \\ \text{right} &= .85 \times (1. \times [1. + .99 \times 0.]) = .85 \\ \end{aligned}$$

In this example we focus only on the right action calculation. We still have to calculate all other actions. Luckily, the calculation is the same for 'up,' 'down,' and 'left' actions, which is 0.

$$ \text{{up, down, left}} = .05 \times (1. \times [0. + .99 \times 0.]) = 0.$$

In the end our $v_\pi(s)$ calculation is:

$$ v_\pi(s) = \text{left + right + up + down} = 0.85 $$

Also, remember that the calculations made used deterministic action probabilities. In a stochastic environment, sometimes, actions don't always go according to plan. Sometimes, actions lead to random new states. Sometimes, an agent going right yields a 50% chance of going to the intended state; other times, an agent going right yields a 50% chance of going to a random state. In that case, you'd have to calculate the inner sum for all possible next state and reward pairs.

The most important thing to understand about this equation is that the state-value function can store everything about the environment at any particular state and determines how good it is to be in a particular state. This holds true as long as all states are visited infinitely.

The objective of the agent is to maximize this equation. Many algorithms in reinforcement learning achieve this by deriving strategies from the Bellman Equation.

$$ v_*(s) = \max_\pi v_\pi(s) \text{ , for all } s \in S$$

Let's look at Bellman Equation for the action-value function.

Action-Value Bellman Equation:

$$ q_\pi(s,a) = \mathbb{E}[G_t \mid S_t=s, A_t = a] = \mathbb{E}[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1} \mid S_t=s, A_t = a] $$

This can be derived further to get the final action-value Bellman Equation.

Final Action-Value Bellman Equation:

$$ q_\pi(s,a) = \sum_{s',r}p(s',r \mid s,a)[r + \gamma v_\pi(s')] $$

$[r + \gamma v_\pi(s')]$ is the direction $q_\pi(s,a)$ updates towards too for all possible transition scenarios. If the next state is terminal, then $v_\pi(s)$ updates towards the reward as $v_\pi(s_{terminal}) = 0$.

Let's do the same example as previously.

$$\begin{aligned} q_\pi(s,a_{right}) &= \sum_{s',r}(p(s',r \mid s,a) \times [r + \gamma \times v_\pi(s')]) \\ \text{right} &= 1. \times [1. + .99 \times 0.] = 1. \\ \text{{left, up, down}} &= 1. \times [0. + .99 \times 0.] = 0. \\ \end{aligned}$$

The algorithm lets us know that at the time of calculation, a right action will lead to a reward of 1.

To calculate the $v_\pi(s')$, just average out all the action-value functions of that particular state by the probabilities from the agent's policy for the specific action:

$$\begin{aligned} v_\pi(s) &= \sum_{a \in A} \pi(a \mid s)Q(s,a) \\ &= (.85 \times 1) + (.05 \times 0) + (.05 \times 0) + (.05 \times 0) \\ &= .85 \end{aligned}$$

where $a$ is the actions belonging to a set of actions $a \in A$, $s$ is the state, $\pi(a \mid s)$ is the probability of taking action given a policy, and $Q(s,a)$ is the action-value function for the state-action pair.

Lastly, we can define the optimal policy by selecting the best state-action pair in any given state.

$$ q_*(s,a) = \max_\pi q_\pi(s,a)$$

The main issue with these calculations is that we need to know everything about the MDP, which is impossible for most complex RL problems.

Learning Materials:

Chapter 3.5 in Reinforcement Learning (Sutton and Barto)
Chapter 4.2 in Reinforcement Learning (Sutton and Barto)

Bootstrapping

Bootstrapping allows algorithms to update the state-value or action-value function based on only another state's state-value function. This means we can make updates without knowing anything about the environment's transition probabilities or action policy probabilities.

You can recognize an algorithm or mathematical expression as participating in bootstrapping by looking to see if the updates are based on the state-value function $V(s_{t+n})$ or action-value function $Q(s_{t+n}, a_{t+n})$ of other states where $n$ is the number of steps ahead of the current state you are updating for, where $n \ge 1$.

Usually the state-value function and the action-value function are stand ins for the Returns you should expect to see from that state. This "guesstimate" of the state is the reason some algorithms are able to update immediately instead of waiting for an episode to terminate.

However there is a caveat. Since bootstrapping essentially updates on a guess, if the guess is good, then the updates are good. If the guess is bad, then the updates are bad. This is called bias. However, bias can be a good thing in this case if the tradeoff is faster convergence and less memory intensive since we don't have to save full episode trajectories. This means that we increase bias but reduce variance.

TD(0) Updates

Temporal Difference (TD) Learning is a central theme of reinforcement learning and is a subset of bootstrapping. It's the core idea behind popular algorithms such as Q-Learning and SARSA.

Temporal Difference updates can be categorized into three categories: TD(0), TD(1), and TD($\lambda)$. TD(0) means we make updates on the next timestep, such as Q-Learning and SARSA. TD(1) means we make updates after an episode has finished, such as Monte-Carlo Updates. TD($\lambda)$ is a special case where we can make updates ranging from one-step to complete episode updates.

This section looks at TD(0) updates and breaks down the formulas.

TD(0) Update:

$$ V(s_t) = V(s_t) + \alpha[r_{t} + \gamma V(s_{t+1}) - V(s_t)] $$

where $\alpha \in [0,1]$ is the learning rate, and $ \gamma \in [0,1] $ is the discount factor. The alpha value $\alpha$ determines how large or small of an update takes place.

The Temporal Difference Update has a TD error term $\delta$ which lets us know how far off from the true value the current state is.

TD Error:

$$ \delta = r_t + \gamma V(s_{t+1}) - V(s_t) $$

where $ \gamma $ is the discount factor, and $\delta$ is the TD Error term.

The TD Error also contains something called the TD target, where the TD target is the value we wish to update our state-value function towards.

TD target: $$\ TD_{target} = r_t + \gamma V(s_{t+1}) $$

where the TD target is the "estimated guess" the agent is updating towards and $ \gamma $ is the discount factor. Note that the TD Target can also be called the One-Step Return.

Learning Materials:

Chapter 6 in Reinforcement Learning (Sutton and Barto)

Learning Updates

This section covers some algorithms used for solving MDPs online. An online algorithm requires the agent to explore the world before an update occurs. The algorithm determines whether that update happens immediately, at the end of the episode, or anywhere in between.

Monte-Carlo Updates

The simplest form of updating states is called the Monte-Carlo Update, and it is simply the complete discounted returns from the starting state to the terminal state, also called the complete trajectory $\tau$.

Monte-Carlo Update:

$$ G_{t:T} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \text{ ... } + \gamma^{T-t}R_{T+1} = \sum_{k=1}^{T-t+1} \gamma^{k-1}R_{t+k} $$

where $R_{t+k}$ is the reward received at $k$ steps in the future from time $t$, $\gamma \in [0,1]$ is the discount factor.

Monte-Carlo Updates requires tracking all states visited in the episode, making this a memory-intensive algorithm. Despite this, the G Returns are non-biased, as updates are based on Returns seen during an episode. However, updates are high-variance since Returns rely on episode trajectories, where trajectories can differ from episode to episode.

Another negative is that updates are done at the end of the episode, meaning that the agent cannot update its policy during episode runs.

Lastly, Monte-Carlo Updates is not a bootstrapping algorithm.

Learning Materials:

Chapter 5 in Reinforcement Learning (Sutton and Barto)
Chapter 7.1 in Reinforcement Learning (Sutton and Barto)

N-Step Returns

N-Step Returns is simply a variant of the Monte-Carlo updates in which an agent improves its policies by updating based on N-Steps in the environment. Bootstrapping makes this possible.

The N-Step Return can range anywhere from one time step to infinite timesteps.

One-Step Return:

$$ G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1})$$

where $gamma \in [0,1]$ is the discount rate, and $R$ is the reward. In the case of One-Step Return, $n=1$.

There is a pattern you should take note of when it comes to N-Step Returns, and that is that instead of waiting until reaching the end of an episode to perform updates, we can instead update the current state based on the value function of the state that our agent has transitioned to N-Steps ahead.

This can be seen better in the Two-Step Return.

Two-Step Return:

$$ G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1}(S_{t+2})$$

where $gamma \in [0,1]$ is the discount rate, and $R$ is the reward. In the case of the Two-Step Return, $n=2$.

So again, a pattern emerges: as more time steps are considered in the calculation, the more we discount future rewards. Depending on the value of $\gamma$, future rewards are either very important or not necessary at all.

Let us now look at the state-value function estimate portion of the equation.

In the case of the Two-Step Return, $V_{t+1}(S_{t+2})$ is used to estimate the following:

$$ \gamma^2 V_{t+1}(S_{t+2}) = \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \gamma^4 R_{t+5} + \text{ ... } + \gamma^{t+n-1} R_{t+n}$$

where $\gamma^2 V_{t+1}(S_{t+2})$ is just an estimate of the rewards we believe we will see when we reach the terminal state.

This can be further simplified from the Sutton and Barto book to see what is really going on here:

$$ V_{t+1}(S_{t+2}) = R_{t+3} + \gamma R_{t+4} + \gamma^2 R_{t+5} + \text{ ... } + \gamma^{t+n-3} R_{t+n}$$

where $V_{t+1}(S_{t+2})$ is shown to be an estimate of the rewards we believe we will see when we reach the terminal state from that state.

Lastly, this can then be boiled down to the N-Step Return, where we can update any trajectory length of $N$ size.

N-Step Return:

$$ G_{t:t+n} = \gamma^0 R_{t+1} + \gamma^1 R_{t+2} + \gamma^2 R_{t+3} + \text{ ... } + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}) $$

Also, we only wish to update the state-value function of a state before the last step in the N-Step trajectory.

This update can be represented by the following equation:

$$ V_{t+n} = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)] \text{ , } 0 \leq t \leq T $$

The reason for updating towards a state-value function is that, unlike Monte-Carlo Updates, we don't know what the reward of the true goal is since we are only updating every N-Step. Therefore, we use an estimate of the $G$ Returns. This presents a problem; since we are basing our updates on an estimate, this introduces some bias but simultaneously reduces variance depending on the value of $n$.

The N-Step Returns algorithm is considered a bootstrapping algorithm because of the use of the state-value function in the update equation. The TD Target is the term $G_{t:t+n}$ and the TD Error comes in the form of $G_{t:t+n} - V_{t+n-1(S_t)}$.

Lastly Lambda Returns is a powerful learning algorithm and has shown to be better performant than TD($\lambda$) albeit much slower and complex.

Learning Materials:

Chapter 7.0 - 7.2 in Reinforcement Learning (Sutton and Barto)

Lambda Returns

The Lambda Return update combines both TD(0) and Monte-Carlo methods by using the parameter $\lambda$ that controls the balance between both types of updates.

Lambda Return:

$$ G_t^\lambda = (1- \lambda) \sum_{n=1}^{\infty}(\lambda^{n-1}G_{t:t+n}) + \lambda^{T-t-1} G_t$$

where $G_{t:t+n}$ is the n-step return value starting at time $t$ that includes the rewards up to $n$ steps in the future, and $\lambda \in [0,1]$ is a parameter that controls the balance between the One-Step Return when $\lambda = 0$ and the Monte-Carlo Return when $\lambda = 1$.

$G_{t:t+n}$ represents the return you receive from the N-Step Returns algorithm.

Lambda Return is considered a bootstrapping algorithm because it uses an estimate of the state N-Steps ahead. It is also considered a forward-view update because we update by looking forward to future rewards and states. Despite its complexity, it is one of the best performing TD algorithms.

Learning Materials:

Chapter 12.1 in Reinforcement Learning (Sutton and Barto)

TD($\lambda$)

TD($\lambda$) is an evolution to the Lambda Return algorithm. It introduces three concepts: the weight vector, which is a long-term array that accumulates over the lifetime of training; the eligibility trace, which is a short-term array where all values decay on every timestep; and the algorithm updates values based on the temporal difference (TD) error of the weights based on the current state and the next state.

The eligibility trace is a vector denoted as $z$, represented by an array representing every state in the environment. All values are set to 0, and when the agent reaches a specific state, that particular state's representation in the eligibility trace vector is increased by some value. After every timestep, all values in the eligibility trace $z$ are decreased exponentially by a decay rate denoted as $\lambda$ except for the recently visited state.

The weight vector is denoted by the term $w$ and is a vector of the same size and shape as the eligibility trace vector $z$. The weights represent some generalization of the space, some static value that we can use to determine the value of a state. The statement "function approximator with respect to the weights" means that the function approximator outputs depend on the correct weight vector adjustments.

In essence, we pass states, or features, into some prediction function $v(s,w)$ with respect to the weights; we then get an output that tells what the approximate state-value function is for that given state.

However, if the prediction function $v(s,w)$ is a linear function approximator with respect to the weights, we can then say that $\nabla v(S_t,w_t)$ or the gradient of the approximate value function with respect to the weights is just the feature vector:

$$ \nabla v(s,w) = x(s)$$

where $x(s) = (x_1(s), x_2(s), x_3(s), x_n(s))^\top$ is the feature vector, $w$ is the weight, and $s$ is the state.

The feature vector, denoted as $x (s)$, is a versatile concept in reinforcement learning. Imagine the feature vector as your states. In a continuous problem, like CartPole, the feature vector could be represented as: [cart position, cart velocity, pole angle, pole angular velocity]. In a discrete problem, such as a 2x2 grid world, the feature vector might be a one-hot encoding of the environment where a value of 1 signifies the agent's location: [0,0,1,0].

Lastly, the weight vector $w$ is initialized to some initialization strategy depending on the problem at the start of training. In contrast, the eligibility trace vector $z$ is initialized to all zeros at the beginning of every episode.

Updates depend on the algorithm, but they all follow the same flow.

Set Eligibility Trace Vector $z$ to all zeros; set Weight Vector $w$ to (random/zeros/etc) values
Update Eligibility Trace Vector $z$ by the "fade away" term $\gamma$$\lambda$ and increase by arbitrary value(s) for a given state
Select and Take Action, Observe Rewards and Next State
Calculate $\delta$ using Reward, Current State, and Next State
Update Weight Vector $w$ by $z$, $\delta$, and $\alpha$
If the terminal state is reached, reset $z$ to all zeros and restart the episode
Repeat 2-6 Until Satisfied

This is the core of all TD($\lambda$) updates.

What makes the TD($\lambda$) update unique is that we only update based on the gradient. In the case of the Reinforcement Learning book by Sutton and Barto, the TD Lambda's gradient is represented by a linear approximation function, which can be defined as a vector or array for simple tabular problems.

Semi-Gradient TD($\lambda$) example:

$$\begin{aligned} & \text{Loop for each step of episode} \\ & \quad \text{Choose } A \sim \pi (\cdot \mid S) \\ & \quad \text{Take action } A \text{, observe } R,S' \\ & \quad z \leftarrow \gamma \lambda z + \nabla v(S,w) \\ & \quad \delta \leftarrow R + \gamma v(S',w) - v(S,w) \\ & \quad w \leftarrow w + \alpha \delta z \\ & \quad S \leftarrow S' \\ & \text{until } S' \text{ is terminal} \end{aligned}$$

Eligibility Trace Vector $z$ Update:

$$ z = \gamma \lambda z + \nabla v(S_t,w_t)$$

where we "fade away" all values in the eligibility trace vector $z$ by the term $\lambda \gamma$, and where $\nabla v(S_t,w_t)$ in a linear function approximator with respect to weights is just the feature vector. In a one-hot encoding scenario such as the 2x2 grid world, we first "fade away" all values by $\lambda \gamma$ and then update the agent's state position in the eligibility trace by one.

TD($\lambda$)'s TD Error:

$$ \delta_t = R_{t+1} + \gamma v(S_{t+1}, w_t) - v(S_t, w_t))$$

where $\delta$ is the Temporal Difference (TD) error term, $\gamma$ is the discount factor, and $w$ is the weight vector.

Imagine we have an environment with five states. The weight vector is $w = [1,0,5,0,10]$, which is a generalization of the space. Let's also say the feature vector is $x(s) = [0,0,0,0,1]^\top$. If $v(S_t, w_t)$ is a linear function approximator with respect to the weights (linear regression), we can simply find the dot product between the weight vector $w$ and the feature vector $x(s)$:

$$ v(S, w) = w \cdot x(s) = w^\top x(s) = \sum_{i=1}^d w_ix_i(s) = 10$$

where $w$ is the weight vector, $x(s)$ is the feature vector representing state $s$, $w^\top$ is the transpose of the weight vector, $d$ is the dimension of the vector, and where the final result will be a scalar value.

The same calculation would then be done on $v(S_{t+1}, w_t)$.

Once $\delta$ has been solved, we can update that value against all states in the weight vector $w$ by a step size called the learning rate $\alpha$.

Weight Vector $w$ Update:

$$ w = w + \alpha \delta z$$

where $\alpha$ is the learning rate in a range $\alpha \in [0,1]$, $\delta$ is the TD Error term, and $z$ is the eligibility trace.

Essentially, we would like to update all values in the weight vector by the "faded away" vector of our features and the TD update value. How significant of an update the weight vector experiences depends on the value of the learning rate $\alpha$.

TD($\lambda$) is considered a backward view update as the algorithm uses the eligibility trace along with weights and feature vectors to determine the value function of any state. This is very powerful as we can approximate states we have never visited. However, there are drawbacks; the main drawback is that updates are now done through approximation which comes with all the problems of the specific function approximator being used. A secondary drawback is the "fade away" term $\gamma \lambda$, which needs to be fine tuned depending on the problem the agent is trying to solve.

This is no different than the "No Free Lunch" theorem in mathematics; there are always pros and cons with the choices of algorithms. In this case, you have the pros and cons of TD($\lambda)$ and the function approximator.

Some types of function approximators are listed here:

Linear Function Approximators
- Linear Regression
- Polynomial Regression
- Fourier Basis
- Tile Coding
- Radial Basis Function (RBF)
NonLinear Function Approximators
- Neural Networks
- Decision Trees
- Support Vector Machines (SVM)
Local Approximators
- k-Nearest Neighbor

In theory, TD($\lambda$) can be used with any function approximator. Still, in practice, it may be harder to implement some than others, which is why the Reinforcement Learning book by Sutton and Barto, along with Sutton's 1988 Paper, suggests the use of Linear Function Approximators.

Learning Materials:

Chapter 9.4 - 9.5 in Reinforcement Learning (Sutton and Barto)
Chapter 12.2 - 12.5 in Reinforcement Learning (Sutton and Barto)
Learning to Predict by the Methods of Temporal Differences

SARSA

SARSA is a unique algorithm in that it's one of the first algorithms to be able to converge to an $\pi*$, yet does not need to save transitions or experiences. SARSA uses an action-selection strategy as part of its update, making SARSA an on-policy algorithm and it has the ability to make updates on every timestep, making SARSA an online algorithm.

SARSA Algorithm:

$$\begin{aligned} & \text{Hyperparameters: } \alpha \in (0,1] \text{, } \epsilon > 0 \\ & \text{Initialize } Q(s,a) \text{, for all } s \in S^+, a \in A, Q(terminal, \cdot) = 0 \\ & \\ & \text{Loop for each episode:} \\ & \quad \text{Initialize } S \\ & \quad \text{Choose } A \text{ from } S \text{ using policy derived from } Q \text{ (e.g. } \epsilon \text{-greedy)} \\ & \quad \text{Loop for each step in episode:} \\ & \quad \quad \text{Take action } A \text{, observe } R,S' \\ & \quad \quad \text{Choose } A' \text{ from } S' \text{ using policy derived from } Q \text{ (e.g. } \epsilon \text{-greedy)} \\ & \quad \quad Q(s_t,a_t) = Q(s_t,a_t) + \alpha[r_{t+1} + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t) ] \\ & \quad \quad S \leftarrow S'; A \leftarrow A'; \\ & \quad \text{until } S \text{ is terminal} \end{aligned}$$

The TD Updates are based on the next timesteps action from the policy. Regardless of the action-selection choice, SARSA has been shown to converge to the $\pi*$ as long as all state-action pairs have been visited an infinite amount of times.

While SARSA may tend to make 'conservative' actions due to its policy-based updates, it's remarkably adaptable. Depending on the action-selection strategy, the policy can be tailored to lean towards non-greedy, such as the case with epsilon-decay. This adaptability underscores the flexibility and versatility of SARSA.

Learning Materials:

Chapter 6.4 in Reinforcement Learning (Sutton and Barto)

Q-Learning

Understanding Q-Learning will give you the tools to tackle algorithms like Deep Q-Networks and its derivatives. Q-Learning is a model-free algorithm that does not require knowing anything about the model. The best part is that it does not update based on an action-selection policy, making Q-Learning an off-policy algorithm.

Q-Learning Algorithm:

$$\begin{aligned} & \text{Hyperparameters: } \alpha \in (0,1] \text{, } \epsilon > 0 \\ & \text{Initialize } Q(s,a) \text{, for all } s \in S^+, a \in A, Q(terminal, \cdot) = 0 \\ & \\ & \text{Loop for each episode:} \\ & \quad \text{Initialize } S \\ & \quad \text{Loop for each step in episode:} \\ & \quad \quad \text{Choose } A' \text{ from } S' \text{ using policy derived from } Q \text{ (e.g. } \epsilon \text{-greedy)} \\ & \quad \quad \text{Take action } A \text{, observe } R,S' \\ & \quad \quad Q(s_t,a_t) = Q(s_t,a_t) + \alpha[r_{t+1} + \gamma \max_{action}Q(s_{t+1},a_{t+1}) - Q(s_t,a_t) ] \\ & \quad \quad S \leftarrow S'\\ & \quad \text{until } S \text{ is terminal} \end{aligned}$$

Q-Learning is powerful because it's not beholden to updating based on any policy, nor does it need to save trajectories for updates such as N-Step Returns or Monte-Carlo Updates. Q-Learning has been shown to converge to the $\pi*$ as long as all state-action pairs have been visited an infinite amount of times.

One behavior of Q-learning is that it selects the maximum action-value function of the next state. This behavior is extremely aggressive because it always chooses the greediest action. This is simply an observation and not a fact of the algorithm.

Although the Epsilon Greedy Action-Selection Strategy is a popular choice for Q-Learning, consider using Thompson Sampling. In small environments where actions are discrete, Thompson Sampling has been shown to converge as long as initial exploration is not "funky" or extremely unlucky.

Learning Materials:

Chapter 6.5 in Reinforcement Learning (Sutton and Barto)

Conclusion

In the next article, I'll take a look at the Advantage Actor-Critic with Proximal Policy Optimization algorithm.

OMSCS Reinforcement Learning Series:

Georgia Tech Reinforcement Learning: Preparing for Success
Single Agent Reinforcement Learning: Basic Concepts and Terminologies
Turbocharging Advantage Actor-Critic with Proximal Policy Optimization
Advantage Actor-Critic with Proximal Policy Optimization: A Journey Through Code
Multi-Agent Reinforcement Learning Soft Introduction: Cooperation

2024-05-1030 min readSingle Agent Reinforcement Learning: Basic Concepts and Terminologies

Friendly Caution

Defining Terms

Environment

Agent

Goal

States

Actions

Rewards

Transitions

Experiences

Episodes

Action Selection Policy

Policy

Themes

Exploration and Exploitation Tradeoff

Credit Assignment Problem

Core Concepts

Markov Decision Process

Returns

Bellman Equation

Bootstrapping

TD(0) Updates

Learning Updates

Monte-Carlo Updates

N-Step Returns

Lambda Returns

TD(\(\lambda\))

SARSA

Q-Learning

Conclusion

2024-05-10 · 30 min readSingle Agent Reinforcement Learning: Basic Concepts and Terminologies

Friendly Caution

Defining Terms

Environment

Agent

Goal

States

Actions

Rewards

Transitions

Experiences

Episodes

Action Selection Policy

Policy

Themes

Exploration and Exploitation Tradeoff

Credit Assignment Problem

Core Concepts

Markov Decision Process

Returns

Bellman Equation

Bootstrapping

TD(0) Updates

Learning Updates

Monte-Carlo Updates

N-Step Returns

Lambda Returns

TD(\(\lambda\))

SARSA

Q-Learning

Conclusion

2024-05-1030 min readSingle Agent Reinforcement Learning: Basic Concepts and Terminologies