Temporal Difference Learning: A Deep Dive into Prediction and Control

Temporal Difference (TD) learning stands as a pivotal concept in the realm of reinforcement learning and artificial intelligence. Unlike supervised learning methods that rely on explicit target outputs, TD learning is an unsupervised approach, empowering an agent to learn predictions about future values within a sequence of states. This article, tailored for learners at learns.edu.vn and beyond, delves into the intricacies of TD learning, its underlying principles, and its powerful applications, particularly in training intelligent agents to interact effectively with their environments.

Understanding the Essence of Temporal Difference Learning

Imagine predicting Saturday’s weather at the start of the week. A straightforward, supervised approach would be to record Monday’s conditions and pair them with Saturday’s actual weather. Repeat this daily, building a dataset of weekday weather paired with the subsequent Saturday outcome. As discussed in earlier chapters on supervised learning, this data could train a model to predict Saturday’s weather. However, this method has a significant drawback: it requires waiting until Saturday to observe the actual weather before any learning can occur. This is inefficient, especially when we consider the valuable information available during the week leading up to Saturday. For instance, persistent rain on Thursday and Friday strongly suggests a higher likelihood of rain on Saturday.

Temporal Difference Learning offers a more nuanced, unsupervised approach. It leverages intermediate information, not just the final outcome, to refine predictions. The core idea is that predictions about future values are not confirmed all at once but incrementally, step-by-step, as new observations emerge. This is particularly relevant in multi-step prediction problems where waiting for the final outcome to learn is inefficient and overlooks valuable intermediate signals.

While various unsupervised prediction techniques exist, TD learning stands out. In supervised learning, we minimize prediction errors against pre-defined target values. In neural networks, this translates to adjusting connection weights to reduce the difference between network output and the target. However, in multi-step prediction scenarios, the target value (like Saturday’s weather) is only available at the sequence’s end. TD learning cleverly circumvents this by using the difference in predictions between successive time steps to drive learning.

The traditional supervised learning update rule, as shown above, uses the error (z – V(st)), where ‘z’ is the target value and ‘V(st)’ is the prediction for state ‘st’. ‘α’ is the learning rate, ‘wt’ are the parameters, and ‘▿wV(st)’ is the gradient. However, ‘z’ is unknown until the sequence ends, hindering incremental updates.

TD learning’s innovation lies in expressing the error, (z – Vt), as a sum of prediction differences across time steps:

Here, ‘Vt’ represents ‘V(st)’, and ‘Vm+1’ is defined as ‘z’, the terminal outcome. This formula’s validity is easily confirmed by expanding the sum, where intermediate terms cancel out, leaving ‘z – Vt’.

Substituting this error representation into the sequence update formula and performing algebraic manipulation yields the TD learning update rule for a single time step:

This rule is remarkably insightful. It uses two consecutive predictions, ‘Vt+1’ and ‘Vt’, to form the error term. Learning for the prediction ‘Vt’ (updating ‘wt’) occurs retroactively, even after moving to state ‘st+1’. This is achieved by maintaining a running sum of gradients, ‘▿wVk’. When the error arises at time ‘t+1’, it’s multiplied by this summed gradient, adjusting past predictions. Essentially, TD learning updates prior predictions to align them with the subsequent prediction ‘Vt+1’, hence the name “Temporal-Difference.”

This single-step update rule achieves the same cumulative weight changes over a sequence as supervised learning but allows for incremental updates. Only successive predictions and the accumulated gradient sum are needed. Sutton (1988) terms this TD(1). He further generalizes it to TD(λ), which introduces an exponential discount, λ, to past gradients:

The λ parameter, ranging from 0 to 1, controls the influence of past gradients. λ = 1 mirrors supervised learning, while λ = 0 simplifies to:

TD(0) resembles supervised learning with training pairs of (st, Vt+1). Intermediate λ values bridge these extremes, determining how much past predictions are adjusted by current errors. The term:

is the eligibility trace. It’s the core mechanism for temporal credit assignment in TD learning. Credit or blame for the TD error is distributed to preceding steps based on this trace. Higher λ values result in greater updates to earlier predictions for a given error signal.

Consider learning state values in a sequence (a, b, c, d). Initially, value predictions are zero. Learning is minimal until the sequence ends with an outcome ‘z’ (e.g., z=1). As we process states, we track gradients and eligibility traces. Upon reaching ‘z’, the error (z – V(d)) is calculated and multiplied by the eligibility trace, updating weights. This corrects past predictions to be closer to the final outcome value. After multiple sequences, predictions become more informed. If we encounter states ‘c’ then ‘b’, and ‘b’ has a pre-existing value, the TD error (V(b) – V(c)) adjusts V(c) towards V(b). This “bootstrapping” nature of TD learning uses its own predictions to refine others, propagating the outcome value backward through the sequence as value predictions. This propagation is effective even with λ = 0, and higher λ values accelerate the process. TD learning, while still requiring multiple sequence passes, generally learns more efficiently than supervised methods.

(1988). Reprinted by permission.)](http://handbook1310011-1698648897.png)

Consider Figure 9.1, illustrating a game state space. Suppose the state “BAD” is known to lead to loss 90% and win 10%, hence a low value. A “NOVEL” state leads to “BAD” then a win. Supervised learning, upon encountering “NOVEL-win”, might initially assign “NOVEL” a high value. TD learning, pairing “NOVEL” with the subsequent “BAD”, correctly assigns “NOVEL” a low value. TD learning, in essence, recognizes that “NOVEL” leads to a win only through a known “BAD” state. While supervised learning can eventually achieve similar performance with enough data, TD learning utilizes data more efficiently in multi-step prediction problems by bootstrapping, leading to faster learning with limited experience.

Reinforcement Learning: Extending TD for Control

TD learning excels at prediction. Now, we extend it to the control problem – enabling an agent to learn how to control its environment. This is the heart of reinforcement learning (RL): an agent learns to predict state values and use these predictions to make decisions that alter its environment, maximizing rewards.

Discounted Returns: Valuing Future Rewards

We now introduce the concept of “reward” for the outcome value. In RL, agents aim to maximize reward. Rewards can be positive (desirable) or negative (undesirable/punishment). TD learning is adaptable to scenarios with intermediate rewards, not just terminal ones. We associate a reward value with each state in a sequence, though many states may have zero reward. The prediction problem shifts from predicting a terminal reward to predicting the sum of rewards – the “return.”

For episodic tasks (tasks with distinct episodes), the sum of rewards works. However, for continuous, non-episodic tasks, the return could become unbounded. To address this, we introduce a discount factor, gamma (γ), where 0 ≤ γ ≤ 1. The discounted return Rt is:

Discounting means future rewards are valued less than immediate rewards. Lower γ values emphasize immediate rewards. Gamma also controls how far ahead the agent considers future rewards when making current predictions.

We want our prediction ‘Vt’ to approximate ‘Rt’, the expected return at time ‘t’. Let ‘Vt‘ be the true expected return, so ‘Vt‘ = ‘Rt’. From the return equation, we derive:

The TD error at time ‘t’ is Et = Vt – Vt. Using ‘Vt+1’ as an estimate for ‘Vt+1‘ (bootstrapping), the generalized TD error becomes:

This error, scaled by learning rate ‘α’, drives weight updates:

When using function approximators like neural networks, ‘Vt’ is updated by adjusting weights via gradient descent.

Crucially, the return value replaces explicit planning. Instead of planning long sequences of actions, a TD agent aims to maximize total return. The value function ‘Vt’ reflects the expected future return. To maximize returns, an agent simply chooses actions leading to states with higher predicted values. Planning is thus reduced to selecting the next state with the highest value.

The Control Problem: Agent-Environment Interaction

In defining the control problem, we conceptualize agent-environment interaction as distinct modules. These modules provide a framework for understanding control, independent of specific environments or tasks. Figure 9.2 illustrates this modular architecture. The environment provides the agent with the current state ‘st’ and a reward signal ‘rt’. The reward is the sole training signal from the environment. Internally, the agent generates another training signal: the value of the successor state for TD error calculation. The agent also acts on the environment, changing its state.

(1998). Reprinted by permission.)](http://handbook1340012-1698648897.png)

The agent-environment boundary is defined by the agent’s control. Anything the agent cannot directly control is part of the environment, even if physically close to the agent. For example, reward signals, though often computed within a biological agent, are considered environmental because the agent doesn’t trivially modify them. Similarly, a robot’s hand is part of its environment when learning hand control.

Action selection is policy-driven. A policy is an agent-internal rule mapping state values to actions. In the simplest case, the agent evaluates possible next states, estimates their values, and chooses based on these estimates. Consider a Tic-Tac-Toe agent. It evaluates potential board states resulting from each possible move and chooses the move leading to the highest-valued state.

Alternatively, agents can learn state-action values, Q(s, a), representing the expected return of taking action ‘a’ in state ‘s’. In neural networks, the Q-function can be implemented by having input units represent state and action. For a robot in a grid world, state might be sensory input, and actions could be move forward, back, left, right, or pickup. To choose an action, the robot inputs the current state and then, for each action, computes the state-action value. The action with the highest value is selected.

The update rule for state-action values, known as SARSA (State-Action-Reward-State-Action), is:

SARSA uses the state and action at time ‘t’, and the reward, state, and action at time ‘t+1’ to calculate the TD error. This is associative learning, as values are tied to specific states. A non-associative version learns action values independent of state, like a slot machine agent learning the value of each lever. State-action value learning is often considered the complete RL problem, though state and action value learning are also achievable with TD methods.

Policy choice is critical for balancing exploration and exploitation. An RL agent must learn state/action values (exploration) and control the environment to maximize reward (exploitation). Initially, exploration is crucial to learn value estimates, potentially involving suboptimal actions. A purely “greedy” policy, always choosing the highest-valued action, exploits current knowledge but may miss better states due to initially low value estimates. Exploration is also vital in non-stationary environments where reward values change over time.

A common exploration-exploitation strategy is the ε-greedy policy: with probability 1-ε, it follows a greedy policy; with probability ε, it takes a random action. Softmax policy, familiar from Boltzmann machines, offers another approach. The probability of choosing action ‘a’ at time ‘t’ is:

where ‘τ’ is the temperature coefficient. High ‘τ’ makes actions equiprobable (exploration), low ‘τ’ favors greedy actions (exploitation). Annealing ‘τ’ over training allows for more exploration early on and more exploitation later. Various policies, including greedy, ε-greedy, and softmax, are implemented in software tools.

TD Learning and Backpropagation: A Powerful Combination

TD learning addresses temporal credit assignment, distributing credit/blame across prediction sequences. A simple TD implementation uses lookup tables to store and update state or state-action values. This works for enumerable state spaces. However, many tasks, especially with continuous state spaces, have too many states for lookup tables to be practical.

Backpropagation (BP), on the other hand, solves structural credit assignment, adjusting network weights to minimize error. Neural networks excel at generalizing learning across similar states. Combining TD learning with backpropagation (TDBP) creates a powerful agent capable of learning to maximize reward over time while generalizing predictions to novel states. This synergy is highly effective.

Backpropagating TD Error: Implementing TDBP

The gradient descent form of TD learning, including eligibility traces, is:

TDBP’s key difference from standard BP is that weight adjustments for input at time ‘t’ occur at time ‘t+1’. This necessitates saving the output gradient for input ‘t’ until the TD error is available at ‘t+1’. Similar to regular BP, the logistic activation function, f(neti) = , is commonly used. The derivative is f′(neti) = ai(1 -ai).

Output gradients with respect to weights are calculated using the chain rule, as in standard BP. We use δik to denote . For the output layer, δii = ai(1 – ai). For hidden layer ‘k’:

where ‘j’ is a unit in the next layer. Unlike regular BP, where the error is incorporated into the δ term at the output, TDBP’s δ terms are initially error-free. Thus, δ terms for hidden layers are generally two-dimensional matrices (i x k), where ‘i’ is the number of output units and ‘k’ is the number of hidden units. Output layer δ terms are vectors.

The output gradient with respect to weights is obtained by multiplying the δ term by the activation of the projecting layer:

This yields a three-dimensional gradient matrix for each weight projection ‘wjk’. This gradient, scaled by λ, is added to the eligibility trace ‘eijk’ for that projection. Upon observing the TD error ‘Ei’ at time ‘t+1’, weight changes are computed:

For networks with multiple paths to the output, TDBP handles hidden layers projecting to multiple layers by concatenating the δ matrices (and eligibility traces) along the ‘i’ dimension. The TD error is also a vector encompassing errors for all output units.

Case Study: TD-Gammon – Mastering Backgammon with TD Learning

Backgammon is a complex board game with a high branching factor, limiting the effectiveness of tree search methods in computer play. The vast number of board positions also renders lookup tables impractical. Gerald Tesauro’s TD-Gammon (Tesauro, 1992), developed at IBM in the late 1980s, marked a breakthrough by successfully applying TDBP to learn backgammon state values.

TD-Gammon’s network had an input layer representing the board state and a single hidden layer. The output layer comprised four logistic units estimating probabilities for white/black regular wins and gammons. It simultaneously estimated four outcome values. The initial TD-Gammon version used 198 input units. Board points were represented by 8 units each (4 for white, 4 for black), encoding checker counts. Additional units tracked checkers on the bar, removed checkers, and whose turn it was. This input layer projected to a 40-unit hidden layer.

Unlike Tesauro’s earlier Neurogammon, trained via supervised learning on expert games, TD-Gammon learned entirely through self-play. The network simulated dice rolls, generated possible next board positions, and evaluated them. The highest-ranked position was chosen as the next move. Initially, moves were random due to random weights. Over games, the network learned, exploring strategies in depth. Backgammon’s stochastic nature facilitated thorough state space exploration. This self-play approach surpassed Neurogammon’s supervised technique.

Later TD-Gammon versions incorporated conceptual features relevant to expert backgammon play, such as hit probability and blockade strength (Tesauro, 2002). This augmented input representation led to expert-level performance, establishing TD-Gammon as a top backgammon program, still used for game analysis and expert decision evaluation. Input representation is a crucial factor in TDBP network success.

TD-Gammon’s success also benefited from backgammon’s non-deterministic nature and relatively smooth state space. Similar board positions have similar values. Deterministic games like chess have discontinuous state spaces, making generalization harder. Self-play learning, while powerful, can lead to strategies optimized for self-play but weak against diverse opponents. Backgammon’s stochasticity mitigated this. Parameter tuning, often heuristic, involved λ = 0.7 and α = 0.1 for much of training. Parameter adjustments are task-dependent and modeler-driven.

Implementation: Using the tdbp Program

The tdbp program implements the TD backpropagation algorithm, mirroring the structure of the bp program. Pools are declared in feedforward order: input, hidden, output, context. Input pools come first, then hidden, then output. Output activation functions can be logistic or linear. Linear activation allows predicting rewards outside [0, 1]. Context pools function like in the srn program.

Gamma can be set per output pool or globally. Output units have ‘actfunction’ (logistic/linear). Pools have ‘delta’ (2D matrix) and ‘error’ variables. Projections can be from lower to higher numbered pools. Bias pool can project to any pool. Context pools can receive copyback projections. Self-projections are disallowed. Projections have eligibility traces (3D matrices), ‘lambda’, and ‘lrate’ parameters, defaulting to global values if unspecified.

Network operation modes are “beforestate” and “afterstate”, defined in the .net file. “Beforestate” mode: (1) input current state, (2) weight changes, (3) evaluate next states, (4) set next state. “Afterstate” mode: (1) evaluate next states, (2) set chosen state as input, (3) weight changes, (4) set next state. Learning occurs after the second state in both modes. “Beforestate” is for prediction problems, “afterstate” for environment-modifying tasks and SARSA learning. SARSA requires “afterstate” to ensure reward follows action selection, correctly adjusting state-action values.

Specifying the Environment: Matlab Class Files

The key difference from bp is using Matlab class files instead of pattern files to define the environment. An “environment class” is a Matlab object with properties and methods, simulating the task environment. It provides input patterns, reward signals, and possible next states. Environment classes must adhere to a standard interface (see envir_template.m for a template). Select environment files (.m extension) in the pattern selection dialog in pdptool. The tdbp program instantiates the class as a global variable “envir.”

The environment class interface includes one property (lflag) and ten methods (two optional). lflag controls learning; setting it to 0 skips training for the current step. This allows controlling training steps from within the environment class, useful for games where alternating player turns might require selective training.

Environment class methods are control methods (managing program interaction) and formal methods (reflecting RL concepts). Matlab classes require methods modifying object properties to accept and return the object.

Control Methods:

environment_template() (Constructor): Creates and returns a new environment instance. Rename to your environment class name. Calls construct() and returns its value. Allows initializing persistent variables that survive across training episodes.
construct(obj): Initializes environment variables, reset at the start of each episode. Set obj.lflag=1; to enable learning.
reset(obj): Resets environment state for a new episode by calling construct(obj). Save episode data to persistent variables here before calling construct(obj).
add_to_list=runStats(obj) (Optional): Manipulates persistent variables to extract statistics and return them as a cell array of strings for display in the pattern list at the end of a run.
next_state=doPolicy(obj,next_states,state_values) (Optional): Implement custom action/state selection policies based on network outputs. Takes next_states (possible next states) and state_values (network outputs) as input, returns the chosen next_state. Useful for policies using multiple output units.
endFlag=isTerminal(obj): Returns a logical value (0 or 1) indicating if the environment is in a terminal state. Tests for terminal conditions.

Formal Methods:

reward=getCurrentReward(obj): Returns a reward vector for the current step, used for TD error calculation. Length must match the number of output units. Reward determination depends on the environment dynamics.
state=getCurrentState(obj): Returns an input vector representing the current environment state. Length must match input units.
states=getNextStates(obj): Returns a matrix of possible next states/actions. Each row is a possible next state.
obj=setNextState(obj,next_state): Sets the environment state based on the chosen next_state (passed by tdbp). For action-value learning, interpret the action part of next_state and update the environment accordingly. For “defer” policy, next_state is all zeros and should be ignored; environment state changes are determined elsewhere.

Running the Program: tdbp in Action

tdbp usage is similar to bp. Key differences:

Pattern List: Displays patterns generated by the environment class during learning. Granularity: “epoch” (episode summaries) or “step” (step-by-step state and reward info). “Show Values” mode (step mode only) displays evaluated states and their network outputs. “Reset” and “Newstart” commands function as in bp.

Training Options: Sets epochs, global lambda, gamma, lrate, learning granularity, wrange, wdecay, context layer parameters, and state selection policy. Policies: “greedy”, “egreedy” (epsilon parameter), “softmax” (temperature via “annealsched” button), “linearwt”, “defer”, “userdef” (custom doPolicy function). Policies typically use the first output unit of the lowest numbered output pool. “Userdef” is needed for policies considering multiple output units.

Testing Options: Similar to training options, but for testing. “Stepcutoff” parameter prevents infinite loops during testing with greedy policies by limiting episode length.

Exercises: Hands-on Temporal Difference Learning

Ex 9.1. Trash Robot: Exploring Reward and Parameters

Objective: Investigate reward structure and parameter effects on RL agent behavior.

Setup: Open trashgrid.m (text editor) and 3by3grid.net. Start with start3by3grid in Matlab. Network: SARSA, softmax policy (temperature=1). Observe network window: 3×3 grid, 4 “action” units (up, down, left, right), 13 inputs, 6 hidden, 1 output.
Task: Agent navigates a 3×3 grid to a terminal square, receiving reward based on path length. Trash piece present in one square. Agent = trash-collecting robot.
Initial Exploration: Test panel active, “Update After” set to “Step”. Step through episodes. Observe unit (1,1) (top-left) active (agent location). Bottom-left unit (.5 value) indicates trash. Terminal square: (2,3) (unmarked). Robot moves randomly. “Newstart” to see path variations.

Q.9.1.1. Reward Structure:
Analyze getCurrentReward() in trashgrid.m. Determine the reward structure. What reward for the fastest path to the terminal square? Explain possible rewards and how they are obtained.

Training: Set “Update After” to 10 epochs, train for 350 epochs. Observe epoch summaries in pattern list. Note initial long episodes with 0 reward, and improvement over training.

Q.9.1.2. Greedy Policy Behavior:
After training, switch to test panel (greedy policy). Step through episodes. What path does the robot take? Does it match your reward prediction? “New Start” and retrain multiple times. Is the path consistent across training runs? If not, what are the commonalities in different paths?

Softmax Policy Testing: Test policy to softmax (temperature=1).

Q.9.1.3. Softmax Policy Changes:
Step through softmax episodes. What behavioral changes do you observe compared to greedy? Set “nepochs” to 50, check “Runstats”, “epoch” update mode, run test. Observe average terminal reward and trash pickup proportion. Run multiple tests, noting average reward. Explain changes in behavior, reward, and relation to action choice policy.

Value Inspection: Test options: “nepochs” = 1, check “Show values”. Step through episodes, observing action values in pattern list. Actions encoded at input vector end (up, down, left, right).

Q.9.1.4. Implicit Value Encoding:
What is implicitly encoded in relative value estimates at each step? How does this lead to the bistable behavior observed?

Reward Function Modification: Training panel, reset network. Inspect getCurrentReward in trashgrid.m.

Q.9.1.5. Reward Function Manipulation:
Remember current getCurrentReward (copy/paste). Modify reward assignment to favor a different path. Retrain (350 epochs), observe greedy policy behavior. Did the change produce expected behavior? Why?

Revert and Reset: Revert reward function changes. Reset network.

Q.9.1.6. Gamma Parameter Manipulation:
Change gamma value to induce the same behavior pattern as reward function manipulation. Retrain after each change, reset between runs. Test with greedy policy. May require multiple attempts and more training epochs if cycling occurs. What gamma change was made and why did it cause this behavioral change? Explain using return calculation, gamma role, reward function, and grid layout.

Further Exploration: Modify trashgrid.m and network for larger grids, multiple trash pieces, or changing trash locations to further explore RL learning.

⇒⇐⇑