Course 2 - Sample-based Learning Methods
Metadata
- URL:: Sample-based Learning Methods | Coursera
- Tags:: #on/Reinforcement-Learning
- Links:: Reinforcement Learning Specialization
- Status:: #triage
- Instructor:: Martha White Adam White
- Created:: 2022-04-28 22:44:26
- Modified:: 2022-06-05 23:58:44
About this Course
In this course, you will learn about several algorithms that can learn near optimal policies based on trial and error interaction with the environment—learning from the agent’s own experience. Learning from actual experience is striking because it requires no prior knowledge of the environment’s dynamics, yet can still attain optimal behavior. We will cover intuitively simple but powerful Monte Carlo methods, and temporal difference learning methods including Q-learning. We will wrap up this course investigating how we can get the best of both worlds: algorithms that can combine model-based planning (similar to dynamic programming) and temporal difference updates to radically accelerate learning.By the end of this course you will be able to:
- Understand Temporal
- Difference learning and Monte Carlo as two strategies for estimating value functions from sampled experience
- Understand the importance of exploration, when using sampled experience rather than dynamic programming sweeps within a model
- Understand the connections between Monte Carlo and Dynamic Programming and TD.
- Implement and apply the TD algorithm, for estimating value functions
- Implement and apply Expected Sarsa and Q-learning (two TD methods for control)
- Understand the difference between on-policy and off-policy control
- Understand planning with simulated experience (as opposed to classic planning strategies)
- Implement a model-based approach to RL, called Dyna, which uses simulated experience
- Conduct an empirical study to see the improvements in sample efficiency when using Dyna
# Notes
# Week 1
# Introduction to Monte Carlo Methods
# What is Monte Carlo?
- The term Monte Carlo is often used more broadly for any estimation method that relies on repeated random sampling.
- In RL, Monte Carlo methods allow us to estimate values directly from experience - from sequences of states, actions and rewards.
- This is only used for episodic tasks.
- In reinforcement learning we want to learn a value function. Value functions represent expected returns. So a Monte Carlo method for learning a value function would first observe multiple returns from the same state. Then, it averages those observed returns to estimate the expected return from that state.
- As the number of samples increases, the average tends to get closer and closer to the expected return.
- These returns can only be observed at the end of an episode.
- Monte Carlo methods learn directly from interaction. They don’t need to model the environment dynamics.
# Monte Carlo for Control
# Off-policy learning for prediction
- On-Policy Learning: Improve and evaluate the policy being used to select actions.
- Off-Policy Learning: Improve and evaluate a different policy from the one used to select actions.
- Off-policy learning allows learning an optimal policy from suboptimal behavior.
- The policy that we are learning is the target policy, denoted $\pi (a | s)$.
- The policy that we are choosing actions from is the behavior policy, denoted $b(a|s)$.
- [?] Why does off-policy learning matter?
- As $\epsilon$-soft policies are suboptimal, both for learning and taking actions, we need another way to be able to follow an optimal policy for taking actions, but still be able to explore all possible state-action pairs.
- Off-policy learning allows us to learn an optimal target policy while taking actions from an exploratory behavior policy.
- Importance Sampling uses samples from one probability distribution to estimate the expectation of a different distribution.
# Week 2
# Introduction to Temporal Difference Learning
- Importance of Temporal-Difference Learning by Richard S. Sutton
- Methods that scale with computation are the future of AI.
- e.g., learning and search (general purpose methods)
- Weak general purpose methods are better than strong methods. #to/lookup
- Supervised Learning and Model-free RL methods are only weakly scalable.
- Prediction Learning is scalable.
- Prediction learning means you’ve tried to predict what will happen next, you make a prediction, you wait and you see what happens and then when you find out what happens, you learn.
- It’s the unsupervised supervised learning.
- We have a target (just by waiting).
- No human labeling is needed!
- Prediction Learning is the scalable model-free learning.
- Temporal-Difference Learning is a learning method that is specialized for Prediction Learning.
- It is widely used in RL to predict future reward (value function).
- It is key to major RL algorithms:
- Q-Learning
- Sarsa
- TD($\lambda$)
- Deep Q-Network
- TD-Gammon
- Actor-critic methods
- Samuel’s checker player
- Appears to be how brain reward system works
- It can be used to predict any signal, not just reward.
- It is relevant only on multi-step prediction problems.
- Methods that scale with computation are the future of AI.
# Advantages of TD Learning
- Unlike Dynamic Programming, TD methods do not require a model of the environment.
- Unlike Monte Carlo, TD methods are online and incremental.
- They can update the values on every step.
- Bootstrapping allows us to update the estimates based on other estimates.
- TD methods asymptotically converge to the correct predictions.
- They usually converge faster than Monte Carlo methods.
# Week 3
- The TD control algorithms are based on Bellman equations.
- Sarsa uses a sample based version of the bellman equations.
- Q-learning uses the Bellman Optimality equation. It learns $q^*$.
- Expected Sarsa uses the same Bellman equation as Sarsa, but samples it differently. It takes an expectation over the next action values.
- Sarsa is a on-policy algorithm that learns the action values for the policy it’s currently following.
- Q-learning is an off-policy algorithm that learns the optimal action values.
- Expected Sarsa is both an on-policy and an off-policy algorithm that can learn the action values for any policy.
- Sarsa can do better than Q-learning when performance is measured online. This is because on-policy control methods account for their own exploration.

# Week 4
- Model are used to store knowledge about the dynamics.
- From a particular state in action, the model should produce a possible next state and reward. This allows us to see an outcome of an action without having to actually take it.
- Planning refers to the process of using a model to improve a policy.

- Models can generally be divided into two main types:
- Sample Models
- Sample models procedurally generate samples, without explicitly storing the probability of each outcome.
- They can be computationally inexpensive because random outcomes can be produced according to a set of rules.
- Sample models can be more compact that distribution models.
- Distribution Models
- Distribution models contain a list of all outcomes and their probabilities.
- They contain more information, but can be difficult to specify and can become very large.
- The explicit probabilities stored by them can be used to exactly identify the expected outcome.
- Sample Models
- The Dyna architecture uses direct RL and planning updates to learn from both environment and model experience.
- Direct RL updates use environment experiences to improve a policy or value function.
- Planning updates use model experience to improve a policy or value function.
- Models can be inaccurate if:
- They are incomplete.
- The environment changes.
- Planning with an inaccurate model improves the policy or value function with respect to the model, and not the environment.
- Dyna-Q can plan with an incomplete model by only sampling state-action pairs that have been previously visited.
- Model accuracy produces a variant of the exploration-exploitation trade-off.
- Dyna-Q+ uses exploration bonuses to explore the environment.