Problem Set 1: Basics of Implementation

Exercise 1.1: Gaussian Log-Likelihood

Path to Exercise:

  • PyTorch version: spinup/exercises/pytorch/problem_set_1/exercise1_1.py
  • Tensorflow version: spinup/exercises/tf1/problem_set_1/exercise1_1.py

Path to Solution:

  • PyTorch version: spinup/exercises/pytorch/problem_set_1_solutions/exercise1_1_soln.py
  • Tensorflow version: spinup/exercises/tf1/problem_set_1_solutions/exercise1_1_soln.py

Instructions. Write a function that takes in the means and log stds of a batch of diagonal Gaussian distributions, along with (previously-generated) samples from those distributions, and returns the log likelihoods of those samples. (In the Tensorflow version, you will write a function that creates computation graph operations to do this; in the PyTorch version, you will directly operate on given Tensors.)

You may find it useful to review the formula given in this section of the RL introduction.

Implement your solution in exercise1_1.py, and run that file to automatically check your work.

Evaluation Criteria. Your solution will be checked by comparing outputs against a known-good implementation, using a batch of random inputs.

Exercise 1.2: Policy for PPO

Path to Exercise:

  • PyTorch version: spinup/exercises/pytorch/problem_set_1/exercise1_2.py
  • Tensorflow version: spinup/exercises/tf1/problem_set_1/exercise1_2.py

Path to Solution:

  • PyTorch version: spinup/exercises/pytorch/problem_set_1_solutions/exercise1_2_soln.py
  • Tensorflow version: spinup/exercises/tf1/problem_set_1_solutions/exercise1_2_soln.py

Instructions. Implement an MLP diagonal Gaussian policy for PPO.

Implement your solution in exercise1_2.py, and run that file to automatically check your work.

Evaluation Criteria. Your solution will be evaluated by running for 20 epochs in the InvertedPendulum-v2 Gym environment, and this should take in the ballpark of 3-5 minutes (depending on your machine, and other processes you are running in the background). The bar for success is reaching an average score of over 500 in the last 5 epochs, or getting to a score of 1000 (the maximum possible score) in the last 5 epochs.

Exercise 1.3: Computation Graph for TD3

Path to Exercise.

  • PyTorch version: spinup/exercises/pytorch/problem_set_1/exercise1_3.py
  • Tensorflow version: spinup/exercises/tf1/problem_set_1/exercise1_3.py

Path to Solution.

  • PyTorch version: spinup/algos/pytorch/td3/td3.py
  • Tensorflow version: spinup/algos/tf1/td3/td3.py

Instructions. Implement the main mathematical logic for the TD3 algorithm.

As starter code, you are given the entirety of the TD3 algorithm except for the main mathematical logic (essentially, the loss functions and intermediate calculations needed for them). Find “YOUR CODE HERE” to begin.

You may find it useful to review the pseudocode in our page on TD3.

Implement your solution in exercise1_3.py, and run that file to see the results of your work. There is no automatic checking for this exercise.

Evaluation Criteria. Evaluate your code by running exercise1_3.py with HalfCheetah-v2, InvertedPendulum-v2, and one other Gym MuJoCo environment of your choosing (set via the --env flag). It is set up to use smaller neural networks (hidden sizes [128,128]) than typical for TD3, with a maximum episode length of 150, and to run for only 10 epochs. The goal is to see significant learning progress relatively quickly (in terms of wall clock time). Experiments will likely take on the order of ~10 minutes.

Use the --use_soln flag to run Spinning Up’s TD3 instead of your implementation. Anecdotally, within 10 epochs, the score in HalfCheetah should go over 300, and the score in InvertedPendulum should max out at 150.

Problem Set 2: Algorithm Failure Modes

Exercise 2.1: Value Function Fitting in TRPO

Path to Exercise. (Not applicable, there is no code for this one.)

Path to Solution. Solution available here.

Many factors can impact the performance of policy gradient algorithms, but few more drastically than the quality of the learned value function used for advantage estimation.

In this exercise, you will compare results between runs of TRPO where you put lots of effort into fitting the value function (train_v_iters=80), versus where you put very little effort into fitting the value function (train_v_iters=0).

Instructions. Run the following command:

python -m spinup.run trpo --env Hopper-v2 --train_v_iters[v] 0 80 --exp_name ex2-1 --epochs 250 --steps_per_epoch 4000 --seed 0 10 20 --dt

and plot the results. (These experiments might take ~10 minutes each, and this command runs six of them.) What do you find?

Exercise 2.2: Silent Bug in DDPG

Path to Exercise.

  • PyTorch version: spinup/exercises/pytorch/problem_set_2/exercise2_2.py
  • Tensorflow version: spinup/exercises/tf1/problem_set_2/exercise2_2.py

Path to Solution. Solution available here.

The hardest part of writing RL code is dealing with bugs, because failures are frequently silent. The code will appear to run correctly, but the agent’s performance will degrade relative to a bug-free implementation—sometimes to the extent that it never learns anything.

In this exercise, you will observe a bug in vivo and compare results against correct code. The bug is the same (conceptually, if not in exact implementation) for both the PyTorch and Tensorflow versions of this exercise.

Instructions. Run exercise2_2.py, which will launch DDPG experiments with and without a bug. The non-bugged version runs the default Spinning Up implementation of DDPG, using a default method for creating the actor and critic networks. The bugged version runs the same DDPG code, except uses a bugged method for creating the networks.

There will be six experiments in all (three random seeds for each case), and each should take in the ballpark of 10 minutes. When they’re finished, plot the results. What is the difference in performance with and without the bug?

Without referencing the correct actor-critic code (which is to say—don’t look in DDPG’s core.py file), try to figure out what the bug is and explain how it breaks things.

Hint. To figure out what’s going wrong, think about how the DDPG code implements the DDPG computation graph. For the Tensorflow version, look at this excerpt:

# Bellman backup for Q function
backup = tf.stop_gradient(r_ph + gamma*(1-d_ph)*q_pi_targ)

# DDPG losses
pi_loss = -tf.reduce_mean(q_pi)
q_loss = tf.reduce_mean((q-backup)**2)

How could a bug in the actor-critic code have an impact here?

Bonus. Are there any choices of hyperparameters which would have hidden the effects of the bug?


Write Code from Scratch

As we suggest in the essay, try reimplementing various deep RL algorithms from scratch.

Requests for Research

If you feel comfortable with writing deep learning and deep RL code, consider trying to make progress on any of OpenAI’s standing requests for research: