CSC2541-F18: Topics in Machine Learning
Deep Reinforcement Learning


Instructors: Jimmy Ba

Lecture hours: Wednesday 3 – 5 ES B142

Office hours: Jimmy: W 5 – 6 PT290D TAs: TH 3–4 PT290C

Teaching assistants: Tingwu Wang, Michael Zhang


  • New Oct 30: TA hours moved to 3-4PM, Thursday in Pratt 290

  • New Oct 30: You are encouraged to upload the link of your presentation slides to the seminar excel sheet.

  • Oct 11: The course project guideline is now posted. Guideline

  • Oct 3: Updated software resources. Enroll on Piazza to find project partners.

  • Sept 18: New classroom change from BA1240 to ES B142.

Course Overview:

Learning by interaction or trial-and-error is a core aspect of any intelligence system. Reinforcement learning (RL) is a paradigm aiming to develop computational methods that allow intelligent agents to learn by interacting with their environments. In this course, we will cover the basic formulation of the Markov decision process (MDP), learning algorithms for tabular MDPs. This course will mainly focus on various function approximation methods using deep neural networks. The examples will include game playing and robot locomotion control.


  Reading Topic / Slides
Week 1 Sept 12 Sutton & Barto, Ch 1 Introduction
pdf slides
Week 2 Sept 19 Sutton & Barto, Ch 2-4 Multi-Armed Bandits, MDPs, Dynamic Programming
pdf slides
Week 3 Sept 26 Sutton & Barto, Ch 5-7 Monte-Carlo and Temporal Difference Learning
pdf slides
  Value Function Approximation  
  Baird. Residual algorithm: Reinforcement Learning with Function Approximation  
  Tsitsiklis, Roy. An Analysis of Temporal-Difference Learning with Function Approximation  
  Tesauro. Temporal Difference Learning and TD-Gammon Karthik Raja
  Riedmiller. Neural Fitted Q Iteration Oliver Limoyo
  Minh, et al. Human-level control through deep reinforcement learning Hyunmin Lee
Week 4 Oct 3 Monte-Carlo Planning  
  Column, Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search Bret Nestor
  Kocsis, Szepesvari, Bandit based Monte-Carlo Planninga Ranjani Murali
  Gelly, Silver. Combining Online and Offline Knowledge in UCT Alberto Camacho
  Silver, et. al. Temporal-Difference Search in Computer Go Jesse Bettencourt
  Silver, et. al. Mastering the game of Go with deep neural networks and tree search Will Grathwohl
  Policy Search  
  Konda, Tsitsiklis. Actor-Critic Algorithms Xiaohui Zeng
  Sutton, et. al. Policy Gradient Methods for Reinforcement Learning with Function Approximation Lan Xiao
  Baxter, Bartlett. Infinite-Horizon Policy-Gradient Estimation  
  Sutton, et. al. Comparing Policy-Gradient Algorithms Angeline Yasodhara
  Tang, Abbeel. On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient Amanjit Singh Kainth
Week 5 Oct 10 Policy Search Cont.  
  Minh, et. al. Asynchronous Methods for Deep Reinforcement Learning Alex Adam
  Grudic, et. al. Refining Autonomous Robot Controllers Using Reinforcement Learning Jonathan Lorraine
  Hafner, Riedmiller. Reinforcement learning in feedback control Srinivasan
  Silver, et. al. Deterministic Policy Gradient Algorithms Silviu Pitis
  Lillicrap, et. al. Continuous control with deep reinforcement learning Sergio Casas
  Hierarchical RL  
  Parr, Russell. Reinforcement Learning with Hierarchies of Machines Bryan Chan
  Barto, et. al. Intrinsically Motivated Learning of Hierarchical Collections of Skills Zihang Fu
  Sutton, et. al. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Reid McIlroy-Young
  Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function  
  Guestrin, et. al. Coordinated Reinforcement Learning Safwan Hossain
Project proposal due (Oct 14)    
Week 6 Oct 17 Exploration, Intrinsic Motivation, Curiosity  
  Schmidhuber. Curious model-building control systems. S. Vineeth Bhaskara
  Mohamed, Rezende. Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning Yingying Fu
  Stadie, et. al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models Zeqi Li
  Pathak, et. al. Curiosity-driven Exploration by Self-supervised Prediction Harris Chan
  Houthooft, et. al. VIME: Variational Information Maximizing Exploration Zahra Shekarchi
  Model-based RL  
  Levine, Koltun, Guided Policy Search Shengyang Sun
  Nagabandi, et. al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning Yin-Hung Chen
  Chua, et. al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models Paul Briggs
  Botev, et. al. The Cross-Entropy Method for Optimization Haicheng Wang
  Sutton. Dyna, an Integrated Architecture for Learning, Planning, and Reacting + Kurutach, et. al. Model-Ensemble Trust-Region Policy Optimization Eric Langlois
  Ha, Schmidhuber. World Models Farzaneh Mahdisoltani
Week 7 Oct 24 Energy-based control/inference  
  Haarnoja, et. al. Reinforcement Learning with Deep Energy-Based Policies  
  Haarnoja, et. al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor Mohammad Firouzi
  Nachum, et. al. Bridging the Gap Between Value and Policy Based Reinforcement Learning Kelvin Wong
  Ziebart, et. al. Modeling interaction via the principle of maximum causal entropy  
  Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review Sean Segal
  Inverse RL  
  Ng, Russell. Algorithms for inverse reinforcement learning Simon Suo
  Ziebart, et. al. Maximum Entropy Inverse Reinforcement Learning Elsa Riachi
  Finn, et. al. A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models Stephane Aroca-Ouellette
  Hadfield-Menell, et. al. Inverse Reward Design Sneha Desai
  Evolutionary strategy  
  Brockhoff, et. al. Mirrored Sampling and Sequential Selection for Evolution Strategies Jingcheng Niu
Week 8 Oct 31    
Week 9 Fall reading week   no class
Week 10 Nov 14    
Week 11 Nov 21    
Week 12 Nov 28    
    Project presentation
Week 13 Dec 5    
    Project presentation
Final project due (Dec 16)    


Type Name Description
RL Code base OpenAI Baseline Implementations of common reinforcement learning algorithms.
  Google Dopamine Research framework for fast prototyping of reinforcement learning algorithms.
  Evolution-strategies-starter Evolution Strategies as a Scalable Alternative to Reinforcement Learning.
  Pytorch-a2c-ppo-acktr PyTorch implementation of A2C, PPO and ACKTR.
  Model-Agnostic Meta-Learning Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.
  Reptile Reptile is a meta-learning algorithm that finds a good initialization.
General Framework TensorFlow An open source machine learning framework.
  PyTorch An open source deep learning platform that provides a seamless path from research prototyping to production deployment.
Environments OpenAI Gym Gym is a toolkit for developing and comparing reinforcement learning algorithms.
  Deepmind Control Suite A set of Python Reinforcement Learning environments powered by the MuJoCo physics engine.
Suggested (Free) online computation platform AWS-EC2 Amazon Elastic Compute Cloud (EC2) forms a central part of’s cloud-computing platform, Amazon Web Services (AWS), by allowing users to rent virtual computers on which to run their own computer applications.
  GCE Google Compute Engine delivers virtual machines running in Google’s innovative data centers and worldwide fiber network.
  Colab Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.