CSC2541-F18 Topics in Machine Learning - Deep Reinforcement Learning (Fall 2018)

Course:

Instructors: Jimmy Ba

Lecture hours: Wednesday 3 – 5 ES B142

Office hours: Jimmy: W 5 – 6 PT290D TAs: TH 3–4 PT290C

Teaching assistants: Tingwu Wang, Michael Zhang

Announcements:

New Oct 30: TA hours moved to 3-4PM, Thursday in Pratt 290
New Oct 30: You are encouraged to upload the link of your presentation slides to the seminar excel sheet.
Oct 11: The course project guideline is now posted. Guideline
Oct 3: Updated software resources. Enroll on Piazza to find project partners.
Sept 18: New classroom change from BA1240 to ES B142.

Course Overview:

Learning by interaction or trial-and-error is a core aspect of any intelligence system. Reinforcement learning (RL) is a paradigm aiming to develop computational methods that allow intelligent agents to learn by interacting with their environments. In this course, we will cover the basic formulation of the Markov decision process (MDP), learning algorithms for tabular MDPs. This course will mainly focus on various function approximation methods using deep neural networks. The examples will include game playing and robot locomotion control.

Calendar:

	Reading	Topic / Slides
Week 1 Sept 12	Sutton & Barto, Ch 1	Introduction pdf slides

Week 2 Sept 19	Sutton & Barto, Ch 2-4	Multi-Armed Bandits, MDPs, Dynamic Programming pdf slides

Week 3 Sept 26	Sutton & Barto, Ch 5-7	Monte-Carlo and Temporal Difference Learning pdf slides
	Value Function Approximation
	Baird. Residual algorithm: Reinforcement Learning with Function Approximation
	Tsitsiklis, Roy. An Analysis of Temporal-Difference Learning with Function Approximation
	Tesauro. Temporal Difference Learning and TD-Gammon	Karthik Raja
	Riedmiller. Neural Fitted Q Iteration	Oliver Limoyo
	Minh, et al. Human-level control through deep reinforcement learning	Hyunmin Lee

Week 4 Oct 3	Monte-Carlo Planning
	Column, Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search	Bret Nestor
	Kocsis, Szepesvari, Bandit based Monte-Carlo Planninga	Ranjani Murali
	Gelly, Silver. Combining Online and Offline Knowledge in UCT	Alberto Camacho
	Silver, et. al. Temporal-Difference Search in Computer Go	Jesse Bettencourt
	Silver, et. al. Mastering the game of Go with deep neural networks and tree search	Will Grathwohl
	Policy Search
	Konda, Tsitsiklis. Actor-Critic Algorithms	Xiaohui Zeng
	Sutton, et. al. Policy Gradient Methods for Reinforcement Learning with Function Approximation	Lan Xiao
	Baxter, Bartlett. Infinite-Horizon Policy-Gradient Estimation
	Sutton, et. al. Comparing Policy-Gradient Algorithms	Angeline Yasodhara
	Tang, Abbeel. On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient	Amanjit Singh Kainth

Week 5 Oct 10	Policy Search Cont.
	Minh, et. al. Asynchronous Methods for Deep Reinforcement Learning	Alex Adam
	Grudic, et. al. Refining Autonomous Robot Controllers Using Reinforcement Learning	Jonathan Lorraine
	Hafner, Riedmiller. Reinforcement learning in feedback control	Srinivasan
	Silver, et. al. Deterministic Policy Gradient Algorithms	Silviu Pitis
	Lillicrap, et. al. Continuous control with deep reinforcement learning	Sergio Casas
	Hierarchical RL
	Parr, Russell. Reinforcement Learning with Hierarchies of Machines	Bryan Chan
	Barto, et. al. Intrinsically Motivated Learning of Hierarchical Collections of Skills	Zihang Fu
	Sutton, et. al. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning	Reid McIlroy-Young
	Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function
	Guestrin, et. al. Coordinated Reinforcement Learning	Safwan Hossain

Project proposal due (Oct 14)

Week 6 Oct 17	Exploration, Intrinsic Motivation, Curiosity
	Schmidhuber. Curious model-building control systems.	S. Vineeth Bhaskara
	Mohamed, Rezende. Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning	Yingying Fu
	Stadie, et. al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models	Zeqi Li
	Pathak, et. al. Curiosity-driven Exploration by Self-supervised Prediction	Harris Chan
	Houthooft, et. al. VIME: Variational Information Maximizing Exploration	Zahra Shekarchi
	Model-based RL
	Levine, Koltun, Guided Policy Search	Shengyang Sun
	Nagabandi, et. al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning	Yin-Hung Chen
	Chua, et. al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models	Paul Briggs
	Botev, et. al. The Cross-Entropy Method for Optimization	Haicheng Wang
	Sutton. Dyna, an Integrated Architecture for Learning, Planning, and Reacting + Kurutach, et. al. Model-Ensemble Trust-Region Policy Optimization	Eric Langlois
	Ha, Schmidhuber. World Models	Farzaneh Mahdisoltani

Week 7 Oct 24	Energy-based control/inference
	Haarnoja, et. al. Reinforcement Learning with Deep Energy-Based Policies
	Haarnoja, et. al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor	Mohammad Firouzi
	Nachum, et. al. Bridging the Gap Between Value and Policy Based Reinforcement Learning	Kelvin Wong
	Ziebart, et. al. Modeling interaction via the principle of maximum causal entropy
	Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review	Sean Segal
	Inverse RL
	Ng, Russell. Algorithms for inverse reinforcement learning	Simon Suo
	Ziebart, et. al. Maximum Entropy Inverse Reinforcement Learning	Elsa Riachi
	Finn, et. al. A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models	Stephane Aroca-Ouellette
	Hadfield-Menell, et. al. Inverse Reward Design	Sneha Desai
	Evolutionary strategy
	Brockhoff, et. al. Mirrored Sampling and Sequential Selection for Evolution Strategies	Jingcheng Niu

Week 8 Oct 31

Week 9 Fall reading week		no class

Week 10 Nov 14

Week 11 Nov 21

Week 12 Nov 28
		Project presentation
Week 13 Dec 5
		Project presentation
Final project due (Dec 16)

Resource:

Type	Name	Description
RL Code base	OpenAI Baseline	Implementations of common reinforcement learning algorithms.
	Google Dopamine	Research framework for fast prototyping of reinforcement learning algorithms.
	Evolution-strategies-starter	Evolution Strategies as a Scalable Alternative to Reinforcement Learning.
	Pytorch-a2c-ppo-acktr	PyTorch implementation of A2C, PPO and ACKTR.
	Model-Agnostic Meta-Learning	Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.
	Reptile	Reptile is a meta-learning algorithm that finds a good initialization.

General Framework	TensorFlow	An open source machine learning framework.
	PyTorch	An open source deep learning platform that provides a seamless path from research prototyping to production deployment.

Environments	OpenAI Gym	Gym is a toolkit for developing and comparing reinforcement learning algorithms.
	Deepmind Control Suite	A set of Python Reinforcement Learning environments powered by the MuJoCo physics engine.

Suggested (Free) online computation platform	AWS-EC2	Amazon Elastic Compute Cloud (EC2) forms a central part of Amazon.com’s cloud-computing platform, Amazon Web Services (AWS), by allowing users to rent virtual computers on which to run their own computer applications.
	GCE	Google Compute Engine delivers virtual machines running in Google’s innovative data centers and worldwide fiber network.
	Colab	Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.