Skip to content

TheCEDL/homework2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Homework2 - Policy Gradient

Please complete each homework for each team, and
mention who contributed which parts in your report.

Introduction

In this assignment, we will solve the classic control problem - CartPole.

CartPole is an environment which contains a pendulum attached by an un-actuated joint to a cart, and the goal is to prevent it from falling over. You can apply a force of +1 or -1 to the cart. A reward of +1 is provided for every timestep that the pendulum remains upright.

Setup

  • OpenAI gym
  • TensorFlow
  • Numpy
  • Scipy
  • IPython Notebook

If you already have some of above libraries installed, try to manage the dependencies by yourself.

If you are using a new environment (may be virtual), the preferred approach for installing above dependencies is to use Anaconda, which is a Python distribution that includes many of the most popular Python packages for science, math, engineering and data analysis.

  1. Install Anaconda: Follow the instructions on the Anaconda download site.
  2. Install TensorFlow: See anaconda section of TensorFlow installation page.
  3. Install OpenAI gym: Follow the official installation documents here.

Prerequisites

If you are unfamiliar with Numpy or IPython, you should read materials from CS231n:

Also, knowing the basics of TensorFlow is required to complete this assignment.

For introductory material on TensorFlow, see

Feel free to skip these materials if you are already familiar with these libraries.

How to Start

  1. Start IPython: After you clone this repository and install all the dependencies, you should start the IPython notebook server from the home directory
  2. Open the assignment: Open HW2_Policy_Graident.ipynb, and it will walk you through completing the assignment.

To-Do

  • [+20] Construct a 2-layer neural network to represent policy

  • [+30] Compute the surrogate loss

  • [+20] Compute the accumulated discounted rewards at each timestep

  • [+10] Use baseline to reduce the variance

  • [+10] Modify the code and write a report to compare the variance and performance before and after adding baseline (with figures is better)

  • [+10] In function process_paths of class PolicyOptimizer, why we need to normalize the advantages? i.e., what's the usage of this line:

    p["advantages"] = (a - a.mean()) / (a.std() + 1e-8)

    Include the answer in your report

Other

  • Office hour 2-3 pm in 資電館 with YenChen Lin.
  • Due on Oct. 17 before class.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published