Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homework 2 李冠毅 李季紘 #9

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Average_return_with_baseline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Average_return_without_baseline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added CEDL_HW2_Report.pdf
Binary file not shown.
259 changes: 247 additions & 12 deletions HW2_Policy_Graident.ipynb

Large diffs are not rendered by default.

68 changes: 15 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,61 +2,23 @@
Please complete each homework for each team, and <br>
mention who contributed which parts in your report.

# Introduction
In this assignment, we will solve the classic control problem - CartPole.
# Our report of this assignment
We present our report as a <a href="CEDL_HW2_Report.pdf">pdf file</a>.

<img src="https://cloud.githubusercontent.com/assets/7057863/19025154/dd94466c-8946-11e6-977f-2db4ce478cf3.gif" width="400" height="200" />
# Problem 5
### Average_return_with_baseline.png
![with B](Average_return_with_baseline.png "Average return with baseline")
### Average_return_without_baseline.png
![without B](Average_return_without_baseline.png "Average return without baseline")

CartPole is an environment which contains a pendulum attached by an un-actuated joint to a cart,
and the goal is to prevent it from falling over. You can apply a force of +1 or -1 to the cart.
A reward of +1 is provided for every timestep that the pendulum remains upright.
上兩圖中的藍色線為Average Return值,而黑色線則表示Stardard Deviation範圍,上圖是有在Policy Gradient中加入Baseline,而下圖則無,可以看到Standar Deviation值具有明顯差異,換算則Variance的話,有加入Baseline大約可以減少Variance約300~400左右,而在Variance值減少的情況下,原本可預期減少iteration數,在實驗中也曾測到Iteration數減少約10~20,但是因為每次執行的結果都不同,所以這裡給的數值只是大概值。

# Setup
* OpenAI gym
* TensorFlow
* Numpy
* Scipy
* IPython Notebook
# Problem 6
針對Advantage進行Normalization的話能夠穩定Rewards中的variance大小的影響,進一步讓Iteration數趨於穩定,原預期此一步驟可讓Gradient趨於穩定,然而經多次實驗後卻發現Iteration數量不減反增,加入Normalization僅能讓加入Baseline的因素影響減少而已,因此判定Normalization可讓Training過程穩定。

**If you already have some of above libraries installed, try to manage the dependencies by yourself.**

If you are using a new environment (may be virtual), the preferred approach for installing above dependencies is to use [Anaconda](https://www.continuum.io/downloads), which is a Python distribution that includes many of the most popular Python packages for science, math, engineering and data analysis.

1. **Install Anaconda**: Follow the instructions on the [Anaconda download site](https://www.continuum.io/downloads).
2. **Install TensorFlow**: See [anaconda section](https://www.tensorflow.org/versions/r0.11/get_started/os_setup.html#anaconda-installation) of TensorFlow installation page.
3. **Install OpenAI gym**: Follow the official installation documents [here](https://gym.openai.com/docs).

# Prerequisites
If you are unfamiliar with Numpy or IPython, you should read materials from [CS231n](http://cs231n.github.io/):
* [Numpy tutorial](http://cs231n.github.io/python-numpy-tutorial/)
* [IPython tutorial](http://cs231n.github.io/ipython-tutorial/)

Also, knowing the basics of TensorFlow is required to complete this assignment.

For introductory material on TensorFlow, see
* [MNIST For ML Beginners](https://www.tensorflow.org/versions/r0.11/tutorials/mnist/beginners/index.html) from official site
* [Tutorial Video](https://www.youtube.com/watch?v=l6K-MFgIEjc&t=3334s) from [Stanford CS224D](http://cs224d.stanford.edu/syllabus.html)

Feel free to skip these materials if you are already familiar with these libraries.

# How to Start
1. **Start IPython**: After you clone this repository and install all the dependencies, you should start the IPython notebook server from the home directory
2. **Open the assignment**: Open ``HW2_Policy_Graident.ipynb``, and it will walk you through completing the assignment.

# To-Do
* [**+20**] Construct a 2-layer neural network to represent policy
* [**+30**] Compute the surrogate loss
* [**+20**] Compute the accumulated discounted rewards at each timestep
* [**+10**] Use baseline to reduce the variance
* [**+10**] Modify the code and write a report to compare the variance and performance before and after adding baseline (with figures is better)
* [**+10**] In function `process_paths` of class `PolicyOptimizer`, why we need to normalize the advantages?
i.e., what's the usage of this line:

`p["advantages"] = (a - a.mean()) / (a.std() + 1e-8)`

Include the answer in your report

# Other
* Office hour 2-3 pm in 資電館 with [YenChen Lin](http://yclin.me/).
* Due on Oct. 17 before class.
# Team members and contribution
- 姓名:<a href="https://github.com/Timforce">李冠毅</a> 學號:104064510 <br>
負責內容:數據圖,實驗討論,環境設置

- 姓名:<a href="https://github.com/gjlnnv">李季紘</a> 學號:(交大)0556083 <br>
負責內容:程式設計,實驗討論,資料收集
6 changes: 5 additions & 1 deletion policy_gradient/policy.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
Sample solution is about 2~4 lines.
"""
# YOUR CODE HERE >>>>>>
# probs = ???
hidden_layer = tf.contrib.layers.fully_connected(self._observations, hidden_dim, tf.tanh)
probs = tf.contrib.layers.fully_connected(hidden_layer, out_dim, tf.nn.softmax)
#probs = ???
# <<<<<<<<

# --------------------------------------------------
Expand Down Expand Up @@ -69,6 +71,8 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
Sample solution is about 1~3 lines.
"""
# YOUR CODE HERE >>>>>>
#surr_loss = tf.reduce_mean(log_prob*self._advantages)
surr_loss = -tf.reduce_mean(log_prob*self._advantages)
# surr_loss = ???
# <<<<<<<<

Expand Down
11 changes: 9 additions & 2 deletions policy_gradient/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,14 @@ def flatten_space(space):
Sample solution is about 1~7 lines.
"""

# def discount_cumsum(x, discount_rate):
#def discount_cumsum(x, discount_rate):
# YOUR CODE HERE >>>>>>
# return ???
# <<<<<<<<
# <<<<<<<<

def discount_cumsum(x, discount_rate):
result=x.astype(float)
for index in range(1,len(x)):
result[index]=result[index-1]+x[index]*pow(discount_rate,index)

return result[::-1]