Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Critic Network Loss in Policy Gradient Algorithms #1300

Open
Pagl1acci opened this issue Dec 16, 2024 · 4 comments
Open

Issues with Critic Network Loss in Policy Gradient Algorithms #1300

Pagl1acci opened this issue Dec 16, 2024 · 4 comments

Comments

@Pagl1acci
Copy link

I implemented policy gradient with 'qpg', 'rpg', and 'neurd' losses in the 'brps' game, but all of them failed and converged to the corner of the simplex [0, 1, 0].

It seems that the critic network couldn't approximate the true value properly. I think this might be because the sample weights weren't adjusted, as the samples follow the same distribution as the policy ((\pi)). When most of the actions come from a majority group, the loss from the minor actions seems to get ignored by the network.

I couldn’t find anyone reporting similar issues, and other implementations appear to work fine. Am I overlooking something? Any suggestions would be appreciated!

@lanctot
Copy link
Collaborator

lanctot commented Dec 16, 2024

Hi @Pagl1acci,

What game?

My guess is that the learning rate(s) is/are too high.

@Pagl1acci
Copy link
Author

Hi @lanctot.

It's 'brps', biased RPS. As said in the paper, if neuRD loss is used, the learning trajectories should resemble replicator dynamics. So it should run around the NE.
11a40eb8e95add1eebc61e7555a70d9
This one is with critic_lr = 0.001 and pi_lr = 0.001

803b920b8cde9e6f89d20ae28e90863
This one is with critic_lr = 0.001 and pi_lr = 0.0001

The description in my original post about convergence to a point is not entirely accurate. For the latter one, now the q_value is [-20.2396, 0.1002, -7.2484], and the \pi is [0.00443135, 0.98928986, 0.00627879]. Here, the action 2 should have the highest q_value which is about 4.72. However, the critic network seems very hard to capture that even after more steps.

Any thoughts on what might be going wrong? Thanks in advance for your insights!

@lanctot
Copy link
Collaborator

lanctot commented Dec 16, 2024

We need more information to be able to reproduce.

Which policy gradient implementation? Did you base it on the kuhn_example.py?

Also:

As said in the paper, if neuRD loss is used, the learning trajectories should resemble replicator dynamics

Which paper? Are you e.g. referring to the RPG paper, NeuRD paper, R-NaD paper? It'd be helpful if you can point to the specific claim to give some context.

Once we have more details we'll be better suited to help.

@Pagl1acci
Copy link
Author

Thank you for your patient guidance.

I tried to use neuRD from https://arxiv.org/abs/1906.00190 on the biased RPS game. In this paper, neuRD is introduced as a one-line fix from the policy gradient algorithm, which I believe is presented in this paper: https://arxiv.org/abs/1810.09026. And I think policy_gradient.py follows this framework.

I used Torch version from open_spiel/python/pytorch/policy_gradient.py, as there is no neuRD loss implemented in the tf version. What I expect to see is something similar to the plot below: the trajectories of replicator dynamics in the same game setting.
d673f79263cc45bceb1548f6e9a42fc

Changing the loss type won't change too much. I tried 'rpg', 'qpg', and 'neurd'. The same issue can be also observed on the critic network.

Here is the code I used:

from open_spiel.python.pytorch import policy_gradient

game_name = 'matrix_brps'
env = rl_environment.Environment(game_name)
info_state_size = env.observation_spec()["info_state"][0]
num_actions = env.action_spec()["num_actions"]
agents = [
policy_gradient.PolicyGradient(
player_id=player_id,
info_state_size=info_state_size,
num_actions=num_actions,
loss_str="neurd",
hidden_layers_sizes=[32],
batch_size=64,
entropy_cost=None,
critic_learning_rate=0.001,
pi_learning_rate=0.0001,
num_critic_before_pi=8,
optimizer_str="sgd") for player_id in [0, 1]
]
policy_history_ = []

for episode_counter in range(1000000):
time_step = env.reset()
agent_output1 = agents[0].step(time_step)
agent_output2 = agents[1].step(time_step)
time_step = env.step([agent_output1.action,agent_output2.action])
for agent in agents:
agent.step(time_step)
if episode_counter % 512 == 0:
policy_history_.append(agent_output1.probs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants