-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Critic Network Loss in Policy Gradient Algorithms #1300
Comments
Hi @Pagl1acci, What game? My guess is that the learning rate(s) is/are too high. |
Hi @lanctot. It's 'brps', biased RPS. As said in the paper, if neuRD loss is used, the learning trajectories should resemble replicator dynamics. So it should run around the NE.
The description in my original post about convergence to a point is not entirely accurate. For the latter one, now the q_value is [-20.2396, 0.1002, -7.2484], and the \pi is [0.00443135, 0.98928986, 0.00627879]. Here, the action 2 should have the highest q_value which is about 4.72. However, the critic network seems very hard to capture that even after more steps. Any thoughts on what might be going wrong? Thanks in advance for your insights! |
We need more information to be able to reproduce. Which policy gradient implementation? Did you base it on the kuhn_example.py? Also:
Which paper? Are you e.g. referring to the RPG paper, NeuRD paper, R-NaD paper? It'd be helpful if you can point to the specific claim to give some context. Once we have more details we'll be better suited to help. |
Thank you for your patient guidance. I tried to use neuRD from https://arxiv.org/abs/1906.00190 on the biased RPS game. In this paper, neuRD is introduced as a one-line fix from the policy gradient algorithm, which I believe is presented in this paper: https://arxiv.org/abs/1810.09026. And I think policy_gradient.py follows this framework. I used Torch version from open_spiel/python/pytorch/policy_gradient.py, as there is no neuRD loss implemented in the tf version. What I expect to see is something similar to the plot below: the trajectories of replicator dynamics in the same game setting. Changing the loss type won't change too much. I tried 'rpg', 'qpg', and 'neurd'. The same issue can be also observed on the critic network. Here is the code I used: from open_spiel.python.pytorch import policy_gradient game_name = 'matrix_brps' for episode_counter in range(1000000): |
I implemented policy gradient with 'qpg', 'rpg', and 'neurd' losses in the 'brps' game, but all of them failed and converged to the corner of the simplex [0, 1, 0].
It seems that the critic network couldn't approximate the true value properly. I think this might be because the sample weights weren't adjusted, as the samples follow the same distribution as the policy ((\pi)). When most of the actions come from a majority group, the loss from the minor actions seems to get ignored by the network.
I couldn’t find anyone reporting similar issues, and other implementations appear to work fine. Am I overlooking something? Any suggestions would be appreciated!
The text was updated successfully, but these errors were encountered: