-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
developing agents for team dominoes #1218
Comments
Hi @Brunozml , Apologies for the lag in communication. I am currently travelling and will be away at a conference until mid May. The tic-tac-toe QLearner example is probably using a tabular method. You might need an analogous one that uses neural networks. Take a look at python/examples/breakthrough_dqn.py which is similar but uses DQN. For the second point, the game should not have any states where a player has no actions. You can do two things in this case: add a special pass action which is the only legal action fior the player when they have no actions, or change the apply_action function so that it sets the current player id to the next one with actions. (There could be another game specific that is more appropriate of course, I am just not familiar with the rules) |
Thank you @lanctot! I have made some progress since then, focusing firstly on the 2-person version (i.e. Here are some problem's dilemmas I'm facing, and I would love it if you could offer some advice whenever you have the time.
(a) training locally using the parameters used in the original papers takes ~20min/iteration, and I don't even know if these are even applicable to this larger game! (e.g. memory buffer probably shouldn't be as small as that used for Kuhn/leduc poker). However, results from a veeeeery preliminary experiment of 30 training iterations showed that the algorithm was making progress against random agents. How do I scale this, and how do I got about selecting appropriate hyper-parameters here? (b) from policy to agent what interface should I use for the DeepCFR solver? I'm struggling to understand the difference between a
(same for JAX, and tf2 version doesn't seem to be working)
# Episode is over, step all agents with final info state.
for agent in agents:
agent.step(time_step) I ask this because I believe the reason my 4-agent training scripts haven't succeeded is due to my attempt at adapting this evaluation methodology to four players... and I'm also generally curious about some of the design choices I've observed in open_spiel =) Finally (and sorry for condensing so many comments into one reply), giogix2's |
Wow, cool to see the nice graphics @Brunozml @giogix2 ! I will change the docs to advertise pyspiel_game more prominently, it looks great! Ok, these are non-trivial questions. 1a. Deep CFR is based on external-sampling MCCFR. For balanced trees each iteration will take roughly square root the size of the game. This just won't scale without some form of abstraction to keep the abstract game small enough. A more scalable approach would be, for example, DREAM (Steinberger et al.) or ARMAC (Grusyls et al) -- both are based on single-trajectory sampling... but we don't have implementations of either in OpenSpiel. Contributions of them would be very welcome but they're non-trivial. 1b. A
If you detect any other bugs we should be aware of, please let us know. I am not sure how much use these Deep CFR implementations are getting and they were all external contributions. There may still be bugs of the kind you found (some of these could be introduced when upgrading versions of things too, as sometimes the APIs and behaviors change over time). However, we did ask the authors to reproduce the original paper's curves when they were contributed originally (~2 years ago). It may be worthwhile checking that's still true on Kuhn and Leduc poker. Also, keep in mind the metrics could look very different when using different hyper-parameters. For easy saving/loading I would avoid TF1 and use Pytorch or JAX. Is there a specific problem with the JAX implementation? It has still been passing tests in every recent sync to github.
If you haven't yet, I strongly recommend taking a look at the OpenSpiel tutorial. https://www.youtube.com/watch?v=8NCPqtPwlFQ . It explains the basic API components. I also explain why that last step is necessary in training but not evaluation. Here's an ASCII diagram. Example, suppose you have two DQN agents in self-play at the end of a game of tic-tac-toe: s_0: player 0 takes cell 4 (center) Player 0's (s, a, r, s') transitions in this game are: (s_0, 4, 0, s_2), (s_2, 0, 0, s_4), (s_4, 1, 0, s_6), (s6, 2, +1, s7) If there was no agent step on the final state, then Player 1 would never receive the last transition (s_5, 8, -1, s_7) which DQN needs to put into its replay buffer be able to learn. Similarly for Player 0's final transition (s6, 2, +1, s7). However, during evaluation you don't need to worry about the agents learning, you just need the (+1, -1) reward at the end of the episode. Hope this helps! |
As a follow-up to 1, instead of DREAM or ARMAC I suggest trying R-NaD (in this repos), which is the Friction FoReL algorithm behind the Mastering Stratego paper... or MMD by @ssokota @ryan-dorazio et al. We do not have a copy of the RL form of the MMD in our repos but you can find it on Sam's github (https://github.com/ssokota/mmd) |
Hi @lanctot, Thanks for the reply!! I took another look at the tutorial, and it all makes more sense now that I have a better grasp of OpenSpiel as a whole! In terms of your follow-up reply, I have a few more doubts:
Finally, out of pragmatic curiosity, how do you suggest I set-up a systematic approach to training and testing different algorithms for Again, I thank you for your dedication and patience. Much much appreciated |
Hi @Brunozml The deep RL form of MMD is basically just PPO with appropriate hyperparameters, so you could use any reasonable PPO implementation. |
Sorry to jump in in the middle of an on-going discussion, but I wanted to say thank you for looking at pygame_spiel. I'm happy to hear it served as an inspiration. Creating a UI for Dominoes is not trivial, so congrats on what you've done @Brunozml, it looks awesome. My initial goal was to implement Hive for both open_spiel (likely to be hosted in my fork, because of licenses related issues), and later in pygame_spiel. Not easy at all. |
@giogix2 Cool, yes I'd love to hear about more developments in pygame_spiel and would be happy to advertise its use on our development page (in fact I will do that in the next few weeks). Funny that you should mention Hive in particular. Someone is currently working on an implementation and we are discussing a potential license / agreement with the publishers to have it in the main repos. See #1194 for details and feel free to contact me or @rhstephens via email if you want updates. |
That would be amazing! @giogix2 I've just invited to as a collaborator on my |
@ssokota I've been playing around with MMD. I might have misunderstood something about the OpenSpiel implementation, but its performance doesn't seem to reach that of the paper (see here) @lanctot In terms of RNa-D, has anyone already used the OpenSpiel implementation to do something similar (albeit at a smaller case ofc) to the Mastering Strategy paper? that would be great. |
Well I'm not sure what you mean by something similar to the Mastering Stratego paper. If you mean super-human performance, there are some people working on it n Liar's poker and they've managed to recover good strategies at small scale (@ciamac). And there's this paper that used it as a baseline: https://arxiv.org/pdf/2312.15220v2. Note: several people have worked with it / are working with it have reported issues, some of which we have fixed but some remain open, sp take a look at the github issues before using it as there are some known open questions. There's also NFSP, btw, which is always a good place to start. But I'd say it's worth still trying MMD with annealing temperatures and wait for Sam's response. He might have some suggested learning rates and decay schedules. There's very recent work using search and transformers (https://arxiv.org/abs/2404.13150). I'd love to see an implementation of something like this in OpenSpiel, but this is a fair bit of effort. |
@Brunozml What results are you trying to reproduce? The implementation in OpenSpiel is for sequence form, whereas the other implementation you linked is for behavioral form. So, it's expected that they give different results. |
Great to hear about the ongoing work with the Hive implementation @lanctot. Looking forward to test it out. @Brunozml I'd be happy to have a chat about pygame_spiel. I have some idea about the phylosophy of the project, on how to make it more expandable. Making it easier to switch between player 0 and 1 is quite fundamental I agree. |
Hi,
for the last couple of days I have been working on scripts for training and evaluating RL agents for multiplayer dominoes, but I have faced two main issues:
tic_tac_toe_qlearner.py
example to this four player game; however, I've been unable to resolve and issue regardinglegal_actions()
being called by the random agent'sstep
function when its empty—no available actions—and should therefore not be called. Being the only python game with more than two players, it has been to understand if the bug is in my own implementation of the game, the observer, or in some other script.Any resources and/or advice help!
The text was updated successfully, but these errors were encountered: