Replies: 6 comments 7 replies
-
Hi, Fernando, Thanks for offering the help. You need to change MCTSModel.calc_loss in mcts_mode.py to do this. You can see a recent change (#1158) to use quantile regression for reward and value regression to get an idea where are the relevant code. And if you are using create_simple_prediction_net() (the default to SimpleMCTSModel) to construct the prediction model, you also need to change create_simple_prediction_net() to output the logits instead of a scalar. I'd love to know how it compares to quantile regression. Don't hesitate to ask if there is still any confusion. Wei |
Beta Was this translation helpful? Give feedback.
-
FYI, I am implementing this. |
Beta Was this translation helpful? Give feedback.
-
The three networks (encoding_net_ctor, dynamics_net_ctor, prediction_net_ctor) need to be changed for cartpole. The current version assumes image as input. |
Beta Was this translation helpful? Give feedback.
-
Hi @emailweixu, I think I'm in the right way. the config
|
Beta Was this translation helpful? Give feedback.
-
Some of the hyperparameters are tuned for Atari 100k setting (e.g. weight decay). And if you already have a working cartpole config, you can slightly change it use the new loss without introducing other changes (e.g., train_repr_prediction). Though train_repr_prediction is very useful for Atari, it's never tested on cartpole. |
Beta Was this translation helpful? Give feedback.
-
@ipsec The batch dimension is the first dimension. For rollout, it is same as num_parallel_environments, which is 1, 2, 5 in each of the three cases you mentioned. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
First, congratulations by excellent project.
On the muzero paper (https://rdcu.be/ccErB) in the Network Architecture they have used an invertible transformation, for reward and value targets, to a categorical representation.
The code of Werner Duvaud available on GitHub had done this here and here.
Using this, the loss function could be the cross entropy given more stable results than MSE (according the paper).
I don't know how to implement this on the alf.
Maybe some guidance, so I can try help you with this.
Best regards,
Fernando
Beta Was this translation helpful? Give feedback.
All reactions