You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By now, SeaPearl is only capable to learn a value heuristic from previous experience. Indeed, given a Constraint Programming problem at a certain step in the solving process and variable, our algorithm returns what it considered to be the best value in the domain of definition of the variable to branch on.
However, a good constraint programming solver is the combination of two heuristics:
a Variable heuristic : that decides on which variable to branch on. This has not been implemented yet in SeaPearl
a Value Heuristic : that decides given a variable, what value to assign to it. This has already been implemented in SeaPearl.
The next big step for SeaPearl is to be able to learn both heuristics, using a specific agent for each task. This is a case of sequential Multi-agent Reinforcement learning or SMARL.
But, why don't we just implement a slightly identical pipeline for Variable selection ?
Indeed, why wouldn't it be possible to have 2 pipelines, so 2 agents evolving in parallel, taking the same tripartite graph as input and inferring either the variable to branch on or the associated assigned value. after each assignation, a fix-point is done to compute the associated reward that is identical for both agent.
To this question, comes 2 observations:
First, it appears that the two tasks are relatively close to each other, so it seems, at least from a computational point of view, interesting to try to intertwine the 2 pipelines. We could think of sharing some of the weights, especially perhaps those of the first convolutions between the 2 agents.
Secondly, and this appears to be the major issue, is often referred as the problem of non-stationary problem. Indeed in our case the two agents evolve in parallel, and take their action sequentially, they then receive the same reward signal. The actions of the second agent impact (via the reward) the learning of the first agent. That being said, the first agent may receive spurious reward signals that originate from their teammates' behavior. This comes from the fact that Independent Q-learners cannot distinguish teammates’ exploration from stochasticity in the environment..
This can lead to the lazy agent problem, where only one agent learns a useful policy, but the other agent is discouraged from learning because its exploration would hinder the first agent and lead to worse team reward.
We, therefore, need a way to correlate the two Q-learners.
A Deep-RL Architecture for Coop-MARL :
This thought is partly inspired from this paper but re-arranged to leverage the fact that the decision of the agents are made one after another. Our problem is part of the cooperative Multi-agent Reinforcement learning, in which a set of agents must optimize a single reward signal.
Even if each agent has its individual observations and is responsible (thought their own Q-table approximator) for their individual action, their Q-values are summed to a joint Q-function for training.
In their paper, they introduce the concept of learned additive value decomposition. The main assumption we make and exploit is that the joint action-value function for the system can be additively decomposed into value functions across agents :
Where the depends only on each agent’s local observations. is the CP input state given to the variable selection agent, is the CP input state and the variable x on which we branch, that is given to the value selection agent. is the action of selecting the variable and is the action of assigning a value to this variable.
The pipeline can be summarized as follows :
One interesting property is that once being co-trained, each agent can be deployed totally independently from the other as long as each agent acts greedily with respect to its local Q-value estimator.
For more details on the theoretical validity of such an approach, feel free to look at the cited paper.
reward signal
Q value update
why having one correlated Q value ?
how to implement in practice ?
limit of such a method ?
mettre en lumière la différence au niveau de la moving target pour la méthode correlé et la méthode décorellé. Changement au niveau de l'algo de Q-learning max_a(Q_(t+1))
expliquer ppourquoi cela impacte le premier agent mais pas le 2ème.
The text was updated successfully, but these errors were encountered:
By now, SeaPearl is only capable to learn a value heuristic from previous experience. Indeed, given a Constraint Programming problem at a certain step in the solving process and variable, our algorithm returns what it considered to be the best value in the domain of definition of the variable to branch on.
However, a good constraint programming solver is the combination of two heuristics:
The next big step for SeaPearl is to be able to learn both heuristics, using a specific agent for each task. This is a case of sequential Multi-agent Reinforcement learning or SMARL.
But, why don't we just implement a slightly identical pipeline for Variable selection ?
Indeed, why wouldn't it be possible to have 2 pipelines, so 2 agents evolving in parallel, taking the same tripartite graph as input and inferring either the variable to branch on or the associated assigned value. after each assignation, a fix-point is done to compute the associated reward that is identical for both agent.
To this question, comes 2 observations:
This can lead to the lazy agent problem, where only one agent learns a useful policy, but the other agent is discouraged from learning because its exploration would hinder the first agent and lead to worse team reward.
We, therefore, need a way to correlate the two Q-learners.
A Deep-RL Architecture for Coop-MARL :
This thought is partly inspired from this paper but re-arranged to leverage the fact that the decision of the agents are made one after another. Our problem is part of the cooperative Multi-agent Reinforcement learning, in which a set of agents must optimize a single reward signal.
Even if each agent has its individual observations and is responsible (thought their own Q-table approximator) for their individual action, their Q-values are summed to a joint Q-function for training.
In their paper, they introduce the concept of learned additive value decomposition. The main assumption we make and exploit is that the joint action-value function for the system can be additively decomposed into value functions across agents :
Where the depends only on each agent’s local observations. is the CP input state given to the variable selection agent, is the CP input state and the variable x on which we branch, that is given to the value selection agent. is the action of selecting the variable and is the action of assigning a value to this variable.
The pipeline can be summarized as follows :
One interesting property is that once being co-trained, each agent can be deployed totally independently from the other as long as each agent acts greedily with respect to its local Q-value estimator.
For more details on the theoretical validity of such an approach, feel free to look at the cited paper.
reward signal
Q value update
why having one correlated Q value ?
how to implement in practice ?
limit of such a method ?
The text was updated successfully, but these errors were encountered: