Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The theory behind a Smart double Variable / Value Heuristic #232

Open
3rdCore opened this issue May 20, 2022 · 0 comments
Open

The theory behind a Smart double Variable / Value Heuristic #232

3rdCore opened this issue May 20, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed heuristic

Comments

@3rdCore
Copy link
Collaborator

3rdCore commented May 20, 2022

By now, SeaPearl is only capable to learn a value heuristic from previous experience. Indeed, given a Constraint Programming problem at a certain step in the solving process and variable, our algorithm returns what it considered to be the best value in the domain of definition of the variable to branch on.

However, a good constraint programming solver is the combination of two heuristics:

  • a Variable heuristic : that decides on which variable to branch on. This has not been implemented yet in SeaPearl
  • a Value Heuristic : that decides given a variable, what value to assign to it. This has already been implemented in SeaPearl.

The next big step for SeaPearl is to be able to learn both heuristics, using a specific agent for each task. This is a case of sequential Multi-agent Reinforcement learning or SMARL.

But, why don't we just implement a slightly identical pipeline for Variable selection ?

Indeed, why wouldn't it be possible to have 2 pipelines, so 2 agents evolving in parallel, taking the same tripartite graph as input and inferring either the variable to branch on or the associated assigned value. after each assignation, a fix-point is done to compute the associated reward that is identical for both agent.

To this question, comes 2 observations:

  • First, it appears that the two tasks are relatively close to each other, so it seems, at least from a computational point of view, interesting to try to intertwine the 2 pipelines. We could think of sharing some of the weights, especially perhaps those of the first convolutions between the 2 agents.
  • Secondly, and this appears to be the major issue, is often referred as the problem of non-stationary problem. Indeed in our case the two agents evolve in parallel, and take their action sequentially, they then receive the same reward signal. The actions of the second agent impact (via the reward) the learning of the first agent. That being said, the first agent may receive spurious reward signals that originate from their teammates' behavior. This comes from the fact that Independent Q-learners cannot distinguish teammates’ exploration from stochasticity in the environment..
    This can lead to the lazy agent problem, where only one agent learns a useful policy, but the other agent is discouraged from learning because its exploration would hinder the first agent and lead to worse team reward.

We, therefore, need a way to correlate the two Q-learners.

A Deep-RL Architecture for Coop-MARL :

This thought is partly inspired from this paper but re-arranged to leverage the fact that the decision of the agents are made one after another. Our problem is part of the cooperative Multi-agent Reinforcement learning, in which a set of agents must optimize a single reward signal.

Even if each agent has its individual observations and is responsible (thought their own Q-table approximator) for their individual action, their Q-values are summed to a joint Q-function for training.

In their paper, they introduce the concept of learned additive value decomposition. The main assumption we make and exploit is that the joint action-value function for the system can be additively decomposed into value functions across agents :

equation

Where the equation depends only on each agent’s local observations. equation is the CP input state given to the variable selection agent, equation is the CP input state and the variable x on which we branch, that is given to the value selection agent. equation is the action of selecting the variable and equation is the action of assigning a value to this variable.

The pipeline can be summarized as follows :

image

One interesting property is that once being co-trained, each agent can be deployed totally independently from the other as long as each agent acts greedily with respect to its local Q-value estimator.

For more details on the theoretical validity of such an approach, feel free to look at the cited paper.

equation

reward signal
Q value update
why having one correlated Q value ?
how to implement in practice ?
limit of such a method ?

  • mettre en lumière la différence au niveau de la moving target pour la méthode correlé et la méthode décorellé. Changement au niveau de l'algo de Q-learning max_a(Q_(t+1))
  • expliquer ppourquoi cela impacte le premier agent mais pas le 2ème.
@3rdCore 3rdCore added enhancement New feature or request help wanted Extra attention is needed heuristic labels May 20, 2022
@jardinetsouffleton jardinetsouffleton pinned this issue Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed heuristic
Projects
None yet
Development

No branches or pull requests

2 participants