MDP formulation and discretization model

Code repository with classic reinforcement learning methods using poke_env environment <poke-env.readthedocs.io/en/latest/>.

It is necessary to install the requirements available at requirements_poke_env.yml.

For training, it is necessary to run Pokémon Showdown https://play.pokemonshowdown.com on localhost.

The problem addressed

Pokémon is a popular Japanese RPG (Role Playing Game) which stands a world championship every year;
One single battle of Pokémon has two players. Each player has a 6-Pokémon team;
Each Pokémon has:
- 6 stats (Health Points, Attack, Defense, Special Attack, Special Defense, Speed). The first 5 are used in the damage calculation. The speed defined which Pokémon moves first in the turn.
  - The Health Points goes from 100% (healthy) to 0% (fainted);
- 4 possible moves (each with a limited number of uses);
- one ability that has special effects in the field;
- one nature that specifies which stats are higher and which are lower;
- one item, that can restore Health Points or increase the Power of an Attack.
The winner of the battle is the player that makes all Pokémon of the oposing team to faint (all oposing Pokémon with health points equals zero, "last man standing" criteria);
Only one Pokémon of each team can be at the battle field at the same time;
Every turn, each players select one action: one of the 4 moves of their active Pokémon or switching for one of other non-fainted Pokémon of their team;
Pokémon can be summarized as an analyze state (turn) -> take action sequence game.
By standard, Pokémon is a stochastic game:
- One move can have an accuracy value less than 100%, then this move has a probability to be missed;
- The damage moves (attacks) have the following damage calculation:

where:

*Level * (the level of the attacking Pokémon);
*A * is the effective Attack stat of the attacking Pokémon if the used move is a physical move, or the effective Special Attack stat of the attacking Pokémon if the used move is a special move;
*D * is the effective Defense stat of the target if the used move is a physical move or a special move that uses the target's Defense stat, or the effective Special Defense of the target if the used move is an other special move;
*Power * is the effective power of the used move;
*Weather * is 1.5 if a Water-type move is being used during rain or a Fire-type move during harsh sunlight, and 0.5 if a Water-type move is used during harsh sunlight or a Fire-type move during rain, and 1 otherwise.
*Critical * has 6.25% chance of occurs and multiplies the damage by 1.5;
*random * is a random factor between 0.85 and 1.00 (inclusive):
*STAB * is the same-type attack bonus. This is equal to 1.5 if the move's type matches any of the user's types, 2 if the user of the move additionally has the ability Adaptability, and 1 if otherwise;
*Type * is the type effectiveness. This can be 0 (ineffective); 0.25, 0.5 (not very effective); 1 (normally effective); 2, or 4 (super effective), depending on both the move's and target's types;
*Burn * is 0.5 (from Generation III onward) if the attacker is burned, its Ability is not Guts, and the used move is a physical move (other than Facade from Generation VI onward), and 1 otherwise.
*other * is 1 in most cases, and a different multiplier when specific interactions of moves, Abilities, or items take effect. In this work, this is applied just to Pokémon that has the item *Life Orb *, which multiplies the damage by 1.3.
Not used in this work (equals 1):
- Targets (for Battles with more than two active Pokémon in the field);
- Badge ( just applied in Generation II);

MDP formulation and discretization model

Original (stochastic)

We considered our original (stochastic) MDP as a tuple M = (S, A, \phi, R), where:

S is the whole set of possible states. One state *s in * is defined at each turn with 12 battle elements concatenated, that correspond to:
- [0] Our Active Pokémon index (0: Venusaur, 1: Pikachu, 2: Tauros, 3: Sirfetch'd, 4: Blastoise, 5: Charizard);
- [1] Opponent Active Pokémon index (0: Eevee, 1: Vaporeon, 2: Leafeon, 3: Sylveon, 4: Jolteon, 5: Umbreon);
- [2-5] Active Pokémon moves base power (if a move doesn't have base power, default to -1);
- [6-9] Active Pokémon moves damage multipliers;
- [10] Our remaining Pokémon;
- [11] Opponent remaining Pokémon.
A is the whole set of possible actions. Our action space is a range [0, 8]. One action *a \in A * is one of the possible choices:
- [0] 1st Active Pokémon move;
- [1] 2nd Active Pokémon move;
- [2] 3rd Active Pokémon move;
- [3] 4th Active Pokémon move;
- [4] Switch to 1st next Pokémon;
- [5] Switch to 2nd next Pokémon;
- [6] Switch to 3rd next Pokémon;
- [7] Switch to 4th next Pokémon;
- [8] Switch to 5th next Pokémon.

When a selected action cannot be executed, we random select another possible action.

\phi is a stochastic transition function that occurs from state *s * to state *s' *, by taking an action *a *. The following parameters are part of our stochastic transition function:
- Move's accuracy (chance of the move successfully occurs or to fail);
- Damage calculation: The *Critical * parameter (6.25% chance of occurs) and the *random * parameter, ranging from 0.85 and 1.00 (inclusive).
R is a set of rewards. A reward *r \in R * is acquired in state *s * by taking the action *a *. The rewards are calculated at the end of the turn. The value of reward *r * is defined by:
- +Our Active Pokémon current Health Points;
- -2 if our Active Pokémon fainted;
- -1 if our Active Pokémon have a negative status condition;
- +Number of remaining Pokémon in our team;
- -Opponent Active Pokémon current Health Points;
- +2 if opponent Active Pokémon fainted;
- +1 if opponent Active Pokémon have a negative status condition;
- -Number of remaining Pokémon in opponent team;
- +15 if we won the battle;
- -15 if we lost the battle.

Stochastic Team

Our stochastic team, with each Pokémon, their abilities, natures, items, moves (with base power and accuracy) and possible switches are shown in Team.

Deterministic

To adapt Pokémon to a deterministic environment, we use Pokémon that cannot receive a critical hit, moves with only 100% accuracy and edit the server code to ignore the random parameter in damage calculation, removing the stochastic transition function \phi from our MDP. Therefore, now our MDP is a tuple M = (S, A, R), where:

S is the whole set of possible states. One state *s in * is defined at each turn with 12 battle elements concatenated, that correspond to:
- [0] Our Active Pokémon index ;
- [1] Opponent Active Pokémon index ;
- [2-5] Active Pokémon moves base power (if a move doesn't have base power, default to -1);
- [6-9] Active Pokémon moves damage multipliers;
- [10] Our remaining Pokémon;
- [11] Opponent remaining Pokémon.
A is the whole set of possible actions. Our action space is a range [0, 8] (len: 9). One action *a \in A * is one of the possible choices:
- [0] 1st Active Pokémon move;
- [1] 2nd Active Pokémon move;
- [2] 3rd Active Pokémon move;
- [3] 4th Active Pokémon move;
- [4] Switch to 1st next Pokémon;
- [5] Switch to 2nd next Pokémon;
- [6] Switch to 3rd next Pokémon;
- [7] Switch to 4th next Pokémon;
- [8] Switch to 5th next Pokémon.

When a selected action cannot be executed, we random select another possible action.

R is a set of rewards. A reward *r \in R * is acquired in state *s * by taking the action *a *. The rewards are calculated at the end of each turn. The value of reward *r * is defined by:
- +Our Active Pokémon current Health Points;
- -2 if our Active Pokémon fainted;
- -1 if our Active Pokémon have a negative status condition;
- +Number of remaining Pokémon in our team;
- -Opponent Active Pokémon current Health Points;
- +2 if opponent Active Pokémon fainted;
- +1 if opponent Active Pokémon have a negative status condition;
- -Number of remaining Pokémon in opponent team;
- +15 if we won the battle;
- -15 if we lost the battle.

Deterministic Team

Our deterministic team, with each Pokémon, their abilities, natures, items, moves (with base power and accuracy) and possible switches are shown in Team.

The environments built

The environment used is Pokémon Showdown, a open-source Pokémon battle simulator.

Example of one battle in Pokémon Showdown.

Characteristics of the problem

Both of our environments (stochastic and deterministic) are episodic. One state occurs after another;
Our terminal states are:
- When all our Pokémon are fainted (we lose);
- When all opponent Pokémon are fainted (we won).
As specified before, a reward *r * is calculated at the end of a turn. The value of reward *r * is defined by:
- +Our Active Pokémon current Health Points;
- -2 if our Active Pokémon fainted;
- -1 if our Active Pokémon have a negative status condition;
- +Number of remaining Pokémon in our team;
- -Opponent Active Pokémon current Health Points;
- +2 if opponent Active Pokémon fainted;
- +1 if opponent Active Pokémon have a negative status condition;
- -Number of remaining Pokémon in opponent team;
- +15 if we won the battle;
- -15 if we lost the battle.

Methods implemented

Monte Carlo Control First-Visit;
Function Approximation with Monte Carlo Control First-Visit;
Q-Learning;
Function Approximation with Q-Learning;
SARSA(\lambda)
Function Approximation with SARSA(\lambda)

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
images		images
src		src
txts		txts
.gitignore		.gitignore
211010_MaxPlayer_RandPlayer.ipynb		211010_MaxPlayer_RandPlayer.ipynb
MC_Control.ipynb		MC_Control.ipynb
MC_Control_FA.ipynb		MC_Control_FA.ipynb
MC_Control_FA_deterministic.ipynb		MC_Control_FA_deterministic.ipynb
MC_Control_deterministic.ipynb		MC_Control_deterministic.ipynb
Notebook_Report_MO436_RL.ipynb		Notebook_Report_MO436_RL.ipynb
Q_Learning.ipynb		Q_Learning.ipynb
Q_Learning.py		Q_Learning.py
Q_Learning_FA.ipynb		Q_Learning_FA.ipynb
Q_Learning_FA.py		Q_Learning_FA.py
Q_Learning_FA_deterministic.ipynb		Q_Learning_FA_deterministic.ipynb
Q_Learning_FA_deterministic.py		Q_Learning_FA_deterministic.py
Q_Learning_deterministic.ipynb		Q_Learning_deterministic.ipynb
README.md		README.md
SARSA(lambda)_learning-Deterministic.ipynb		SARSA(lambda)_learning-Deterministic.ipynb
SARSA(lambda)_learning.ipynb		SARSA(lambda)_learning.ipynb
SARSA(lambda)_learning_FA.ipynb		SARSA(lambda)_learning_FA.ipynb
Sarsa_Learning.ipynb		Sarsa_Learning.ipynb
requirements_poke_env.yml		requirements_poke_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The problem addressed

MDP formulation and discretization model

Original (stochastic)

Stochastic Team

Deterministic

Deterministic Team

The environments built

Characteristics of the problem

Methods implemented

About

Releases

Packages

Languages

mauricioplopes/poke_RL

Folders and files

Latest commit

History

Repository files navigation

*The problem addressed *

*MDP formulation and discretization model *

Original (stochastic)

Stochastic Team

Deterministic

Deterministic Team

*The environments built *

*Characteristics of the problem *

Methods implemented

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

The problem addressed

MDP formulation and discretization model

The environments built

Characteristics of the problem

Packages