Code repository with classic reinforcement learning methods using poke_env environment <poke-env.readthedocs.io/en/latest/>.
It is necessary to install the requirements available at requirements_poke_env.yml.
For training, it is necessary to run Pokémon Showdown https://play.pokemonshowdown.com on localhost.
-
Pokémon is a popular Japanese RPG (Role Playing Game) which stands a world championship every year;
-
One single battle of Pokémon has two players. Each player has a 6-Pokémon team;
-
Each Pokémon has:
- 6 stats (Health Points, Attack, Defense, Special Attack, Special Defense, Speed). The first 5 are used in the damage calculation. The speed defined which Pokémon moves first in the turn.
- The Health Points goes from 100% (healthy) to 0% (fainted);
- 4 possible moves (each with a limited number of uses);
- one ability that has special effects in the field;
- one nature that specifies which stats are higher and which are lower;
- one item, that can restore Health Points or increase the Power of an Attack.
- 6 stats (Health Points, Attack, Defense, Special Attack, Special Defense, Speed). The first 5 are used in the damage calculation. The speed defined which Pokémon moves first in the turn.
-
The winner of the battle is the player that makes all Pokémon of the oposing team to faint (all oposing Pokémon with health points equals zero, "last man standing" criteria);
-
Only one Pokémon of each team can be at the battle field at the same time;
-
Every turn, each players select one action: one of the 4 moves of their active Pokémon or switching for one of other non-fainted Pokémon of their team;
-
Pokémon can be summarized as an analyze state (turn) -> take action sequence game.
-
By standard, Pokémon is a stochastic game:
- One move can have an accuracy value less than 100%, then this move has a probability to be missed;
- The damage moves (attacks) have the following damage calculation:
where:
-
*Level * (the level of the attacking Pokémon);
-
*A * is the effective Attack stat of the attacking Pokémon if the used move is a physical move, or the effective Special Attack stat of the attacking Pokémon if the used move is a special move;
-
*D * is the effective Defense stat of the target if the used move is a physical move or a special move that uses the target's Defense stat, or the effective Special Defense of the target if the used move is an other special move;
-
*Power * is the effective power of the used move;
-
*Weather * is 1.5 if a Water-type move is being used during rain or a Fire-type move during harsh sunlight, and 0.5 if a Water-type move is used during harsh sunlight or a Fire-type move during rain, and 1 otherwise.
-
*Critical * has 6.25% chance of occurs and multiplies the damage by 1.5;
-
*random * is a random factor between 0.85 and 1.00 (inclusive):
-
*STAB * is the same-type attack bonus. This is equal to 1.5 if the move's type matches any of the user's types, 2 if the user of the move additionally has the ability Adaptability, and 1 if otherwise;
-
*Type * is the type effectiveness. This can be 0 (ineffective); 0.25, 0.5 (not very effective); 1 (normally effective); 2, or 4 (super effective), depending on both the move's and target's types;
-
*Burn * is 0.5 (from Generation III onward) if the attacker is burned, its Ability is not Guts, and the used move is a physical move (other than Facade from Generation VI onward), and 1 otherwise.
-
*other * is 1 in most cases, and a different multiplier when specific interactions of moves, Abilities, or items take effect. In this work, this is applied just to Pokémon that has the item *Life Orb *, which multiplies the damage by 1.3.
-
Not used in this work (equals 1):
- Targets (for Battles with more than two active Pokémon in the field);
- Badge ( just applied in Generation II);
We considered our original (stochastic) MDP as a tuple M = (S, A, \phi, R), where:
-
S is the whole set of possible states. One state *s in * is defined at each turn with 12 battle elements concatenated, that correspond to:
- [0] Our Active Pokémon index (0: Venusaur, 1: Pikachu, 2: Tauros, 3: Sirfetch'd, 4: Blastoise, 5: Charizard);
- [1] Opponent Active Pokémon index (0: Eevee, 1: Vaporeon, 2: Leafeon, 3: Sylveon, 4: Jolteon, 5: Umbreon);
- [2-5] Active Pokémon moves base power (if a move doesn't have base power, default to -1);
- [6-9] Active Pokémon moves damage multipliers;
- [10] Our remaining Pokémon;
- [11] Opponent remaining Pokémon.
-
A is the whole set of possible actions. Our action space is a range [0, 8]. One action *a \in A * is one of the possible choices:
- [0] 1st Active Pokémon move;
- [1] 2nd Active Pokémon move;
- [2] 3rd Active Pokémon move;
- [3] 4th Active Pokémon move;
- [4] Switch to 1st next Pokémon;
- [5] Switch to 2nd next Pokémon;
- [6] Switch to 3rd next Pokémon;
- [7] Switch to 4th next Pokémon;
- [8] Switch to 5th next Pokémon.
When a selected action cannot be executed, we random select another possible action.
-
\phi is a stochastic transition function that occurs from state *s * to state *s' *, by taking an action *a *. The following parameters are part of our stochastic transition function:
- Move's accuracy (chance of the move successfully occurs or to fail);
- Damage calculation: The *Critical * parameter (6.25% chance of occurs) and the *random * parameter, ranging from 0.85 and 1.00 (inclusive).
-
R is a set of rewards. A reward *r \in R * is acquired in state *s * by taking the action *a *. The rewards are calculated at the end of the turn. The value of reward *r * is defined by:
- +Our Active Pokémon current Health Points;
- -2 if our Active Pokémon fainted;
- -1 if our Active Pokémon have a negative status condition;
- +Number of remaining Pokémon in our team;
- -Opponent Active Pokémon current Health Points;
- +2 if opponent Active Pokémon fainted;
- +1 if opponent Active Pokémon have a negative status condition;
- -Number of remaining Pokémon in opponent team;
- +15 if we won the battle;
- -15 if we lost the battle.
Our stochastic team, with each Pokémon, their abilities, natures, items, moves (with base power and accuracy) and possible switches are shown in Team.
To adapt Pokémon to a deterministic environment, we use Pokémon that cannot receive a critical hit, moves with only 100% accuracy and edit the server code to ignore the random parameter in damage calculation, removing the stochastic transition function \phi from our MDP. Therefore, now our MDP is a tuple M = (S, A, R), where:
-
S is the whole set of possible states. One state *s in * is defined at each turn with 12 battle elements concatenated, that correspond to:
- [0] Our Active Pokémon index ;
- [1] Opponent Active Pokémon index ;
- [2-5] Active Pokémon moves base power (if a move doesn't have base power, default to -1);
- [6-9] Active Pokémon moves damage multipliers;
- [10] Our remaining Pokémon;
- [11] Opponent remaining Pokémon.
-
A is the whole set of possible actions. Our action space is a range [0, 8] (len: 9). One action *a \in A * is one of the possible choices:
- [0] 1st Active Pokémon move;
- [1] 2nd Active Pokémon move;
- [2] 3rd Active Pokémon move;
- [3] 4th Active Pokémon move;
- [4] Switch to 1st next Pokémon;
- [5] Switch to 2nd next Pokémon;
- [6] Switch to 3rd next Pokémon;
- [7] Switch to 4th next Pokémon;
- [8] Switch to 5th next Pokémon.
When a selected action cannot be executed, we random select another possible action.
- R is a set of rewards. A reward *r \in R * is acquired in state *s * by taking the action *a *. The rewards are calculated at the end of each turn. The value of reward *r * is defined by:
- +Our Active Pokémon current Health Points;
- -2 if our Active Pokémon fainted;
- -1 if our Active Pokémon have a negative status condition;
- +Number of remaining Pokémon in our team;
- -Opponent Active Pokémon current Health Points;
- +2 if opponent Active Pokémon fainted;
- +1 if opponent Active Pokémon have a negative status condition;
- -Number of remaining Pokémon in opponent team;
- +15 if we won the battle;
- -15 if we lost the battle.
Our deterministic team, with each Pokémon, their abilities, natures, items, moves (with base power and accuracy) and possible switches are shown in Team.
The environment used is Pokémon Showdown, a open-source Pokémon battle simulator.
Example of one battle in Pokémon Showdown.
-
Both of our environments (stochastic and deterministic) are episodic. One state occurs after another;
-
Our terminal states are:
- When all our Pokémon are fainted (we lose);
- When all opponent Pokémon are fainted (we won).
-
As specified before, a reward *r * is calculated at the end of a turn. The value of reward *r * is defined by:
- +Our Active Pokémon current Health Points;
- -2 if our Active Pokémon fainted;
- -1 if our Active Pokémon have a negative status condition;
- +Number of remaining Pokémon in our team;
- -Opponent Active Pokémon current Health Points;
- +2 if opponent Active Pokémon fainted;
- +1 if opponent Active Pokémon have a negative status condition;
- -Number of remaining Pokémon in opponent team;
- +15 if we won the battle;
- -15 if we lost the battle.
- Monte Carlo Control First-Visit;
- Function Approximation with Monte Carlo Control First-Visit;
- Q-Learning;
- Function Approximation with Q-Learning;
- SARSA(\lambda)
- Function Approximation with SARSA(\lambda)