The multi-armed bandit problem is one of the classical reinforcements learning problems that describe the friction between the agent's exploration and exploitation.
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.
More details can be found in this paper.
https://www.youtube.com/watch?v=I0XmHQJPaVM
You can install the required Python packages using the following command:
pipenv sync
You can train the agent using the following command:
pipenv run python ts_bandits.py