You can see the live demo here.
- π Quickstart π»
- π» Introduction π¨π»βπ»
- π Physics Simulation Engines π¦Ώ
- πͺ Environment π¦Ύ
- π¬ Algorithms π»
- π Run locally π²οΈ
Explore the project easily and quickly through the following colab notebooks:
Grasp: Pick-and-place with a robotic hand
- this demo notebook compares first three algorithms and train agents onGrasp
environment byBrax
. At the end, it also shows trainedPPO agent
interaction with the environment.
Step-by-step training with PPO
- this notebook shows step-by-step training ofPPO agent
onGrasp
environment byBrax
.
The field of robotics has seen incredible advancements in recent years, with the development of increasingly sophisticated machines capable of performing a wide range of tasks. One area of particular interest is the ability for robots to manipulate objects in their environment, known as grasping. In this project, we have chosen to focus on a specific grasping task - training a robotic hand to pick up a moving ball object and place it in a specific target location using the Brax
physics simulation engine.
Grasp β robotic hand which picks a moving ball and moves it to a specific target
The reason for choosing this project is twofold. Firstly, the ability for robots to grasp and manipulate objects is a fundamental skill that is crucial for many real-world applications, such as manufacturing, logistics, and service industries. Secondly, the use of a physics simulation engine allows us to train our robotic hand in a realistic and controlled environment, without the need for expensive hardware and the associated costs and safety concerns.
Reinforcement learning is a powerful tool for training robots to perform complex tasks, as it allows the robot to learn through trial and error. In this project, we will be using reinforcement learning techniques to train our robotic hand, and we hope to demonstrate the effectiveness of this approach in solving the grasping task.
The use of a physics simulation engine is essential for training a robotic hand to perform the grasping task, as it allows us to simulate the real-world physical interactions between the robot and the ball. Without a physics simulation engine, it would be difficult to accurately model the dynamics of the task, including the forces and torques required for the robotic hand to pick up the ball and move it to the target location.
In this project, we explored several different physics simulation engines, including:
Each of these engines has its own strengths and weaknesses, and we carefully considered the trade-offs between them before making a final decision.
Ultimately, we chose to use Brax
due to its highly scalable and parallelizable architecture, which makes it well-suited for accelerated hardware (XLA backends such as GPUs
and TPUs
). This allows us to simulate the grasping task at a high level of realism and detail, while also taking advantage of the increased computational power of modern hardware to speed up the training process.
The grasping environment provided by Brax
is a simple pick-and-place task, where a 4-fingered claw hand must pick up and move a ball to a target location. The environment is designed to simulate the physical interactions between the robotic hand and the ball, including the forces and torques required for the hand to grasp the ball and move it to the target location.
The hand is able to pick up the ball and carry it to a series of red targets. Once the ball gets close to the red target, the red target is respawned at a different random location
In the environment, the robotic hand is represented by a 4-fingered claw, which is capable of opening and closing to grasp the ball. The ball is placed in a random location at the beginning of each episode, and the target location is also randomly chosen. The goal of the robotic hand is to move the ball to the target location as quickly and efficiently as possible. For more details, check 4.2.2.
The environment observes three main bodies: the Hand
, the Object
, and the Target
. The agent uses these observations to learn how to control the robotic hand and move the object to the target location.
-
The
Hand
observation includes information about the state of the robotic hand, such as the position and orientation of the fingers, the joint angles, and the forces and torques applied to the hand. This information is used by the agent to control the hand and pick up the object. -
The
Object
observation includes information about the state of the object, such as its position, velocity, and orientation. This information is used by the agent to track the object and move it to the target location. -
The
Target
observation includes information about the target location, such as its position and orientation. This information is used by the agent to navigate the hand and the object to the target location.
When the object reaches the target location, the agent is rewarded. The agent is also given a penalty if the object falls or if the hand collides with any obstacle. The agent's goal is to maximize the reward, which means reaching the target location as quickly and efficiently as possible.
Overall, the observations provided by the Grasp environment
are designed to give the agent the information it needs to learn how to control the robotic hand and move the object to the target location. The combination of the Hand, Object, and Target observations allows the agent to learn from the environment and improve its performance over time.
The action has 19
dimensions, itβs the handβs position and the jointsβ angles, and it is normalized to the [-1, 1]
as continuous values.
The reward function is calculated using following equation:
where,
$$\text{moving to object} : \text{small reward for moving towards the object.} \nonumber \\$$ $$\text{close to object} : \text{small reward for being close to the object.} \nonumber \\$$ $$\text{touching object} : \text{small reward for touching the object.} \nonumber \\$$ $$\text{target hit} : \text{high reward for hitting the target (max. reward).} \nonumber \\$$each minor step approaching the task completeness will be rewarded, while the
$\text{target hit}$ will gain the biggest reward.
We will use the braxβs optimized algorithms: PPO
, ES
, ARS
and SAC
.
Proximal Policy Optimization (PPO)
is a model-free online policy gradient reinforcement learning algorithm, developed at OpenAI in 2017. PPO
strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. Generally speaking, it is a clipper version A2C
algorithm.
Evolution Strategy (ES)
is inspired by natural evolution, it is a powerful black-box optimization technique. A group of random noise is tested for the network parameters, and the highest scoring parameter vectors are chosen to evolute the network. It is a different method compared with using the loss function to back propagate the network. ES
can be parallelized using XLA backend (CPU
/GPU
/TPU
) to speed up the training.
Augmented Random Search (ARS)
is a random search method for training linear policies for continuous control problems. It operates directly on the policy weights, each epoch the agent perturbs its current policy N
times, and collects 2N
rollouts using the modified policies. The rewards from these rollouts are used to update the current policy weights, repeat until completion. The algorithm is known to have high variance; not all seeds obtain high rewards, but to our knowledge their work in many ways represents the state of the art on these benchmarks.
Soft Actor-Critic (SAC)
is an off-policy model-free reinforcement framework. The actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible, and that is why itβs called soft. SAC
has better sample efficiency than PPO
.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Clone the repository
git clone https://github.com/mohammadzainabbas/Reinforcement-Learning-CS.git
cd Reinforcement-Learning-CS/
- Create a new enviornment and install all dependencies
First, install mamba
, a fast and efficient package manager for conda
.
conda install mamba -n base -c conda-forge
Then, create a new environment and install all dependencies, and activate it.
mamba env create -n reinforcement_learning -f docs/config/reinforcement_learning_env.yaml
conda activate reinforcement_learning
- Run the code
train_ppo.py
- train the reinforcement learning agent using PPO
algorithm:
python src/train_ppo.py
You will get the following output files:
ppo_training.png
- Training progress plotresult_with_ppo.html
- Simulation of the trained agent (in HTML format)ppo_params
- Trained parameters of the agent
train_sac.py
- train the reinforcement learning agent using SAC
algorithm:
python src/train_sac.py
you will get the same output files as
PPO
algorithm.
generate_results.py
- generate the results of the trained PPO
agent:
python src/generate_results.py
you can see the live output here.
ppo_with_pytorch.py
- implementation of PPO
algorithm with PyTorch
.