Buffer is checkpointed by default

Eclectic-Sheep · May 9, 2024 · 5ec76bc · 5ec76bc
1 parent 9e4cce2
commit 5ec76bc
Show file tree

Hide file tree

Showing 6 changed files with 8 additions and 8 deletions.
diff --git a/howto/logs_and_checkpoints.md b/howto/logs_and_checkpoints.md
@@ -195,11 +195,11 @@ size: ???
 memmap: True
 validate_args: False
 from_numpy: False
-checkpoint: False  # Used only for off-policy algorithms
+checkpoint: True  # Used only for off-policy algorithms
 ```
 
 There can be few scenarios to pay attention to:
 
 * If the buffer is memory-mapped (i.e. `buffer.memmap=True`) and one saves the buffer in the checkpoint then one **mustn't delete the buffer folder** of the stopped experiment: if the buffer is memory-mapped a file for every key saved in the replay buffer is created on disk (`observations.memmap`, `rewards.memmap` for example) and when the experiment is resumed those files are read back from the exact same location
-* If the buffer is memory-mapped (i.e. `buffer.memmap=True`), one saves the buffer in the checkpoint and the buffer has been filled completely during the previous experiment (meaning that the olders trajectories have been overwritten by newer ones) then it could happen that the agent will be trained from "future" trajectories coming from a "future" policy. To be more precise the buffer is simply a pre-allocated numpy-array with an attribute `pos` that points to the first free slot to be written; if we are using a `sheeprl.data.buffers.SequentialReplayBuffer` we sample sequential sequences in `[0, pos - sequence_length) ∪ [pos, buffer_size)` or simply `[0, pos - sequence_length)` depending on whether the buffer has been filled or not respectively. When we save the buffer into the checkpoint we save all the relevant information regarding it (the `pos` attribute and the path to the memory-mapped files, which represents the buffer content to be retrieved upon resuming). Suppose that we have saved a checkpoint at step `N` and the experiment have gone further for `K < N` steps before it stops, with the buffer that had already been filled at least one time. When we resume the buffer is laoded from the checkpoint, meaning that the `pos` attribute points at the same position it was pointing at step `N` and because we have memory-mapped our buffer we find in `[pos, pos + K]` a bunch of trajectories that comes from a "future" policy: the one that we were training in the previous experiment and stopped! Currently we don't know if this can cause problems to the agent and neither we have found a nice solution to mitigate this problem. We have thought at a bunch of ways to solve this problem: one is to memmap the buffer metadata like the current `pos`: in this way when we load the buffer from the checkpoint we can remove all the unwanted trajectories in `[old_pos, current_pos]`; this could potentially erase a lot of the buffer content if for example one has a checkpoint at step `N` and the experiment stopped at step `2N - 1`. Another solution could be to employ an online queue to save the trajectories momentarily into and flush the queue to the replay buffer only upon checkpointing; the problem with this solution is that one has to maintain in memory a lot of info and the RAM could explode easily if one is working with images (this can be avoided by also memory-mapping the online queue). Practically, another possible solution is to set the `algo.learning_starts=K` from the CLI or in the algorithm section in the experiment config: in this way all the future trajectories will be erased by random samples sampled from the resumed agent. 
-* In any case, when the checkpoint is resumed the buffer **could be potentially pre-filled for `algo.learning_starts` steps** with random actions sapled from the resumed agent. If you don't want to pre-fill the buffer set `algo.learning_starts=0`
+* If the buffer is memory-mapped (i.e. `buffer.memmap=True`), one saves the buffer in the checkpoint and the buffer has been filled completely during the previous experiment (meaning that the olders trajectories have been overwritten by newer ones) then it could happen that the agent will be trained from "future" trajectories coming from a "future" policy. To be more precise the buffer is simply a pre-allocated numpy-array with an attribute `pos` that points to the first free slot to be written; if we are using a `sheeprl.data.buffers.SequentialReplayBuffer` we sample sequential sequences in `[0, pos - sequence_length) ∪ [pos, buffer_size)` or simply `[0, pos - sequence_length)` depending on whether the buffer has been filled or not respectively. When we save the buffer into the checkpoint we save all the relevant information regarding it (the `pos` attribute and the path to the memory-mapped files, which represents the buffer content to be retrieved upon resuming). Suppose that we have saved a checkpoint at step `N` and the experiment have gone further for `K < N` steps before it stops, with the buffer that had already been filled at least one time. When we resume the buffer is laoded from the checkpoint, meaning that the `pos` attribute points at the same position it was pointing at step `N` and because we have memory-mapped our buffer we find in `[pos, pos + K]` a bunch of trajectories that comes from a "future" policy: the one that we were training in the previous experiment and stopped! Currently we don't know if this can cause problems to the agent and neither we have found a nice solution to mitigate this problem. We have thought at a bunch of ways to solve this problem: one is to memmap the buffer metadata like the current `pos`: in this way when we load the buffer from the checkpoint we can remove all the unwanted trajectories in `[old_pos, current_pos]`; this could potentially erase a lot of the buffer content if for example one has a checkpoint at step `N` and the experiment stopped at step `2N - 1`. Another solution could be to employ an online queue to save the trajectories momentarily into and flush the queue to the replay buffer only upon checkpointing; the problem with this solution is that one has to maintain in memory a lot of info and the RAM could explode easily if one is working with images (this can be avoided by also memory-mapping the online queue). Practically, another possible solution is to set the `algo.learning_starts=K` from the CLI or in the algorithm section in the experiment config: in this way all the future trajectories will be erased by trajectories conditioned by the resumed agent. 
+* In any case, when the checkpoint is resumed the buffer **could be potentially pre-filled for `algo.learning_starts` steps** with trajectories conditioned by the resumed agent. If you don't want to pre-fill the buffer set `algo.learning_starts=0`
diff --git a/sheeprl/configs/buffer/default.yaml b/sheeprl/configs/buffer/default.yaml
@@ -2,4 +2,4 @@ size: ???
 memmap: True
 validate_args: False
 from_numpy: False
-checkpoint: False  # Used only for off-policy algorithms
+checkpoint: True  # Used only for off-policy algorithms
diff --git a/sheeprl/configs/exp/dreamer_v1.yaml b/sheeprl/configs/exp/dreamer_v1.yaml
@@ -22,7 +22,7 @@ checkpoint:
 # Buffer
 buffer:
   size: 5000000
-  checkpoint: False
+  checkpoint: True
 
 # Distribution
 distribution:

diff --git a/sheeprl/configs/exp/dreamer_v2.yaml b/sheeprl/configs/exp/dreamer_v2.yaml
@@ -23,7 +23,7 @@ checkpoint:
 buffer:
   size: 5000000
   type: sequential
-  checkpoint: False
+  checkpoint: True
   prioritize_ends: False
 
 # Distribution

diff --git a/sheeprl/configs/exp/dreamer_v3.yaml b/sheeprl/configs/exp/dreamer_v3.yaml
@@ -23,7 +23,7 @@ checkpoint:
 # Buffer
 buffer:
   size: 1000000
-  checkpoint: False
+  checkpoint: True
 
 # Distribution
 distribution:

diff --git a/sheeprl/configs/exp/sac.yaml b/sheeprl/configs/exp/sac.yaml
@@ -21,7 +21,7 @@ checkpoint:
 # Buffer
 buffer:
   size: 1000000
-  checkpoint: False
+  checkpoint: True
   sample_next_obs: False
 
 # Environment