class TorchOptimizer(Optimizer)
| create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None
Create reward signals
Arguments:
reward_signal_configs
: Reward signal config.
| get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]]
Get value estimates and memories for a trajectory, in batch form.
Arguments:
batch
: An AgentBuffer that consists of a trajectory.next_obs
: the next observation (after the trajectory). Used for boostrapping if this is not a termiinal trajectory.done
: Set true if this is a terminal trajectory.agent_id
: Agent ID of the agent that this trajectory belongs to.
Returns:
A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)], the final value estimate as a Dict of [name, float], and optionally (if using memories) an AgentBufferField of initial critic memories to be used during update.
class Optimizer(abc.ABC)
Creates loss functions and auxillary networks (e.g. Q or Value) needed for training. Provides methods to update the Policy.
| @abc.abstractmethod
| update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float]
Update the Policy based on the batch that was passed in.
Arguments:
batch
: AgentBuffer that contains the minibatch of data used for this update.num_sequences
: Number of recurrent sequences found in the minibatch.
Returns:
A Dict containing statistics (name, value) from the update (e.g. loss)