learning.learner#

Learner module encapsulating the Deep Deterministic Policy Gradient (DDPG) algorithm.

Learner initializes the actor, critic, and normalizers and takes care of checkpoints during training as well as network loading if starting from pre-trained networks.

Module Contents#

class learning.learner.Learner(plant: jacta.planner.dynamics.simulator_plant.SimulatorPlant, graph: jacta.planner.core.graph.Graph, replay_buffer: jacta.learning.replay_buffer.ReplayBuffer, params: jacta.planner.core.parameter_container.ParameterContainer, save_local: bool = True, load_local: bool = False, verbose: bool = True)#

Deep Deterministic Policy Gradient algorithm class.

Uses a state/goal normalizer and the HER sampling method to solve sparse reward environments. .. py:method:: reset() -> None

Parameters:
  • plant (jacta.planner.dynamics.simulator_plant.SimulatorPlant) –

  • graph (jacta.planner.core.graph.Graph) –

  • replay_buffer (jacta.learning.replay_buffer.ReplayBuffer) –

  • params (jacta.planner.core.parameter_container.ParameterContainer) –

  • save_local (bool) –

  • load_local (bool) –

  • verbose (bool) –

actor_actions(actor: jacta.learning.networks.Actor, node_ids: torch.IntTensor, action_time_step: float) torch.FloatTensor#
Parameters:
  • actor (jacta.learning.networks.Actor) –

  • node_ids (torch.IntTensor) –

  • action_time_step (float) –

Return type:

torch.FloatTensor

relative_distances_to(data_container: jacta.planner.core.graph.Graph | jacta.learning.replay_buffer.ReplayBuffer, ids: torch.IntTensor, target_states: torch.FloatTensor) torch.FloatTensor#
Parameters:
  • data_container (Union[jacta.planner.core.graph.Graph, jacta.learning.replay_buffer.ReplayBuffer]) –

  • ids (torch.IntTensor) –

  • target_states (torch.FloatTensor) –

Return type:

torch.FloatTensor

reward_function(data_container: jacta.planner.core.graph.Graph | jacta.learning.replay_buffer.ReplayBuffer, node_ids: torch.FloatTensor, goals: torch.FloatTensor) Tuple[torch.FloatTensor, torch.FloatTensor]#
Parameters:
  • data_container (Union[jacta.planner.core.graph.Graph, jacta.learning.replay_buffer.ReplayBuffer]) –

  • node_ids (torch.FloatTensor) –

  • goals (torch.FloatTensor) –

Return type:

Tuple[torch.FloatTensor, torch.FloatTensor]

planner_exploration(root_states: torch.FloatTensor) torch.FloatTensor#
Parameters:

root_states (torch.FloatTensor) –

Return type:

torch.FloatTensor

update_norm(states: torch.FloatTensor, goals: torch.FloatTensor) None#

Update the normalizers with the current episode of play experience.

Samples the trajectory instead of taking every experience to create a goal distribution that is equal to what the networks encouter.

Parameters:
  • states (torch.FloatTensor) –

  • goals (torch.FloatTensor) –

Return type:

None

policy_rollout(temporary: bool = False) Tuple[torch.FloatTensor, bool]#
Parameters:

temporary (bool) –

Return type:

Tuple[torch.FloatTensor, bool]

graph_rollout(temporary: bool = False) torch.FloatTensor#
Parameters:

temporary (bool) –

Return type:

torch.FloatTensor

set_demonstration_injection(final_success_rate: float) None#
Parameters:

final_success_rate (float) –

Return type:

None

train(num_epochs: int = 50) None#

Train a policy to solve the environment with DDPG.

Trajectories are resampled with HER to solve sparse reward environments.

DDPG paper

HER paper

Parameters:

num_epochs (int) –

Return type:

None

state_action_training_data(num_trajectories: int = 1000, discount_factor: float = 0.98) Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]#
Parameters:
  • num_trajectories (int) –

  • discount_factor (float) –

Return type:

Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]

pretrain(num_epochs: int = 100, num_trajectories: int = 1000, train_critic: bool = True) None#
Parameters:
  • num_epochs (int) –

  • num_trajectories (int) –

  • train_critic (bool) –

Return type:

None

eval_agent() Tuple[float, float, float]#

Evaluate the current agent performance on the task.

Runs learner_evals times and averages the success rate.

Return type:

Tuple[float, float, float]

save_models() None#

Save the actor and critic networks and the normalizers.

Saves are located under /models/<model_filename>/.

Return type:

None

load_models(path: str) None#

Load the actor and critic networks and the normalizers.

Parameters:

path (str) –

Return type:

None