learning.networks#

The actor module contains the actor class as well as the actor networks.

The Actor acts as a wrapper around the actual deterministic policy network to provide action selection, and loading utilities.

DDP is a vanilla deep deterministic policy network implementation.

Module Contents#

learning.networks.soft_update(network: torch.nn.Module, target: torch.nn.Module, tau: float) torch.nn.Module#

Perform a soft update of the target network’s weights.

Shifts the weights of the target by a factor of tau into the direction of the network.

Parameters:
  • network (torch.nn.Module) – Network from which to copy the weights.

  • target (torch.nn.Module) – Network that gets updated.

  • tau (float) – Controls how much the weights are shifted. Valid in [0, 1].

Returns:

The updated target network.

Return type:

torch.nn.Module

class learning.networks.Actor(size_s: int, size_a: int, nlayers: int = 4, layer_width: int = 256, lr: float = 0.001, eps: float = 0.3, action_clip: float = 1.0)#

Actor class encapsulating the action selection and training process for the DDPG actor. .. py:method:: select_action(state_norm: jacta.learning.normalizer.Normalizer, state: torch.FloatTensor, goal: torch.FloatTensor, exploration_function: Optional[Callable] = None) -> torch.FloatTensor

Select an action for the given input states.

If in train mode, samples noise and chooses completely random actions with probability self.eps. If in evaluation mode, only clips the action to the maximum value.

param states:

Input states.

returns:

A numpy array of actions.

Parameters:
  • size_s (int) –

  • size_a (int) –

  • nlayers (int) –

  • layer_width (int) –

  • lr (float) –

  • eps (float) –

  • action_clip (float) –

__call__(states: torch.FloatTensor) torch.FloatTensor#

Run a forward pass directly on the action net.

Parameters:

states (torch.FloatTensor) – Input states.

Returns:

An action tensor.

Return type:

torch.FloatTensor

backward_step(loss: torch.FloatTensor) None#

Perform a backward pass with an optimizer step.

Parameters:

loss (torch.FloatTensor) – Actor network loss.

Return type:

None

target(states: torch.FloatTensor) torch.FloatTensor#

Compute actions with the target network and without noise.

Parameters:

states (torch.FloatTensor) – Input states.

Returns:

An action tensor.

Return type:

torch.FloatTensor

eval() None#

Set the actor to eval mode without noise in the action selection.

Return type:

None

train() None#

Set the actor to train mode with noisy actions.

Return type:

None

update_target(tau: float = 0.05) None#

Update the target network with a soft parameter transfer update.

Parameters:

tau (float) – Averaging fraction of the parameter update for the action network.

Return type:

None

load(checkpoint: Any) None#

Load data for the actor.

Parameters:

checkpoint (Any) – dict containing loaded data.

Return type:

None

save(f: io.BufferedWriter) None#

Save data for the actor.

Parameters:

f (io.BufferedWriter) –

Return type:

None

class learning.networks.DDP(size_s: int, size_a: int, nlayers: int, layer_width: int)#

Bases: torch.nn.Module

Continuous action choice network for the agent. .. py:method:: forward(x: torch.FloatTensor) -> torch.FloatTensor

Compute the network forward pass.

param x:

Input tensor.

returns:

The network output.

Parameters:
  • size_s (int) –

  • size_a (int) –

  • nlayers (int) –

  • layer_width (int) –

class learning.networks.Critic(size_s: int, size_a: int, nlayers: int = 4, layer_width: int = 256, lr: float = 0.001)#

Critic class encapsulating the critic and training process for the DDPG critic. .. py:method:: __call__(states: torch.FloatTensor, actions: torch.FloatTensor) -> torch.FloatTensor

Run a critic net forward pass.

param states:

Input states.

param actions:

Input actions.

returns:

An action value tensor.

Parameters:
  • size_s (int) –

  • size_a (int) –

  • nlayers (int) –

  • layer_width (int) –

  • lr (float) –

target(states: torch.FloatTensor, actions: torch.FloatTensor) torch.FloatTensor#

Compute the action value with the target network.

Parameters:
  • states (torch.FloatTensor) – Input states.

  • actions (torch.FloatTensor) – Input actions.

Returns:

An action value tensor.

Return type:

torch.FloatTensor

backward_step(loss: torch.FloatTensor) None#

Perform a backward pass with an optimizer step.

Parameters:

loss (torch.FloatTensor) – Critic network loss.

Return type:

None

update_target(tau: float = 0.05) None#

Update the target network with a soft parameter transfer update.

Parameters:

tau (float) – Averaging fraction of the parameter update for the action network.

Return type:

None

load(checkpoint: Any) None#

Load data for the critic.

Parameters:

checkpoint (Any) – dict containing loaded data.

Return type:

None

save(f: io.BufferedWriter) None#

Save data for the critic.

Parameters:

f (io.BufferedWriter) –

Return type:

None

class learning.networks.CriticNetwork(size_s: int, size_a: int, nlayers: int, layer_width: int)#

Bases: torch.nn.Module

State action critic network for the critic. .. py:method:: forward(state: torch.FloatTensor, action: torch.FloatTensor) -> torch.FloatTensor

Compute the network forward pass.

param state:

Input state tensor.

param action:

Input action tensor.

returns:

The network output.

Parameters:
  • size_s (int) –

  • size_a (int) –

  • nlayers (int) –

  • layer_width (int) –

learning.networks.train_actor_critic(actor: Actor, actor_expert: Actor, critic: Critic, state_norm: jacta.learning.normalizer.Normalizer, replay_buffer: jacta.learning.replay_buffer.ReplayBuffer, reward_fun: Callable, batch_size: int = 256, discount_factor: float = 0.98, her_probability: float = 0.0) float#

Train the agent and critic network with experience sampled from the replay buffer.

Parameters:
  • actor (Actor) –

  • actor_expert (Actor) –

  • critic (Critic) –

  • state_norm (jacta.learning.normalizer.Normalizer) –

  • replay_buffer (jacta.learning.replay_buffer.ReplayBuffer) –

  • reward_fun (Callable) –

  • batch_size (int) –

  • discount_factor (float) –

  • her_probability (float) –

Return type:

float

learning.networks.train_actor_imitation(actor: Actor, state_norm: jacta.learning.normalizer.Normalizer, states: torch.FloatTensor, goals: torch.FloatTensor, actor_actions: torch.FloatTensor, batch_size: int = 256) float#
Parameters:
  • actor (Actor) –

  • state_norm (jacta.learning.normalizer.Normalizer) –

  • states (torch.FloatTensor) –

  • goals (torch.FloatTensor) –

  • actor_actions (torch.FloatTensor) –

  • batch_size (int) –

Return type:

float

learning.networks.train_critic_imitation(critic: Critic, state_norm: jacta.learning.normalizer.Normalizer, states: torch.FloatTensor, goals: torch.FloatTensor, actor_actions: torch.FloatTensor, q_values: torch.FloatTensor, batch_size: int = 256) float#
Parameters:
  • critic (Critic) –

  • state_norm (jacta.learning.normalizer.Normalizer) –

  • states (torch.FloatTensor) –

  • goals (torch.FloatTensor) –

  • actor_actions (torch.FloatTensor) –

  • q_values (torch.FloatTensor) –

  • batch_size (int) –

Return type:

float