Collaboration and Competition

Algorithm

In this environment, I use Twin Delayed Deep Deterministic Policy which so called TD3 to train the agents. There are two agents, they share a replay buffer and critic networks and the actor network is independence for each agent. While training, agents perform their action according to their state and adding their record to the same replay buffer.

TD3 is based on Deep Deterministic Policy Gradient(DDPG) and have several modifications on it. The main ideas are in the following list.

  • Clipped Double Q-Learning for Actor-Critic

    It uses the concept of Double Q-Learning to alleviate the overestimation bias problem in Actor-Critic. To be specific, it have two pairs of critic networks(2 locals and 2 targets). While calculating target Q values, it uses both critic networks to compute and choose the smaller outcome as target Q value to update the network. By doing this to reduce the influence of the overestimation problem.

  • Target Networks and Delayed Policy Updates

    Just like the DDPG, TD3 also use target networks and soft update method to stabilize the training procedure. Beside these methods, it also uses delayed updates which simply only update target networks and actor networks every several timesteps. This way can result in higher quality policy updates thus improve the performance of the agent.

  • Target Policy Smoothing Regularization

    Intuitively, similar actions should have similar value. So, in order to implement this idea. TD3 adding a small amount of random noise to the target policy. That is to say, fitting the value of a small area around the target action would have the beneļ¬t of smoothing the value estimate by similar state-action value estimates.

This is the pseudo code from the paper

pseudo code

Model

The actor model architecture

Actor(
  (fc1): Linear(in_features=24, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=2, bias=True)
)

The critic model architecture

Critic(
  (fc1): Linear(in_features=24, out_features=256, bias=True)
  (fc2): Linear(in_features=258, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=1, bias=True)
)

Note that the critic model architecture is different from the TD3 author's implementation. The offical TD3 critic model implementation concatenates the state and action before the first dense layer, which the Udacity's DDPG implementation concatenates them in the second layer. I found that the agent can not learn anything while using the TD3 critic implementation. At the end, I choose to use DDPG critic model architecture.

The code comparison

class DDPGCritic(nn.Module):
    """DDPG Critic (Value) Model."""
    def __init__(self, state_size, action_size, fc1_units=256, fc2_units=128):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units+action_size, fc2_units)
        self.fc3 = nn.Linear(fc2_units, 1)

    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        xs = F.leaky_relu(self.fc1(state))
        x = torch.cat((xs, action), dim=1)
        x = F.leaky_relu(self.fc2(x))
        return self.fc3(x)

class TD3Critic(nn.Module):
    """TD3 Critic (Value) Model."""
    def __init__(self, state_size, action_size, fc1_units=256, fc2_units=128):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_size + action_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, 1)

    def forward(self, state, action):
        state_action = torch.cat([state, action], dim=1)
        xs = F.relu(self.fc1(state_action))
        x = F.relu(self.fc2(xs))
        return self.fc3(x)

Hyper parameters

BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 256        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-2              # for soft update of target parameters
LR_ACTOR = 1e-3         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
LEARN_EVERY = 4        # how often to learn from the experience
UPDATE_EVERY = 2        # how often to update the target network
random_seed = 0

TD3 noise parameters

noise = 0.2             # the range to generate random noise while learning
noise_std = 0.3         # the range to generate random noise while performing action
noise_clip = 0.5        # to clip random noise into this range

Training noise parameters

# apply to the generated random noise while performing action for exploration at the beginning of every episode.
noise_degree = 2.0    

# reduce noise degree every timestep.
noise_decay = 0.999

Result

The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5. The left plot is the maximum score between two agents in each episode, the right plot is the average scores of two agents in each episode. We can see that the score is over 0.5 at around 470 episode.

Plot

Agent

Ideas for Future Work

I have only tried DDPG and TD3 in this environment, so maybe other algorithms like PPO or D4PG would work better in this environment. And also continuing tune the hyper parameters could improve the performance.