Training a DQN agent to play Connect Four
During the development of a zeroknowledge application called zk Connect Four, the appropriate context was given for the implementation of a reinforcement learning agent.
Reinforcement learning is a machine learning paradigm in which an agent learns from its environment through trial and error. That is, it is a computational approach to learning from actions.
The goal is to learn what the most appropriate actions are in each game state to maximise a reward. To this end, a DQN agent and a double DQN (DDQN) agent are implemented in this notebook to optimize an approximator in the Connect Four game using PyTorch.
In this article
Setup
Fundamental concepts
The main actors in the reinforcement learning paradigm are the environment and the agent. The environment is the context in which the agent lives and interacts with.
In each iteration, the agent receives an observation of the current state of the game, decides the best action to take, and acts on the environment.
In this way, a change in the state of the environment is generated and a reward is obtained from this transition.
The objective of this exercise is to obtain a policy that is capable of maximising this reward, thus obtaining the best action for each transition and promoting an optimal approach to the Connect Four game.
Below we introduce the main classes that underpin this paradigm and that will allow us to establish the actors and the context of the game.
Environment
The environment implementation consists of a class whose main methods and tasks are the following:

get_valid_actions
: provides a list with all valid actions given the current state, i.e., the board at that moment in the game. 
step
 undertakes a transition with the action received from the agent (an index of a column on the board), checks whether it has been won or tied, and returns a reward depending on the result. 
switch_turn
: changes the game turn between both players. 
reset
 initializes the state and all game variables and returns an initial state.
DQN and DDQN agents
DQN and DDQN are essentially algorithms that consist of learning an approximator for a Q function actionvalue.
In our case, we will try to learn the strategies of both players of the Connect Four game, so instead of being an agent that is only trained on one strategy, it would be a multiagent reinforcement learning (MARL) modality.
Then, an alternating Markov game scenario is presented, or a zerosum game where one player's gain is proportional to the other's loss.
The implementation of the class used to instantiate the agents consists of the following methods:
In each iteration, the agent is prompted to provide us with an action according to an epsilongreedy strategy to treat the explorationexploitation dilemma.
Sometimes our model will predict the action to be taken and other times an action will be sampled.
As can be seen in the graph below, in each iteration the probability of new actions being explored or learnt actions being exploited will decay exponentially based on the rate of decay eps_decay
.
This strategy is implemented in the method act
of the DQN agent below.
If the sample value is greater than the epsilon threshold, the best action given the last observation of the environment will be approximated (exploitation of learnt actions).
Otherwise, an action will be chosen (exploration) from among the valid actions uniformly.
The option is given, through
enforce_valid_action
, to ensure that the chosen action is valid, or to urge the model to learn which columns of the board can qualify for a new piece in each turn, i.e., to learn this rule of the game.
The data used to train the agent is generated from scratch during each training. This data are stored in an object of class ExperienceReplay
and are randomly sampled with the aim of optimizing a strategy or action plan (policy).
The observations in each iteration are appended as named tuples (itertools' named_tuple
) to a sequence list (deque
, or doubleended queue) with limited capacity, where observations can be inserted and deleted from both ends, thus creating a gaming experience.
Each observation includes a state (the game board at that precise moment), an action (the index of the column that has been chosen), a next state (the game board after considering the last action, or None
if it is a final state, i.e., if a winning action has been taken) and a reward.
Once we have obtained and memorized more observations than the number of batch_size we use, we will proceed to optimize the model.
In our example, it is only optimized once per episode since the values of the transitions will sometimes depend on previous transitions.
For example, when a player wins the game the player gets a positive reward, so for the state transition in this iteration a tuple is generated as follows:
transition = Transition(state, action, None, reward)
The transition would consist of the state in t
, the action taken, the state in t+1
(as it is a final state, it is assigned None
) and the reward obtained.
What happens is that during the generation of the observation in the previous iteration, we did not yet know about this outcome and therefore the reward was not negative for having lost the game.
For this reason, once each training game ends, this relationship of rewards is resolved to be saved later in memory.
Therefore, the transition in t1
from the transition above would be:
transition_t1 = Transition(state, action, next_state, reward + env.params['rewards']['loss'])
The loss reward is added to the assigned reward and not directly assigned since it is possible to configure an extension reward in the game as a hyperparameter.
The optimization process is shown below:
First, a number of observations are sampled in memory in the method optimize
, and all state and action values of the observations are concatenated into separate arrays.
Furthermore, both the states in t
and t+1
get a new dimension since a board transformation is applied using the get_two_channels
function.
The idea is that for each board with value 0, 1 or 2, two matrices with binary values are obtained, one for each player.
These two matrices will be the two input channels per observation that our convolutional neural net will get as input.
A logical array is then generated that will be used as a mask to assign Q values to nonfinal states, while final states will have a value of zero.
In the case that a DQN agent is used, only the target model will be used to obtain the values of Q, obtaining the values of maximum magnitude among all the actions.
While with a DDQN agent, both our policy model and our "fixed" target model are used to obtain the time difference values of Q that we want to approach in order to solve a possible overestimation of these values.
Subsequently, the longterm profit that can be anticipated is calculated. For this, a discount factor gamma is applied and is subtracted from the reward for that transition (rather than added as in the Bellman equation as suggested in this Kaggle notebook).
In this way we adapt our optimization for the learning of both strategies to the alternating zerosum game at hand.
It is at this moment when the parameters of our policy are optimized. To do this, the loss is calculated (in our case the criterion torch.nn.HuberLoss
is used) to minimize the error of the temporal difference δ, null values are assigned to the gradients to later be computed using backpropagation, a clipping of them is applied if the training has been configured that way and the parameters are updated based on the calculated gradient.
Once the parameters of the policy have been optimized during training, the parameters of our target are considered to be updated. For this, two strategies are implemented: soft update y hard update.
The former will update the parameters gradually in each episode only obtaining a percentage of the parameters of policy based on the TAU value considered as a hyperparameter in training.
For example, if TAU is 0.001, target will use 0.1% of the parameters of policy_ and will keep the rest (99.9%).
On the contrary, through a hard update, all parameters will be copied at once if a number of iterations, to be configured, is reached.
Finally, three methods are implemented to save transitions in memory, save current parameters of policy to configure checkpoints during training, and to load the weights of a saved model at the beginning of a training.
Note that the optimizer is instantiated once the weights of the previously saved model have been loaded.
Modeling
The model chosen to learn the Connect Four game consists of two convolutional layers with ReLU activation and two fully connected layers (the first also with ReLU activation) at the output.
In the same object of class ConnectFourNet
two modules are instantiated, one for policy and another for target. The latter is copied from the former once its parameters (weights and bias) have been initialized.
In this way we can save the parameters of both models during training in the same state dictionary.
Training
As previously noted, the training
of the game policy combines the strategies of both players. To do this, we experiment with the use of the DQN and DDQN agents.
DQN
First of all, we configure all the parameters not only of the training but also of the evaluation, environment and agent, and with these parameters we instantiate the module with the neural networks, the agent and the environment.
Once all the entities and parameters that are needed have been instantiated, we can start training the agent to optimize the policy.
The function train
shows iteration metrics periodically. Some values will correspond to averages obtained from current values and previous iterations in order to recognize trends during training.
In each episode the policy model is evaluated playing against a random agent.
The rewards for each episode are usually zero since the winning score is added to the losing score (1 + 1), however the average that appears in parentheses is sometimes different from zero. This is because in some episodes the model chooses an action that is not valid and obtains a reward as if it had lost the game, but a winning reward is not assigned to the other player since the next transition does not occur.
As we can see, this average tends to decrease, i.e., the model tends to make fewer errors when choosing actions that are legal.
In the previous graphs we can verify what we mentioned previously. Furthermore, it is observed how the number of iterations tends to increase slightly during training and evaluation.
It would be expected that the games were longer in training given that as the model learns it continues to play against itself.
However, with regards to evaluation, and considering that the model plays against random decision making, it would be expected that, although there is a slight reduction in the number of turns it takes to win with the passing of the iterations, the reduction was greater.
In any case, the model tends to win almost all the games, as we can see in Evaluation rates.
An option to try to incentivize the model to try to make the games shorter could be to assign a negative reward to
params_env['rewards']['prolongation']
.
Besides, since the agent is learning a strategy or action policy for a nonstationary scenario, which depends on specific actions, the graph of Running loss is only considered for the choice of the training batch size hyperparameter (batch_size
) and to verify that the gradients neither increase nor decrease considerably or exponentially (exploding gradients and vanishing gradients).
We can check below how our policy acts with examples of possible legal boards in key game situations, for illustration purposes only.
First we see if the policy model is able to complete some lines of four to win a game.
Horizontal P2
t
t+1
Horizontal P1
t
t+1
Vertical P2
t
t+1
Vertical P1
t
t+1
Diagonal P2
t
t+1
Diagonal P1
t
t+1
AntiDiagonal P2
t
t+1
AntiDiagonal P1
t
t+1
We can see how this model is capable of choosing actions to win the game either horizontally, vertically, diagonally or antidiagonally, although not for both players in all line modalities, for both players in different modalities.
A similar thing happens when our approximator is presented with the opportunity to block lines of four consecutive counters to win a game just in the turn before these winning transitions can happen.
Horizontal P2
t
t+1
Horizontal P1
t
t+1
Vertical P2
t
t+1
Vertical P1
t
t+1
Diagonal P2
t
t+1
Diagonal P1
t
t+1
Anti diagonal P2
t
t+1
Anti diagonal P1
t
t+1
DDQN
Finally, a double DQN agent was trained in the same exercise. Similar training hyperparameters have been used.
As we can see below, the training results are similar to the DQN agent, however, the results of the model when faced with random decision making are slightly worse.
Again, just for the sake of illustration, we pose some observations of key scenarios in the game to our model to get an idea of how it performs.
Horizontal P2
t
t+1
Horizontal P1
t
t+1
Vertical P2
t
t+1
Vertical P1
t
t+1
Diagonal P2
t
t+1
Diagonal P1
t
t+1
Anti diagonal P2
t
t+1
Anti diagonal P1
t
t+1
As we see, it is still capable of resolving the best action on the majority of the boards both to win the game and to block the opposing player from winning it:
Horizontal P2
t
t+1
Horizontal P1
t
t+1
Vertical P2
t
t+1
Vertical P1
t
t+1
Diagonal P2
t
t+1
Diagonal P1
t
t+1
Anti diagonal P2
t
t+1
Anti diagonal P1
t
t+1
It is possible that the models do not perform as well in prolonged games since during training they receive more observations of boards with fewer counters.
There are many training parameters to modify and try to get a better generalization of the model. For example, rewards, epsilon decay rate, observation batch size, memory capacity, learning rate or even implement another neural network architecture or modify and/or include convolutional or fully connected layers .
In addition, more modern reinforcement learning algorithms such as proximal policy optimization (PPO) can be implemented, or make use of a memory of observations with prioritized sampling.
The library TorchRL, which is a reinforcement learning library for PyTorch, implements these and many other options.
Export
As noted in the introduction to this Jupyter notebook, this model is used in a decentralised zeroknowledge web application.
For this reason, the module torch.onnx
is used to convert the native PyTorch calculation graph to a ONNX graph and thus, using the ONNX Runtime Web JavaScript library, this machine learning model can be deployed in our web application.