1
00:00:00,000 --> 00:00:05,720
Chapter 13 AlphaGo—Bringing It All Together

2
00:00:05,720 --> 00:00:09,760
This chapter covers Diving into the guiding principles that led

3
00:00:09,760 --> 00:00:13,320
GoBots to play at superhuman strength.

4
00:00:13,320 --> 00:00:17,480
Using tree search, supervised deep learning, and reinforcement learning to build such a

5
00:00:17,480 --> 00:00:19,840
bot.

6
00:00:19,840 --> 00:00:23,879
Implementing your own version of DeepMind's AlphaGo engine

7
00:00:23,879 --> 00:00:29,920
When DeepMind's GoBot AlphaGo played Move 37 of Game 2 against Lee Sedol in 2016, it

8
00:00:29,920 --> 00:00:32,160
took the Go world by storm.

9
00:00:32,160 --> 00:00:36,240
Commentator Michael Redmond, a professional player with nearly a thousand top-level games

10
00:00:36,240 --> 00:00:39,080
under his belt, did a double-take on air.

11
00:00:39,080 --> 00:00:43,279
He even briefly removed the stone from the demo board while looking around as if to confirm

12
00:00:43,279 --> 00:00:45,520
that AlphaGo made the right move.

13
00:00:45,520 --> 00:00:50,840
I still don't really understand the mechanics of it, Redmond told the American Go e-journal

14
00:00:50,840 --> 00:00:52,560
the next day.

15
00:00:52,560 --> 00:00:58,000
Lee, the worldwide dominant player of the past decade, spent twelve minutes studying

16
00:00:58,000 --> 00:01:00,500
the board before responding.

17
00:01:00,500 --> 00:01:05,879
Figure 13.1 displays the legendary move.

18
00:01:05,879 --> 00:01:11,599
Figure 13.1—the legendary shoulder hit that AlphaGo played against Lee Sedol in the second

19
00:01:11,599 --> 00:01:13,400
game of their series.

20
00:01:13,400 --> 00:01:17,440
This move stunned many professional players.

21
00:01:17,440 --> 00:01:20,199
The move defied conventional Go theory.

22
00:01:20,199 --> 00:01:24,879
The diagonal approach, or shoulder hit, is an invitation for the white stone to extend

23
00:01:24,879 --> 00:01:27,639
along the side and make a solid wall.

24
00:01:27,639 --> 00:01:31,639
If the white stone is on the third line and the black stone is on the fourth line, this

25
00:01:31,639 --> 00:01:34,440
is considered a roughly even exchange.

26
00:01:34,440 --> 00:01:38,800
White gets points on the side, while black gets influence toward the center.

27
00:01:38,800 --> 00:01:43,599
But when the white stone is on the fourth line, the wall locks up too much territory.

28
00:01:43,599 --> 00:01:48,279
To any strong Go players who are reading, we apologize for drastically oversimplifying

29
00:01:48,279 --> 00:01:49,279
this.

30
00:01:49,279 --> 00:01:54,480
A fifth-line shoulder hit looks a little amateurish, or at least it did until Professor Alpha took

31
00:01:54,480 --> 00:01:57,440
four out of five games against a legend.

32
00:01:57,440 --> 00:02:01,120
The shoulder hit was the first of many surprises from AlphaGo.

33
00:02:01,120 --> 00:02:05,720
Fast forward a year, and everyone from top pros to casual club players is experimenting

34
00:02:05,720 --> 00:02:07,900
with AlphaGo moves.

35
00:02:07,900 --> 00:02:12,360
In this chapter, you're going to learn how AlphaGo works by implementing all of its building

36
00:02:12,360 --> 00:02:13,360
blocks.

37
00:02:13,360 --> 00:02:18,679
AlphaGo is a clever combination of supervised deep learning from professional Go records,

38
00:02:18,679 --> 00:02:23,960
which you learned about in chapters 5 through 8, deep reinforcement learning with self-play,

39
00:02:23,960 --> 00:02:28,800
covered in chapters 9 through 12, and using these deep networks to improve tree search

40
00:02:28,800 --> 00:02:30,639
in a novel way.

41
00:02:30,639 --> 00:02:35,080
You might be surprised by how much you already know about the ingredients of AlphaGo.

42
00:02:35,080 --> 00:02:41,600
To be more precise, the AlphaGo system we'll be describing in detail works as follows.

43
00:02:41,600 --> 00:02:46,639
You start off by training two deep convolutional neural networks, policy networks, for move

44
00:02:46,639 --> 00:02:48,240
prediction.

45
00:02:48,240 --> 00:02:52,960
One of these network architectures is a bit deeper and produces more accurate results,

46
00:02:52,960 --> 00:02:56,440
whereas the other one is smaller and faster to evaluate.

47
00:02:56,440 --> 00:03:00,839
We'll call them the strong and fast policy networks, respectively.

48
00:03:00,839 --> 00:03:05,279
The strong and fast policy networks use a slightly more sophisticated board encoder

49
00:03:05,279 --> 00:03:07,600
with 48 feature planes.

50
00:03:07,600 --> 00:03:11,759
They also use a deeper architecture than what you've seen in chapters 6 and 7, but other

51
00:03:11,759 --> 00:03:14,759
than that, they should look familiar.

52
00:03:14,759 --> 00:03:19,639
Section 13.1 covers AlphaGo's policy network architectures.

53
00:03:19,639 --> 00:03:24,199
After the first training step of policy networks is complete, you take the strong policy network

54
00:03:24,199 --> 00:03:28,320
as a starting point for self-play in section 13.2.

55
00:03:28,320 --> 00:03:32,059
If you do this with a lot of compute power, this will result in a massive improvement

56
00:03:32,059 --> 00:03:34,039
of your bots.

57
00:03:34,039 --> 00:03:38,320
As a next step, you take the strong self-play network to derive a value network from it

58
00:03:38,320 --> 00:03:40,720
in section 13.3.

59
00:03:40,720 --> 00:03:44,600
This completes the network training stage, and you don't do any deep learning after this

60
00:03:44,600 --> 00:03:46,360
point.

61
00:03:46,360 --> 00:03:50,759
To play a game of Go, you use tree search as a basis for play, but instead of plain

62
00:03:50,759 --> 00:03:55,639
Monte Carlo rollouts as in chapter 4, you use the fast policy network to guide the next

63
00:03:55,639 --> 00:03:56,639
steps.

64
00:03:56,639 --> 00:04:00,860
Also, you balance the output of this tree search algorithm with what your value function

65
00:04:00,860 --> 00:04:01,860
tells you.

66
00:04:01,860 --> 00:04:06,720
We'll tell you all about this innovation in section 13.4.

67
00:04:06,720 --> 00:04:11,240
Performing this whole process from training policies to self-play to running games with

68
00:04:11,240 --> 00:04:16,679
search on a superhuman level requires massive compute resources and time.

69
00:04:16,679 --> 00:04:21,640
Section 13.5 gives you some ideas on what it took to make AlphaGo as strong as it is,

70
00:04:21,640 --> 00:04:25,040
and what to expect from your own experiments.

71
00:04:25,040 --> 00:04:29,640
Figure 13.2 gives an overview of the whole process we just sketched.

72
00:04:29,640 --> 00:04:33,440
Throughout the chapter, we'll zoom into parts of this diagram and provide you with more

73
00:04:33,440 --> 00:04:37,000
details in the respective sections.

74
00:04:37,000 --> 00:04:43,119
Figure 13.2, how to train the three neural networks that power the AlphaGo AI.

75
00:04:43,119 --> 00:04:47,519
Starting with a collection of human game records, you can train two neural networks to predict

76
00:04:47,519 --> 00:04:52,480
the next move, a small, fast network and a large, strong network.

77
00:04:52,480 --> 00:04:56,399
You can then further improve the playing strength of the large network through reinforcement

78
00:04:56,399 --> 00:04:57,399
learning.

79
00:04:57,399 --> 00:05:01,880
The self-play games also provide data to train a value network.

80
00:05:01,880 --> 00:05:06,279
AlphaGo then uses all three networks in a tree search algorithm that can produce incredibly

81
00:05:06,279 --> 00:05:09,000
strong gameplay.

82
00:05:09,000 --> 00:05:13,399
Section 13.1, training deep neural networks for AlphaGo.

83
00:05:13,399 --> 00:05:19,040
In the introduction, you learned that AlphaGo uses three neural networks, two policy networks

84
00:05:19,040 --> 00:05:21,320
and one value network.

85
00:05:21,320 --> 00:05:25,799
Although this may seem like a lot at first, in this section, you'll see that these networks

86
00:05:25,799 --> 00:05:31,079
and the input features that feed into them are conceptually close to each other.

87
00:05:31,079 --> 00:05:35,119
Perhaps the most surprising part about deep learning as used in AlphaGo is how much you

88
00:05:35,119 --> 00:05:39,279
already know about it after completing chapters 5 to 12.

89
00:05:39,279 --> 00:05:43,480
Before we go into details of how these neural networks are built and trained, let's quickly

90
00:05:43,480 --> 00:05:47,239
discuss their role in the AlphaGo system.

91
00:05:47,239 --> 00:05:49,720
Fast policy network.

92
00:05:49,720 --> 00:05:54,200
This go-move prediction network is comparable in size to the networks you trained in chapters

93
00:05:54,200 --> 00:05:55,799
7 and 8.

94
00:05:55,799 --> 00:06:00,160
Its purpose isn't to be the most accurate move predictor, but rather a good predictor

95
00:06:00,160 --> 00:06:03,119
that's really fast at predicting moves.

96
00:06:03,119 --> 00:06:08,200
This network is used in section 13.4 in tree search rollouts, and you've seen in chapter

97
00:06:08,200 --> 00:06:12,880
4 that you need to create a lot of them quickly for tree search to become an option.

98
00:06:12,880 --> 00:06:17,500
We'll put a little less emphasis on this network and focus on the following two.

99
00:06:17,500 --> 00:06:19,519
Strong policy network.

100
00:06:19,519 --> 00:06:23,720
This move prediction network is optimized for accuracy, not speed.

101
00:06:23,720 --> 00:06:27,679
It's a convolutional network that's deeper than its fast version and can be more than

102
00:06:27,679 --> 00:06:30,559
twice as good at predicting go-moves.

103
00:06:30,559 --> 00:06:35,519
As the fast version, this network is trained on human gameplay data, as you did in chapter

104
00:06:35,519 --> 00:06:36,920
7.

105
00:06:36,920 --> 00:06:41,019
After this training step is completed, the strong policy network is used as a starting

106
00:06:41,019 --> 00:06:46,440
point for self-play by using reinforcement learning techniques from chapters 9 and 10.

107
00:06:46,440 --> 00:06:50,200
This step will make this policy network even stronger.

108
00:06:50,200 --> 00:06:52,160
Value network.

109
00:06:52,160 --> 00:06:56,760
The self-play games played by the strong policy network generate a new data set that you can

110
00:06:56,760 --> 00:06:59,079
use to train a value network.

111
00:06:59,079 --> 00:07:03,440
Specifically, you use the outcome of these games and the techniques from chapters 11

112
00:07:03,440 --> 00:07:06,559
and 12 to learn a value function.

113
00:07:06,559 --> 00:07:12,119
This value network will then play an integral role in section 13.4.

114
00:07:12,119 --> 00:07:16,920
Section 13.1.1, network architectures in AlphaGo.

115
00:07:16,920 --> 00:07:20,519
Now that you roughly know what each of the three deep neural networks is used for in

116
00:07:20,519 --> 00:07:25,839
AlphaGo, we can show you how to build these networks in Python using Keras.

117
00:07:25,839 --> 00:07:30,359
Here's a quick description of the network architectures before we show you the code.

118
00:07:30,359 --> 00:07:34,720
If you need a refresher on terminology for convolutional networks, have a look at chapter

119
00:07:34,720 --> 00:07:36,440
7 again.

120
00:07:36,440 --> 00:07:40,679
The strong policy network is a 13-layer convolutional network.

121
00:07:40,679 --> 00:07:44,160
All of these layers produce 19 by 19 filters.

122
00:07:44,160 --> 00:07:48,279
You consistently keep the original board size across the whole network.

123
00:07:48,279 --> 00:07:53,320
For this to work, you need to pad the inputs accordingly, as you did in chapter 7.

124
00:07:53,320 --> 00:07:58,119
The first convolutional layer has a kernel size of 5, and all following layers work with

125
00:07:58,119 --> 00:08:00,200
a kernel size of 3.

126
00:08:00,200 --> 00:08:05,440
The last layer uses softmax activations and has one output filter, and the first 12 layers

127
00:08:05,440 --> 00:08:10,720
use ReLU activations and have 192 output filters each.

128
00:08:10,720 --> 00:08:15,720
The value network is a 16-layer convolutional network, the first 12 of which are exactly

129
00:08:15,720 --> 00:08:18,640
the same as the strong policy network.

130
00:08:18,640 --> 00:08:23,600
Layer 13 is an additional convolutional layer, structurally identical to layers 2 through

131
00:08:23,600 --> 00:08:24,959
12.

132
00:08:24,959 --> 00:08:30,119
Layer 14 is a convolutional layer with kernel size 1 and one output filter.

133
00:08:30,119 --> 00:08:36,640
The network is topped off with two dense layers, one with 256 outputs and ReLU activations,

134
00:08:36,640 --> 00:08:40,599
and a final one with one output and TAN activation.

135
00:08:40,599 --> 00:08:45,880
As you can see, both policy and value networks in AlphaGo are the same kind of deep convolutional

136
00:08:46,159 --> 00:08:49,440
neural network that you already encountered in chapter 6.

137
00:08:49,440 --> 00:08:54,080
The fact that these two networks are so similar allows you to define them in a single Python

138
00:08:54,080 --> 00:08:55,760
function.

139
00:08:55,760 --> 00:09:00,599
Before doing so, we introduce a little shortcut in Keras that shortens the network definition

140
00:09:00,599 --> 00:09:02,239
quite a bit.

141
00:09:02,239 --> 00:09:07,760
Recall from chapter 7 that you can pad input images in Keras with the zero-padding 2D utility

142
00:09:07,760 --> 00:09:08,760
layer.

143
00:09:08,760 --> 00:09:12,700
It's perfectly fine to do so, but you can save some ink in your model definition by

144
00:09:12,700 --> 00:09:16,340
moving the padding into the conv2d layer.

145
00:09:16,340 --> 00:09:21,539
What you want to do in both value and policy networks is to pad the input to each convolutional

146
00:09:21,539 --> 00:09:27,500
layer so that the output filters have the same size as the input, 19 by 19.

147
00:09:27,500 --> 00:09:33,140
For instance, instead of explicitly padding the 19 by 19 input of the first layer to 23

148
00:09:33,140 --> 00:09:40,140
by 23 images so that the following convolutional layer with kernel size 5 produces 19 by 19

149
00:09:40,140 --> 00:09:44,859
output filters, you tell the convolutional layer to retain the input size.

150
00:09:44,859 --> 00:09:49,900
You do this by providing the argument padding equals same to your convolutional layer, which

151
00:09:49,900 --> 00:09:52,559
will take care of the padding for you.

152
00:09:52,559 --> 00:09:57,299
With this neat shortcut in mind, let's define the first 11 layers that AlphaGo's policy

153
00:09:57,299 --> 00:09:59,659
and value networks have in common.

154
00:09:59,659 --> 00:10:07,219
You find this definition in our GitHub repository in alphago.py in the dlgo.networks module.

155
00:10:07,299 --> 00:10:14,580
Listing 13.1, initializing a neural network for both policy and value networks in AlphaGo.

156
00:10:14,580 --> 00:10:18,299
Note that you didn't yet specify the input shape of the first layer.

157
00:10:18,299 --> 00:10:22,179
That's because that shape differs slightly for policy and value networks.

158
00:10:22,179 --> 00:10:27,099
You'll see the difference when we introduce the AlphaGo board encoder in the next section.

159
00:10:27,099 --> 00:10:31,739
To continue the definition of model, you're just one final convolutional layer away from

160
00:10:31,739 --> 00:10:35,299
defining the strong policy network.

161
00:10:35,299 --> 00:10:41,500
Listing 13.2, creating AlphaGo's strong policy network in Keras.

162
00:10:41,500 --> 00:10:46,419
As you can see, you add a final flatten layer to flatten the predictions and ensure consistency

163
00:10:46,419 --> 00:10:50,820
with your previous model definitions from chapters 5 to 8.

164
00:10:50,820 --> 00:10:56,299
If you want to return AlphaGo's value network instead, adding two more conv2d layers, two

165
00:10:56,299 --> 00:11:01,820
dense layers, and one flatten layer to connect them will do the job.

166
00:11:01,820 --> 00:11:07,380
Listing 13.3, building AlphaGo's value network in Keras.

167
00:11:07,380 --> 00:11:11,580
We don't explicitly discuss the architecture of the fast policy network here.

168
00:11:11,580 --> 00:11:16,099
The definition of input features and network architecture of the fast policy is technically

169
00:11:16,099 --> 00:11:21,140
involved and doesn't contribute to a deeper understanding of the AlphaGo system.

170
00:11:21,140 --> 00:11:26,460
For your own experiments, it's perfectly fine to use one of the networks from our dlgo.networks

171
00:11:26,460 --> 00:11:30,619
module, such as small, medium, or large.

172
00:11:30,619 --> 00:11:35,260
The main idea for the fast policy is to have a smaller network than the strong policy that's

173
00:11:35,260 --> 00:11:37,140
quick to evaluate.

174
00:11:37,140 --> 00:11:42,340
We'll guide you through the training process in more detail throughout the next sections.

175
00:11:42,340 --> 00:11:46,500
Section 13.1.2, the AlphaGo board encoder.

176
00:11:46,500 --> 00:11:51,219
Now that you know all about the network architectures used in AlphaGo, let's discuss how to encode

177
00:11:51,219 --> 00:11:53,659
Go board data the AlphaGo way.

178
00:11:53,659 --> 00:11:58,700
You've implemented quite a few board encoders in chapters 6 and 7 already, including OnePlane,

179
00:11:58,700 --> 00:12:04,059
7Plane, or Simple, all of which you stored in the dlgo.encoders module.

180
00:12:04,059 --> 00:12:07,900
The feature planes used in AlphaGo are just a little more sophisticated than what you've

181
00:12:07,900 --> 00:12:13,979
encountered before, but represent a natural continuation of the encoders shown so far.

182
00:12:13,979 --> 00:12:18,380
The AlphaGo board encoder for policy networks has 48 feature planes.

183
00:12:18,380 --> 00:12:22,780
For value networks, you augment these features with one additional plane.

184
00:12:22,780 --> 00:12:27,859
These 48 planes are made up of 11 concepts, some of which you've used before and others

185
00:12:27,940 --> 00:12:29,299
that are new.

186
00:12:29,299 --> 00:12:31,979
We'll discuss each of them in more detail.

187
00:12:31,979 --> 00:12:37,500
In general, AlphaGo makes a bit more use of Go-specific tactical situations than the board

188
00:12:37,500 --> 00:12:40,419
encoder examples we've discussed so far.

189
00:12:40,419 --> 00:12:45,179
A prime example of this is making the concept of ladder captures and escapes, see figure

190
00:12:45,179 --> 00:12:49,380
13.3, part of the feature set.

191
00:12:49,380 --> 00:12:51,380
Figure 13.3.

192
00:12:51,380 --> 00:12:56,020
AlphaGo encoded many Go tactical concepts directly into its feature planes, including

193
00:12:56,020 --> 00:12:57,020
ladders.

194
00:12:57,020 --> 00:13:01,900
In the first example, a white stone has just one liberty, meaning black could capture on

195
00:13:01,900 --> 00:13:03,619
the next turn.

196
00:13:03,619 --> 00:13:08,739
The white player extends the white stone to gain an extra liberty, but black can again

197
00:13:08,739 --> 00:13:11,940
reduce the white stones to one liberty.

198
00:13:11,940 --> 00:13:16,700
This sequence continues until it hits the edge of the board, where white is captured.

199
00:13:16,700 --> 00:13:21,219
On the other hand, if there's a white stone in the path of the ladder, white may be able

200
00:13:21,219 --> 00:13:23,460
to escape capture.

201
00:13:23,460 --> 00:13:29,460
AlphaGo included feature planes that indicated whether a ladder would be successful.

202
00:13:29,460 --> 00:13:33,580
A technique you consistently used in all of your Go board encoders that's also present

203
00:13:33,580 --> 00:13:37,099
in AlphaGo is the use of binary features.

204
00:13:37,099 --> 00:13:41,739
For instance, when capturing liberties, empty adjacent points on the board, you didn't just

205
00:13:41,739 --> 00:13:46,640
use one feature plane with liberty counts for each stone on the board, but chose a binary

206
00:13:46,640 --> 00:13:53,719
representation with planes indicating whether a stone had one, two, three, or more liberties.

207
00:13:53,719 --> 00:13:59,580
In AlphaGo, you see the exact same idea, but with eight feature planes to binarize counts.

208
00:13:59,580 --> 00:14:05,359
In the example of liberties, that means eight planes to indicate one, two, three, four,

209
00:14:05,359 --> 00:14:10,219
five, six, seven, or at least eight liberties for a stone.

210
00:14:10,219 --> 00:14:13,880
The only fundamental difference from what you've seen in chapters six to eight is that

211
00:14:13,880 --> 00:14:18,080
AlphaGo encodes stone color explicitly in separate feature planes.

212
00:14:18,080 --> 00:14:22,239
Recall that in the seven-plane encoder from chapter seven, you had liberty planes for

213
00:14:22,239 --> 00:14:24,440
both black and white stones.

214
00:14:24,440 --> 00:14:28,440
In AlphaGo, you have only one set of features counting liberties.

215
00:14:28,440 --> 00:14:33,119
Additionally, all features are expressed in terms of the player to play next.

216
00:14:33,119 --> 00:14:37,320
For instance, the feature set capture size, counting the number of stones that would be

217
00:14:37,320 --> 00:14:41,919
captured by a move, counts the stones the current player would capture, whatever stone

218
00:14:41,919 --> 00:14:44,440
color this might be.

219
00:14:44,440 --> 00:14:48,799
Table 13.1 summarizes all the features used in AlphaGo.

220
00:14:48,799 --> 00:14:55,739
The first 48 planes are used for policy networks, and the last one only for value networks.

221
00:14:55,739 --> 00:15:01,799
Table 13.1, Feature Planes Used in AlphaGo, View Table Figure.

222
00:15:01,799 --> 00:15:07,320
The implementation of these features can be found in our GitHub repository under alphago.py

223
00:15:07,320 --> 00:15:10,119
in the dlgo.encoders module.

224
00:15:10,119 --> 00:15:14,840
Although implementing each of the feature sets from Table 13.1 isn't difficult, it's

225
00:15:14,840 --> 00:15:19,200
also not particularly interesting when compared to all the exciting parts making up AlphaGo

226
00:15:19,200 --> 00:15:21,960
that still lie ahead of us.

227
00:15:21,960 --> 00:15:26,159
Implementing ladder captures is somewhat tricky, and encoding the number of turns since a move

228
00:15:26,159 --> 00:15:30,679
was played requires modifications to your Go board definition.

229
00:15:30,679 --> 00:15:36,239
So if you're interested in how this can be done, check out our implementation on GitHub.

230
00:15:36,239 --> 00:15:40,520
Let's quickly look at how an AlphaGo encoder can be initialized so you can use it to train

231
00:15:40,520 --> 00:15:42,280
deep neural networks.

232
00:15:42,280 --> 00:15:46,679
You provide a Go board size and a Boolean called UsePlayerPlane that indicates whether

233
00:15:46,679 --> 00:15:48,760
to use the 49th feature plane.

234
00:15:48,760 --> 00:15:52,440
This is shown in the following listing.

235
00:15:52,440 --> 00:15:58,719
Listing 13.4, Signature and Initialization of Your AlphaGo Board Encoder.

236
00:15:58,719 --> 00:16:03,760
Section 13.1.3, Training AlphaGo Style Policy Networks.

237
00:16:03,760 --> 00:16:08,640
Having network architectures and input features ready, the first step of training policy networks

238
00:16:08,640 --> 00:16:13,719
for AlphaGo follows the exact procedure we introduced in Chapter 7, specifying a board

239
00:16:13,719 --> 00:16:18,960
encoder and an agent, loading Go data, and training the agents with this data.

240
00:16:18,960 --> 00:16:21,799
Figure 13.4 illustrates the process.

241
00:16:21,799 --> 00:16:25,679
The fact that you use slightly more elaborate features and networks doesn't change this

242
00:16:25,679 --> 00:16:28,080
one bit.

243
00:16:29,080 --> 00:16:34,599
The supervised training process for AlphaGo's policy networks is exactly the same as the

244
00:16:34,599 --> 00:16:37,239
flow covered in Chapters 6 and 7.

245
00:16:37,239 --> 00:16:41,119
You replay human game records and reproduce the game states.

246
00:16:41,119 --> 00:16:43,760
Each game state is encoded as a tensor.

247
00:16:43,760 --> 00:16:47,119
This diagram shows a tensor with only two planes.

248
00:16:47,119 --> 00:16:49,479
AlphaGo used 48 planes.

249
00:16:49,479 --> 00:16:54,440
The training target is a vector the same size as the board, with a one where the human actually

250
00:16:54,440 --> 00:16:55,440
played.

251
00:16:56,280 --> 00:17:01,119
To initialize and train AlphaGo's strong policy network, you first need to instantiate an

252
00:17:01,119 --> 00:17:06,400
AlphaGo encoder and create two Go data generators for training and testing, just as you did

253
00:17:06,400 --> 00:17:07,880
in Chapter 7.

254
00:17:07,880 --> 00:17:14,839
You find this step on GitHub under Examples, AlphaGo, AlphaGoPolicySL.py.

255
00:17:14,839 --> 00:17:20,719
Listing 13.5, Loading Data for the First Step of Training AlphaGo's Policy Network.

256
00:17:21,040 --> 00:17:25,959
Next, you can load AlphaGo's policy network by using the AlphaGoModel function defined

257
00:17:25,959 --> 00:17:31,199
earlier in this section and compile this Keras model with categorical cross-entropy and stochastic

258
00:17:31,199 --> 00:17:32,760
gradient descent.

259
00:17:32,760 --> 00:17:38,000
We call this model AlphaGoSL policy to signify that it's a policy network trained by supervised

260
00:17:38,000 --> 00:17:40,920
learning, SL.

261
00:17:40,920 --> 00:17:46,880
Listing 13.6, Creating an AlphaGo Policy Network with Keras.

262
00:17:46,880 --> 00:17:51,160
Now all that's left for this first stage of training is to call Fit Generator on this

263
00:17:51,160 --> 00:17:56,520
policy network, using both training and test generators as you did in Chapter 7.

264
00:17:56,520 --> 00:18:01,239
Apart from using a larger network and a more sophisticated encoder, this is precisely what

265
00:18:01,239 --> 00:18:04,280
you did in Chapters 6 to 8.

266
00:18:04,280 --> 00:18:08,680
After training has finished, you can create a deep learning agent for model and encoder

267
00:18:08,680 --> 00:18:13,599
and store it for the next two training phases that we discuss next.

268
00:18:13,599 --> 00:18:18,719
Listing 13.7, Training and Persisting a Policy Network.

269
00:18:18,719 --> 00:18:22,880
For the sake of simplicity, in this chapter you don't need to train fast and strong policy

270
00:18:22,880 --> 00:18:26,719
networks separately, as in the original AlphaGo paper.

271
00:18:26,719 --> 00:18:32,000
Instead of training a smaller and faster second policy network, you can use AlphaGoSL agent

272
00:18:32,000 --> 00:18:33,839
as the fast policy.

273
00:18:33,839 --> 00:18:38,079
In the next section, you'll see how to use this agent as a starting point for reinforcement

274
00:18:38,079 --> 00:18:41,119
learning, which will lead to a stronger policy network.


