﻿1
00:00:04,490 --> 00:00:07,078
so I hope you enjoyed last week's

2
00:00:07,278 --> 00:00:11,910
tutorial on tensor flow and this week we

3
00:00:12,109 --> 00:00:13,290
again have something very special for

4
00:00:13,490 --> 00:00:15,960
you Simon or Sendero here will give a

5
00:00:16,160 --> 00:00:18,449
lecture about neural networks back

6
00:00:18,649 --> 00:00:20,699
propagation how to train those networks

7
00:00:20,899 --> 00:00:24,660
and so on and it's really quite special

8
00:00:24,859 --> 00:00:26,399
to have Simon here he really is an

9
00:00:26,599 --> 00:00:29,129
expert on the topic he also works a deep

10
00:00:29,329 --> 00:00:32,059
mind in the deep learning group he is

11
00:00:32,259 --> 00:00:34,948
educated locally to some degree at least

12
00:00:35,149 --> 00:00:38,239
yeah so I'm a mother

13
00:00:38,439 --> 00:00:42,869
Cambridge then PhD at UCL and then later

14
00:00:43,070 --> 00:00:47,128
worked with geoff hinton in canada so

15
00:00:47,329 --> 00:00:49,529
there couldn't be a better person to do

16
00:00:49,729 --> 00:00:51,779
this before we start just a quick

17
00:00:51,979 --> 00:00:54,298
announcement terry williams who attended

18
00:00:54,499 --> 00:00:56,608
here last week he's running a reading

19
00:00:56,808 --> 00:00:58,318
group on deep learning and the game of

20
00:00:58,518 --> 00:01:02,579
Go I'll put this book cover and his card

21
00:01:02,780 --> 00:01:04,108
here on the table in case anyone's

22
00:01:04,308 --> 00:01:06,329
interested it's basically a new book

23
00:01:06,530 --> 00:01:10,769
that came out that tries to explain deep

24
00:01:10,969 --> 00:01:14,488
learning based on on the game of go in

25
00:01:14,688 --> 00:01:17,069
the wake of alphago okay thank you very

26
00:01:17,269 --> 00:01:20,459
much over to you Simon Hayter and get up

27
00:01:20,659 --> 00:01:24,599
to noon everyone see can everyone hear

28
00:01:24,799 --> 00:01:32,640
me okay you can hear me okay yes so sort

29
00:01:32,840 --> 00:01:33,000
of saying

30
00:01:33,200 --> 00:01:34,709
today's lecture is just me covering some

31
00:01:34,909 --> 00:01:36,238
of the foundations of neural networks

32
00:01:36,438 --> 00:01:39,028
and I'm guessing that some of you will

33
00:01:39,228 --> 00:01:41,278
be quite familiar with the material that

34
00:01:41,478 --> 00:01:43,259
we're going to go over today and I hope

35
00:01:43,459 --> 00:01:44,459
that most of you have seen bits of it

36
00:01:44,659 --> 00:01:46,140
before but nevertheless it's kind of

37
00:01:46,340 --> 00:01:48,599
good to go back over the foundations to

38
00:01:48,799 --> 00:01:49,980
make sure that they're very solid and

39
00:01:50,180 --> 00:01:51,418
also one of things that I'm going to

40
00:01:51,618 --> 00:01:52,948
hope to do as we go through is in

41
00:01:53,149 --> 00:01:54,929
addition to kind of conveying some of

42
00:01:55,129 --> 00:01:56,369
the mathematics also try and give you a

43
00:01:56,569 --> 00:01:58,588
sense of the intuition to get a kind of

44
00:01:58,789 --> 00:01:59,969
deeper and more visceral understanding

45
00:02:00,170 --> 00:02:04,018
of what's going on and as we go through

46
00:02:04,218 --> 00:02:06,149
there'll be a couple of natural section

47
00:02:06,349 --> 00:02:08,279
breaks between the sections so that's

48
00:02:08,479 --> 00:02:09,750
probably a good time to do questions

49
00:02:09,949 --> 00:02:11,160
from the preceding section if there are

50
00:02:11,360 --> 00:02:15,090
any and we'll also have an inci break in

51
00:02:15,289 --> 00:02:16,800
the middle probably two-thirds of the

52
00:02:17,000 --> 00:02:17,660
way through

53
00:02:17,860 --> 00:02:20,030
and then the the last point is these

54
00:02:20,229 --> 00:02:20,960
slides were all going to be available

55
00:02:21,159 --> 00:02:23,060
online and in the slides I've added

56
00:02:23,259 --> 00:02:25,430
quite a few hyperlinks out to additional

57
00:02:25,629 --> 00:02:27,500
material which if one of the topics

58
00:02:27,699 --> 00:02:28,310
we're talking about is particular

59
00:02:28,509 --> 00:02:29,780
interesting to you you can kind of go

60
00:02:29,979 --> 00:02:37,130
off and read more about that okay and so

61
00:02:37,330 --> 00:02:39,980
this slide is in some sense a tldr of

62
00:02:40,180 --> 00:02:42,590
what we're going to do today and at a

63
00:02:42,789 --> 00:02:44,390
high level it's also kind of a tldr of

64
00:02:44,590 --> 00:02:46,250
what we're going to do in this entire

65
00:02:46,449 --> 00:02:49,070
course so deep learning good neural

66
00:02:49,270 --> 00:02:50,810
networks is actually pretty simple as

67
00:02:51,009 --> 00:02:52,280
it's more or less just the composition

68
00:02:52,479 --> 00:02:54,410
of linear transforms and nonlinear

69
00:02:54,610 --> 00:02:57,439
functions and it turns out that by

70
00:02:57,639 --> 00:02:59,689
composing these quite simple building

71
00:02:59,889 --> 00:03:02,360
blocks into large graphs we gained

72
00:03:02,560 --> 00:03:05,450
massive powerful flexing flexible

73
00:03:05,650 --> 00:03:07,550
modeling power and when I say massive

74
00:03:07,750 --> 00:03:09,620
fight I do mean quite massive so these

75
00:03:09,819 --> 00:03:12,170
days we routinely train neural networks

76
00:03:12,370 --> 00:03:13,340
with hundreds of millions of parameters

77
00:03:13,539 --> 00:03:16,700
and when I say training or learning what

78
00:03:16,900 --> 00:03:18,640
does that mean well it basically means

79
00:03:18,840 --> 00:03:21,439
optimizing a loss function that in some

80
00:03:21,639 --> 00:03:22,310
sense describes a problem we're

81
00:03:22,509 --> 00:03:25,340
interested in over some data set or in

82
00:03:25,539 --> 00:03:27,349
the case of reinforcement learning with

83
00:03:27,549 --> 00:03:30,620
respect to world experience with with

84
00:03:30,819 --> 00:03:32,569
effective our parameters and we do that

85
00:03:32,769 --> 00:03:34,969
using various gradient optimization

86
00:03:35,169 --> 00:03:36,800
methods one of the most common of those

87
00:03:37,000 --> 00:03:38,840
is SPD or stochastic gradient descent

88
00:03:39,039 --> 00:03:42,020
and so from a thousand feet that's

89
00:03:42,219 --> 00:03:43,939
that's kind of it it's pretty simple but

90
00:03:44,139 --> 00:03:46,219
in this course what we're going to do is

91
00:03:46,419 --> 00:03:48,410
look at the details of the different

92
00:03:48,610 --> 00:03:50,120
building blocks when you might want to

93
00:03:50,319 --> 00:03:52,340
make certain choices and also how to do

94
00:03:52,539 --> 00:03:57,230
this well at a very large scale so

95
00:03:57,430 --> 00:03:59,180
before we dive in let's step back a

96
00:03:59,379 --> 00:04:01,009
little bit and ask why are we doing this

97
00:04:01,209 --> 00:04:02,920
what what a neuron that's good for and

98
00:04:03,120 --> 00:04:04,969
turns out they're actually useful for a

99
00:04:05,169 --> 00:04:06,740
whole ton of things but these days you

100
00:04:06,939 --> 00:04:08,810
know I think a better question is you

101
00:04:09,009 --> 00:04:09,770
know if you can come up with the right

102
00:04:09,969 --> 00:04:11,569
loss function and a quiet training later

103
00:04:11,769 --> 00:04:14,660
what a neuron that's not good for so

104
00:04:14,860 --> 00:04:17,480
just to kind of go over some examples in

105
00:04:17,680 --> 00:04:19,639
recent years we've seen some very

106
00:04:19,839 --> 00:04:22,069
impressive steps forward in computer

107
00:04:22,269 --> 00:04:24,889
vision we can now recognize objects in

108
00:04:25,089 --> 00:04:27,020
images with very high accuracy there's

109
00:04:27,220 --> 00:04:29,740
all sorts of cool more esoteric

110
00:04:29,939 --> 00:04:31,439
applications that the folks

111
00:04:31,639 --> 00:04:34,499
so listen very nice work looking at

112
00:04:34,699 --> 00:04:38,610
doing superhuman recognition of human

113
00:04:38,810 --> 00:04:41,129
emotions by having a neural network that

114
00:04:41,329 --> 00:04:42,989
can recognize micro expressions on folks

115
00:04:43,189 --> 00:04:45,299
faces so essentially better or even

116
00:04:45,499 --> 00:04:48,629
human emotions than humans are later in

117
00:04:48,829 --> 00:04:50,269
this course there'll be a module on

118
00:04:50,468 --> 00:04:52,468
sequence models with a recurrent neural

119
00:04:52,668 --> 00:04:54,718
networks and there we've seen incredible

120
00:04:54,918 --> 00:04:57,629
gains and speech recognition one of the

121
00:04:57,829 --> 00:04:59,100
cool things again in recent years that

122
00:04:59,300 --> 00:05:02,968
came up is this idea of using neural

123
00:05:03,168 --> 00:05:04,739
networks for machine translation and

124
00:05:04,939 --> 00:05:07,290
furthermore it turns out that you can

125
00:05:07,490 --> 00:05:09,769
use neural networks for multilingual

126
00:05:09,968 --> 00:05:16,930
machine translation so the echo is

127
00:05:19,360 --> 00:05:22,528
hello-hello note them maybe the mics on

128
00:05:22,728 --> 00:05:23,939
turn I'll turn raise my voice per dad

129
00:05:24,139 --> 00:05:25,319
yeah please do raise your hand if

130
00:05:25,519 --> 00:05:26,579
they're if you're having trouble hearing

131
00:05:26,778 --> 00:05:30,749
me yes so one of the particularly cool

132
00:05:30,949 --> 00:05:33,360
things that came out in the last year or

133
00:05:33,560 --> 00:05:35,489
so is this idea of doing multilingual

134
00:05:35,689 --> 00:05:36,869
translation through a common

135
00:05:37,069 --> 00:05:39,509
representation so we can translate from

136
00:05:39,709 --> 00:05:43,989
many languages into many other languages

137
00:05:49,778 --> 00:05:52,990
heavily wide

138
00:06:03,209 --> 00:06:04,579
okay

139
00:06:04,779 --> 00:06:08,749
is that better for folks right yeah this

140
00:06:08,949 --> 00:06:10,129
notion of a kind of interlink where so

141
00:06:10,329 --> 00:06:12,528
if we have a common representation space

142
00:06:12,728 --> 00:06:14,509
that is the bottleneck when we're

143
00:06:14,709 --> 00:06:15,800
translating from one language to another

144
00:06:16,000 --> 00:06:18,170
then in a very real sense you can think

145
00:06:18,370 --> 00:06:19,968
of the representations in that space as

146
00:06:20,168 --> 00:06:21,800
some kind of inter linguist so it's kind

147
00:06:22,000 --> 00:06:23,600
of representing concept across many

148
00:06:23,800 --> 00:06:26,540
different languages along similar lines

149
00:06:26,740 --> 00:06:28,670
there's been some excellent work from

150
00:06:28,870 --> 00:06:31,160
deep mind on speech synthesis so going

151
00:06:31,360 --> 00:06:35,809
from text to speech and wavenet was a

152
00:06:36,009 --> 00:06:37,670
something that was developed at D mind

153
00:06:37,870 --> 00:06:39,619
starting back two years ago and now it's

154
00:06:39,819 --> 00:06:42,079
in production so a lot of the voices

155
00:06:42,279 --> 00:06:43,910
that you'll hear in say Google home or

156
00:06:44,110 --> 00:06:45,619
Google assistant are now synthesized

157
00:06:45,819 --> 00:06:47,600
with wavenet so a very fast turnaround

158
00:06:47,800 --> 00:06:51,249
from research to large-scale deployment

159
00:06:51,449 --> 00:06:54,699
other places where they've been enjoying

160
00:06:54,899 --> 00:06:56,959
impressive uses in reinforcement

161
00:06:57,158 --> 00:06:58,218
learning and you'll hear much more about

162
00:06:58,418 --> 00:06:59,930
that in the other half of the course so

163
00:07:00,129 --> 00:07:03,860
things like dqn or a3c and applying that

164
00:07:04,060 --> 00:07:05,870
to aims headings like atari and then

165
00:07:06,069 --> 00:07:08,809
also moving into moralistic games and 3d

166
00:07:09,009 --> 00:07:11,749
environments also with reinforcement

167
00:07:11,949 --> 00:07:13,430
learning you guys are all probably

168
00:07:13,629 --> 00:07:17,509
familiar with alphago which was able to

169
00:07:17,709 --> 00:07:20,680
beat the human world champion at go and

170
00:07:20,879 --> 00:07:23,778
has now even superseded that by playing

171
00:07:23,978 --> 00:07:25,129
just the games itself so not not even

172
00:07:25,329 --> 00:07:27,920
using any any human data now the list

173
00:07:28,120 --> 00:07:30,170
goes on and in all these cases what

174
00:07:30,370 --> 00:07:32,240
we're dealing with is pretty simple and

175
00:07:32,439 --> 00:07:34,490
there's just a couple of different

176
00:07:34,689 --> 00:07:37,040
elements you see grab a laser pointer

177
00:07:37,240 --> 00:07:41,749
yeah cool yes so we essentially have our

178
00:07:41,949 --> 00:07:43,689
neural network so we defined some

179
00:07:43,889 --> 00:07:46,610
architecture we have our inputs so it

180
00:07:46,810 --> 00:07:48,528
could be images spectrograms

181
00:07:48,728 --> 00:07:51,619
you name it we have parameters that

182
00:07:51,819 --> 00:07:53,718
define the network and some outputs that

183
00:07:53,918 --> 00:07:55,939
we want to predict and essentially all

184
00:07:56,139 --> 00:07:58,730
we're doing is formulating a loss

185
00:07:58,930 --> 00:08:00,949
function between our inputs and our

186
00:08:01,149 --> 00:08:03,199
outputs and then optimizing that loss

187
00:08:03,399 --> 00:08:04,790
function with respect to our parameters

188
00:08:04,990 --> 00:08:07,300
and and again it's in a high level

189
00:08:07,500 --> 00:08:09,290
everything we're doing is very simple

190
00:08:09,490 --> 00:08:13,399
but the devil is in the details so

191
00:08:13,598 --> 00:08:14,730
here's a road map

192
00:08:14,930 --> 00:08:19,020
for most the rest of today so that the

193
00:08:19,220 --> 00:08:20,670
the field of neural networks has been

194
00:08:20,870 --> 00:08:22,740
around for a long time and there's a

195
00:08:22,939 --> 00:08:24,870
fairly rich history so there's you know

196
00:08:25,069 --> 00:08:26,249
not time to cover all that today what

197
00:08:26,449 --> 00:08:27,990
would we are going to cover today in in

198
00:08:28,189 --> 00:08:29,699
the course overall are the things that

199
00:08:29,899 --> 00:08:31,319
are having the most impact right now but

200
00:08:31,519 --> 00:08:33,539
I just wanted to begin by calling out

201
00:08:33,740 --> 00:08:35,429
some of the topics that I think are

202
00:08:35,629 --> 00:08:36,870
interesting but that we're not going to

203
00:08:37,070 --> 00:08:39,750
cover and I'd also encourage you to kind

204
00:08:39,950 --> 00:08:41,758
of delve into the history of the field

205
00:08:41,958 --> 00:08:42,958
if there are particular topics that

206
00:08:43,158 --> 00:08:44,549
you're interested in because there's a

207
00:08:44,750 --> 00:08:46,469
lot of work dating back to the sort like

208
00:08:46,669 --> 00:08:48,689
early 2000 and even the 80s and 90s that

209
00:08:48,889 --> 00:08:52,649
is probably worth revisiting in the rest

210
00:08:52,850 --> 00:08:54,539
of the course we'll begin by a treatment

211
00:08:54,740 --> 00:08:55,948
of single layer networks and just seeing

212
00:08:56,149 --> 00:08:58,828
ok what can we do with just one layer

213
00:08:59,028 --> 00:09:01,349
weights and neurons we'll then move on

214
00:09:01,549 --> 00:09:03,750
to talk about the advantages that we get

215
00:09:03,950 --> 00:09:07,258
by adding just one hidden layer and then

216
00:09:07,458 --> 00:09:09,328
we'll kind of switch gears and kind of

217
00:09:09,528 --> 00:09:11,669
focus on what I call modern deep net

218
00:09:11,870 --> 00:09:14,639
so here it's useful just to think in

219
00:09:14,839 --> 00:09:17,008
terms of abstract compute graphs and

220
00:09:17,208 --> 00:09:20,370
we'll see some very large networks and

221
00:09:20,570 --> 00:09:22,740
also how to think about composing those

222
00:09:22,940 --> 00:09:25,979
in software there'll be a session and

223
00:09:26,179 --> 00:09:27,299
this is probably the most math heavy

224
00:09:27,500 --> 00:09:29,429
part of today on learning and so there

225
00:09:29,629 --> 00:09:31,740
will kind of recap some concepts from

226
00:09:31,940 --> 00:09:35,399
calculus and vector algebra and then

227
00:09:35,600 --> 00:09:37,199
we'll talk about modular backprop an

228
00:09:37,399 --> 00:09:39,149
automatic differentiation and those are

229
00:09:39,350 --> 00:09:41,639
tools that allow us to build these

230
00:09:41,839 --> 00:09:43,620
extremely esoteric graphs without having

231
00:09:43,820 --> 00:09:45,269
to think too much about how learning

232
00:09:45,470 --> 00:09:48,269
operates I'll talk a bit about what I'm

233
00:09:48,470 --> 00:09:50,819
calling a model Zoo so when we think

234
00:09:51,019 --> 00:09:52,439
about these networks in terms of these

235
00:09:52,639 --> 00:09:54,089
modules then what are the building

236
00:09:54,289 --> 00:09:55,409
blocks that we can use to construct them

237
00:09:55,610 --> 00:09:58,378
from and then toward the end I've

238
00:09:58,578 --> 00:10:00,029
touched on some kind of practical topics

239
00:10:00,230 --> 00:10:01,439
in terms of you want actually doing this

240
00:10:01,639 --> 00:10:02,549
in practice what are things that you

241
00:10:02,750 --> 00:10:03,990
might want to be aware of what a tricks

242
00:10:04,190 --> 00:10:05,309
you can use to sort of diagnose if

243
00:10:05,509 --> 00:10:06,990
things are going wrong and maybe we'll

244
00:10:07,190 --> 00:10:11,370
talk about a research topic yes but as I

245
00:10:11,570 --> 00:10:12,719
was saying it's a large field with many

246
00:10:12,919 --> 00:10:17,219
branches dating back depending when you

247
00:10:17,419 --> 00:10:18,779
count dating back to the 60s and then

248
00:10:18,980 --> 00:10:20,189
there was another resurgence in the 80s

249
00:10:20,389 --> 00:10:22,919
so a couple of things that I think are

250
00:10:23,120 --> 00:10:24,328
interesting that won't be covered in

251
00:10:24,528 --> 00:10:26,669
this lecture course are also machines

252
00:10:26,870 --> 00:10:28,169
and hopfield networks

253
00:10:28,370 --> 00:10:32,639
they were developed ran through the 80s

254
00:10:32,839 --> 00:10:34,828
and for quite a while were extremely

255
00:10:35,028 --> 00:10:37,679
popular and there was some interesting

256
00:10:37,879 --> 00:10:39,059
early work I guess in the second wave of

257
00:10:39,259 --> 00:10:41,129
neural networks that they're not in

258
00:10:41,330 --> 00:10:43,500
favor as much now but I think they're

259
00:10:43,700 --> 00:10:45,078
still useful so particularly for

260
00:10:45,278 --> 00:10:46,979
situations were we're interested in

261
00:10:47,179 --> 00:10:48,870
models of memory and in particular

262
00:10:49,070 --> 00:10:51,839
associative memory so I think for me

263
00:10:52,039 --> 00:10:52,979
that's that's one thing that's worth

264
00:10:53,179 --> 00:10:55,828
revisiting another area that what's

265
00:10:56,028 --> 00:10:57,629
property at one time that doesn't

266
00:10:57,830 --> 00:11:00,419
receive as much attention now is models

267
00:11:00,620 --> 00:11:01,620
that operate in the continuous time

268
00:11:01,820 --> 00:11:03,870
domain so in particular spiking neural

269
00:11:04,070 --> 00:11:06,149
networks and one of the reasons that

270
00:11:06,350 --> 00:11:07,620
they're interesting that it's a

271
00:11:07,820 --> 00:11:10,409
different learning paradigm but if you

272
00:11:10,610 --> 00:11:12,359
have that kind of model it's possible to

273
00:11:12,559 --> 00:11:14,969
do extremely efficient implementations

274
00:11:15,169 --> 00:11:17,490
in hardware so you can have very

275
00:11:17,690 --> 00:11:18,709
low-power

276
00:11:18,909 --> 00:11:21,839
New York neural networks so I said yeah

277
00:11:22,039 --> 00:11:23,879
there's lots of things to to look at I'd

278
00:11:24,080 --> 00:11:25,679
encourage you to look at the history of

279
00:11:25,879 --> 00:11:26,909
the field in addition to the stuff that

280
00:11:27,110 --> 00:11:34,078
we cover in this course oh and one last

281
00:11:34,278 --> 00:11:36,839
thing at a high level this small caveat

282
00:11:37,039 --> 00:11:40,828
on terminology and this is a little bit

283
00:11:41,028 --> 00:11:43,649
a function of the history of the field

284
00:11:43,850 --> 00:11:46,289
we sometimes use different names to

285
00:11:46,490 --> 00:11:49,500
refer to the same thing so I'll try and

286
00:11:49,700 --> 00:11:51,149
be consistent but I'm sure I wouldn't

287
00:11:51,350 --> 00:11:54,149
manage it fully so for instance people

288
00:11:54,350 --> 00:11:56,519
interchangeably might use the word unit

289
00:11:56,720 --> 00:12:00,508
or neuron to describe the activate

290
00:12:00,708 --> 00:12:02,328
activity in a single element of a layer

291
00:12:02,528 --> 00:12:06,179
similarly you might hear non-linearity

292
00:12:06,379 --> 00:12:08,099
or activation function and they they

293
00:12:08,299 --> 00:12:10,199
also mean the same thing slightly

294
00:12:10,399 --> 00:12:12,929
trickier is that we sometimes use the

295
00:12:13,129 --> 00:12:14,699
same name to refer to different things

296
00:12:14,899 --> 00:12:18,179
so in the more traditional view of the

297
00:12:18,379 --> 00:12:20,849
field folks would refer to the compound

298
00:12:21,049 --> 00:12:22,979
of say a nonlinear transformation plus a

299
00:12:23,179 --> 00:12:25,370
non-linearity as Leia

300
00:12:25,570 --> 00:12:27,839
in more modern parlance particularly

301
00:12:28,039 --> 00:12:28,859
when we're thinking about implementation

302
00:12:29,059 --> 00:12:30,359
that things like tend to flow then we

303
00:12:30,559 --> 00:12:33,659
kind of tend to describe as a layer

304
00:12:33,860 --> 00:12:35,429
these more atomic operations so in this

305
00:12:35,629 --> 00:12:37,529
case we'd call the linear transformation

306
00:12:37,730 --> 00:12:40,639
as one layer and the

307
00:12:40,839 --> 00:12:44,519
nonlinearity another layer and link to

308
00:12:44,720 --> 00:12:45,750
that there's also slightly different

309
00:12:45,950 --> 00:12:47,939
graphical conventions when we're

310
00:12:48,139 --> 00:12:49,859
depicting models it should usually be

311
00:12:50,059 --> 00:12:51,120
obvious from context but I just wanted

312
00:12:51,320 --> 00:12:52,199
to call that out just in case that's

313
00:12:52,399 --> 00:12:56,490
confusing okay so as I said we're gonna

314
00:12:56,690 --> 00:12:58,589
start off with what can we do with a

315
00:12:58,789 --> 00:13:00,269
single layer networks and to begin with

316
00:13:00,470 --> 00:13:02,490
I'm gonna make a very short digression

317
00:13:02,690 --> 00:13:06,990
on real neurons and describe some of the

318
00:13:07,190 --> 00:13:08,939
kind of inspiration for the artifice in

319
00:13:09,139 --> 00:13:10,769
your Andriy we use it's a very loose

320
00:13:10,970 --> 00:13:11,909
connection and I won't dwell there too

321
00:13:12,110 --> 00:13:14,849
much will then talk about what we can do

322
00:13:15,049 --> 00:13:17,099
with a linear layer sigmoid activation

323
00:13:17,299 --> 00:13:20,689
function and then we'll kind of recap

324
00:13:20,889 --> 00:13:23,129
binary classification or logistic

325
00:13:23,330 --> 00:13:24,899
regression which should have been in

326
00:13:25,100 --> 00:13:26,399
either the last lecture or in their

327
00:13:26,600 --> 00:13:28,828
business for that lecture and then we'll

328
00:13:29,028 --> 00:13:30,990
move on from binary classification into

329
00:13:31,190 --> 00:13:36,000
multi-class classification okay so in

330
00:13:36,200 --> 00:13:38,370
the slide here in the bottom right this

331
00:13:38,570 --> 00:13:42,089
is a cartoon depiction of a real neuron

332
00:13:42,289 --> 00:13:43,740
so there's a couple things going on we

333
00:13:43,940 --> 00:13:46,919
have a cell body the dendrites which is

334
00:13:47,120 --> 00:13:48,328
where the inputs from other neurons are

335
00:13:48,528 --> 00:13:51,539
received and then the axon with the

336
00:13:51,740 --> 00:13:52,799
tunnel bulbs and that's kind of the

337
00:13:53,000 --> 00:13:55,139
output from this neuron and more or less

338
00:13:55,339 --> 00:13:57,689
the way this operates when a neuron is

339
00:13:57,889 --> 00:13:59,459
active an electrical impulse travels

340
00:13:59,659 --> 00:14:01,859
down the axon it reaches the terminal

341
00:14:02,059 --> 00:14:03,719
bulb which causes vesicles of

342
00:14:03,919 --> 00:14:05,758
neurotransmitter to be released those

343
00:14:05,958 --> 00:14:08,000
kind of diffuse across the gap between

344
00:14:08,200 --> 00:14:09,959
this neuron and the neuron that it's

345
00:14:10,159 --> 00:14:12,240
communicating with when it's received in

346
00:14:12,440 --> 00:14:14,399
the dendrites it causes a depolarization

347
00:14:14,600 --> 00:14:17,370
that eventually makes its way back to

348
00:14:17,570 --> 00:14:20,879
the cell body and B so some of the

349
00:14:21,080 --> 00:14:21,929
depolarizations

350
00:14:22,129 --> 00:14:23,939
from all these dendrites is what

351
00:14:24,139 --> 00:14:25,740
determines whether or not the receiving

352
00:14:25,940 --> 00:14:30,059
neuron is going to fire or not and in a

353
00:14:30,259 --> 00:14:32,519
very very coarse way this process of

354
00:14:32,720 --> 00:14:34,979
receiving inputs of different strengths

355
00:14:35,179 --> 00:14:37,019
and integrating it in the cell body is

356
00:14:37,220 --> 00:14:41,069
what this equation is describing so it's

357
00:14:41,269 --> 00:14:43,199
just a weighted sum of inputs or an

358
00:14:43,399 --> 00:14:45,120
affine transformation if you will so the

359
00:14:45,320 --> 00:14:50,549
inputs X the the weights W and maybe

360
00:14:50,750 --> 00:14:53,000
some bias B and so

361
00:14:53,200 --> 00:14:55,899
is what we'd call a simple linear neuron

362
00:14:56,100 --> 00:14:58,219
if we have a whole collection of them

363
00:14:58,419 --> 00:15:00,709
then we can move into matrix vector

364
00:15:00,909 --> 00:15:04,789
notation so this vector Y is a vector of

365
00:15:04,990 --> 00:15:08,329
linear neuron States and we obtain that

366
00:15:08,529 --> 00:15:10,729
by doing a matrix vector multiplication

367
00:15:10,929 --> 00:15:12,709
between the inputs and our weight matrix

368
00:15:12,909 --> 00:15:16,879
and some bias vector B and there's not

369
00:15:17,080 --> 00:15:18,529
an awful lot we can do with that setup

370
00:15:18,730 --> 00:15:20,509
but we are able to do linear regression

371
00:15:20,710 --> 00:15:22,719
which I think you guys saw previously

372
00:15:22,919 --> 00:15:26,059
but in practice we typically combine

373
00:15:26,259 --> 00:15:28,009
these linear layers with some

374
00:15:28,210 --> 00:15:29,659
non-linearity and particularly for a

375
00:15:29,860 --> 00:15:33,109
stacking them in depth so let's let's

376
00:15:33,309 --> 00:15:33,889
take a look at one of those

377
00:15:34,090 --> 00:15:35,269
nonlinearities and this will kind of

378
00:15:35,470 --> 00:15:37,759
complete the picture of our artificial

379
00:15:37,960 --> 00:15:41,269
neuron so what I'm showing here is

380
00:15:41,470 --> 00:15:43,459
something called the sigmoid function

381
00:15:43,659 --> 00:15:45,019
you can think of it as a kind of

382
00:15:45,220 --> 00:15:48,889
squashing function so this equation here

383
00:15:49,090 --> 00:15:50,449
describes the input-output relationship

384
00:15:50,649 --> 00:15:52,359
and so when we combine that with the

385
00:15:52,559 --> 00:15:55,609
linear mapping from previously we have a

386
00:15:55,809 --> 00:15:57,439
way to sum of inputs offset by a bias

387
00:15:57,639 --> 00:15:58,939
and then we pass it through this

388
00:15:59,139 --> 00:16:00,949
squashing function and this in a very

389
00:16:01,149 --> 00:16:04,099
coarse way reproduces what happens in it

390
00:16:04,299 --> 00:16:05,509
in a real neuron when it receives input

391
00:16:05,710 --> 00:16:09,129
so there's some threshold below which

392
00:16:09,330 --> 00:16:11,329
the neuron isn't going to fire at all

393
00:16:11,529 --> 00:16:13,789
once it's above threshold then it

394
00:16:13,990 --> 00:16:15,919
increases its fire and great but there's

395
00:16:16,120 --> 00:16:17,659
only so fast that a real neuron can fire

396
00:16:17,860 --> 00:16:21,769
and so it has upset rating and so at a

397
00:16:21,970 --> 00:16:23,240
very high level that's what this

398
00:16:23,440 --> 00:16:28,039
function is is performing for us it used

399
00:16:28,240 --> 00:16:31,459
to be that this was the sort of

400
00:16:31,659 --> 00:16:32,870
canonical choice in neural network so if

401
00:16:33,070 --> 00:16:34,959
you look at papers particularly from the

402
00:16:35,159 --> 00:16:37,159
90s or the early two-thousands you'll

403
00:16:37,360 --> 00:16:39,559
see this kind of activation function

404
00:16:39,759 --> 00:16:41,509
everywhere it's not that common anymore

405
00:16:41,710 --> 00:16:43,429
and we'll go into some of the reasons

406
00:16:43,629 --> 00:16:45,859
why but at a high level it doesn't have

407
00:16:46,059 --> 00:16:48,919
as nice gradient properties as we'd like

408
00:16:49,120 --> 00:16:49,969
when we're building these very deep

409
00:16:50,169 --> 00:16:53,029
models however it is I still actively

410
00:16:53,230 --> 00:16:54,229
use them a couple of places so in

411
00:16:54,429 --> 00:16:56,659
particular for gating units if we want

412
00:16:56,860 --> 00:16:59,899
to kind of have some kind of soft

413
00:17:00,100 --> 00:17:01,969
differentiable switch and one of the

414
00:17:02,169 --> 00:17:03,139
most common places that you'll see this

415
00:17:03,340 --> 00:17:06,368
is in long short-term memory cells

416
00:17:06,568 --> 00:17:08,229
which I'll hear a lot more about in the

417
00:17:08,429 --> 00:17:14,259
class on recurrent networks so yeah as I

418
00:17:14,459 --> 00:17:17,948
said even with just a simple linear so

419
00:17:18,148 --> 00:17:19,448
signal neuron we can actually do useful

420
00:17:19,648 --> 00:17:23,169
things so I just grabbed this purple box

421
00:17:23,369 --> 00:17:24,848
here I grabbed some tourist slides so

422
00:17:25,048 --> 00:17:26,848
there's a slight change in notation but

423
00:17:27,048 --> 00:17:30,009
if you think back to logistic regression

424
00:17:30,210 --> 00:17:32,919
what do we have we have a linear model a

425
00:17:33,119 --> 00:17:34,629
linked function and then a cross and to

426
00:17:34,829 --> 00:17:37,389
be loss and this linear model is exactly

427
00:17:37,589 --> 00:17:39,938
what's going on in this linear layer and

428
00:17:40,138 --> 00:17:42,969
the link function is what the sigmoid is

429
00:17:43,169 --> 00:17:45,009
doing so there's an extremely tight

430
00:17:45,210 --> 00:17:48,219
relationship between logistic regression

431
00:17:48,419 --> 00:17:51,159
and bited classification and these

432
00:17:51,359 --> 00:17:53,108
layers in in a neural network and so

433
00:17:53,308 --> 00:17:54,549
with just a single neuron we can

434
00:17:54,750 --> 00:17:57,729
actually build a binary classifier so in

435
00:17:57,929 --> 00:17:59,709
this toy example I've got two classes 0

436
00:17:59,909 --> 00:18:03,219
& 1 if I arrange to have my weight

437
00:18:03,419 --> 00:18:04,448
vector pointing in this direction so

438
00:18:04,648 --> 00:18:07,500
orthogonal to this red separating plane

439
00:18:07,700 --> 00:18:10,328
and I adjust the strength of the weights

440
00:18:10,528 --> 00:18:12,428
and the biases appropriately then I can

441
00:18:12,628 --> 00:18:14,409
have a system where when I give it an

442
00:18:14,609 --> 00:18:17,428
input from class 0 the output is 0 and

443
00:18:17,628 --> 00:18:19,808
when I give it an input from class 1 the

444
00:18:20,009 --> 00:18:25,119
output is 1 so that was binary

445
00:18:25,319 --> 00:18:26,318
classification we're now going to move

446
00:18:26,519 --> 00:18:28,869
on and discuss something called a soft

447
00:18:29,069 --> 00:18:31,149
max layer and this essentially extends

448
00:18:31,349 --> 00:18:33,818
binary classification into multi-class

449
00:18:34,019 --> 00:18:36,938
classification so this type of layer is

450
00:18:37,138 --> 00:18:39,068
a way to allow us to do either

451
00:18:39,269 --> 00:18:40,629
multi-class classification another place

452
00:18:40,829 --> 00:18:42,719
that you you might see this used is

453
00:18:42,919 --> 00:18:45,009
internally in networks if you need to do

454
00:18:45,210 --> 00:18:46,869
some kind of multi-way switching so if

455
00:18:47,069 --> 00:18:49,088
say you have a junction in your network

456
00:18:49,288 --> 00:18:50,678
and there's multiple different inputs

457
00:18:50,878 --> 00:18:54,909
and one of them needs to be routed this

458
00:18:55,109 --> 00:18:56,108
is something you can use as a kind of

459
00:18:56,308 --> 00:19:00,789
multi way gating mechanism so what does

460
00:19:00,990 --> 00:19:02,799
it actually do well if we first think

461
00:19:03,000 --> 00:19:05,979
about the Arg max function so when we

462
00:19:06,179 --> 00:19:08,558
apply that to some input vector X all

463
00:19:08,759 --> 00:19:11,678
but the largest element is zero and the

464
00:19:11,878 --> 00:19:15,088
largest element is one the softmax is

465
00:19:15,288 --> 00:19:17,769
essentially just a soft version of the

466
00:19:17,970 --> 00:19:19,838
Arg max so rather than

467
00:19:20,038 --> 00:19:21,908
only the largest element being one and

468
00:19:22,108 --> 00:19:23,828
everything else being zero the largest

469
00:19:24,028 --> 00:19:25,509
element will be the one that's closest

470
00:19:25,710 --> 00:19:27,219
to one the others will be close to zero

471
00:19:27,419 --> 00:19:30,399
and the sum of activities across the app

472
00:19:30,599 --> 00:19:32,740
vector what we want so it it also gives

473
00:19:32,940 --> 00:19:35,828
us a probability distribution the

474
00:19:36,028 --> 00:19:37,598
mathematical form is here so we have

475
00:19:37,798 --> 00:19:41,019
these exponents and I don't know if the

476
00:19:41,220 --> 00:19:42,519
resolution is high enough on this

477
00:19:42,720 --> 00:19:44,408
monitor but what I'm showing in these

478
00:19:44,608 --> 00:19:46,598
two bar plots here is two slightly

479
00:19:46,798 --> 00:19:49,629
different scenarios so the red bars are

480
00:19:49,829 --> 00:19:52,688
the inputs the blue bars are the outputs

481
00:19:52,888 --> 00:19:56,139
and the scale of the red bars in the in

482
00:19:56,339 --> 00:19:58,418
the lower plot is double that of the one

483
00:19:58,618 --> 00:20:02,588
in the the upper plot so in this example

484
00:20:02,788 --> 00:20:05,168
here the the output for the largest

485
00:20:05,368 --> 00:20:08,558
input is the largest and you can't quite

486
00:20:08,759 --> 00:20:11,048
see but it's about 0.6 so the closest to

487
00:20:11,249 --> 00:20:13,448
one however if I increase the magnitude

488
00:20:13,648 --> 00:20:15,219
of all the input so that the ratios are

489
00:20:15,419 --> 00:20:17,379
still the same but now this is 0.9 so

490
00:20:17,579 --> 00:20:19,209
it's much much close to 1 so as the

491
00:20:19,409 --> 00:20:20,889
scale of the inputs gets larger and

492
00:20:21,089 --> 00:20:22,688
larger this gets closer and closer to

493
00:20:22,888 --> 00:20:28,298
doing a hard max operation and so what

494
00:20:28,499 --> 00:20:29,709
can we use this for well as I said we

495
00:20:29,909 --> 00:20:32,348
can use it to do multi way

496
00:20:32,548 --> 00:20:34,838
classification so if you combine this

497
00:20:35,038 --> 00:20:38,558
kind of unit with a cross entropy loss

498
00:20:38,759 --> 00:20:40,448
we're able to Train something that will

499
00:20:40,648 --> 00:20:43,298
do classification of inputs into one of

500
00:20:43,499 --> 00:20:46,509
several different classes so let's take

501
00:20:46,710 --> 00:20:48,759
a look at what this relationship looks

502
00:20:48,960 --> 00:20:52,778
like so the output for the il iment

503
00:20:52,979 --> 00:20:53,889
which you can think of it as the

504
00:20:54,089 --> 00:20:55,448
probability that the input is assigned

505
00:20:55,648 --> 00:20:59,948
to class I is given by in the numerator

506
00:21:00,148 --> 00:21:01,778
we have an exponent that is a weighted

507
00:21:01,979 --> 00:21:04,658
sum of inputs plus a bias and then this

508
00:21:04,858 --> 00:21:06,219
is normalized by that same expression

509
00:21:06,419 --> 00:21:08,469
over all the other possible outputs so

510
00:21:08,669 --> 00:21:10,678
we have a probability distribution and

511
00:21:10,878 --> 00:21:13,269
in a sense you can think of what's going

512
00:21:13,470 --> 00:21:15,668
on in this exponent as being the amount

513
00:21:15,868 --> 00:21:17,619
of evidence that we have for the

514
00:21:17,819 --> 00:21:22,328
presence of the ID class and had we had

515
00:21:22,528 --> 00:21:24,250
to retrain this had we learn we can just

516
00:21:24,450 --> 00:21:26,379
do that by minimizing the negative log

517
00:21:26,579 --> 00:21:28,159
likelihood or accordingly

518
00:21:28,359 --> 00:21:29,930
the cross-entropy of the true labels

519
00:21:30,130 --> 00:21:32,809
under our predictive distribution in

520
00:21:33,009 --> 00:21:34,899
terms of notation how we represent that

521
00:21:35,099 --> 00:21:36,649
something that you commonly see these

522
00:21:36,849 --> 00:21:39,829
things called one-hot vectors to encode

523
00:21:40,029 --> 00:21:42,559
the two plus label and what's that look

524
00:21:42,759 --> 00:21:44,750
like well basically it's a vector that

525
00:21:44,950 --> 00:21:48,229
is of the dimensionality of the output

526
00:21:48,429 --> 00:21:51,289
space the element for the true class

527
00:21:51,490 --> 00:21:52,669
like the the entry for the element of

528
00:21:52,869 --> 00:21:54,440
the true class label is one and

529
00:21:54,640 --> 00:21:56,899
everything else is zero so it's this

530
00:21:57,099 --> 00:21:58,490
vector here in the example above these

531
00:21:58,690 --> 00:22:01,309
digits so four-digit for the one hop

532
00:22:01,509 --> 00:22:03,229
label vector would look like this so the

533
00:22:03,429 --> 00:22:04,969
fourth element is one everything else is

534
00:22:05,169 --> 00:22:09,049
zero if we plug this into our expression

535
00:22:09,250 --> 00:22:11,240
for the negative log likelihood then we

536
00:22:11,440 --> 00:22:13,399
see something like this so since the

537
00:22:13,599 --> 00:22:16,848
only element that of T that is going to

538
00:22:17,048 --> 00:22:19,009
be nonzero is the target we're

539
00:22:19,210 --> 00:22:22,450
essentially asking this probability here

540
00:22:22,650 --> 00:22:24,500
the log probability of this to be

541
00:22:24,700 --> 00:22:25,909
maximized and then we just sum that

542
00:22:26,109 --> 00:22:29,899
across our data cases so even just with

543
00:22:30,099 --> 00:22:32,389
a linear layer if we were to optimize

544
00:22:32,589 --> 00:22:35,269
this we could form a very simple linear

545
00:22:35,470 --> 00:22:37,879
multi way classifier for say digits

546
00:22:38,079 --> 00:22:39,769
it wouldn't work super well and we'll

547
00:22:39,970 --> 00:22:42,169
talk about adding depth but that's

548
00:22:42,369 --> 00:22:43,519
something that you can actually usefully

549
00:22:43,720 --> 00:22:48,049
do with one of these layers now as I

550
00:22:48,250 --> 00:22:50,839
said it's it used to be the case that

551
00:22:51,039 --> 00:22:53,240
the the sigmoid was the dominant

552
00:22:53,440 --> 00:22:55,219
non-linearity and that's fallen out of

553
00:22:55,419 --> 00:22:57,049
favor and so in a lot of the neural

554
00:22:57,250 --> 00:22:58,848
networks that you'll see nowadays a much

555
00:22:59,048 --> 00:23:00,379
more common activation function is

556
00:23:00,579 --> 00:23:02,269
something called the rectified linear

557
00:23:02,470 --> 00:23:04,250
unit or so notice just shortened to a

558
00:23:04,450 --> 00:23:08,779
ray Lu and it has a couple of nice

559
00:23:08,980 --> 00:23:12,108
properties so it's a lot simpler and

560
00:23:12,308 --> 00:23:13,899
computationally cheaper than the sigmoid

561
00:23:14,099 --> 00:23:16,129
it's basically a function that

562
00:23:16,329 --> 00:23:20,240
thresholds below by 0 or otherwise has a

563
00:23:20,440 --> 00:23:23,569
pass through so we can write it down as

564
00:23:23,769 --> 00:23:26,690
this so if the if the input to the

565
00:23:26,890 --> 00:23:29,029
rayleigh function is below zero then the

566
00:23:29,230 --> 00:23:31,279
output is just zero and then above zero

567
00:23:31,480 --> 00:23:34,608
it's just a linear pass-through and it

568
00:23:34,808 --> 00:23:36,108
has a couple of nice properties one of

569
00:23:36,308 --> 00:23:40,129
which is in this region here the

570
00:23:40,329 --> 00:23:41,769
gradient is constant

571
00:23:41,970 --> 00:23:44,329
and generally in in your networks we

572
00:23:44,529 --> 00:23:46,450
want to have gradients flowing so it's

573
00:23:46,650 --> 00:23:49,009
maybe not so nice here that there's no

574
00:23:49,210 --> 00:23:50,389
great information here but at least once

575
00:23:50,589 --> 00:23:52,579
it's active the gradient is constant and

576
00:23:52,779 --> 00:23:54,769
we don't have any saturation regions

577
00:23:54,970 --> 00:23:56,559
once it was the you know is active so

578
00:23:56,759 --> 00:24:00,109
you'll hear I think a lot more about the

579
00:24:00,309 --> 00:24:02,629
details of the gradient properties of

580
00:24:02,829 --> 00:24:04,159
this kind of stuff in James Martin's

581
00:24:04,359 --> 00:24:06,220
lecture later on in optimization but

582
00:24:06,420 --> 00:24:08,089
these are kind of some of the subtleties

583
00:24:08,289 --> 00:24:08,899
that I was talking about they're

584
00:24:09,099 --> 00:24:13,129
important to think about ok so we've now

585
00:24:13,329 --> 00:24:17,119
seen just a very basic single layer now

586
00:24:17,319 --> 00:24:19,549
let's move on one step and ask ok what

587
00:24:19,750 --> 00:24:23,210
can we do if we have more than one layer

588
00:24:23,410 --> 00:24:25,129
so what can we do with neural networks

589
00:24:25,329 --> 00:24:29,720
with a hidden layer and to motivate this

590
00:24:29,920 --> 00:24:31,789
we'll take a look at a very simple

591
00:24:31,990 --> 00:24:33,470
example so what happens if we want to do

592
00:24:33,670 --> 00:24:35,899
binary classification but the inputs are

593
00:24:36,099 --> 00:24:39,680
not linearly separable and then in the

594
00:24:39,880 --> 00:24:42,649
second part of this section I'll kind of

595
00:24:42,849 --> 00:24:45,859
give a a visual proof for why we can see

596
00:24:46,059 --> 00:24:47,329
that neural networks are universal

597
00:24:47,529 --> 00:24:49,129
proper function approximate is so with

598
00:24:49,329 --> 00:24:50,480
enough with a large enough network we

599
00:24:50,680 --> 00:24:54,740
can approximate any function so when I

600
00:24:54,940 --> 00:24:56,869
say a single hidden layer this is what I

601
00:24:57,069 --> 00:24:59,539
mean so we have some inputs here a

602
00:24:59,740 --> 00:25:04,519
linear module of weights some nonlinear

603
00:25:04,720 --> 00:25:06,619
activations to give us this hidden

604
00:25:06,819 --> 00:25:09,680
representation another linear mapping

605
00:25:09,880 --> 00:25:11,029
and then either directly to the output

606
00:25:11,230 --> 00:25:13,669
or some puppet non-linearity and

607
00:25:13,869 --> 00:25:16,159
basically another way of thinking about

608
00:25:16,359 --> 00:25:18,440
why this is useful is that the outputs

609
00:25:18,640 --> 00:25:19,909
of one layer are the inputs to the next

610
00:25:20,109 --> 00:25:21,710
and so it allows us to transform our

611
00:25:21,910 --> 00:25:24,289
input through a series of intermediate

612
00:25:24,490 --> 00:25:27,169
representations and the hope is that

613
00:25:27,369 --> 00:25:29,119
rather than trying to solve the problem

614
00:25:29,319 --> 00:25:30,619
we're interested in directly an input

615
00:25:30,819 --> 00:25:32,240
space we can find this series of

616
00:25:32,440 --> 00:25:34,700
transformations the render our problem

617
00:25:34,900 --> 00:25:37,190
simpler in some transform representation

618
00:25:37,390 --> 00:25:40,369
so again I think this was covered

619
00:25:40,569 --> 00:25:41,779
towards the end of those previous

620
00:25:41,980 --> 00:25:42,799
lecture but if you think back to what's

621
00:25:43,000 --> 00:25:44,839
going on with basis functions it's a

622
00:25:45,039 --> 00:25:50,960
similar kind of idea so this is probably

623
00:25:51,160 --> 00:25:52,309
that the simplest example that can

624
00:25:52,509 --> 00:25:54,649
exemplify that so it's kind of simple

625
00:25:54,849 --> 00:25:55,759
XOR task so

626
00:25:55,960 --> 00:25:57,379
let's imagine that I have four data

627
00:25:57,579 --> 00:26:00,740
points living in 2d a B C and D and a

628
00:26:00,940 --> 00:26:04,309
and B are members of class 0 C and D are

629
00:26:04,509 --> 00:26:09,759
members of class 1 now if I just have a

630
00:26:09,960 --> 00:26:11,869
single linear layer plus logistic

631
00:26:12,069 --> 00:26:13,519
there's no way that I can correctly

632
00:26:13,720 --> 00:26:15,500
classify these points there's no there's

633
00:26:15,700 --> 00:26:18,379
no line I can draw that will put the

634
00:26:18,579 --> 00:26:20,089
yellow B the yellow point to one side

635
00:26:20,289 --> 00:26:22,159
and the blue points on the other now

636
00:26:22,359 --> 00:26:23,899
let's think about what we can do with a

637
00:26:24,099 --> 00:26:26,539
very simple Network as I've drawn here

638
00:26:26,740 --> 00:26:28,250
so we're just gonna have two hidden

639
00:26:28,450 --> 00:26:33,139
units and so let's imagine that the the

640
00:26:33,339 --> 00:26:35,809
first in unit has a weight vector

641
00:26:36,009 --> 00:26:39,079
pointing this direction so in terms of B

642
00:26:39,279 --> 00:26:42,649
its outputs these will be 0 in this red

643
00:26:42,849 --> 00:26:45,799
shaded region and one here and then the

644
00:26:46,000 --> 00:26:47,930
second hidden unit will have a slightly

645
00:26:48,130 --> 00:26:49,369
different decision boundary it'll be

646
00:26:49,569 --> 00:26:52,269
this one so it'll be 0 here and one here

647
00:26:52,470 --> 00:26:55,519
and now if we ask ourselves ok in this

648
00:26:55,720 --> 00:26:58,039
space of hidden activities if I rewrite

649
00:26:58,240 --> 00:27:00,019
the data fight if I plot it again which

650
00:27:00,220 --> 00:27:01,609
I'm doing down here

651
00:27:01,809 --> 00:27:04,129
what does my classification problem like

652
00:27:04,329 --> 00:27:06,440
in this new space so let's go through

653
00:27:06,640 --> 00:27:11,629
the steps of that so point a had one for

654
00:27:11,829 --> 00:27:13,369
the first hidden unit and 0 for the

655
00:27:13,569 --> 00:27:16,299
second so it would live here point B

656
00:27:16,500 --> 00:27:19,879
same again 1 and 0 also lives there

657
00:27:20,079 --> 00:27:22,819
Point C has 0 for the first in unit 0

658
00:27:23,019 --> 00:27:26,409
for the second it lives here and then D

659
00:27:26,609 --> 00:27:30,440
has 1 and 1 so it lives here so this is

660
00:27:30,640 --> 00:27:32,569
the representation of these four data

661
00:27:32,769 --> 00:27:34,669
points in the input space this is the

662
00:27:34,869 --> 00:27:38,000
representation in this this first hidden

663
00:27:38,200 --> 00:27:41,000
layer and so in this space the two

664
00:27:41,200 --> 00:27:43,009
classes now are linearly separable and

665
00:27:43,210 --> 00:27:48,139
so if I add an additional linear plus

666
00:27:48,339 --> 00:27:50,240
sigmoid on top of this then I'm able to

667
00:27:50,440 --> 00:27:53,119
classify these two point B this data set

668
00:27:53,319 --> 00:27:57,259
correctly and so this is again it's a

669
00:27:57,460 --> 00:27:58,339
very simple example but I think it's a

670
00:27:58,539 --> 00:28:01,940
useful motivation for why having a

671
00:28:02,140 --> 00:28:05,289
hidden layer gives us additional power

672
00:28:05,490 --> 00:28:07,279
actually looks like there's a couple of

673
00:28:07,480 --> 00:28:09,109
seats free I see a couple for extending

674
00:28:09,309 --> 00:28:09,289
good

675
00:28:09,490 --> 00:28:11,448
if you want to take a second to sit down

676
00:28:11,648 --> 00:28:14,059
if that's easy for you there's a couple

677
00:28:14,259 --> 00:28:17,908
down here at the front and the second or

678
00:28:22,259 --> 00:28:26,328
so here's another problem of a similar

679
00:28:26,528 --> 00:28:31,729
flavor but slightly less travel so if we

680
00:28:31,929 --> 00:28:36,379
now have the setting here where the data

681
00:28:36,579 --> 00:28:37,430
from different classes live in these

682
00:28:37,630 --> 00:28:39,979
quadrants then just two hidden units on

683
00:28:40,179 --> 00:28:41,509
their own won't cut it but it turns out

684
00:28:41,710 --> 00:28:45,259
that with 16 units you can actually do a

685
00:28:45,460 --> 00:28:46,969
pretty good job at carving up this input

686
00:28:47,169 --> 00:28:51,019
space into the four quadrant and there's

687
00:28:51,220 --> 00:28:52,639
a link from the slide out it's something

688
00:28:52,839 --> 00:28:54,129
that if you guys are not aware of it

689
00:28:54,329 --> 00:28:56,389
it's nice to look at there's a a

690
00:28:56,589 --> 00:28:58,639
tensorflow web playground that basically

691
00:28:58,839 --> 00:29:01,578
lets you take some of these very simple

692
00:29:01,778 --> 00:29:03,529
problems in your browser and play around

693
00:29:03,730 --> 00:29:05,419
with different numbers of Units

694
00:29:05,619 --> 00:29:07,190
different nonlinearities and so on and

695
00:29:07,390 --> 00:29:08,690
itself will typically train on these

696
00:29:08,890 --> 00:29:11,539
problems in a few seconds in it even

697
00:29:11,740 --> 00:29:12,588
looks very simple I think it's a really

698
00:29:12,788 --> 00:29:14,209
nice thing to look at to refine your

699
00:29:14,409 --> 00:29:17,299
intuition for what sorts of things these

700
00:29:17,500 --> 00:29:18,619
models learn what the decision

701
00:29:18,819 --> 00:29:21,430
boundaries look like and Academy to add

702
00:29:21,630 --> 00:29:23,659
detail to your kind of mental picture of

703
00:29:23,859 --> 00:29:26,119
what's going on so yeah when the slice

704
00:29:26,319 --> 00:29:27,709
is shared I'd encourage you to take a

705
00:29:27,909 --> 00:29:29,180
look at that and just kind of play with

706
00:29:29,380 --> 00:29:30,769
some of these simple problems in the

707
00:29:30,970 --> 00:29:34,500
browser to kind of refine your intuition

708
00:29:35,009 --> 00:29:37,519
okay so we've seen that the power that

709
00:29:37,720 --> 00:29:40,789
we can get for these toy problems I'm

710
00:29:40,990 --> 00:29:43,369
now going to go through I guess I'd call

711
00:29:43,569 --> 00:29:45,769
it a sort it's not quite a proof but a

712
00:29:45,970 --> 00:29:49,000
visual intuition pump if you will for

713
00:29:49,200 --> 00:29:52,069
why neural networks with just one hidden

714
00:29:52,269 --> 00:29:53,809
layer can still be viewed as universal

715
00:29:54,009 --> 00:29:56,088
function approximate is and this is one

716
00:29:56,288 --> 00:29:59,259
of those ideas that was arrived at by

717
00:29:59,460 --> 00:30:02,500
several people more or less concurrently

718
00:30:02,700 --> 00:30:05,719
one the kind of well-known sort of

719
00:30:05,919 --> 00:30:08,719
proposes a proof of this was a guy Chu

720
00:30:08,919 --> 00:30:11,899
Benko from 89 and that the papers are

721
00:30:12,099 --> 00:30:14,000
linked here there's also again in terms

722
00:30:14,200 --> 00:30:15,619
of the hyperlinks there's again some

723
00:30:15,819 --> 00:30:19,490
nice interactive web demos one of them

724
00:30:19,690 --> 00:30:21,229
in Michael Nielsen's web become deep

725
00:30:21,429 --> 00:30:22,190
learning that

726
00:30:22,390 --> 00:30:24,789
I'd recommend you take a look at and

727
00:30:24,990 --> 00:30:26,569
going a little beyond the scope of this

728
00:30:26,769 --> 00:30:29,419
class it turns out there are interesting

729
00:30:29,619 --> 00:30:31,700
links along these lines to be made

730
00:30:31,900 --> 00:30:33,829
between neural networks and something

731
00:30:34,029 --> 00:30:36,259
called Gaussian processes they're not

732
00:30:36,460 --> 00:30:39,319
going to be covered today but again I'd

733
00:30:39,519 --> 00:30:40,220
encourage you to take a look if you're

734
00:30:40,420 --> 00:30:44,509
interested okay so what what is our

735
00:30:44,710 --> 00:30:47,629
visual proof going to be the with enough

736
00:30:47,829 --> 00:30:49,490
hidden units we can use a neural network

737
00:30:49,690 --> 00:30:52,009
to approximate anything so let's begin

738
00:30:52,210 --> 00:30:55,879
by just considering two of our linear

739
00:30:56,079 --> 00:30:59,210
plus sigmoid units here and let's

740
00:30:59,410 --> 00:31:01,159
imagine that we arranged for the weight

741
00:31:01,359 --> 00:31:02,119
vectors to point in the same direction

742
00:31:02,319 --> 00:31:04,730
or maybe we'll start off with just a

743
00:31:04,930 --> 00:31:06,349
scalar case so the only difference

744
00:31:06,549 --> 00:31:10,129
between unit 1 and unit 2 is the bias so

745
00:31:10,329 --> 00:31:11,899
that's the kind of offset of where the

746
00:31:12,099 --> 00:31:15,019
sigmoid kicks in and then let's imagine

747
00:31:15,220 --> 00:31:16,730
okay what happens if we take this pair

748
00:31:16,930 --> 00:31:20,299
of units and we we subtract them from

749
00:31:20,500 --> 00:31:24,409
each other what does that difference

750
00:31:24,609 --> 00:31:26,180
output look like and it turns out it

751
00:31:26,380 --> 00:31:27,409
looks something a little like this this

752
00:31:27,609 --> 00:31:30,589
kind of bump of activity Y well over to

753
00:31:30,789 --> 00:31:33,069
the far left both these units is 0 so

754
00:31:33,269 --> 00:31:36,289
the the difference is 0 over to the far

755
00:31:36,490 --> 00:31:38,690
right the upper buddies answers 1 so

756
00:31:38,890 --> 00:31:40,639
they cancel and then in the middle we

757
00:31:40,839 --> 00:31:42,980
have this this little bump and so by

758
00:31:43,180 --> 00:31:45,440
having this pair of units were able to

759
00:31:45,640 --> 00:31:47,359
create this this bump here which is a

760
00:31:47,559 --> 00:31:50,809
lot like a basis function right so let's

761
00:31:51,009 --> 00:31:53,029
imagine that we want to use a neural

762
00:31:53,230 --> 00:31:55,250
network with a hidden layer to model

763
00:31:55,450 --> 00:31:57,470
this gray this arbitrary gray function

764
00:31:57,670 --> 00:31:59,599
here one of the ways we could do it it's

765
00:31:59,799 --> 00:32:01,430
probably not the best way but just as a

766
00:32:01,630 --> 00:32:03,220
kind of proof to show it can be done is

767
00:32:03,420 --> 00:32:05,690
you could imagine now that I've got

768
00:32:05,890 --> 00:32:08,480
these little bumps of activity I can

769
00:32:08,680 --> 00:32:10,639
arrange for that offset to light

770
00:32:10,839 --> 00:32:12,680
different points along this line and I

771
00:32:12,880 --> 00:32:17,089
can also scale the but a multiplicative

772
00:32:17,289 --> 00:32:19,909
scale on this so the idea is through

773
00:32:20,109 --> 00:32:21,529
pairs of units we can kind of come up

774
00:32:21,730 --> 00:32:23,539
with these little bumps and if we think

775
00:32:23,740 --> 00:32:25,009
of what the sum of all these bumps look

776
00:32:25,210 --> 00:32:27,049
like if I have enough of them and

777
00:32:27,250 --> 00:32:28,700
they're narrow enough then it starts to

778
00:32:28,900 --> 00:32:32,029
look like this gray curve that we're

779
00:32:32,230 --> 00:32:34,399
trying to fit so the Mobile's we have I

780
00:32:34,599 --> 00:32:36,019
either the bigger the

781
00:32:36,220 --> 00:32:38,269
the hidden layer the more accurate our

782
00:32:38,470 --> 00:32:40,219
approximation and so that's the kind of

783
00:32:40,419 --> 00:32:44,059
sketch proof for 1d in 2d this same

784
00:32:44,259 --> 00:32:47,000
sorts of ideas apply except we now need

785
00:32:47,200 --> 00:32:50,088
a pair of hidden units for each

786
00:32:50,288 --> 00:32:53,209
dimension of the input so it's hard to

787
00:32:53,409 --> 00:32:55,789
visualize in dimensions beyond two but a

788
00:32:55,990 --> 00:32:58,190
similar sort of thing would apply in 2d

789
00:32:58,390 --> 00:33:01,459
where we if we have four neurons we can

790
00:33:01,659 --> 00:33:02,930
build these little towers of activity

791
00:33:03,130 --> 00:33:04,369
that we can kind of shift around and

792
00:33:04,569 --> 00:33:07,930
again the same idea would apply so

793
00:33:08,130 --> 00:33:10,309
hopefully this is convinced to you that

794
00:33:10,509 --> 00:33:13,219
with enough units we can approximate

795
00:33:13,419 --> 00:33:14,750
everything although it doesn't sound

796
00:33:14,950 --> 00:33:15,979
very efficient and you'd hope that

797
00:33:16,179 --> 00:33:17,358
there's a much better way of doing that

798
00:33:17,558 --> 00:33:24,619
and it turns out that there is so now

799
00:33:24,819 --> 00:33:50,869
that we've seen what we can do mmm I

800
00:33:51,069 --> 00:33:56,719
don't think so you're you're not taking

801
00:33:56,919 --> 00:33:59,029
the area under each bump you're just

802
00:33:59,230 --> 00:34:01,459
taking their kind of magnitude of the

803
00:34:01,659 --> 00:34:06,529
function so there's dump your question I

804
00:34:06,730 --> 00:34:09,699
think I may listen this is your question

805
00:34:09,898 --> 00:34:19,010
okay okay I see any any more questions

806
00:34:19,210 --> 00:34:24,260
before we move on okay so now we're

807
00:34:24,460 --> 00:34:27,740
gonna start to think about deeper

808
00:34:27,940 --> 00:34:31,310
networks so we've seen what we can do

809
00:34:31,510 --> 00:34:35,269
with just a single hidden layer and we

810
00:34:35,469 --> 00:34:36,740
do have this Universal approximation

811
00:34:36,940 --> 00:34:38,389
property but we've also seen that it is

812
00:34:38,588 --> 00:34:40,370
kind of a horrible way to do it it needs

813
00:34:40,570 --> 00:34:43,340
many many units and it turns out that as

814
00:34:43,539 --> 00:34:47,120
we add depth things get a lot more

815
00:34:47,320 --> 00:34:48,550
powerful and we've become

816
00:34:48,750 --> 00:34:50,920
a lot more efficient and again I'll give

817
00:34:51,119 --> 00:34:54,340
a kind of a reference to a paper that

818
00:34:54,539 --> 00:34:56,620
has the full proof but for the class

819
00:34:56,820 --> 00:34:58,450
I'll try and give you a sort of more

820
00:34:58,650 --> 00:35:01,210
visual motivation for how you can see

821
00:35:01,409 --> 00:35:03,640
that that is something that happens and

822
00:35:03,840 --> 00:35:07,539
again to kind of motivate what you were

823
00:35:07,739 --> 00:35:11,050
what you get if you allow these very

824
00:35:11,250 --> 00:35:13,420
deep transformations again coming back

825
00:35:13,619 --> 00:35:15,370
this idea of rather than trying to kind

826
00:35:15,570 --> 00:35:17,970
of go from inputs to outputs in one go

827
00:35:18,170 --> 00:35:20,470
it allows us to potentially break it

828
00:35:20,670 --> 00:35:24,250
down into into smaller steps so you know

829
00:35:24,449 --> 00:35:26,530
cartoon from vision might be rather than

830
00:35:26,730 --> 00:35:29,470
going straight from a vector of pixels

831
00:35:29,670 --> 00:35:31,570
into some kind of scene level analysis

832
00:35:31,769 --> 00:35:33,789
maybe it's easier if in the first stage

833
00:35:33,989 --> 00:35:35,650
of transformation we can extract the

834
00:35:35,849 --> 00:35:38,080
edges or into the edges from an image

835
00:35:38,280 --> 00:35:39,820
from those you can start to think about

836
00:35:40,019 --> 00:35:43,060
composing those edges into say junctions

837
00:35:43,260 --> 00:35:46,780
and small shapes from there into part

838
00:35:46,980 --> 00:35:48,670
there aren't objects and then there

839
00:35:48,869 --> 00:35:50,800
enter into full scene so breaking down

840
00:35:51,000 --> 00:35:53,170
these complicated computations into

841
00:35:53,369 --> 00:35:55,930
smaller chunks in the in the second half

842
00:35:56,130 --> 00:35:59,789
of the section will kind of flip to this

843
00:35:59,989 --> 00:36:01,870
what I'm calling out a more modern

844
00:36:02,070 --> 00:36:04,630
compute graph perspective and there will

845
00:36:04,829 --> 00:36:07,090
kind of really start to see the creative

846
00:36:07,289 --> 00:36:08,800
designs that you can do in these very

847
00:36:09,000 --> 00:36:10,570
large networks and I'll also throw in

848
00:36:10,769 --> 00:36:13,060
just a couple of examples of real-world

849
00:36:13,260 --> 00:36:14,680
networks that you can see what I mean

850
00:36:14,880 --> 00:36:16,690
when I when I say that the structure

851
00:36:16,889 --> 00:36:21,160
these things can get very elaborate okay

852
00:36:21,360 --> 00:36:26,110
so yeah what I'm gonna do for this slide

853
00:36:26,309 --> 00:36:30,310
in the next one is just go over how we

854
00:36:30,510 --> 00:36:33,070
can see the benefits of depth you can

855
00:36:33,269 --> 00:36:35,440
ignore this is my slide from last year

856
00:36:35,639 --> 00:36:36,670
when there was an exam but this era of

857
00:36:36,869 --> 00:36:38,050
things cause what they say you know a

858
00:36:38,250 --> 00:36:43,810
minute worried so here's the

859
00:36:44,010 --> 00:36:46,810
construction so if we imagine taking the

860
00:36:47,010 --> 00:36:49,750
rectified linear unit that we we saw

861
00:36:49,949 --> 00:36:53,980
previously so one of these is just zero

862
00:36:54,179 --> 00:36:59,650
if it's if neighbors blow zero zero it's

863
00:36:59,849 --> 00:37:02,019
linear about that and imagine we take

864
00:37:02,219 --> 00:37:04,899
another one of these rectifiers and

865
00:37:05,099 --> 00:37:06,430
essentially flip the signs of the

866
00:37:06,630 --> 00:37:07,780
weights and biases so it's kind of V

867
00:37:07,980 --> 00:37:11,560
converse what this gives us oriented it

868
00:37:11,760 --> 00:37:13,570
around the origin in this case is a full

869
00:37:13,769 --> 00:37:17,860
rectifier and so in 1d this has the

870
00:37:18,059 --> 00:37:20,710
property that anything we build on top

871
00:37:20,909 --> 00:37:24,310
of this will have the same output for a

872
00:37:24,510 --> 00:37:27,280
point of plus X as it will at minus X so

873
00:37:27,480 --> 00:37:29,140
it's kind of it's mirroring where you

874
00:37:29,340 --> 00:37:32,350
can imagine it as kind of folding a

875
00:37:32,550 --> 00:37:35,019
space over so yet multiple points in the

876
00:37:35,219 --> 00:37:36,519
input mapped at the same point in the

877
00:37:36,719 --> 00:37:40,840
output and so this letters have multiple

878
00:37:41,039 --> 00:37:42,100
regions of the input showing the same

879
00:37:42,300 --> 00:37:44,950
functional mapping will kind of extend

880
00:37:45,150 --> 00:37:48,580
that from 1d into 2d here so imagine

881
00:37:48,780 --> 00:37:51,700
that I have two pairs of these full

882
00:37:51,900 --> 00:37:55,630
rectifiers so that would causes you four

883
00:37:55,829 --> 00:37:58,149
hidden units in this layer in total one

884
00:37:58,349 --> 00:37:58,840
of the rectifiers

885
00:37:59,039 --> 00:38:01,840
is arranged along the x axis and one

886
00:38:02,039 --> 00:38:03,909
along the y axis and so what it means is

887
00:38:04,108 --> 00:38:09,010
that any any function of the output of

888
00:38:09,210 --> 00:38:10,960
these is replicated in each of these

889
00:38:11,159 --> 00:38:13,030
quadrants and so one way you can think

890
00:38:13,230 --> 00:38:14,800
about what these rectifiers are doing is

891
00:38:15,000 --> 00:38:16,840
if I were to take that 2d plane and kind

892
00:38:17,039 --> 00:38:18,730
of fold it over and then fold it back on

893
00:38:18,929 --> 00:38:22,450
itself functions that I would map on

894
00:38:22,650 --> 00:38:24,190
that folded representation if I unfold

895
00:38:24,389 --> 00:38:25,960
it it kind of fall back into the

896
00:38:26,159 --> 00:38:29,139
original input space so that's the kind

897
00:38:29,338 --> 00:38:32,620
of underlying intuition you guys okay

898
00:38:32,820 --> 00:38:37,060
yeah and so this is from this paper from

899
00:38:37,260 --> 00:38:40,690
2014 by wonderful Pascal oho and Benjy

900
00:38:40,889 --> 00:38:45,220
and what I just described is the sort of

901
00:38:45,420 --> 00:38:47,950
basic operation they use to come up with

902
00:38:48,150 --> 00:38:49,149
this interesting proof about the

903
00:38:49,349 --> 00:38:50,860
representational power of deep networks

904
00:38:51,059 --> 00:38:54,550
so I'll kind of step through this this

905
00:38:54,750 --> 00:38:56,260
diagram fairly quickly again if you if

906
00:38:56,460 --> 00:38:58,090
you're interested then it's a nice paper

907
00:38:58,289 --> 00:38:59,740
and fairly easy to read but it's just

908
00:38:59,940 --> 00:39:01,930
too too many details to go through today

909
00:39:02,130 --> 00:39:06,490
so as I said we imagine by applying

910
00:39:06,690 --> 00:39:07,899
these pairs of rectifiers what you end

911
00:39:08,099 --> 00:39:12,490
up with is this folded space I can on

912
00:39:12,690 --> 00:39:14,610
the outputs of that so

913
00:39:14,809 --> 00:39:18,390
I can apply a new set of units on top of

914
00:39:18,590 --> 00:39:20,039
that which would end up kind of folding

915
00:39:20,239 --> 00:39:23,370
this space again and so what we end up

916
00:39:23,570 --> 00:39:26,100
with any decision bound we have in the

917
00:39:26,300 --> 00:39:28,500
final layer as we kind of backtrack so

918
00:39:28,699 --> 00:39:30,480
going through this unfolding gets

919
00:39:30,679 --> 00:39:33,750
replicated or distributed to different

920
00:39:33,949 --> 00:39:37,890
parts of the input space so probably the

921
00:39:38,090 --> 00:39:39,300
most helpful thing to look at is this

922
00:39:39,500 --> 00:39:43,620
this figure here so if we have a network

923
00:39:43,820 --> 00:39:46,590
arranged like this in this output layer

924
00:39:46,789 --> 00:39:48,320
if we have a linear decision boundary

925
00:39:48,519 --> 00:39:55,070
when we unfold that we end up with four

926
00:39:55,269 --> 00:39:57,360
full boundaries one in each of the

927
00:39:57,559 --> 00:40:00,630
quadrant represented here so we've gone

928
00:40:00,829 --> 00:40:03,510
from two regions that we can separate

929
00:40:03,710 --> 00:40:04,920
here to eight regions that we can

930
00:40:05,119 --> 00:40:07,710
separate here if we were to unfold that

931
00:40:07,909 --> 00:40:12,570
again then we end up with 32 regions so

932
00:40:12,769 --> 00:40:16,140
the kind of the high-level take home

933
00:40:16,340 --> 00:40:19,200
from this is the the number of regions

934
00:40:19,400 --> 00:40:21,890
that we can assign different labels to

935
00:40:22,090 --> 00:40:25,710
increases exponentially with depth and

936
00:40:25,909 --> 00:40:26,850
it turns out it only increases

937
00:40:27,050 --> 00:40:29,070
polynomial e with a number of units per

938
00:40:29,269 --> 00:40:33,539
layer so so all that's being equal for a

939
00:40:33,739 --> 00:40:38,210
fixed total number of neurons there's

940
00:40:38,409 --> 00:40:40,620
potentially much more power by making a

941
00:40:40,820 --> 00:40:42,930
narrow deep network than there is in

942
00:40:43,130 --> 00:40:46,170
having a shallow wide Network you know

943
00:40:46,369 --> 00:40:47,670
the details of that will depend on your

944
00:40:47,869 --> 00:40:49,920
problem but that's one of the intuitions

945
00:40:50,119 --> 00:40:54,519
for why adding depth is so helpful it's

946
00:41:08,829 --> 00:41:12,600
guess so it's hot on to these questions

947
00:41:12,800 --> 00:41:14,610
so I say the state of theory in deep

948
00:41:14,809 --> 00:41:16,470
learning alone is is know in their world

949
00:41:16,670 --> 00:41:19,980
we'd like it to be so there aren't of

950
00:41:20,179 --> 00:41:21,980
good rigorous demonstration of that

951
00:41:22,179 --> 00:41:27,420
empirically in a lot of problems what

952
00:41:27,619 --> 00:41:28,530
you'll find is in a few

953
00:41:28,730 --> 00:41:31,110
you try and tackle something with a

954
00:41:31,309 --> 00:41:36,180
fixed budget of Units then in practice

955
00:41:36,380 --> 00:41:37,590
often you will get better empirical

956
00:41:37,789 --> 00:41:39,840
performance by adding a couple of hidden

957
00:41:40,039 --> 00:41:42,769
layers rather than having one very wide

958
00:41:42,969 --> 00:41:46,019
very wide one but it's also problem

959
00:41:46,219 --> 00:41:49,200
dependent yeah I think there's another

960
00:41:49,400 --> 00:41:54,060
question somewhere over there okay does

961
00:41:54,260 --> 00:41:56,390
that answer your question

962
00:41:56,590 --> 00:42:05,340
sure yeah don't worry but my pastor yeah

963
00:42:05,539 --> 00:42:07,170
I just encourage you to read the paper

964
00:42:07,369 --> 00:42:10,200
because it's it's really nicely written

965
00:42:10,400 --> 00:42:11,940
in to the extent that yeah this works

966
00:42:12,139 --> 00:42:14,730
for you as an intuition pump it's worth

967
00:42:14,929 --> 00:42:15,720
taking the time to kind of go through

968
00:42:15,920 --> 00:42:24,280
that argument and understand it okay so

969
00:42:26,139 --> 00:42:29,250
now I said we're gonna switch gears a

970
00:42:29,449 --> 00:42:32,610
bit and move from this what I would say

971
00:42:32,809 --> 00:42:35,840
is a kind of more traditional style of

972
00:42:36,039 --> 00:42:38,130
depicting and thinking about neural

973
00:42:38,329 --> 00:42:40,470
networks and in this we sort of bundle

974
00:42:40,670 --> 00:42:43,410
in our description of layers the

975
00:42:43,610 --> 00:42:45,720
nonlinearities and move towards this

976
00:42:45,920 --> 00:42:51,990
kind of more explicit compute graph

977
00:42:52,190 --> 00:42:55,289
representation where we have separate

978
00:42:55,489 --> 00:42:57,510
node for our weights and we separate out

979
00:42:57,710 --> 00:42:58,860
separate out the linear transformation

980
00:42:59,059 --> 00:43:04,350
from the nonlinearities and this is more

981
00:43:04,550 --> 00:43:05,400
similar the kind of thing that you'll

982
00:43:05,599 --> 00:43:07,950
see if you look at say visualizations in

983
00:43:08,150 --> 00:43:12,330
tents aboard so these are kind of

984
00:43:12,530 --> 00:43:14,910
isomorphic to each other and to these

985
00:43:15,110 --> 00:43:16,769
equations here I'm just I just put

986
00:43:16,969 --> 00:43:18,840
together an arbitrary graph just to kind

987
00:43:19,039 --> 00:43:25,830
of highlight this so we have input to a

988
00:43:26,030 --> 00:43:29,370
first and layer with a sigmoid the

989
00:43:29,570 --> 00:43:31,710
outputs of this go to a secondhand layer

990
00:43:31,909 --> 00:43:33,950
which I decided to pick a railing for

991
00:43:34,150 --> 00:43:37,680
there's another pathway so that yeah

992
00:43:37,880 --> 00:43:38,910
this one is really there's another

993
00:43:39,110 --> 00:43:41,640
pathway coming through here and then

994
00:43:41,840 --> 00:43:42,419
they combine at the app

995
00:43:42,619 --> 00:43:44,369
that exactly the same thing here I'm

996
00:43:44,568 --> 00:43:47,068
just kind of adding these additional

997
00:43:47,268 --> 00:43:53,099
nodes and it seems like we've kind of

998
00:43:53,298 --> 00:43:54,899
made this one looks more complicated

999
00:43:55,099 --> 00:43:56,460
than this one but there's a reason for

1000
00:43:56,659 --> 00:43:57,720
kind of breaking it down like this which

1001
00:43:57,920 --> 00:44:00,269
will kind of move on to in the next

1002
00:44:00,469 --> 00:44:04,500
sections and that's the idea of kind of

1003
00:44:04,699 --> 00:44:07,639
looking at these systems just as kind of

1004
00:44:07,838 --> 00:44:09,750
compute graphs from modular building

1005
00:44:09,949 --> 00:44:13,230
blocks and the nice thing is if we if we

1006
00:44:13,429 --> 00:44:14,849
represent and think about our models in

1007
00:44:15,048 --> 00:44:17,519
this way then there's a nice link into

1008
00:44:17,719 --> 00:44:20,129
software implementation so we can kind

1009
00:44:20,329 --> 00:44:21,659
of take a very object-oriented approach

1010
00:44:21,858 --> 00:44:23,730
to composing these graphs and

1011
00:44:23,929 --> 00:44:27,359
implementing them and for most of what

1012
00:44:27,559 --> 00:44:29,430
we need to do there's a very small

1013
00:44:29,630 --> 00:44:31,470
minimal set of API functions that each

1014
00:44:31,670 --> 00:44:33,030
of these modules needs to be able to

1015
00:44:33,230 --> 00:44:35,849
carry out and you can basically have

1016
00:44:36,048 --> 00:44:37,798
anything as a module in your graph as

1017
00:44:37,998 --> 00:44:40,619
long as it can carry out these these

1018
00:44:40,818 --> 00:44:43,559
three functionalities so and well we'll

1019
00:44:43,759 --> 00:44:45,240
go through them and in the subsequent

1020
00:44:45,440 --> 00:44:48,659
slides but just to kind of signpost them

1021
00:44:48,858 --> 00:44:50,280
there's a forward path so Harry go from

1022
00:44:50,480 --> 00:44:52,530
inputs to outputs there's a backwards

1023
00:44:52,730 --> 00:44:55,019
pair so given some gradients of the loss

1024
00:44:55,219 --> 00:44:58,769
we care about how do we compute those

1025
00:44:58,969 --> 00:44:59,818
gradients all the way through the graph

1026
00:45:00,018 --> 00:45:02,818
and then how do we compute the prior

1027
00:45:03,018 --> 00:45:10,200
updates and this is just putting this up

1028
00:45:10,400 --> 00:45:12,629
here this is what the compute graph for

1029
00:45:12,829 --> 00:45:15,149
Inception before looks like and I just

1030
00:45:15,349 --> 00:45:16,919
wanted to kind of put this up the to

1031
00:45:17,119 --> 00:45:20,430
ground why it's important to have this

1032
00:45:20,630 --> 00:45:22,559
kind of modular framework because you

1033
00:45:22,759 --> 00:45:24,030
know for the for the small networks that

1034
00:45:24,230 --> 00:45:25,859
I was showing you initially it kind of

1035
00:45:26,059 --> 00:45:27,389
doesn't matter how you set up your code

1036
00:45:27,588 --> 00:45:28,379
you could you know you can drive

1037
00:45:28,579 --> 00:45:31,169
everything by hand you know maybe you

1038
00:45:31,369 --> 00:45:32,430
want to fuse some of the operations

1039
00:45:32,630 --> 00:45:34,230
yourself just to make things efficient

1040
00:45:34,429 --> 00:45:36,930
but once you have these massive massive

1041
00:45:37,130 --> 00:45:39,269
graphs then keeping track of that in

1042
00:45:39,469 --> 00:45:41,190
your head or by by hand is just not

1043
00:45:41,389 --> 00:45:43,500
really feasible and so you need to have

1044
00:45:43,699 --> 00:45:44,970
some automated way of plugging these

1045
00:45:45,170 --> 00:45:46,470
things together and being able to to

1046
00:45:46,670 --> 00:45:50,129
deal with them so this I think it's not

1047
00:45:50,329 --> 00:45:51,899
state-of-the-art anymore that's a kind

1048
00:45:52,099 --> 00:45:54,059
of sign of how the fields moving but as

1049
00:45:54,259 --> 00:45:55,450
of around this

1050
00:45:55,650 --> 00:45:57,780
last year this was a state-of-the-art

1051
00:45:57,980 --> 00:46:00,490
vision architecture it's still pretty

1052
00:46:00,690 --> 00:46:03,700
good this is another example this time

1053
00:46:03,900 --> 00:46:05,680
from deep reinforcement learning and

1054
00:46:05,880 --> 00:46:08,019
again and just kind of putting this up

1055
00:46:08,219 --> 00:46:11,350
there to give you a sense of what sorts

1056
00:46:11,550 --> 00:46:13,750
of architectures we end up using it in

1057
00:46:13,949 --> 00:46:15,780
real-world problems and the sorts of

1058
00:46:15,980 --> 00:46:19,240
somewhat arbitrary topologies that we

1059
00:46:19,440 --> 00:46:20,800
can have depending on on what we need to

1060
00:46:21,000 --> 00:46:24,100
do the details of this don't matter too

1061
00:46:24,300 --> 00:46:25,900
much but I I think towards the end of

1062
00:46:26,099 --> 00:46:28,120
the RL course Hado might cover some of

1063
00:46:28,320 --> 00:46:41,289
this stuff ok so the the next section

1064
00:46:41,489 --> 00:46:43,150
we're going to cover learning and it's

1065
00:46:43,349 --> 00:46:45,130
probably going to be one of the more

1066
00:46:45,329 --> 00:46:48,670
math heavy sections and I guess I'll

1067
00:46:48,869 --> 00:46:51,610
I'll cover up the material but I usually

1068
00:46:51,809 --> 00:46:53,289
find it's not super productive to be

1069
00:46:53,489 --> 00:46:55,120
very detailed with mathematics in a

1070
00:46:55,320 --> 00:46:57,850
lecture but you can kind of refer to the

1071
00:46:58,050 --> 00:47:00,280
slides for details afterwards so what is

1072
00:47:00,480 --> 00:47:02,170
what is learning as I said it's very

1073
00:47:02,369 --> 00:47:05,200
simple we have some loss function

1074
00:47:05,400 --> 00:47:07,900
defined with respect to our data and

1075
00:47:08,099 --> 00:47:11,200
model parameters and then learning is

1076
00:47:11,400 --> 00:47:13,990
just using optimization methods to find

1077
00:47:14,190 --> 00:47:15,820
a set of model parameters with minimize

1078
00:47:16,019 --> 00:47:20,289
this loss and typically we'll use some

1079
00:47:20,489 --> 00:47:22,030
form of gradient descent to do this and

1080
00:47:22,230 --> 00:47:23,560
there'll be a whole lecture that kind of

1081
00:47:23,760 --> 00:47:26,289
covers various ways of the optimization

1082
00:47:26,489 --> 00:47:28,780
I guess something else that I'll add

1083
00:47:28,980 --> 00:47:33,580
just cuz it's starting to become popular

1084
00:47:33,780 --> 00:47:35,080
in source is something that I'm working

1085
00:47:35,280 --> 00:47:37,090
on in my research of the moon so there

1086
00:47:37,289 --> 00:47:38,920
are great in free ways of doing

1087
00:47:39,119 --> 00:47:41,980
optimization so kind of 0th order

1088
00:47:42,179 --> 00:47:43,450
approximations to gradients or

1089
00:47:43,650 --> 00:47:46,930
evolutionary methods and again I guess

1090
00:47:47,130 --> 00:47:48,130
one of those things were you know these

1091
00:47:48,329 --> 00:47:50,470
things coming waves of fashion day they

1092
00:47:50,670 --> 00:47:52,240
were kind of popular in the early 2000s

1093
00:47:52,440 --> 00:47:53,890
they've fallen out of favor they're

1094
00:47:54,090 --> 00:47:57,250
actually appearing again particularly in

1095
00:47:57,449 --> 00:47:58,750
some reinforcement learning contexts

1096
00:47:58,949 --> 00:48:01,390
where you have the situation that sure

1097
00:48:01,590 --> 00:48:03,160
we can kind of deal with great in our

1098
00:48:03,360 --> 00:48:06,910
models but depending on the data that we

1099
00:48:07,110 --> 00:48:08,420
have available so in

1100
00:48:08,619 --> 00:48:10,250
wasn't learning the data you trained on

1101
00:48:10,449 --> 00:48:11,450
depends on how well you're exploring the

1102
00:48:11,650 --> 00:48:12,800
environment it might be that there just

1103
00:48:13,000 --> 00:48:14,360
isn't a very good gradient signal there

1104
00:48:14,559 --> 00:48:18,440
and so we won't cover it today I don't

1105
00:48:18,639 --> 00:48:20,000
know if James will touch on a bit on his

1106
00:48:20,199 --> 00:48:21,230
lecture but it's just useful pretty

1107
00:48:21,429 --> 00:48:22,820
aware of that there are these sort of

1108
00:48:23,019 --> 00:48:24,500
gradient free optimization methods as

1109
00:48:24,699 --> 00:48:26,750
well and depending on your problem that

1110
00:48:26,949 --> 00:48:28,070
might be something useful to think about

1111
00:48:28,269 --> 00:48:31,400
and at least be aware of so in this

1112
00:48:31,599 --> 00:48:34,130
section I'll start by doing a kind of a

1113
00:48:34,329 --> 00:48:37,010
recap of some calculus and linear

1114
00:48:37,210 --> 00:48:40,310
algebra will recap Green percent and

1115
00:48:40,510 --> 00:48:42,410
then we'll talk about how to put these

1116
00:48:42,610 --> 00:48:44,330
together on the compute graphs we were

1117
00:48:44,530 --> 00:48:46,100
just discussing with automatic

1118
00:48:46,300 --> 00:48:47,330
differentiation is something called

1119
00:48:47,530 --> 00:48:50,390
modular backprop and what I'll do at the

1120
00:48:50,590 --> 00:48:52,190
end of the section is we can kind of go

1121
00:48:52,389 --> 00:48:54,650
through a more detailed derivation of

1122
00:48:54,849 --> 00:48:56,870
how we do a set up if we wanted to say

1123
00:48:57,070 --> 00:48:59,390
do classification of endless digits with

1124
00:48:59,590 --> 00:49:01,519
a network with one hidden layer so just

1125
00:49:01,719 --> 00:49:03,230
a kind of very cruel example but once

1126
00:49:03,429 --> 00:49:04,550
you've got that it kind of generalizes

1127
00:49:04,750 --> 00:49:05,600
to all sorts of other things that you'd

1128
00:49:05,800 --> 00:49:13,940
want to do so there's two concepts that

1129
00:49:14,139 --> 00:49:16,250
it's useful to have in mind they're kind

1130
00:49:16,449 --> 00:49:18,320
of objects that allow us to write some

1131
00:49:18,519 --> 00:49:20,120
of the the equations more efficiently

1132
00:49:20,320 --> 00:49:21,920
and to kind of think about these things

1133
00:49:22,119 --> 00:49:24,830
in a slightly more compact way so one of

1134
00:49:25,030 --> 00:49:27,130
them is this notion of a gradient vector

1135
00:49:27,329 --> 00:49:31,580
so if I have some scalar function f a

1136
00:49:31,780 --> 00:49:35,660
vector argument then the elements of the

1137
00:49:35,860 --> 00:49:37,630
gradient vector which is denoted here

1138
00:49:37,829 --> 00:49:41,269
with respect to X are just the partial

1139
00:49:41,469 --> 00:49:43,610
derivatives of the scale output with

1140
00:49:43,809 --> 00:49:45,380
respect to the individual dimensions of

1141
00:49:45,579 --> 00:49:47,990
the vector the other concept that's

1142
00:49:48,190 --> 00:49:50,360
going to be useful in terms of writing

1143
00:49:50,559 --> 00:49:52,660
some of these things down concisely

1144
00:49:52,860 --> 00:49:56,870
is the Jacobian matrix and so there if

1145
00:49:57,070 --> 00:49:59,240
we have a vector function of vector

1146
00:49:59,440 --> 00:50:04,550
arguments then the Jacobian matrix the

1147
00:50:04,750 --> 00:50:07,490
NF element of that is just the partial

1148
00:50:07,690 --> 00:50:10,010
derivative of the nth element of the our

1149
00:50:10,210 --> 00:50:12,470
vector with respect to the F element or

1150
00:50:12,670 --> 00:50:15,450
the input vector

1151
00:50:17,090 --> 00:50:21,930
and in terms of gradient descent what

1152
00:50:22,130 --> 00:50:24,600
does that mean well if we have some lost

1153
00:50:24,800 --> 00:50:26,460
function that we want to minimize then

1154
00:50:26,659 --> 00:50:27,990
essentially we were just kind of

1155
00:50:28,190 --> 00:50:29,490
repeatedly doing these updates where we

1156
00:50:29,690 --> 00:50:31,740
take our previous parameter value we

1157
00:50:31,940 --> 00:50:33,360
compute the gradient and we can do this

1158
00:50:33,559 --> 00:50:37,650
either over our entire data set or which

1159
00:50:37,849 --> 00:50:39,780
would be kind of batch or a kind of

1160
00:50:39,980 --> 00:50:41,250
subset of the data which be mini batch

1161
00:50:41,449 --> 00:50:43,200
or something that we end up calling

1162
00:50:43,400 --> 00:50:44,640
online gradient descent which is if we

1163
00:50:44,840 --> 00:50:47,340
take one data point at a time we just

1164
00:50:47,539 --> 00:50:48,780
compute the gradient of our loss with

1165
00:50:48,980 --> 00:50:51,180
respect to that data and then take a

1166
00:50:51,380 --> 00:50:53,370
small step scale by this learning rate

1167
00:50:53,570 --> 00:50:55,500
eater in the direct descent direction

1168
00:50:55,699 --> 00:50:58,890
and then we end up repeating this in

1169
00:50:59,090 --> 00:51:00,780
what I'm gonna talk about the cone

1170
00:51:00,980 --> 00:51:03,030
slides I'm gonna operate in the

1171
00:51:03,230 --> 00:51:05,220
assumption that we're doing it online it

1172
00:51:05,420 --> 00:51:07,500
doesn't change it much if we do batch

1173
00:51:07,699 --> 00:51:09,120
methods it's just easier to represent if

1174
00:51:09,320 --> 00:51:11,460
we just have one data case I have to

1175
00:51:11,659 --> 00:51:16,560
think about and I'll cover this a couple

1176
00:51:16,760 --> 00:51:17,910
of times later as well but it's just

1177
00:51:18,110 --> 00:51:19,620
worth stressing that the choice of

1178
00:51:19,820 --> 00:51:21,269
learning rates are the step size

1179
00:51:21,469 --> 00:51:22,920
parameter ends up making a big

1180
00:51:23,119 --> 00:51:26,550
difference but how quickly you can find

1181
00:51:26,750 --> 00:51:30,000
solutions and in fact the quality of

1182
00:51:30,199 --> 00:51:32,310
solutions that you end up finding and so

1183
00:51:32,510 --> 00:51:33,510
that's something that will touch on when

1184
00:51:33,710 --> 00:51:34,800
we talk a bit about hyper parameter

1185
00:51:35,000 --> 00:51:38,310
optimization and moving beyond simple

1186
00:51:38,510 --> 00:51:39,539
gradient descent there's a lot more

1187
00:51:39,739 --> 00:51:41,430
sophisticated method so things like

1188
00:51:41,630 --> 00:51:43,740
momentum where you kind of keep around

1189
00:51:43,940 --> 00:51:45,750
gradient from previous iterations and

1190
00:51:45,949 --> 00:51:47,310
blend them wood grain from the current

1191
00:51:47,510 --> 00:51:49,590
iteration there's things like rmsprop or

1192
00:51:49,789 --> 00:51:53,910
atom which are adaptive ways of scaling

1193
00:51:54,110 --> 00:51:56,490
some of the step size as long different

1194
00:51:56,690 --> 00:51:58,860
directions and I think James is going to

1195
00:51:59,059 --> 00:52:00,210
go into a lot more detail about that in

1196
00:52:00,409 --> 00:52:07,839
a couple weeks time okay

1197
00:52:08,820 --> 00:52:10,940
so if you think that too kind of high

1198
00:52:11,139 --> 00:52:14,060
school calculus and in particular the

1199
00:52:14,260 --> 00:52:19,690
chain rule so let's start off with this

1200
00:52:19,889 --> 00:52:25,280
nested function so Y is f of G of X and

1201
00:52:25,480 --> 00:52:30,590
so if we ask okay what's the derivative

1202
00:52:30,789 --> 00:52:33,470
of Y with respect to X well we just plug

1203
00:52:33,670 --> 00:52:36,019
in the chain rule so it's the derivative

1204
00:52:36,219 --> 00:52:38,840
of F with respect to G considering

1205
00:52:39,039 --> 00:52:40,730
g-tube its argument and then the

1206
00:52:40,929 --> 00:52:43,210
derivative of G with respect to X so a

1207
00:52:43,409 --> 00:52:46,519
similar scalar case scalar output scalar

1208
00:52:46,719 --> 00:52:52,519
input if we make this multivariate so

1209
00:52:52,719 --> 00:52:55,519
now let's imagine that our function f is

1210
00:52:55,719 --> 00:52:57,260
a function of multiple arguments each of

1211
00:52:57,460 --> 00:52:59,150
which is a different function G 1

1212
00:52:59,349 --> 00:53:02,630
through m of X and again were interested

1213
00:53:02,829 --> 00:53:06,050
in the same question what's the the

1214
00:53:06,250 --> 00:53:08,560
derivative of Y with respect to X well

1215
00:53:08,760 --> 00:53:12,980
we sum over all these individual

1216
00:53:13,179 --> 00:53:15,670
functions and then for any one of them

1217
00:53:15,869 --> 00:53:17,780
it's again just the chain rule from

1218
00:53:17,980 --> 00:53:19,820
above so the partial of F with respect

1219
00:53:20,019 --> 00:53:22,400
to G I and then the partial of G IR with

1220
00:53:22,599 --> 00:53:28,000
respect to X so we basically for each

1221
00:53:28,199 --> 00:53:31,580
half of nesting we take a product along

1222
00:53:31,780 --> 00:53:34,130
a single path and then we sum over all

1223
00:53:34,329 --> 00:53:36,190
possible paths to get the total

1224
00:53:36,389 --> 00:53:39,710
derivative and well basically just gonna

1225
00:53:39,909 --> 00:53:43,940
take these concepts and scale them up so

1226
00:53:44,139 --> 00:53:46,400
that we can apply them to these compute

1227
00:53:46,599 --> 00:53:49,940
graphs and the only thing to be aware of

1228
00:53:50,139 --> 00:53:51,590
an hour I'll have mentioned this again

1229
00:53:51,789 --> 00:53:52,970
in a second there's a couple of

1230
00:53:53,170 --> 00:53:54,560
efficiency tricks that we should be

1231
00:53:54,760 --> 00:53:57,380
aware of so if there are junctions as we

1232
00:53:57,579 --> 00:53:59,269
traverse there's opportunities to

1233
00:53:59,469 --> 00:54:01,160
factorize these expressions and that

1234
00:54:01,360 --> 00:54:02,690
becomes particularly important if you

1235
00:54:02,889 --> 00:54:07,100
have a graph with a lot of branching in

1236
00:54:07,300 --> 00:54:13,130
its topology so let's let's take a some

1237
00:54:13,329 --> 00:54:14,840
arbitrary if you graph as an example

1238
00:54:15,039 --> 00:54:16,190
again so

1239
00:54:16,389 --> 00:54:18,589
it's a little dense when I write it out

1240
00:54:18,789 --> 00:54:20,329
but hopefully this will kind of like

1241
00:54:20,528 --> 00:54:21,950
carry over the point so so imagine we

1242
00:54:22,150 --> 00:54:24,139
have some function mapping from X to Y

1243
00:54:24,338 --> 00:54:27,409
and the way this is going to be composed

1244
00:54:27,608 --> 00:54:31,460
it's gonna be some G of F F is going to

1245
00:54:31,659 --> 00:54:34,730
be a function of its two inputs E and J

1246
00:54:34,929 --> 00:54:38,690
and then E is this kind of nested

1247
00:54:38,889 --> 00:54:40,818
sequence of functions or operations all

1248
00:54:41,018 --> 00:54:45,519
the way to X and similarly J so if I

1249
00:54:45,719 --> 00:54:47,810
take what I just set up here and ask

1250
00:54:48,010 --> 00:54:50,810
okay what's the derivative of Y with

1251
00:54:51,010 --> 00:54:54,470
respect to X then we take the product

1252
00:54:54,670 --> 00:54:56,599
along these two paths as I say so a

1253
00:54:56,798 --> 00:54:58,490
through G and then there's also this

1254
00:54:58,690 --> 00:55:01,220
path through here and so we get these

1255
00:55:01,420 --> 00:55:04,818
two expressions down here what I was

1256
00:55:05,018 --> 00:55:06,349
saying about kind of some of the

1257
00:55:06,548 --> 00:55:07,789
efficiency tricks is you'll notice

1258
00:55:07,989 --> 00:55:10,039
there's some common terms towards the

1259
00:55:10,239 --> 00:55:11,810
end of this expression and this

1260
00:55:12,010 --> 00:55:13,460
expression and so we could actually

1261
00:55:13,659 --> 00:55:17,329
group these together factor those out of

1262
00:55:17,528 --> 00:55:19,849
that sum in the scalar case it doesn't

1263
00:55:20,048 --> 00:55:21,619
matter too much but we'll move to the

1264
00:55:21,818 --> 00:55:23,659
the vector case and more elaborate

1265
00:55:23,858 --> 00:55:25,099
graphs you'll see why it's important

1266
00:55:25,298 --> 00:55:26,450
essentially if there's a lot of

1267
00:55:26,650 --> 00:55:28,579
branching and joining then we have to do

1268
00:55:28,778 --> 00:55:29,960
these sums over there kind of

1269
00:55:30,159 --> 00:55:32,180
combinatorially many paths through the

1270
00:55:32,380 --> 00:55:33,889
graph for the mapping that we're

1271
00:55:34,088 --> 00:55:43,250
interested in the other point that is is

1272
00:55:43,449 --> 00:55:44,750
worth mentioning is so if you look at

1273
00:55:44,949 --> 00:55:46,369
the literature on automatic

1274
00:55:46,568 --> 00:55:48,559
differentiation you might hear a couple

1275
00:55:48,759 --> 00:55:50,568
of different terms so there's something

1276
00:55:50,768 --> 00:55:53,000
called forwards mode automatic

1277
00:55:53,199 --> 00:55:55,339
differentiation and something called

1278
00:55:55,539 --> 00:55:56,750
reverse mode automatic differentiation

1279
00:55:56,949 --> 00:56:01,568
and that just that's really referring to

1280
00:56:01,768 --> 00:56:05,568
when were computing these expressions do

1281
00:56:05,768 --> 00:56:08,240
we compute the product starting from the

1282
00:56:08,440 --> 00:56:10,220
input working towards the output or do

1283
00:56:10,420 --> 00:56:13,839
we work in Reverse and the difference

1284
00:56:14,039 --> 00:56:18,349
between the two is to do with what sorts

1285
00:56:18,548 --> 00:56:20,180
of intermediate properties that we end

1286
00:56:20,380 --> 00:56:24,859
up with so if I work from the input

1287
00:56:25,059 --> 00:56:29,629
towards the output so if I can

1288
00:56:29,829 --> 00:56:32,059
this product see from the inputs to the

1289
00:56:32,259 --> 00:56:33,649
outputs then my intermediate terms are

1290
00:56:33,849 --> 00:56:37,849
things like da/dx if I then compute this

1291
00:56:38,048 --> 00:56:42,740
then I basically would end up with DB DX

1292
00:56:42,940 --> 00:56:46,220
DZ DX so in forwards mode we get the

1293
00:56:46,420 --> 00:56:48,798
partial derivatives of the internal

1294
00:56:48,998 --> 00:56:52,099
nodes with respect to the inputs which

1295
00:56:52,298 --> 00:56:53,750
is actually not super useful for what we

1296
00:56:53,949 --> 00:56:55,039
want to do it it's great if you want to

1297
00:56:55,239 --> 00:56:57,409
say do sensitivity analysis so if I want

1298
00:56:57,608 --> 00:56:59,539
to know how much changing a little bit

1299
00:56:59,739 --> 00:57:00,710
of the input would affect the output

1300
00:57:00,909 --> 00:57:03,250
this is exactly what we want to do and

1301
00:57:03,449 --> 00:57:05,659
that can be useful in deep learning if

1302
00:57:05,858 --> 00:57:08,240
you want to get a sense of how models

1303
00:57:08,440 --> 00:57:11,119
are representing functions or which bits

1304
00:57:11,318 --> 00:57:12,829
the input are important but it is not

1305
00:57:13,028 --> 00:57:15,099
useful for learning however if we

1306
00:57:15,298 --> 00:57:17,809
Traverse this in the opposite direction

1307
00:57:18,009 --> 00:57:20,960
so from outputs towards inputs then we

1308
00:57:21,159 --> 00:57:25,700
end up with two terms that are

1309
00:57:25,900 --> 00:57:27,528
derivatives of the output with respect

1310
00:57:27,728 --> 00:57:29,450
to the internal nodes and it turns out

1311
00:57:29,650 --> 00:57:32,690
that that's exactly what we need for for

1312
00:57:32,889 --> 00:57:38,379
learning so so it's interesting kind of

1313
00:57:38,579 --> 00:57:41,329
explaining this stuff because on the one

1314
00:57:41,528 --> 00:57:43,068
hand it's all kind of trivial it's you

1315
00:57:43,268 --> 00:57:44,690
know it's basically the chain rule you

1316
00:57:44,889 --> 00:57:47,359
know you'll have seen this in high

1317
00:57:47,559 --> 00:57:49,278
school so it's kind of one of these

1318
00:57:49,478 --> 00:57:52,849
simple ideas that actually had quite a

1319
00:57:53,048 --> 00:57:55,278
big impact so even though it's kind of

1320
00:57:55,478 --> 00:57:58,009
obvious when you look at it like this in

1321
00:57:58,208 --> 00:58:00,710
terms of the impact on efficiency when

1322
00:58:00,909 --> 00:58:02,269
you're computing gradient updates for

1323
00:58:02,469 --> 00:58:04,039
neural networks it makes a big

1324
00:58:04,239 --> 00:58:06,769
difference organizing the computation in

1325
00:58:06,969 --> 00:58:08,298
this efficient way and I think that's

1326
00:58:08,498 --> 00:58:10,818
one of the reasons why when backprop was

1327
00:58:11,018 --> 00:58:12,859
introduced it had such a big impact even

1328
00:58:13,059 --> 00:58:14,720
though at hardest a kind of

1329
00:58:14,920 --> 00:58:17,240
fundamentally simple method and also

1330
00:58:17,440 --> 00:58:18,889
what we'll see as we move on to kind of

1331
00:58:19,088 --> 00:58:20,778
the more vector calculus how to things

1332
00:58:20,978 --> 00:58:22,639
it all looks pretty trivial if we're

1333
00:58:22,838 --> 00:58:25,490
dealing with scalars but once we move

1334
00:58:25,690 --> 00:58:30,919
into large models then B again we'll see

1335
00:58:31,119 --> 00:58:35,470
why the ordering makes difference so

1336
00:58:35,670 --> 00:58:38,180
yeah essentially reverse mode or my

1337
00:58:38,380 --> 00:58:39,980
differentiation a clever application of

1338
00:58:40,179 --> 00:58:41,990
the chain rule back prop that all the

1339
00:58:42,190 --> 00:58:42,919
same thing

1340
00:58:43,119 --> 00:58:46,460
so basically in the backwoods pass

1341
00:58:46,659 --> 00:58:47,690
through the network what we're going to

1342
00:58:47,889 --> 00:58:51,079
want to do is compute the derivative of

1343
00:58:51,278 --> 00:58:53,240
the loss with respect to the inputs of

1344
00:58:53,440 --> 00:58:57,289
each module and if we have that then

1345
00:58:57,489 --> 00:58:59,359
that kind of goes into part of this

1346
00:58:59,559 --> 00:59:01,129
minimal API that I was describing those

1347
00:59:01,329 --> 00:59:03,500
three methods that if our modules

1348
00:59:03,699 --> 00:59:05,480
implement those then we can just plug

1349
00:59:05,679 --> 00:59:07,940
them together however we like and go

1350
00:59:08,139 --> 00:59:10,059
ahead and train the other thing that's

1351
00:59:10,259 --> 00:59:12,289
worth mentioning is interesting is that

1352
00:59:12,489 --> 00:59:17,778
this idea doesn't just apply to things

1353
00:59:17,978 --> 00:59:18,798
that you might consider to be simple

1354
00:59:18,998 --> 00:59:20,389
mathematical operations you can actually

1355
00:59:20,588 --> 00:59:22,849
apply this to the entire compute graph

1356
00:59:23,048 --> 00:59:25,250
including constructs like for loops or

1357
00:59:25,449 --> 00:59:28,369
conditionals and so on essentially we

1358
00:59:28,568 --> 00:59:30,950
just backtrack through the forward

1359
00:59:31,150 --> 00:59:33,559
execution path so if something has a

1360
00:59:33,759 --> 00:59:35,778
derivative we take it but if in the case

1361
00:59:35,978 --> 00:59:38,089
of an if Clause then we essentially

1362
00:59:38,289 --> 00:59:40,099
there's multiple execution branches that

1363
00:59:40,298 --> 00:59:41,990
we could have ended up following when we

1364
00:59:42,190 --> 00:59:44,089
work backwards we just need to remember

1365
00:59:44,289 --> 00:59:45,769
which branch we followed going forward

1366
00:59:45,969 --> 00:59:49,849
and that's the one that we we use when

1367
00:59:50,048 --> 00:59:52,419
we're going in the reverse direction so

1368
00:59:52,619 --> 00:59:54,379
essentially we can take an entire

1369
00:59:54,579 --> 00:59:56,329
computer program more or less and

1370
00:59:56,528 --> 00:59:58,309
everything we can apply this automatic

1371
00:59:58,509 --> 01:00:00,409
differentiation to and that's one of the

1372
01:00:00,608 --> 01:00:01,789
powerful things that tends to float does

1373
01:00:01,989 --> 01:00:02,990
for you it allows you to write these

1374
01:00:03,190 --> 01:00:05,240
I'll retreat in few graphs and then when

1375
01:00:05,440 --> 01:00:07,399
it comes time to learn it does the hard

1376
01:00:07,599 --> 01:00:09,649
work of doing all this backtracking for

1377
01:00:09,849 --> 01:00:11,450
you and kind of okie canoeing in terms

1378
01:00:11,650 --> 01:00:15,139
of how the gradients flow there's a

1379
01:00:15,338 --> 01:00:18,440
couple of things that you need to be

1380
01:00:18,639 --> 01:00:21,559
aware of so in most implementations of

1381
01:00:21,759 --> 01:00:23,990
this you need to store the variables

1382
01:00:24,190 --> 01:00:27,500
during the forward pass so in very big

1383
01:00:27,699 --> 01:00:29,659
models or sequence models over very long

1384
01:00:29,858 --> 01:00:33,019
sequence lengths this can lead to us

1385
01:00:33,219 --> 01:00:35,269
requiring a lot of memory but there are

1386
01:00:35,469 --> 01:00:38,139
also clever tricks to get around that so

1387
01:00:38,338 --> 01:00:40,369
there's a nice paper that I linked to

1388
01:00:40,568 --> 01:00:43,399
here which is one way of being memory

1389
01:00:43,599 --> 01:00:45,079
efficient and it's essentially boils

1390
01:00:45,278 --> 01:00:49,639
down to being smart about caching States

1391
01:00:49,838 --> 01:00:52,099
in the schema in the forward execution

1392
01:00:52,298 --> 01:00:55,548
so rather than remembering everything

1393
01:00:55,748 --> 01:00:56,240
you can think

1394
01:00:56,440 --> 01:00:58,310
it's like every few layers say we

1395
01:00:58,510 --> 01:01:00,610
checkpoint then in the back Pro pass

1396
01:01:00,809 --> 01:01:03,110
rather than having to remember

1397
01:01:03,309 --> 01:01:05,030
everything or the other thing would be

1398
01:01:05,230 --> 01:01:06,110
to kind of compute everything for

1399
01:01:06,309 --> 01:01:10,369
scrapped we can find the most recent or

1400
01:01:10,568 --> 01:01:12,409
the the closest cache state and then

1401
01:01:12,608 --> 01:01:14,629
just do a little forward computation

1402
01:01:14,829 --> 01:01:16,430
from that to get the states we need to

1403
01:01:16,630 --> 01:01:21,320
evaluate the gradients and yeah that

1404
01:01:21,519 --> 01:01:23,119
most of this is taken care of

1405
01:01:23,318 --> 01:01:26,000
automatically by things like tensor flow

1406
01:01:26,199 --> 01:01:28,129
and even I think this memory fish and

1407
01:01:28,329 --> 01:01:29,840
stuff is probably going to find its way

1408
01:01:30,039 --> 01:01:31,850
into the core tensor flow code probably

1409
01:01:32,050 --> 01:01:34,550
the next release or two so a lot of

1410
01:01:34,750 --> 01:01:36,769
these things you on a day to day basis

1411
01:01:36,969 --> 01:01:37,970
you don't need to worry about but again

1412
01:01:38,170 --> 01:01:39,800
I think it's always useful to kind of

1413
01:01:40,000 --> 01:01:42,080
know what's going on under the hood in

1414
01:01:42,280 --> 01:01:43,550
case you are doing something unusual or

1415
01:01:43,750 --> 01:01:45,200
if you are running into some of these

1416
01:01:45,400 --> 01:01:56,600
problems okay so in this cartoon here

1417
01:01:56,800 --> 01:02:01,159
what I'm showing is how those different

1418
01:02:01,358 --> 01:02:03,619
pieces fit together and the sorts of

1419
01:02:03,818 --> 01:02:05,180
things that looks like once we're in a

1420
01:02:05,380 --> 01:02:08,269
more realistic setting so we have vector

1421
01:02:08,469 --> 01:02:11,570
input SPECT outputs and as I said

1422
01:02:11,769 --> 01:02:16,340
there's these three API methods that as

1423
01:02:16,539 --> 01:02:17,419
long as we have some sort of

1424
01:02:17,619 --> 01:02:20,119
implementation of these then we can plug

1425
01:02:20,318 --> 01:02:21,560
together these arbitrary graphs of

1426
01:02:21,760 --> 01:02:26,090
modules and figure out the outputs given

1427
01:02:26,289 --> 01:02:27,889
inputs figure out the derivatives we

1428
01:02:28,088 --> 01:02:28,909
need to figure out the parameter update

1429
01:02:29,108 --> 01:02:33,440
so what are they the first one is what

1430
01:02:33,639 --> 01:02:34,940
I'm calling the forward pass so this is

1431
01:02:35,139 --> 01:02:39,118
just what's the output given the input

1432
01:02:39,329 --> 01:02:43,810
so through here and then there's two

1433
01:02:44,010 --> 01:02:45,769
methods that involve gradient so one

1434
01:02:45,969 --> 01:02:49,550
which I call the backward pass is we'd

1435
01:02:49,750 --> 01:02:52,519
like to know the gradient of the loss

1436
01:02:52,719 --> 01:02:55,850
with respect to the inputs given the

1437
01:02:56,050 --> 01:02:57,289
gradient of the loss with respect to the

1438
01:02:57,489 --> 01:03:02,090
output and so it turns out that what

1439
01:03:02,289 --> 01:03:03,980
does that look like well thinking back

1440
01:03:04,179 --> 01:03:06,740
to the chain rule slides from slides ago

1441
01:03:06,940 --> 01:03:09,050
if I want to think about this element

1442
01:03:09,250 --> 01:03:10,130
wise then

1443
01:03:10,329 --> 01:03:12,050
the gradient the lost with respect to

1444
01:03:12,250 --> 01:03:15,350
the I input is just the sum over all the

1445
01:03:15,550 --> 01:03:17,539
outputs of the gradient of the lost with

1446
01:03:17,739 --> 01:03:18,920
respect to each of those outputs and

1447
01:03:19,119 --> 01:03:20,660
then the gradient of those outputs with

1448
01:03:20,860 --> 01:03:23,539
respect to the input and if we want to

1449
01:03:23,739 --> 01:03:28,160
use our vector matrix notation then it's

1450
01:03:28,360 --> 01:03:30,740
the product of this gradient vector with

1451
01:03:30,940 --> 01:03:34,280
respect to the Jacobian of Y so this is

1452
01:03:34,480 --> 01:03:37,010
just a kind of compact way of

1453
01:03:37,210 --> 01:03:41,870
representing things similarly to get

1454
01:03:42,070 --> 01:03:43,640
parameter gradients or that's just the

1455
01:03:43,840 --> 01:03:45,170
derivative of the loss with respect to

1456
01:03:45,369 --> 01:03:49,880
the parameters which is then the sum of

1457
01:03:50,079 --> 01:03:52,280
all the outputs maduro to the loss with

1458
01:03:52,480 --> 01:03:54,680
respect those outputs the derivative

1459
01:03:54,880 --> 01:03:56,180
those outputs with recta parameters and

1460
01:03:56,380 --> 01:03:59,800
then these are obviously evaluated at

1461
01:04:00,000 --> 01:04:02,090
the state that it was 1 we're doing the

1462
01:04:02,289 --> 01:04:03,350
forward pass and that that's why I was

1463
01:04:03,550 --> 01:04:05,090
saying before that we need to keep these

1464
01:04:05,289 --> 01:04:08,810
states around because typically these

1465
01:04:09,010 --> 01:04:11,030
derivative terms will involve an

1466
01:04:11,230 --> 01:04:12,170
expression that involves what the

1467
01:04:12,369 --> 01:04:16,519
current state is so yeah these are kind

1468
01:04:16,719 --> 01:04:18,019
of compact ways of representing this in

1469
01:04:18,219 --> 01:04:21,500
practice we we actually don't if you

1470
01:04:21,699 --> 01:04:23,390
were to write these models yourself you

1471
01:04:23,590 --> 01:04:25,010
probably wouldn't want to form the full

1472
01:04:25,210 --> 01:04:28,280
Jacobian in these cases just because the

1473
01:04:28,480 --> 01:04:30,440
jacobians tend to be very sparse so if

1474
01:04:30,639 --> 01:04:34,580
there's there are many inputs that might

1475
01:04:34,780 --> 01:04:37,130
not have an influence on an output and

1476
01:04:37,329 --> 01:04:38,810
so many elements of the Jacobian are

1477
01:04:39,010 --> 01:04:42,680
often 0 but it's useful notationally

1478
01:04:42,880 --> 01:04:44,390
particularly if you kind of go back and

1479
01:04:44,590 --> 01:04:45,769
forth between this and the subscript

1480
01:04:45,969 --> 01:04:47,840
notation if you ever need to kind of

1481
01:04:48,039 --> 01:04:50,690
derive how to implement an R between new

1482
01:04:50,889 --> 01:04:52,100
module for yourself say if you have some

1483
01:04:52,300 --> 01:04:53,570
wid function there's and supported by

1484
01:04:53,769 --> 01:05:01,220
tons flow so yeah but that's more or

1485
01:05:01,420 --> 01:05:03,670
less what I just said so we have these

1486
01:05:03,869 --> 01:05:06,650
these methods that we we need to

1487
01:05:06,849 --> 01:05:10,550
implement and we chained the forward

1488
01:05:10,750 --> 01:05:13,370
passes together so how would we operate

1489
01:05:13,570 --> 01:05:15,950
this we'd we'd call the forward method

1490
01:05:16,150 --> 01:05:17,660
for the linear unit given the parameter

1491
01:05:17,860 --> 01:05:19,580
an input that would give us some output

1492
01:05:19,780 --> 01:05:21,260
the forward method of the relu the

1493
01:05:21,460 --> 01:05:22,860
forward method linear

1494
01:05:23,059 --> 01:05:26,519
method of the softmax and then we'd get

1495
01:05:26,719 --> 01:05:28,080
a loss and then we just call be

1496
01:05:28,280 --> 01:05:31,650
backwards method zombies to get our

1497
01:05:31,849 --> 01:05:36,420
derivatives of outputs with respect to

1498
01:05:36,619 --> 01:05:38,190
inputs and derivatives with respect to

1499
01:05:38,389 --> 01:05:41,519
parameters we apply the gradient that we

1500
01:05:41,719 --> 01:05:44,100
get from the parameters to take a small

1501
01:05:44,300 --> 01:05:45,600
descent step and then we just iterate

1502
01:05:45,800 --> 01:05:49,769
that so what I'm going to do in the next

1503
01:05:49,969 --> 01:05:51,840
couple of slides is go through what some

1504
01:05:52,039 --> 01:05:53,190
of those operations look like for these

1505
01:05:53,389 --> 01:05:55,320
building blocks and by the end of it

1506
01:05:55,519 --> 01:05:58,080
we'll have everything we need to do to

1507
01:05:58,280 --> 01:06:01,560
put together something like endless

1508
01:06:01,760 --> 01:06:03,120
classification with cross-entropy loss

1509
01:06:03,320 --> 01:06:13,710
and a single table later okay so the

1510
01:06:13,909 --> 01:06:16,320
forward pass for a linear module we're

1511
01:06:16,519 --> 01:06:18,150
calling for the Binnington class is just

1512
01:06:18,349 --> 01:06:20,220
given by this expression here so the

1513
01:06:20,420 --> 01:06:22,950
vector output is a matrix vector

1514
01:06:23,150 --> 01:06:26,640
operation plus a bias again I say in

1515
01:06:26,840 --> 01:06:28,200
these derivations is often useful to

1516
01:06:28,400 --> 01:06:31,010
kind of flip back and forth between

1517
01:06:31,210 --> 01:06:33,600
matrix vector notation and subscript

1518
01:06:33,800 --> 01:06:36,900
notation so this is just kind of

1519
01:06:37,099 --> 01:06:39,240
unpacking what the nth element of this

1520
01:06:39,440 --> 01:06:44,039
output vector is so we can compose the

1521
01:06:44,239 --> 01:06:45,870
relevant bits of the Jacobian that we

1522
01:06:46,070 --> 01:06:49,050
need so what do we need we want the the

1523
01:06:49,250 --> 01:06:51,230
partial of Y with the Spectras inputs

1524
01:06:51,429 --> 01:06:53,460
the partial of Y with respect to the

1525
01:06:53,659 --> 01:06:55,590
bias and the partial of Y with respect

1526
01:06:55,789 --> 01:07:01,110
to the weights and we get these

1527
01:07:01,309 --> 01:07:02,789
expressions this is what I was saying

1528
01:07:02,989 --> 01:07:07,140
before so this Kronecker Delta here most

1529
01:07:07,340 --> 01:07:09,480
of the elements of this Jacobian are

1530
01:07:09,679 --> 01:07:11,550
zero because if there isn't oh if

1531
01:07:11,750 --> 01:07:16,289
there's not a weight involved in this in

1532
01:07:16,489 --> 01:07:18,030
this particular but if a particular

1533
01:07:18,230 --> 01:07:20,430
weight isn't involved in producing a

1534
01:07:20,630 --> 01:07:21,870
particular output then there's

1535
01:07:22,070 --> 01:07:24,470
absolutely zero and so it's quite sparse

1536
01:07:24,670 --> 01:07:28,260
so armed with this we can come together

1537
01:07:28,460 --> 01:07:30,900
and get our backwards pass so what is

1538
01:07:31,099 --> 01:07:33,630
that it's just given by this expression

1539
01:07:33,829 --> 01:07:37,330
so we kind of plug in

1540
01:07:37,530 --> 01:07:39,289
these things that we've already derived

1541
01:07:39,489 --> 01:07:43,789
so if we have the as I said in the back

1542
01:07:43,989 --> 01:07:45,760
was possibly assume that were given the

1543
01:07:45,960 --> 01:07:48,169
gradient of the output with this of the

1544
01:07:48,369 --> 01:07:49,070
grading of the loss with respect the

1545
01:07:49,269 --> 01:07:51,710
output and so we just have this matrix

1546
01:07:51,909 --> 01:07:56,360
vector expression here similarly for the

1547
01:07:56,559 --> 01:07:58,100
parameter gradient if we kind of churn

1548
01:07:58,300 --> 01:08:00,200
through this this math then we we get

1549
01:08:00,400 --> 01:08:02,390
this outer product of the gradient

1550
01:08:02,590 --> 01:08:04,850
vector with the inputs and there's a

1551
01:08:05,050 --> 01:08:08,080
similarly simple thing for the biases so

1552
01:08:08,280 --> 01:08:10,280
armed with that we have everything we

1553
01:08:10,480 --> 01:08:12,620
need to do forward propagation backward

1554
01:08:12,820 --> 01:08:14,390
propagation and parameter updates for

1555
01:08:14,590 --> 01:08:19,039
the linear module the rally module is is

1556
01:08:19,239 --> 01:08:22,070
super simple so there's no parameters so

1557
01:08:22,270 --> 01:08:25,639
that the forward pass is just this max

1558
01:08:25,838 --> 01:08:27,440
of 0 and the input so it's our kind of

1559
01:08:27,640 --> 01:08:31,070
floor at zero and then the backward pass

1560
01:08:31,270 --> 01:08:33,800
is also simple is this kind of element

1561
01:08:34,000 --> 01:08:37,699
wise comparison so if the output was

1562
01:08:37,899 --> 01:08:39,889
above zero then the gradient with

1563
01:08:40,088 --> 01:08:41,630
respect to inputs is just 1 were in that

1564
01:08:41,829 --> 01:08:43,789
the linear pass through if the output

1565
01:08:43,989 --> 01:08:45,320
was below zero then there are no

1566
01:08:45,520 --> 01:08:54,579
gradients the softmax module is a little

1567
01:08:54,779 --> 01:08:56,630
trickier to derive from the omits for

1568
01:08:56,829 --> 01:08:59,389
but it's basically still simple calculus

1569
01:08:59,588 --> 01:09:02,779
so if we recall one that was the the

1570
01:09:02,979 --> 01:09:06,340
enth output is just this exponent of the

1571
01:09:06,539 --> 01:09:09,770
sum total and input normalized by that

1572
01:09:09,970 --> 01:09:13,820
same expression over all units we can

1573
01:09:14,020 --> 01:09:17,510
plug these in derive our Jacobian

1574
01:09:17,710 --> 01:09:20,779
element and then similarly we can plug

1575
01:09:20,979 --> 01:09:21,650
them in the backwards pass I've actually

1576
01:09:21,850 --> 01:09:24,079
skipped the derivation for this and I

1577
01:09:24,279 --> 01:09:25,460
think for the next one in the slides

1578
01:09:25,659 --> 01:09:26,480
just collecting that's going to come up

1579
01:09:26,680 --> 01:09:29,809
as something on your assignments but in

1580
01:09:30,009 --> 01:09:30,980
a later version of the slides I'll

1581
01:09:31,180 --> 01:09:33,020
update it with that the solution in

1582
01:09:33,220 --> 01:09:36,060
there okay

1583
01:09:47,020 --> 01:10:25,860
so second I don't think so so yeah good

1584
01:10:26,060 --> 01:10:30,600
question um I think I usually I usually

1585
01:10:30,800 --> 01:10:32,400
do a greater than zero so if it's equal

1586
01:10:32,600 --> 01:10:34,020
to zero then I treat the brain at zero

1587
01:10:34,220 --> 01:10:37,230
it's not it's not well-defined in

1588
01:10:37,430 --> 01:10:38,579
practice you can kind of assume it it

1589
01:10:38,779 --> 01:10:43,260
doesn't happen much but I would just set

1590
01:10:43,460 --> 01:10:45,449
the define the gradient at zero to be

1591
01:10:45,649 --> 01:10:47,520
zero but it actually doesn't matter too

1592
01:10:47,720 --> 01:10:50,340
much just because numerically or

1593
01:10:50,539 --> 01:10:51,869
extremely unlikely to hit something

1594
01:10:52,069 --> 01:11:04,020
that's exactly zero yeah so the final

1595
01:11:04,220 --> 01:11:05,970
part of this was the the loss the loss

1596
01:11:06,170 --> 01:11:12,690
itself and so again there's there's no

1597
01:11:12,890 --> 01:11:14,810
parameters in the forward pass we just

1598
01:11:15,010 --> 01:11:18,180
this is our definition of the loss when

1599
01:11:18,380 --> 01:11:19,739
we take derivatives we end up with this

1600
01:11:19,939 --> 01:11:21,869
expression and you might look at this

1601
01:11:22,069 --> 01:11:25,320
and be a little bit worried particularly

1602
01:11:25,520 --> 01:11:27,390
that with with this kind of expression

1603
01:11:27,590 --> 01:11:30,420
and X can vary a lot then you might

1604
01:11:30,619 --> 01:11:32,460
worry that if X is very small we might

1605
01:11:32,659 --> 01:11:34,380
run into numerical precision issues and

1606
01:11:34,579 --> 01:11:36,690
in fact actually that is a real concern

1607
01:11:36,890 --> 01:11:42,930
so what people typically do is use this

1608
01:11:43,130 --> 01:11:45,989
kind of compound module so it's softmax

1609
01:11:46,189 --> 01:11:48,119
plus cross-entropy and you'll see that

1610
01:11:48,319 --> 01:11:49,579
in terms of flow I think this

1611
01:11:49,779 --> 01:11:52,110
implementations for both but

1612
01:11:52,310 --> 01:11:53,890
you know unless you have your own

1613
01:11:54,090 --> 01:11:55,119
special reasons you probably should use

1614
01:11:55,319 --> 01:11:59,650
that the softmax plus cross-entropy so

1615
01:11:59,850 --> 01:12:02,500
it basically combines both the the

1616
01:12:02,699 --> 01:12:04,390
softmax operation and the cross and

1617
01:12:04,590 --> 01:12:07,840
pre-loss into a single operation and the

1618
01:12:08,039 --> 01:12:09,279
reason for that is if we do that and

1619
01:12:09,479 --> 01:12:10,840
look at the the grades that we get out

1620
01:12:11,039 --> 01:12:12,850
then it's this much more stable form

1621
01:12:13,050 --> 01:12:18,159
here so if we kind of go back what are

1622
01:12:18,359 --> 01:12:20,289
we done we had this graph that we wanted

1623
01:12:20,489 --> 01:12:23,140
to do learning in safer does it

1624
01:12:23,340 --> 01:12:25,180
classification we've gone through and

1625
01:12:25,380 --> 01:12:28,690
for each of these module types we

1626
01:12:28,890 --> 01:12:30,730
figured out what we need to do to kind

1627
01:12:30,930 --> 01:12:32,860
of propagate forwards what we need to do

1628
01:12:33,060 --> 01:12:34,300
to propagate backwards and what we need

1629
01:12:34,500 --> 01:12:36,850
to do to come up with the parameter

1630
01:12:37,050 --> 01:12:39,880
derivatives and armed with that we're

1631
01:12:40,079 --> 01:12:41,289
ready to go and we can plug together

1632
01:12:41,489 --> 01:12:47,440
things in whatever order we like so in

1633
01:12:47,640 --> 01:12:48,789
terms of learning we just kind of

1634
01:12:48,989 --> 01:12:51,279
iterate through getting an input and a

1635
01:12:51,479 --> 01:12:53,850
label running forward propagation

1636
01:12:54,050 --> 01:12:56,619
running backwards propagation getting

1637
01:12:56,819 --> 01:12:58,630
parameter updates applying the parameter

1638
01:12:58,829 --> 01:13:00,520
updates and cycling and the nice thing

1639
01:13:00,720 --> 01:13:02,320
is if we'd you know written this from

1640
01:13:02,520 --> 01:13:03,850
scratch ourselves and we wanted to try

1641
01:13:04,050 --> 01:13:07,570
adding in you know an extra hidden layer

1642
01:13:07,770 --> 01:13:10,239
then it'd be very simple we we we just

1643
01:13:10,439 --> 01:13:12,400
kind of put another one of these mod

1644
01:13:12,600 --> 01:13:15,550
modules here change the call sequence

1645
01:13:15,750 --> 01:13:18,789
and and we're good to go so once we have

1646
01:13:18,989 --> 01:13:20,380
those in place it's then very easy to

1647
01:13:20,579 --> 01:13:22,060
explore different topologies if I wanted

1648
01:13:22,260 --> 01:13:24,489
to California come up with some crazy

1649
01:13:24,689 --> 01:13:25,989
non-linearity instead of the rail ooh

1650
01:13:26,189 --> 01:13:28,140
then I am free to do so I would just

1651
01:13:28,340 --> 01:13:31,720
implement a module that has those three

1652
01:13:31,920 --> 01:13:33,970
API methods and and everything should

1653
01:13:34,170 --> 01:13:36,670
just work in this next section I'm going

1654
01:13:36,869 --> 01:13:38,710
to kind of do a quick tour of what I'm

1655
01:13:38,909 --> 01:13:42,119
calling a module Zoo so we've seen some

1656
01:13:42,319 --> 01:13:44,970
basic module types that are useful so

1657
01:13:45,170 --> 01:13:50,680
linear sigmoid relu softmax just gonna

1658
01:13:50,880 --> 01:13:52,920
go through some of the other operations

1659
01:13:53,119 --> 01:13:56,590
that you might see so there's actually

1660
01:13:56,789 --> 01:13:58,659
two main types of linear model the first

1661
01:13:58,859 --> 01:14:01,449
is the kind of simple matrix

1662
01:14:01,649 --> 01:14:02,760
multiplication that we've seen already

1663
01:14:02,960 --> 01:14:04,840
convolution and deconvolution all

1664
01:14:05,039 --> 01:14:08,050
laser also linear I'm not gonna talk

1665
01:14:08,250 --> 01:14:10,090
about those but Karen's going to cover

1666
01:14:10,289 --> 01:14:12,360
those in the next lecture on commnets

1667
01:14:12,560 --> 01:14:15,520
there's a couple of basic sort like

1668
01:14:15,720 --> 01:14:18,310
element wise operations so addition and

1669
01:14:18,510 --> 01:14:20,289
putting wise multiplication some group

1670
01:14:20,489 --> 01:14:22,680
operations and then a couple of other

1671
01:14:22,880 --> 01:14:24,340
nonlinearities that are worth knowing

1672
01:14:24,539 --> 01:14:29,800
about also in the slides I like how this

1673
01:14:30,000 --> 01:14:32,050
is sort of fairly inexhaustible writing

1674
01:14:32,250 --> 01:14:33,970
your possible activation functions you'd

1675
01:14:34,170 --> 01:14:37,810
wanna use typically the ones that we're

1676
01:14:38,010 --> 01:14:40,000
gonna cover today will will be in the

1677
01:14:40,199 --> 01:14:41,320
vast majority of thing you see but it's

1678
01:14:41,520 --> 01:14:43,360
also worth remembering but if you know

1679
01:14:43,560 --> 01:14:44,980
if you have a particular problem or if

1680
01:14:45,180 --> 01:14:46,180
you feel like you need to think

1681
01:14:46,380 --> 01:14:47,680
creatively about it your you have

1682
01:14:47,880 --> 01:14:49,119
license to kind of but pretty much

1683
01:14:49,319 --> 01:14:50,710
anything you want in these models as

1684
01:14:50,909 --> 01:14:52,539
long as they're differentiable you're

1685
01:14:52,739 --> 01:14:54,460
absolutely fine and even if they're not

1686
01:14:54,659 --> 01:14:56,739
perfectly differentiable you might still

1687
01:14:56,939 --> 01:14:57,909
be able to kind of come up with

1688
01:14:58,109 --> 01:15:02,860
something that's usable so yeah I'll go

1689
01:15:03,060 --> 01:15:06,760
through these relatively quickly so if

1690
01:15:06,960 --> 01:15:09,100
we want to do addition then the forward

1691
01:15:09,300 --> 01:15:11,770
pop method just obviously simple vector

1692
01:15:11,970 --> 01:15:15,699
addition the back pop method also

1693
01:15:15,899 --> 01:15:19,029
relatively straightforward there's no

1694
01:15:19,229 --> 01:15:20,020
parameters that there's no gradient

1695
01:15:20,220 --> 01:15:20,500
update

1696
01:15:20,699 --> 01:15:24,159
similarly for multiplication so element

1697
01:15:24,359 --> 01:15:26,710
wise multiplication this kind of thing

1698
01:15:26,909 --> 01:15:29,890
is kind of useful in as I saying into

1699
01:15:30,090 --> 01:15:32,079
like gating situations where depending

1700
01:15:32,279 --> 01:15:35,800
on some context say you might want to

1701
01:15:36,000 --> 01:15:37,779
propagate some parts of the state and

1702
01:15:37,979 --> 01:15:42,340
not others also comes up in modulation

1703
01:15:42,539 --> 01:15:45,190
or things like attention so if I want to

1704
01:15:45,390 --> 01:15:46,449
emphasize some parts of my

1705
01:15:46,649 --> 01:15:47,770
representation and relative to others

1706
01:15:47,970 --> 01:15:49,869
that's are elsewhere you'd see this kind

1707
01:15:50,069 --> 01:15:54,159
of operation there's a couple of kind of

1708
01:15:54,359 --> 01:15:58,180
group wise operations so summing for

1709
01:15:58,380 --> 01:16:02,020
instance so if we have a sum then the

1710
01:16:02,220 --> 01:16:05,680
gradient is kind of gets distributed the

1711
01:16:05,880 --> 01:16:07,119
back for grading gets distributed across

1712
01:16:07,319 --> 01:16:10,150
all the elements if we have a Mac so you

1713
01:16:10,350 --> 01:16:12,159
might see this in max pooling in

1714
01:16:12,359 --> 01:16:15,730
commerce for instance then basically

1715
01:16:15,930 --> 01:16:18,909
the for the back prop if the element was

1716
01:16:19,109 --> 01:16:21,220
not small then the gradient just passes

1717
01:16:21,420 --> 01:16:25,150
through otherwise there's no gradient if

1718
01:16:25,350 --> 01:16:28,480
we have a switch or a conditional one

1719
01:16:28,680 --> 01:16:30,340
way of representing it as I was saying

1720
01:16:30,539 --> 01:16:31,810
is with this kind of element-wise

1721
01:16:32,010 --> 01:16:33,730
multiplication and we basically just

1722
01:16:33,930 --> 01:16:36,250
need to remember which brand to which

1723
01:16:36,449 --> 01:16:39,100
switch was active that gets back propped

1724
01:16:39,300 --> 01:16:45,159
everything else gets set to zero here's

1725
01:16:45,359 --> 01:16:47,770
a couple of slight variance on

1726
01:16:47,970 --> 01:16:49,980
activation function we've seen already

1727
01:16:50,180 --> 01:16:54,430
so the tan H is basically just a kind of

1728
01:16:54,630 --> 01:16:55,960
scaled and shifted version of the

1729
01:16:56,159 --> 01:17:00,550
sigmoid so it's at 0 its 0 and it

1730
01:17:00,750 --> 01:17:05,050
saturates at 1 and minus 1 if you were

1731
01:17:05,250 --> 01:17:07,210
to build a feed-forward Network there's

1732
01:17:07,409 --> 01:17:09,550
some in potential in some cases there's

1733
01:17:09,750 --> 01:17:12,850
advantages to using 1080 over sigmoid in

1734
01:17:13,050 --> 01:17:16,989
that if you initialize with small

1735
01:17:17,189 --> 01:17:18,310
weights and small biases then you

1736
01:17:18,510 --> 01:17:19,810
basically get to initialize in this

1737
01:17:20,010 --> 01:17:24,010
linear region here and in practice it's

1738
01:17:24,210 --> 01:17:26,920
often nice if you can initialize your

1739
01:17:27,119 --> 01:17:28,659
network so that it it does a kind of

1740
01:17:28,859 --> 01:17:30,070
simple straightforward function rather

1741
01:17:30,270 --> 01:17:31,750
than kind of risking being in some of

1742
01:17:31,949 --> 01:17:33,100
these saturated regions where the

1743
01:17:33,300 --> 01:17:36,190
gradients are going to flow for similar

1744
01:17:36,390 --> 01:17:38,860
kind of gradient flow reasons rather

1745
01:17:39,060 --> 01:17:41,440
than using the rail ooh this would be

1746
01:17:41,640 --> 01:17:43,539
kind of zero here another thing that

1747
01:17:43,739 --> 01:17:45,460
people sometimes use is to have a very

1748
01:17:45,659 --> 01:17:48,130
small but nonzero slope in this negative

1749
01:17:48,329 --> 01:17:50,050
region and again it just kind of helps

1750
01:17:50,250 --> 01:17:53,590
with gradient propagation in the you no

1751
01:17:53,789 --> 01:17:55,720
longer lose all gradient if you're below

1752
01:17:55,920 --> 01:17:58,470
zero and that could also can be useful

1753
01:17:58,670 --> 01:18:00,520
I'd say that this actually on those

1754
01:18:00,720 --> 01:18:04,989
things were probably if you it's not a

1755
01:18:05,189 --> 01:18:08,170
default choice but maybe it should be in

1756
01:18:08,369 --> 01:18:10,720
that in my experience is often better to

1757
01:18:10,920 --> 01:18:13,600
use this than it is to use a rail ooh

1758
01:18:13,800 --> 01:18:16,060
that said I often don't use it just to

1759
01:18:16,260 --> 01:18:17,770
kind of keep as few moving parts as

1760
01:18:17,970 --> 01:18:18,940
possible because you know there are

1761
01:18:19,140 --> 01:18:20,440
design choices that you'd want to make

1762
01:18:20,640 --> 01:18:23,260
here so I if there was something that I

1763
01:18:23,460 --> 01:18:24,940
really really heard about getting the

1764
01:18:25,140 --> 01:18:26,800
best performance out of I probably start

1765
01:18:27,000 --> 01:18:28,699
to explore some of these variants but

1766
01:18:28,899 --> 01:18:30,498
day to day I tend to kind of stick with

1767
01:18:30,698 --> 01:18:31,519
the simple choices just because then

1768
01:18:31,719 --> 01:18:33,498
there's fewer few things to keep track

1769
01:18:33,698 --> 01:18:38,900
of in terms of mental overhead we've

1770
01:18:39,100 --> 01:18:41,720
already seen content to be lost and so

1771
01:18:41,920 --> 01:18:43,369
there's just another simple one so if

1772
01:18:43,569 --> 01:18:44,900
we're doing say regression problems then

1773
01:18:45,100 --> 01:18:50,420
squared error is a common choice yeah I

1774
01:18:50,619 --> 01:18:51,680
didn't have on the slides way I can add

1775
01:18:51,880 --> 01:18:53,150
it later just exciting again worth

1776
01:18:53,350 --> 01:18:53,600
noting

1777
01:18:53,800 --> 01:18:55,340
so square error is very common in

1778
01:18:55,539 --> 01:18:57,829
regression problems again in practice I

1779
01:18:58,029 --> 01:18:59,630
would probably try squared error if I

1780
01:18:59,829 --> 01:19:01,729
had this but I'd probably also try other

1781
01:19:01,929 --> 01:19:05,090
norms as well so in particular l1 one of

1782
01:19:05,289 --> 01:19:07,279
the problems with squared error is if

1783
01:19:07,479 --> 01:19:09,619
you have outliers or operations that for

1784
01:19:09,819 --> 01:19:11,720
whatever reason have to be way happen to

1785
01:19:11,920 --> 01:19:13,220
be way off mark that you can get

1786
01:19:13,420 --> 01:19:15,229
extremely large gradients and so

1787
01:19:15,429 --> 01:19:16,670
sometimes that can make learning

1788
01:19:16,869 --> 01:19:20,810
unstable so again in all these cases

1789
01:19:21,010 --> 01:19:22,458
there's sort like reasonable defaults

1790
01:19:22,658 --> 01:19:23,958
that are sensible to start with but it's

1791
01:19:24,158 --> 01:19:26,208
also useful to know kind of okay what

1792
01:19:26,408 --> 01:19:27,800
would the design choices that I might

1793
01:19:28,000 --> 01:19:30,890
want to revisit be if if things for

1794
01:19:31,090 --> 01:19:32,510
whatever is not working or if great

1795
01:19:32,710 --> 01:19:35,900
Nizar kind of blowing up and actually

1796
01:19:36,100 --> 01:19:39,470
that brings me on to this next section

1797
01:19:39,670 --> 01:19:43,190
where what I'll do is kind of go through

1798
01:19:43,390 --> 01:19:46,220
some sort of high-level practical tips

1799
01:19:46,420 --> 01:19:47,510
in terms of things that might be useful

1800
01:19:47,710 --> 01:19:49,458
for you when you're dealing with these

1801
01:19:49,658 --> 01:19:53,208
models and kind of good things to to

1802
01:19:53,408 --> 01:19:56,119
bear in mind this came up a bit in the

1803
01:19:56,319 --> 01:19:59,690
break as well it's sort of the the field

1804
01:19:59,890 --> 01:20:01,220
at the moment there's definitely a kind

1805
01:20:01,420 --> 01:20:04,489
of scarcity of strong theoretical

1806
01:20:04,689 --> 01:20:06,010
statements we can make and so

1807
01:20:06,210 --> 01:20:09,489
unfortunately that kind of means that a

1808
01:20:09,689 --> 01:20:12,110
lot of deep learning is still a bit more

1809
01:20:12,310 --> 01:20:14,769
of a dark art than it would be ideal so

1810
01:20:14,969 --> 01:20:16,820
there are some things that you can kind

1811
01:20:17,020 --> 01:20:19,130
of plug in and just rely on but there's

1812
01:20:19,329 --> 01:20:24,979
also a lot of trial and error and it's

1813
01:20:25,179 --> 01:20:26,329
some pieces where you kind of have to do

1814
01:20:26,529 --> 01:20:28,998
more of a an interrogated loop of okay

1815
01:20:29,198 --> 01:20:30,498
is this model working if so great if not

1816
01:20:30,698 --> 01:20:33,829
okay what might be going wrong and a lot

1817
01:20:34,029 --> 01:20:35,269
of getting good at this kind of stuff is

1818
01:20:35,469 --> 01:20:38,029
refining your intuition for if something

1819
01:20:38,229 --> 01:20:38,840
isn't working

1820
01:20:39,039 --> 01:20:40,918
what might the causes be

1821
01:20:41,118 --> 01:20:43,679
to quickly diagnose that and also what

1822
01:20:43,878 --> 01:20:45,060
sort of things you could do to fix that

1823
01:20:45,260 --> 01:20:53,850
so let's go through these so one problem

1824
01:20:54,050 --> 01:20:57,469
that you can run into is overfitting so

1825
01:20:57,668 --> 01:21:00,298
you get very good loss on your training

1826
01:21:00,498 --> 01:21:04,469
set but you don't generalize well so one

1827
01:21:04,668 --> 01:21:06,810
thing you can do there and this was kind

1828
01:21:07,010 --> 01:21:10,469
of probably in the early days is early

1829
01:21:10,668 --> 01:21:12,060
stopping so you basically just rather

1830
01:21:12,260 --> 01:21:14,248
than training to kind of push your loss

1831
01:21:14,448 --> 01:21:18,719
all the way to zero you kind of in

1832
01:21:18,918 --> 01:21:20,429
parallel or evaluating on some

1833
01:21:20,628 --> 01:21:22,829
validation set and you stop once say

1834
01:21:23,029 --> 01:21:24,390
that the loss in your validation step

1835
01:21:24,590 --> 01:21:27,989
starts to go up that's one method some

1836
01:21:28,189 --> 01:21:29,699
something else you can do and you know

1837
01:21:29,899 --> 01:21:30,859
you can do all these in combination

1838
01:21:31,059 --> 01:21:33,149
there's something else called weight

1839
01:21:33,349 --> 01:21:37,769
decay and it basically penalizes the

1840
01:21:37,969 --> 01:21:39,179
weights in your network from becoming

1841
01:21:39,378 --> 01:21:42,479
too big and one intuition for why this

1842
01:21:42,679 --> 01:21:44,519
might be helpful is if we think about

1843
01:21:44,719 --> 01:21:46,588
something like the sigmoid with small

1844
01:21:46,788 --> 01:21:48,029
weights we're going to tend to be in

1845
01:21:48,229 --> 01:21:54,119
this more often in this linear region so

1846
01:21:54,319 --> 01:21:56,689
our kind of functional mapping will be

1847
01:21:56,889 --> 01:22:01,469
closer to linear and so potentially

1848
01:22:01,668 --> 01:22:04,439
lower complexity what one thing to

1849
01:22:04,639 --> 01:22:05,668
mention actually about weight decay is

1850
01:22:05,868 --> 01:22:07,979
that it doesn't have as much an effect

1851
01:22:08,179 --> 01:22:10,469
on reloj units as it does on some of

1852
01:22:10,668 --> 01:22:14,819
these others so it may be a less useful

1853
01:22:15,019 --> 01:22:17,009
form of regularization for your relu

1854
01:22:17,208 --> 01:22:18,208
layers it'll still obviously have an

1855
01:22:18,408 --> 01:22:22,048
effect on the output but with Ray lose

1856
01:22:22,248 --> 01:22:23,640
you can get a new scale all the weights

1857
01:22:23,840 --> 01:22:25,829
down and you still have the same set of

1858
01:22:26,029 --> 01:22:27,269
decision boundaries so it doesn't quite

1859
01:22:27,469 --> 01:22:30,649
regularize where I lose in the same way

1860
01:22:30,849 --> 01:22:34,739
something else that you can do is I said

1861
01:22:34,939 --> 01:22:37,829
he add noise and this kind of brings us

1862
01:22:38,029 --> 01:22:40,979
on to things like drop out and there's a

1863
01:22:41,179 --> 01:22:42,659
couple of ways of interpreting what's

1864
01:22:42,859 --> 01:22:46,469
going on so you can add noise to your

1865
01:22:46,668 --> 01:22:48,418
your inputs which you could also think

1866
01:22:48,618 --> 01:22:49,859
of as a form of data augmentation you

1867
01:22:50,059 --> 01:22:52,319
could add noise to your activities you

1868
01:22:52,519 --> 01:22:53,850
can add noise to your parameters you can

1869
01:22:54,050 --> 01:22:54,619
kind of

1870
01:22:54,819 --> 01:22:58,050
mask out some of the activities of units

1871
01:22:58,250 --> 01:23:01,260
within layers and yet in terms of the

1872
01:23:01,460 --> 01:23:02,880
like what is this doing well you can

1873
01:23:03,079 --> 01:23:03,570
kind of think of it in a couple

1874
01:23:03,770 --> 01:23:05,640
different ways one is that it prevents

1875
01:23:05,840 --> 01:23:08,550
the Mott Network from being too reliant

1876
01:23:08,750 --> 01:23:12,260
on very precise conjunctions or features

1877
01:23:12,460 --> 01:23:15,090
so you can imagine that you know that'd

1878
01:23:15,289 --> 01:23:17,579
be one way to memorize your data set if

1879
01:23:17,779 --> 01:23:19,890
you kind of have very precise activities

1880
01:23:20,090 --> 01:23:21,918
that depend on the very precise pattern

1881
01:23:22,118 --> 01:23:25,470
that you see in a particular input you

1882
01:23:25,670 --> 01:23:27,208
can also view it as a kind of cheap way

1883
01:23:27,408 --> 01:23:31,140
of doing ensemble assay model multiple

1884
01:23:31,340 --> 01:23:32,640
times adding different amounts of noise

1885
01:23:32,840 --> 01:23:36,418
then that's some what you might spend

1886
01:23:36,618 --> 01:23:37,829
that to have somewhat similar effects to

1887
01:23:38,029 --> 01:23:40,439
if I had an ensemble of similar models

1888
01:23:40,639 --> 01:23:42,149
and so you can also kind of tie that

1889
01:23:42,349 --> 01:23:46,409
into some ideas from so phasing

1890
01:23:46,609 --> 01:23:48,269
statistics are rather than say have a

1891
01:23:48,469 --> 01:23:49,498
single model you have a posterior

1892
01:23:49,698 --> 01:23:51,449
distribution over parameters and adding

1893
01:23:51,649 --> 01:23:53,489
noise in a hand-wavy sense is a little

1894
01:23:53,689 --> 01:23:55,739
bit like looking at a local plastic

1895
01:23:55,939 --> 01:24:00,149
proximation and then probably the best

1896
01:24:00,349 --> 01:24:02,550
known of these is is dropout and so in

1897
01:24:02,750 --> 01:24:05,100
this you sort of randomly set a fraction

1898
01:24:05,300 --> 01:24:08,208
of activities in a given layer to 0 and

1899
01:24:08,408 --> 01:24:11,519
at testing time you kind of need to

1900
01:24:11,719 --> 01:24:13,529
rescale things by the proper fraction

1901
01:24:13,729 --> 01:24:16,050
because at test time you're gonna have

1902
01:24:16,250 --> 01:24:20,449
everything active so this would be

1903
01:24:20,649 --> 01:24:23,640
typical magnitude of the activities in a

1904
01:24:23,840 --> 01:24:25,949
given way are going to be higher it's

1905
01:24:26,149 --> 01:24:27,570
also worth noting that sort of drop out

1906
01:24:27,770 --> 01:24:29,159
it's one of those things that kind of

1907
01:24:29,359 --> 01:24:30,989
you know peaked in popularity I guess

1908
01:24:31,189 --> 01:24:34,529
around like 2012 or so it's not used as

1909
01:24:34,729 --> 01:24:38,159
much these days as it used to be I think

1910
01:24:38,359 --> 01:24:41,820
one of the reasons for that is the sort

1911
01:24:42,020 --> 01:24:45,239
of introduction of normalization so I'll

1912
01:24:45,439 --> 01:24:46,890
talk about that in a second but another

1913
01:24:47,090 --> 01:24:49,019
another factor that can be important in

1914
01:24:49,219 --> 01:24:51,149
terms of whether your models train well

1915
01:24:51,349 --> 01:24:54,979
or not is how will you initialize them

1916
01:24:55,179 --> 01:24:57,989
and yeah this can expect what I was

1917
01:24:58,189 --> 01:24:59,369
saying about you know the tonnage being

1918
01:24:59,569 --> 01:25:01,229
someone nice and that if you have small

1919
01:25:01,429 --> 01:25:02,760
weights then you can get to initialize

1920
01:25:02,960 --> 01:25:05,399
things in a more or less linear region

1921
01:25:05,599 --> 01:25:06,338
but

1922
01:25:06,538 --> 01:25:08,529
the beginning of training you want to

1923
01:25:08,729 --> 01:25:10,810
make sure that you have good gradients

1924
01:25:11,010 --> 01:25:12,038
flowing all the way through your network

1925
01:25:12,238 --> 01:25:15,519
so you don't them to be too big and you

1926
01:25:15,719 --> 01:25:18,069
don't them to be too small there's

1927
01:25:18,269 --> 01:25:20,140
various heuristics for kind of arranging

1928
01:25:20,340 --> 01:25:22,958
for this to be the case a link to a

1929
01:25:23,158 --> 01:25:27,519
couple of figures here so and for some

1930
01:25:27,719 --> 01:25:28,600
reason a lot of these are kind of named

1931
01:25:28,800 --> 01:25:32,708
after the first author of the proposed

1932
01:25:32,908 --> 01:25:34,029
these so there's something that protocol

1933
01:25:34,229 --> 01:25:38,529
Xavier initialization named after is a

1934
01:25:38,729 --> 01:25:42,418
burglar who's so deep mind I forget the

1935
01:25:42,618 --> 01:25:44,829
first name of hair but there's a

1936
01:25:45,029 --> 01:25:46,149
follow-on paper that the difference

1937
01:25:46,349 --> 01:25:49,109
between these two is that both trying to

1938
01:25:49,309 --> 01:25:52,119
say okay how should I scale my weights

1939
01:25:52,319 --> 01:25:55,439
and biases at initialization so that the

1940
01:25:55,639 --> 01:25:59,649
input to my normally era T's say have

1941
01:25:59,849 --> 01:26:03,310
some particular distribution so maybe 0

1942
01:26:03,510 --> 01:26:06,189
mean unit variance but differently in

1943
01:26:06,389 --> 01:26:09,010
these two is that the assumptions that

1944
01:26:09,210 --> 01:26:10,628
you might want to make if you're using

1945
01:26:10,828 --> 01:26:13,239
say a sigmoid unit are different from

1946
01:26:13,439 --> 01:26:15,878
those if you're using say a rectified

1947
01:26:16,078 --> 01:26:18,519
linear unit so yeah there's a couple of

1948
01:26:18,719 --> 01:26:19,600
papers here that you might want to take

1949
01:26:19,800 --> 01:26:22,449
a look at then there's this thing batch

1950
01:26:22,649 --> 01:26:26,378
norm which is used full extensively now

1951
01:26:26,578 --> 01:26:29,550
particularly in feed-forward networks

1952
01:26:29,750 --> 01:26:33,369
it's still not used as much in recurrent

1953
01:26:33,569 --> 01:26:35,649
models just because there's some

1954
01:26:35,849 --> 01:26:36,939
subtleties about how you'd actually go

1955
01:26:37,139 --> 01:26:39,220
about doing that and it's used

1956
01:26:39,420 --> 01:26:42,159
I'd say hardly at all in deep RL but

1957
01:26:42,359 --> 01:26:43,869
there's probably modifications to this

1958
01:26:44,069 --> 01:26:45,489
kind of idea that you could do a few if

1959
01:26:45,689 --> 01:26:47,109
you wanted to apply those approaches

1960
01:26:47,309 --> 01:26:52,510
there and it it kind of subsumes some of

1961
01:26:52,710 --> 01:26:55,418
the stuff in that you can think of it as

1962
01:26:55,618 --> 01:26:57,310
being similar to what we do in some of

1963
01:26:57,510 --> 01:27:00,418
these initialization methods but we also

1964
01:27:00,618 --> 01:27:04,180
continuously update to maintain these

1965
01:27:04,380 --> 01:27:09,279
properties so the idea is we'd like the

1966
01:27:09,479 --> 01:27:13,300
the inputs the some inputs to our units

1967
01:27:13,500 --> 01:27:14,760
to have a zero mean and unit variance

1968
01:27:14,960 --> 01:27:16,989
but for the reasons I described in terms

1969
01:27:17,189 --> 01:27:19,029
of initialization what batch norm does

1970
01:27:19,229 --> 01:27:19,748
is it kind

1971
01:27:19,948 --> 01:27:22,239
enforces that but it also introduces

1972
01:27:22,439 --> 01:27:24,699
some additional trainable correction

1973
01:27:24,899 --> 01:27:27,220
factors so that if it turned out in fact

1974
01:27:27,420 --> 01:27:29,409
I would rather have something that had

1975
01:27:29,609 --> 01:27:32,378
variance 10 and I've mean of one then

1976
01:27:32,578 --> 01:27:34,269
there's kind of scalings and offsets

1977
01:27:34,469 --> 01:27:36,548
that I can learn during training to help

1978
01:27:36,748 --> 01:27:39,609
that be the case but it all that's being

1979
01:27:39,809 --> 01:27:41,708
equal it kind of helps keep my

1980
01:27:41,908 --> 01:27:44,140
activities you know a reasonable regime

1981
01:27:44,340 --> 01:27:46,298
with respect to minorities and also with

1982
01:27:46,498 --> 01:27:47,918
respect to the kind of gradient scaling

1983
01:27:48,118 --> 01:27:51,189
that we get when we do back from another

1984
01:27:51,389 --> 01:27:55,239
nice benefit of Bachelor on that is I

1985
01:27:55,439 --> 01:27:57,009
think actually mentioned less often but

1986
01:27:57,208 --> 01:27:59,288
is is interesting and is perhaps part of

1987
01:27:59,488 --> 01:28:02,498
the reason why drop out isn't as favored

1988
01:28:02,698 --> 01:28:04,979
as much is that you you get a sort of

1989
01:28:05,179 --> 01:28:08,109
drop out like noise effect from batch

1990
01:28:08,309 --> 01:28:11,949
normalization and that in order to

1991
01:28:12,149 --> 01:28:15,759
enforce or to encourage these kind of 0

1992
01:28:15,958 --> 01:28:17,560
mean unit variance properties you look

1993
01:28:17,760 --> 01:28:19,810
at your local data batch and so just

1994
01:28:20,010 --> 01:28:22,329
because of randomization amongst the

1995
01:28:22,529 --> 01:28:25,390
cases that you get in a given batch from

1996
01:28:25,590 --> 01:28:26,588
the point of view of any one of those

1997
01:28:26,788 --> 01:28:29,288
data cases the contribution to the batch

1998
01:28:29,488 --> 01:28:30,998
normalization from the rest of the batch

1999
01:28:31,198 --> 01:28:33,128
members looks a lot like noise and so

2000
01:28:33,328 --> 01:28:35,498
that kind of gives you some some sort of

2001
01:28:35,698 --> 01:28:38,259
regularization effect anyway there'll be

2002
01:28:38,458 --> 01:28:40,869
a lot more about this in current lecture

2003
01:28:41,069 --> 01:28:46,509
on conducts another kind of area that's

2004
01:28:46,708 --> 01:28:49,538
important practice is how to pick good

2005
01:28:49,738 --> 01:28:55,930
hyper parameters so how do I know how do

2006
01:28:56,130 --> 01:28:58,449
I know what a good learning rate is if

2007
01:28:58,649 --> 01:29:00,970
I'm using dropout how do I know what

2008
01:29:01,170 --> 01:29:02,979
fraction units to dropout or how much

2009
01:29:03,179 --> 01:29:04,689
noise to add if I'm doing weight decay

2010
01:29:04,889 --> 01:29:09,759
and so on and we're still relatively

2011
01:29:09,958 --> 01:29:12,509
primitive and how we deal with this so

2012
01:29:12,708 --> 01:29:15,430
basically the idea is just to try many

2013
01:29:15,630 --> 01:29:17,439
combinations and kind of evaluate the

2014
01:29:17,639 --> 01:29:20,069
final results on some held out data set

2015
01:29:20,269 --> 01:29:23,319
and then pick the best but there are a

2016
01:29:23,519 --> 01:29:24,579
kind of a couple of kind of practical

2017
01:29:24,779 --> 01:29:27,189
tricks and some of these to it so if

2018
01:29:27,389 --> 01:29:28,449
there's lots and lots of hyper

2019
01:29:28,649 --> 01:29:30,279
parameters then the search space can be

2020
01:29:30,479 --> 01:29:32,708
huge so that's something that you might

2021
01:29:32,908 --> 01:29:33,699
worry about

2022
01:29:33,899 --> 01:29:35,829
for a long time people advocated grid

2023
01:29:36,029 --> 01:29:37,720
search so essentially for each hyper

2024
01:29:37,920 --> 01:29:41,409
parameter that you you care about maybe

2025
01:29:41,609 --> 01:29:43,180
kind of come up with some grid of things

2026
01:29:43,380 --> 01:29:45,070
to try and kind of systematically try

2027
01:29:45,270 --> 01:29:47,820
the cross base of all possibilities

2028
01:29:48,020 --> 01:29:51,640
turns out that in a lot of cases that's

2029
01:29:51,840 --> 01:29:54,420
actually not the best thing to do and

2030
01:29:54,619 --> 01:29:56,619
there's a nice paper by a box from

2031
01:29:56,819 --> 01:29:58,360
benzio which I've linked here and I've

2032
01:29:58,560 --> 01:30:00,400
taken this figure from it and this kind

2033
01:30:00,600 --> 01:30:02,229
of tries to illustrate why that might be

2034
01:30:02,429 --> 01:30:06,729
so depending on what the the sensitivity

2035
01:30:06,929 --> 01:30:09,038
of your model is to the hyper parameters

2036
01:30:09,238 --> 01:30:11,619
if you do grid sir you could very easily

2037
01:30:11,819 --> 01:30:13,810
miss these good regions just if your

2038
01:30:14,010 --> 01:30:17,229
grid happens to be poorly aligned with

2039
01:30:17,429 --> 01:30:20,159
respect to the reason is that useful so

2040
01:30:20,359 --> 01:30:23,288
they advocate and kind of empirically

2041
01:30:23,488 --> 01:30:25,150
demonstrated that this often gets better

2042
01:30:25,350 --> 01:30:27,159
results just doing random search so

2043
01:30:27,359 --> 01:30:29,288
rather than defining a grid for each

2044
01:30:29,488 --> 01:30:31,510
dimension you might define some sampling

2045
01:30:31,710 --> 01:30:35,880
distribution and then you essentially

2046
01:30:36,079 --> 01:30:38,769
just set a sample from that joint

2047
01:30:38,969 --> 01:30:41,949
probability space run run your models

2048
01:30:42,149 --> 01:30:43,208
and then a nice thing there is that you

2049
01:30:43,408 --> 01:30:47,920
can you get broader coverage of any

2050
01:30:48,119 --> 01:30:51,729
individual parameter value and there's a

2051
01:30:51,929 --> 01:30:53,050
better chance that you'll find a good

2052
01:30:53,250 --> 01:30:55,060
region that you can then explore more

2053
01:30:55,260 --> 01:30:57,489
carefully so I would say if you're if

2054
01:30:57,689 --> 01:30:58,449
you're faced with this kind of issue

2055
01:30:58,649 --> 01:31:01,300
then unless you have a good reason not

2056
01:31:01,500 --> 01:31:03,130
to don't do grid search to do do a

2057
01:31:03,329 --> 01:31:06,579
random search there's actually kind of a

2058
01:31:06,779 --> 01:31:10,570
lot of ongoing research in terms of ways

2059
01:31:10,770 --> 01:31:12,760
to get around some of these problems or

2060
01:31:12,960 --> 01:31:14,619
at least a kind of automate this search

2061
01:31:14,819 --> 01:31:18,550
process so there's some approaches from

2062
01:31:18,750 --> 01:31:21,850
kind of Bayesian modeling where the idea

2063
01:31:22,050 --> 01:31:27,038
there is if I could somehow form a model

2064
01:31:27,238 --> 01:31:29,739
of how well form a predictive model of

2065
01:31:29,939 --> 01:31:31,600
the performance of the models that I'm

2066
01:31:31,800 --> 01:31:33,729
training then I could be smarter about

2067
01:31:33,929 --> 01:31:35,470
figuring out which hydro parameter

2068
01:31:35,670 --> 01:31:40,560
values to try next there's also some

2069
01:31:40,760 --> 01:31:42,369
reinforcement learning approaches which

2070
01:31:42,569 --> 01:31:46,539
essentially there's some upfront cost

2071
01:31:46,739 --> 01:31:47,949
in terms of having to run training many

2072
01:31:48,149 --> 01:31:50,230
times but the hope is that I can

2073
01:31:50,430 --> 01:31:52,750
essentially learn how to dynamically

2074
01:31:52,949 --> 01:31:54,670
adjust these hyper parameters through

2075
01:31:54,869 --> 01:31:56,980
training so that if I then have another

2076
01:31:57,180 --> 01:31:58,630
instance of the same sort of learning

2077
01:31:58,829 --> 01:32:01,900
problem I can be much smarter about how

2078
01:32:02,100 --> 01:32:05,610
I treat that and then there's actually a

2079
01:32:05,810 --> 01:32:09,820
paper the I along with some other folks

2080
01:32:10,020 --> 01:32:12,400
would be mine published archived at the

2081
01:32:12,600 --> 01:32:14,800
end of last year which is this idea of

2082
01:32:15,000 --> 01:32:16,989
borrowing some tricks from evolutionary

2083
01:32:17,189 --> 01:32:20,230
optimization and a population of

2084
01:32:20,430 --> 01:32:22,750
simultaneous training models and

2085
01:32:22,949 --> 01:32:25,510
essentially the idea there is instead of

2086
01:32:25,710 --> 01:32:27,940
doing at a grid search or random search

2087
01:32:28,140 --> 01:32:30,279
let's say we initialize with random

2088
01:32:30,479 --> 01:32:34,449
search we're training everything all

2089
01:32:34,649 --> 01:32:37,029
together and periodically we look at the

2090
01:32:37,229 --> 01:32:40,420
training progress that each of the the

2091
01:32:40,619 --> 01:32:43,150
jobs know population has made and if

2092
01:32:43,350 --> 01:32:45,010
something seems to be doing particularly

2093
01:32:45,210 --> 01:32:47,230
poorly then we look for something that's

2094
01:32:47,430 --> 01:32:49,420
doing particularly well we copy its

2095
01:32:49,619 --> 01:32:51,520
parameters over and then do a small

2096
01:32:51,720 --> 01:32:53,170
adjustment to its hyper parameters and

2097
01:32:53,369 --> 01:32:55,900
then continue training and that lets us

2098
01:32:56,100 --> 01:33:00,090
do be kind of it's a nice combination of

2099
01:33:00,289 --> 01:33:02,529
Hydra parameter search and a little bit

2100
01:33:02,729 --> 01:33:04,140
of online model selection in that were

2101
01:33:04,340 --> 01:33:06,610
devoting more compute to the models that

2102
01:33:06,810 --> 01:33:08,489
seem to be doing better and also

2103
01:33:08,689 --> 01:33:10,630
exploring in regions of hyper parameter

2104
01:33:10,829 --> 01:33:12,750
space that seemed to be more promising

2105
01:33:12,949 --> 01:33:15,520
that she has another particularly nice

2106
01:33:15,720 --> 01:33:18,520
benefit in reinforcement learning so one

2107
01:33:18,720 --> 01:33:21,340
of the kind of hallmarks of many RL

2108
01:33:21,539 --> 01:33:23,079
problems is that the data distribution

2109
01:33:23,279 --> 01:33:24,550
that we deal with is is non-stationary

2110
01:33:24,750 --> 01:33:27,130
so you know if I'm a robot that's

2111
01:33:27,329 --> 01:33:29,110
letting to operate in the world that may

2112
01:33:29,310 --> 01:33:30,760
be you know the David distribution in

2113
01:33:30,960 --> 01:33:32,050
this room might be completely different

2114
01:33:32,250 --> 01:33:34,360
to the David distribution when I go into

2115
01:33:34,560 --> 01:33:35,829
the hallway and so it could well be the

2116
01:33:36,029 --> 01:33:37,930
case that throughout learning there it

2117
01:33:38,130 --> 01:33:39,610
the the hyper parameters that would

2118
01:33:39,810 --> 01:33:40,659
allow me to make the best learning

2119
01:33:40,859 --> 01:33:43,289
progress might be quite different and so

2120
01:33:43,489 --> 01:33:45,369
some of these methods like random search

2121
01:33:45,569 --> 01:33:48,369
just can't address that whereas the

2122
01:33:48,569 --> 01:33:49,900
population-based method that we propose

2123
01:33:50,100 --> 01:33:52,890
is actually kind of locally adaptive so

2124
01:33:53,090 --> 01:33:56,230
that's worth looking at it works super

2125
01:33:56,430 --> 01:33:58,600
well and a demine workhorse of like

2126
01:33:58,800 --> 01:34:00,220
using this

2127
01:34:00,420 --> 01:34:01,869
the vast majority of our experiments now

2128
01:34:02,069 --> 01:34:05,890
the downside is it's simple to implement

2129
01:34:06,090 --> 01:34:08,470
but it's a little resource-hungry in

2130
01:34:08,670 --> 01:34:11,500
terms of how much compete you're able to

2131
01:34:11,699 --> 01:34:15,039
access concurrently so if you're able to

2132
01:34:15,239 --> 01:34:18,779
run say 30 or 40 replicas of your

2133
01:34:18,979 --> 01:34:23,079
experiment in parallel then I this is I

2134
01:34:23,279 --> 01:34:25,480
think I said a clearly better way to do

2135
01:34:25,680 --> 01:34:27,400
hyper own search but yeah if you don't

2136
01:34:27,600 --> 01:34:29,440
have some Google's resources then it can

2137
01:34:29,640 --> 01:34:31,270
be trickier to kind of do that so you

2138
01:34:31,470 --> 01:34:32,640
might want to do these more sequential

2139
01:34:32,840 --> 01:34:39,250
methods so ya hit his just some kind of

2140
01:34:39,449 --> 01:34:41,289
rules of thumb but there's a much longer

2141
01:34:41,489 --> 01:34:42,880
list of this and exacting some of those

2142
01:34:43,079 --> 01:34:44,289
things that you just kind of build up

2143
01:34:44,489 --> 01:34:46,659
experience over time but a couple of

2144
01:34:46,859 --> 01:34:49,510
kind of easy things to do if you're not

2145
01:34:49,710 --> 01:34:50,739
getting the performance that you've

2146
01:34:50,939 --> 01:34:53,860
hoped for one is to sort check for dead

2147
01:34:54,060 --> 01:34:56,710
unit so you could say take a large

2148
01:34:56,909 --> 01:34:59,440
mini-batch and look at the histogram for

2149
01:34:59,640 --> 01:35:01,300
a given layer look at the histogram of

2150
01:35:01,500 --> 01:35:03,070
activities of units in that layer and

2151
01:35:03,270 --> 01:35:05,079
what you're looking for is basically you

2152
01:35:05,279 --> 01:35:06,640
know some units that maybe never turn on

2153
01:35:06,840 --> 01:35:08,770
so for whatever reason and maybe your

2154
01:35:08,970 --> 01:35:10,720
initialization was off or you went to a

2155
01:35:10,920 --> 01:35:12,730
weird letting regime but it might be the

2156
01:35:12,930 --> 01:35:14,890
case that say if you have rarely units

2157
01:35:15,090 --> 01:35:16,270
many of them are just never in that

2158
01:35:16,470 --> 01:35:18,070
linear region and so you have the

2159
01:35:18,270 --> 01:35:19,180
capacity there but it's actually not

2160
01:35:19,380 --> 01:35:21,279
useful for you and so I'm just getting

2161
01:35:21,479 --> 01:35:22,659
in the way

2162
01:35:22,859 --> 01:35:26,110
a similar diagnostic is it can be useful

2163
01:35:26,310 --> 01:35:29,800
to look at histograms of your gradient

2164
01:35:30,000 --> 01:35:31,989
say again visualized over a large mini

2165
01:35:32,189 --> 01:35:33,640
batch and again you're kind of looking

2166
01:35:33,840 --> 01:35:35,529
out for you know gradients that are

2167
01:35:35,729 --> 01:35:36,909
always zero in which case you're gonna

2168
01:35:37,109 --> 01:35:40,270
have from making any progress or very

2169
01:35:40,470 --> 01:35:41,949
heavy tailed grading distributions in

2170
01:35:42,149 --> 01:35:44,500
which case maybe there's some data cases

2171
01:35:44,699 --> 01:35:47,619
that are dominating or there's some kind

2172
01:35:47,819 --> 01:35:48,970
of numerical issues with your gratings

2173
01:35:49,170 --> 01:35:50,500
blowing up

2174
01:35:50,699 --> 01:35:52,180
something else is a really useful thing

2175
01:35:52,380 --> 01:35:56,680
to try is take a kind of a very small

2176
01:35:56,880 --> 01:35:59,770
subset of data or if it's an RL setting

2177
01:35:59,970 --> 01:36:01,539
if there's a kind of a simplified

2178
01:36:01,739 --> 01:36:04,659
version of your task I just try to try a

2179
01:36:04,859 --> 01:36:05,890
model on that simpler version of the

2180
01:36:06,090 --> 01:36:08,650
task and for a smaller subset you should

2181
01:36:08,850 --> 01:36:11,529
be able to get zero training error or

2182
01:36:11,729 --> 01:36:12,789
you know close to a depending on you

2183
01:36:12,989 --> 01:36:13,119
know

2184
01:36:13,319 --> 01:36:15,039
noisy labeling that kind of stuff but

2185
01:36:15,239 --> 01:36:17,500
the idea is if you're not seeing the

2186
01:36:17,699 --> 01:36:18,640
performance on the real world problem

2187
01:36:18,840 --> 01:36:20,590
you care about just as a kind of sanity

2188
01:36:20,789 --> 01:36:22,210
check scale back the size of your data

2189
01:36:22,409 --> 01:36:24,789
set and make sure that you can over fit

2190
01:36:24,989 --> 01:36:31,510
on a small amount of data and because we

2191
01:36:31,710 --> 01:36:33,180
could just get about ten minutes left

2192
01:36:33,380 --> 01:36:35,860
I'll go through this fairly quickly it's

2193
01:36:36,060 --> 01:36:38,560
a kind of research topic again from D

2194
01:36:38,760 --> 01:36:39,850
mind that relates to some of the stuff

2195
01:36:40,050 --> 01:36:41,619
we've talked about but I'm I'll leave

2196
01:36:41,819 --> 01:36:43,300
five minutes at the end for questions as

2197
01:36:43,500 --> 01:36:48,220
well so this is some work that was it's

2198
01:36:48,420 --> 01:36:50,260
from I guess a year and a half ago now

2199
01:36:50,460 --> 01:36:53,220
although what kind of stuff on going and

2200
01:36:53,420 --> 01:36:55,920
it was this idea that we called

2201
01:36:56,119 --> 01:36:58,420
decoupled neural interfaces using some

2202
01:36:58,619 --> 01:37:03,600
data gradients and basically the idea is

2203
01:37:03,800 --> 01:37:09,250
rather than running say our forward

2204
01:37:09,449 --> 01:37:10,779
propagation all the way to the end and

2205
01:37:10,979 --> 01:37:12,520
then back propagation although at the

2206
01:37:12,720 --> 01:37:16,960
end can we say midway through this chain

2207
01:37:17,159 --> 01:37:19,960
predict what the back propagated

2208
01:37:20,159 --> 01:37:21,670
gradients are gonna be before we

2209
01:37:21,869 --> 01:37:25,539
actually get them and it turns out that

2210
01:37:25,739 --> 01:37:29,230
you can do that you might ask why why

2211
01:37:29,430 --> 01:37:32,980
would I want to so there's two places I

2212
01:37:33,180 --> 01:37:35,320
think where it's useful one is if we

2213
01:37:35,520 --> 01:37:37,329
have is more a kind of I guess

2214
01:37:37,529 --> 01:37:38,350
infrastructure thing that we have

2215
01:37:38,550 --> 01:37:41,170
massive massive graphs and we want it we

2216
01:37:41,369 --> 01:37:42,550
need to do lots of most of computation

2217
01:37:42,750 --> 01:37:46,840
before we can do an update then if this

2218
01:37:47,039 --> 01:37:48,600
were model parallel say then essentially

2219
01:37:48,800 --> 01:37:52,690
the the machines holding this these

2220
01:37:52,890 --> 01:37:54,880
nodes would be waiting for the back prop

2221
01:37:55,079 --> 01:37:56,079
s to happen before they could do an

2222
01:37:56,279 --> 01:37:57,940
update after the forward pass so one way

2223
01:37:58,140 --> 01:37:59,860
is to kind of allow for potentially

2224
01:38:00,060 --> 01:38:02,110
better pipelining the other benefit and

2225
01:38:02,310 --> 01:38:04,380
that's partly why I kind of have this

2226
01:38:04,579 --> 01:38:06,190
graph here that's more of a sequence

2227
01:38:06,390 --> 01:38:09,699
model is there are some settings where

2228
01:38:09,899 --> 01:38:12,100
we actually don't want to have to wait

2229
01:38:12,300 --> 01:38:13,570
for the future to arrive before we

2230
01:38:13,770 --> 01:38:15,100
update our parameters so if I have a

2231
01:38:15,300 --> 01:38:16,630
sequence model over an extremely long

2232
01:38:16,829 --> 01:38:20,170
sequence or in the case of an and RL

2233
01:38:20,369 --> 01:38:22,239
agent you know it's kind of indefinite I

2234
01:38:22,439 --> 01:38:25,060
can't so I don't want to wait for an

2235
01:38:25,260 --> 01:38:26,440
extremely long time before I can run my

2236
01:38:26,640 --> 01:38:26,860
back

2237
01:38:27,060 --> 01:38:29,199
through time to get gradients and it

2238
01:38:29,399 --> 01:38:32,529
might not be might not be feasible right

2239
01:38:32,729 --> 01:38:34,630
now if what what people typically do is

2240
01:38:34,829 --> 01:38:36,670
they'll take a long sequence and they'll

2241
01:38:36,869 --> 01:38:38,529
chop it into chunks and they'll run

2242
01:38:38,729 --> 01:38:40,390
something called truncated back prop

2243
01:38:40,590 --> 01:38:43,420
through time and if you sit down and

2244
01:38:43,619 --> 01:38:44,770
think about what that's doing then it's

2245
01:38:44,970 --> 01:38:47,170
it's essentially assuming that outside

2246
01:38:47,369 --> 01:38:48,789
of the kind of truncation window the

2247
01:38:48,989 --> 01:38:50,350
gradient from the future are zero

2248
01:38:50,550 --> 01:38:51,760
because what we're just ignoring them

2249
01:38:51,960 --> 01:38:53,710
and so if you look at it like they're

2250
01:38:53,909 --> 01:38:55,720
the argument behind Cynthia gradients is

2251
01:38:55,920 --> 01:38:57,310
is kind of obvious you're basically

2252
01:38:57,510 --> 01:39:00,190
saying if I my default was to do

2253
01:39:00,390 --> 01:39:02,829
truncated back put through time which

2254
01:39:03,029 --> 01:39:04,060
implicitly makes the assumption that

2255
01:39:04,260 --> 01:39:05,829
gradients from outside the truncation

2256
01:39:06,029 --> 01:39:08,170
window are zero could I possibly do

2257
01:39:08,369 --> 01:39:09,400
better by predicting something other

2258
01:39:09,600 --> 01:39:11,890
than zero and the answer is probably yes

2259
01:39:12,090 --> 01:39:13,750
in most cases and so that's a kind of

2260
01:39:13,949 --> 01:39:16,590
good motivation for why it's interesting

2261
01:39:16,789 --> 01:39:18,489
there's a couple of papers that we

2262
01:39:18,689 --> 01:39:19,960
published on this now already and

2263
01:39:20,159 --> 01:39:22,600
there's a nice kind of interactive blog

2264
01:39:22,800 --> 01:39:23,980
post that you can you can look at here

2265
01:39:24,180 --> 01:39:27,340
if you if you want to hear some more so

2266
01:39:27,539 --> 01:39:29,020
you know that's it for today the next

2267
01:39:29,220 --> 01:39:30,489
lecture is going to be commnets with

2268
01:39:30,689 --> 01:39:33,039
Korean but yeah there's time some

2269
01:39:33,239 --> 01:39:34,000
questions now and if there's more

2270
01:39:34,199 --> 01:39:35,619
questions afterwards I'm happy to kind

2271
01:39:35,819 --> 01:39:37,869
of hand ran outside for a bit more than

2272
01:39:38,069 --> 01:39:49,570
we have time for yeah that's another

2273
01:39:49,770 --> 01:39:52,900
great question I said that's a kind of

2274
01:39:53,100 --> 01:39:58,659
another ongoing area research so the

2275
01:39:58,859 --> 01:40:01,989
sort of the fault of the moan is more

2276
01:40:02,189 --> 01:40:04,600
like you know kind of human driven you

2277
01:40:04,800 --> 01:40:05,860
briefed me optimization and that you

2278
01:40:06,060 --> 01:40:07,150
know I have some idea in my head of what

2279
01:40:07,350 --> 01:40:08,440
the kind of fitness of different

2280
01:40:08,640 --> 01:40:09,760
architectures would be and I kind of

2281
01:40:09,960 --> 01:40:12,640
prioritize trying those there's some

2282
01:40:12,840 --> 01:40:15,909
interesting work going on again using

2283
01:40:16,109 --> 01:40:18,420
some of these gradient free methods to

2284
01:40:18,619 --> 01:40:22,180
search over architectures so at a high

2285
01:40:22,380 --> 01:40:24,100
level this idea if I can start to build

2286
01:40:24,300 --> 01:40:26,529
a predictive model of how different

2287
01:40:26,729 --> 01:40:28,659
architectures might perform then I can

2288
01:40:28,859 --> 01:40:30,730
use that to automate the priority list

2289
01:40:30,930 --> 01:40:33,460
of what I should try next on the

2290
01:40:33,659 --> 01:40:37,000
population training side of things some

2291
01:40:37,199 --> 01:40:38,619
of the stuff that were

2292
01:40:38,819 --> 01:40:41,550
working on actually at the moment is

2293
01:40:41,750 --> 01:40:46,659
there are ways of adapting network

2294
01:40:46,859 --> 01:40:49,418
architectures online without having to

2295
01:40:49,618 --> 01:40:51,958
restart training so one example of that

2296
01:40:52,158 --> 01:40:54,220
there's a couple of papers on technique

2297
01:40:54,420 --> 01:40:56,019
service in a called net net or net

2298
01:40:56,219 --> 01:40:57,430
morphism and various other

2299
01:40:57,630 --> 01:41:01,600
transformations there I see so imagine a

2300
01:41:01,800 --> 01:41:05,409
mitem that I have some architecture and

2301
01:41:05,609 --> 01:41:07,300
I'm thinking would that architecture be

2302
01:41:07,500 --> 01:41:08,890
better if I were to interject an

2303
01:41:09,090 --> 01:41:10,989
additional hidden layer somewhere

2304
01:41:11,189 --> 01:41:13,810
I could just start training from scratch

2305
01:41:14,010 --> 01:41:15,760
but something else that I can do is take

2306
01:41:15,960 --> 01:41:17,110
something's thats been trained

2307
01:41:17,310 --> 01:41:18,970
originally and figure out a way to

2308
01:41:19,170 --> 01:41:21,100
inject an additional hidden layer in

2309
01:41:21,300 --> 01:41:22,659
there that doesn't change the function

2310
01:41:22,859 --> 01:41:23,739
that's been learned so far

2311
01:41:23,939 --> 01:41:25,659
but then after I've can added that

2312
01:41:25,859 --> 01:41:27,128
hidden layer I can then continue

2313
01:41:27,328 --> 01:41:29,350
training and potentially allowed the

2314
01:41:29,550 --> 01:41:32,680
model to make use of that additional

2315
01:41:32,880 --> 01:41:37,269
capacity and one one cartoon of how to

2316
01:41:37,469 --> 01:41:45,230
see I could do that is I could say

2317
01:41:45,710 --> 01:41:48,010
arrange to have an additional hidden

2318
01:41:48,210 --> 01:41:51,329
layer with say tonight's unit and

2319
01:41:51,529 --> 01:41:53,829
initialize them so that they're kind of

2320
01:41:54,029 --> 01:41:55,390
in Berlin ear region so it's it's more

2321
01:41:55,590 --> 01:41:58,628
or less a linear pass through so I could

2322
01:41:58,828 --> 01:42:00,729
take my previous model add in an

2323
01:42:00,929 --> 01:42:02,439
additional layer with the existing

2324
01:42:02,639 --> 01:42:04,869
weight matrix initialize the outgoing

2325
01:42:05,069 --> 01:42:09,189
wig matrix of that 10h layer to be some

2326
01:42:09,389 --> 01:42:10,989
kind of large values and that will that

2327
01:42:11,189 --> 01:42:15,159
will locally give me something that has

2328
01:42:15,359 --> 01:42:18,128
a very similar functional mapping as the

2329
01:42:18,328 --> 01:42:19,869
the network I start out with but now I

2330
01:42:20,069 --> 01:42:22,239
have the potential to learn additional

2331
01:42:22,439 --> 01:42:23,829
connections going from those ten each

2332
01:42:24,029 --> 01:42:25,449
unit so there's potentially ways of

2333
01:42:25,649 --> 01:42:27,489
doing this kind of architecture search

2334
01:42:27,689 --> 01:42:30,729
online and then there's model-based

2335
01:42:30,929 --> 01:42:32,829
approaches and then evolutionary methods

2336
01:42:33,029 --> 01:42:34,538
I'd say they're kind of the three main

2337
01:42:34,738 --> 01:42:37,789
ways of doing that

2338
01:42:46,270 --> 01:42:50,239
learners are you looking at kind of help

2339
01:42:50,439 --> 01:42:51,829
set performance are you looking at

2340
01:42:52,029 --> 01:42:53,720
convergence rates yeah it's a good

2341
01:42:53,920 --> 01:42:54,380
question

2342
01:42:54,579 --> 01:42:56,810
so I've mostly been thinking of this in

2343
01:42:57,010 --> 01:42:58,610
the context of reinforcement learning

2344
01:42:58,810 --> 01:43:02,840
and so they're sort of your test say is

2345
01:43:03,039 --> 01:43:05,570
your training set in a sentence yeah so

2346
01:43:05,770 --> 01:43:07,039
for kind of supervised problems then

2347
01:43:07,239 --> 01:43:09,560
yeah looking at it on a held out set

2348
01:43:09,760 --> 01:43:10,760
another thing that's worth mentioning

2349
01:43:10,960 --> 01:43:12,110
and again this is something that were

2350
01:43:12,310 --> 01:43:13,820
kind of actively working on at the

2351
01:43:14,020 --> 01:43:17,900
moment is you might not want to make

2352
01:43:18,100 --> 01:43:20,930
greedy decisions about that so a good

2353
01:43:21,130 --> 01:43:22,670
example is you know in supervised

2354
01:43:22,869 --> 01:43:25,880
learning it might be the so often it's

2355
01:43:26,079 --> 01:43:27,199
good to have a fairly high learning rate

2356
01:43:27,399 --> 01:43:28,159
initially and then to kind of drop it

2357
01:43:28,359 --> 01:43:31,369
down but one of things we noticed and

2358
01:43:31,569 --> 01:43:32,270
applying this to some of the supervised

2359
01:43:32,470 --> 01:43:36,980
problems is that you can if you kind of

2360
01:43:37,180 --> 01:43:39,289
look greedily you can appear to be doing

2361
01:43:39,489 --> 01:43:41,360
better by dropping their learning rate

2362
01:43:41,560 --> 01:43:42,920
earlier than you would in a nocturnal

2363
01:43:43,119 --> 01:43:44,210
setting because I kind of give you that

2364
01:43:44,409 --> 01:43:47,840
local boost and so something that again

2365
01:43:48,039 --> 01:43:50,029
this is appears to be less of a problem

2366
01:43:50,229 --> 01:43:52,760
in the RL settings we've looked at but

2367
01:43:52,960 --> 01:43:55,039
I'm saying that you probably want to do

2368
01:43:55,239 --> 01:43:57,710
as we extend these methods is think

2369
01:43:57,909 --> 01:43:59,329
about kind of performance metrics that

2370
01:43:59,529 --> 01:44:00,920
aren't just how well am i doing now but

2371
01:44:01,119 --> 01:44:02,060
kind of combining in some of that

2372
01:44:02,260 --> 01:44:04,670
model-based for looking things so not

2373
01:44:04,869 --> 01:44:06,560
how well am i doing now but given

2374
01:44:06,760 --> 01:44:07,640
everything I've seen about learning

2375
01:44:07,840 --> 01:44:10,820
progress so far how well could this run

2376
01:44:11,020 --> 01:44:13,220
or its descendants end up doing and kind

2377
01:44:13,420 --> 01:44:16,789
of use use a less greedy performance

2378
01:44:16,989 --> 01:44:22,880
metric way if there are no more

2379
01:44:23,079 --> 01:44:24,680
questions then thank you and yeah feel

2380
01:44:24,880 --> 01:44:27,670
free to ask because no salary

2381
01:44:27,869 --> 01:44:32,869
[Applause]


