1
00:00:07,770 --> 00:00:10,620
Evolutionary fuzzing is the state of the art.

2
00:00:10,650 --> 00:00:18,390
It's an application of genetic algorithms to make smarter and better and faster fathers to understand

3
00:00:18,390 --> 00:00:19,890
how it works.

4
00:00:19,890 --> 00:00:26,730
Let's talk about genetic algorithms and how they model evolution in their genetic algorithm.

5
00:00:26,730 --> 00:00:30,970
We have something called a population which consists of individuals.

6
00:00:31,020 --> 00:00:34,780
In our case individuals are fuzzing inputs.

7
00:00:34,830 --> 00:00:41,450
These individuals have a fitness function which is a measure of how good they are for fuzzing.

8
00:00:41,460 --> 00:00:46,740
It might be some metric we're using to assess how good an input is.

9
00:00:46,740 --> 00:00:50,610
So we might say that an input is good when it causes a crash.

10
00:00:50,670 --> 00:00:57,140
Or we can say that's good when it covers the most code or it's good when it covers the most new code

11
00:00:57,290 --> 00:01:06,040
or some other variation the less fit individuals get called out the die whereas the more fat ones remain

12
00:01:06,740 --> 00:01:12,660
and the language just means that inputs that we're not interested in have low fitness.

13
00:01:12,770 --> 00:01:19,480
We're not going to be reusing these individuals that do remain that survive because of high fitness

14
00:01:19,940 --> 00:01:21,750
then reproduce.

15
00:01:21,770 --> 00:01:31,440
So you pick pairs of individuals and those will reproduce to create a hybrid or genetically speaking.

16
00:01:31,640 --> 00:01:36,920
Genetically speaking they'll will do B combination of their genes for fuzzing.

17
00:01:37,130 --> 00:01:47,450
We might take our two inputs that scored high on our metric and then combine them somehow.

18
00:01:47,450 --> 00:01:54,670
For instance we might take the initial segment of one and replace of the initial segment of the other.

19
00:01:54,890 --> 00:02:01,340
And this will create a new input so an offspring in the language of biology in addition so that you're

20
00:02:01,340 --> 00:02:10,270
not just shuffling the same data over and over Balaji introduces mutations and mutations you have all

21
00:02:10,270 --> 00:02:10,860
your genes.

22
00:02:10,870 --> 00:02:18,250
But there were some small probability something random happens and it introduces new genes.

23
00:02:19,040 --> 00:02:28,690
And new structures and this allows your gene pool to expand to other possible individuals that were

24
00:02:28,690 --> 00:02:33,750
not already present in the gene pool for phasing.

25
00:02:33,760 --> 00:02:43,090
We might take our input and then perform some mutation and I'll create a new input which we will then

26
00:02:43,120 --> 00:02:43,780
retain.

27
00:02:44,440 --> 00:02:49,030
And this basically describes the main ideas of genetic algorithms.

28
00:02:49,030 --> 00:02:55,810
There should be some fitness function there should be some method for recombination.

29
00:02:55,900 --> 00:03:02,050
There should be a method for mutation and there should be a phase where the population gets called out.

30
00:03:03,480 --> 00:03:08,550
So note that this is the method used for AFL State of the art father.

31
00:03:08,880 --> 00:03:11,310
So now let's see how we can use it ourselves.

32
00:03:11,340 --> 00:03:18,990
How this works so I'll be using the same function we're working with earlier CGI decoder and here's

33
00:03:18,990 --> 00:03:21,510
all the tracing functionality.

34
00:03:21,510 --> 00:03:29,610
Now go to the final class of an individual some member of that population and an individual is going

35
00:03:29,610 --> 00:03:30,860
to have some value.

36
00:03:32,290 --> 00:03:37,260
So when I'm fuzzing the CGI decoder it's going to be some string.

37
00:03:37,420 --> 00:03:43,670
That's the value and I'm also going to record its coverage.

38
00:03:43,700 --> 00:03:50,410
I'm going to use coverage for fitness and it's going to have a numerical score for its fitness.

39
00:03:50,410 --> 00:03:55,770
So let's say that I'm initializing it I give it a value

40
00:03:58,570 --> 00:04:06,490
which will be substring and then I try to compute the coverage using our functions well before the reason

41
00:04:06,490 --> 00:04:10,360
to try and not a guaranteed success.

42
00:04:12,200 --> 00:04:20,330
Is that it could be actually an input that causes the CGI decoder to crash which is what we're looking

43
00:04:20,330 --> 00:04:20,630
for.

44
00:04:23,760 --> 00:04:31,560
But in case that happens then I will start recording the coverage and lines

45
00:04:34,110 --> 00:04:44,140
I'll record what the type of bargains printed out for our reference and I'll be recording all the coverage

46
00:04:44,410 --> 00:04:45,400
in a dictionary.

47
00:04:47,170 --> 00:04:56,010
And the reason for that is that I want to make sure that I'm keeping count of the coverage of how many

48
00:04:56,010 --> 00:05:01,790
times each type of coverage takes place and I'll be using that in my fitness function.

49
00:05:01,800 --> 00:05:12,390
So if for instance lines 1 2 3 get covered many times I want to make sure that these inputs don't get

50
00:05:12,390 --> 00:05:17,350
repeated too much so I'll be using that in my fitness function.

51
00:05:18,540 --> 00:05:26,760
Initially fitness will be just 9 so discuss this stuff in more detail but for now just bear with me

52
00:05:27,150 --> 00:05:32,940
another function I want just to be able to print to be able to see the values I mentioned the value

53
00:05:33,360 --> 00:05:35,730
coverage and fitness.

54
00:05:35,730 --> 00:05:43,170
And then I want to be able to compute the fitness and the formula I've decided on it makes the most

55
00:05:43,170 --> 00:05:50,250
sense for the application here is for it to be equal to the reciprocal of the number of times that the

56
00:05:50,250 --> 00:05:52,540
coverage is covered.

57
00:05:54,100 --> 00:06:02,650
So again if lines 1 2 3 get covered many times together in the same pattern then I want to make sure

58
00:06:02,650 --> 00:06:10,390
that this side type of input sees a low fitness so that it is less likely to go on to the next generation

59
00:06:10,390 --> 00:06:19,090
because I don't want it to be repeated on the other hand coverages that rarely take place will have

60
00:06:19,090 --> 00:06:21,120
high fitness which is what I want.

61
00:06:21,130 --> 00:06:24,370
So they get explored further.

62
00:06:24,370 --> 00:06:28,620
So this is my individual class which will make up my population.

63
00:06:28,810 --> 00:06:32,040
Now I'll be wanting to keep track of the important samples.

64
00:06:32,080 --> 00:06:38,920
So those are samples that produce new coverages and then be covered dict.

65
00:06:39,610 --> 00:06:46,300
We'll be keeping track of how many times each coverage was covered so that I can use it in my fitness

66
00:06:46,300 --> 00:06:51,880
calculation more to define a function called add to coverage dict.

67
00:06:52,150 --> 00:06:53,550
And it does what it sounds like.

68
00:06:53,560 --> 00:06:58,000
It takes the value and it simply increment the counter.

69
00:06:58,000 --> 00:07:01,950
And in fact we are referring to it even earlier.

70
00:07:02,080 --> 00:07:09,310
So where we instantiate an individual or calculate it coverage and then we'll add it to the dictionary.

71
00:07:10,210 --> 00:07:16,620
Now I'm going to define something called the carrying capacity which will determine the size of the

72
00:07:16,620 --> 00:07:17,670
population.

73
00:07:17,670 --> 00:07:25,630
Basically I will allow the population to grow but then I will kill off.

74
00:07:26,190 --> 00:07:33,090
Call the lower fitness individuals until I'm back down to 100.

75
00:07:33,290 --> 00:07:36,470
So the population always goes back to 100.

76
00:07:36,500 --> 00:07:45,670
Once it goes past 100 they'll go back to 100 and this allows me to focus the year on the most promising

77
00:07:45,760 --> 00:07:53,080
inputs as opposed to a huge number of them where many are not promising I'm going to define an initial

78
00:07:53,080 --> 00:08:03,010
seed which will be the initial members of my population and this will be a couple your cells that seem

79
00:08:03,010 --> 00:08:03,990
to be working.

80
00:08:04,530 --> 00:08:11,940
So then my population will consist of the individuals instantiated on these you or else another function

81
00:08:11,950 --> 00:08:12,610
I'll be wanting.

82
00:08:12,610 --> 00:08:18,070
Is the update fitness scores function which will allow you to.

83
00:08:18,210 --> 00:08:25,640
Which will allow me to update all the fitness scores we'll see that in more detail soon enough.

84
00:08:25,690 --> 00:08:30,310
I also aren't a convenience function that will just show me the information about my population.

85
00:08:30,310 --> 00:08:36,410
This is for debugging purposes and for explanatory purposes.

86
00:08:36,880 --> 00:08:45,400
And I'll also find a get fitness course function which will give me an array of all the fitness scores.

87
00:08:46,150 --> 00:08:53,680
And the reason is that I'll be wanting to sample from it because the fitness is the fitness scores will

88
00:08:53,680 --> 00:08:55,510
be used in a pub ballistic manner.

89
00:08:55,510 --> 00:09:01,840
I won't just be taking the highest fitness individuals but rather the highest fitness individuals will

90
00:09:01,840 --> 00:09:06,280
be more likely to carry on and reproduce.

91
00:09:06,460 --> 00:09:14,950
But there is no guarantee just like in life I'll be defining a function called sample pair for reproduction

92
00:09:15,340 --> 00:09:25,610
and it does what it sounds like it will sample a pair from my from my population based on the fitness.

93
00:09:25,740 --> 00:09:36,430
So first that's going to get the fitness scores then it will choose one individual based on the probability

94
00:09:36,430 --> 00:09:41,630
distribution from a fitness course and then another individual.

95
00:09:43,670 --> 00:09:49,760
And what we're going to do is pair them up and perform recombination.

96
00:09:51,170 --> 00:10:02,860
So once have individual on an individual to I'm going to pick a random point in the string and then

97
00:10:02,860 --> 00:10:10,720
I'll simply swap the first part of one with the first part of the other and then I'll get to New offsprings

98
00:10:13,580 --> 00:10:19,440
we'll see this as an example soon enough and let me update the fitness scores.

99
00:10:19,690 --> 00:10:21,880
In other words initialize them.

100
00:10:21,910 --> 00:10:25,920
Now let's take a look finally at what they look like.

101
00:10:26,290 --> 00:10:33,120
So this year well here gets this code coverage and this one has the same code coverage.

102
00:10:33,130 --> 00:10:42,470
And the third on the same code coverage as well and their fitness scores are 33 percent ish one third

103
00:10:42,980 --> 00:10:44,240
like it should be.

104
00:10:44,930 --> 00:10:54,900
Now we're going to undergo our population is going to undergo a recombination which will later on put

105
00:10:54,900 --> 00:10:55,690
into a loop.

106
00:10:55,740 --> 00:11:00,350
But for now basically what we do is take our population.

107
00:11:00,490 --> 00:11:08,400
We combine half using the methods we just defined so sample a pair for reproduction.

108
00:11:08,400 --> 00:11:17,990
We combine them add them to the new population and then update the scores and then we have a new population.

109
00:11:18,440 --> 00:11:19,850
So let's see what that looks like.

110
00:11:19,880 --> 00:11:29,250
I take my current population which you're seeing here consisting of three inputs and I'm combining and

111
00:11:29,290 --> 00:11:36,860
now that's printed and you see there are a couple new individuals there are no one two three four five

112
00:11:38,040 --> 00:11:45,920
5 members in the populations that 2 are new and they also happen to have the same coverage but that's

113
00:11:45,950 --> 00:11:46,350
OK.

114
00:11:48,980 --> 00:11:52,640
We'll keep going and we can look at them and see.

115
00:11:53,810 --> 00:11:57,420
So in the initial population we had one question mark here.

116
00:11:57,420 --> 00:12:02,710
Now we have two individuals with the question marks and they're different.

117
00:12:02,720 --> 00:12:07,200
You see the ending here has this to G and this one does not.

118
00:12:07,460 --> 00:12:12,240
So you can tell these guys were some recombination happened there.

119
00:12:12,310 --> 00:12:16,200
And if we look at this a little closer we'll be able to tell which ones were combined.

120
00:12:16,240 --> 00:12:17,900
But that's not too important.

121
00:12:17,940 --> 00:12:18,710
Now.

122
00:12:18,940 --> 00:12:20,880
Next let's introduce mutation.

123
00:12:20,980 --> 00:12:28,730
So pick a probability of mutation because mutation is a very powerful thing.

124
00:12:28,750 --> 00:12:35,740
You take some individual regardless of how fit they are you're going to mutate them and now they can

125
00:12:35,740 --> 00:12:37,870
be extremely unfair.

126
00:12:37,990 --> 00:12:45,040
So you can take something really good and then make it really bad for that reason we got to be careful

127
00:12:45,040 --> 00:12:47,200
about how much mutation we're doing.

128
00:12:47,820 --> 00:12:56,170
But at the same time we need mutation so that we don't have too much constraints on the genome.

129
00:12:58,050 --> 00:12:59,710
It allows us to explore more.

130
00:13:00,550 --> 00:13:12,000
So I'm going to recall our functions from the mutations lesson delete insert swap randomize and a convenience

131
00:13:12,000 --> 00:13:12,780
function here.

132
00:13:12,800 --> 00:13:19,590
Produce a random character which will simply pick a random character and then I'm also going to recall

133
00:13:19,590 --> 00:13:29,210
I mutate function which chose at random from these types of mutations and then repeated this for no

134
00:13:29,330 --> 00:13:30,730
mutations times.

135
00:13:30,800 --> 00:13:39,860
And now we are ready to define the mutation phase in which we loop over the whole population looking

136
00:13:39,860 --> 00:13:41,140
at each individual.

137
00:13:41,450 --> 00:13:54,310
And we flip a coin not a fair coin but a coin that's biased using this constant and if it flips one

138
00:13:54,310 --> 00:13:57,470
way then we will mutate the individual.

139
00:13:57,910 --> 00:14:04,080
So we'll apply the mutation and otherwise we will not.

140
00:14:04,360 --> 00:14:12,730
And I will update the fitness because we have changed the population so let's apply mutation and take

141
00:14:12,730 --> 00:14:13,870
a look at what happens

142
00:14:17,830 --> 00:14:20,490
so it doesn't look like much has happened.

143
00:14:21,430 --> 00:14:23,130
Or at least it's hard to find.

144
00:14:23,460 --> 00:14:30,330
But if you look closely you'll see here is Google Now it's Google that's fine.

145
00:14:31,570 --> 00:14:32,750
That's not a problem.

146
00:14:33,040 --> 00:14:35,020
But the coverage is still the same.

147
00:14:35,020 --> 00:14:42,520
Now we're going to define the calling phase in which we kill off individuals that are less fit.

148
00:14:42,550 --> 00:14:49,360
So what we're gonna do is look at whether it's at carrying beyond the carrying capacity or not which

149
00:14:49,360 --> 00:14:51,160
we said to be 100.

150
00:14:51,160 --> 00:14:58,360
Now we're going to get the distribution of fitness is we're going to pick up to Capital an individual

151
00:14:58,810 --> 00:15:03,580
exactly capital and individuals according to the fitness.

152
00:15:03,590 --> 00:15:09,860
So now you see that the more fit individuals are more likely to be retained and the less fit wants less

153
00:15:09,860 --> 00:15:16,100
likely to mutate and then that's what we do in our case we'll be retaining everyone because we don't

154
00:15:16,100 --> 00:15:17,890
have carrying capacity just yet.

155
00:15:18,230 --> 00:15:23,960
So now we run our calling phase for be exactly the same because nothing was called.

156
00:15:24,650 --> 00:15:31,050
And what we just did complete one generation of our algorithm for evolution.

157
00:15:31,670 --> 00:15:39,860
So we started off all the way here with the seeds and we grew our population.

158
00:15:39,950 --> 00:15:51,570
In this case to five members from three and we added some mutations and now if you just repeat this

159
00:15:52,740 --> 00:15:58,560
over and over again you'll see that the coverages will change and we will find bugs and these things

160
00:15:58,560 --> 00:15:59,680
do evolve with time.

161
00:16:00,470 --> 00:16:04,890
So we're going to set a number of generations and I'll be iterating.

162
00:16:04,890 --> 00:16:17,710
So for each generation go through all these phases and let's run this so population generation 0 1 2

163
00:16:17,710 --> 00:16:28,090
3 4 5 and then generation 5 already caught some error so value heir with this input here.

164
00:16:28,200 --> 00:16:29,570
And there's lots of that.

165
00:16:30,250 --> 00:16:38,740
And as you can see even though there is a ton of it eventually it stops doing as many value errors and

166
00:16:38,740 --> 00:16:46,240
that's because of the way we gave reciprocal weight to more frequent coverages.

167
00:16:46,270 --> 00:16:50,730
So the value air coverage was increasing in frequency in the dictionary.

168
00:16:50,950 --> 00:16:58,070
And since we're taking the reciprocal it becomes less and less likely that these individuals survive.

169
00:16:58,090 --> 00:17:05,830
So then it evolved and then now we've found this index error which is actually an error.

170
00:17:05,980 --> 00:17:15,480
We are not expecting so the value error is something you expect.

171
00:17:16,320 --> 00:17:24,780
But there is nothing said about index error and we could see there's more stuff going on.

172
00:17:25,560 --> 00:17:27,760
And the end of the evolution.

173
00:17:28,020 --> 00:17:35,120
Now it can take a look at what went on in by looking at the important samples.

174
00:17:35,210 --> 00:17:38,450
And that's a representative of each coverage.

175
00:17:38,450 --> 00:17:40,730
So if I look at the coverage dictionary

176
00:17:44,170 --> 00:17:51,820
you can see the different types of coverages that have taken place and the important samples are just

177
00:17:51,820 --> 00:17:55,720
representatives one from each.

178
00:17:55,730 --> 00:18:04,090
So for example if you were to run this individual then it would get this coverage.

179
00:18:04,610 --> 00:18:13,330
And if you're on if you were to run this one then you would get this coverage and so on and now it can

180
00:18:13,330 --> 00:18:18,870
take a look at this index air which is a genuine bug.

181
00:18:18,880 --> 00:18:31,130
The whole point of finding the whole point of finding and we can even fur to adhere and see what happened

182
00:18:31,300 --> 00:18:32,250
in more details

183
00:18:38,580 --> 00:18:46,350
and we look and we see that we are getting string index out of range.

184
00:18:46,440 --> 00:18:56,100
And the reason is that if we look here we see that if there is a percentage sign then the next two characters

185
00:18:56,910 --> 00:19:02,270
should be existent but they're not.

186
00:19:02,280 --> 00:19:07,290
For instance I can also put in this input and that still doesn't work.

187
00:19:07,290 --> 00:19:10,290
But if I add another character then it's fine.

188
00:19:11,280 --> 00:19:20,640
So basically our buzzer found us a real vulnerability and it tells us that there's actually a bug in

189
00:19:20,640 --> 00:19:21,220
our code.

190
00:19:22,620 --> 00:19:29,000
So congratulations to us who successfully built up from scratch and executed an evolutionary phase and

191
00:19:29,110 --> 00:19:31,820
found a vulnerability in a piece of code.

192
00:19:31,830 --> 00:19:34,500
Next up we'll be learning how to use the AFL.

