1
00:00:11,840 --> 00:00:13,720
Good afternoon, everyone.

2
00:00:15,820 --> 00:00:19,220
My name is Gustavo Vugo CSH.

3
00:00:19,340 --> 00:00:22,080
I'm the old guard of Brazilian Hacking.

4
00:00:25,240 --> 00:00:28,060
This is my second year in H2HC.

5
00:00:28,060 --> 00:00:29,980
I think the first time I came was in 2012.

6
00:00:31,320 --> 00:00:34,200
And now, in these last two years.

7
00:00:35,940 --> 00:00:57,180
Although I always talk about more technical things, low level, this year I decided to talk a little about an experience I had last year, which was to participate in a call for papers, where there were votes, and people had to vote, but I'll talk about that in a little while.

8
00:00:57,180 --> 00:01:05,880
But the idea was that I had to automate to vote for my paper, because I don't have that many friends.

9
00:01:09,540 --> 00:01:12,040
It was really fun.

10
00:01:12,040 --> 00:01:35,900
I want to share with you what I learned, because I don't really understand what's behind Computer Vision, or Optical Recognition of Characters, but I wanted to do this because I was really pissed off.

11
00:01:38,060 --> 00:01:40,680
So, let's go.

12
00:01:40,680 --> 00:01:41,920
What is CAPTCHA?

13
00:01:42,000 --> 00:01:43,600
This ugly name here.

14
00:01:43,600 --> 00:01:48,660
Complete Automated Public Turing Test to Tell Computers and Humans Apart.

15
00:01:49,400 --> 00:01:55,900
Basically, it's a test that the site will do, to know if you're really a human, or if you're a robot.

16
00:01:56,220 --> 00:01:57,840
But that's exactly what I wanted.

17
00:01:59,060 --> 00:02:00,820
I wanted it to be a robot.

18
00:02:01,260 --> 00:02:04,080
I wanted to make several votes for me, obviously.

19
00:02:07,600 --> 00:02:08,660
So, I wrote...

20
00:02:09,520 --> 00:02:19,380
Actually, the story was really funny, because I work at Intel with Rubira, with Gabriel, with Igor, we are doing the Brazilian Mafia at Intel.

21
00:02:20,200 --> 00:02:23,440
The main language is Portuguese.

22
00:02:23,600 --> 00:02:31,700
Marian, who is Austrian, or Tim, who is Russian, we have a few more Chinese on the team.

23
00:02:31,700 --> 00:02:36,120
They are already learning several words in Portuguese, because if they don't, they get really pissed off.

24
00:02:36,880 --> 00:02:39,220
So, it's really fun.

25
00:02:39,220 --> 00:02:39,980
Well, anyway.

26
00:02:39,980 --> 00:02:51,660
Continuing here, so as not to change course, I decided to write an article for the H2HC magazine, talking about all the details of breaking CAPT.

27
00:02:51,920 --> 00:02:57,860
So, my presentation will be a little around this same subject.

28
00:02:58,300 --> 00:03:06,240
So, like, how to do this first analysis, how to identify a CAPT by the other, and how to break the thing.

29
00:03:06,240 --> 00:03:20,300
So, I decided to create a story here, of the dog and such, but for it to be, like, an analogy, because I didn't want to talk about it directly, I didn't want to expose who organized this call for papers.

30
00:03:21,380 --> 00:03:23,680
So, I wanted to do something more fun.

31
00:03:23,680 --> 00:03:32,220
So, I tried to do a little comedy, and speaking more technically, a little of what I did to break it.

32
00:03:32,220 --> 00:03:40,520
So, just for those who didn't read, to go over here, what I wrote, is the story of Carlos Salo Humberto, who is CSH.

33
00:03:40,540 --> 00:03:42,060
Who discovered this?

34
00:03:42,900 --> 00:03:48,380
So, Carlos is the hero of the Gurizada, he is the guy who was penalized.

35
00:03:48,940 --> 00:03:56,700
Salo is Carlos' popular friend, who has a dog, whose name I didn't give, but who is uglier than Humberto, who is Carlos' dog.

36
00:03:57,020 --> 00:04:04,520
And the story I wrote in Reia, is more or less, Humberto is much prettier than Salo's dog.

37
00:04:04,540 --> 00:04:09,140
But then, the score that Humberto was getting was very low.

38
00:04:09,140 --> 00:04:09,500
Why?

39
00:04:09,500 --> 00:04:21,980
Because Salo posted on Facebook that there was a contest, that they were going to travel to Bahamas, and their friends went there, not only gave a high score to his dog, which was uglier, but also zeroed Humberto.

40
00:04:21,980 --> 00:04:23,840
And Humberto went to the last place.

41
00:04:23,960 --> 00:04:26,380
And this made Carlos get pissed off.

42
00:04:26,440 --> 00:04:28,440
So, any resemblance is a mere coincidence.

43
00:04:30,460 --> 00:04:39,780
So, introduction, every time I talk about session, for those who are not used to it, but there is a session between the user and the server.

44
00:04:39,780 --> 00:04:50,840
So , to make a request, all the cookies of the session, etc., considered a session, a solution would be a correct answer to the capture challenge.

45
00:04:50,840 --> 00:04:57,860
So, if some letters appear there, the solution would be those letters that for us it is obvious, but for a computer it is not so obvious.

46
00:04:58,160 --> 00:05:01,280
And a margin is a generic figure.

47
00:05:03,080 --> 00:05:09,780
So, this is how the capture of this event appears.

48
00:05:10,000 --> 00:05:12,800
And above I put an image, just with a word.

49
00:05:13,100 --> 00:05:23,280
Because the type of software that does an analysis of an image and converts it to text is called OCR, Optical Character Recognition.

50
00:05:24,840 --> 00:05:31,460
And let's agree that if you are a computer and you are going to read the first image, it is much easier to decode than the second.

51
00:05:31,460 --> 00:05:46,980
Precisely, the capture, it puts this bunch of noise and puts the wrong letters, etc., to make it difficult for you to write a program to automate or to throw it inside an OCR software.

52
00:05:47,420 --> 00:06:06,480
Then, when I came across this, because the step of the site, you had to create an account, to create the account, it showed the image, it had to have an email, then it sent a link to that email, you had to confirm the email to activate the account, and then you logged in with that account to make the vote.

53
00:06:06,480 --> 00:06:15,520
So, it's more or less, you are already used to this, not only on voting sites, but any site where you want information, they don't give a damn about it.

54
00:06:16,820 --> 00:06:20,280
So, my challenge was, OK, how am I going to do this?

55
00:06:20,280 --> 00:06:21,900
I'm going to learn Computer Vision.

56
00:06:22,380 --> 00:06:26,460
I even started to take a look, but I said, no, this is too complicated.

57
00:06:29,680 --> 00:06:30,740
Use a ready-made tool.

58
00:06:30,740 --> 00:06:36,320
Actually, I put it in reverse order, because I think the first thing I did was, oh, I'm going to see if anyone has already done this.

59
00:06:36,320 --> 00:06:44,560
Then I found a lot of tutorials and scripts that said they broke and such, and obviously I tested it, and it didn't work the way I wanted.

60
00:06:44,600 --> 00:06:47,680
Then I said, I'm going to have to learn this shit now, to do this.

61
00:06:48,140 --> 00:06:59,060
But I said, oh , I have little time, because it was 30 days, I think, and we entered the Raffle for Papers, and there would be all the voting, and in the end they wouldn't give the result.

62
00:06:59,620 --> 00:07:02,860
Then I said, well, I don't have time to learn all this, I have to find a way.

63
00:07:03,660 --> 00:07:13,200
Then I saw that some of these sites that had tutorials to break CAPT, they used Tesseract, which I didn't know until then, but it's a tool from Google that does CR.

64
00:07:13,920 --> 00:07:15,900
You pass the image and it gives you the CR.

65
00:07:15,900 --> 00:07:22,940
But obviously, passing the first, it will do wonderfully well, the second, not so much.

66
00:07:22,940 --> 00:07:28,300
Then I thought, no, ok, what do I need to know if I can break this CAPT?

67
00:07:28,300 --> 00:07:37,100
Because if the guy changes the image every time I make a mistake, or, I don't know, blocks my IP, I don't know.

68
00:07:37,540 --> 00:07:39,000
Then I started to think.

69
00:07:39,460 --> 00:07:46,480
I took the URL of the image, opened a new tab, pasted the URL there, and started pressing F5 like crazy.

70
00:07:47,080 --> 00:07:52,320
Every time I did this, the image would change, it would jump to one side, it would jump to the other.

71
00:07:53,320 --> 00:07:56,180
Some details of the image remained the same.

72
00:07:57,120 --> 00:07:59,580
Then I said, ok, I think I can break it.

73
00:07:59,720 --> 00:08:04,720
I had never done this before, but I was indignant with this story.

74
00:08:05,180 --> 00:08:06,200
My dog.

75
00:08:07,060 --> 00:08:09,640
And then, no, I'm going to do this.

76
00:08:09,900 --> 00:08:25,980
Then I thought, no, for the section, for the URL, I have to do a lot of things, like, if I keep pressing F5, there will come a time when it will change this image, or it will stop giving me this image, it will give me a 404 error, something like that.

77
00:08:25,980 --> 00:08:31,360
Because I wanted to determine what is the temporality I have to keep capturing these images.

78
00:08:33,620 --> 00:08:41,520
And then, of course, I tested other sites to have this table to help me understand this.

79
00:08:41,520 --> 00:08:47,580
But basically, what are the criteria for me to determine if a captcha is breakable or not.

80
00:08:47,740 --> 00:08:50,240
So, the more images you have, the better.

81
00:08:50,240 --> 00:08:52,080
I will explain why.

82
00:08:52,500 --> 00:08:54,620
And the question of the section.

83
00:08:54,620 --> 00:09:10,200
So, if there is a section, a unique URL for the image, that you can get several images, or you can associate this section with the resolution, it is also a good thing.

84
00:09:10,680 --> 00:09:29,840
Another important thing, if I submit a post with the captcha resolution, and this solution is wrong, the site that is evaluating this changed the captcha, like, changed it, in my case, I read four letters.

85
00:09:30,100 --> 00:09:38,320
If I tried A, B, C, D, and it was not, if it will change the solution, it will remain the same.

86
00:09:38,320 --> 00:09:38,900
Why is that?

87
00:09:38,900 --> 00:09:54,780
Because if I can, if it remains the same, I can keep trying several times until this temporality no longer exists, or infinite, if it lets me do this indefinitely.

88
00:09:55,240 --> 00:09:59,260
So, this is more or less what I put in my criteria.

89
00:09:59,700 --> 00:10:08,280
It is funny that I am talking about this lecture, after watching Edgar and Thais' lecture, I already think that all of this could be the constraints of my solver.

90
00:10:08,740 --> 00:10:16,180
And to know if a captcha is breakable or not, I could use an SMT to know if it is possible or not.

91
00:10:17,340 --> 00:10:27,620
So, here is just an example of some images that you can see, like this, for example, UPSD, this U goes there, then U goes here, you see?

92
00:10:28,740 --> 00:10:29,680
So, what did I do?

93
00:10:29,680 --> 00:10:47,100
In order not to bother the guys with my IP, taking the images, I did, no, I will do a sample, I will take 50 sessions, 50 images of each session, then I can keep playing, creating my algorithms, see if I can filter this, and so, and break it.

94
00:10:47,920 --> 00:10:54,540
And it was from there that I wanted a proof that it was possible to break it, before automating it.

95
00:10:54,540 --> 00:11:00,020
For me, I think that the most difficult problem was to break the captcha.

96
00:11:00,020 --> 00:11:14,900
Because then it is this automation, like creating an automatic e-mail, I found out there, Guerrilla Mail, I don't know if anyone knows it, but it is cool that you create a temporary account, while the session is open there, you can go to a website, create an account with that temporary e-mail,

97
00:11:14,900 --> 00:11:26,320
it will send the e-mail, you don't need, there is even an AJAX API, that creates automatic users, and it is really cool.

98
00:11:27,860 --> 00:11:45,900
So, decoding process, I put four steps, one, I will normalize this image, what does normalization mean, I will get into it, I do a pre-filter, and then I separate the letters, and I send each letter to Tesseract, because I found out that, because they keep dancing,

99
00:11:45,900 --> 00:11:53,280
they put the letters in different angles, Tesseract didn't recognize it very well for me.

100
00:11:53,280 --> 00:12:14,740
Maybe there are other options, or configurations of Tesseract that I don't know, because there are dozens, maybe hundreds of options in Tesseract, and I didn't have time, so I had to, I will separate these letters, because I think it is easier.

101
00:12:15,800 --> 00:12:31,840
So, the first thing I did, I will make an histogram of these images, in width, I will take all the images, I will know how many pixels are in width, how many pixels are in height, I will see what the average is, and I will normalize all the images to the same size,

102
00:12:31,840 --> 00:12:38,500
because I can then overlay them, and maybe there is something that I can see, in this story.

103
00:12:38,500 --> 00:12:52,280
So, although you can't see it, but I think the average here is like 78 pixels in width, by, I think, 58, something like that, in height.

104
00:12:52,280 --> 00:13:03,960
But this one you can see that it is more or less normalized, but this other one, the vast majority, is in that height there, and I think it was the height I used.

105
00:13:04,700 --> 00:13:19,440
And then I realized that this line, I think, this is a trick they do, because the image with the lowest resolution is more difficult to break.

106
00:13:19,440 --> 00:13:35,300
So, the higher the resolution, the more definition the images have, or the noise, it is easier for you to break the capture, because there are several ways for you to work the images, to reduce the noise.

107
00:13:35,820 --> 00:13:42,800
But I found that there was a line here, I think there is another line there, and an arc, that has all the images.

108
00:13:42,800 --> 00:14:06,320
So, when I normalized all my 2,500 images, and made a sum of all of them, and divided by the number of images, you can do this, OpenCV, I had to, I did all this stuff in Python, so OpenCV helped me, how to open an image, convert it to a matrix, all these things,

109
00:14:06,320 --> 00:14:09,320
so far so good.

110
00:14:09,720 --> 00:14:21,500
But, only with these images, I was able to extract these things, applying this average of all the images, on top of the images, you can see that the noise has decreased a lot.

111
00:14:21,640 --> 00:14:28,140
Removing the peripheral part, around the image, it almost cleaned everything automatically.

112
00:14:28,140 --> 00:14:33,900
But you can see that the thing is still a little strange, it is not very clear.

113
00:14:34,680 --> 00:14:54,140
But then you have some things, I think I used two or three functions of OpenCV, to improve, because it has something that I think is, I forgot the name of the function, but it kind of agglutinates small spaces between, for example, in the C there, that has some flaws,

114
00:14:54,140 --> 00:15:00,820
it will extend that shape, to try to complete it.

115
00:15:00,820 --> 00:15:16,420
And then, once I did that, I think the hardest part was breaking the four letters, I tried a lot of things, like, you can see that there is a certain cloud here, between each letter.

116
00:15:17,580 --> 00:15:28,300
I imagined, maybe if I take only these regions here, extract these regions after applying this pre-filter, extract each one of these, send it to Tesseract and see what happens.

117
00:15:28,560 --> 00:15:29,820
I did that.

118
00:15:29,820 --> 00:15:39,320
And I had a good initial result, I don't know, 20, 30% of the images, I was able to decode the letters.

119
00:15:39,320 --> 00:15:48,440
But a lot of times, because of that dancing of the letters, there was a lot of pieces of the letter out of each one of these regions, and that got in the way.

120
00:15:48,440 --> 00:15:59,700
So I had to use, again, OpenCV, there is a function that returns me, given this image, how many outlines do I have?

121
00:15:59,700 --> 00:16:02,200
Then it goes and says, there are so many outlines.

122
00:16:02,280 --> 00:16:13,600
Then, of course, you can see, for example, this T down here, it would tell me that there were two outlines, or regions of outlines, which is the T itself, and there is a dirt down there.

123
00:16:14,100 --> 00:16:21,200
So, for example, that other H there, from 40 to 40, there is a dot in the middle of the H, and there is also a dirt up there.

124
00:16:21,660 --> 00:16:39,160
So I had to do, I have N outlines, then I made the algorithm, there is an outline inside another outline, and I will consider making a union of these two outlines, then I say, what is the size that I have of my width of each outline, adding all of them,

125
00:16:39,160 --> 00:16:42,440
does it make sense to have four letters?

126
00:16:42,440 --> 00:16:43,820
Because if there were more...

127
00:16:44,420 --> 00:16:55,660
Then I did something kind of rough, just to separate, when I think that the thing is not very good, I discard that image, and I try to find a new image on the site.

128
00:16:56,280 --> 00:17:06,160
But this was my proof of concept, I am doing this on top of my 2,500 images, to know if my algorithm will work, before I lose my time automating that.

129
00:17:07,580 --> 00:17:08,060
And...

130
00:17:08,060 --> 00:17:17,380
I think this is where I lost more of my time, to try to do things, just to write this little Python, to make the letters pretty, and that took me some time.

131
00:17:18,320 --> 00:17:24,280
But it helped me to have a general vision, like, I think the algorithm is good, I think it is not.

132
00:17:25,440 --> 00:17:30,340
And then I did a statistical analysis, of each of the letters.

133
00:17:30,340 --> 00:17:38,240
So, for example, the first set of images, I apologize to those of you who cannot see well, but I will try to explain more or less.

134
00:17:38,980 --> 00:17:52,360
But the first set of images, the first letter, sorry, the second letter, which is this first image here, the solution is R.

135
00:17:52,400 --> 00:17:57,980
But the tesseract of the 15 images that it decoded...

136
00:17:58,960 --> 00:18:00,540
Wait, let me do...

137
00:18:00,540 --> 00:18:02,360
No, I am not on my computer.

138
00:18:02,360 --> 00:18:03,140
Damn.

139
00:18:03,220 --> 00:18:05,200
This is a Google Spreadsheet.

140
00:18:05,200 --> 00:18:10,260
You can send me an e-mail, I can share this spreadsheet with you, you can look at the numbers.

141
00:18:10,260 --> 00:18:12,300
But the idea is...

142
00:18:12,300 --> 00:18:16,540
The tesseract does not give me the correct answers sometimes.

143
00:18:16,540 --> 00:18:29,580
It told me twice that it was a C, four times that it was a D, once that it was an O, once that it was a P, once that it was a Q, and six times that it was an R.

144
00:18:29,580 --> 00:18:31,420
The right answer was an R.

145
00:18:31,660 --> 00:18:36,360
But as you can see, it is not very precise.

146
00:18:36,380 --> 00:18:37,820
It is really rough.

147
00:18:39,560 --> 00:18:40,040
So...

148
00:18:42,300 --> 00:18:46,880
My message to you is that image decoding is an analog process.

149
00:18:47,940 --> 00:18:48,980
I do not know if...

150
00:18:48,980 --> 00:18:58,440
I was talking to someone on Friday who told me that there is a tool that breaks captures, that makes automatic filters, and it already breaks.

151
00:18:58,440 --> 00:19:02,720
I was interested, because I did not find this tool when I researched it.

152
00:19:02,720 --> 00:19:07,460
So, maybe there are more automated ways, but...

153
00:19:07,460 --> 00:19:14,760
I went through this to try to understand and make a shortcut in this process.

154
00:19:15,480 --> 00:19:16,000
And...

155
00:19:16,000 --> 00:19:18,300
And this is what I wanted to share.

156
00:19:19,440 --> 00:19:19,960
So...

157
00:19:19,960 --> 00:19:23,940
My conclusion was that image decoding is an analog process.

158
00:19:23,940 --> 00:19:30,980
In my case, I have to have a function that decodes, passes as a parameter an image, and I return four letters.

159
00:19:31,280 --> 00:19:37,300
It may be that the third-party cannot interpret any letter, or it may be that that letter is not the correct answer.

160
00:19:39,160 --> 00:19:39,680
So...

161
00:19:39,680 --> 00:19:41,660
There is a high error rate.

162
00:19:42,900 --> 00:19:43,800
After I made the presentation...

163
00:19:44,660 --> 00:19:49,660
For the first time in my life, I finished a presentation before half an hour of my presentation.

164
00:19:49,820 --> 00:19:53,380
So I'll have to remember the lies I'm going to tell you.

165
00:19:53,560 --> 00:19:55,860
So sometimes I'll get stuck, but...

166
00:19:55,860 --> 00:19:56,540
Let's go.

167
00:19:56,840 --> 00:19:59,080
So, it's a statistical process.

168
00:19:59,200 --> 00:20:03,840
If the number of images is infinite, it is certain that you will get a capture.

169
00:20:03,840 --> 00:20:06,580
This is kind of obvious.

170
00:20:06,880 --> 00:20:15,200
If the number of attempts is infinite, it is also kind of obvious that you will be able to crack, but you can also use brute force.

171
00:20:15,580 --> 00:20:24,680
So, in my case, there are 26 letters in the alphabet, in the fourth power, if there is a number of combinations to get the capture.

172
00:20:24,980 --> 00:20:34,940
So if there is a small latency in the network, with the network of the contest, it trolls the guys, and then it breaks.

173
00:20:36,980 --> 00:20:37,420
And...

174
00:20:38,180 --> 00:20:45,940
Here I made an histogram with the four letters and...

175
00:20:47,000 --> 00:20:58,140
the axis of the Y will tell me how many images it was able to decode from each of my image sets, which are 25.

176
00:20:58,140 --> 00:21:02,440
So there are 25 sets of four bars there.

177
00:21:02,520 --> 00:21:11,280
So you can see, for example, the fourth letter, there are some places where it was able to decode very few letters.

178
00:21:12,600 --> 00:21:13,200
And...

179
00:21:13,200 --> 00:21:23,820
But I was calm, because I thought, but it is obvious that if I can map this number of things, I can guess what the capture is.

180
00:21:23,820 --> 00:21:28,160
And this is, I think, the great moral of the story.

181
00:21:28,200 --> 00:21:39,660
I will try to go a little in my algorithm to detect who has more statistical knowledge will understand this directly, who doesn't, I will try to be a little didactic.

182
00:21:39,700 --> 00:21:43,660
But the idea is the following, let's say that my initial state is the first line there.

183
00:21:43,800 --> 00:21:53,760
In the first line I have no letter that I was able to decode and no letter from that set of five letters or...

184
00:21:53,760 --> 00:21:57,160
that set of letters that I have, of possibilities.

185
00:21:58,120 --> 00:22:10,280
So, when my script processes the first image, the Tesseract returns three letters, T, R, C, and I couldn't decode the fourth letter.

186
00:22:10,280 --> 00:22:22,780
So I will enter my frequency table that will have there, well, I have one T in the first, one R in the second, one C in the third, and I have one unknown in the fourth letter.

187
00:22:23,080 --> 00:22:29,200
Then the question is, could I make some attempt to decode?

188
00:22:29,240 --> 00:22:32,580
No, because I don't have enough information, right?

189
00:22:32,840 --> 00:22:42,040
So I go on, until, for example, in the third image a K appears in the fourth letter.

190
00:22:42,040 --> 00:22:49,460
So I have here, for example, the T, I analyzed three images and the Tesseract returned three times the letter T.

191
00:22:49,780 --> 00:22:56,220
I said, wow, this gives me hope that the first letter is the T.

192
00:22:56,220 --> 00:23:02,860
In the second letter it gave me two R's and one unknown.

193
00:23:02,860 --> 00:23:09,800
So I have a 66% chance to know that the second letter is R.

194
00:23:09,800 --> 00:23:15,920
My other 33%, or my other third, tells me that I have no way to evaluate.

195
00:23:15,960 --> 00:23:25,180
In the third letter, the same thing, I have 66% chance that it is a C and 1% chance that it is a G.

196
00:23:25,320 --> 00:23:38,480
In the fourth letter I have 66,666, or 2% chance that it is that I don't know and one third that it is a K.

197
00:23:39,120 --> 00:23:45,400
Then I made an algorithm that thought like this, what is the most obvious to try?

198
00:23:45,400 --> 00:23:50,820
I get the letters in the highest frequency in each of the positions And this is the one I'm going to try.

199
00:23:52,340 --> 00:23:58,640
And if it doesn't work, well, if it doesn't work, I'm going to start doing a simulation of a lot of things.

200
00:23:58,640 --> 00:24:05,540
I'm going to take a dice, I'm going to start throwing a dice, and I'm going to try to know what the letters are.

201
00:24:06,200 --> 00:24:15,180
And that's what I did, but notice that in the third image, my script will try to guess the CAPTCHA.

202
00:24:15,180 --> 00:24:19,540
It will try to say that it's T, R, C, K.

203
00:24:20,500 --> 00:24:24,400
And the site will reject it, saying, no, this CAPTCHA is not valid.

204
00:24:24,400 --> 00:24:25,360
What do I do?

205
00:24:25,360 --> 00:24:31,160
I take this attempt and throw it in a table of attempts that didn't work.

206
00:24:33,360 --> 00:24:40,520
I'm only going to have another letter in the fourth house, here, at this point.

207
00:24:40,720 --> 00:24:43,240
But at this point, my whole scenario has changed.

208
00:24:44,300 --> 00:24:46,140
I'm going to have...

209
00:24:47,200 --> 00:24:49,180
Sorry, let me go back a little before.

210
00:24:50,060 --> 00:24:54,100
Since I have K in the fourth house, I'm going to have...

211
00:24:54,100 --> 00:24:58,640
But in the third image, I only have...

212
00:24:59,780 --> 00:25:01,760
No, actually, it even makes a permutation.

213
00:25:01,980 --> 00:25:07,420
Because I have the possibility of being T, R, G, K.

214
00:25:07,660 --> 00:25:10,460
Because in the second image, it said it could be a G.

215
00:25:11,080 --> 00:25:11,800
Is that it?

216
00:25:11,800 --> 00:25:12,760
That's it.

217
00:25:12,760 --> 00:25:19,400
So my script will take and play randomly, based on the weight.

218
00:25:19,400 --> 00:25:30,480
So it will take, for example, for the first letter, I have 3 T's, 2 R's, 2 C's and 1 G.

219
00:25:30,480 --> 00:25:38,100
So it will make, in each of the groups, a random sample with weights.

220
00:25:38,140 --> 00:25:39,560
And it will try this other one without.

221
00:25:39,560 --> 00:25:46,320
So it could try, there, in the fourth image, it could try T, R, G, K, which would also fail.

222
00:25:46,320 --> 00:25:53,800
It takes this password, puts it in the attempt table, which were not successful.

223
00:25:53,800 --> 00:25:54,700
And so it goes.

224
00:25:54,700 --> 00:26:02,800
So, for example, in the fifth line, it introduces an O, then it introduces an I and an F.

225
00:26:02,800 --> 00:26:04,340
And that's where it will be stuck.

226
00:26:04,340 --> 00:26:15,860
So, while it is capturing the images and updating the frequency table, the algorithm that tries to generate, the combinations of the answer will be running in parallel.

227
00:26:16,200 --> 00:26:27,400
But it will always try, this is what I did, it will always try the one with the highest frequency before trying to make this random attempt.

228
00:26:28,000 --> 00:26:47,980
And when I did this, I tested it several times, I was improving my filter algorithm of the image, and I managed, out of my 50 samples, I managed to break 48 of them.

229
00:26:48,320 --> 00:26:52,360
Just doing these tricks.

230
00:26:52,360 --> 00:27:00,180
Some had like 400, 500 attempts to get to the right answer, but they did it.

231
00:27:02,060 --> 00:27:06,760
So, I put here, like, what are the results.

232
00:27:06,760 --> 00:27:14,740
In the first letter, 57% of the images managed to decode, whether it is the correct one or not.

233
00:27:14,880 --> 00:27:24,320
Like, Tesseract can see an A, but it will tell you that it is a, I don't know, a symbol of truth, or, I don't know, upside down, I don't know.

234
00:27:24,320 --> 00:27:36,060
It gives a crazy code, which I assume is null, which is not inside, between A and Z, I consider it to be garbage.

235
00:27:36,660 --> 00:27:44,100
But, incredibly, my process of filter and letter separation gave me all these results.

236
00:27:44,820 --> 00:27:55,460
And then, I managed to decode the solution, like, the first letter, 100%, the second letter, 98%, because two cases failed, I think, the third letter, the same thing, the fourth letter, 100%.

237
00:27:55,460 --> 00:27:58,500
And this is what is interesting.

238
00:27:58,660 --> 00:28:14,460
Like, the solution is inside the most frequently decoded letter, which means that, from A to Z, let's say I have decoded 50 images, I have 40 Zs, is this Z the solution?

239
00:28:14,920 --> 00:28:17,980
In 76% of the cases, yes.

240
00:28:18,960 --> 00:28:21,880
The second letter, in 70% of the cases, no.

241
00:28:21,880 --> 00:28:23,740
In other words, I...

242
00:28:24,360 --> 00:28:36,460
Actually, the process I did was to get to this point, and then I decided, no, I will have to go back and do that random weight sample to improve it.

243
00:28:36,460 --> 00:28:37,770
Because this made...

244
00:28:38,240 --> 00:28:47,570
That algorithm I explained to you before, of the random weight sample, made me raise my solution from 76% of the cases, from the first letter, to 100%.

245
00:28:48,520 --> 00:28:57,460
Because then it is not always focused on what has the highest frequency, it tries to catch all the letters it managed to decode.

246
00:28:59,340 --> 00:29:23,500
So, in summary, of this decoding process, having the sample, making a way to automate, like I did, 50 sessions, 50 images each, I can work without alerting the organization of the contest, or whatever, because I can work this locally, manipulate the images,

247
00:29:23,500 --> 00:29:30,380
make the filters, I don't need to automate all this stuff now, it gives me a good idea to know, well, I think this is going to work.

248
00:29:30,860 --> 00:29:35,980
And when I got to this point, I said, no, it's cool.

249
00:29:36,320 --> 00:29:39,460
So I can start automating.

250
00:29:39,460 --> 00:29:49,140
I made some quizzes here for you, because the idea I want to pass is that maybe one captcha is different from another captcha, right?

251
00:29:49,420 --> 00:29:56,960
And then I got a government website that I knew had online voting, and then I did what I...

252
00:29:56,960 --> 00:30:08,920
I took the URL of the image, opened another tab, I pressed 7 times refresh, although you don't notice, first, it's the same captcha.

253
00:30:09,200 --> 00:30:12,960
Second, the letters don't dance, so it seems to be easier to break.

254
00:30:13,360 --> 00:30:16,740
The only thing that changes is the noise they add.

255
00:30:16,740 --> 00:30:21,600
So, for example, here you have some dots here, which are not here, here are others, and so on.

256
00:30:21,660 --> 00:30:24,360
Then I thought, ah, this is very easy to break.

257
00:30:24,760 --> 00:30:39,460
You just apply a simple filter, pass it to the tesseract, I even believe that if you pass this directly to the tesseract, it will decode directly, and then you already have the answer of the captcha.

258
00:30:40,540 --> 00:30:41,220
Like...

259
00:30:42,040 --> 00:30:43,800
Very easy.

260
00:30:44,080 --> 00:30:46,120
This is another captcha.

261
00:30:46,120 --> 00:30:50,580
This one is cool, this kind of captcha, because it applies...

262
00:30:50,580 --> 00:30:56,620
it's a simple equation, like 7x4, 9-1, 4x1, blah, blah, blah.

263
00:30:56,620 --> 00:30:58,400
You have to give the right answer.

264
00:30:59,500 --> 00:31:02,680
If you miss the answer, it gives you another equation.

265
00:31:03,680 --> 00:31:18,360
So , like, it's good to have an algorithm of image analysis, to be able to make all the necessary filters, so that you can decode that in text, and then be able to make the resolution.

266
00:31:19,420 --> 00:31:27,200
It would even be cool to make a little thing there, like an eval, something like that, and then...

267
00:31:27,200 --> 00:31:29,640
It would also be cool to...

268
00:31:30,220 --> 00:31:32,000
This is what I'm thinking about now.

269
00:31:32,760 --> 00:31:48,340
But if you wanted to break the breaker, it would be cool to put it there, if it's doing an eval, to make an exploit here, then the tesseract decodes it, puts it in the eval, and you open a shell in the cracker.

270
00:31:48,340 --> 00:31:50,320
That would be cool.

271
00:31:53,460 --> 00:32:08,880
This was another one, this captias.net, another site I saw, and although you can't see it, it bends the letters a little, but they stay in the same angle.

272
00:32:09,100 --> 00:32:16,240
So, it generated these random letters, it generated more or less the angle of each letter, and it remained the same.

273
00:32:16,240 --> 00:32:20,940
The only difference is that it added a lot of random noise in the background.

274
00:32:21,240 --> 00:32:22,640
So, again, this with...

275
00:32:25,320 --> 00:32:31,160
a basic image manipulation, you can take all this noise and send it to the tesseract to break it.

276
00:32:33,660 --> 00:32:52,960
The interesting thing about this is that this site that provides this solution, I think it's in PHP, I don't remember now, they keep, they give you the same image with the same answer several times.

277
00:32:52,960 --> 00:32:54,900
So, for me, this is a failure.

278
00:32:55,180 --> 00:33:02,200
So, the more dynamic the answer is, the error, it generates a totally new thing.

279
00:33:02,200 --> 00:33:06,820
Like Google's Recaptcha, which shows some sentences...

280
00:33:08,320 --> 00:33:15,400
Actually, Google's Recaptcha is a separate chapter, maybe I could make a whole presentation about it.

281
00:33:15,400 --> 00:33:18,620
I looked at it, discovered several things.

282
00:33:18,820 --> 00:33:27,840
One, Google's APIs keep looking at where you are moving the mouse, where you are clicking, and so on.

283
00:33:27,840 --> 00:33:29,940
And then it tells you, ah, you are not a robot.

284
00:33:30,180 --> 00:33:42,580
So, every time you enter a site that will show like this, Google's robot verification, and it already gives a little OK, that you can continue, it's because it's looking where you are going to have the mouse, where you are clicking, and assuming that,

285
00:33:42,580 --> 00:33:47,920
analyzing the set of these events, that you are not a robot script.

286
00:33:48,120 --> 00:33:57,040
If it thinks that you are not moving the mouse enough, it shows you those things like, ah, identify traffic signs in these images.

287
00:33:57,040 --> 00:33:59,960
Then you go there, you have to keep clicking about four times.

288
00:34:02,180 --> 00:34:04,460
But I discovered one thing.

289
00:34:05,060 --> 00:34:07,060
I'm sorry, it was very interesting.

290
00:34:07,240 --> 00:34:10,140
I'm just opening a parenthesis here for this Google thing.

291
00:34:11,620 --> 00:34:31,320
If you have vision problems, or are blind, Google's Captcha will put you in the old ReCaptcha, because it shows two words, all distorted, but you have the option to hear what is in that word.

292
00:34:31,480 --> 00:34:35,040
Then I heard this and said, ah, but I'm going to research about it.

293
00:34:35,120 --> 00:34:42,180
Because, in my opinion, I think that our hearing is much worse than our vision.

294
00:34:42,180 --> 00:34:50,840
So, you show some things like this, most human beings can read, but because the noise level is very high.

295
00:34:50,840 --> 00:34:54,120
But now, what kind of noise are you going to put in audio?

296
00:34:54,120 --> 00:35:03,840
Because I'm talking to you here without any noise, maybe you don't understand a word I've said, and verbal communication is much more difficult than written communication.

297
00:35:03,840 --> 00:35:05,320
This is a fact.

298
00:35:06,400 --> 00:35:27,900
Then I found a guy who wrote an article that he took the output of this ReCaptcha, that is, when Google asked him, he said to Google that he was blind, showed this, and he threw the audio of ReCaptcha for a dictation like this, I think it could even be from Google,

299
00:35:27,900 --> 00:35:38,240
to translate, to make the description of the voice, and then he just used Google's API, by Google's API, and broke ReCaptcha.

300
00:35:39,140 --> 00:35:46,560
But anyway, the site I went to, the guys are not even there for those who are blind, so you had to see it to break it, there was no such option.

301
00:35:47,140 --> 00:35:57,740
Which is exactly what I did, I put it as a home theme, because maybe one day, if I'm a little boring, I'll do this, which must be different.

302
00:35:58,380 --> 00:36:00,240
Because I think that's what he did.

303
00:36:00,240 --> 00:36:03,320
I think this is the site, but I don't remember.

304
00:36:04,420 --> 00:36:09,240
If not, it will be something else cool about ReCaptcha, I put it here for some reason.

305
00:36:12,830 --> 00:36:14,470
And that was it.

306
00:36:14,470 --> 00:36:18,490
I mean, this is my name, my email, you can find me.

307
00:36:20,450 --> 00:36:22,410
I think we have time.

308
00:36:22,610 --> 00:36:23,950
I have ten minutes.

309
00:36:24,350 --> 00:36:32,910
I broke it here in three points, which was what I did later to automate this vote, just in terms of curiosity.

310
00:36:33,190 --> 00:36:37,870
But the idea is, I have the registration part, I have the confirmation part, and I have the vote part.

311
00:36:37,870 --> 00:37:01,250
Once I reached a result, adequate to break the Captchas, then I set up a program to go there, I generate a random name, a random email, I talk to Guerrilla Mail there, I registered an email here, with a password, I go to the site, I try to kill the Captcha,

312
00:37:01,250 --> 00:37:18,170
I take there, I don't know, half a dozen images, ten images, it breaks, and then I receive the email, I click on the link's email to activate that account, then I log in and post the vote.

313
00:37:18,370 --> 00:37:25,950
Of course, my first version took, I don't know, about two minutes, I think, to do all this process.

314
00:37:26,370 --> 00:37:28,950
Then I broke it in...

315
00:37:28,950 --> 00:37:34,910
Actually, I started breaking it in two scripts, one that created the accounts, and another that only voted.

316
00:37:35,290 --> 00:37:53,670
Then I got all this, and I made a thread that only analyzed the images and sent them to Tesseract, updating my frequency table, while another thread only did the permutations, and generating random accounts, and trying to get the password in parallel.

317
00:37:53,990 --> 00:38:07,990
Once he got that password, there was another thread that was just to confirm the email, but it already sent the other thread that was capturing the images and cracking, like, start a new session, start a new session, then I was able to automate it.

318
00:38:07,990 --> 00:38:09,170
Then I created, like, what?

319
00:38:09,170 --> 00:38:11,570
About 20,000, 30,000 accounts.

320
00:38:13,630 --> 00:38:24,010
And then I started voting, and voting, and voting, and voting, and then I left the dog of the others with zero, on average.

321
00:38:24,870 --> 00:38:27,790
But then the thing kind of got out of control.

322
00:38:29,370 --> 00:38:40,030
Like, the guys from the event said, like, the first thing that got out of control, I asked them to vote, and, like, it took about two minutes for them to compute the vote.

323
00:38:40,030 --> 00:38:41,470
Damn, the system crashed.

324
00:38:41,570 --> 00:38:43,190
What the fuck did I do?

325
00:38:44,070 --> 00:38:46,730
Then I came to the conclusion that, no, it can't be.

326
00:38:46,730 --> 00:38:49,650
I had voted this, I don't know, about 10,000, 15,000 times.

327
00:38:50,770 --> 00:38:53,850
Then I kept thinking, like, no, there's something wrong.

328
00:38:53,890 --> 00:38:56,810
Then I thought, well, what do I think the guys are doing?

329
00:38:56,810 --> 00:39:12,110
There must be a table there, or whatever, that must have, like, what is the TOC, what is the username, I don't know, with a foreign key, and what is the password, like, the note that you gave for that lecture.

330
00:39:12,110 --> 00:39:20,270
And then, when it showed the website, it should select that table, and then it would do the average, and put it on the screen.

331
00:39:20,270 --> 00:39:27,890
So I think that when I did 10,000, it was kind of rough to go through the whole table, so it took a long time.

332
00:39:27,890 --> 00:39:35,710
Then the guys went on Twitter and said, like, we're going to use our own criteria, it won't be the one with the biggest table.

333
00:39:36,550 --> 00:39:37,950
Then I was like, okay, cool.

334
00:39:38,170 --> 00:39:42,170
So, like, I know the guys and stuff, like, who knows, it will work out.

335
00:39:44,470 --> 00:39:48,910
Then, in parallel, they changed the system, from one time to another, it was fast.

336
00:39:49,290 --> 00:39:51,930
Like, I was like, wow, they changed the system.

337
00:39:51,930 --> 00:39:52,910
And they really did.

338
00:39:52,910 --> 00:39:54,190
Like, that's what they did.

339
00:39:54,190 --> 00:39:56,270
Like, they updated it right away.

340
00:39:56,270 --> 00:40:03,430
Keep the number of votes and the sum of all scores, divide one by the other, so you can have the average, right?

341
00:40:03,630 --> 00:40:04,790
And then it got faster.

342
00:40:04,790 --> 00:40:08,910
Then I was like, okay, I kept going, I put another 30,000 votes there.

343
00:40:09,470 --> 00:40:11,590
There was a week left on the tie.

344
00:40:11,590 --> 00:40:17,230
Then the guys went on Twitter and said, like, now we've changed the criteria, the criteria will be the one with the biggest table.

345
00:40:17,250 --> 00:40:18,590
I was like, holy shit.

346
00:40:19,110 --> 00:40:22,970
Because I had put myself between the 5th and the 6th, so as not to draw attention.

347
00:40:25,470 --> 00:40:32,310
But I basically chose who was going to give the lectures, because I said, man, this lecture seems to be cool, so I'm going to give a higher grade to this guy.

348
00:40:32,450 --> 00:40:33,790
And so it went.

349
00:40:34,990 --> 00:40:39,950
Then, after I saw in the whole story that it wasn't going to work, I said, okay, so I'm going to change my...

350
00:40:39,950 --> 00:40:41,190
There were about two days left.

351
00:40:41,190 --> 00:40:43,850
Okay, I'm going to change all the photos I took.

352
00:40:44,210 --> 00:40:49,470
Then I took it like this, no, okay, so I'm going to make the average of my talk be pi.

353
00:40:49,650 --> 00:40:51,430
3.14, 15, something like that.

354
00:40:51,430 --> 00:40:55,390
So I kept doing it, changing the grades, to see if it was closer to pi or not.

355
00:40:55,970 --> 00:40:57,810
Until I converted the value.

356
00:40:58,110 --> 00:40:59,810
And it was a lot of fun.

357
00:41:00,570 --> 00:41:06,030
Obviously, I got an email saying, oh, you weren't selected, and such.

358
00:41:06,330 --> 00:41:09,330
But there's no time for fooling around, right?

359
00:41:09,330 --> 00:41:18,130
I did this presentation here, they're going to do the same thing again, I doubt they'll change the caption thing, and I'm going to make fun of them again.

360
00:41:18,650 --> 00:41:22,430
But the title of my lecture will be How They Made Fun of Call of Papers.

361
00:41:23,730 --> 00:41:25,970
I don't think they're going to accept it.

362
00:41:26,570 --> 00:41:27,790
But I'm going to try.

363
00:41:29,310 --> 00:41:34,490
Then I put it here, like, you know, to make it complete, in quotes, right?

364
00:41:34,490 --> 00:41:38,370
But how would you protect against CAPTCHA ideas?

365
00:41:38,490 --> 00:41:41,210
Oh, I don't have any, I like to break things, I don't like to protect.

366
00:41:41,530 --> 00:41:47,550
But if you have ideas, if you want to share, or if you have questions, that's it.

367
00:41:47,630 --> 00:41:48,390
That's it for now.

368
00:41:57,310 --> 00:41:59,590
Let's open for questions.

369
00:41:59,670 --> 00:42:01,350
I'm going to ask.

370
00:42:02,490 --> 00:42:04,170
First question.

371
00:42:04,270 --> 00:42:06,190
Very cool lecture.

372
00:42:06,290 --> 00:42:08,010
My question is this.

373
00:42:08,090 --> 00:42:12,330
You said that the automation process was a little boring.

374
00:42:12,330 --> 00:42:14,170
What did you use to automate?

375
00:42:14,170 --> 00:42:18,470
Especially that part, I don't know, a Selenium or something like that?

376
00:42:18,530 --> 00:42:20,590
No, I used Python.

377
00:42:21,050 --> 00:42:22,150
Pure Python?

378
00:42:22,150 --> 00:42:25,870
Get, host, everything in the arm.

379
00:42:26,170 --> 00:42:28,630
Here's Selenium's tip for you.

380
00:42:28,630 --> 00:42:29,930
Ok, I'll take a look.

381
00:42:30,090 --> 00:42:39,630
By the way, in the article I wrote in the magazine, there's an URL where I put all the Python scripts I wrote to do this analysis in the samples.

382
00:42:39,630 --> 00:42:44,530
So , all those images I generated, you can generate with the scripts I left there.

383
00:42:44,530 --> 00:42:47,470
Any questions you have, send me an email.

384
00:42:48,070 --> 00:42:57,090
I think the guerrilla mail part and the automation of the bots is not there, but whoever wants a tip, just send it to me, I'll help.

385
00:42:57,590 --> 00:43:01,250
Good lecture, I liked the research, it was interesting.

386
00:43:01,710 --> 00:43:10,310
Gustavo, suddenly you try to work on an algorithm that would vectorize the letters and give a certain score.

387
00:43:10,510 --> 00:43:29,650
For example, you take a letter without any noise, generate its vectorization, give it a score, and then you try to break all those letters, even dirty ones, try to vectorize what is more solid, and try to get an approximation inside it, because even if it is dancing,

388
00:43:29,650 --> 00:43:42,850
you, by its outline and texture, you would have more or less a number close and you start working with this type of vectorization and score.

389
00:43:42,930 --> 00:43:50,150
So, for example, even if I take two very close letters, but they are different, one would always have a higher score.

390
00:43:50,150 --> 00:43:53,250
And see if this would really work with CAPTCHA.

391
00:43:54,150 --> 00:43:56,630
I think you are right, but I don't know.

392
00:43:56,630 --> 00:44:01,750
I think this is the path of computer vision that I didn't have time to enter.

393
00:44:01,750 --> 00:44:08,050
I think it makes sense what you are saying, but I don't have this knowledge.

394
00:44:08,050 --> 00:44:13,770
I even think I could vectorize these letters, but I don't know what I would do next, you know?

395
00:44:13,770 --> 00:44:17,310
Like, how do I make them all the same?

396
00:44:17,310 --> 00:44:18,890
How do I compare one to the other?

397
00:44:18,890 --> 00:44:25,370
Which, in fact, is the basis of machine learning and computer vision, of having all these things.

398
00:44:25,390 --> 00:44:34,170
Like, I saw several articles that said how to break CAPTCHA when I researched there, ah, you will use TensorFlow.

399
00:44:36,330 --> 00:44:46,290
Because then I think the knowledge base that I would have to have to make this useful to me, it would take a long time, I think.

400
00:44:46,390 --> 00:44:49,550
And then there would be no time for me to vote in the thing.

401
00:44:49,550 --> 00:44:51,970
I think that was the interesting thing, you know?

402
00:44:51,970 --> 00:44:58,770
Because I always liked to understand things in a complete way.

403
00:44:58,770 --> 00:45:00,750
And time was never a factor.

404
00:45:00,750 --> 00:45:10,350
I was never good at capturing the flag, because I like to have time to look, like, ah, I'm not good with time.

405
00:45:10,350 --> 00:45:20,530
But this made me take away my perfectionism of trying to have the basic knowledge for later, to solve the problem, to be more practical.

406
00:45:20,530 --> 00:45:22,610
Because I was pissed off with the guys, you know?

407
00:45:23,650 --> 00:45:30,910
I mean, I wasn't pissed off with the organizers, I was pissed off with the guys that had, I don't know, for example, a professor from a college.

408
00:45:31,250 --> 00:45:33,870
How many guys doesn't this guy have to vote?

409
00:45:33,870 --> 00:45:36,050
It has nothing to do with this thing of information.

410
00:45:36,670 --> 00:45:43,370
So, the moral of my story, of this story that I think, if you want to make an online vote, don't do it.

411
00:45:43,370 --> 00:45:49,630
Because I think that, like, specifically in this issue of the Call of Papers, you know?

412
00:45:49,990 --> 00:45:51,870
Like, for other things, yeah.

413
00:45:52,190 --> 00:45:58,610
But I don't know, like, those things like Big Brother and such, that you can vote and such.

414
00:45:58,950 --> 00:46:02,510
Maybe it's a cool thing to take a look to see if...

415
00:46:03,270 --> 00:46:04,930
It would be cool, right?

416
00:46:05,170 --> 00:46:15,290
Find one of those datacenters from Globo and then you get a cheap virtual hosting and a match post.

417
00:46:15,290 --> 00:46:16,850
It would be cool, right?

418
00:46:17,290 --> 00:46:18,870
Any other questions, guys?

419
00:46:20,710 --> 00:46:23,810
Well, as you may have seen, I had a lot of fun.


