1
00:00:11,840 --> 00:00:13,720
Good afternoon, everyone.

2
00:00:15,820 --> 00:00:19,220
My name is Gustavo Vugo CSH.

3
00:00:19,340 --> 00:00:22,080
I'm the old guard of Brazilian Hacking.

4
00:00:25,240 --> 00:00:28,060
This is my second year in H2HC.

5
00:00:28,060 --> 00:00:29,980
I think the first time I came was in 2012.

6
00:00:31,320 --> 00:00:34,200
And now, in these last two years.

7
00:00:35,940 --> 00:00:57,180
Although I always talk about more technical things, low level, this year I decided to talk a little about an experience I had last year, which was to participate in a call for papers, where there were votes, and people had to vote, but I'll talk about that in a little while.

8
00:00:57,180 --> 00:01:05,880
But the idea was that I had to automate to vote for my paper, because I don't have that many friends.

9
00:01:09,540 --> 00:01:12,040
It was really fun.

10
00:01:12,040 --> 00:01:35,900
I want to share with you what I learned, because I don't really understand what's behind Computer Vision, or Optical Recognition of Characters, but I wanted to do this because I was really pissed off.

11
00:01:38,060 --> 00:01:40,680
So, let's go.

12
00:01:40,680 --> 00:01:41,920
What is CAPTCHA?

13
00:01:42,000 --> 00:01:43,600
This ugly name here.

14
00:01:43,600 --> 00:01:48,660
Complete Automated Public Turing Test to Tell Computers and Humans Apart.

15
00:01:49,400 --> 00:01:55,900
Basically, it's a test that the site will do, to know if you're really a human, or if you're a robot.

16
00:01:56,220 --> 00:01:57,840
But that's exactly what I wanted.

17
00:01:59,060 --> 00:02:00,820
I wanted it to be a robot.

18
00:02:01,260 --> 00:02:04,080
I wanted to make several votes for me, obviously.

19
00:02:07,600 --> 00:02:08,660
So, I wrote...

20
00:02:09,520 --> 00:02:19,380
Actually, the story was really funny, because I work at Intel with Rubira, with Gabriel, with Igor, we are doing the Brazilian Mafia at Intel.

21
00:02:20,200 --> 00:02:23,440
The main language is Portuguese.

22
00:02:23,600 --> 00:02:31,700
Marian, who is Austrian, or Tim, who is Russian, we have a few more Chinese on the team.

23
00:02:31,700 --> 00:02:36,120
They are already learning several words in Portuguese, because if they don't, they get really pissed off.

24
00:02:36,880 --> 00:02:39,220
So, it's really fun.

25
00:02:39,220 --> 00:02:39,980
Well, anyway.

26
00:02:39,980 --> 00:02:51,660
Continuing here, so as not to change course, I decided to write an article for the H2HC magazine, talking about all the details of breaking CAPT.

27
00:02:51,920 --> 00:02:57,860
So, my presentation will be a little around this same subject.

28
00:02:58,300 --> 00:03:06,240
So, like, how to do this first analysis, how to identify a CAPT by the other, and how to break the thing.

29
00:03:06,240 --> 00:03:20,300
So, I decided to create a story here, of the dog and such, but for it to be, like, an analogy, because I didn't want to talk about it directly, I didn't want to expose who organized this call for papers.

30
00:03:21,380 --> 00:03:23,680
So, I wanted to do something more fun.

31
00:03:23,680 --> 00:03:32,220
So, I tried to do a little comedy, and speaking more technically, a little of what I did to break it.

32
00:03:32,220 --> 00:03:40,520
So, just for those who didn't read, to go over here, what I wrote, is the story of Carlos Salo Humberto, who is CSH.

33
00:03:40,540 --> 00:03:42,060
Who discovered this?

34
00:03:42,900 --> 00:03:48,380
So, Carlos is the hero of the Gurizada, he is the guy who was penalized.

35
00:03:48,940 --> 00:03:56,700
Salo is Carlos' popular friend, who has a dog, whose name I didn't give, but who is uglier than Humberto, who is Carlos' dog.

36
00:03:57,020 --> 00:04:04,520
And the story I wrote in Reia, is more or less, Humberto is much prettier than Salo's dog.

37
00:04:04,540 --> 00:04:09,140
But then, the score that Humberto was getting was very low.

38
00:04:09,140 --> 00:04:09,500
Why?

39
00:04:09,500 --> 00:04:21,980
Because Salo posted on Facebook that there was a contest, that they were going to travel to Bahamas, and their friends went there, not only gave a high score to his dog, which was uglier, but also zeroed Humberto.

40
00:04:21,980 --> 00:04:23,840
And Humberto went to the last place.

41
00:04:23,960 --> 00:04:26,380
And this made Carlos get pissed off.

42
00:04:26,440 --> 00:04:28,440
So, any resemblance is a mere coincidence.

43
00:04:30,460 --> 00:04:39,780
So, introduction, every time I talk about session, for those who are not used to it, but there is a session between the user and the server.

44
00:04:39,780 --> 00:04:50,840
So , to make a request, all the cookies of the session, etc., considered a session, a solution would be a correct answer to the capture challenge.

45
00:04:50,840 --> 00:04:57,860
So, if some letters appear there, the solution would be those letters that for us it is obvious, but for a computer it is not so obvious.

46
00:04:58,160 --> 00:05:01,280
And a margin is a generic figure.

47
00:05:03,080 --> 00:05:09,780
So, this is how the capture of this event appears.

48
00:05:10,000 --> 00:05:12,800
And above I put an image, just with a word.

49
00:05:13,100 --> 00:05:23,280
Because the type of software that does an analysis of an image and converts it to text is called OCR, Optical Character Recognition.

50
00:05:24,840 --> 00:05:31,460
And let's agree that if you are a computer and you are going to read the first image, it is much easier to decode than the second.

51
00:05:31,460 --> 00:05:46,980
Precisely, the capture, it puts this bunch of noise and puts the wrong letters, etc., to make it difficult for you to write a program to automate or to throw it inside an OCR software.

52
00:05:47,420 --> 00:06:06,480
Then, when I came across this, because the step of the site, you had to create an account, to create the account, it showed the image, it had to have an email, then it sent a link to that email, you had to confirm the email to activate the account, and then you logged in with that account to make the vote.

53
00:06:06,480 --> 00:06:15,520
So, it's more or less, you are already used to this, not only on voting sites, but any site where you want information, they don't give a damn about it.

54
00:06:16,820 --> 00:06:20,280
So, my challenge was, OK, how am I going to do this?

55
00:06:20,280 --> 00:06:21,900
I'm going to learn Computer Vision.

56
00:06:22,380 --> 00:06:26,460
I even started to take a look, but I said, no, this is too complicated.

57
00:06:29,680 --> 00:06:30,740
Use a ready-made tool.

58
00:06:30,740 --> 00:06:36,320
Actually, I put it in reverse order, because I think the first thing I did was, oh, I'm going to see if anyone has already done this.

59
00:06:36,320 --> 00:06:44,560
Then I found a lot of tutorials and scripts that said they broke and such, and obviously I tested it, and it didn't work the way I wanted.

60
00:06:44,600 --> 00:06:47,680
Then I said, I'm going to have to learn this shit now, to do this.

61
00:06:48,140 --> 00:06:59,060
But I said, oh , I have little time, because it was 30 days, I think, and we entered the Raffle for Papers, and there would be all the voting, and in the end they wouldn't give the result.

62
00:06:59,620 --> 00:07:02,860
Then I said, well, I don't have time to learn all this, I have to find a way.

63
00:07:03,660 --> 00:07:13,200
Then I saw that some of these sites that had tutorials to break CAPT, they used Tesseract, which I didn't know until then, but it's a tool from Google that does CR.

64
00:07:13,920 --> 00:07:15,900
You pass the image and it gives you the CR.

65
00:07:15,900 --> 00:07:22,940
But obviously, passing the first, it will do wonderfully well, the second, not so much.

66
00:07:22,940 --> 00:07:28,300
Then I thought, no, ok, what do I need to know if I can break this CAPT?

67
00:07:28,300 --> 00:07:37,100
Because if the guy changes the image every time I make a mistake, or, I don't know, blocks my IP, I don't know.

68
00:07:37,540 --> 00:07:39,000
Then I started to think.

69
00:07:39,460 --> 00:07:46,480
I took the URL of the image, opened a new tab, pasted the URL there, and started pressing F5 like crazy.

70
00:07:47,080 --> 00:07:52,320
Every time I did this, the image would change, it would jump to one side, it would jump to the other.

71
00:07:53,320 --> 00:07:56,180
Some details of the image remained the same.

72
00:07:57,120 --> 00:07:59,580
Then I said, ok, I think I can break it.

73
00:07:59,720 --> 00:08:04,720
I had never done this before, but I was indignant with this story.

74
00:08:05,180 --> 00:08:06,200
My dog.

75
00:08:07,060 --> 00:08:09,640
And then, no, I'm going to do this.

76
00:08:09,900 --> 00:08:25,980
Then I thought, no, for the section, for the URL, I have to do a lot of things, like, if I keep pressing F5, there will come a time when it will change this image, or it will stop giving me this image, it will give me a 404 error, something like that.

77
00:08:25,980 --> 00:08:31,360
Because I wanted to determine what is the temporality I have to keep capturing these images.

78
00:08:33,620 --> 00:08:41,520
And then, of course, I tested other sites to have this table to help me understand this.

79
00:08:41,520 --> 00:08:47,580
But basically, what are the criteria for me to determine if a captcha is breakable or not.

80
00:08:47,740 --> 00:08:50,240
So, the more images you have, the better.

81
00:08:50,240 --> 00:08:52,080
I will explain why.

82
00:08:52,500 --> 00:08:54,620
And the question of the section.

83
00:08:54,620 --> 00:09:10,200
So, if there is a section, a unique URL for the image, that you can get several images, or you can associate this section with the resolution, it is also a good thing.

84
00:09:10,680 --> 00:09:29,840
Another important thing, if I submit a post with the captcha resolution, and this solution is wrong, the site that is evaluating this changed the captcha, like, changed it, in my case, I read four letters.

85
00:09:30,100 --> 00:09:38,320
If I tried A, B, C, D, and it was not, if it will change the solution, it will remain the same.

86
00:09:38,320 --> 00:09:38,900
Why is that?

87
00:09:38,900 --> 00:09:54,780
Because if I can, if it remains the same, I can keep trying several times until this temporality no longer exists, or infinite, if it lets me do this indefinitely.

88
00:09:55,240 --> 00:09:59,260
So, this is more or less what I put in my criteria.

89
00:09:59,700 --> 00:10:08,280
It is funny that I am talking about this lecture, after watching Edgar and Thais' lecture, I already think that all of this could be the constraints of my solver.

90
00:10:08,740 --> 00:10:16,180
And to know if a captcha is breakable or not, I could use an SMT to know if it is possible or not.

91
00:10:17,340 --> 00:10:27,620
So, here is just an example of some images that you can see, like this, for example, UPSD, this U goes there, then U goes here, you see?

92
00:10:28,740 --> 00:10:29,680
So, what did I do?

93
00:10:29,680 --> 00:10:47,100
In order not to bother the guys with my IP, taking the images, I did, no, I will do a sample, I will take 50 sessions, 50 images of each session, then I can keep playing, creating my algorithms, see if I can filter this, and so, and break it.

94
00:10:47,920 --> 00:10:54,540
And it was from there that I wanted a proof that it was possible to break it, before automating it.

95
00:10:54,540 --> 00:11:00,020
For me, I think that the most difficult problem was to break the captcha.

96
00:11:00,020 --> 00:11:14,900
Because then it is this automation, like creating an automatic e-mail, I found out there, Guerrilla Mail, I don't know if anyone knows it, but it is cool that you create a temporary account, while the session is open there, you can go to a website, create an account with that temporary e-mail,

97
00:11:14,900 --> 00:11:26,320
it will send the e-mail, you don't need, there is even an AJAX API, that creates automatic users, and it is really cool.

98
00:11:27,860 --> 00:11:45,900
So, decoding process, I put four steps, one, I will normalize this image, what does normalization mean, I will get into it, I do a pre-filter, and then I separate the letters, and I send each letter to Tesseract, because I found out that, because they keep dancing,

99
00:11:45,900 --> 00:11:53,280
they put the letters in different angles, Tesseract didn't recognize it very well for me.

100
00:11:53,280 --> 00:12:14,740
Maybe there are other options, or configurations of Tesseract that I don't know, because there are dozens, maybe hundreds of options in Tesseract, and I didn't have time, so I had to, I will separate these letters, because I think it is easier.

101
00:12:15,800 --> 00:12:31,840
So, the first thing I did, I will make an histogram of these images, in width, I will take all the images, I will know how many pixels are in width, how many pixels are in height, I will see what the average is, and I will normalize all the images to the same size,

102
00:12:31,840 --> 00:12:38,500
because I can then overlay them, and maybe there is something that I can see, in this story.

103
00:12:38,500 --> 00:12:52,280
So, although you can't see it, but I think the average here is like 78 pixels in width, by, I think, 58, something like that, in height.

104
00:12:52,280 --> 00:13:03,960
But this one you can see that it is more or less normalized, but this other one, the vast majority, is in that height there, and I think it was the height I used.

105
00:13:04,700 --> 00:13:19,440
And then I realized that this line, I think, this is a trick they do, because the image with the lowest resolution is more difficult to break.

106
00:13:19,440 --> 00:13:35,300
So, the higher the resolution, the more definition the images have, or the noise, it is easier for you to break the capture, because there are several ways for you to work the images, to reduce the noise.

107
00:13:35,820 --> 00:13:42,800
But I found that there was a line here, I think there is another line there, and an arc, that has all the images.

108
00:13:42,800 --> 00:14:06,320
So, when I normalized all my 2,500 images, and made a sum of all of them, and divided by the number of images, you can do this, OpenCV, I had to, I did all this stuff in Python, so OpenCV helped me, how to open an image, convert it to a matrix, all these things,

109
00:14:06,320 --> 00:14:09,320
so far so good.

110
00:14:09,720 --> 00:14:21,500
But, only with these images, I was able to extract these things, applying this average of all the images, on top of the images, you can see that the noise has decreased a lot.

111
00:14:21,640 --> 00:14:28,140
Removing the peripheral part, around the image, it almost cleaned everything automatically.

112
00:14:28,140 --> 00:14:33,900
But you can see that the thing is still a little strange, it is not very clear.

113
00:14:34,680 --> 00:14:54,140
But then you have some things, I think I used two or three functions of OpenCV, to improve, because it has something that I think is, I forgot the name of the function, but it kind of agglutinates small spaces between, for example, in the C there, that has some flaws,

114
00:14:54,140 --> 00:15:00,820
it will extend that shape, to try to complete it.

115
00:15:00,820 --> 00:15:16,420
And then, once I did that, I think the hardest part was breaking the four letters, I tried a lot of things, like, you can see that there is a certain cloud here, between each letter.

116
00:15:17,580 --> 00:15:28,300
I imagined, maybe if I take only these regions here, extract these regions after applying this pre-filter, extract each one of these, send it to Tesseract and see what happens.

117
00:15:28,560 --> 00:15:29,820
I did that.

118
00:15:29,820 --> 00:15:39,320
And I had a good initial result, I don't know, 20, 30% of the images, I was able to decode the letters.

119
00:15:39,320 --> 00:15:48,440
But a lot of times, because of that dancing of the letters, there was a lot of pieces of the letter out of each one of these regions, and that got in the way.

120
00:15:48,440 --> 00:15:59,700
So I had to use, again, OpenCV, there is a function that returns me, given this image, how many outlines do I have?

121
00:15:59,700 --> 00:16:02,200
Then it goes and says, there are so many outlines.

122
00:16:02,280 --> 00:16:13,600
Then, of course, you can see, for example, this T down here, it would tell me that there were two outlines, or regions of outlines, which is the T itself, and there is a dirt down there.

123
00:16:14,100 --> 00:16:21,200
So, for example, that other H there, from 40 to 40, there is a dot in the middle of the H, and there is also a dirt up there.

124
00:16:21,660 --> 00:16:39,160
So I had to do, I have N outlines, then I made the algorithm, there is an outline inside another outline, and I will consider making a union of these two outlines, then I say, what is the size that I have of my width of each outline, adding all of them,

125
00:16:39,160 --> 00:16:42,440
does it make sense to have four letters?

126
00:16:42,440 --> 00:16:43,820
Because if there were more...

127
00:16:44,420 --> 00:16:55,660
Then I did something kind of rough, just to separate, when I think that the thing is not very good, I discard that image, and I try to find a new image on the site.

128
00:16:56,280 --> 00:17:06,160
But this was my proof of concept, I am doing this on top of my 2,500 images, to know if my algorithm will work, before I lose my time automating that.

129
00:17:07,580 --> 00:17:08,060
And...

130
00:17:08,060 --> 00:17:17,380
I think this is where I lost more of my time, to try to do things, just to write this little Python, to make the letters pretty, and that took me some time.

131
00:17:18,320 --> 00:17:24,280
But it helped me to have a general vision, like, I think the algorithm is good, I think it is not.

132
00:17:25,440 --> 00:17:30,340
And then I did a statistical analysis, of each of the letters.

133
00:17:30,340 --> 00:17:38,240
So, for example, the first set of images, I apologize to those of you who cannot see well, but I will try to explain more or less.

134
00:17:38,980 --> 00:17:52,360
But the first set of images, the first letter, sorry, the second letter, which is this first image here, the solution is R.

135
00:17:52,400 --> 00:17:57,980
But the tesseract of the 15 images that it decoded...

136
00:17:58,960 --> 00:18:00,540
Wait, let me do...

137
00:18:00,540 --> 00:18:02,360
No, I am not on my computer.

138
00:18:02,360 --> 00:18:03,140
Damn.

139
00:18:03,220 --> 00:18:05,200
This is a Google Spreadsheet.

140
00:18:05,200 --> 00:18:10,260
You can send me an e-mail, I can share this spreadsheet with you, you can look at the numbers.

141
00:18:10,260 --> 00:18:12,300
But the idea is...

142
00:18:12,300 --> 00:18:16,540
The tesseract does not give me the correct answers sometimes.

143
00:18:16,540 --> 00:18:29,580
It told me twice that it was a C, four times that it was a D, once that it was an O, once that it was a P, once that it was a Q, and six times that it was an R.

144
00:18:29,580 --> 00:18:31,420
The right answer was an R.

145
00:18:31,660 --> 00:18:36,360
But as you can see, it is not very precise.

146
00:18:36,380 --> 00:18:37,820
It is really rough.

147
00:18:39,560 --> 00:18:40,040
So...

148
00:18:42,300 --> 00:18:46,880
My message to you is that image decoding is an analog process.

149
00:18:47,940 --> 00:18:48,980
I do not know if...

150
00:18:48,980 --> 00:18:58,440
I was talking to someone on Friday who told me that there is a tool that breaks captures, that makes automatic filters, and it already breaks.

151
00:18:58,440 --> 00:19:02,720
I was interested, because I did not find this tool when I researched it.

152
00:19:02,720 --> 00:19:07,460
So, maybe there are more automated ways, but...

153
00:19:07,460 --> 00:19:14,760
I went through this to try to understand and make a shortcut in this process.

154
00:19:15,480 --> 00:19:16,000
And...

155
00:19:16,000 --> 00:19:18,300
And this is what I wanted to share.

156
00:19:19,440 --> 00:19:19,960
So...

157
00:19:19,960 --> 00:19:23,940
My conclusion was that image decoding is an analog process.

158
00:19:23,940 --> 00:19:30,980
In my case, I have to have a function that decodes, passes as a parameter an image, and I return four letters.

159
00:19:31,280 --> 00:19:37,300
It may be that the third-party cannot interpret any letter, or it may be that that letter is not the correct answer.

160
00:19:39,160 --> 00:19:39,680
So...

161
00:19:39,680 --> 00:19:41,660
There is a high error rate.

162
00:19:42,900 --> 00:19:43,800
After I made the presentation...

163
00:19:44,660 --> 00:19:49,660
For the first time in my life, I finished a presentation before half an hour of my presentation.

164
00:19:49,820 --> 00:19:53,380
So I'll have to remember the lies I'm going to tell you.

165
00:19:53,560 --> 00:19:55,860
So sometimes I'll get stuck, but...

166
00:19:55,860 --> 00:19:56,540
Let's go.

167
00:19:56,840 --> 00:19:59,080
So, it's a statistical process.

168
00:19:59,200 --> 00:20:03,840
If the number of images is infinite, it is certain that you will get a capture.

169
00:20:03,840 --> 00:20:06,580
This is kind of obvious.

170
00:20:06,880 --> 00:20:15,200
If the number of attempts is infinite, it is also kind of obvious that you will be able to crack, but you can also use brute force.

171
00:20:15,580 --> 00:20:24,680
So, in my case, there are 26 letters in the alphabet, in the fourth power, if there is a number of combinations to get the capture.

172
00:20:24,980 --> 00:20:34,940
So if there is a small latency in the network, with the network of the contest, it trolls the guys, and then it breaks.

173
00:20:36,980 --> 00:20:37,420
And...

174
00:20:38,180 --> 00:20:45,940
Here I made an histogram with the four letters and...

175
00:20:47,000 --> 00:20:58,140
the axis of the Y will tell me how many images it was able to decode from each of my image sets, which are 25.

176
00:20:58,140 --> 00:21:02,440
So there are 25 sets of four bars there.

177
00:21:02,520 --> 00:21:11,280
So you can see, for example, the fourth letter, there are some places where it was able to decode very few letters.

178
00:21:12,600 --> 00:21:13,200
And...

179
00:21:13,200 --> 00:21:23,820
But I was calm, because I thought, but it is obvious that if I can map this number of things, I can guess what the capture is.

180
00:21:23,820 --> 00:21:28,160
And this is, I think, the great moral of the story.

181
00:21:28,200 --> 00:21:39,660
I will try to go a little in my algorithm to detect who has more statistical knowledge will understand this directly, who doesn't, I will try to be a little didactic.

182
00:21:39,700 --> 00:21:43,660
But the idea is the following, let's say that my initial state is the first line there.

183
00:21:43,800 --> 00:21:53,760
In the first line I have no letter that I was able to decode and no letter from that set of five letters or...

184
00:21:53,760 --> 00:21:57,160
that set of letters that I have, of possibilities.

185
00:21:58,120 --> 00:22:10,280
So, when my script processes the first image, the Tesseract returns three letters, T, R, C, and I couldn't decode the fourth letter.

186
00:22:10,280 --> 00:22:22,780
So I will enter my frequency table that will have there, well, I have one T in the first, one R in the second, one C in the third, and I have one unknown in the fourth letter.

187
00:22:23,080 --> 00:22:29,200
Then the question is, could I make some attempt to decode?

188
00:22:29,240 --> 00:22:32,580
No, because I don't have enough information, right?

189
00:22:32,840 --> 00:22:42,040
So I go on, until, for example, in the third image a K appears in the fourth letter.

190
00:22:42,040 --> 00:22:49,460
So I have here, for example, the T, I analyzed three images and the Tesseract returned three times the letter T.

191
00:22:49,780 --> 00:22:56,220
I said, wow, this gives me hope that the first letter is the T.

192
00:22:56,220 --> 00:23:02,860
In the second letter it gave me two R's and one unknown.

193
00:23:02,860 --> 00:23:09,800
So I have a 66% chance to know that the second letter is R.

194
00:23:09,800 --> 00:23:15,920
My other 33%, or my other third, tells me that I have no way to evaluate.

195
00:23:15,960 --> 00:23:25,180
In the third letter, the same thing, I have 66% chance that it is a C and 1% chance that it is a G.

196
00:23:25,360 --> 00:23:38,460
In the fourth letter, I have 66,666, or 2 thirds chance that it is the unknown and 1 third chance that it is the letter K.

197
00:23:39,320 --> 00:23:45,420
So I made the algorithm that thought, well, what is the most obvious thing to try?

198
00:23:45,420 --> 00:23:50,860
I take the letters with the highest frequency in each of the positions and this is the one I will try.

199
00:23:52,320 --> 00:23:58,640
And if it doesn't work, well, if it doesn't work, I will start doing a simulation of a bunch of K's.

200
00:23:58,640 --> 00:24:05,540
I will take a dice, I will start throwing a dice and I will try to know what the letters are.

201
00:24:06,200 --> 00:24:15,200
And this is what I did, but notice that in the third image my script will try to guess the Captcha.

202
00:24:15,200 --> 00:24:19,420
It will try to say that it is T, R, C, K.

203
00:24:20,100 --> 00:24:24,360
And the site will reject it, saying, no, this Captcha is not valid.

204
00:24:24,360 --> 00:24:25,360
So what do I do?

205
00:24:25,360 --> 00:24:31,160
I take this attempt and throw it in a table of attempts that didn't work.

206
00:24:33,160 --> 00:24:38,120
I will only have another letter in the fourth house, here.

207
00:24:38,240 --> 00:24:40,500
At this point.

208
00:24:40,760 --> 00:24:43,240
But at this point my whole scenario changed.

209
00:24:44,040 --> 00:24:45,420
I will have...

210
00:24:47,040 --> 00:24:48,860
Sorry, let me go back a little bit.

211
00:24:50,020 --> 00:24:54,120
Since I have the K in the fourth house, I will have...

212
00:24:55,180 --> 00:24:55,620
But...

213
00:24:55,620 --> 00:24:57,820
In the third image I only have...

214
00:24:59,780 --> 00:25:01,800
Actually, it even makes a permutation.

215
00:25:01,980 --> 00:25:07,460
Because I have the possibility of being T, R, G, K.

216
00:25:07,720 --> 00:25:10,500
Because in the second image it said it could be a G.

217
00:25:11,100 --> 00:25:11,800
Is that it?

218
00:25:11,800 --> 00:25:12,800
That's it.

219
00:25:13,000 --> 00:25:19,400
So my script will take and will play randomly based on the weight.

220
00:25:19,400 --> 00:25:30,500
So it will take, for example, for the first letter I have three T's, two R's, two C's and a G.

221
00:25:30,500 --> 00:25:32,530
So it will make in each group

222
00:25:36,000 --> 00:25:38,120
a random sample with weights.

223
00:25:38,140 --> 00:25:39,580
And it will try this other one without.

224
00:25:39,580 --> 00:25:45,320
So it could try in the fourth image T, R, G, K.

225
00:25:45,320 --> 00:25:46,300
That would also fail.

226
00:25:46,300 --> 00:25:53,780
It takes this password and puts it in the sample table that were not successful.

227
00:25:53,780 --> 00:25:54,680
And so it goes.

228
00:25:54,680 --> 00:26:02,800
So, for example, in the fifth line it introduces an O, then it introduces an I and an F.

229
00:26:02,800 --> 00:26:04,340
And it will be in parallel.

230
00:26:04,340 --> 00:26:15,860
So, while it is capturing the images and updating the frequency table, the algorithm that tries to generate combinations of the answer will be running in parallel.

231
00:26:16,300 --> 00:26:27,400
But it will always try what I did, it will always try the one with the highest frequency before trying to make this random sample.

232
00:26:27,940 --> 00:26:48,380
And when I did this I tested it several times I was improving my filter algorithm I got from my 50 samples I managed to break 48 of them.

233
00:26:48,380 --> 00:26:52,400
Just doing these tricks.

234
00:26:52,400 --> 00:27:00,440
Some had like 400 or 500 attempts to get the right answer but they did it.

235
00:27:02,360 --> 00:27:03,440
So...

236
00:27:03,440 --> 00:27:06,400
I put here the results.

237
00:27:06,400 --> 00:27:14,660
In the first letter 57% of the images managed to decode whether it was the right one or not.

238
00:27:14,920 --> 00:27:24,340
Like, you can see an A but it will tell you that it is, I don't know, a symbol of truth or something like that, I don't know, upside down, I don't know.

239
00:27:24,340 --> 00:27:26,340
It gives a crazy code.

240
00:27:27,600 --> 00:27:29,460
And then I assume that it is null, right?

241
00:27:29,460 --> 00:27:36,180
That it is not inside, between A and Z I consider it trash.

242
00:27:36,620 --> 00:27:44,100
But, incredibly, my process of filter and letter separation gave me all these results.

243
00:27:44,760 --> 00:27:55,440
And then, I managed to decode the solution, like, the first letter 100%, the second letter 98, because two cases failed, I think, the third letter the same thing, the fourth letter 100%.

244
00:27:56,160 --> 00:27:58,360
And this is interesting.

245
00:27:58,660 --> 00:28:14,460
Like, the solution is inside the most frequently decoded letter, it means that from A to Z, let's say, I have decoded 50 images, I have 40 Zs, is this Z the solution?

246
00:28:15,080 --> 00:28:17,940
In 76% of the cases, yes.

247
00:28:19,000 --> 00:28:21,860
The second letter, in 70% of the cases, no.

248
00:28:21,880 --> 00:28:24,360
That is, I...

249
00:28:24,360 --> 00:28:36,340
Actually, the process I did was to get to this point, and then I decided, no, I will have to go back and do that random weight sample to improve.

250
00:28:36,400 --> 00:28:38,120
Because this made...

251
00:28:38,120 --> 00:28:48,500
that algorithm I explained to you before, of the random weight sample, made me raise my solution from 76% of the cases, from the first letter, to 100%.

252
00:28:48,500 --> 00:28:51,460
Because then it is not always focused on what has the highest frequency.

253
00:28:51,460 --> 00:28:57,500
It tries to get all the letters it managed to decode.

254
00:28:59,280 --> 00:29:24,380
So, in summary, of this decoding process, having a sample, doing a way of automating, like I did, 50 sessions, 50 images each, I can work without alerting the organization of the contest, because I can work this locally, manipulate the images, make the filters,

255
00:29:24,380 --> 00:29:26,760
I don't need to automate all this stuff now.

256
00:29:27,080 --> 00:29:30,360
It gives me a good notion of knowing, well, I think this is going to work.

257
00:29:30,840 --> 00:29:36,080
And when I got to this point, I said, no, it's cool.

258
00:29:36,340 --> 00:29:39,460
So I can start automating.

259
00:29:39,460 --> 00:29:47,680
I made some quizzes here for you, because the idea I want to pass on is that maybe one capture is different from the other.

260
00:29:48,200 --> 00:29:49,100
Right?

261
00:29:49,380 --> 00:29:56,920
And then I took a government website, because I knew there was online voting, and then I did what I...

262
00:29:56,920 --> 00:30:08,880
I took the URL of the image, opened another tab, pressed seven times refresh, although you don't realize, first, it's the same capture.

263
00:30:09,180 --> 00:30:11,060
Second, the letters don't dance.

264
00:30:11,060 --> 00:30:13,000
So it seems to be easier to break.

265
00:30:13,360 --> 00:30:16,740
The only thing that changes is the noise they add.

266
00:30:16,740 --> 00:30:21,580
So, for example, here you have some dots here, which are not here, here there are others, and so on.

267
00:30:21,580 --> 00:30:24,700
Then I thought, ah, this is very easy to break.

268
00:30:24,960 --> 00:30:36,400
You apply a simple filter, go to tesseract, I even believe that if you pass this directly to tesseract it will decode directly.

269
00:30:37,240 --> 00:30:39,280
And then you already have a capture response.

270
00:30:41,460 --> 00:30:42,020
Like...

271
00:30:42,020 --> 00:30:43,680
Very easy.

272
00:30:44,020 --> 00:30:46,100
This is another capture.

273
00:30:46,100 --> 00:30:48,660
This one is cool, this kind of capture.

274
00:30:48,660 --> 00:30:56,580
Because it applies, it's a simple equation, like seven times four, nine minus one, four times one, blah, blah, blah.

275
00:30:56,580 --> 00:30:58,460
You have to give the right answer.

276
00:30:59,440 --> 00:31:02,720
If you miss the answer, it gives you another equation.

277
00:31:03,680 --> 00:31:18,360
So , like, it's good to have an algorithm of image analysis, to be able to make all the necessary filters so that you can decode that in text, and then be able to make the resolution.

278
00:31:19,460 --> 00:31:26,100
It would be even cool to do something like an eval or something like that.

279
00:31:26,100 --> 00:31:28,120
It would also be cool to...

280
00:31:30,020 --> 00:31:32,080
This is what I'm thinking about right now.

281
00:31:32,420 --> 00:31:42,780
But, if you wanted to break the breaker, it would be cool to put if it's doing an eval, to do an exploit here.

282
00:31:42,840 --> 00:31:48,820
Then tesseract decodes it, puts it in the eval, and you open a shell in the cracker.

283
00:31:48,860 --> 00:31:50,300
That would be cool.

284
00:31:53,480 --> 00:31:58,520
This was another one, captias.net, another site I saw.

285
00:31:59,320 --> 00:32:08,860
And , although you can't see it, it bends the letters a little bit, but they stay in the same angle.

286
00:32:09,100 --> 00:32:16,220
So, it generated these random letters, it generated, more or less, the angle of each letter and it stayed the same.

287
00:32:16,220 --> 00:32:20,980
The only difference is that it added a lot of noise randomly in the background.

288
00:32:21,220 --> 00:32:31,260
So, again, this with a basic image manipulation you can remove all this noise and send it to tesseract to break it.

289
00:32:33,580 --> 00:32:53,000
The interesting thing about this is that this site that gives you this solution, which I think is in PHP, I don't remember now, they keep, they give you the same image with the same answer several times.

290
00:32:53,000 --> 00:32:54,920
So, for me, this is a failure.

291
00:32:55,180 --> 00:33:02,240
So, the more dynamic the answer is, the error, it generates a totally new thing.

292
00:33:02,260 --> 00:33:07,900
Like Google's ReCaptia, which shows some sentences.

293
00:33:07,900 --> 00:33:12,820
Actually, Google's ReCaptia is a separate chapter.

294
00:33:12,820 --> 00:33:15,620
Maybe I could do an entire presentation about it.

295
00:33:15,740 --> 00:33:18,680
I looked over it and found out several things.

296
00:33:18,840 --> 00:33:27,880
One, Google's APIs keep looking at where you are moving the mouse, where you are clicking, and so on.

297
00:33:27,880 --> 00:33:30,000
And then it tells you, ah, you are not a robot.

298
00:33:30,160 --> 00:33:45,180
So, every time you enter a site that says, Google's robot verification, and it gives you a little OK sign, that you can continue, it is because it is looking at where you are moving your mouse, where you are clicking, and assuming that, analyzing the set of these events,

299
00:33:45,180 --> 00:33:47,900
that you are not a robot script.

300
00:33:48,060 --> 00:33:57,040
If it thinks that you are not moving the mouse enough, it shows you those things, like, ah, identify traffic signs in these images.

301
00:33:57,040 --> 00:33:59,940
Then you go there and have to keep clicking about four times.

302
00:34:01,480 --> 00:34:01,980
And...

303
00:34:01,980 --> 00:34:04,420
But I discovered something.

304
00:34:04,420 --> 00:34:07,060
Sorry, this was very interesting.

305
00:34:07,120 --> 00:34:10,180
I am just opening a parenthesis for this Google thing.

306
00:34:11,600 --> 00:34:16,420
If you have vision problems, or are blind, or whatever, there is...

307
00:34:18,260 --> 00:34:31,280
Google's Captcha will put you in the old ReCaptcha, because it shows two words, all distorted, but you have the option to hear what is in that word.

308
00:34:31,580 --> 00:34:34,920
Then I heard this and said, ha, but I will research about this.

309
00:34:35,120 --> 00:34:42,120
Because, in my opinion, I think that our hearing is much worse than our vision.

310
00:34:42,180 --> 00:34:50,840
So, to show some things like this, most human beings can read, but because the noise level is very high.

311
00:34:50,840 --> 00:34:54,280
But now, what kind of noise can we put in audio?

312
00:34:54,280 --> 00:35:03,860
Because I am talking to you here without any noise, maybe you don't understand any word I have said, and then, verbal communication is much more difficult than written communication.

313
00:35:03,860 --> 00:35:05,480
This is a fact.

314
00:35:06,240 --> 00:35:27,920
Then I found a guy, who wrote an article, where he took the output of this ReCaptcha, that is, when Google asked him, he told Google that he was blind, showed this, and he played the audio of the ReCaptcha for a dictation like this, I think it could even be from Google,

315
00:35:27,920 --> 00:35:32,280
to translate, to make the voice description.

316
00:35:32,320 --> 00:35:38,180
And then he just used Google's API, and broke the ReCaptcha.

317
00:35:38,180 --> 00:35:38,680
It was like...

318
00:35:39,180 --> 00:35:45,160
But anyway, the site I went to, the guys don't even care who is blind, so you had to see to break it.

319
00:35:45,160 --> 00:35:46,580
There was no such option.

320
00:35:47,100 --> 00:35:57,780
Which is exactly what I did, I put it as a home theme, because maybe one day, if I am a little boring, I will do this, which must be different.

321
00:35:58,380 --> 00:36:00,260
I think that was what he did.

322
00:36:00,260 --> 00:36:03,360
I think this is the site, but I don't remember.

323
00:36:04,440 --> 00:36:07,060
Otherwise, it will be something else cool about ReCaptcha.

324
00:36:07,060 --> 00:36:09,280
I put it here for some reason.

325
00:36:13,010 --> 00:36:14,490
And that was it.

326
00:36:14,490 --> 00:36:18,410
I mean, this is my name, my email, you can find me.

327
00:36:20,570 --> 00:36:22,310
I think we have time.

328
00:36:22,670 --> 00:36:23,830
I have 10 minutes.

329
00:36:23,830 --> 00:36:32,910
I broke it here in three points, which is what I did later to automate this vote, just out of curiosity.

330
00:36:32,930 --> 00:36:37,830
But the idea is, I have the registration part, the confirmation part, and the vote part.

331
00:36:37,830 --> 00:37:01,230
Once I achieved an adequate result of breaking the captchas, then I set up a program to go there, generate a random name, a random email, talk to Guerrilla Mail there, I registered an email here with a password, go to the site, try to kill the captcha,

332
00:37:01,230 --> 00:37:18,330
take, I don't know, half a dozen images, 10 images, break it, and then I receive the email, click on the link's email to activate that account, then I log in and post the vote.

333
00:37:18,350 --> 00:37:25,890
Of course, my first version took I don't know, about two minutes, I think, to do all this process.

334
00:37:26,390 --> 00:37:29,070
Then I broke it in three...

335
00:37:29,070 --> 00:37:34,850
Actually, I started breaking it in two scripts, one that created the accounts and the other that only voted.

336
00:37:35,270 --> 00:37:53,670
Then I got all of this, I made a thread that only stayed analyzing the images and sending it to Tesseract, updating my list, my frequency table, while another thread only stayed doing the permutations and generating random accounts and trying to get the password in parallel.

337
00:37:53,890 --> 00:38:07,990
Once he got that password right, there was another thread that was only to confirm the email, but it already sent the other thread that was capturing the images and cracking it, like, a new session starts, a new session starts, and I managed to automate it.

338
00:38:07,990 --> 00:38:11,610
Then I created, like, what, 20,000, 30,000 accounts?

339
00:38:13,550 --> 00:38:19,070
And then I started voting, and voting, and voting, and voting, and...

340
00:38:19,590 --> 00:38:24,010
And then I left the dog of the others with zero, on average.

341
00:38:24,870 --> 00:38:27,790
But then things kind of went out of control.

342
00:38:29,370 --> 00:38:40,050
Like, the guys from the event said like, oh, the first thing that went out of control was that I had to make the vote, and it took about two minutes for it to compute the vote.

343
00:38:40,050 --> 00:38:41,470
Damn, the system sank.

344
00:38:41,470 --> 00:38:43,330
What the fuck did I do?

345
00:38:43,330 --> 00:38:46,790
Then I came to the conclusion that no, it can't be.

346
00:38:46,790 --> 00:38:49,690
I had voted this, I don't know, about 10,000, 15,000 times.

347
00:38:50,730 --> 00:38:53,890
Then I kept thinking, like, no, there's something wrong.

348
00:38:53,890 --> 00:38:56,810
Then I thought, well, what do I think the guys are doing?

349
00:38:56,810 --> 00:39:12,110
There must be a table there, or whatever, that must have, like, what is the TOC, what is the username, I don't know, with a foreign key, and what is the password, the password, the note that you gave to that lecture.

350
00:39:12,110 --> 00:39:20,290
And then, when it showed the website, it should select that table, and then it would do the average, and put it on the screen.

351
00:39:20,310 --> 00:39:26,450
So I think that when I did 10,000, it was kind of rough to keep going through the whole table.

352
00:39:26,450 --> 00:39:27,790
So it took a long time.

353
00:39:28,170 --> 00:39:35,710
Then the guys went to Twitter and said, like, ah, we're going to use our own criteria, it won't be the one with the biggest number.

354
00:39:36,530 --> 00:39:37,950
Then I was like, okay, fine.

355
00:39:38,150 --> 00:39:41,910
So, like, well, I know the guys, maybe it will work.

356
00:39:44,490 --> 00:39:47,050
Then, at the same time, they changed the system.

357
00:39:47,050 --> 00:39:48,910
From one time to another, it was fast.

358
00:39:49,310 --> 00:39:51,910
I was like, wow, they changed the system.

359
00:39:51,910 --> 00:39:52,970
And they really did.

360
00:39:53,010 --> 00:39:56,250
I knew they were going to update it right away.

361
00:39:56,290 --> 00:40:03,250
Keep the number of votes and the sum of all the scores, divide one by the other, so you can have the average.

362
00:40:03,370 --> 00:40:04,770
And then it was faster.

363
00:40:04,770 --> 00:40:06,370
Then I was like, okay, fine.

364
00:40:06,370 --> 00:40:09,110
I put another 30 thousand votes there.

365
00:40:09,610 --> 00:40:10,990
There was one week left.

366
00:40:11,390 --> 00:40:17,250
Then the guys went to Twitter and said, like, ah, now we changed the criteria, it will be the one with the biggest number.

367
00:40:17,310 --> 00:40:18,630
I was like, holy shit.

368
00:40:19,090 --> 00:40:23,030
Because I had put myself between the 5th and the 6th, so as not to draw attention.

369
00:40:25,550 --> 00:40:28,570
But I basically chose who was going to give the lectures.

370
00:40:28,570 --> 00:40:32,390
Because I said, well, this lecture seems to be cool, so I'm going to give a higher number for this guy.

371
00:40:32,390 --> 00:40:33,870
And so it went.

372
00:40:34,830 --> 00:40:39,930
Then, after I saw in the whole story that it wasn't going to work, I said, okay, so I'm going to change my...

373
00:40:39,930 --> 00:40:41,210
There were about two days left.

374
00:40:41,210 --> 00:40:43,970
I said, okay, I'm going to change all the pictures I took.

375
00:40:44,090 --> 00:40:49,590
Then I said, okay, I'm going to make the average of my talk be pi.

376
00:40:49,670 --> 00:40:50,930
3.1415...

377
00:40:50,930 --> 00:40:55,390
So I kept doing it, changing the scores, to see if it was closer to pi or not.

378
00:40:55,970 --> 00:40:57,810
Until I converted the value.

379
00:40:58,090 --> 00:40:59,830
And it was also pretty fun.

380
00:41:00,530 --> 00:41:05,530
Obviously I got an email saying, oh, you weren't selected.

381
00:41:06,490 --> 00:41:08,950
But there's no time for jokes.

382
00:41:08,970 --> 00:41:11,550
I made this presentation here.

383
00:41:11,670 --> 00:41:14,010
They're going to do the same thing again.

384
00:41:14,050 --> 00:41:16,390
I doubt they're going to change the caption.

385
00:41:16,390 --> 00:41:18,230
And I'm going to make fun of them again.

386
00:41:18,550 --> 00:41:22,650
But the title of my lecture is going to be How They Made Fun of Call of Papers.

387
00:41:23,530 --> 00:41:26,090
I don't think they're going to accept it.

388
00:41:26,390 --> 00:41:27,950
But I'm going to try.

389
00:41:29,130 --> 00:41:34,490
Then I put here, I'm going to make it complete, in quotes.

390
00:41:34,730 --> 00:41:37,670
How would you protect against ideas?

391
00:41:38,130 --> 00:41:39,230
Oh, I don't have any.

392
00:41:39,230 --> 00:41:40,130
I like to break things.

393
00:41:40,130 --> 00:41:41,270
I don't like to protect.

394
00:41:41,270 --> 00:41:47,570
But if you have ideas and want to share or have questions, that's it.

395
00:41:57,410 --> 00:41:59,630
Let's open for questions.

396
00:41:59,830 --> 00:42:01,230
I'm going to ask.

397
00:42:02,610 --> 00:42:04,110
First question.

398
00:42:04,110 --> 00:42:06,230
Very cool lecture.

399
00:42:06,270 --> 00:42:07,910
My question is this.

400
00:42:08,130 --> 00:42:12,330
You said that the automation process was a little boring.

401
00:42:12,330 --> 00:42:14,150
What did you use to automate?

402
00:42:14,150 --> 00:42:18,330
Especially the part of Selenium or something like that?

403
00:42:18,650 --> 00:42:20,470
No, I used Python.

404
00:42:21,190 --> 00:42:22,130
Pure Python?

405
00:42:22,130 --> 00:42:22,930
Pure Python.

406
00:42:22,930 --> 00:42:25,550
GET, POST, everything in your arm.

407
00:42:26,170 --> 00:42:28,590
Here's the Selenium tip for you.

408
00:42:28,590 --> 00:42:29,930
Ok, I'll take a look.

409
00:42:29,970 --> 00:42:39,650
By the way, in the article I wrote in the magazine, there's an URL where I put all the Python scripts that I wrote to do this analysis inside the samples.

410
00:42:39,890 --> 00:42:44,550
All those images that I generated, you can generate with the scripts I left there.

411
00:42:44,550 --> 00:42:47,510
Any questions you have, send me an email.

412
00:42:48,130 --> 00:42:57,090
I think the guerrilla email part and the automation of Watts is not there, but if you want a tip, just send it to me and I'll help.

413
00:42:57,610 --> 00:42:58,710
Good lecture.

414
00:42:58,710 --> 00:42:59,910
I liked the research.

415
00:42:59,910 --> 00:43:01,210
It was interesting.

416
00:43:01,750 --> 00:43:10,270
Gustavo, suddenly you try to work an algorithm that vectorizes the letters and gives a certain score.

417
00:43:10,470 --> 00:43:29,650
For example, you take a letter without any noise, generates its vectorization, gives it a score, and then you try to break all those letters, even the dirty ones, try to vectorize what is more solid, and try to get an approximation inside it, because even if it is dancing,

418
00:43:29,650 --> 00:43:42,970
you, by its outline and its texture, you would have a similar number and you start working with this type of vectorization and scoring.

419
00:43:42,970 --> 00:43:49,810
So, for example, even if I take two letters that are very similar, but different, one would have a higher score.

420
00:43:50,090 --> 00:43:53,210
And see if this really works with Capture.

421
00:43:54,070 --> 00:43:55,910
I think you're right.

422
00:43:55,910 --> 00:43:56,670
But I don't know.

423
00:43:56,670 --> 00:44:01,750
I think this is the path of computer vision that I didn't have time to enter.

424
00:44:01,750 --> 00:44:08,070
I think it makes sense what you're saying, but I don't have this knowledge.

425
00:44:08,070 --> 00:44:13,610
I even think I could vectorize these letters, but I don't know what I would do next.

426
00:44:13,770 --> 00:44:17,290
How do I make them all the same?

427
00:44:17,290 --> 00:44:18,890
How do I compare one to the other?

428
00:44:18,890 --> 00:44:25,470
Which is actually the basis of machine learning and computer vision, of having all these things.

429
00:44:25,570 --> 00:44:30,870
Like, I saw several articles that said how to break Capture when I was researching there.

430
00:44:30,870 --> 00:44:34,170
Ah, you're going to use TensorFlow.

431
00:44:34,710 --> 00:44:35,190
No .

432
00:44:36,410 --> 00:44:46,290
Because then I think the knowledge base I would have to take for this to be useful would take a long time, I think.

433
00:44:46,410 --> 00:44:49,590
And then I wouldn't have time to vote on the thing.

434
00:44:49,590 --> 00:45:10,450
I think this was the interesting part, I liked to understand things in a complete way, and time was never a factor, I was never good at Capture the Flag, because I like to have time to look and, like, I'm not good with time.

435
00:45:10,450 --> 00:45:22,610
But this made me get rid of my perfectionism of trying to have the basic knowledge for later and solve the problem to be more practical, because I was pissed off with the guys, you know?

436
00:45:23,650 --> 00:45:30,930
I mean, I wasn't pissed off with the organizers, I was pissed off with the guys who had, I don't know, a professor from a college.

437
00:45:31,230 --> 00:45:33,850
How many guys does this guy have to vote for?

438
00:45:33,850 --> 00:45:36,070
It has nothing to do with this thing of information.

439
00:45:36,670 --> 00:45:43,370
So, the moral of my story, of this story, I think, is that if you want to vote online, don't do it.

440
00:45:43,370 --> 00:46:05,050
Because I think specifically in this matter of Call of Papers, for other things, but I don't know, those things like Big Brother and such, that you can vote and such, maybe it's a cool thing to take a look and see if it would be cool, right?

441
00:46:05,190 --> 00:46:15,450
Find one of those datacenters at Globo and get a cheap virtual hosting and a match post.

442
00:46:15,450 --> 00:46:16,430
It would be cool.

443
00:46:17,170 --> 00:46:18,790
Any more questions, guys?

444
00:46:20,590 --> 00:46:23,790
Well, as you may have seen, I had a lot of fun.


