1
00:00:00,060 --> 00:00:05,070
In this tutorial, I'm going to be showing you the different ways you can scrape data from a PDF documents

2
00:00:05,220 --> 00:00:07,130
using optical character recognition.

3
00:00:07,950 --> 00:00:15,700
So the most common way is to use the kit OCR text activity over here.

4
00:00:16,020 --> 00:00:22,900
So under UI automation, OCR screen scraping, we have to get OCR text activity.

5
00:00:23,220 --> 00:00:27,690
So this is what you want to use when you want to scrape a certain piece of text from a PDF.

6
00:00:28,410 --> 00:00:33,390
If you want to scrape all the text from the entire PDF, you'd have to use something else.

7
00:00:33,750 --> 00:00:35,580
So you'd have to install this package.

8
00:00:35,580 --> 00:00:41,220
So you would have to come to manage packages over here and you can come to all packages and you can

9
00:00:41,220 --> 00:00:44,640
search for PDF over here.

10
00:00:44,640 --> 00:00:49,200
You'll see a UI poster PDF DOT activities and you can install this.

11
00:00:50,190 --> 00:00:53,910
Okay, then click save to make sure that the installation completes.

12
00:00:54,660 --> 00:00:56,280
It'll download the dependencies.

13
00:00:58,430 --> 00:00:58,790
Great.

14
00:00:58,820 --> 00:01:04,730
So now the PDF activities should have been added to your activities panel on the left, you see, if

15
00:01:04,730 --> 00:01:12,360
I search for PDF, you'll see here there is now a read PDF text and a read PDF with OCR.

16
00:01:12,620 --> 00:01:14,100
So this is the one we're looking for.

17
00:01:14,330 --> 00:01:20,300
What this does, it read all the characters from a specified PDF and it'll output everything that it

18
00:01:20,300 --> 00:01:21,320
reads as text.

19
00:01:21,620 --> 00:01:26,300
So let's try and run it here just so I can show you how it works so we can drag this in.

20
00:01:26,960 --> 00:01:33,740
And you can see here it asks you for a file name and then it asks for an OCR engine.

21
00:01:34,130 --> 00:01:37,940
So the filename would simply be the PDF.

22
00:01:37,940 --> 00:01:40,640
So you can just type in PDF.

23
00:01:41,870 --> 00:01:42,410
All right.

24
00:01:42,710 --> 00:01:47,140
You can see it's probably requires a string over here.

25
00:01:47,150 --> 00:01:47,990
Let's just come here.

26
00:01:48,020 --> 00:01:49,520
Yes, it requires a string.

27
00:01:49,940 --> 00:01:52,330
So you can say Dotto String can see there.

28
00:01:52,340 --> 00:01:53,030
It's a string.

29
00:01:53,040 --> 00:01:54,020
That's how I knew that.

30
00:01:54,020 --> 00:01:56,010
Click OK, click away.

31
00:01:56,780 --> 00:02:00,820
So there's still an area that's probably because we haven't added our OCR engine yet.

32
00:02:00,830 --> 00:02:02,390
Let's just see what it says.

33
00:02:02,690 --> 00:02:06,050
Yes, it says no OCR engine assigned so you can close this.

34
00:02:06,050 --> 00:02:14,830
We can type in OCR and there should be a section with OCR engines here and a UI automation OCR engine.

35
00:02:15,320 --> 00:02:21,050
These are the different OCR engines which you could use, which will actually do the converting from

36
00:02:21,050 --> 00:02:22,070
image into text.

37
00:02:22,310 --> 00:02:23,870
So here they are, four listed ones.

38
00:02:23,870 --> 00:02:28,340
There's Ebbe, there's Google, there's Microsoft and there's Tesseract.

39
00:02:28,520 --> 00:02:29,360
What tends to work?

40
00:02:29,360 --> 00:02:35,930
Well, when you use it, when you're scraping components of a PDF would be the Google OCR.

41
00:02:36,140 --> 00:02:42,530
But when you're scraping an entire PDF, usually from my experience, I find the Microsoft OCR works

42
00:02:42,530 --> 00:02:42,910
better.

43
00:02:43,520 --> 00:02:43,910
All right.

44
00:02:43,910 --> 00:02:50,450
So you can experiment with all the different ones depending on your PDF, because sometimes one might

45
00:02:50,450 --> 00:02:51,680
work better over another.

46
00:02:52,580 --> 00:02:56,360
So we can use Microsoft OCR over here and we'll drag that into.

47
00:02:56,690 --> 00:03:01,460
What we can then do is you can click on our read PDF with OCR and over here you can see there's a text

48
00:03:01,460 --> 00:03:05,120
output so we can create a variable which will then store that text.

49
00:03:05,450 --> 00:03:07,250
So I'll just say control.

50
00:03:07,250 --> 00:03:09,530
Okay, to create a variable and I'll call it

51
00:03:12,050 --> 00:03:13,010
OCR.

52
00:03:14,340 --> 00:03:14,970
Output's.

53
00:03:18,860 --> 00:03:23,700
Just like that, OK, we're not going to be keeping this, this is just for demonstration purposes of

54
00:03:23,720 --> 00:03:25,700
they use a right line activity.

55
00:03:27,030 --> 00:03:29,050
To write everything that read the output panel.

56
00:03:29,400 --> 00:03:34,730
So over here, we want to write our OCR output and that is all.

57
00:03:35,310 --> 00:03:37,800
So I'm just going to clear our output panel.

58
00:03:37,830 --> 00:03:41,780
I'll make sure there's no PDF open and we can run this and see what happens.

59
00:03:44,980 --> 00:03:49,720
So now it's opened all three and it should have scraped all the data and here you can see.

60
00:03:50,790 --> 00:03:51,450
There you go.

61
00:03:51,870 --> 00:03:56,970
You'll notice that it spits out the titles first, such as product name, trademark middle name, that

62
00:03:56,970 --> 00:04:00,840
is all of these here on the left hand side, product name, trademark middle name, et cetera.

63
00:04:01,170 --> 00:04:07,950
And on the right hand side, it's, for example, GRV Sports is listed just below that.

64
00:04:07,950 --> 00:04:12,570
It should be there, GRV, Sports, etc. and then it has all the values just below that.

65
00:04:12,960 --> 00:04:19,170
So it doesn't spit it out in a nice formatted order, unfortunately, but it does scrape all the data

66
00:04:19,170 --> 00:04:20,100
from the PDF.

67
00:04:20,610 --> 00:04:26,460
An application I could think of for this would be something like where you want to search all the text

68
00:04:26,460 --> 00:04:31,680
inside a scanned PDF document and you want to search for a certain term.

69
00:04:32,040 --> 00:04:39,290
If that term exists, then you can then, for example, save that PDF into a certain folder.

70
00:04:39,600 --> 00:04:46,060
And if that term doesn't exist, then it saves that PDF into a different folder or something like that.

71
00:04:46,260 --> 00:04:51,990
So the way you could go about doing that is you could use something like a if statement she could use.

72
00:04:51,990 --> 00:04:58,830
And if you can search for the text, which it outputs it, in our case, it was OCR output.

73
00:04:59,160 --> 00:05:09,430
So you can select OCR output, you can say DOT contains and then you can search for a certain term.

74
00:05:09,690 --> 00:05:19,020
For example, let's say you're looking for all the invoices which have the product GRV eSport in it.

75
00:05:19,410 --> 00:05:28,350
Then what you do is you add simply right here, g r v dash e dash sports.

76
00:05:29,040 --> 00:05:34,950
Then from there you can assign certain sequences to do whatever you want.

77
00:05:35,490 --> 00:05:40,180
If it does contain that, do this, if it doesn't contain that, do that, etc..

78
00:05:40,710 --> 00:05:41,130
All right.

79
00:05:41,140 --> 00:05:47,010
So that is just a quick summary of how to use the read PDF with OCR activity.

80
00:05:47,460 --> 00:05:54,390
In the next video, I'll be showing you how to scrape a certain elements from a PDF using OCR.


