1
00:00:01,720 --> 00:00:04,150
Adds brute force approach
to pattern matching

2
00:00:04,150 --> 00:00:08,440
is slow when we try to
match billions of patterns.

3
00:00:08,440 --> 00:00:11,330
Let's try to figure out why.

4
00:00:11,330 --> 00:00:15,965
So, what is happening if we put
each pattern in its own car,

5
00:00:15,965 --> 00:00:21,992
then the first dives the first car,
then the second car, then the third car,

6
00:00:21,992 --> 00:00:28,000
the next car and next car, and
that's why it takes a lot of time.

7
00:00:28,000 --> 00:00:33,080
Here's a new idea,
let's pack all patterns in a bus, and

8
00:00:33,080 --> 00:00:36,750
let's drive this bus along the text.

9
00:00:36,750 --> 00:00:38,740
But how do the constructors bus?

10
00:00:39,780 --> 00:00:45,370
Let me show you how we can construct
the bus from multiplied patterns.

11
00:00:45,370 --> 00:00:52,090
Let's start from the first pattern and
represent it as a pass in a tree.

12
00:00:52,090 --> 00:00:55,510
Continue to the next button,
continue this next button, and

13
00:00:55,510 --> 00:00:57,770
continue this next button.

14
00:00:57,770 --> 00:01:00,770
So far it was easy and not interesting.

15
00:01:00,770 --> 00:01:06,700
We have four patterns and we constructed
for passes from the root of the tree.

16
00:01:06,700 --> 00:01:09,070
Let's go to the next one, antenna.

17
00:01:09,070 --> 00:01:12,880
Now the first letter in
antenna actually already

18
00:01:12,880 --> 00:01:16,120
appears on the way from
the root it is right here.

19
00:01:16,120 --> 00:01:19,060
The second letter also
appear away from the root.

20
00:01:19,060 --> 00:01:22,490
And then, we need to branch

21
00:01:22,490 --> 00:01:27,780
the previous pass into two passes
to construct the pass for antenna.

22
00:01:27,780 --> 00:01:29,360
Now let's do bandana.

23
00:01:29,360 --> 00:01:32,940
So bandana we press it further.

24
00:01:32,940 --> 00:01:35,990
And now we again have to branch the pass.

25
00:01:35,990 --> 00:01:39,040
Continue with ananas, again branching.

26
00:01:39,040 --> 00:01:43,510
And finally continue with nana,
branching again.

27
00:01:43,510 --> 00:01:48,330
And of what we've constructed
is actually our bus.

28
00:01:48,330 --> 00:01:51,115
And which is called trie of patters.

29
00:01:52,330 --> 00:01:53,590
How do we use this bus?

30
00:01:53,590 --> 00:01:55,160
How would we drive?

31
00:01:55,160 --> 00:01:58,007
After constructing this bus,
how would we drive with along text?

32
00:01:59,040 --> 00:02:01,360
Well, you'll use TrieMatching.

33
00:02:01,360 --> 00:02:06,140
And we'll drive

34
00:02:06,140 --> 00:02:11,380
this whole trie along the text,
and at each position of the text,

35
00:02:11,380 --> 00:02:16,340
we will walk down the trie
by spelling symbols of text.

36
00:02:17,380 --> 00:02:24,010
A pattern from the set patterns matches
text each time we reach a leaf.

37
00:02:24,010 --> 00:02:26,010
Let me show you how it works.

38
00:02:26,010 --> 00:02:31,070
So our bus, in now looking at
the first letter of the text, p, so

39
00:02:31,070 --> 00:02:36,960
if we walk along the edge,
labeled by p, the next letter is a,

40
00:02:36,960 --> 00:02:42,200
we walk along this edge, the next
letter is n, we walk along this edge,

41
00:02:42,200 --> 00:02:47,920
and we found that part of pan
appears in panamabananas.

42
00:02:47,920 --> 00:02:51,508
Next, we move to the next letter of
the text and we start my chunk again.

43
00:02:51,508 --> 00:02:55,527
A, n, a and now there is no match so

44
00:02:55,527 --> 00:03:03,830
at this position there is no match
between, patterns and texts.

45
00:03:03,830 --> 00:03:06,010
Continue further, n, a.

46
00:03:07,050 --> 00:03:09,130
Once again, there is no match.

47
00:03:09,130 --> 00:03:12,450
A, once again, there is no match.

48
00:03:12,450 --> 00:03:14,880
For m, from the very beginning,
there is no match.

49
00:03:14,880 --> 00:03:15,990
We continue further.

50
00:03:15,990 --> 00:03:17,190
A.

51
00:03:17,190 --> 00:03:18,750
Once again, there is no match.

52
00:03:18,750 --> 00:03:19,410
Let's try here.

53
00:03:19,410 --> 00:03:23,628
B, a, n, a, n, a we came to so

54
00:03:23,628 --> 00:03:29,950
we found the pattern, but
we have to continue further.

55
00:03:29,950 --> 00:03:30,660
Further.

56
00:03:30,660 --> 00:03:31,410
Further.

57
00:03:31,410 --> 00:03:33,450
Further.
Found the pattern again.

58
00:03:33,450 --> 00:03:34,260
Continue further.

59
00:03:35,460 --> 00:03:37,070
Again found the pattern.

60
00:03:37,070 --> 00:03:39,950
And now, there is no more match.

61
00:03:39,950 --> 00:03:43,542
So, we found, in a single run on our bus,

62
00:03:43,542 --> 00:03:48,074
we found all matches of
patterns against the jacks.

63
00:03:50,190 --> 00:03:51,660
Actually, I haven't finished yet.

64
00:03:51,660 --> 00:03:53,340
We also have to match n, a, s.

65
00:03:54,420 --> 00:03:58,220
No. a? No.
Now we are done.

66
00:03:58,220 --> 00:04:03,048
Our bus is very fast,
recalls at runtime of our brute

67
00:04:03,048 --> 00:04:07,080
force approach was O text time patterns.

68
00:04:07,080 --> 00:04:10,910
Where pattern says the total lens or
for pattern.

69
00:04:10,910 --> 00:04:15,580
That 's why it was slow the total
length of all patterns is huge.

70
00:04:15,580 --> 00:04:20,190
In the case when we tried to
match reeds against the genome.

71
00:04:20,190 --> 00:04:24,040
But the run time of
TrieMatching is only o of

72
00:04:24,040 --> 00:04:29,200
text time the length of
the longest pattern.

73
00:04:29,200 --> 00:04:32,521
And typically in modern
sequencing pattern,

74
00:04:32,521 --> 00:04:36,700
the reeds have lengths only 200,
300 nucleotide.

75
00:04:38,360 --> 00:04:41,690
So it looks like finally they are done.

76
00:04:41,690 --> 00:04:42,580
Should we go home?

77
00:04:43,890 --> 00:04:46,220
We are not ready to go home just yet.

78
00:04:47,290 --> 00:04:52,450
Note that trie we
constructed has 30 edges.

79
00:04:52,450 --> 00:04:55,490
But in general, the number of edges for

80
00:04:55,490 --> 00:05:00,780
a trie is O of the total
length of the patterns.

81
00:05:01,890 --> 00:05:03,140
And for the human genomes,

82
00:05:03,140 --> 00:05:07,370
the total length of the patterns
will be in trillions.

83
00:05:07,370 --> 00:05:14,080
So unfortunately, our algorithm will
be impractical for read matrix.

84
00:05:15,250 --> 00:05:16,116
Should we give up?
