1
00:00:00,000 --> 00:00:07,760
In this video, we dive deeper into the engine behind our chunking workflow.

2
00:00:07,760 --> 00:00:14,439
You will see how we handle storage, how the main file orchestrates the entire process,

3
00:00:14,439 --> 00:00:19,600
and how everything finally comes together in the output.

4
00:00:19,600 --> 00:00:21,040
Let's get started.

5
00:00:21,040 --> 00:00:27,799
In this part of the system, we focus on how chunks are stored, retrieved, and organized

6
00:00:27,799 --> 00:00:31,159
inside a local SQLite database.

7
00:00:31,159 --> 00:00:36,159
We begin by defining the database name, which in this case is chunks.db.

8
00:00:36,159 --> 00:00:42,560
SQLite is lightweight, serverless, and easy to integrate, making it ideal for small-to-medium

9
00:00:42,560 --> 00:00:46,240
scale document processing pipelines.

10
00:00:46,240 --> 00:00:50,119
The first function is init underscore db.

11
00:00:50,119 --> 00:00:55,439
This function is responsible for initializing the database and ensuring the required table

12
00:00:55,439 --> 00:00:57,200
exists.

13
00:00:57,200 --> 00:01:04,000
It creates a connection to the SQLite file, obtains a cursor, and executes a table creation

14
00:01:04,000 --> 00:01:05,519
statement.

15
00:01:05,519 --> 00:01:11,559
The table is named chunks and contains fields such as the document ID, chunk index, title,

16
00:01:11,559 --> 00:01:15,040
summary, keywords, and the actual text content.

17
00:01:15,040 --> 00:01:20,839
If the table already exists, SQLite simply ignores the creation request.

18
00:01:20,839 --> 00:01:25,559
After execution, the connection is committed and safely closed.

19
00:01:25,559 --> 00:01:30,000
This function is typically run once during application startup.

20
00:01:30,000 --> 00:01:34,360
Next, we have the save underscore chunks function.

21
00:01:34,360 --> 00:01:39,080
This is where we take all generated chunks for a given document and write them into the

22
00:01:39,080 --> 00:01:40,760
database.

23
00:01:40,760 --> 00:01:48,080
We again open a connection to the SQLite file and iterate through the list of chunk dictionaries.

24
00:01:48,080 --> 00:01:52,199
For each chunk, we insert a new row into the chunks table.

25
00:01:52,199 --> 00:01:57,160
We store the document ID, the sequential chunk index, and any optional metadata such

26
00:01:57,160 --> 00:01:59,959
as title, summary, and keywords.

27
00:01:59,959 --> 00:02:04,879
If keywords exist, we convert the keyword list into a single comma-separated string

28
00:02:04,879 --> 00:02:07,279
for database storage.

29
00:02:07,279 --> 00:02:11,240
The full chunk text is also stored in a dedicated field.

30
00:02:11,240 --> 00:02:16,559
After inserting all the chunks, we commit the transaction and close the connection.

31
00:02:16,559 --> 00:02:23,279
This gives us a durable, queryable storage format for semantic and agentic chunking outputs.

32
00:02:23,279 --> 00:02:28,520
The third function, load underscore chunks, is used to retrieve stored chunks from the

33
00:02:28,520 --> 00:02:30,199
database.

34
00:02:30,199 --> 00:02:36,880
It accepts a document ID and queries the chunks table for all rows matching that ID.

35
00:02:36,880 --> 00:02:41,880
The results are ordered by chunk index so that we reconstruct the text in the correct

36
00:02:41,880 --> 00:02:43,380
sequence.

37
00:02:43,380 --> 00:02:49,779
Once the query results are fetched, we iterate over each row and rebuild a structured Python

38
00:02:49,779 --> 00:02:52,380
dictionary for each chunk.

39
00:02:52,380 --> 00:02:58,100
The keywords column, which is stored as a comma-separated string, is split back into

40
00:02:58,100 --> 00:02:59,899
a list.

41
00:02:59,899 --> 00:03:05,500
The function then returns a list of chunk objects that can be directly consumed by retrieval

42
00:03:05,500 --> 00:03:11,220
systems, analysis pipelines, or downstream AI agents.

43
00:03:11,220 --> 00:03:19,500
Overall, with this in place, your chunking system moves from being an in-memory utility

44
00:03:19,500 --> 00:03:23,820
to becoming a fully persistent, production-ready workflow.

45
00:03:23,820 --> 00:03:28,860
Let's quickly walk through the main file, which acts as the central controller for our

46
00:03:28,860 --> 00:03:31,339
entire chunking workflow.

47
00:03:31,339 --> 00:03:36,740
It begins by importing the chunking engine and the storage utilities.

48
00:03:36,740 --> 00:03:42,699
Then it prompts the user to paste a multi-line document, which becomes the input for every

49
00:03:42,699 --> 00:03:44,380
chunking mode.

50
00:03:44,380 --> 00:03:49,460
Once the document is collected, the script initializes the SQLite database to ensure

51
00:03:49,460 --> 00:03:51,619
the chunks table is ready.

52
00:03:51,619 --> 00:03:59,779
Next, it loops through all five chunking modes – fixed, recursive, document-based, semantic,

53
00:03:59,779 --> 00:04:01,660
and agentic.

54
00:04:01,660 --> 00:04:07,660
For each mode, it generates chunks, reports how many were created, and saves them into

55
00:04:07,660 --> 00:04:12,419
the database using the mode name as the document ID.

56
00:04:12,419 --> 00:04:17,779
After processing all methods, the script retrieves every stored chunk set from the database and

57
00:04:17,779 --> 00:04:19,500
prints them out.

58
00:04:19,500 --> 00:04:25,019
For each chunk, it displays the index, any available metadata, and a short preview of

59
00:04:25,019 --> 00:04:26,500
the text.

60
00:04:26,500 --> 00:04:33,299
This gives us a full end-to-end run, taking input, chunking it, storing it, and verifying

61
00:04:33,299 --> 00:04:34,660
the final output.

62
00:04:34,660 --> 00:04:37,660
Now it's time for execution.

63
00:04:37,660 --> 00:04:42,420
Let's open the terminal and run the program using the command.

64
00:04:42,420 --> 00:04:48,100
Once the script starts, go ahead and paste the content you want to process.

65
00:04:48,100 --> 00:04:53,980
This will trigger the full chunking pipeline and store the results in our database.

66
00:04:53,980 --> 00:04:59,140
As you can see, the system has now created chunks for all five modes.

67
00:04:59,140 --> 00:05:06,500
Each set of chunks has been processed, indexed, and successfully saved into our database,

68
00:05:06,500 --> 00:05:09,760
stored inside the chunks.db file.

69
00:05:09,760 --> 00:05:15,859
And here on the terminal, you can clearly see an example chunk with its title, summary,

70
00:05:15,899 --> 00:05:25,179
keywords, and a preview of the extracted text, confirming that our pipeline is working end-to-end.

71
00:05:25,179 --> 00:05:26,500
Thanks for joining in.

72
00:05:26,500 --> 00:05:28,299
I'll see you in the next one.

73
00:05:28,299 --> 00:05:31,500
Until then, keep learning and exploring.