-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve create index #868
Improve create index #868
Conversation
# logic: We assume that the maximum number of files in the table is <= | ||
# MAGIC_NUMBER and the number of frames or chunks for each video/document is <= | ||
# MAGIC_NUMBER. Based on this assumption, we can safely say that | ||
# `_row_id` * MAGIC_NUMBER + `chunk_id` for document table and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you determine the magic number? And when will be the chunking be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, I set it to an arbitrarily large value. The idea is that as long as the number of files in the table is fewer than this number, our assumption will hold.
Reg chunking: It is done when we read the document similar to frame decoding in videos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. And for video, each video will get a unique ID and each frame will be assigned a different frame ID? Is this assumption correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have these IDs (_row_id
for video and id
for frame). Building on top of it. The assumption is these ids won't change across runs. _row_id
is persisted, so no issue there. id
is generated at runtime, and as long as the reader
is deterministic across runs, we don't have a problem.
evadb/readers/pdf_reader.py
Outdated
@@ -34,10 +34,12 @@ def _read(self) -> Iterator[Dict]: | |||
doc = fitz.open(self.file_url) | |||
|
|||
# PAGE ID, PARAGRAPH ID, STRING | |||
# Maintain a global paragraph number per PDF | |||
global_paragraph_no = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this so that the calculated id for the create index method stays unique. Otherwise, it is still not unique because multiple paragraphs at different pages can have the same paragraph id. @gaurav274
Close this for now. #1073 is merged as a fix. |
👋 Thanks for submitting a Pull Request to EvaDB!
🙌 We want to make contributing to EvaDB as easy and transparent as possible. Here are a few tips to get you started:
👉 Please see our ✅ Contributing Guide for more details.