-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H-3614: Introduce chonky
PDF Embeddings
#5673
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JesusFileto!
I noticed that you have used curl
which means there is no asynchronous execution of requests possible. In particular for AI calls, asynchronous calls are preferred. In addition, curl
pulls in libcurl
(and OpenSSL), while reqwest
is written in Rust and can use rustls
.
As async backend we use tokio
. If you add the "macros"
and "rt-multi-thread"
to tokio
you can use #[tokio::main]
on fn main()
and #[tokio::test]
instead of #[test]
which makes the function async
.
As this effectively makes the code async
we can also use tokio
to read files (by using the fs
feature on tokio
and tokio::fs
instead of std::fs
)
Cargo.toml
Outdated
bumpalo = { version = "=3.16.0", default-features = false } | ||
bytes = { version = "1.6.0" } | ||
clap_builder = { version = "=4.5.21", default-features = false, features = ["std"] } | ||
criterion = { version = "=0.5.1" } | ||
curl = { version = "=0.4.47" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use reqwest
instead. It's already configured in the root-manifest. You probably want to enable the json
feature so we don't need to convert JSON to string first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok have pushed out some changes that do this, but unable to solve the tokio
linter errors that occur from moves during the await
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, the lints are saying that you cannot send PdfDocument
across threads. This is because PDFium is not threadsafe. We basically have two options:
- Don't pass PDFDocument around in
async
functions but instead read the PDF in the function itself (see their example about thread safety) - Don't care about multithreading here.
Basically this means we cannot send a document across threads in any world (which also would happen if you want to offload a task to a thread). However, this can be worked around by not sending the document but the API as a whole instead (Pdfium
). This would allow to use the library in a multi-threaded environment, however, the PdfDocument
needs to stay in a single thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I'd say we try not passing PdfDocument
if not needed and ignore the lint #[expect(<lint name>, reason = "...")]
if we cannot or do not want to avoid it.
chonky
PDF Embeddings
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5673 +/- ##
==========================================
- Coverage 22.90% 22.88% -0.03%
==========================================
Files 572 572
Lines 19389 19412 +23
Branches 2745 2752 +7
==========================================
Hits 4442 4442
- Misses 14894 14917 +23
Partials 53 53
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work moving to async
🎉
I did a more in-depth review. I think it's a good time to improve the code quality a bit 🙂
.to_owned()) | ||
} | ||
|
||
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed to pass a vector, a slice is enough
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { | |
fn base64_json(image_data: &[u8]) -> Result<String, Report<ChonkyError>> { |
If you want to be even more generic you can use the same approach as above:
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { | |
fn base64_json(image_data: impl AsRef<[u8]>) -> Result<String, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, will add this in next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second suggestion is a more flexible way without side effects (the calling site doesn't need to be updated).
However, you can change the caller to
- payload = base64_json(image_payload);
+ payload = base64_json(&image_payload);
} | ||
|
||
// Parses the response to extract the image embedding vector | ||
fn extract_embedding(response_text: &str) -> Result<Vec<f64>, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a struct for embedding, such as TextEmbedding
and ImageEmbedding
we probably want to return an enum here
enum Embedding {
Text(TextEmbedding),
Image(ImageEmbedding),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was implementing this but the ImageEmbedding
struct requires the image in the struct, which is not accessible when just requesting the embedding, since the calling function should have knowledge on the type of vector it is receiving, it may be easier to not implement this (to avoid having to handling if let
branches upstream where the compiler does not know what datatype we are expecting but we do)
Cargo.toml
Outdated
bumpalo = { version = "=3.16.0", default-features = false } | ||
bytes = { version = "1.6.0" } | ||
clap_builder = { version = "=4.5.21", default-features = false, features = ["std"] } | ||
criterion = { version = "=0.5.1" } | ||
curl = { version = "=0.4.47" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, the lints are saying that you cannot send PdfDocument
across threads. This is because PDFium is not threadsafe. We basically have two options:
- Don't pass PDFDocument around in
async
functions but instead read the PDF in the function itself (see their example about thread safety) - Don't care about multithreading here.
Basically this means we cannot send a document across threads in any world (which also would happen if you want to offload a task to a thread). However, this can be worked around by not sending the document but the API as a whole instead (Pdfium
). This would allow to use the library in a multi-threaded environment, however, the PdfDocument
needs to stay in a single thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a re-review and the changes look very good. I added a few more comments.
Also, let's make sure to consistently use the new Embedding
structs as they really contributes to the readability of the code.
let xmin: i32 = num_traits::cast(bbox.xmin).ok_or(ChonkyError::Pdfium)?; | ||
let ymin: i32 = num_traits::cast(bbox.ymin).ok_or(ChonkyError::Pdfium)?; | ||
let xmax: i32 = num_traits::cast(bbox.xmax).ok_or(ChonkyError::Pdfium)?; | ||
let ymax: i32 = num_traits::cast(bbox.ymax).ok_or(ChonkyError::Pdfium)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the exact behavior here we expect? Because num_traits::cast
simply discards the decimals which is the behavior of floor
. We should add to the comment that flooring it is the expected behavior.
libs/chonky/src/lib.rs
Outdated
pub mod embedding; | ||
|
||
pub use embedding::{hugging_face_api, multi_modal_embedding}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need both to be pub. Either is sufficient.
libs/chonky/src/lib.rs
Outdated
|
||
#[derive(Debug, Clone)] | ||
pub struct Embedding { | ||
_model_used: String, //model name reveals image or text embedding model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's often the case that the model is statically non
_model_used: String, //model name reveals image or text embedding model | |
_model_used: Cow<'static, str>, //model name reveals image or text embedding model |
Also, is there a specific reason why you hide this detail and not making this pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason the field is not pub
is since it is not used yet so clippy complains.
pub async fn embed_pdf_object_images( | ||
pdf_image_extract: Vec<Vec<DynamicImage>>, | ||
project_id: &str, | ||
) -> Result<Vec<Vec<Vec<f64>>>, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see the improvement already. I'll keep this comment open for the Box<[f64]>
approach.
/// | ||
/// [`ChonkyError::VertexAPI`] when there are HTTP request errors | ||
pub async fn embed_screenshots( | ||
pdf_image_extract: Vec<DynamicImage>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual idiomatic version (but also quite hard to start with) Is using streams.
This way the function would become
pub fn embed_pdf_object_images(
pdf_image_extract: impl IntoIterator<Item = PageImageObjects>,
project_id: &str,
) -> impl Stream<Item = Result<PageImageObjectsEmbeddings, Report<ChonkyError>>> {
stream::iter(pdf_image_extract)
.then(|page_images| embed_screenshots(page_images, project_id))
.map(|embeddings| {
Ok(PageImageObjectsEmbeddings {
_embeddings: embeddings?,
})
})
}
Note, that this function is not async
, because it returns a Stream
(which is pretty much an async iterator).
I don't say we need to take this route because streams are not yet well matured in Rust (and a more built-in feature is on the way), but it avoids allocations.
pub async fn embed_tables( | ||
pdf_table_bounds: Vec<Vec<ExtractedTable>>, | ||
project_id: &str, | ||
) -> Result<Vec<Vec<Vec<f64>>>, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we still have this signature 😅
/// | ||
/// [`ChonkyError::HuggingFaceAPI`] when there are HTTP request errors | ||
pub async fn make_table_recognition_request( | ||
image_path: impl AsRef<Path> + core::marker::Send + core::marker::Sync, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Send
and Sync
are in the prelude.
image_path: impl AsRef<Path> + core::marker::Send + core::marker::Sync, | |
image_path: impl AsRef<Path> + Send + Sync, |
async fn get_binary_image_data( | ||
image_path: impl AsRef<Path> + core::marker::Send + core::marker::Sync, | ||
) -> Result<Vec<u8>, Report<ChonkyError>> { | ||
fs::read(image_path) | ||
.await | ||
.change_context(ChonkyError::ImageError) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really doesn't more than reading a file. We should probably inline it. Also, the error ImageError
seems to be inaccurate as this is really a file operation. (Btw, tokio::fs::read
only wraps the reading operation in an another task, nothing else).
If we keep it, we should use Send
only and directly:
async fn get_binary_image_data( | |
image_path: impl AsRef<Path> + core::marker::Send + core::marker::Sync, | |
) -> Result<Vec<u8>, Report<ChonkyError>> { | |
fs::read(image_path) | |
.await | |
.change_context(ChonkyError::ImageError) | |
} | |
async fn get_binary_image_data( | |
image_path: impl AsRef<Path> + Send, | |
) -> Result<Vec<u8>, Report<ChonkyError>> { | |
fs::read(image_path) | |
.await | |
.change_context(ChonkyError::ImageError) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In trying to move the function inline, found that the explicit references to Send + Sync
bounds avoid compiler errors when using references to image_path
further downstream when retrying the request if the hugging face model was initially cold. For this reason it may be wise to not inline this function.
.to_owned()) | ||
} | ||
|
||
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second suggestion is a more flexible way without side effects (the calling site doesn't need to be updated).
However, you can change the caller to
- payload = base64_json(image_payload);
+ payload = base64_json(&image_payload);
… `chonky` tests (#5687) Co-authored-by: JesusFileto <[email protected]> Co-authored-by: Jesus Fileto <[email protected]>
} | ||
|
||
/// A function that performs authentication with Google Vertex API and performs | ||
/// a curl request to obtain multimodal embeddings given an image path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// a curl request to obtain multimodal embeddings given an image path | |
/// a request to obtain multimodal embeddings given an image path |
Benchmark results
|
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph |
representative_read_multiple_entities
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_property | depths: DT=255, PT=255, ET=255, E=255 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=0, E=0 | Flame Graph | |
entity_by_property | depths: DT=2, PT=2, ET=2, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=0, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=2, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=2, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=255, PT=255, ET=255, E=255 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=0, E=0 | Flame Graph | |
link_by_source_by_property | depths: DT=2, PT=2, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=0, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=2, ET=2, E=2 | Flame Graph |
representative_read_entity_type
Function | Value | Mean | Flame graphs |
---|---|---|---|
get_entity_type_by_id | Account ID: d4e16033-c281-4cde-aa35-9085bf2e7579
|
Flame Graph |
scaling_read_entity_complete_one_depth
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 50 entities | Flame Graph | |
entity_by_id | 5 entities | Flame Graph | |
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 25 entities | Flame Graph |
scaling_read_entity_linkless
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 100 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 1000 entities | Flame Graph | |
entity_by_id | 10000 entities | Flame Graph |
scaling_read_entity_complete_zero_depth
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 50 entities | Flame Graph | |
entity_by_id | 5 entities | Flame Graph | |
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 25 entities | Flame Graph |
@@ -92,7 +92,8 @@ ahash = { version = "=0.8.11", default-features = false } | |||
ariadne = { version = "=0.5.0", default-features = false } | |||
aws-types = { version = "=1.3.3", default-features = false } | |||
axum = { version = "0.7.5" } | |||
axum-core = { version = "0.4.5" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was by accident?
[[package]] | ||
name = "curl" | ||
version = "0.4.47" | ||
source = "registry+https://github.com/rust-lang/crates.io-index" | ||
checksum = "d9fb4d13a1be2b58f14d60adba57c9834b78c62fd86c3e76a148f732686e9265" | ||
dependencies = [ | ||
"curl-sys", | ||
"libc", | ||
"openssl-probe", | ||
"openssl-sys", | ||
"schannel", | ||
"socket2", | ||
"windows-sys 0.52.0", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you need to update this file, it seems outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add json files to .gitattributes
as well?
🌟 What is the purpose of this PR?
This PR outlines the work for the embeddings that will be done in the PDF and preliminary structs with how they will be stored. The main goal is to split embeddings into multiple levels of textual information such as table embeddings. Future PRs will focus on implementing the XML metadata that will accompany these embeddings to be used for entity extraction.
🚫 Blocked by
🔍 What does this change?
Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
🐾 Next steps
🛡 What tests cover this?