-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simd-doc, gazette, avro, and dekaf crates #1448
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Save an allocation and copy. Also make broker::process_spec::Id Hash-able.
jgraettinger
force-pushed
the
johnny/read-transcode
branch
4 times, most recently
from
April 20, 2024 18:02
d2f4e59
to
54cc9b2
Compare
psFried
approved these changes
Apr 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Left a few minor nit picks and questions here and there.
This crate offers highly optimized parsing of JSON documents into doc::HeapNode instances, as well as optimized transcoding of documents into a byte layout that exactly matches and is fully compatible with doc::ArchivedNode.
* Router for dynamic dispatch using a request Route. * read_json_lines() using simd-doc for the common case of reading JSON documents as quickly as possible.
avro_rs was contributed to Apache (as apache_avro) and is no longer maintained itself.
The `avro` crate maps static inference over a JSON schema (`doc::Shape`) into a compatible Avro schema that maintains as much fidelity to the schema as possible. It supports nested subrecords and arrays, many logical types, default values, and more. It knows how to generate and encode both "key" and "value" schemas, where key schemas are a flat record encoding an ordered Flow collection key, and value schemas reflect entire documents. This facilitates interop with ecosystems like Kafka, which use Avro schemas to encode topic keys. When encoding, this crate directly encodes doc::AsNode instances into their schematized Avro wire encoding. The `apache_avro` crate is used to a) represent Avro schema, and b) for test-time validation of the compatibility of encodings. It's not used for non-test parsing or encoding.
Dekaf acts as a Kafka broker and supports a sufficent portion of the protocol for Kafka consumers to fetch data from Estuary collections as if they were Kafka topics. It also implements a subset of the Confluent Schema Registry API and performs JSON Schema => Avro schema transliteration and read-time encoding, which allows readers to consume "topic" (collection) data as-if it were natively Avro-encoded with a registered schema.
Replace serde with simd_doc::Parser::parse_one() for parsing documents into doc::HeapNode. Introduce a runtime::Accumulator composite struct which carries a simd_doc::Parser instance.
Also, `protoc` isn't required to build `flowctl`.
jgraettinger
force-pushed
the
johnny/read-transcode
branch
from
May 3, 2024 19:23
54cc9b2
to
305d8e0
Compare
github-actions bot
pushed a commit
to estuary/homebrew-flowctl
that referenced
this pull request
May 31, 2024
## What's Changed * `sum` annotation now supports arbitrary precision using string-encoded numerics * Add experimental `flowctl raw stats` sub-command * Various minor JSON Schema handling improvements. * Switch to simd-json for fast JSON parsing and transcoding. ### Filtered PRs impacting `flowctl`: * crates/json: don't validate strings with underscores as integers or numbers by @williamhbaker in estuary/flow#1364 * Update `runtime::container::start()` to take a new `allow_local` flag by @jshearer in estuary/flow#1361 * json: fix ordering of integers greater than i64::MAX by @psFried in estuary/flow#1367 * validation: fix bucket name validation for GCS and Azure by @psFried in estuary/flow#1370 * thread through `--allow-local` argument when running locally by @psFried in estuary/flow#1374 * validation: allow unsatisfiable constraints on excluded fields by @psFried in estuary/flow#1375 * update a number of dependencies, including RocksDB (to 8.10) by @jgraettinger in estuary/flow#1389 * connector-init: set connector_type on protocol check Spec by @jgraettinger in estuary/flow#1400 * models/journals: region configuration for S3 storage mappings by @williamhbaker in estuary/flow#1410 * improve schema validation errors by including metadata about the collection that failed by @jgraettinger in estuary/flow#1408 * flowctl: resurrect stats subcommand under raw by @psFried in estuary/flow#1432 * make: codesign binaries on mac by @mdibaiee in estuary/flow#1436 * simd-doc, gazette, avro, and dekaf crates by @jgraettinger in estuary/flow#1448 * flowctl(preview): multiple bindings may read from one collection by @mdibaiee in estuary/flow#1466 * crates/doc: support arbitrary precision with `sum` annotation by @jgraettinger in estuary/flow#1477 * crates/doc: relax `sum` inspection to allow numeric strings by @jgraettinger in estuary/flow#1481 **Full Changelog**: estuary/flow@v0.3.12...v0.3.13
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
Please review commit by commit.
This PR introduces:
simd-doc
crate for very fast JSON parsing and transcoding into our preferred runtime representations (HeapNode and ArchivedNode).gazette
crate for dispatched reads from Gazette brokers, with full fragment metadata (and later, journal appends).avro
crate for translation from our JSON schemas (doc::Shape
) into high-fidelity AVRO schemas.dekaf
crate which emulates a Kafka broker and schema registry, and proxies to our existing APIs.The
runtime
and legacyderive
crate are also switched over to usingsimd-doc
for JSON parsing.Testing:
dekaf
-- which also exercisessimd-doc
,gazette
, andavro
-- using thekaf
CLI and Materialize.avro
andsimd-doc
, as well as new fuzz tests and benchmarks for the later.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is