Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simd-doc, gazette, avro, and dekaf crates #1448

Merged
merged 10 commits into from
May 3, 2024
Merged

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Apr 19, 2024

Description:

Please review commit by commit.

This PR introduces:

  • simd-doc crate for very fast JSON parsing and transcoding into our preferred runtime representations (HeapNode and ArchivedNode).
  • A gazette crate for dispatched reads from Gazette brokers, with full fragment metadata (and later, journal appends).
  • An avro crate for translation from our JSON schemas (doc::Shape) into high-fidelity AVRO schemas.
  • A dekaf crate which emulates a Kafka broker and schema registry, and proxies to our existing APIs.

The runtime and legacy derive crate are also switched over to using simd-doc for JSON parsing.

Testing:

  • Signifiant manual testing with dekaf -- which also exercises simd-doc, gazette, and avro -- using the kaf CLI and Materialize.
  • Unit testing of avro and simd-doc, as well as new fuzz tests and benchmarks for the later.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

Save an allocation and copy.

Also make broker::process_spec::Id Hash-able.
@jgraettinger jgraettinger requested a review from psFried April 19, 2024 23:56
@jgraettinger jgraettinger force-pushed the johnny/read-transcode branch 4 times, most recently from d2f4e59 to 54cc9b2 Compare April 20, 2024 18:02
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Left a few minor nit picks and questions here and there.

crates/doc/src/bump_str.rs Show resolved Hide resolved
crates/doc/src/bump_vec.rs Show resolved Hide resolved
crates/simd-doc/src/lib.rs Outdated Show resolved Hide resolved
crates/simd-doc/src/lib.rs Show resolved Hide resolved
crates/gazette/src/journal/mod.rs Show resolved Hide resolved
crates/gazette/src/journal/read.rs Show resolved Hide resolved
crates/gazette/src/router.rs Show resolved Hide resolved
crates/avro/src/encode.rs Show resolved Hide resolved
crates/avro/src/schema.rs Outdated Show resolved Hide resolved
crates/dekaf/src/lib.rs Show resolved Hide resolved
This crate offers highly optimized parsing of JSON documents into
doc::HeapNode instances, as well as optimized transcoding of
documents into a byte layout that exactly matches and is fully
compatible with doc::ArchivedNode.
* Router for dynamic dispatch using a request Route.

* read_json_lines() using simd-doc for the common case of reading JSON
  documents as quickly as possible.
avro_rs was contributed to Apache (as apache_avro) and is no longer maintained itself.
The `avro` crate maps static inference over a JSON schema (`doc::Shape`)
into a compatible Avro schema that maintains as much fidelity to the
schema as possible. It supports nested subrecords and arrays, many
logical types, default values, and more.

It knows how to generate and encode both "key" and "value" schemas,
where key schemas are a flat record encoding an ordered Flow collection key,
and value schemas reflect entire documents. This facilitates interop
with ecosystems like Kafka, which use Avro schemas to encode topic keys.

When encoding, this crate directly encodes doc::AsNode instances into
their schematized Avro wire encoding. The `apache_avro` crate is used to
a) represent Avro schema, and b) for test-time validation of the
compatibility of encodings. It's not used for non-test parsing or encoding.
Dekaf acts as a Kafka broker and supports a sufficent portion of the
protocol for Kafka consumers to fetch data from Estuary collections as
if they were Kafka topics.

It also implements a subset of the Confluent Schema Registry API
and performs JSON Schema => Avro schema transliteration and read-time
encoding, which allows readers to consume "topic" (collection) data
as-if it were natively Avro-encoded with a registered schema.
Replace serde with simd_doc::Parser::parse_one() for parsing documents
into doc::HeapNode.

Introduce a runtime::Accumulator composite struct which carries a
simd_doc::Parser instance.
Also, `protoc` isn't required to build `flowctl`.
@jgraettinger jgraettinger force-pushed the johnny/read-transcode branch from 54cc9b2 to 305d8e0 Compare May 3, 2024 19:23
@jgraettinger jgraettinger merged commit 6fb15f6 into master May 3, 2024
5 checks passed
@jgraettinger jgraettinger deleted the johnny/read-transcode branch May 3, 2024 19:24
github-actions bot pushed a commit to estuary/homebrew-flowctl that referenced this pull request May 31, 2024
## What's Changed

* `sum` annotation now supports arbitrary precision using string-encoded numerics
* Add experimental `flowctl raw stats` sub-command
* Various minor JSON Schema handling improvements.
* Switch to simd-json for fast JSON parsing and transcoding.

### Filtered PRs impacting `flowctl`:

* crates/json: don't validate strings with underscores as integers or numbers by @williamhbaker in estuary/flow#1364
* Update `runtime::container::start()` to take a new `allow_local` flag by @jshearer in estuary/flow#1361
* json: fix ordering of integers greater than i64::MAX by @psFried in estuary/flow#1367
* validation: fix bucket name validation for GCS and Azure by @psFried in estuary/flow#1370
* thread through `--allow-local` argument when running locally by @psFried in estuary/flow#1374
* validation: allow unsatisfiable constraints on excluded fields by @psFried in estuary/flow#1375
* update a number of dependencies, including RocksDB (to 8.10) by @jgraettinger in estuary/flow#1389
* connector-init: set connector_type on protocol check Spec by @jgraettinger in estuary/flow#1400
* models/journals: region configuration for S3 storage mappings by @williamhbaker in estuary/flow#1410
* improve schema validation errors by including metadata about the collection that failed by @jgraettinger in estuary/flow#1408
* flowctl: resurrect stats subcommand under raw by @psFried in estuary/flow#1432
* make: codesign binaries on mac by @mdibaiee in estuary/flow#1436
* simd-doc, gazette, avro, and dekaf crates by @jgraettinger in estuary/flow#1448
* flowctl(preview): multiple bindings may read from one collection by @mdibaiee in estuary/flow#1466
* crates/doc: support arbitrary precision with `sum` annotation by @jgraettinger in estuary/flow#1477
* crates/doc: relax `sum` inspection to allow numeric strings by @jgraettinger in estuary/flow#1481

**Full Changelog**: estuary/flow@v0.3.12...v0.3.13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants