Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks for magnolify-parquet vs parquet-avro R/W #1040

Merged
merged 11 commits into from
Sep 20, 2024

Conversation

clairemcginty
Copy link
Contributor

@clairemcginty clairemcginty commented Sep 18, 2024

Adds benchmarks for Parquet read/write performance, for both magnolify-parquet and parquet-avro (although we don't own parquet-avro, it's helpful to compare against IMO).

Parquet is a little tricky in that it doesn't have a granular "write/read a single record to/from a file" operation due to its complex file structure/encodings. This benchmark sets up an in-memory page store that can can read or write Parquet "groups", which are Parquet's internal record structure. Read/write is invoked with a record type T and a matching RecordConverter[T], which converts either case classes (magnolify-parquet) or Avro records (parquet-avro) into Parquet groups. Thus, what we're benchmarking here is Group-to-record and record-to-Group conversion, which is the core functionality of magnolify-parquet 👍

Results (run locally w 64GB M1 mac + OpenJDK 17.0.5):

% sbt "jmh/jmh:run -i 10 -wi 10 -f1 -t .*parquet.*"
[info] Benchmark                           Mode  Cnt      Score     Error  Units
[info] ParquetBench.parquetReadAvro       avgt   10  12693.357 ± 208.175  ns/op
[info] ParquetBench.parquetReadMagnolify  avgt   10  13695.172 ± 311.972  ns/op
[info] ParquetBench.parquetWriteAvro       avgt   10  9621.541 ±   81.569  ns/op
[info] ParquetBench.parquetWriteMagnolify  avgt   10  5527.228 ± 70.377  ns/op

* An in-memory Parquet page store modeled after parquet-java's MemPageStore, used to benchmark
* ParquetType conversion between Parquet Groups and Scala case classes
*/
class ParquetInMemoryPageStore(rowCount: Long) extends PageReadStore with PageWriteStore {
Copy link
Contributor Author

@clairemcginty clairemcginty Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes are heavily based on this parquet-java package, which sadly are not a part of any artifact: https://github.com/apache/parquet-java/tree/master/parquet-column/src/test/java/org/apache/parquet/column/page/mem

Copy link

codecov bot commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.50%. Comparing base (a3708ba) to head (bded9b4).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1040   +/-   ##
=======================================
  Coverage   95.50%   95.50%           
=======================================
  Files          56       56           
  Lines        1980     1980           
  Branches      186      186           
=======================================
  Hits         1891     1891           
  Misses         89       89           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@clairemcginty clairemcginty merged commit 073c2e3 into main Sep 20, 2024
13 checks passed
@clairemcginty clairemcginty deleted the parquet-read-write-bench branch September 20, 2024 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants