Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engineccl: Add benchmark for ctr_stream encryption #113999

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

bdarnell
Copy link
Contributor

@bdarnell bdarnell commented Nov 8, 2023

Start measuring performance of this code in anticipation of improving it.

Epic: none

Release note: None

@bdarnell bdarnell requested a review from a team as a code owner November 8, 2023 02:46
@bdarnell bdarnell requested a review from sumeerbhola November 8, 2023 02:46
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 8, 2023

"Before" numbers on a gceworker:

goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
BenchmarkCTRBlockCipherStream
BenchmarkCTRBlockCipherStream/key=128,block=16
BenchmarkCTRBlockCipherStream/key=128,block=16-24         	35133664	        37.72 ns/op	 424.16 MB/s
BenchmarkCTRBlockCipherStream/key=128,block=1024
BenchmarkCTRBlockCipherStream/key=128,block=1024-24       	  563631	      2134 ns/op	 479.85 MB/s
BenchmarkCTRBlockCipherStream/key=128,block=10240
BenchmarkCTRBlockCipherStream/key=128,block=10240-24      	   56616	     21501 ns/op	 476.26 MB/s
BenchmarkCTRBlockCipherStream/key=192,block=16
BenchmarkCTRBlockCipherStream/key=192,block=16-24         	32802205	        36.63 ns/op	 436.78 MB/s
BenchmarkCTRBlockCipherStream/key=192,block=1024
BenchmarkCTRBlockCipherStream/key=192,block=1024-24       	  517165	      2289 ns/op	 447.39 MB/s
BenchmarkCTRBlockCipherStream/key=192,block=10240
BenchmarkCTRBlockCipherStream/key=192,block=10240-24      	   51337	     22827 ns/op	 448.60 MB/s
BenchmarkCTRBlockCipherStream/key=256,block=16
BenchmarkCTRBlockCipherStream/key=256,block=16-24         	30893163	        40.18 ns/op	 398.22 MB/s
BenchmarkCTRBlockCipherStream/key=256,block=1024
BenchmarkCTRBlockCipherStream/key=256,block=1024-24       	  492710	      2438 ns/op	 420.03 MB/s
BenchmarkCTRBlockCipherStream/key=256,block=10240
BenchmarkCTRBlockCipherStream/key=256,block=10240-24      	   49321	     24286 ns/op	 421.64 MB/s
PASS

On an m2 macbook it's about twice as fast, 900 MB/s.

I have a quick win for an 18% improvement (coming in a separate PR so we can get the before numbers into roachperf), but the real prize will be making the larger batch sizes useful. We can compare to openssl speed -evp aes-128-ctr:

$ openssl speed -evp aes-128-ctr
Doing aes-128-ctr for 3s on 16 size blocks: 100356351 aes-128-ctr's in 3.00s
Doing aes-128-ctr for 3s on 64 size blocks: 81137142 aes-128-ctr's in 3.00s
Doing aes-128-ctr for 3s on 256 size blocks: 42020281 aes-128-ctr's in 3.00s
Doing aes-128-ctr for 3s on 1024 size blocks: 14060738 aes-128-ctr's in 3.00s
Doing aes-128-ctr for 3s on 8192 size blocks: 1941268 aes-128-ctr's in 3.00s
Doing aes-128-ctr for 3s on 16384 size blocks: 980920 aes-128-ctr's in 3.00s
OpenSSL 1.1.1f  31 Mar 2020
built on: Wed May 24 17:14:51 2023 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-mSG92N/openssl-1.1.1f=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-ctr     535233.87k  1730925.70k  3585730.65k  4799398.57k  5300955.82k  5357131.09k

We're not far behind with the 16 byte block size, but openssl is able to take advantage of larger blocks and reach 10x the speed.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

We should also benchmark fileCipherStream.Encrypt, or alternatively benchmark that instead of newCTRBlockCipherStream (though that won't cover the read path -- but reads often hit in the block cache, so maybe that is ok). It is a thin wrapper that does the iteration over chunks of size ctrBlockSize that you are doing here. Also, it includes some copying when the data is not completely aligned at the end with the ctrBlockSize, which is important to include in the cost when trying to increase block size.

We are using 32KB blocks in Pebble, but that is uncompressed bytes, and that is what fileCipherStream.Encrypt will see after compression -- I am not sure what compression ratio we actually see (I vaguely remember seeing ~3x in some sstables, but I could be misremembering).

Hmm, looks like we typically do two writes, one for the block and one for the trailer (5 bytes) in https://github.com/cockroachdb/pebble/blob/81d9a4fea01c84f2213cf68ccdc886dc2febc15f/sstable/writer.go#L1788-L1794.
That suggests we should benchmark the write path at an even higher level, by writing an sstable. We could set block size to be 16KB (to approximate 2x compression), turn off compression, and vary the key and value sizes so each block is not exactly 16KB.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 8, 2023

Ah yes, fileCipherStream is a better benchmark target. I skipped over it because I didn't want to deal with files but it's actually the level I want.

I want to keep this benchmark pretty low-level for simplicity instead of benchmarking the entire sst write process. It's true that consolidating those two writes would help a bit but an extra encryption call every 16KB won't make much of a difference. The openssl speed results show diminishing returns with larger block sizes; once we're calling the encryption function once per kilobyte instead of once per 16 bytes, we'll have gotten enough of the benefit that I don't think we'll need to push further here.

One more benchmark result: the FIPS-ready build is very slow, only 68 MB/s:

goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
BenchmarkCTRBlockCipherStream/key=128,block=16-24         	 5093323	       233.8 ns/op	  68.43 MB/s
BenchmarkCTRBlockCipherStream/key=128,block=1024-24       	   81940	     14668 ns/op	  69.81 MB/s
BenchmarkCTRBlockCipherStream/key=128,block=10240-24      	    8038	    147555 ns/op	  69.40 MB/s
BenchmarkCTRBlockCipherStream/key=192,block=16-24         	 5102220	       232.3 ns/op	  68.87 MB/s
BenchmarkCTRBlockCipherStream/key=192,block=1024-24       	   80656	     14845 ns/op	  68.98 MB/s
BenchmarkCTRBlockCipherStream/key=192,block=10240-24      	    8004	    148217 ns/op	  69.09 MB/s
BenchmarkCTRBlockCipherStream/key=256,block=16-24         	 5033361	       235.7 ns/op	  67.88 MB/s
BenchmarkCTRBlockCipherStream/key=256,block=1024-24       	   79814	     14907 ns/op	  68.69 MB/s
BenchmarkCTRBlockCipherStream/key=256,block=10240-24      	    7963	    150181 ns/op	  68.18 MB/s
PASS

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIPS-ready build is very slow, only 68 MB/s

That is crazy slow. Do you know why?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell)

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 8, 2023

Yes, it's because it crosses the cgo boundary every 16 bytes, for something that should be a single CPU instruction.

@bdarnell
Copy link
Contributor Author

Rewriting the benchmark to use fileCipherStream shows that this function has significant performance overhead for small (even block-aligned) reads:

BenchmarkFileCipherStream/fips=false/key=128/block=16/-24               17483790                64.06 ns/op      249.77 MB/s
BenchmarkFileCipherStream/fips=false/key=128/block=1024/-24               534123              2223 ns/op         460.60 MB/s
BenchmarkFileCipherStream/fips=false/key=128/block=10240/-24               55297             21707 ns/op         471.73 MB/s
BenchmarkFileCipherStream/fips=false/key=192/block=16/-24               16831272                65.84 ns/op      243.02 MB/s
BenchmarkFileCipherStream/fips=false/key=192/block=1024/-24               503914              2380 ns/op         430.17 MB/s
BenchmarkFileCipherStream/fips=false/key=192/block=10240/-24               51432             23333 ns/op         438.86 MB/s
BenchmarkFileCipherStream/fips=false/key=256/block=16/-24               16352678                68.27 ns/op      234.36 MB/s
BenchmarkFileCipherStream/fips=false/key=256/block=1024/-24               467551              2545 ns/op         402.28 MB/s
BenchmarkFileCipherStream/fips=false/key=256/block=10240/-24               48033             24934 ns/op         410.69 MB/s

BenchmarkFileCipherStream/fips=true/key=128/block=16/-24                 4374481               269.1 ns/op        59.45 MB/s
BenchmarkFileCipherStream/fips=true/key=128/block=1024/-24                 79358             14932 ns/op          68.58 MB/s
BenchmarkFileCipherStream/fips=true/key=128/block=10240/-24                 7944            148345 ns/op          69.03 MB/s
BenchmarkFileCipherStream/fips=true/key=192/block=16/-24                 4381461               271.7 ns/op        58.89 MB/s
BenchmarkFileCipherStream/fips=true/key=192/block=1024/-24                 79731             15064 ns/op          67.98 MB/s
BenchmarkFileCipherStream/fips=true/key=192/block=10240/-24                 7804            149125 ns/op          68.67 MB/s
BenchmarkFileCipherStream/fips=true/key=256/block=16/-24                 4301301               271.7 ns/op        58.90 MB/s
BenchmarkFileCipherStream/fips=true/key=256/block=1024/-24                 77972             14914 ns/op          68.66 MB/s
BenchmarkFileCipherStream/fips=true/key=256/block=10240/-24                 7653            149559 ns/op          68.47 MB/s

I've also rebased onto #114709 so I can verify FIPS status. Only the last commit is for this PR.

@bdarnell bdarnell force-pushed the ctr-encryption-benchmark branch from 41fa7a8 to f0f8bb1 Compare November 27, 2023 20:51
@bdarnell bdarnell requested review from a team as code owners November 27, 2023 20:51
Start measuring performance of this code in anticipation
of improving it.

Epic: none

Release note: None
@bdarnell bdarnell force-pushed the ctr-encryption-benchmark branch from f0f8bb1 to d3077b0 Compare November 28, 2023 20:16
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell)

@bdarnell
Copy link
Contributor Author

bdarnell commented Dec 1, 2023

bors r=sumeerbhola

@craig
Copy link
Contributor

craig bot commented Dec 1, 2023

Build succeeded:

@craig craig bot merged commit 19ce36d into cockroachdb:master Dec 1, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants