Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store issuer chains #132

Merged
merged 15 commits into from
Aug 22, 2024
Merged

Store issuer chains #132

merged 15 commits into from
Aug 22, 2024

Conversation

phbnf
Copy link
Contributor

@phbnf phbnf commented Aug 12, 2024

#88

@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2024

Codecov Report

Attention: Patch coverage is 11.11111% with 72 lines in your changes missing coverage. Please review.

Project coverage is 34.65%. Comparing base (46ec9c2) to head (fdd5de7).
Report is 78 commits behind head on main.

Files Patch % Lines
personalities/sctfe/storage/gcp/issuers.go 0.00% 51 Missing ⚠️
personalities/sctfe/storage.go 0.00% 12 Missing ⚠️
personalities/sctfe/ct_server_gcp/main.go 0.00% 7 Missing ⚠️
personalities/sctfe/handlers.go 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #132      +/-   ##
==========================================
- Coverage   35.80%   34.65%   -1.16%     
==========================================
  Files          16       34      +18     
  Lines        1363     2955    +1592     
==========================================
+ Hits          488     1024     +536     
- Misses        801     1820    +1019     
- Partials       74      111      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@phbnf phbnf force-pushed the issuernoreorg branch 2 times, most recently from 3c45769 to bf979c6 Compare August 14, 2024 12:54
@phbnf phbnf requested a review from AlCutter August 14, 2024 12:56
@phbnf phbnf marked this pull request as ready for review August 14, 2024 13:05
@phbnf phbnf force-pushed the issuernoreorg branch 2 times, most recently from 2944928 to 1b5388a Compare August 15, 2024 15:38
personalities/sctfe/ct_server_gcp/main.go Outdated Show resolved Hide resolved
personalities/sctfe/storage.go Outdated Show resolved Hide resolved
func (cts *CTStorage) AddIssuerChain(ctx context.Context, chain []*x509.Certificate) error {
errG := errgroup.Group{}
for _, c := range chain {
errG.Go(func() error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider pushing this down into the storage implementation to decide how best to store these entries? (e.g. some storage infra may do better with some sort of non-transactional batch insert as opposed to several concurrent write transactions).

May be more of an issue for fresh logs and become less so once the chain storage is close to being maximally populated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage implementations can, if they want to, stage all the requests and then batch write them if they want to. Each go routine will hang until then. That would be batched across all AddChain requests though, which means that two issuers.Add for the same chain might not be in the same batch.

I could change the API to accept multiple k,v instead, but since keys can't be of type []byte, I'd need to pass a map[string][]byte, or two [][]byte that would be meant to track one another, or a map[string][]byte. Can you think of a better way of doing this?

The PR that will come after this adds a local cache. That means that issuers.Add won't necessarily be called on all the issuers. So if we modify the api to get multiple k,v I'd have to batch caching as well, and then call issuers.Add with a potential "skipped chain".

It feels to me that it would add a non trivial amount of complexity to me. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage implementations can, if they want to, stage all the requests and then batch write them if they want to. Each go routine will hang until then. That would be batched across all AddChain requests though, which means that two issuers.Add for the same chain might not be in the same batch.

I agree that storage implementations should be free to batch up writes if that's what best for them, but your proposal here is a multiplier for the goroutines which are serving the HTTP requests these certs came from - i.e. if the queue is configured to flush at 1000 entries, there will be 1000 goroutines parked on the futures, if each of those chains has 3 entries then with this change we'll have 1000+3*1000 parked goroutines.

Obvs that's fine if that really is the best way for the storage impl to handle the calls to Set, but if we're saying that storage impls which would prefer to use a native "batch" write have to then block and "fan-in" all those extra 3000 goroutines in order to get back to just having a slice of entries - that seems like we've let an implementation detail of "lots of goroutines" type storage types leak out into this higher level.

Feels like the "fan-out" should be done by the storage implementation iff that's what's needed?

I could change the API to accept multiple k,v instead, but since keys can't be of type []byte, I'd need to pass a map[string][]byte, or two [][]byte that would be meant to track one another, or a map[string][]byte. Can you think of a better way of doing this?

Hmm, could you pass a []struct{K []byte, V []byte}?

As you say, these would need to be independently added rather than transactional (i.e. I think that means the semantics of the API becomes a bit more AddTheseIntermediates() rather than AddIssuerChain()), both for local caching reasons, but also because distributed systems in the case where there's a central shared intermediates cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I wanted to avoid defining a new struct because it leads to more dependencies. Go lacks tuples :/
I had a go at it, (still need to add comments and co), but have a look at let me know what you think. I could go even further an, leave it to the storage implementation to do the "Exists" calls.

While I'm starting with the SCTFE here, I'm thinking of making deduplication a feature that other personalities can use. That probably means that we'll need to have a base "personalitystorage" package, where KV will be defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's either this, or a map[string]string / map[string][]byte, and we do some type conversion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably ok for the storage impls to take a dep on a struct here - they have to meet the implicit interface requirement specific here anyway.

I think having the storage do the Exists inline with the Add would be good, they might be able to optimize that away - e.g. GCS might do a single WriteIfNotExists and save a round-trip, etc.

For now, I'd keep the chain storage and dedup storages separate and have one do []byte -> []byte and the other do []byte -> uint64 natively, rather than prematurely push them together just yet.

Copy link
Contributor Author

@phbnf phbnf Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the storage do the Exists inline with the Add would be good, they might be able to optimize that away - e.g. GCS might do a single WriteIfNotExists and save a round-trip, etc.

Maybe, yes. Not 100% sure since write are more expensive than reads, and eventually the list issuers will stop growing, so it will be cheaper to check for existence first. But also, caching should make all of this even cheaper. There's already a TODO somewhere to evaluate this. I've also left a TODO for parallel write, let's keep them out for now. I've pushed it down to the storage layer.

For now, I'd keep the chain storage and dedup storages separate and have one do []byte -> []byte and the other do []byte -> uint64 natively, rather than prematurely push them together just yet.

In practice, I had written a wrapper around map.go to handle the conversion: https://github.com/phbnf/trillian-tessera/blob/dedup/personalities/sctfe/storage/gcp/dedup.go. Honestly, I'm not even super attached to this dedup code on GCS, I don't think any CT log will use it, it doesn't scale price-wise. It might be useful for a test personality though or other applications once deduplication becomes a "standard" personality plugin.

personalities/sctfe/storage/gcp/map.go Outdated Show resolved Hide resolved
personalities/sctfe/storage/gcp/map.go Outdated Show resolved Hide resolved
@phbnf phbnf merged commit 5423c83 into transparency-dev:main Aug 22, 2024
8 checks passed
@phbnf phbnf mentioned this pull request Aug 29, 2024
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants