-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store issuer chains #132
Store issuer chains #132
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #132 +/- ##
==========================================
- Coverage 35.80% 34.65% -1.16%
==========================================
Files 16 34 +18
Lines 1363 2955 +1592
==========================================
+ Hits 488 1024 +536
- Misses 801 1820 +1019
- Partials 74 111 +37 ☔ View full report in Codecov by Sentry. |
3c45769
to
bf979c6
Compare
2944928
to
1b5388a
Compare
personalities/sctfe/storage.go
Outdated
func (cts *CTStorage) AddIssuerChain(ctx context.Context, chain []*x509.Certificate) error { | ||
errG := errgroup.Group{} | ||
for _, c := range chain { | ||
errG.Go(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider pushing this down into the storage implementation to decide how best to store these entries? (e.g. some storage infra may do better with some sort of non-transactional batch insert as opposed to several concurrent write transactions).
May be more of an issue for fresh logs and become less so once the chain storage is close to being maximally populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storage implementations can, if they want to, stage all the requests and then batch write them if they want to. Each go routine will hang until then. That would be batched across all AddChain requests though, which means that two issuers.Add for the same chain might not be in the same batch.
I could change the API to accept multiple k,v instead, but since keys can't be of type []byte, I'd need to pass a map[string][]byte, or two [][]byte that would be meant to track one another, or a map[string][]byte. Can you think of a better way of doing this?
The PR that will come after this adds a local cache. That means that issuers.Add won't necessarily be called on all the issuers. So if we modify the api to get multiple k,v I'd have to batch caching as well, and then call issuers.Add with a potential "skipped chain".
It feels to me that it would add a non trivial amount of complexity to me. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storage implementations can, if they want to, stage all the requests and then batch write them if they want to. Each go routine will hang until then. That would be batched across all AddChain requests though, which means that two issuers.Add for the same chain might not be in the same batch.
I agree that storage implementations should be free to batch up writes if that's what best for them, but your proposal here is a multiplier for the goroutines which are serving the HTTP requests these certs came from - i.e. if the queue is configured to flush at 1000 entries, there will be 1000 goroutines parked on the futures, if each of those chains has 3 entries then with this change we'll have 1000+3*1000 parked goroutines.
Obvs that's fine if that really is the best way for the storage impl to handle the calls to Set, but if we're saying that storage impls which would prefer to use a native "batch" write have to then block and "fan-in" all those extra 3000 goroutines in order to get back to just having a slice of entries - that seems like we've let an implementation detail of "lots of goroutines" type storage types leak out into this higher level.
Feels like the "fan-out" should be done by the storage implementation iff that's what's needed?
I could change the API to accept multiple k,v instead, but since keys can't be of type []byte, I'd need to pass a map[string][]byte, or two [][]byte that would be meant to track one another, or a map[string][]byte. Can you think of a better way of doing this?
Hmm, could you pass a []struct{K []byte, V []byte}
?
As you say, these would need to be independently added rather than transactional (i.e. I think that means the semantics of the API becomes a bit more AddTheseIntermediates()
rather than AddIssuerChain()
), both for local caching reasons, but also because distributed systems in the case where there's a central shared intermediates cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I wanted to avoid defining a new struct because it leads to more dependencies. Go lacks tuples :/
I had a go at it, (still need to add comments and co), but have a look at let me know what you think. I could go even further an, leave it to the storage implementation to do the "Exists" calls.
While I'm starting with the SCTFE here, I'm thinking of making deduplication a feature that other personalities can use. That probably means that we'll need to have a base "personalitystorage" package, where KV will be defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's either this, or a map[string]string / map[string][]byte, and we do some type conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably ok for the storage impls to take a dep on a struct here - they have to meet the implicit interface requirement specific here anyway.
I think having the storage do the Exists
inline with the Add
would be good, they might be able to optimize that away - e.g. GCS might do a single WriteIfNotExists and save a round-trip, etc.
For now, I'd keep the chain storage and dedup storages separate and have one do []byte -> []byte
and the other do []byte -> uint64
natively, rather than prematurely push them together just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the storage do the Exists inline with the Add would be good, they might be able to optimize that away - e.g. GCS might do a single WriteIfNotExists and save a round-trip, etc.
Maybe, yes. Not 100% sure since write are more expensive than reads, and eventually the list issuers will stop growing, so it will be cheaper to check for existence first. But also, caching should make all of this even cheaper. There's already a TODO somewhere to evaluate this. I've also left a TODO for parallel write, let's keep them out for now. I've pushed it down to the storage layer.
For now, I'd keep the chain storage and dedup storages separate and have one do []byte -> []byte and the other do []byte -> uint64 natively, rather than prematurely push them together just yet.
In practice, I had written a wrapper around map.go to handle the conversion: https://github.com/phbnf/trillian-tessera/blob/dedup/personalities/sctfe/storage/gcp/dedup.go. Honestly, I'm not even super attached to this dedup code on GCS, I don't think any CT log will use it, it doesn't scale price-wise. It might be useful for a test personality though or other applications once deduplication becomes a "standard" personality plugin.
Need to initialize the writer before setting the content type
#88