Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement key synchronization. #32

Merged
merged 99 commits into from
Sep 25, 2023
Merged

Implement key synchronization. #32

merged 99 commits into from
Sep 25, 2023

Conversation

NullHypothesis
Copy link
Contributor

@NullHypothesis NullHypothesis commented Aug 8, 2023

Resolves #10

main.go Fixed Show fixed Hide fixed
@rillian
Copy link
Contributor

rillian commented Aug 8, 2023

Looks like a good start. Initial thoughts:

The attester interface seems generally useful. Could land it separately to reduce the size of this PR.

In addition to reporting update failures, the leader should probably drop workers from its list if it can't update them, to handle instances that have failed. We probably also want some guards against stale nodes, especially since key rotation might happen no more than every few weeks. Maybe workers should re-register periodically, as a keep-alive against expiry from the leader's list. Likewise, workers should check their key material against the leader periodically and terminate if they haven't received an update. Maybe that could be combined into some sort of keep-alive ping?

util.go Fixed Show fixed Hide fixed
@NullHypothesis
Copy link
Contributor Author

NullHypothesis commented Aug 16, 2023

Quick summary of where we are with the PR:

  • From the leader's PoV:
    • Upon receiving a worker's registration, the leader immediately initiates key synchronization.
    • Upon receiving a worker's heartbeat, the leader updates the worker's "last seen" timestamp and lets the worker know if its key material is up-to-date.
    • Upon receiving new keys from star-randsrv, the leader immediately re-synchronizes with all registered workers.
    • If a given worker hasn't sent a heartbeat in X minutes, the leader logs an error and removes it from the worker pool.
  • From the worker's PoV:
    • Immediately after bootstrapping, the worker registers itself with the leader.
    • Every X minutes, the worker sends a heartbeat to the leader, containing a hash over its key material. If the leader signals that the key material is outdated, the worker re-registers itself.
    • If key synchronization fails, the worker terminates.
    • If the leader is temporarily unavailable for the heartbeat, the worker logs an error.

What remains to be done:

  • Test key sync in the context of k8s.
  • Log important errors via Prometheus, so we know when there are sync issues.
  • Write more tests.
  • Provide a mechanism that lets star-randsrv know when its keys were updated.

Also, the scripts/ directory contains a few shell scripts that help with testing key synchronization locally.

@rillian
Copy link
Contributor

rillian commented Aug 16, 2023

Are there advantages to having separate registration and heartbeat endpoints? If the initial registration request contained an empty body (or a hash of null key material), the leader could use the same logic to schedule a key exchange. When workers send subsequent registration requests with a current key hash, that could work the same as a heartbeat, updating the leader's list of workers without triggering an immediate keysync.

@NullHypothesis NullHypothesis marked this pull request as ready for review September 11, 2023 12:39
@NullHypothesis NullHypothesis changed the title WIP: Implement key synchronization. Implement key synchronization. Sep 11, 2023
@NullHypothesis
Copy link
Contributor Author

Removing the "work-in-progress" because it no longer is. (cc @rillian, @DJAndries)

util.go Outdated Show resolved Hide resolved
util.go Outdated Show resolved Hide resolved
handlers.go Show resolved Hide resolved
sync_worker.go Show resolved Hide resolved
util.go Show resolved Hide resolved
util.go Show resolved Hide resolved
util.go Show resolved Hide resolved
Copy link
Member

@kdenhartog kdenhartog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all LGTM, few more non-blocking questions but the leader concern I had during the original feedback have been addressed in my opinion.

`hashed_keys` contains the Base64-encoded SHA-256 hash over the worker's enclave key material.
If all goes well, the leader responds with status code `200 OK`.

* `GET /enclave/leader?nonce={nonce}` Exposed by all enclaves, this endpoint
Copy link
Member

@kdenhartog kdenhartog Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocker: For the FQDN is there an internal DNS service running that's configured to point at the leader node as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no internal DNS service. For now, we assume that the domains for both leader and workers are public. That may change though, depending on how the Kubernetes tests are going to go.

return nil
}

// setupLeader performs necessary setup tasks like starting the worker event
// loop and installing leader-specific HTTP handlers.
func (e *Enclave) setupLeader() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocker: What's the expected method of handing leadership over if the leader node goes down? From my current understanding if the leader falls over as long as it properly restarts then when the worker contacts it during the heartbeat check and determines it's using the wrong key and resync's.

Wouldn't we run into a sync issue between the heartbeat recheck and the leader restarting? Seems that intermediate state is a concern but is short enough (looks to be once a minute) that probably not worth being concerned about. Is that aligned with your view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that aligned with your view?

Yes. Heartbeats are cheap and we can increase their frequency if this turns out to be a concern.

rillian
rillian previously approved these changes Sep 14, 2023
Copy link
Contributor

@rillian rillian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ready to land.

Copy link
Contributor

@rillian rillian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new commits also look good

@NullHypothesis NullHypothesis merged commit ea48d48 into master Sep 25, 2023
4 checks passed
@NullHypothesis
Copy link
Contributor Author

Let's merge and address future issues in subsequent PRs.

@NullHypothesis NullHypothesis deleted the key-sync branch September 25, 2023 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify mechanism for enclave synchronization
5 participants