Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/runtime: Support hot-loading of runtime bundles #5961

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

peternose
Copy link
Contributor

Issue: #5755

Example status output:

"bundles": [
  {
    "name": "test-runtime",
    "id": "8000000000000000000000000000000000000000000000000000000000000000",
    "components": [
      {
        "kind": "ronl",
        "version": {}
      },
      {
        "kind": "rofl",
        "version": {}
      }
    ]
  }
],

Copy link

netlify bot commented Dec 6, 2024

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit 979c1c2
🔍 Latest deploy log https://app.netlify.com/sites/oasisprotocol-oasis-core/deploys/675acb25e1adc00008166450

@peternose peternose force-pushed the peternose/feature/hot-loading branch 2 times, most recently from 58dd5fb to 9e3cb33 Compare December 7, 2024 01:45
The node can now fetch and verify runtime bundles from remote repositories
and automatically update to new versions.
@peternose peternose force-pushed the peternose/feature/hot-loading branch from 9e3cb33 to 5da378e Compare December 7, 2024 02:50
Copy link

codecov bot commented Dec 7, 2024

Codecov Report

Attention: Patch coverage is 72.08589% with 273 lines in your changes missing coverage. Please review.

Project coverage is 65.33%. Comparing base (ba6735a) to head (35d2d45).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
go/runtime/bundle/discovery.go 61.45% 75 Missing and 26 partials ⚠️
go/runtime/registry/registry.go 76.30% 24 Missing and 17 partials ⚠️
go/runtime/registry/config.go 70.33% 24 Missing and 11 partials ⚠️
go/runtime/bundle/registry.go 90.06% 9 Missing and 6 partials ⚠️
go/worker/common/committee/node.go 54.16% 9 Missing and 2 partials ⚠️
go/worker/keymanager/worker.go 65.51% 9 Missing and 1 partial ⚠️
go/runtime/host/sandbox/sandbox.go 61.90% 6 Missing and 2 partials ⚠️
go/runtime/host/tdx/qemu.go 0.00% 8 Missing ⚠️
go/runtime/registry/host.go 75.75% 4 Missing and 4 partials ⚠️
go/runtime/bundle/bundle.go 28.57% 4 Missing and 1 partial ⚠️
... and 13 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5961      +/-   ##
==========================================
+ Coverage   64.69%   65.33%   +0.63%     
==========================================
  Files         629      631       +2     
  Lines       64297    64752     +455     
==========================================
+ Hits        41599    42304     +705     
+ Misses      17775    17485     -290     
- Partials     4923     4963      +40     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@peternose peternose marked this pull request as ready for review December 7, 2024 03:18
Copy link
Member

@ptrus ptrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(self-note): only reviewed the discovery part now

go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Outdated Show resolved Hide resolved
"filename", filename,
)

if err := d.registry.AddBundle(src, manifestHash); err != nil {
Copy link
Member

@ptrus ptrus Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit scary that AddBundle needs to "Open" the (untrusted) bundle before verifying the manifest hash. I am not sure, but the current implementation of Open might assume that bundle came from a (somewhat) trusted source, which might not be the case in future. We should ensure that Open is safe to use with malicious inputs.


Somehwat related, I remember we talked about 38edded . Now that I have more context, shouldn't that actually be the hash of the entire bundle (not just the manifest)? That would prevent this problem and allow us to verify the downloaded file before opening it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation is actually hidden in the Open function, so the bundle doesn't need to be trusted.

We can refactor this to use bundle checksum instead (maybe in another PR), as that might be more intuitive. However, the current approach allows us to remove the bundles entirely as one could just publish the manifest (e.g. on chain) and the users should then fetch binaries from untrusted sources and verify their checksums.

@kostko What are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea for using the manifest hash was that one would not need to always download all of the bundle artifacts as some may be shared between bundles.

We should probably add a separate Open variant (or have the existing one take options), one that takes a manifest hash. Obviously the zip archive parser should be well-fuzzed (which the current implementation is) and since the manifest must be the first file in the archive, we can just verify the manifest hash before continuing to load or parse anything else. I would opt for this variant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we should do some bundle fuzzing, just to be sure these things are somewhat robust. Will open an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kostko I think in this context it would be worth adding some additional doc of the current and intendend status of bundles.

Here is what I have in mind:

  • right now we have legacy and non-legacy bundles (components)
  • furthermore we will have two ways of adding bundles (via the paths or configuring them to download).
  • finally, there is this things with versions that is not 100% clear yet (what is allowed to be reused, what if you get same version but different content, exact bundle validation).

I think it may be worth aggregating requirements/invariant somewhere together with legacy context, as in addition to make it more accessible it would also ease us in finding problematic corner cases/interleaving. Maybe a README.md under go/runtime/bundle? We can later move this to official docs once those features are released?

Hope I am not complicating, we could also offline :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that runtime bundles don't need to be compatible with all previous versions, so I expect we will remove all these legacy fields once we are done refactoring and all runtimes upgrade to new bundles. So I'm not sure if we need a readme file.

go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
Comment on lines 74 to 85
// Init sets up bundle discovery using node configuration and adds configured
// and cached bundles to the registry.
func (d *Discovery) Init() error {
// Consolidate all bundles in one place, which could be useful
// if we implement P2P sharing in the future.
if err := d.copyBundles(); err != nil {
return err
}

// Add copied and cached bundles to the registry.
if err := d.Discover(); err != nil {
return err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init only adds configured bundles to the registry. In addition, some of them may be already previously exploded (cached)? If this is the case somewhere nested WriteExploded will only verify...

However, it does not guarantee to add all the "cached" bundles, as looking at Discover method you only add the *.orc bundles to the registry (i.e. the ones that were configured). This is desired as some of the exploded bundles may be stale, i.e. no longer part of the config. (Automatic removal will be handled as part of #5737).

Do I understand your logic correctly? If so I would suggest removing cached bundles reference from the Init toplevel docs and before d.Discover() to avoid confusion?

Thanks for your reply! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All bundles are copied to the bundle folder.:

  • Bundles from the config are copied in Init on line 79.
  • Bundles fetched from the web are first verified and then copied to the bundle folder.

So the Discovery loads all bundles in the config, all bundles that used to be in the config, and all bundles that were downloaded. The idea is that you always have bundle and the extracted folder in the bundle directory. This way, it will be easy to implement p2p2 sharing as one can just check which orc files are in the bundle directory.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and thanks.

So what I suggest to change (super minor but to avoid confusion):

// Init sets up bundle discovery using node configuration and adds configured
// and cached bundles to the registry.

and

// Add copied and cached bundles to the registry.

to

// Init sets up bundle discovery using node configuration and adds configured
// bundles (that are guaranteed to be exploded) to the registry.

and

// Add copied bundles (that are guaranteed to be exploded) to the registry.

or smtg along the lines, just to emphasize that we are not also moving stale cached bundles (see issue #5737)

Hope it makes sense(?), and feel free to reword as you see fit...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all bundles that used to be in the config

Correct since at init time the .orc file (if configured) was moved to /data/runtimes/bundles and exploded.
Note, the nodes that were running previous software (prior to this PR/current version), may have some bundles exploded, later removed them from config (and thus also bundle), so only the exploded part will stay. I guess with #5737 the idea was to remove those stale "exploded" bundles + the ones whose version is now outdated.

Now if we push for the p2p sharing, I guess we go in the other direction, i.e. we further increase the storage load on the nodes as they store essentially all the historical bundles and their exploded paths (technically we could be removing those once outdated and only store bundles). On the positive side we get a more resilient data availability for our software components.

Thinking aloud, if we want to have some kind of p2p sharing we may want only archive nodes responsible for that? For the normal nodes I don't think we want to store outdated bundles and exploded components. In any case we may want to adapt #5737 cleanup requirements?

Happy to tackle this myself in the follow-up once we align on the strategy.

cc @kostko

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no sense in storing or sharing historic bundles, so any pruning should get rid of them IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this make sense. RONL + system components in one bundle, and ROFL components in detached bundles.

Also note that ROFL components can only be fetched from detached bundles, not from bundles that contain RONLs.

Our code doesn't distinguish between ROFL and system components. But it would be nice to have a check that ROFLs are always in a detached bundle, and system components in the runtime bundle.

Copy link
Contributor Author

@peternose peternose Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add ROFLSystemKind Kind= "rofl-system" or SystemKind Kind = "system" and verify this in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, maybe we don't need this and we just mark the component as system ROFL once is read from an undetached bundle.

Copy link

@martintomazic martintomazic Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought "system components" can only be ROFL component with current design? (update expect for one and only one RONL)

But it would be nice to have a check that ROFLs are always in a detached bundle, and system components in the runtime bundle.

If not is this really "always" then, if system component is not rofl and may be used by different bundles, then you also want to package it separately, i.e. detached?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "system ROFL component" = non-detached ROFL. There is already the detached flag for this, but maybe it is not obvious? Or there could be an additional component attribute (e.g. system), which would only be allowed to be set if the bundle is non-detached?

Copy link

@martintomazic martintomazic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked new discovery and registry part. Overall looks super solid. Will try to review the whole PR/integration tmr :)

go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/registry.go Show resolved Hide resolved
go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Show resolved Hide resolved
go/runtime/bundle/discovery.go Show resolved Hide resolved
go/oasis-node/cmd/node/node.go Show resolved Hide resolved
go/runtime/config/config.go Show resolved Hide resolved
go/runtime/bundle/discovery.go Show resolved Hide resolved
go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
Comment on lines 74 to 85
// Init sets up bundle discovery using node configuration and adds configured
// and cached bundles to the registry.
func (d *Discovery) Init() error {
// Consolidate all bundles in one place, which could be useful
// if we implement P2P sharing in the future.
if err := d.copyBundles(); err != nil {
return err
}

// Add copied and cached bundles to the registry.
if err := d.Discover(); err != nil {
return err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see and thanks.

So what I suggest to change (super minor but to avoid confusion):

// Init sets up bundle discovery using node configuration and adds configured
// and cached bundles to the registry.

and

// Add copied and cached bundles to the registry.

to

// Init sets up bundle discovery using node configuration and adds configured
// bundles (that are guaranteed to be exploded) to the registry.

and

// Add copied bundles (that are guaranteed to be exploded) to the registry.

or smtg along the lines, just to emphasize that we are not also moving stale cached bundles (see issue #5737)

Hope it makes sense(?), and feel free to reword as you see fit...

go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
go/runtime/bundle/discovery.go Show resolved Hide resolved
go/control/api/api.go Outdated Show resolved Hide resolved
go/control/api/api.go Show resolved Hide resolved
go/runtime/host/sandbox/sandbox.go Show resolved Hide resolved
go/runtime/registry/config.go Outdated Show resolved Hide resolved
go/runtime/bundle/manifest.go Outdated Show resolved Hide resolved
go/runtime/bundle/manifest.go Outdated Show resolved Hide resolved
Comment on lines 74 to 85
// Init sets up bundle discovery using node configuration and adds configured
// and cached bundles to the registry.
func (d *Discovery) Init() error {
// Consolidate all bundles in one place, which could be useful
// if we implement P2P sharing in the future.
if err := d.copyBundles(); err != nil {
return err
}

// Add copied and cached bundles to the registry.
if err := d.Discover(); err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no sense in storing or sharing historic bundles, so any pruning should get rid of them IMO.

return fmt.Errorf("failed to construct URL: %w", err)
}

src, err := d.fetchBundle(url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this scheme assumes that all bundles are directly downloadable in the registry by their hash. I would like to avoid the need to copy bundles and instead do it like it was suggested in #5755 (comment), namely that there would be an indirection step:

  • You construct the URL based on the registry base URL and the manifest hash.
  • But instead of containing the entire bundle, this URL only contains metadata on where to find the bundle.

This way we can use existing GitHub releases and just point to those releases in these metadata files instead of needing to copy over the actual bundles.

go/runtime/bundle/registry.go Outdated Show resolved Hide resolved
@peternose peternose force-pushed the peternose/feature/hot-loading branch from 49e86ad to 45b2903 Compare December 12, 2024 11:11
.changelog/5962.cfg.md Outdated Show resolved Hide resolved
@peternose peternose force-pushed the peternose/feature/hot-loading branch from 45b2903 to 979c1c2 Compare December 12, 2024 11:38
Comment on lines 29 to 35
// requestTimeout is the time limit for http client requests.
requestTimeout = 10 * time.Second

// maxBundleSizeBytes is the maximum allowed bundle size in bytes.
maxBundleSizeBytes = 20 * 1024 * 1024 // 20 MB
// maxDefaultBundleSizeBytes is the maximum allowed default bundle size
// in bytes.
maxDefaultBundleSizeBytes = 20 * 1024 * 1024 // 20 MB
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Will requestTimeout of 10s be sufficient for big bundles or do we also want to make this configurable/function of size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just idle timeout (e.g. waiting for response while nothing is being sent) or max time it takes for the request to complete?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say for the request to complete since we use it to set http.Client.Timeout

       // Timeout specifies a time limit for requests made by this
	// Client. The timeout includes connection time, any
	// redirects, and reading the response body. The timer remains
	// running after Get, Head, Post, or Do return and will
	// interrupt reading of the Response.Body.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the max size we anticipate? For the reference I see that our documentation recommends "Recommended: 1 Gbps internet connection with low latency".

Might be a good idea to future proof us by making this parametric...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants