Skip to content

Commit

Permalink
Merge branch 'firecracker-microvm:main' into clippy_cast_lossless
Browse files Browse the repository at this point in the history
  • Loading branch information
StemCll authored Nov 20, 2022
2 parents 05127c5 + 1644b3c commit 42cdc44
Show file tree
Hide file tree
Showing 11 changed files with 221 additions and 229 deletions.
3 changes: 2 additions & 1 deletion .buildkite/pipeline_pr.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ def group(group_name, command, agent_tags=None, priority=0, timeout=30):
step_style = {
"command": "./tools/devtool -y test -- ../tests/integration_tests/style/",
"label": "🪶 Style",
# no agent tags, it doesn't matter where this runs
# we only install the required dependencies in x86_64
"agents": ["platform=x86_64.metal"]
}

build_grp = group(
Expand Down
56 changes: 36 additions & 20 deletions docs/snapshotting/snapshot-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
- [Overview](#overview)
- [Snapshot files management](#snapshot-files-management)
- [Performance](#performance)
- [Known issues](#known-issues)
- [Developer preview status](#developer-preview-status)
- [Limitations](#limitations)
- [Firecracker Snapshotting characteristics](#firecracker-snapshotting-characteristics)
- [Snapshot versioning](#snapshot-versioning)
- [Snapshot API](#snapshot-api)
Expand Down Expand Up @@ -38,6 +39,7 @@ guest workload at that particular point in time.

The Firecracker snapshot feature is in [developer preview](../RELEASE_POLICY.md)
on all CPU micro-architectures listed in [README](../../README.md#supported-platforms).
See [this section](#developer-preview-status) for more info.

### Overview

Expand Down Expand Up @@ -82,8 +84,6 @@ resumed microVM.

The Firecracker snapshot design offers a very simple interface to interact with
snapshots but provides no functionality to package or manage them on the host.
Using snapshots in production is currently not recommended as there are open
[Known issues](#known-issues).

The [threat containment model](../design.md#threat-containment) states
that the host, host/API communication and snapshot files are trusted by Firecracker.
Expand All @@ -93,33 +93,49 @@ snapshot files by implementing authentication and encryption schemes while
managing their lifecycle or moving them across the trust boundary, like for
example when provisioning them from a respository to a host over the network.

Firecracker is optimized for fast load/resume and it's designed to do some very basic
sanity checks only on the vm state file. It only verifies integrity using a 64
bit CRC value embedded in the vm state file, but this is only as a partial
measure to protect against accidental corruption, as the disk files and memory
file need to be secured as well. It is important to note that CRC computation
is validated before trying to load the snapshot. Should it encounter failure,
an error will be shown to the user and the Firecracker process will be terminated.
Firecracker is optimized for fast load/resume, and it's designed to do some
very basic sanity checks only on the vm state file. It only verifies integrity
using a 64-bit CRC value embedded in the vm state file, but this is only
a partial measure to protect against accidental corruption, as the disk
files and memory file need to be secured as well. It is important to note that
CRC computation is validated before trying to load the snapshot. Should it
encounter failure, an error will be shown to the user and the Firecracker
process will be terminated.

### Performance

The Firecracker snapshot create/resume performance depends on the memory size,
vCPU count and emulated devices count. The Firecracker CI runs snapshots tests
on AWS **m5d.metal** instances for Intel and on AWS **m6g.metal** for ARM.
The baseline for snapshot resume latency target on Intel is under **8ms** with
5ms p90, and on ARM is under **3ms** for a microVM with the following specs:
2vCPU/512MB/1 block/1 net device.
vCPU count and emulated devices count.
The Firecracker CI runs snapshot tests on:

### Known issues
- AWS **m5d.metal** and **m6i.metal** instances for Intel
- AWS **m6g.metal** for ARM
- AWS **m6a.metal** for AMD

- High snapshot latency on 5.4+ host kernels - [#2129](https://github.com/firecracker-microvm/firecracker/issues/2129)
We are running nightly performance tests for all the enumerated platforms on
all supported kernel versions.
The baselines can be found in their [respective config file](../../tests/integration_tests/performance/configs/).

### Developer preview status

The snapshot functionality is still in developer preview due to the following:

- Poor entropy and replayable randomness when resuming multiple microvms from
the same snapshot. We do not recommend to use snapshotting in production if
there is no mechanism to guarantee proper secrecy and uniqueness between
guests.
Please see [Snapshot security and uniqueness](#snapshot-security-and-uniqueness).

### Limitations

- High snapshot latency on 5.4+ host kernels due to cgroups V1. We
strongly recommend to deploy snapshots on cgroups V2 enabled hosts for the
implied kernel versions - [related issue](https://github.com/firecracker-microvm/firecracker/issues/2129).
- Guest network connectivity is not guaranteed to be preserved after resume.
For recommendations related to guest network connectivity for clones please
see [Network connectivity for clones](network-for-clones.md).
- Vsock device does not have full snapshotting support.
Please see [Vsock device limitation](#vsock-device-limitation).
- Poor entropy and replayable randomness when resuming multiple microvms which
deal with cryptographic secrets. Please see [Snapshot security and uniqueness](#snapshot-security-and-uniqueness).
- Snapshotting on arm64 works for both GICv2 and GICv3 enabled guests.
However, restoring between different GIC version is not possible.

Expand Down Expand Up @@ -542,7 +558,7 @@ Boot microVM A -> ... -> Create snapshot S -> Resume -> ...
-> Load S in microVM B -> Resume -> ...
```

Here, both microVM A and B do work staring from the state stored in snapshot S.
Here, both microVM A and B do work starting from the state stored in snapshot S.
Unique identifiers, random numbers, and cryptographic tokens that are meant to
be used once may be used twice. It doesn't matter if microVM A is terminated
before microVM B resumes execution from snapshot S or not. In this example, we
Expand Down
9 changes: 8 additions & 1 deletion resources/tests/setup_rootfs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,14 @@ prepare_fc_rootfs() {
SSH_DIR="$BUILD_DIR/ssh"
RESOURCE_DIR="$2"

packages="udev systemd-sysv openssh-server iproute2 msr-tools"
packages="udev systemd-sysv openssh-server iproute2"

# msr-tools is only supported on x86-64.
arch=$(uname -m)
if [ "${arch}" == "x86_64" ]; then
packages="$packages msr-tools"
fi

apt-get update
apt-get install -y --no-install-recommends $packages

Expand Down
4 changes: 2 additions & 2 deletions src/cpuid/src/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ pub enum Error {
/// Extract entry from the cpuid.
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
pub fn get_cpuid(function: u32, count: u32) -> Result<CpuidResult, Error> {
// TODO: replace with validation based on `has_cpuid()` when it becomes stable:
// https://doc.rust-lang.org/core/arch/x86/fn.has_cpuid.html
// TODO: Use `core::arch::x86_64::has_cpuid`
// (https://github.com/firecracker-microvm/firecracker/issues/3271)
#[cfg(target_env = "sgx")]
{
return Err(Error::NotSupported);
Expand Down
2 changes: 1 addition & 1 deletion tests/integration_tests/build/test_pylint.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def test_python_pylint():
'--variable-rgx="[a-z_][a-z0-9_]{1,30}$" --disable='
"fixme,too-many-instance-attributes,import-error,"
"too-many-locals,too-many-arguments,consider-using-f-string,"
"consider-using-with,implicit-str-concat"
"consider-using-with,implicit-str-concat,line-too-long"
)

# Get all *.py files from the project
Expand Down
21 changes: 14 additions & 7 deletions tests/integration_tests/functional/test_balloon.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,22 +45,29 @@ def get_rss_from_pmap():

def make_guest_dirty_memory(ssh_connection, should_oom=False, amount=8192):
"""Tell the guest, over ssh, to dirty `amount` pages of memory."""
logger = logging.getLogger("make_guest_dirty_memory")

amount_in_mbytes = amount / MB_TO_PAGES

exit_code, _, _ = ssh_connection.execute_command(
"/sbin/fillmem {}".format(amount_in_mbytes)
)
cmd = f"/sbin/fillmem {amount_in_mbytes}"
exit_code, stdout, stderr = ssh_connection.execute_command(cmd)
# add something to the logs for troubleshooting
if exit_code != 0:
logger.error("while running: %s", cmd)
logger.error("stdout: %s", stdout.read())
logger.error("stderr: %s", stderr.read())

cmd = "cat /tmp/fillmem_output.txt"
_, stdout, _ = ssh_connection.execute_command(cmd)
if should_oom:
assert (
exit_code == 0
and ("OOM Killer stopped the program with " "signal 9, exit code 0")
in stdout.read()
"OOM Killer stopped the program with "
"signal 9, exit code 0" in stdout.read()
)
else:
assert exit_code == 0 and ("Memory filling was " "successful") in stdout.read()
assert exit_code == 0, stderr.read()
stdout_txt = stdout.read()
assert "Memory filling was successful" in stdout_txt, stdout_txt


def build_test_matrix(network_config, bin_cloner_path, logger):
Expand Down
Loading

0 comments on commit 42cdc44

Please sign in to comment.