-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of the vm-memory crate #1
Conversation
Please refer to following links for related discussions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass 1. Will do some more reviewing in the following days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looked at the documentation and examples, a bit at the code. Will look at the code tomorrow.
d0db04a
to
172c4f8
Compare
Hi, for how long is this still open for comments? |
Hope it could be closed within this week:) |
@alexandruag Until at least 2 team members will approve it. We can definitely hold it until you review it. |
Thanks, I just had some comments but also wanted to really try and understand the entire thing first. Will post something before the end of this week if still allowed :D |
@alexandruag Did you have time to look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not done, and I don't think I have any time to review it till maybe Friday. I'll defer to Alex in case he has more time.
Please also update the travis configuration file so that we test builds with no features and with backend-mmap. Also, run the unit tests for all features.
I still have to go through some things. It would be great if you can allow us a couple more days. I'm very very sorry about the delay, and it's totally understandable if you want to draw a line and move forward. |
#[derive(Debug)] | ||
pub struct MmapRegion { | ||
addr: *mut u8, | ||
size: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extend the structure the same way I did it here: sboeuf/firecracker@248da22#diff-149449c5f11974990eadca8fd5a76982R43 ?
The goal being to always have access to the file descriptor backing the memory mapping, so that in case of vhost-user, we can provide this fd through the vhost-user socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding another abstraction of AddressSpace to support the vhost crate?
The idea is to manage guest's physical address space with an AddressSpace object, which contains a group of address regions. Each address region maintains type(memory or MMIO), properties(RWX), Usage, GPA, optional fd, optional offset into fd etc.
And GuestMemory is a mechanism for current process to access the guest memory and it could be generated from the AddressSpace object on demand.
Currently qemu maps all the guest memory into current process, but we have some thoughts only to map partial or even none of guest memory into the hypervisor process. For the vhost-user crate, we could simple generate a group of address regions accessible by IO subsystem (or specific device backend) and pass the (GPA, fd, offset) tuple to vhost-user backends.
What's your thoughts here?
Thanks!
// else is holding a reference to it. | ||
unsafe { | ||
libc::munmap(self.addr as *mut libc::c_void, self.size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments regarding this behavior.
First, this assumes that part of the memory security is enforced through the code, instead of being enforced by the language itself. The assumption that nobody else is holding a reference is very hard to enforce...
The way mmap()
behaves, we can clearly expect that mmap()
would be called multiple times on the same address with different ranges, without having to call into munmap()
. Calling munmap()
would prevent us from atomically change the mapping at a specific fixed address.
This is a very specific case that I ran into while porting virtio-fs for Firecracker (DAX case), since the memory mapping that happens on behalf of the libfuse daemon is performed on top of the existing mapping holding the whole shared region.
Problem is, by having this Drop
behavior, this forces the virtio-fs device to maintain a list of MemoryMapping
structures if we don't want to see the mapping disappear. And even by maintaining a list, what happens when someone wants to mmap()
on top of an address that is already tracked?
Ok let's try to take an example:
- Map 16MB of anonymous pages at address 0x0
- Map 4MB of pages from a file descriptor at address 0x200000 (2MB)
- At this point, you have two entries in the list of
MemoryMapping
- Unmap 16MB of pages at address 0x0
- At this point you only have one entry corresponding to the second mapping
- Trying to access the page at address 0x200000 will generate a segfault
What I'm trying to say is that by enforcing the unmap from the code itself, we need deterministic behavior. In the case of virtio-fs, the hypervisor is doing actions (mmap()
basically) on behalf of the vhost-user backend, that's why it seems pretty hard to maintain proper state every time a mmap()
happens.
My question is, do we want to keep this behavior for any memory mapping, by maintaining a complex list of descriptors, or do we simply want to create another type of MmapRegion
, like MmapRegionRaw
that would not include any Drop
behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf Actually is this a MmapRegionRaw
or is this a special type of region, where you want to free mapping to the mapped contents but not give up the virtual memory reservation.
To be able to do this atomically you cannot do an unmap, but you need to do a mmap of anonymous memory at the same virtual address with the same same size. Hence the VMM gets to hold on to the virtual address reservation.
The only assumption here is that there is an overarching region that covers all of these *special" region that will outlive all of these sub regions and that will be responsible for the actual mmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcastelino you're right! Having a special type handling the Drop
with mmap(MAP_ANONYMOUS)
instead of munmap()
looks like the only clean way to address this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how you can bypass language safety unless: 1) someone calls libc::munmap
, 2) someone calls libc:mmap
with MAP_FIXED
. Both of these are unsafe, and even in the case of (1) it is possible to write a safe wrapper using std::mem::forget. Since aliasing can only happen through other unsafe calls, I'm not sure any change is needed for safety.
Of course, it is possible to make changes to make the API more ergonomic, for example:
-
add a
MmapRegionRaw
that is exactly the same asMmapRegion
but does not unmap automatically. -
if you added a constructor for
mmap(MAP_FIXED)
, such aspub fn from_fd_fixed(fd: &AsRawFd, size: usize, offset: libc::off_t, address: *mut u8) -> io::Result<Self>
, then it would have to be unsafe. And you could then also have a safe wrapper around it likepub fn remap(self, fd: &AsRawFd, size: usize, offset: libc::off_t) -> io::Result<Self>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but I would do that after the first version goes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about std::mem::forget
, but this could be a potential solution to the problem too.
That being said, @mcastelino suggestion would be the cleanest approach IMO since we would make sure the memory region is reserved without leaving the previous mapping.
This can be part of a follow up PR, I don't mind, but I want to make sure we discuss and decide how we want to solve this asap, as this may influence the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a special requirement to support virtio-fs, and we haven't considered such a use case before.
Let's first talk about the requirements of the virtio-fs driver. Per my understanding, to support virtio-fs devices, an MMIO window is created in guest's physical address space, and the virtiofsd dynamically map/remap contents from files into the MMIO window. Is that correct?
If it's so, how about separating the virtio-fs logic from the vm-memory crate? We could do it in this way:
- create a GuestRegionMmap object to allocate the virtual address and optionally anonymous memory for the MMIO window.
- introduce a new VirtioFsManager to manage the virtual address allocated in step 1 as a pool and provides interfaces to
a) allocate virtual address from the pool to map file content into the MMIO window.
b) free virtual addresses into the pool when a file is deleted/closed.
c) reclaim virtual address by LRU when the pool runs out of virtual address.
d) map/remap file content to specific virtual address with mmap(MAP_FIXED). - the VirtioFsManager object out-lives the GuestRegionMmap object for the MMIO window.
By this way, we don't need to change the vm-memory crate and just introduce a new manager for virtio-fs.
As a bonus, we may make a minor improvement to the vm-memory crate to support mmap(PROT_NONE) in step 1, so that page fault happens when guest accesses a virtual address in the MMIO window which hasn't been assoicated with any file content yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangliu I am not sure if this requirement is limited to virtio-fs. Will we have a need for something similar when managing the PCI Hole and BAR's within the PCI Hole. @bonzini may have a better idea. Will the reqions corresponding to the BARs be allocated on the fly. If the regions corresponding to the BARs are allocated on the fly, then the current implementation of GuestMemoryMmap
should probably need to support adding regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's first talk about the requirements of the virtio-fs driver. Per my understanding, to support virtio-fs devices, an MMIO window is created in guest's physical address space, and the virtiofsd dynamically map/remap contents from files into the MMIO window. Is that correct?
Yes this is correct.
Introducing a VirtioFsManager
seems like a lot to address a problem that is more generic as we could end up doing multiple mmap
in any situation that requires submappings withing an existing primary mapping.
With your solution, you would create a new entry holding onto an MmapRegion
every time a new submapping happens. And at the end, when all those entries go out of scope, they get dropped with Drop
, which means a bunch of munmap()
are being called. This works but seems more complex than the solution proposed by @mcastelino.
Eventually, I think we would want to be able to define that a MmapRegion
holds a list of submapping that could be implementing a different Drop
(doing an mmap
instead).
Last thing, the mapping could be handled by a manager as you proposed, but if we end up going down this path, we should have this manager being defined as a trait that can be reused by any device. We cannot expect every device to handle this use case in its own way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only @sboeuf's request to extend MmapRegion remains?
// else is holding a reference to it. | ||
unsafe { | ||
libc::munmap(self.addr as *mut libc::c_void, self.size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how you can bypass language safety unless: 1) someone calls libc::munmap
, 2) someone calls libc:mmap
with MAP_FIXED
. Both of these are unsafe, and even in the case of (1) it is possible to write a safe wrapper using std::mem::forget. Since aliasing can only happen through other unsafe calls, I'm not sure any change is needed for safety.
Of course, it is possible to make changes to make the API more ergonomic, for example:
-
add a
MmapRegionRaw
that is exactly the same asMmapRegion
but does not unmap automatically. -
if you added a constructor for
mmap(MAP_FIXED)
, such aspub fn from_fd_fixed(fd: &AsRawFd, size: usize, offset: libc::off_t, address: *mut u8) -> io::Result<Self>
, then it would have to be unsafe. And you could then also have a safe wrapper around it likepub fn remap(self, fd: &AsRawFd, size: usize, offset: libc::off_t) -> io::Result<Self>
.
// else is holding a reference to it. | ||
unsafe { | ||
libc::munmap(self.addr as *mut libc::c_void, self.size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but I would do that after the first version goes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Once again I apologize for the delayed review. I left most of my comments in the code, but I would also like to add the following:
- It seems some functions are marked as
#[inline]
or#[inline(always)]
, but mostpub
functions aren't. As far as I know, Rust does not inline functions across crate boundaries unless they are explicitly marked#[inline]
or the project is built with LTO enabled. Should we make a conscious effort to mark the implementation of publicly visible functions/methods? - Why not switch to Rust 2018? (I know Firecracker is still using 2015 👎 )
- Both
use some_module::*
andpub use some_module::*
appear pretty often in the code. I generally don't like wildcard imports (I don't how to properly call them) because they make code harder to understand, and every now and then you end up exposing something you don't want. What do you think about only using explicit imports, unless there is a very compelling reason to do otherwise? - It looks like the crate would be easier to inspect if we split the code a bit more into modules. What do you think about placing things like
Bytes
,VolatileMemory
,VolatileSlice
,VolatileRef
, etc. into they own modules (assuming this doesn't makes testing too awkward)? - I think we should create a new PR by merge squashing the changes from this one in another branch (after addressing the current comments). There are already quite a large number of commits, and it's a bit difficult to pinpoint the best place to leave a comment. I had a few more remarks (for example, about error handling for
GuestMemory
), and it would be much easier to bring them to a clean slate version of the PR.
use Bytes; | ||
|
||
static MAX_ACCESS_CHUNK: usize = 4096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, what's the basis for choosing this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess it's based on typical physical page sizes.
src/guest_memory.rs
Outdated
/// - error code returned by the callback 'f' | ||
/// - size of data already handled when encountering the first hole | ||
/// - size of data already handled when the whole range has been handled | ||
fn try_access<F>(&self, count: usize, addr: GuestAddress, mut f: F) -> Result<usize, Error> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit apprehensive about this function and its uses throughout the PR. I'm mainly concerned about the way Bytes
is implemented for T: GuestMemory
(somewhere in a later commit), and the fact that we silently allow reads/writes across multiple regions (even if they have to be adjacent) by default. I think that cases where a read/write spills into another region are erroneous situations more often than not, and the implementation of Bytes
for T: GuestMemory
should not touch more than one region.
To go around this limitation for cases where it actually makes sense, we can offer a helper method like try_access
that iterates through all relevant regions and invokes some logic passed by the user. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main target is to support memory hotplug(hot-adding). A request crossing region boundary may be valid if memory hot-adding is enabled. Providing a helper such as try_access() doesn't solve the issue because the consumer has no way to tell whether a memory region is hot-added or not so it will always use try_access().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangliu can we add another feature to vm-memory called "hot-plug" or something of that sort? @alexandruag would that be an acceptable compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a matter of compromising; one behavior is right and the other is wrong.
Hotplug is an example, but in general this is just how DMA works, and at its heart a VMM is just emulating hardware. Firecracker does not implement this behavior but it's incorrect. The purpose of rust-vmm should also be to avoid this kind of shortcuts and do things the right way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added context! There are obviously a lot of things that I'm missing, but it's not super clear that one behavior is correct/desirable every single time, whereas the other shouldn't ever make sense for anyone :(
How about decoupling any particular behavior from the interface and leaving it up the implementation? In this case, we remove the try_access
entirely from GuestMemory
, and leave every method that relies on it unimplemented. We can then use try_access
in the reference implementation, where we might also (at some point) add a parameter/feature/compilation flag that configures whether cross-region access is allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it useful not to cross regions? IMO the point of this crate is exactly to provide a good implementation for things that are not so obvious, and this qualifies perfectly...
I will take a look at Alexandu's review and make some changes tomorrow. |
Do you have some specific 2018 features in mind? Even The only limit is that we cannot use nightly features because we want to build this code in distro packages and not just distribute it as crates (and distro rustc packages don't enable nightly features).
Will fix.
Unfortunately it's not possible due to orphan trait rules:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ type parameter :( |
|
Rust 2018 is indeed not yet available in distros, maybe we can wait a bit before switching to Edition 2018. I think the current version in the ones that I've tried is 1.30.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some really minor comments :)
Hello, sorry for not being able to attend the meeting today.
I don't fully grasp all the enhancements brought by the 2018 edition, but I really like not having to explicitly
I though the orphan rule does not apply when splitting things into separate modules (within the same crate), but anyway that comment was essentially about refactoring, which can be done at a later time. |
Paolo has mentioned a good point that virtual machines mimic physical machines:)
For physical machines, a request crossing memory region boundary is allowed for several cases, such as
1) the memory hot plug case we have mentioned,
2) a request crossing NUMA node boundary,
3) even a request crossing physical memory and MMIO boundary is valid.
… On Mar 21, 2019, at 12:05 AM, Alexandru Agache ***@***.*** ***@***.***>> wrote:
In
|
I guess what I wanna ask is this: are there any downsides to having an interface (the |
On Mar 21, 2019, at 12:32 AM, Alexandru Agache ***@***.***> wrote:
Paolo has mentioned a good point that virtual machines mimic physical machines:) For physical machines, a request crossing memory region boundary is allowed for several cases, such as 1) the memory hot plug case we have mentioned, 2) a request crossing NUMA node boundary, 3) even a request crossing physical memory and MMIO boundary is valid.
I guess what I wanna ask is this: are there any downsides to having an interface (the GuestMemory trait) that is agnostic in terms of cross-region memory access, especially since users have full control over which implementation they choose?
How about reimplementing the try_access() method for a specific GuestMemory implementation if the default implementation isn’t desired?
Seems that may solve the concern:)
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB14_Fe46mq7XuQ9BLSq0Eop6PzD_Fh7ks5vYmKcgaJpZM4bW6zr>.
|
The advantage is in code reuse. |
Provide a default and sample implementation of the GuestMemory trait by mmap-ing guest's memory into current process. Three data structures are introduced here: - MmapRegion: mmap a continous range of guest's physical memory into current and provide methods to access the mmapped memory. - GuestRegionMmap: a wrapper structure to map guest physical address into (mmap_region, offset) tuple. - GuestMemoryMmap: manage a collection of GuestRegionMmap objects for a virtual machine. One of the main responsibilities of the GuestMemoryMmap object is to handle the use cases where an access request crosses the memory region boundary. This scenario may be triggered when memory hotplug is supported. So there's a tradeoff between functionality code and complexity: - use following pattern for simplicity which fails when the request crosses region boundary. It's current default behavior in the crosvm and firecracker project. guest_memory_mmap.find_region(addr).access_ops(buf, addr) - use following pattern for functionality to support request crossing region boundary: guest_memory_mmap.access_ops(buf, addr) Signed-off-by: Liu Jiang <[email protected]>
Do not conflate backend errors with memory access errors. Remove unused error codes. Do not be over-specific on errors when the cause can be deduced from the backtrace.
The Bytes trait provides a common abstraction for volatile memory access. The basic implementation will be provided by VolatileSlice, but other implementations can be made that use newtypes for addresses.
The read_obj/write_obj methods need not be reimplemented all over the place. Move the implementation to the trait.
This is more idiomatic.
Remove a bunch of abstractions that are not used yet. AddressRegion can be replaced by Bytes, AddressSpace can be moved directly in GuestMemory instead of making it a marker trait.
A small cleanup.
Just name them Error and Result. This is in preparation to making volatile_memory a completely separate crate from memory_model.
This will simplify using VolatileMemory from GuestRegionMmap. Signed-off-by: Paolo Bonzini <[email protected]>
mmap.rs is a reference implementation, so if possible we want generic implementation outside it. Do it for Bytes<MmapAddress>: drop the newtype, and just make it Bytes<usize> so that VolatileSlice can implement it. The Bytes<GuestAddress> implementation is adjusted to go from the MmapRegion to the VolatileSlice.
Yet another piece that can be made generic. Region access now goes through a newtype MemoryRegionAddress, which is similar to how MmapAddress was used but generic. Additionally, as_slice() and as_mut_slice() are made optional, so that they are usable even if a hypervisor wants to use GuestMemory for e.g. MMIO regions.
This should never happen due to the capping that is done by try_access before invoking the callback.
This is a legal operation, albeit a bit weird.
Some comments got out of date, refine them according to the new design. Signed-off-by: Liu Jiang <[email protected]>
Add documentation for the crate. Signed-off-by: Liu Jiang <[email protected]>
Explicitly export symbols instread of exporting all public symbols. Signed-off-by: Liu Jiang <[email protected]>
Rename trait DataInit to trait ByteValued, also rename data_init.rs to bytes.rs. So all traits in the file bytes.rs are related to byte operations. Signed-off-by: Liu Jiang <[email protected]>
Fix cargo clippy errors: error: long literal lacking separators --> src/endian.rs:121:29 | 121 | let v = 0x0123456789ABCDEF as $old_type; | ^^^^^^^^^^^^^^^^^^ help: consider: `0x0123_4567_89AB_CDEF` ... 140 | endian_test!(u16, Le16, test_le16, NATIVE_LITTLE); | -------------------------------------------------- in this macro invocation | = note: `-D clippy::unreadable-literal` implied by `-D warnings` = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unreadable_literal Signed-off-by: Liu Jiang <[email protected]>
This requires a small change to the VolatileRef API, which now takes and returns a *u8. Signed-off-by: Paolo Bonzini <[email protected]>
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Initial implementation of the vm-memory crate