-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware access implementation discussion #425
Comments
One of the first things we got asked was how to enable VLAN support. So, that should probably be on your list of things to support. I don't know how much you have to do on the link layer end, but it's probably not going to be that much. :) |
Thanks for the excellent upfront planning of your forthcoming thesis work; I look forward to seeing this progress! My initial reaction is to encourage you to look at previous work in this space as regards virtualising PCI; there are a number of details and device-specific quirks that previous hypervisors have run into (for instance, when expanding out to display hardware or other non-networking PCI devices). In particular, Xen has a well-trodden pci front/back split device that also has been deployed on quite a variety of real-world PCI hardware. Its pcifront PV interface should provide good inspiration for the Solo5 support, but without the Xen-specific pieces such as its KVM has yet another approach which required Intel VT-d for the majority of usecases (and, for arm64, fairly intricate use of MSIs on arm64). My instinct is to push for the "most software" solution that doesn't leave hardware underutilised, and the Xen PV approach seems to fit that bill. VT-d is a very large hammer to use for virtualising PCI devices. Once you do decide on a good mechanism to get pages through to the host device, I'd be curious to know if you're interested in putting an FPGA into the mix. The Mirage stack makes it fairly easy to decide on the structure of the inputs to the software, which do not have to be what we traditionally have (TCP/IP offload and checksumming). If you had a hardware device that could do some custom protocol-level parsing, we could implement protocols that are not traditionally hardware-accelerated by implement some scanning logic in the "NIC/FPGA" and passing those structures through to a software Mirage stack to parse. I'm thinking specifically of QUIC/HTTP3 here, which is difficult to accelerate using TCP offload since... it's not TCP :-) This is a very exciting research direction to take the Mirage software stack, and particularly so in the context of RISC-V which @kayceesrk @samoht @nojb and I have been working on recently. |
@fwsGonzo @avsm One question up front, though: Is Xen really the right approach? This sounds very naive, but do we want unikernels to deal with the intricacies of PCIe devices? My initial idea was to bind our device to the Other host OS would be a completely different story, of course, and I admittedly have not looked into that yet. I'm definitely interested in other hardware besides network cards; I just happen to know a bit about network cards already. I don't have any experience regarding FPGAs; I know what they are and that's about it. I'd have to ask around at university if there are specific FPGAs around I could experiment with. Custom protocol-level parsing goes a bit beyond the scope of my thesis though :-) Do you have resources and/or tips for approaching RISC-V? Especially IOMMU stuff? I'll start with x86_64 for now but I'm always curious about other architectures. I feel like I'm approaching the problem from too much of an ixy.ml-focussed perspective: ixy.ml doesn't care about anything Linux's PCI code does and also just runs as root without a care in the world. |
@Reperator Sorry, I misunderstood your task then. :) |
This is a quite an impressive plan for you thesis! Wishing you all the best for your progress! @avsm, I think even the Xen virtualization link, talks about using the Intel VT-d extensions for the PCI passthrough, if I am right?
In that case, it would always be advisable to use the VT-d extension, because I assume Xen is taking care of the address mappings through the In that case, @Reperator, whether you are going through Xen or KVM, both of them implement PCIE passthrough using the Intel IOMMU.
I think I misunderstand some part here. When you write a userspace PCIE driver against I think my major concern is with regards to multiple guest processes trying to access the PCI cards. For example, in the case of multiple processes running on Linux which need to access the NIC, since the driver is in the kernel, it can take the requests from each application, do the necessary work and dispatch them. In the case of Anyways, the above concern is more on the practical side of things, which I am not sure you intend to address in your research. Finally, RISC-V does not have a complete virtualization story yet. There do exist draft Hypervisor specifications and ports for KVM underway, but nothing on the side of IO virtualization yet. I am open to questions for the RISC-V state of the union, if required. |
I was going to map everything the device offers into the unikernel's address space and be done with it. That's what I meant with having unikernels deal with the intricacies. I actually don't know how hypervisors like Xen targeting "real" OSs deal with PCIe; I just assumed that VMs would have to do all the same setup steps (bus enumeration, register mappings, etc.) as bare-metal OSs. I figured it'd be easier for unikernel developers to just have PCIe devices "magically" show up in a fixed place (like an array at address I have thought a bit about multiple processes accessing the same device, though admittedly not too much. For my initial prototype I was going to take the "user must make sure to only use each device once" stance.
That ain't happening: one unikernel will take full control of the device. The others will have to use different NICs. I don't see an easy solution for this in the general case; with modern NICs specifically you could theoretically use different rx/tx queues for each unikernel and have the NIC multiplex in hardware, but then the driver would have to run either in Solo5 (@mato is screaming already ;-)) or one unikernel would have to configure the NIC for all the others. Don't think that's viable. I guess in the distant future we could have something of a "plug-in" unikernel that exposes functionality to other unikernels in a structured fashion. |
I think a general approach for sharing a PCIe device is SR-IOV and VFs (Virtual Functions). Each MirageOS Unikernel having a device driver can manage an assigned VF (i.e. virtualized device) independently without any breach. |
That sounds about right. As far as I have understood, the Xen The following is equivalent to binding the
This is how a virtual guest would access the device -
So it would seem that you'd have to bind the Though, I still lack the necessary expertise in this area, and it would be preferable if you consult with someone who understands these intricacies better! |
Also, my comment about multiple user processes accessing the NIC was raised only in the context of Jitsu, where you spawn processes for each request. I couldn't think of a clear way of setting such a system up, given each unikernel would handle the networking stack as well, thus I asked the question out to get some ideas! Anyways, I think virtualization is an orthogonal issue. Given once we have unikernels which take care of the network stack, the virtualization can be inspected. As @TImada suggests, SR-IOV could be one way to do it among others! I am definitely intrigued by dedicated processes existing to handle the NIC's, maybe multiplexing over cards using SR-IOV's and worker processes communicating their IP requests to each of these, sort of like a load balancing setup for multi-NIC systems! Though, I think this is veering off topic and is a discussion for some other issue! |
Thanks for this interesting discussion.
not sure why this should be solo5 responsibility -- for |
This is what we do for native Ada/SPARK components on Muen. Since the whole system is static, we can have these addresses as constants at compilation time. For subjects that do resource discovery during runtime/startup we have a mechanism called Subject Info which enables querying e.g. memory mappings and their attributes (rwx) for a given name. This facility is already used in the Solo5/Muen bindings to get the memory channels for memory stream network devices, see solo5/bindings/muen/muen-net.c Line 159 in 6a98a19
If Solo5 provided a way for Unikernels to query similar information, the driver could setup its memory mappings initially and then start operation. Another idea worth investigating could be how the new manifest could be leveraged to fit this use case. |
To summarize some discussion I've already had with @Reperator and @Kensan elsewhere: From the Solo5 point of view, ease of portability, isolation and a clear separation of concerns are important goals if this is to be eventually up-streamed. With that in mind, I would aim for a minimal implementation that starts with the Roughly, the implementation should:
Regarding the requirement for DMA memory: You're right that requiring the operator to specify the "right" amount of DMA memory when starting the tender is somewhat unintuitive and fragile. When I was designing the manifest I considered defining entries for other resources, such as memory, but did not go down that route since I wanted to push out the user-visible features enabled by the manifest sooner rather than later. In order to understand how this could be designed, I have a couple of questions:
|
Thanks for the summary! Regarding your questions:
|
@Reperator:
|
|
@Reperator: OK, looking at how the VFIO mapping works it's just an Once you verify that this approach actually works, we can then proceed figuring out what the "real" APIs and implementation should be. |
On a very obtuse note, do you think some sort of AVX/AVX2 codegen capabilities in OCaml would benefit your work? I am assuming that in case you want to do some SIMD computation you'd have to bind to a C library and my initial impression was that you would like to keep it to as minimum as possible. I had read in a Reddit thread that DPDK drivers are faster due to AVX/AVX2 but also completely unreadable. So I imagine that being able to use AVX/AVX2 from high level constructs might be useful. |
@anmolsahoo25: |
Makes sense. Thanks! |
tl;dr: I'll be implementing PCIe and DMA support in hvt and MirageOS and am looking for other people's requirements/suggestions.
I'm about to start my master's thesis at the Chair of Network Architectures and Services at the Technical University of Munich. The goal of my thesis is to implement direct hardware access (PCIe and DMA) into MirageOS, specifically to allow drivers (including my driver ixy.ml) to take direct control of hardware devices.
Motivation
The main reason hardware support is of interest is obviously performance: Moving packet buffers (or any other data) between unikernel and hypervisor/runtime takes time. Having the unikernel directly control its hardware should (hopefully) show some significant performance improvements. I'll mostly focus on networking, but hardware support could also be the foundation for other people to build drivers on top of, maybe something like storage drivers (NVMe?).
Additionally as our paper showed, there's a case to be made for writing drivers in high-level languages and not using the standard C drivers baked into your garden-variety Linux kernel.
Finally I'm also interested in the flexibility this brings for unikernels: @hannesm and I discussed implementing batch rx/tx and zero-copy into MirageOS' network stack; something we could more easily test with ixy.ml, which already has support for batching. Parts of #290 for example could be implemented inside the unikernel itself, keeping Solo5's codebase small and maintainable.
I'm hoping to get some discussion on implementation details going here. The reason I'm opening this issue here instead of on the main Mirage repo is that most of the changes I'll have to make will be on Solo5. Right now I'm only looking at mirage-solo5 and hvt on Linux as I'm hoping to get a proof-of-concept up and running early into the 6 months I have to finish the thesis. Depending on how much time I have left I'll also take a look at the BSDs.
So let's talk details! There's two main features my driver specifically needs.
PCIe
First off PCIe register access: ixy.ml needs to configure the network card it wants to control by writing to its configuration registers. On Linux those registers are mapped to the
/sys/bus/pci/devices/$PCI_ADDRESS/resource*
files. Additionally ixy.ml needs to enable DMA by setting a bit in the PCI configuration space, which is mapped to/sys/bus/pci/devices/$PCI_ADDRESS/config
. There needs to be some way for unikernels to access these files in an implementation-hiding fashion. Also there are some trivial details to take care of, like what the command line flag for mapping a device into a unikernel (--pci=0000:ab:cd.e
for example) should be.DMA
Secondly there's the (more complicated) issue of DMA: Modern high-performance PCIe devices such as network/storage/graphics cards all use DMA (direct memory access) to communicate with their host system. Generally a driver instructs a device to read/write data from/to specific physical memory addresses. Those addresses must never change (without the driver knowing, I guess), otherwise the device will access arbitrary memory, leading to stability and (more importantly) security problems. Imagine the OS places some secret data in a specific physical location that previously was mapped to the device while the device has not been reconfigured. A NIC for example would happily send out your private data as network packets! This is not acceptable and so @mato and I already decided that IOMMU support is 100% required.
The IOMMU does for PCIe devices what your processor's MMU does for your programs: It translates virtual addresses into physical addresses. When we configure our DMA mappings into the IOMMU (for example by using Linux's
vfio
framework like ixy.c and ixy.rs already do) the device will also use virtual addresses. This means the driver won't have to take care of translating between virtual and physical addresses. Additionally the IOMMU will block any access to memory areas outside of the mappings the device is allowed to access.When configuring the IOMMU we must also look at what hardware we are actually running on. The TLB inside the IOMMU usually has way fewer entries than the main MMU's TLB and there are different entry sizes (4KiB, 2MiB on x86_64 for example) available on different architectures.
So how should I go about implementing this? As I see it there needs to be some command line flag like
--dma=16M
, instructing hvt to configure a 16 Mebibyte mapping into the IOMMU. Then there needs to be a way for the unikernel to retrieve its mappings.I think some mechanism for drivers to indicate their specific requirements (possibly before actually running) would be helpful for users. For example ixy needs some contiguous DMA areas for its packet buffer rings. Otherwise driver authors would be forced to do something like this when these requirements aren't met:
Suggestions?
Those were just the features ixy.ml and, I imagine, other network drivers need. I'm interested in other people's requirements; are there applications that need other features? Do you have suggestions/wishes/tips?
The text was updated successfully, but these errors were encountered: