diff --git a/doc/assets/css/misc.css b/doc/assets/css/misc.css new file mode 100644 index 00000000000..beb5a28e43a --- /dev/null +++ b/doc/assets/css/misc.css @@ -0,0 +1,73 @@ +.revision-table { + width: 50%; + margin: 1em auto 1em auto; + font-size: 80%; +} + +.label { + display: inline; + padding: .2em .6em .3em; + font-size: 75%; + font-weight: 700; + line-height: 1; + color: #fff; + text-align: center; + white-space: nowrap; + vertical-align: baseline; + border-radius: .25em; +} + +.label.label-default { + background-color: #777; +} + +.label.label-info { + background-color: #5bc0de; +} + +.label.label-danger { + background-color: #d9534f; +} + +.label.label-warning { + background-color: #f0ad4e; +} + +.label.label-success { + background-color: #5cb85c; +} + +.table-condensed > thead > tr > th, +.table-condensed > tbody > tr > th, +.table-condensed > tfoot > tr > th, +.table-condensed > thead > tr > td, +.table-condensed > tbody > tr > td, +.table-condensed > tfoot > tr > td { + padding: 5px; + +} + +.table-striped > tbody > tr:nth-child(odd) { + background-color: #f9f9f9; +} + +.btn { + display: inline-block; + padding: 6px 12px; + margin-bottom: 0; + font-weight: normal; + line-height: 1.42857143; + text-align: center; + white-space: nowrap; + vertical-align: middle; + cursor: pointer; + background-image: none; + border: 1px solid transparent; + border-radius: 4px; +} + +.btn-link { + font-weight: normal; + color: #337ab7; + border-radius: 0; +} diff --git a/doc/content/design/RDP.md b/doc/content/design/RDP.md new file mode 100644 index 00000000000..8793a3066d6 --- /dev/null +++ b/doc/content/design/RDP.md @@ -0,0 +1,98 @@ +--- +title: RDP control +layout: default +design_doc: true +revision: 2 +status: released (XenServer 6.5 SP1) +design_review: 12 +--- +### Purpose + +To administer guest VMs it can be useful to connect to them over Remote Desktop Protocol (RDP). XenCenter supports this; it has an integrated RDP client. + +First it is necessary to turn on the RDP service in the guest. + +This can be controlled from XenCenter. Several layers are involved. This description starts in the guest and works up the stack to XenCenter. + +This feature was completed in the first quarter of 2015, and released in Service Pack 1 for XenServer 6.5. + +### The guest agent + +The XenServer guest agent installed in Windows VMs can turn the RDP service on and off, and can report whether it is running. + +The guest agent is at https://github.com/xenserver/win-xenguestagent + +Interaction with the agent is done through some Xenstore keys: + +The guest agent running in domain N writes two xenstore nodes when it starts up: +* `/local/domain/N/control/feature-ts = 1` +* `/local/domain/N/control/feature-ts2 = 1` + +This indicates support for the rest of the functionality described below. + +(The "...ts2" flag is new for this feature; older versions of the guest agent wrote the "...ts" flag and had support for only a subset of the functionality (no firewall modification), and had a bug in updating `.../data/ts`.) + +To indicate whether RDP is running, the guest agent writes the string "1" (running) or "0" (disabled) to xenstore node + +`/local/domain/N/data/ts`. + +It does this on start-up, and also in response to the deletion of that node. + +The guest agent also watches xenstore node `/local/domain/N/control/ts` and it turns RDP on and off in response to "1" or "0" (respectively) being written to that node. The agent acknowledges the request by deleting the node, and afterwards it deletes `local/domain/N/data/ts`, thus triggering itself to update that node as described above. + +When the guest agent turns the RDP service on/off, it also modifies the standard Windows firewall to allow/forbid incoming connections to the RDP port. This is the same as the firewall change that happens automatically when the RDP service is turned on/off through the standard Windows GUI. + +### XAPI etc. + +xenopsd sets up watches on xenstore nodes including the `control` tree and `data/ts`, and prompts xapi to react by updating the relevant VM guest metrics record, which is available through a XenAPI call. + +XenAPI includes a new message (function call) which can be used to ask the guest agent to turn RDP on and off. + +This is `VM.call_plugin` (analogous to `Host.call_plugin`) in the hope that it can be used for other purposes in the future, even though for now it does not really call a plugin. + +To use it, supply `plugin="guest-agent-operation"` and either `fn="request_rdp_on"` or `fn="request_rdp_off"`. + +See http://xapi-project.github.io/xen-api/classes/vm.html + +The function strings are named with "request" (rather than, say, "enable_rdp" or "turn_rdp_on") to make it clear that xapi only makes a request of the guest: when one of these calls returns successfully this means only that the appropriate string (1 or 0) was written to the `control/ts` node and it is up to the guest whether it responds. + +### XenCenter + +#### Behaviour on older XenServer versions that do not support RDP control + +Note that the current behaviour depends on some global options: "Enable Remote Desktop console scanning" and "Automatically switch to the Remote Desktop console when it becomes available". + +1. When tools are not installed: + * As of XenCenter 6.5, the RDP button is absent. +2. When tools are installed but RDP is not switched on in the guest: + 1. If "Enable Remote Desktop console scanning" is on: + * The RDP button is present but greyed out. (It seems to sometimes read "Switch to Remote Desktop" and sometimes read "Looking for guest console...": I haven't yet worked out the difference). + * We scan the RDP port to detect when RDP is turned on + 2. If "Enable Remote Desktop console scanning" is off: + * The RDP button is enabled and reads "Switch to Remote Desktop" +3. When tools are installed and RDP is switched on in the guest: + 1. If "Enable Remote Desktop console scanning" is on: + * The RDP button is enabled and reads "Switch to Remote Desktop" + * If "Automatically switch" is on, we switch to RDP immediately we detect it + 2. If "Enable Remote Desktop console scanning" is off: + * As above, the RDP button is enabled and reads "Switch to Remote Desktop" + +#### New behaviour on XenServer versions that support RDP control + +1. This new XenCenter behaviour is only for XenServer versions that support RDP control, with guests with the new guest agent: behaviour must be unchanged if the server or guest-agent is older. +2. There should be no change in the behaviour for Linux guests, either PV or HVM varieties: this must be tested. +3. We should never scan the RDP port; instead we should watch for a change in the relevant variable in guest_metrics. +4. The XenCenter option "Enable Remote Desktop console scanning" should change to read "Enable Remote Desktop console scanning (XenServer 6.5 and earlier)" +5. The XenCenter option "Automatically switch to the Remote Desktop console when it becomes available" should be enabled even when "Enable Remote Desktop console scanning" is off. +6. When tools are not installed: + * As above, the RDP button should be absent. +7. When tools are installed but RDP is not switched on in the guest: + * The RDP button should be enabled and read "Turn on Remote Desktop" + * If pressed, it should launch a dialog with the following wording: "Would you like to turn on Remote Desktop in this VM, and then connect to it over Remote Desktop? [Yes] [No]" + * That button should turn on RDP, wait for RDP to become enabled, and switch to an RDP connection. It should do this even if "Automatically switch" is off. +8. When tools are installed and RDP is switched on in the guest: + * The RDP button should be enabled and read "Switch to Remote Desktop" + * If "Automatically switch" is on, we should switch to RDP immediately + * There is no need for us to provide UI to switch RDP off again +9. We should also test the case where RDP has been switched on in the guest before the tools are installed. + diff --git a/doc/content/design/_index.md b/doc/content/design/_index.md new file mode 100644 index 00000000000..e90cfc7b21f --- /dev/null +++ b/doc/content/design/_index.md @@ -0,0 +1,6 @@ ++++ +title = "Design Documents" +menuTitle = "Designs" ++++ + +{{< design_docs_list >}} diff --git a/doc/content/design/aggr-storage-reboots.md b/doc/content/design/aggr-storage-reboots.md new file mode 100644 index 00000000000..b3173f6e19e --- /dev/null +++ b/doc/content/design/aggr-storage-reboots.md @@ -0,0 +1,67 @@ +--- +title: Aggregated Local Storage and Host Reboots +layout: default +design_doc: true +revision: 3 +status: proposed +design_review: 144 +revision_history: +- revision_number: 1 + description: Initial version +- revision_number: 2 + description: Included some open questions under Xapi point 2 +- revision_number: 3 + description: Added new error, task, and assumptions +--- + +## Introduction + +When hosts use an aggregated local storage SR, then disks are going to be mirrored to several different hosts in the pool (RAID). This ensures that if a host goes down (e.g. due to a reboot after installing a hotfix or upgrade, or when "fenced" by the HA feature), all disk contents in the SR are still accessible. This also means that if all disks are mirrored to just two hosts (worst-case scenario), just one host may be down at any point in time to keep the SR fully available. + +When a node comes back up after a reboot, it will resynchronise all its disks with the related mirrors on the other hosts in the pool. This syncing takes some time, and only after this is done, we may consider the host "up" again, and allow another host to be shut down. + +Therefore, when installing a hotfix to a pool that uses aggregated local storage, or doing a rolling pool upgrade, we need to make sure that we do hosts one-by-one, and we wait for the storage syncing to finish before doing the next. + +This design aims to provide guidance and protection around this by blocking hosts to be shut down or rebooted from the XenAPI except when safe, and setting the `host.allowed_operations` field accordingly. + + +## XenAPI + +If an aggregated local storage SR is in use, and one of the hosts is rebooting or down (for whatever reason), or resynchronising its storage, the operations `reboot` and `shutdown` will be removed from the `host.allowed_operations` field of _all_ hosts in the pool that have a PBD for the SR. + +This is a conservative approach in that assumes that this kind of SR tolerates only one node "failure", and assumes no knowledge about how the SR distributes its mirrors. We may refine this in future, in order to allow some hosts to be down simultaneously. + +The presence of the `reboot` operation in `host.allowed_operations` indicates whether the `host.reboot` XenAPI call is allowed or not (similarly for `shutdown` and `host.shutdown`). It will not, of course, prevent anyone from rebooting a host from the dom0 console or power switch. + +Clients, such as XenCenter, can use `host.allowed_operations`, when applying an update to a pool, to guide them when it is safe to update and reboot the next host in the sequence. + +In case `host.reboot` or `host.shutdown` is called while the storage is busy resyncing mirrors, the call will fail with a new error `MIRROR_REBUILD_IN_PROGRESS`. + +## Xapi + +Xapi needs to be able to: + +1. Determine whether aggregated local storage is in use; this just means that a PBD for such an SR present. + * TBD: To avoid SR-specific code in xapi, the storage backend should tell us whether it is an aggregated local storage SR. +2. Determine whether the storage system is resynchronising its mirrors; it will need to be able to query the storage backend for this kind of information. + * Xapi will poll for this and will reflect that a resync is happening by creating a `Task` for it (in the DB). This task can be used to track progress, if available. + * The exact way to get the syncing information from the storage backend is SR specific. The check may be implemented in a separate script or binary that xapi calls from the polling thread. Ideally this would be integrated with the storage backend. +3. Update `host.allowed_operations` for all hosts in the pool according to the rules described above. This comes down to updating the function `valid_operations` in `xapi_host_helpers.ml`, and will need to use a combination of the functionality from the two points above, plus and indication of host liveness from `host_metrics.live`. +4. Trigger an update of the allowed operations when a host shuts down or reboots (due to a XenAPI call or otherwise), and when it has finished resynchronising when back up. Triggers must be in the following places (some may already be present, but are listed for completeness, and to confirm this): + * Wherever `host_metrics.live` is updated to detect pool slaves going up and down (probably at least in `Db_gc.check_host_liveness` and `Xapi_ha`). + * Immediately when a `host.reboot` or `host.shutdown` call is executed: `Message_forwarding.Host.{reboot,shutdown,with_host_operation}`. + * When a storage resync is starting or finishing. + +All of the above runs on the pool master (= SR master) only. + +## Assumptions + +The above will be safe if the storage cluster is equal to the XenServer pool. In general, however, it may be desirable to have a storage cluster that is larger than the pool, have multiple XS pools on a single cluster, or even share the cluster with other kinds of nodes. + +To ensure that the storage is "safe" in these scenarios, xapi needs to be able to ask the storage backend: + +1. if a mirror is being rebuilt "somewhere" in the cluster, AND +2. if "some node" in the cluster is offline (even if the node is not in the XS pool). + +If the cluster is equal to the pool, then xapi can do point 2 without asking the storage backend, which will simplify things. For the moment, we assume that the storage cluster is equal to the XS pool, to avoid making things too complicated (while still need to keep in mind that we may change this in future). + diff --git a/doc/content/design/archival-redesign.md b/doc/content/design/archival-redesign.md new file mode 100644 index 00000000000..34a3b898019 --- /dev/null +++ b/doc/content/design/archival-redesign.md @@ -0,0 +1,95 @@ +--- +title: RRDD archival redesign +layout: default +design_doc: true +revision: 1 +status: released (7,0) +--- + +## Introduction + +Current problems with rrdd: + +* rrdd stores knowledge about whether it is running on a master or a slave + +This determines the host to which rrdd will archive a VM's rrd when the VM's +domain disappears - rrdd will always try to archive to the master. However, +when a host joins a pool as a slave rrdd is not restarted so this knowledge is +out of date. When a VM shuts down on the slave rrdd will archive the rrd +locally. When starting this VM again the master xapi will attempt to push any +locally-existing rrd to the host on which the VM is being started, but since +no rrd archive exists on the master the slave rrdd will end up creating a new +rrd and the previous rrd will be lost. + +* rrdd handles rebooting VMs unpredictably + +When rebooting a VM, there is a chance rrdd will attempt to update that VM's rrd +during the brief period when there is no domain for that VM. If this happens, +rrdd will archive the VM's rrd to the master, and then create a new rrd for the +VM when it sees the new domain. If rrdd doesn't attempt to update that VM's rrd +during this period, rrdd will continue to add data for the new domain to the old +rrd. + +## Proposal + +To solve these problems, we will remove some of the intelligence from rrdd and +make it into more of a slave process of xapi. This will entail removing all +knowledge from rrdd of whether it is running on a master or a slave, and also +modifying rrdd to only start monitoring a VM when it is told to, and only +archiving an rrd (to a specified address) when it is told to. This matches the +way xenopsd only manages domains which it has been told to manage. + +## Design + +For most VM lifecycle operations, xapi and rrdd processes (sometimes across more +than one host) cooperate to start or stop recording a VM's metrics and/or to +restore or backup the VM's archived metrics. Below we will describe, for each +relevant VM operation, how the VM's rrd is currently handled, and how we propose +it will be handled after the redesign. + +#### VM.destroy + +The master xapi makes a remove_rrd call to the local rrdd, which causes rrdd to +to delete the VM's archived rrd from disk. This behaviour will remain unchanged. + +#### VM.start(\_on) and VM.resume(\_on) + +The master xapi makes a push_rrd call to the local rrdd, which causes rrdd to +send any locally-archived rrd for the VM in question to the rrdd of the host on +which the VM is starting. This behaviour will remain unchanged. + +#### VM.shutdown and VM.suspend + +Every update cycle rrdd compares its list of registered VMs to the list of +domains actually running on the host. Any registered VMs which do not have a +corresponding domain have their rrds archived to the rrdd running on the host +believed to be the master. We will change this behaviour by stopping rrdd from +doing the archiving itself; instead we will expose a new function in rrdd's +interface: + +``` +val archive_rrd : vm_uuid:string -> remote_address:string -> unit +``` + +This will cause rrdd to remove the specified rrd from its table of registered +VMs, and archive the rrd to the specified host. When a VM has finished shutting +down or suspending, the xapi process on the host on which the VM was running +will call archive_rrd to ask the local rrdd to archive back to the master rrdd. + +#### VM.reboot + +Removing rrdd's ability to automatically archive the rrds for disappeared +domains will have the bonus effect of fixing how the rrds of rebooting VMs are +handled, as we don't want the rrds of rebooting VMs to be archived at all. + +#### VM.checkpoint + +This will be handled automatically, as internally VM.checkpoint carries out a +VM.suspend followed by a VM.resume. + +#### VM.pool_migrate and VM.migrate_send + +The source host's xapi makes a migrate_rrd call to the local rrd, with a +destination address and an optional session ID. The session ID is only required +for cross-pool migration. The local rrdd sends the rrd for that VM to the +destination host's rrdd as an HTTP PUT. This behaviour will remain unchanged. diff --git a/doc/content/design/backtraces.md b/doc/content/design/backtraces.md new file mode 100644 index 00000000000..f8374be0d46 --- /dev/null +++ b/doc/content/design/backtraces.md @@ -0,0 +1,298 @@ +--- +title: Backtrace support +layout: default +design_doc: true +revision: 1 +status: Confirmed +--- + +We want to make debugging easier by recording exception backtraces which are + +- reliable +- cross-process (e.g. xapi to xenopsd) +- cross-language +- cross-host (e.g. master to slave) + +We therefore need + +- to ensure that backtraces are captured in our OCaml and python code +- a marshalling format for backtraces +- conventions for storing and retrieving backtraces + +Backtraces in OCaml +=================== + +OCaml has fast exceptions which can be used for both + +- control flow i.e. fast jumps from inner scopes to outer scopes +- reporting errors to users (e.g. the toplevel or an API user) + +To keep the exceptions fast, exceptions and backtraces are decoupled: +there is a single active backtrace per-thread at any one time. If you +have caught an exception and then throw another exception, the backtrace +buffer will be reinitialised, destroying your previous records. For example +consider a 'finally' function: + +```ocaml +let finally f cleanup = + try + let result = f () in + cleanup (); + result + with e -> + cleanup (); + raise e (* <-- backtrace starts here now *) +``` + +This function performs some action (i.e. `f ()`) and guarantees to +perform some cleanup action (`cleanup ()`) whether or not an exception +is thrown. This is a common pattern to ensure resources are freed (e.g. +closing a socket or file descriptor). Unfortunately the `raise e` in +the exception handler loses the backtrace context: when the exception +gets to the toplevel, `Printexc.get_backtrace ()` will point at the +`finally` rather than the real cause of the error. + +We will use a variant of the solution proposed by +[Jacques-Henri Jourdan](http://gallium.inria.fr/blog/a-library-to-record-ocaml-backtraces/) +where we will record backtraces when we catch exceptions, before the +buffer is reinitialised. Our `finally` function will now look like this: + +```ocaml +let finally f cleanup = + try + let result = f () in + cleanup (); + result + with e -> + Backtrace.is_important e; + cleanup (); + raise e +``` + +The function `Backtrace.is_important e` associates the exception `e` +with the current backtrace before it gets deleted. + +Xapi always has high-level exception handlers or other wrappers around all the +threads it spawns. In particular Xapi tries really hard to associate threads +with active tasks, so it can prefix all log lines with a task id. This helps +admins see the related log lines even when there is lots of concurrent activity. +Xapi also tries very hard to label other threads with names for the same reason +(e.g. `db_gc`). Every thread should end up being wrapped in `with_thread_named` +which allows us to catch exceptions and log stacktraces from `Backtrace.get` +on the way out. + +OCaml design guidelines +----------------------- + +Making nice backtraces requires us to think when we write our exception raising +and handling code. In particular: + +- If a function handles an exception and re-raise it, you must call + `Backtrace.is_important e` with the exception to capture the backtrace first. +- If a function raises a different exception (e.g. `Not_found` becoming a XenAPI + `INTERNAL_ERROR`) then you must use `Backtrace.reraise ` to + ensure the backtrace is preserved. +- All exceptions should be printable -- if the generic printer doesn't do a good + enough job then register a custom printer. +- If you are the last person who will see an exception (because you aren't going + to rethrow it) then you *may* log the backtrace via `Debug.log_backtrace e` + *if and only if* you reasonably expect the resulting backtrace to be helpful + and not spammy. +- If you aren't the last person who will see an exception (because you are going + to rethrow it or another exception), then *do not* log the backtrace; the + next handler will do that. +- All threads should have a final exception handler at the outermost level + for example `Debug.with_thread_named` will do this for you. + + +Backtraces in python +==================== + +Python exceptions behave similarly to the OCaml ones: if you raise a new +exception while handling an exception, the backtrace buffer is overwritten. +Therefore the same considerations apply. + +Python design guidelines +------------------------ + +The function [sys.exc_info()](https://docs.python.org/2/library/sys.html#sys.exc_info) +can be used to capture the traceback associated with the last exception. +We must guarantee to call this before constructing another exception. In +particular, this does not work: + +```python + raise MyException(sys.exc_info()) +``` + +Instead you must capture the traceback first: + +```python + exc_info = sys.exc_info() + raise MyException(exc_info) +``` + +Marshalling backtraces +====================== + +We need to be able to take an exception thrown from python code, gather +the backtrace, transmit it to an OCaml program (e.g. xenopsd) and glue +it onto the end of the OCaml backtrace. We will use a simple json marshalling +format for the raw backtrace data consisting of + +- a string summary of the error (e.g. an exception name) +- a list of filenames +- a corresponding list of lines + +(Note we don't use the more natural list of pairs as this confuses the +"rpclib" code generating library) + +In python: + +```python + results = { + "error": str(s[1]), + "files": files, + "lines": lines, + } + print json.dumps(results) +``` + +In OCaml: + +```ocaml + type error = { + error: string; + files: string list; + lines: int list; + } with rpc + print_string (Jsonrpc.to_string (rpc_of_error ...)) +``` + +Retrieving backtraces +===================== + +Backtraces will be written to syslog as usual. However it will also be +possible to retrieve the information via the CLI to allow diagnostic +tools to be written more easily. + +The CLI +------- + +We add a global CLI argument "--trace" which requests the backtrace be +printed, if one is available: + +``` +# xe vm-start vm=hvm --trace +Error code: SR_BACKEND_FAILURE_202 +Error parameters: , General backend error [opterr=exceptions must be old-style classes or derived from BaseException, not str], +Raised Server_error(SR_BACKEND_FAILURE_202, [ ; General backend error [opterr=exceptions must be old-style classes or derived from BaseException, not str]; ]) +Backtrace: +0/50 EXT @ st30 Raised at file /opt/xensource/sm/SRCommand.py, line 110 +1/50 EXT @ st30 Called from file /opt/xensource/sm/SRCommand.py, line 159 +2/50 EXT @ st30 Called from file /opt/xensource/sm/SRCommand.py, line 263 +3/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1486 +4/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 83 +5/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1519 +6/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1567 +7/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1065 +8/50 EXT @ st30 Called from file /opt/xensource/sm/EXTSR.py, line 221 +9/50 xenopsd-xc @ st30 Raised by primitive operation at file "lib/storage.ml", line 32, characters 3-26 +10/50 xenopsd-xc @ st30 Called from file "lib/task_server.ml", line 176, characters 15-19 +11/50 xenopsd-xc @ st30 Raised at file "lib/task_server.ml", line 184, characters 8-9 +12/50 xenopsd-xc @ st30 Called from file "lib/storage.ml", line 57, characters 1-156 +13/50 xenopsd-xc @ st30 Called from file "xc/xenops_server_xen.ml", line 254, characters 15-63 +14/50 xenopsd-xc @ st30 Called from file "xc/xenops_server_xen.ml", line 1643, characters 15-76 +15/50 xenopsd-xc @ st30 Called from file "lib/xenctrl.ml", line 127, characters 13-17 +16/50 xenopsd-xc @ st30 Re-raised at file "lib/xenctrl.ml", line 127, characters 56-59 +17/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 937, characters 3-54 +18/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1103, characters 4-71 +19/50 xenopsd-xc @ st30 Called from file "list.ml", line 84, characters 24-34 +20/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1098, characters 2-367 +21/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1203, characters 3-46 +22/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1441, characters 3-9 +23/50 xenopsd-xc @ st30 Raised at file "lib/xenops_server.ml", line 1452, characters 9-10 +24/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1458, characters 48-60 +25/50 xenopsd-xc @ st30 Called from file "lib/task_server.ml", line 151, characters 15-26 +26/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 1719, characters 11-14 +27/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9 +28/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 2005, characters 13-14 +29/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9 +30/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 1785, characters 15-16 +31/50 xapi @ st30 Called from file "message_forwarding.ml", line 233, characters 25-44 +32/50 xapi @ st30 Called from file "message_forwarding.ml", line 915, characters 15-67 +33/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9 +34/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12 +35/50 xapi @ st30 Called from file "message_forwarding.ml", line 1205, characters 21-199 +36/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9 +37/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12 +38/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9 +9/50 xapi @ st30 Raised at file "rbac.ml", line 236, characters 10-15 +40/50 xapi @ st30 Called from file "server_helpers.ml", line 75, characters 11-41 +41/50 xapi @ st30 Raised at file "cli_util.ml", line 78, characters 9-12 +42/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9 +43/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12 +44/50 xapi @ st30 Called from file "cli_operations.ml", line 1889, characters 2-6 +45/50 xapi @ st30 Re-raised at file "cli_operations.ml", line 1898, characters 10-11 +46/50 xapi @ st30 Called from file "cli_operations.ml", line 1821, characters 14-18 +47/50 xapi @ st30 Called from file "cli_operations.ml", line 2109, characters 7-526 +48/50 xapi @ st30 Called from file "xapi_cli.ml", line 113, characters 18-56 +49/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9 +``` + +One can automatically set "--trace" for a whole shell session as follows: + +```bash +export XE_EXTRA_ARGS="--trace" +``` + +The XenAPI +---------- + +We already store error information in the XenAPI "Task" object and so we +can store backtraces at the same time. We shall add a field "backtrace" +which will have type "string" but which will contain s-expression encoded +backtrace data. Clients should not attempt to parse this string: its +contents may change in future. The reason it is different from the json +mentioned before is that it also contains host and process information +supplied by Xapi, and may be extended in future to contain other diagnostic +information. + + +The Xenopsd API +--------------- + +We already store error information in the xenopsd API "Task" objects, +we can extend these to store the backtrace in an additional field ("backtrace"). +This field will have type "string" but will contain s-expression encoded +backtrace data. + + +The SMAPIv1 API +--------------- + +Errors in SMAPIv1 are returned as XMLRPC "Faults" containing a code and +a status line. Xapi transforms these into XenAPI exceptions usually of the +form `SR_BACKEND_FAILURE_`. We can extend the SM backends to use the +XenAPI exception type directly: i.e. to marshal exceptions as dictionaries: + +```python + results = { + "Status": "Failure", + "ErrorDescription": [ code, param1, ..., paramN ] + } +``` + +We can then define a new backtrace-carrying error: + +- code = `SR_BACKEND_FAILURE_WITH_BACKTRACE` +- param1 = json-encoded backtrace +- param2 = code +- param3 = reason + +which is internally transformed into `SR_BACKEND_FAILURE_` and +the backtrace is appended to the current Task backtrace. From the client's +point of view the final exception should look the same, but Xapi will have +a chance to see and log the whole backtrace. + +As a side-effect, it is possible for SM plugins to throw XenAPI errors directly, +without interpretation by Xapi. diff --git a/doc/content/design/bonding-improvements.md b/doc/content/design/bonding-improvements.md new file mode 100644 index 00000000000..fc09a8a7fe1 --- /dev/null +++ b/doc/content/design/bonding-improvements.md @@ -0,0 +1,288 @@ +--- +title: Bonding Improvements design +layout: default +design_doc: true +revision: 1 +status: released (6.0) +--- + +This document describes design details for the +PR-1006 requirements. + +XAPI and XenAPI +=============== + +Creating a Bond +--------------- + +### Current Behaviour on Bond creation + +Steps for a user to create a bond: + +1. Shutdown all VMs with VIFs using the interfaces that will be bonded, + in order to unplug those VIFs. +2. Create a Network to be used by the bond: `Network.create` +3. Call `Bond.create` with a ref to this Network, a list of refs of + slave PIFs, and a MAC address to use. +4. Call `PIF.reconfigure_ip` to configure the bond master. +5. Call `Host.management_reconfigure` if one of the slaves is the + management interface. This command will call `interface-reconfigure` + to bring up the master and bring down the slave PIFs, thereby + activating the bond. Otherwise, call `PIF.plug` to activate the + bond. + +`Bond.create` XenAPI call: + +1. Remove duplicates in the list of slaves. +2. Validate the following: + - Slaves must not be in a bond already. + - Slaves must not be VLAN masters. + - Slaves must be on the same host. + - Network does not already have a PIF on the same host as the + slaves. + - The given MAC is valid. + +3. Create the master PIF object. + - The device name of this PIF is `bond`*x*, with *x* the smallest + unused non-negative integer. + - The MAC of the first-named slave is used if no MAC was + specified. + +4. Create the Bond object, specifying a reference to the master. The + value of the `PIF.master_of` field on the master is dynamically + computed on request. +5. Set the `PIF.bond_slave_of` fields of the slaves. The value of the + `Bond.slaves` field is dynamically computed on request. + +### New Behaviour on Bond creation + +Steps for a user to create a bond: + +1. Create a Network to be used by the bond: `Network.create` +2. Call `Bond.create` with a ref to this Network, a list of refs of + slave PIFs, and a MAC address to use.\ + The new bond will automatically be plugged if one of the slaves was + plugged. + +In the following, for a host *h*, a *VIF-to-move* is a VIF associated +with a VM that is either + +- running, suspended or paused on *h*, OR +- halted, and *h* is the only host that the VM can be started on. + +The `Bond.create` XenAPI call is updated to do the following: + +1. Remove duplicates in the list of slaves. +2. Validate the following, and raise an exception if any of these check + fails: + - Slaves must not be in a bond already. + - Slaves must not be VLAN masters. + - Slaves must not be Tunnel access PIFs. + - Slaves must be on the same host. + - Network does not already have a PIF on the same host as the + slaves. + - The given MAC is valid. + +3. Try unplugging all currently attached VIFs of the set of VIFs that + need to be moved. Roll back and raise an exception of one of the + VIFs cannot be unplugged (e.g. due to the absence of PV drivers in + the VM). +4. Determine the *primary slave*: the management PIF (if among the + slaves), or the first slave with IP configuration. +5. Create the master PIF object. + - The device name of this PIF is `bond`*x*, with *x* the smallest + unused non-negative integer. + - The MAC of the primary slave is used if no MAC was specified. + - Include the IP configuration of the primary slave. + - If any of the slaves has `PIF.disallow_unplug = true`, this will + be copied to the master. + +6. Create the Bond object, specifying a reference to the master. The + value of the `PIF.master_of` field on the master is dynamically + computed on request. Also a reference to the primary slave is + written to `Bond.primary_slave` on the new Bond object. +7. Set the `PIF.bond_slave_of` fields of the slaves. The value of the + `Bond.slaves` field is dynamically computed on request. +8. Move VLANs, plus the VIFs-to-move on them, to the master. + - If all VLANs on the slaves have different tags, all VLANs will + be moved to the bond master, while the same Network is used. The + network effectively moves up to the bond and therefore no VIFs + need to be moved. + - If multiple VLANs on different slaves have the same tag, they + necessarily have different Networks as well. Only one VLAN with + this tag is created on the bond master. All VIFs-to-move on the + remaining VLAN networks are moved to the Network that was moved + up. + +9. Move Tunnels to the master. The tunnel Networks move up with the + tunnels. As tunnel keys are different for all tunnel networks, there + are no complications as in the VLAN case. +10. Move VIFs-to-move on the slaves to the master. +11. If one of the slaves is the current management interface, move + management to the master; the master will automatically be plugged. + If none of the slaves is the management interface, plug the master + if any of the slaves was plugged. In both cases, the slaves will + automatically be unplugged. +12. On all slaves, reset the IP configuration and set `disallow_unplug` + to false. + +*Note: "moving" a VIF, VLAN or tunnel means "re-creating somewhere else, +and destroying the old one".* + +Destroying a Bond +----------------- + +### Current Behaviour on Bond destruction + +Steps for a user to destroy a bond: + +1. If the management interface is on the bond, move it to another PIF + using `PIF.reconfigure_ip` and `Host.management_reconfigure`. + Otherwise, no `PIF.unplug` needs to be called on the bond master, as + `Bond.destroy` does this automatically. +2. Call `Bond.destroy` with a ref to the Bond object. +3. If desired, bring up the former slave PIFs by calls to `PIF.plug` + (this is does not happen automatically). + +`Bond.destroy` XenAPI call: + +1. Validate the following constraints: + - No VLANs are attached to the bond master. + - The bond master is not the management PIF. + +2. Bring down the master PIF and clean up the underlying network + devices. +3. Remove the Bond and master PIF objects. + +### New Behaviour on Bond destruction + +Steps for a user to destroy a bond: + +1. Call `Bond.destroy` with a ref to the Bond object. +2. If desired, move VIFs/VLANs/tunnels/management from (former) primary + slave to other PIFs. + +`Bond.destroy` XenAPI call is updated to do the following: + +1. Try unplugging all currently attached VIFs of the set of VIFs that + need to be moved. Roll back and raise an exception of one of the + VIFs cannot be unplugged (e.g. due to the absence of PV drivers in + the VM). +2. Copy the IP configuration of the master to the primary slave. +3. Move VLANs, with their Networks, to the primary slave. +4. Move Tunnels, with their Networks, to the primary slave. +5. Move VIFs-to-move on the master to the primary slave. +6. If the master is the current management interface, move management + to the primary slave. The primary slave will automatically be + plugged. +7. If the master was plugged, plug the primary slave. This will + automatically clean up the underlying devices of the bond. +8. If the master has `PIF.disallow_unplug = true`, this will be copied + to the primary slave. +9. Remove the Bond and master PIF objects. + +Using Bond Slaves +----------------- + +### Current Behaviour for Bond Slaves + +- It possible to plug any existing PIF, even bond slaves. Any other + PIFs that cannot be attached at the same time as the PIF that is + being plugged, are automatically unplugged. +- Similarly, it is possible to make a bond slave the management + interface. Any other PIFs that cannot be attached at the same time + as the PIF that is being plugged, are automatically unplugged. +- It is possible to have a VIF on a Network associated with a bond + slave. When the VIF's VM is started, or the VIF is hot-plugged, the + PIF is relies on is automatically plugged, and any other PIFs that + cannot be attached at the same time as this PIF are automatically + unplugged. +- It is possible to have a VLAN on a bond slave, though the bond + (master) and the VLAN may not be simultaneously attached. This is + not currently enforced (which may be considered a bug). + +### New behaviour for Bond Slaves + +- It is no longer possible to plug a bond slave. The exception + CANNOT\_PLUG\_BOND\_SLAVE is raised when trying to do so. +- It is no longer possible to make a bond slave the management + interface. The exception CANNOT\_PLUG\_BOND\_SLAVE is raised when + trying to do so. +- It is still possible to have a VIF on the Network of a bond slave. + However, it is not possible to start such a VIF's VM on a host, if + this would need a bond slave to be plugged. Trying this will result + in a CANNOT\_PLUG\_BOND\_SLAVE exception. Likewise, it is not + possible to hot-plug such a VIF. +- It is no longer possible to place a VLAN on a bond slave. The + exception CANNOT\_ADD\_VLAN\_TO\_BOND\_SLAVE is raised when trying + to do so. +- It is no longer possible to place a tunnel on a bond slave. The + exception CANNOT\_ADD\_TUNNEL\_TO\_BOND\_SLAVE is raised when trying + to do so. + +Actions on Start-up +------------------- + +### Current Behaviour on Start-up + +When a pool slave starts up, bonds and VLANs on the pool master are +replicated on the slave: + +- Create all VLANs that the master has, but the slave has not. VLANs + are identified by their tag, the device name of the slave PIF, and + the Networks of the master and slave PIFs. +- Create all bonds that the master has, but the slave has not. If the + interfaces needed for the bond are not all available on the slave, a + partial bond is created. If some of these interface are already + bonded on the slave, this bond is destroyed first. + +### New Behaviour on Start-up + +- The current VLAN/tunnel/bond recreation code is retained, as it uses + the new Bond.create and Bond.destroy functions, and therefore does + what it needs to do. +- Before VLAN/tunnel/bond recreation, any violations of the rules + defined in R2 are rectified, by moving VIFs, VLANs, tunnels or + management up to bonds. + +CLI +=== + +The behaviour of the `xe` CLI commands `bond-create`, `bond-destroy`, +`pif-plug`, and `host-management-reconfigure` is changed to match their +associated XenAPI calls. + +XenCenter +========= + +XenCenter already automatically moves the management interface when a +bond is created or destroyed. This is no longer necessary, as the +`Bond.create/destroy` calls already do this. XenCenter only needs to +copy any `PIF.other_config` keys that is needs between primary slave and +bond master. + +Manual Tests +============ + +- Create a bond of two interfaces... + - without VIFs/VLANs/management on them; + - with management on one of them; + - with a VLAN on one of them; + - with two VLANs on two different interfaces, having the same VLAN + tag; + - with a VIF associated with a halted VM on one of them; + - with a VIF associated with a running VM (with and without PV + drivers) on one of them. +- Destroy a bond of two interfaces... + - without VIFs/VLANs/management on it; + - with management on it; + - with a VLAN on it; + - with a VIF associated with a halted VM on it; + - with a VIF associated with a running VM (with and without PV + drivers) on it. +- In a pool of two hosts, having VIFs/VLANs/management on the + interfaces of the pool slave, create a bond on the pool master, and + restart XAPI on the slave. +- Restart XAPI on a host with a networking configuration that has + become illegal due to these requirements. + diff --git a/doc/content/design/coverage/coverage-screenshot.png b/doc/content/design/coverage/coverage-screenshot.png new file mode 100644 index 00000000000..f14d8070e3f Binary files /dev/null and b/doc/content/design/coverage/coverage-screenshot.png differ diff --git a/doc/content/design/coverage/index.md b/doc/content/design/coverage/index.md new file mode 100644 index 00000000000..3b3f6ec3ec7 --- /dev/null +++ b/doc/content/design/coverage/index.md @@ -0,0 +1,267 @@ +--- +layout: default +title: Code Coverage Profiling +design_doc: true +status: proposed +revision: 2 +--- + +We would like to add optional coverage profiling to existing [OCaml] +projects in the context of [XenServer] and [XenAPI]. This article +presents how we do it. + +Binaries instrumented for coverage profiling in the XenServer project +need to run in an environment where several services act together as +they provide operating-system-level services. This makes it a little +harder than profiling code that can be profiled and executed in +isolation. + +## TL;DR + +To build binaries with coverage profiling, do: + + ./configure --enable-coverage + make + +Binaries will log coverage data to `/tmp/bisect*.out` from which a +coverage report can be generated in `coverage/`: + + bisect-ppx-report -I _build -html coverage /tmp/bisect*.out + + +## Profiling Framework Bisect-PPX + +The open-source [BisectPPX] instrumentation framework uses extension +points (PPX) in the [OCaml] compiler to instrument code during +compilation. Instrumented code for a binary is then compiled as usual +and logs during execution data to in-memory data structures. Before an +instrumented binary terminates, it writes the logged data to a file. +This data can then be analysed with the `bisect-ppx-report` tool, to +produce a summary of annotated code that highlights what part of a +codebase was executed. + +[BisectPPX] has several desirable properties: + +* a robust code base that is well tested +* it is easy to integrate into the compilation pipeline (see below) +* is specific to the [OCaml] language; an expression-oriented language + like OCaml doesn't fit the traditional statement coverage well +* it is actively maintained +* is generates useful reports for interactive and non-interactive use + that help to improve code coverage + +![Coverage Analysis](./coverage-screenshot.png) + +Red parts indicate code that wasn't executed whereas green parts were. +Hovering over a dark green spot reveals how often that point was +executed. + +The individual steps of instrumenting code with [BisectPPX] are greatly +abstracted by OCamlfind (OCaml's library manager) and OCamlbuild +(OCaml's compilation manager): + + # write code + vim example.ml + + # build it with instrumentation from bisect_ppx + ocamlbuild -use-ocamlfind -pkg bisect_ppx -pkg unix example.native + + # execute it - generates files ./bisect*.out + ./example.native + + # generate report + bisect-ppx-report -I _build -html coverage bisect000* + + # view coverage/index.html + + Summary: + - 'binding' points: 2/2 (100.00%) + - 'sequence' points: 10/10 (100.00%) + - 'match/function' points: 5/8 (62.50%) + - total: 17/20 (85.00%) + +The fourth step generates a HTML report in `coverage/`. All it takes is +to declare to [OCamlbuild] that a module depends on `bisect_ppx` and it +will be instrumented during compilation. Behind the scenes `ocamlfind` +makes sure that the compiler uses a preprocessing step that instruments +the code. + +## Signal Handling + +During execution the code instrumentation leads to the collection of +data. This code registers a function with `at_exit` that writes the data +to `bisect*.out` when `exit` is called. A binary can terminate without +calling `exit` and in that case the file would not be written. It is +therefore important to make sure that `exit` is called. If this does not +happen naturally, for example in the context of a daemon that is +terminated by receiving the `TERM` signal, a signal handler must be +installed: + + let stop signal = + printf "caught signal %d\n" signal; + exit 0 + + Sys.set_signal Sys.sigterm (Sys.Signal_handle stop) + +## Dumping coverage information at runtime + +By default coverage data can only be dumped at exit, which is inconvenient if you have a test-suite +that needs to reuse a long running daemon, and starting/stopping it each time is not feasible. + +In such cases we need an API to dump coverage at runtime, which *is* provided by `bisect_ppx >= 1.3.0`. +However each daemon will need to set up a way to listen to an event that triggers this coverage dump, +furthermore it is desirable to make runtime coverage dumping compiled in conditionally to be absolutely sure +that production builds do *not* use coverage preprocessed code. + +Hence instead of duplicating all this build logic in each daemon (`xapi`, `xenopsd`, etc.) provide this +functionality in a common library `xapi-idl` that: + + * logs a message on startup so we know it is active + * sets BISECT_FILE environment variable to dump coverage in the appropriate place + * listens on `org.xen.xapi.coverage.` message queue for runtime coverage dump commands: + * sending `dump ` will cause runtime coverage to be dumped to a file + named `bisect--..out` + * sending `reset` will cause the runtime coverage counters to be reset + +Daemons that use `Xcp_service.configure2` (e.g. `xenopsd`) will benefit from this runtime trigger automatically, +provided they are themselves preprocessed with `bisect_ppx`. + +Since we are interested in collecting coverage data for system-wide test-suite runs we need a way to trigger +dumping of coverage data centrally, and a good candidate for that is `xapi` as the top-level daemon. + +It will call `Xcp_coverage.dispatcher_init ()`, which listens on `org.xen.xapi.coverage.dispatch` and +dispatches the coverage dump command to all message queues under `org.xen.xapi.coverage.*` except itself. + +On production, and regular builds all of this is a no-op, ensured by using separate `lib/coverage/disabled.ml` and `lib/coverage/enabled.ml` +files which implement the same interface, and choosing which one to use at build time. + + +## Where Data is Written + +By default, [BisectPPX] writes data in a binary's current working +directory as `bisectXXXX.out`. It doesn't overwrite existing files and +files from several runs can be combined during analysis. However, this +name and the location can be inconvenient when multiple programs share a +directory. + +[BisectPPX]'s default can be overridden with the `BISECT_FILE` +environment variable. This can happen on the command line: + + BISECT_FILE=/tmp/example ./example.native + +In the context of XenServer we could do this in startup scripts. +However, we added a bit of code + + val Coverage.init: string -> unit + +that sets the environment variable from inside the program. The files +are written to a temporary directory (respecting `$TMP` or using `/tmp`) +and uses the `string`-typed argument to include it in the name. To be +effective, this function must be called before the programs exits. For +clarity it is called at the begin of program execution. + +## Instrumenting an Oasis Project + +While instrumentation is easy on the level of a small file or project it +is challenging in a bigger project. We decided to focus on projects that +are build with the [Oasis] build and packaging manager. These have a +well-defined structure and compilation process that is controlled by a +central `_oasis` file. This file describes for each library and binary +its dependencies at a package level. From this, [Oasis] generates a +`configure` script and compilation rules for the [OCamlbuild] system. +[Oasis] is designed that the generated files can be shipped without +requiring [Oasis] itself being available. + +Goals for instrumentation are: + +* what files are instrumented should be obvious and easy to manage +* instrumentation must be optional, yet easy to activate +* avoid methods that require to keep several files in sync like multiple + `_oasis` files +* avoid separate Git branches for instrumented and non-instrumented + code + +In the ideal case, we could introduce a configuration switch +`./configure --enable-coverage` that would prepare compilation for +coverage instrumentation. While [Oasis] supports the creation of such +switches, they cannot be used to control build dependencies like +compiling a file with or without package `bisec_ppx`. We have chosen a +different method: + +A `Makefile` target `coverage` augments the `_tags` file to include the +rules in file `_tags.coverage` that cause files to be instrumented: + + make coverage # prepare + make # build + +leads to the execution of this code during preparation: + + coverage: _tags _tags.coverage + test ! -f _tags.orig && mv _tags _tags.orig || true + cat _tags.coverage _tags.orig > _tags + +The file `_tags.coverage` contains two simple [OCamlbuild] rules that +could be tweaked to instrument only some files: + + <**/*.ml{,i,y}>: pkg_bisect_ppx + <**/*.native>: pkg_bisect_ppx + +When `make coverage` is not called, these rules are not active and +hence, code is not instrumented for coverage. We believe that this +solution to control instrumentation meets the goals from above. In +particular, what files are instrumented and when is controlled by very +few lines of declarative code that lives in the main repository of a +project. + +## Project Layout + +The crucial files in an [Oasis]-controlled project that is set up for +coverage analysis are: + + ./_oasis - make "profiling" a build depdency + ./_tags.coverage - what files get instrumented + ./profiling/coverage.ml - support file, sets env var + ./Makefile - target 'coverage' + +The `_oasis` file bundles the files under `profiling/` into an internal +library which executables then depend on: + + # Support files for profiling + Library profiling + CompiledObject: best + Path: profiling + Install: false + Findlibname: profiling + Modules: Coverage + BuildDepends: + + Executable set_domain_uuid + CompiledObject: best + Path: tools + ByteOpt: -warn-error +a-3 + NativeOpt: -warn-error +a-3 + MainIs: set_domain_uuid.ml + Install: false + BuildDepends: + xenctrl, + uuidm, + cmdliner, + profiling # <-- here + +The `Makefile` target `coverage` primes the project for a profiling build: + + # make coverage - prepares for building with coverage analysis + + coverage: _tags _tags.coverage + test ! -f _tags.orig && mv _tags _tags.orig || true + cat _tags.coverage _tags.orig > _tags + + +[OCamlbuild]: https://github.com/ocaml/ocamlbuild/blob/master/manual/manual.adoc +[BisectPPX]: https://github.com/aantron/bisect_ppx +[OCaml]: http://ocaml.org +[XenServer]: https://github.com/xenserver +[XenAPI]: https://github.com/xapi-project +[Oasis]: http://oasis.forge.ocamlcore.org + + diff --git a/doc/content/design/cpu-levelling-v2.md b/doc/content/design/cpu-levelling-v2.md new file mode 100644 index 00000000000..2192c1665a3 --- /dev/null +++ b/doc/content/design/cpu-levelling-v2.md @@ -0,0 +1,202 @@ +--- +title: CPU feature levelling 2.0 +layout: default +design_doc: true +status: released (7.0) +revision: 7 +revision_history: +- revision_number: 1 + description: Initial version +- revision_number: 2 + description: Add details about VM migration and import +- revision_number: 3 + description: Included and excluded use cases +- revision_number: 4 + description: Rolling Pool Upgrade use cases +- revision_number: 5 + description: Lots of changes to simplify the design +- revision_number: 6 + description: Use case refresh based on simplified design +- revision_number: 7 + description: RPU refresh based on simplified design +--- + +Executive Summary +================= + +The old XS 5.6-style Heterogeneous Pool feature that is based around hardware-level CPUID masking will be replaced by a safer and more flexible software-based levelling mechanism. + +History +======= + +- Original XS 5.6 design: [heterogeneous-pools](../heterogeneous-pools) +- Changes made in XS 5.6 FP1 for the DR feature (added CPUID checks upon migration) +- XS 6.1: migration checks extended for cross-pool scenario + +High-level Interfaces and Behaviour +=================================== + +A VM can only be migrated safely from one host to another if both hosts offer the set of CPU features which the VM expects. If this is not the case, CPU features may appear or disappear as the VM is migrated, causing it to crash. The purpose of feature levelling is to hide features which the hosts do not have in common from the VM, so that it does not see any change in CPU capabilities when it is migrated. + +Most pools start off with homogenous hardware, but over time it may become impossible to source new hosts with the same specifications as the ones already in the pool. The main use of feature levelling is to allow such newer, more capable hosts to be added to an existing pool while preserving the ability to migrate existing VMs to any host in the pool. + +Principles for Migration +------------------------ + +The CPU levelling feature aims to both: + +1. Make VM migrations _safe_ by ensuring that a VM will see the same CPU features before and after a migration. +2. Make VMs as _mobile_ as possible, so that it can be freely migrated around in a XenServer pool. + +To make migrations safe: + +* A migration request will be blocked if the destination host does not offer the some of the CPU features that the VM currently sees. +* Any additional CPU features that the destination host is able to offer will be hidden from the VM. + +_Note:_ Due to the limitations of the old Heterogeneous Pools feature, we are not able to guarantee the safety of VMs that are migrated to a Levelling-v2 host from an older host, during a rolling pool upgrade. This is because such VMs may be using CPU features that were not captured in the old feature sets, of which we are therefore unaware. However, migrations between the same two hosts, but before the upgrade, may have already been unsafe. The promise is that we will not make migrations _more_ unsafe during a rolling pool upgrade. + +To make VMs mobile: + +* A VM that is started in a XenServer pool will be able to see only CPU features that are common to all hosts in the pool. The set of common CPU features is referred to in this document as the _pool CPU feature level_, or simply the _pool level_. + +Use Cases for Pools +------------------- + +1. A user wants to add a new host to an existing XenServer pool. The new host has all the features of the existing hosts, plus extra features which the existing hosts do not. The new host will be allowed to join the pool, but its extra features will be hidden from VMs that are started on the host or migrated to it. The join does not require any host reboots. + +2. A user wants to add a new host to an existing XenServer pool. The new host does not have all the features of the existing ones. XenCenter warns the user that adding the host to the pool is possible, but it would lower the pool's CPU feature level. The user accepts this and continues the join. The join does not require any host reboots. VMs that are started anywhere on the pool, from now on, will only see the features of the new host (the lowest common denominator), such that they are migratable to any host in the pool, including the new one. VMs that were running before the pool join will not be migratable to the new host, because these VMs may be using features that the new host does not have. However, after a reboot, such VMs will be fully mobile. + +3. A user wants to add a new host to an existing XenServer pool. The new host does not have all the features of the existing ones, and at the same time, it has certain features that the pool does not have (the feature sets overlap). This is essentially a combination of the two use cases above, where the pool's CPU feature level will be downgraded to the intersection of the feature sets of the pool and the new host. The join does not require any host reboots. + +4. A user wants to upgrade or repair the hardware of a host in an existing XenServer pool. After upgrade the host has all the features it used to have, plus extra features which other hosts in the pool do not have. The extra features are masked out and the host resumes its place in the pool when it is booted up again. + +5. A user wants to upgrade or repair the hardware of a host in an existing XenServer pool. After upgrade the host has fewer features than it used to have. When the host is booted up again, the pool CPU's feature level will be automatically lowered, and the user will be alerted of this fact (through the usual alerting mechanism). + +6. A user wants to remove a host from an existing XenServer pool. The host will be removed as normal after any VMs on it have been migrated away. The feature set offered by the pool will be automatically re-levelled upwards in case the host which was removed was the least capable in the pool, and additional features common to the remaining hosts will be unmasked. + + +Rolling Pool Upgrade +-------------------- + +* A VM which was running on the pool before the upgrade is expected to continue to run afterwards. However, when the VM is migrated to an upgraded host, some of the CPU features it had been using might disappear, either because they are not offered by the host or because the new feature-levelling mechanism hides them. To have the best chance for such a VM to successfully migrate (see the note under "Principles for Migration"), it will be given a temporary VM-level feature set providing all of the destination's CPU features that were unknown to XenServer before the upgrade. When the VM is rebooted it will inherit the pool-level feature set. + +* A VM which is started during the upgrade will be given the current pool-level feature set. The pool-level feature set may drop after the VM is started, as more hosts are upgraded and re-join the pool, however the VM is guaranteed to be able to migrate to any host which has already been upgraded. If the VM is started on the master, there is a risk that it may only be able to run on that host. + +* To allow the VMs with grandfathered-in flags to be migrated around in the pool, the intra pool VM migration pre-checks will compare the VM's feature flags to the target host's flags, not the pool flags. This will maximise the chance that a VM can be migrated somewhere in a heterogeneous pool, particularly in the case where only a few hosts in the pool do not have features which the VMs require. + +* To allow cross-pool migration, including to pool of a higher XenServer version, we will still check the VM's requirements against the *pool-level* features of the target pool. This is to avoid the possibility that we migrate a VM to an 'island' in the other pool, from which it cannot be migrated any further. + + +XenAPI Changes +-------------- + +### Fields + +* `host.cpu_info` is a field of type `(string -> string) map` that contains information about the CPUs in a host. It contains the following keys: `cpu_count`, `socket_count`, `vendor`, `speed`, `modelname`, `family`, `model`, `stepping`, `flags`, `features`, `features_after_reboot`, `physical_features` and `maskable`. + * The following keys are specific to hardware-based CPU masking and will be removed: `features_after_reboot`, `physical_features` and `maskable`. + * The `features` key will continue to hold the current CPU features that the host is able to use. In practise, these features will be available to Xen itself and dom0; guests may only see a subset. The current format is a string of four 32-bit words represented as four groups of 8 hexadecimal digits, separated by dashes. This will change to an arbitrary number of 32-bit words. Each bit at a particular position (starting from the left) still refers to a distinct CPU feature (`1`: feature is present; `0`: feature is absent), and feature strings may be compared between hosts. The old format simply becomes a special (4 word) case of the new format, and bits in the same position may be compared between old and new feature strings. + * The new key `features_pv` will be added, representing the subset of `features` that the host is able to offer to a PV guest. + * The new key `features_hvm` will be added, representing the subset of `features` that the host is able to offer to an HVM guest. +* A new field `pool.cpu_info` of type `(string -> string) map` (read only) will be added. It will contain: + * `vendor`: The common CPU vendor across all hosts in the pool. + * `features_pv`: The intersection of `features_pv` across all hosts in the pool, representing the feature set that a PV guest will see when started on the pool. + * `features_hvm`: The intersection of `features_hvm` across all hosts in the pool, representing the feature set that an HVM guest will see when started on the pool. + * `cpu_count`: the total number of CPU cores in the pool. + * `socket_count`: the total number of CPU sockets in the pool. +* The `pool.other_config:cpuid_feature_mask` override key will no longer have any effect on pool join or VM migration. +* The field `VM.last_boot_CPU_flags` will be updated to the new format (see `host.cpu_info:features`). It will still contain the feature set that the VM was started with as well as the vendor (under the `features` and `vendor` keys respectively). + +### Messages + +* `pool.join` currently requires that the CPU vendor and feature set (according to `host.cpu_info:vendor` and `host.cpu_info:features`) of the joining host are equal to those of the pool master. This requirement will be loosened to mandate only equality in CPU vendor: + * The join will be allowed if `host.cpu_info:vendor` equals `pool.cpu_info:vendor`. + * This means that xapi will additionally allow hosts that have a _more_ extensive feature set than the pool (as long as the CPU vendor is common). Such hosts are transparently down-levelled to the pool level (without needing reboots). + * This further means that xapi will additionally allow hosts that have a _less_ extensive feature set than the pool (as long as the CPU vendor is common). In this case, the pool is transparently down-levelled to the new host's level (without needing reboots). Note that this does not affect any running VMs in any way; the mobility of running VMs will not be restricted, which can still migrate to any host they could migrate to before. It does mean that those running VMs will not be migratable to the new host. + * The current error raised in case of a CPU mismatch is `POOL_HOSTS_NOT_HOMOGENEOUS` with `reason` argument `"CPUs differ"`. This will remain the error that is raised if the pool join fails due to incompatible CPU vendors. + * The `pool.other_config:cpuid_feature_mask` override key will no longer have any effect. +* `host.set_cpu_features` and `host.reset_cpu_features` will be removed: it is no longer to use the old method of CPU feature masking (CPU feature sets are controlled automatically by xapi). Calls will fail with `MESSAGE_REMOVED`. +* VM lifecycle operations will be updated internally to use the new feature fields, to ensure that: + * Newly started VMs will be given CPU features according to the pool level for maximal mobility. + * For safety, running VMs will maintain their feature set across migrations and suspend/resume cycles. CPU features will transparently be hidden from VMs. + * Furthermore, migrate and resume will only be allowed in case the target host's CPUs are capable enough, i.e. `host.cpu_info:vendor` = `VM.last_boot_CPU_flags:vendor` and `host.cpu_info:features_{pv,hvm}` ⊇ `VM.last_boot_CPU_flags:features`. A `VM_INCOMPATIBLE_WITH_THIS_HOST` error will be returned otherwise (as happens today). + * For cross pool migrations, to ensure maximal mobility in the target pool, a stricter condition will apply: the VM must satisfy the pool CPU level rather than just the target host's level: `pool.cpu_info:vendor` = `VM.last_boot_CPU_flags:vendor` and `pool.cpu_info:features_{pv,hvm}` ⊇ `VM.last_boot_CPU_flags:features` + + +CLI Changes +----------- + +The following changes to the `xe` CLI will be made: + +* `xe host-cpu-info` (as well as `xe host-param-list` and friends) will return the fields of `host.cpu_info` as described above. +* `xe host-set-cpu-features` and `xe host-reset-cpu-features` will be removed. +* `xe host-get-cpu-features` will still return the value of `host.cpu_info:features` for a given host. + +Low-level implementation +======================== + +Xenctrl +------- + +The old `xc_get_boot_cpufeatures` hypercall will be removed, and replaced by two new functions, which are available to xenopsd through the Xenctrl module: + + external get_levelling_caps : handle -> int64 = "stub_xc_get_levelling_caps" + + type featureset_index = Featureset_host | Featureset_pv | Featureset_hvm + external get_featureset : handle -> featureset_index -> int64 array = "stub_xc_get_featureset" + +In particular, the `get_featureset` function will be used by xapi/xenopsd to ask Xen which are the widest sets of CPU features that it can offer to a VM (PV or HVM). I don't think there is a use for `get_levelling_caps` yet. + +Xenopsd +------- + +* Update the type `Host.cpu_info`, which contains all the fields that need to go into the `host.cpu_info` field in the xapi DB. The type already exists but is unused. Add the function `HOST.get_cpu_info` to obtain an instance of the type. Some code from xapi and the cpuid.ml from xen-api-libs can be reused. +* Add a platform key `featureset` (`Vm.t.platformdata`), which xenopsd will write to xenstore along with the other platform keys (no code change needed in xenopsd). Xenguest will pick this up when a domain is created, and will apply the CPUID policy to the domain. This has the effect of masking out features that the host may have, but which have a `0` in the feature set bitmap. +* Review current cpuid-related functions in `xc/domain.ml`. + +Xapi +---- + +### Xapi startup + +* Update `Create_misc.create_host_cpu` function to use the new xenopsd call. +* If the host features fall below pool level, e.g. due to a change in hardware: down-level the pool by updating `pool.cpu_info.features_{pv,hvm}`. Newly started VMs will inherit the new level; already running VMs will not be affected, but will not be able to migrate to this host. +* To notify the admin of this event, an API alert (message) will be set: `pool_cpu_features_downgraded`. + +### VM start + +- Inherit feature set from pool (`pool.cpu_info.features_{pv,hvm}`) and set `VM.last_boot_CPU_flags` (`cpuid_helpers.ml`). +- The domain will be started with this CPU feature set enabled, by writing the feature set string to `platformdata` (see above). + +### VM migrate and resume + +- There are already CPU compatiblity checks on migration, both in-pool and cross-pool, as well as resume. Xapi compares `VM.last_boot_CPU_flags` of the VM to-migrate with `host.cpu_info` of the receiving host. Migration is only allowed if the CPU vendors and the same, and `host.cpu_info:features` ⊇ `VM.last_boot_CPU_flags:features`. The check can be overridden by setting the `force` argument to `true`. +- For in-pool migrations, these checks will be updated to use the appropriate `features_pv` or `features_hvm` field. +- For cross-pool migrations. These checks will be updated to use `pool.cpu_info` (`features_pv` or `features_hvm` depending on how the VM was booted) rather than `host.cpu_info`. +- If the above checks pass, then the `VM.last_boot_CPU_flags` will be maintained, and the new domain will be started with the same CPU feature set enabled, by writing the feature set string to `platformdata` (see above). +- In case the VM is migrated to a host with a higher xapi software version (e.g. a migration from a host that does not have CPU levelling v2), the feature string may be longer. This may happen during a rolling pool upgrade or a cross-pool migration, or when a suspended VM is resume after an upgrade. In this case, the following safety rules apply: + - Only the existing (shorter) feature string will be used to determine whether the migration will be allowed. This is the best we can do, because we are unaware of the state of the extended feature set on the older host. + - The existing feature set in `VM.last_boot_CPU_flags` will be extended with the extra bits in `host.cpu_info:features_{pv,hvm}`, i.e. the widest feature set that can possibly be granted to the VM (just in case the VM was using any of these features before the migration). + - Strictly speaking, a migration of a VM from host A to B that was allowed before B was upgraded, may no longer be allowed after the upgrade, due to stricter feature sets in the new implementation (from the `xc_get_featureset` hypercall). However, the CPU features that are switched off by the new implementation are features that a VM would not have been able to actually use. We therefore need a don't-care feature set (similar to the old `pool.other_config:cpuid_feature_mask` key) with bits that we may ignore in migration checks, and switch off after the migration. This will be a xapi config file option. + - XXX: Can we actually block a cross-pool migration at the receiver end?? + +### VM import + +The `VM.last_boot_CPU_flags` field must be upgraded to the new format (only really needed for VMs that were suspended while exported; `preserve_power_state=true`), as described above. + +### Pool join + +Update pool join checks according to the rules above (see `pool.join`), i.e. remove the CPU features constraints. + +### Upgrade + +* The pool level (`pool.cpu_info`) will be initialised when the pool master upgrades, and automatically adjusted if needed (downwards) when slaves are upgraded, by each upgraded host's started sequence (as above under "Xapi startup"). +* The `VM.last_boot_CPU_flags` fields of running and suspended VMs will be "upgraded" to the new format on demand, when a VM is migrated to or resume on an upgraded host, as described above. + + +XenCenter integration +--------------------- + +- Don't explicitly down-level upon join anymore +- Become aware of new pool join rule +- Update Rolling Pool Upgrade + diff --git a/doc/content/design/distributed-database/architecture.png b/doc/content/design/distributed-database/architecture.png new file mode 100644 index 00000000000..5756472b025 Binary files /dev/null and b/doc/content/design/distributed-database/architecture.png differ diff --git a/doc/content/design/distributed-database/index.md b/doc/content/design/distributed-database/index.md new file mode 100644 index 00000000000..b56d043d9e8 --- /dev/null +++ b/doc/content/design/distributed-database/index.md @@ -0,0 +1,181 @@ +--- +title: Distributed database +layout: default +design_doc: true +revision: 1 +status: proposed +--- + +All hosts in a pool use the shared database by sending queries to +the pool master. This creates + +- a performance bottleneck as the pool size increases +- a reliability problem when the master fails. + +The reliability problem can be ameliorated by running with HA enabled, +but this is not always possible. + +Both problems can be addressed by observing that the database objects +correspond to distinct physical objects where eventual consistency is +perfectly ok. For example if host 'A' is running a VM and changes the +VM's name, it doesn't matter if it takes a while before the change shows +up on host 'B'. If host 'B' changes its network configuration then it +doesn't matter how long it takes host 'A' to notice. We would still like +the metadata to be replicated to cope with failure, but we can allow +changes to be committed locally and synchronised later. + +Note the one exception to this pattern: the current SM plugins use database +fields to implement locks. This should be shifted to a special-purpose +lock acquire/release API. + +Using git via Irmin +------------------- + +A git repository is a database of key=value pairs with branching history. +If we placed our host and VM metadata in git then we could `commit` +changes and `pull` and `push` them between replicas. The +[Irmin](https://github.com/mirage/irmin) library provides an easy programming +interface on top of git which we could link with the Xapi database layer. + +Proposed new architecture +------------------------- + +![Pools of one](architecture.png) + +The diagram above shows two hosts: one a master and the other a regular host. +The XenAPI client has sent a request to the wrong host; normally this would +result in a `HOST_IS_SLAVE` error being sent to the client. In the new +world, the host is able to process the request, only contacting the master +if it is necessary to acquire a lock. Starting a VM would require a lock; but +rebooting or migrating an existing VM would not. Assuming the lock can +be acquired, then the operation is executed locally with all state updates +being made to a git topic branch. + +![Topic branches](topic.png) + +Roughly we would have 1 topic branch per +pending XenAPI Task. Once the Task completes successfully, the topic branch +(containing the new VM state) is merged back into master. +Separately each +host will pull and push updates between each other for replication. + +We would avoid merge conflicts by construction; either + +- a host's configuration will always be "owned" by the host and it will be + an error for anyone else to merge updates to it +- the master's locking will guarantee that a VM is running on at most one + host at a time. It will be an error for anyone else to merge updates to it. + +What we gain +------------ + +We will gain the following + +- the master will only be a bottleneck when the number of VM locks gets + really large; +- you will be able to connect XenCenter to hosts without a master and manage + them. Today such hosts are unmanageable. +- the database will have a history and you'll be able to "go back in time" + either for debugging or to recover from mistakes +- bugs caused by concurrent threads (in separate Tasks) confusing each other + will be vanquished. A typical failure mode is: one active thread destroys + an object; a passive thread sees the object and then tries to read it + and gets a database failure instead. Since every thread is operating a + separate Task they will all have their own branch and will be isolated from + each other. + +What we lose +------------ + +We will lose the following + +- the ability to use the Xapi database as a "lock" +- coherence between hosts: there will be no guarantee that an effect seen + by host 'A' will be seen immediately by host 'B'. In particular this means + that clients should send all their commands and `event.from` calls to + the same host (although any host will do) + + +Stuff we need to build +---------------------- + +- A `pull`/`push` replicator: this would have to monitor the list + of hosts in the pool and distribute updates to them in some vaguely + efficient manner. Ideally we would avoid hassling the pool master and + use some more efficient topology: perhaps a tree? + +- A `git diff` to XenAPI event converter: whenever a host `pull`s + updates from another it needs to convert the diff into a set of touched + objects for any `event.from` to read. We could send the changeset hash + as the `event.from` token. + +- Irmin nested views: since Tasks can be nested (and git branches can be + nested) we need to make sure that Irmin views can be nested. + +- We need to go through the xapi code and convert all mixtures of database + access and XenAPI updates into pure database calls. With the previous system + it was better to use a XenAPI to remote large chunks of database effects to + the master than to perform them locally. It will now be better to run them + all locally and merge them at the end. Additionally since a Task will have + a local branch, it won't be possible to see the state on a remote host + without triggering an early merge (which would harm efficiency) + +- We need to create a first-class locking API to use instead of the + `VDI.sm_config` locks. + +Prototype +--------- + +A basic prototype has been created: + +```bash +$ opam pin xen-api-client git://github.com/djs55/xen-api-client#improvements +$ opam pin add xapi-database git://github.com/djs55/xapi-database +$ opam pin add xapi git://github.com/djs55/xen-api#schema-sexp +``` + +The `xapi-database` is clone of the existing Xapi database code +configured to run as a separate process. There is +[code to convert from XML to git](https://github.com/djs55/xapi-database/blob/master/core/db_git.ml#L55) +and +[an implementation of the Xapi remote database API](https://github.com/djs55/xapi-database/blob/master/core/db_git.ml#L186) +which uses the following layout: + +```bash +$ git clone /xapi.db db +Cloning into 'db'... +done. + +$ cd db; ls +xapi + +$ ls xapi +console host_metrics PCI pool SR user VM +host network PIF session tables VBD VM_metrics +host_cpu PBD PIF_metrics SM task VDI + +$ ls xapi/pool +OpaqueRef:39adc911-0c32-9e13-91a8-43a25939110b + +$ ls xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/ +crash_dump_SR __mtime suspend_image_SR +__ctime name_description uuid +default_SR name_label vswitch_controller +ha_allow_overcommit other_config wlb_enabled +ha_enabled redo_log_enabled wlb_password +ha_host_failures_to_tolerate redo_log_vdi wlb_url +ha_overcommitted ref wlb_username +ha_plan_exists_for _ref wlb_verify_cert +master restrictions + +$ ls xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/other_config/ +cpuid_feature_mask memory-ratio-hvm memory-ratio-pv + +$ cat xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/other_config/cpuid_feature_mask +ffffff7f-ffffffff-ffffffff-ffffffff +``` + +Notice how: + +- every object is a directory +- every key/value pair is represented as a file diff --git a/doc/content/design/distributed-database/topic.png b/doc/content/design/distributed-database/topic.png new file mode 100644 index 00000000000..bcb94f19667 Binary files /dev/null and b/doc/content/design/distributed-database/topic.png differ diff --git a/doc/content/design/emergency-network-reset.md b/doc/content/design/emergency-network-reset.md new file mode 100644 index 00000000000..44451205b51 --- /dev/null +++ b/doc/content/design/emergency-network-reset.md @@ -0,0 +1,143 @@ +--- +title: Emergency Network Reset Design +layout: default +design_doc: true +revision: 1 +status: released (6.0.2) +--- + +This document describes design details for the PR-1032 requirements. + +The design consists of four parts: + +1. A new XenAPI call `Host.reset_networking`, which removes all the + PIFs, Bonds, VLANs and tunnels associated with the given host, and a + call `PIF.scan_bios` to bring back the PIFs with device names as + defined in the BIOS. +2. A `xe-reset-networking` script that can be executed on a XenServer + host, which prepares the reset and causes the host to reboot. +3. An xsconsole page that essentially does the same as + `xe-reset-networking`. +4. A new item in the XAPI start-up sequence, which when triggered by + `xe-reset-networking`, calls `Host.reset_networking` and re-creates + the PIFs. + +Command-Line Utility +-------------------- + +The `xe-reset-networking` script takes the following parameters: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
-m, --masterThe IP address of the master. Optional if the host is pool slave, ignored otherwise.
--deviceDevice name of management interface. Optional. If not specified, it is taken from the firstboot data.
--modeIP configuration mode for management interface. Optional. Either dhcp or static (default is dhcp).
--ipIP address for management interface. Required if --mode=static, ignored otherwise.
--netmaskNetmask for management interface. Required if --mode=static, ignored otherwise.
--gatewayGateway for management interface. Optional; ignored if --mode=dhcp.
--dnsDNS server for management interface. Optional; ignored if --mode=dhcp.
+ +DNS server for management interface. Optional; ignored if `--mode=dhcp`. + +The script takes the following steps after processing the given +parameters: + +1. Inform the user that the host will be restarted, and that any + running VMs should be shut down. Make the user confirm that they + really want to reset the networking by typing 'yes'. +2. Read `/etc/xensource/pool.conf` to determine whether the host is a + pool master or pool slave. +3. If a pool slave, update the IP address in the `pool.conf` file to + the one given in the `-m` parameter, if present. +4. Shut down networking subsystem (`service network stop`). +5. If no management device is specified, take it from + /etc/firstboot.d/data/management.conf. +6. If XAPI is running, stop it. +7. Reconfigure the management interface and associated bridge by + `interface-reconfigure --force`. +8. Update `MANAGEMENT_INTERFACE` and clear `CURRENT_INTERFACES` in + `/etc/xensource-inventory`. +9. Create the file `/tmp/network-reset` to trigger XAPI to complete the + network reset after the reboot. This file should contain the full + configuration details of the management interface as key/value pairs + (format: `=\n`), and looks similar to the firstboot data + files. The file contains at least the keys `DEVICE` and `MODE`, and + `IP`, `NETMASK`, `GATEWAY`, or `DNS` when appropriate. +10. Reboot + +XAPI +---- + +### XenAPI + +A new *hidden* API call: + +- `Host.reset_networking` + - Parameter: host reference `host` + - Calling this function removes all the PIF, Bond, VLAN and tunnel + objects associated with the given host from the master database. + All Network and VIF objects are maintained, as these do not + necessarily belong to a single host. + +### Start-up Sequence + +After reboot, in the XAPI start-up sequence trigged by the presence of +`/tmp/network-reset`: + +1. Read the desired management configuration from `/tmp/network-reset`. +2. Call `Host.reset_networking` with a ref to the localhost. +3. Call `PIF.scan` with a ref to the localhost to recreate the + (physical) PIFs. +4. Call `PIF.reconfigure_ip` to configure the management interface. +5. Call `Host.management_reconfigure`. +6. Delete `/tmp/network-reset`. + +xsconsole +--------- + +Add an "Emergency Network Reset" option under the "Network and +Management Interface" menu. Selecting this option will show some +explanation in the pane on the right-hand side. Pressing \ will +bring up a dialogue to select the interfaces to use as management +interface after the reset. After choosing a device, the dialogue +continues with configuration options like in the "Configure Management +Interface" dialogue. After completing the dialogue, the same steps as +listed for `xe-reset-networking` are executed. + +Notes +----- + +- On a pool slave, the management interface should be the same as on + the master (the same device name, e.g. eth0). +- Resetting the networking configuration on the master should be + ideally be followed by resets of the pool slaves as well, in order + to synchronise their configuration (especially bonds/VLANs/tunnels). + Furthermore, in case the IP address of the master has changed, as a + result of a network reset or `Host.management_reconfigure`, pool + slaves may also use the network reset functionality to reconnect to + the master on its new IP. + diff --git a/doc/content/design/emulated-pci-spec.md b/doc/content/design/emulated-pci-spec.md new file mode 100644 index 00000000000..7a976ede1a2 --- /dev/null +++ b/doc/content/design/emulated-pci-spec.md @@ -0,0 +1,61 @@ +--- +title: Specifying Emulated PCI Devices +layout: default +design_doc: true +revision: 1 +status: proposed +--- + +### Background and goals + +At present (early March 2015) the datamodel defines a VM as having a "platform" string-string map, in which two keys are interpreted as specifying a PCI device which should be emulated for the VM. Those keys are "device_id" and "revision" (with int values represented as decimal strings). + +Limitations: +* Hardcoded defaults are used for the the vendor ID and all other parameters except device_id and revision. +* Only one emulated PCI device can be specified. + +When instructing qemu to emulate PCI devices, qemu accepts twelve parameters for each device. + +Future guest-agent features rely on additional emulated PCI devices. We cannot know in advance the full details of all the devices that will be needed, but we can predict some. + +We need a way to configure VMs such that they will be given additional emulated PCI devices. + +### Design + +In the datamodel, there will be a new type of object for emulated PCI devices. + +Tentative name: "emulated_pci_device" + +Fields to be passed through to qemu are the following, all static read-only, and all ints except devicename: +* devicename (string) +* vendorid +* deviceid +* command +* status +* revision +* classcode +* headertype +* subvendorid +* subsystemid +* interruptline +* interruptpin + +We also need a "built_in" flag: see below. + +Allow creation of these objects through the API (and CLI). + +(It would be nice, but by no means essential, to be able to create one by specifying an existing one as a basis, along with one or more altered fields, e.g. "Make a new one just like that existing one except with interruptpin=9.") + +Create some of these devices to be defined as standard in XenServer, along the same lines as the VM templates. Those ones should have built_in=true. + +Allow destruction of these objects through the API (and CLI), but not if they are in use or if they have built_in=true. + +A VM will have a list of zero or more of these emulated-pci-device objects. (OPEN QUESTION: Should we forbid having more than one of a given device?) + +Provide API (and CLI) commands to add and remove one of these devices from a VM (identifying the VM and device by uuid or other identifier such as name). + +The CLI should allow performing this on multiple VMs in one go, based on a selector or filter for the VMs. We have this concept already in the CLI in commands such as vm-start. + +In the function that adds an emulated PCI device to a VM, we must check if this is the first device to be added, and must refuse if the VM's Virtual Hardware Platform Version is too low. (Or should we just raise the version automatically if needed?) + +When starting a VM, check its list of emulated pci devices and pass the details through to qemu (via xenopsd). diff --git a/doc/content/design/fcoe-nics.md b/doc/content/design/fcoe-nics.md new file mode 100644 index 00000000000..59b2634f609 --- /dev/null +++ b/doc/content/design/fcoe-nics.md @@ -0,0 +1,56 @@ +--- +title: FCoE capable NICs +layout: default +design_doc: true +revision: 3 +status: proposed +design_review: 120 +--- + +It has been possible to identify the NICs of a Host which can support FCoE. +This property can be listed in PIF object under capabilities field. + +Introduction +------------ + +* FCoE supported on a NIC is a hardware property. With the help of dcbtool, we can identify which NIC support FCoE. +* The new field capabilities will be `Set(String)` in PIF object. For FCoE capable NIC will have string "fcoe" in PIF capabilities field. +* `capabilities` field will be ReadOnly, This field cannot be modified by user. + +PIF Object +------- + +New field: + +* Field `PIF.capabilities` will be type `Set(string)`. +* Default value in PIF capabilities will have an empty set. + +Xapi Changes +------- + +* Set the field capabilities "fcoe" depending on output of xcp-networkd call `get_capabilities`. +* Field capabilities "fcoe" can be set during `introduce_internal` on when creating a PIF. +* Field capabilities "fcoe" can be updated during `refresh_all` on xapi startup. +* The above field will be set everytime when xapi-restart. + +XCP-Networkd Changes +------- + +New function: + +* String list `string list get_capabilties (string)` +* Argument: device_name for the PIF. +* This function calls method `capable` exposed by `fcoe_driver.py` as part of dom0. +* It returns string list ["fcoe"] or [] depending on `capable` method output. + +Defaults, Installation and Upgrade +------------------------ + +* Any newly introduced PIF will have its capabilities field as empty set until `fcoe_driver` method `capable` states FCoE is supported on the NIC. +* It includes PIFs obtained after a fresh install of Xenserver, as well as PIFs created using `PIF.introduce` then `PIF.scan`. +* During an upgrade Xapi Restart will call `refresh_all` which then populate the capabilities field as empty set. + +Command Line Interface +---------------------- + +* The `PIF.capabilities` field is exposed through `xe pif-list` and `xe pif-param-list` as usual. diff --git a/doc/content/design/gpu-passthrough.md b/doc/content/design/gpu-passthrough.md new file mode 100644 index 00000000000..db6841cee38 --- /dev/null +++ b/doc/content/design/gpu-passthrough.md @@ -0,0 +1,365 @@ +--- +title: GPU pass-through support +layout: default +design_doc: true +revision: 1 +status: released (6.0) +--- + +This document contains the software design for GPU pass-through. This +code was originally included in the version of Xapi used in XenServer 6.0. + +Overview +-------- + +Rather than modelling GPU pass-through from a PCI perspective, and +having the user manipulate PCI devices directly, we are taking a +higher-level view by introducing a dedicated graphics model. The +graphics model is similar to the networking and storage model, in which +virtual and physical devices are linked through an intermediate +abstraction layer (e.g. the "Network" class in the networking model). + +The basic graphics model is as follows: + +- A host owns a number of physical GPU devices (*pGPUs*), each of + which is available for passing through to a VM. +- A VM may have a virtual GPU device (*vGPU*), which means it expects + to have access to a GPU when it is running. +- Identical pGPUs are grouped across a resource pool in *GPU groups*. + GPU groups are automatically created and maintained by XS. +- A GPU group connects vGPUs to pGPUs in the same way as VIFs are + connected to PIFs by Network objects: for a VM *v* having a vGPU on + GPU group *p* to run on host *h*, host *h* must have a pGPU in GPU + group *p* and pass it through to VM *v*. +- VM start and non-live migration rules are analogous to the network + API and follow the above rules. +- In case a VM that has a vGPU is started, while no pGPU available, an + exception will occur and the VM won't start. As a result, in order + to guarantee that a VM always has access to a pGPU, the number of + vGPUs should not exceed the number of pGPUs in a GPU group. + +Currently, the following restrictions apply: + +- Hotplug is not supported. +- Suspend/resume and checkpointing (memory snapshots) are not + supported. +- Live migration (XenMotion) is not supported. +- No more than one GPU per VM will be supported. +- Only Windows guests will be supported. + +XenAPI Changes +-------------- + +The design introduces a new generic class called *PCI* to capture state +and information about relevant PCI devices in a host. By default, xapi +would not create PCI objects for all PCI devices, but only for the ones +that are managed and configured by xapi; currently only GPU devices. + +The PCI class has no fields specific to the type of the PCI device (e.g. +a graphics card or NIC). Instead, device specific objects will contain a +link to their underlying PCI device's object. + +The new XenAPI classes and changes to existing classes are detailed +below. + +### PCI class + +Fields: + +| Name | Type | Description | +|----------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------| +| uuid | string | Unique identifier/object reference. | +| class_id | string | PCI class ID (hidden field) | +| class_name | string | PCI class name (GPU, NIC, ...) | +| vendor_id | string | Vendor ID (hidden field). | +| vendor_name | string | Vendor name. | +| device_id | string | Device ID (hidden field). | +| device_name | string | Device name. | +| host | host ref | The host that owns the PCI device. | +| pci_id | string | BDF (domain/Bus/Device/Function identifier) of the (physical) PCI function, e.g. "0000:00:1a.1". The format is hhhh:hh:hh.h, where h is a hexadecimal digit. | +| functions | int | Number of (physical + virtual) functions; currently fixed at 1 (hidden field). | +| attached_VMs | VM ref set | List of VMs that have this PCI device "currently attached", i.e. plugged, i.e. passed-through to (hidden field). | +| dependencies | PCI ref set | List of dependent PCI devices: all of these need to be passed-thru to the same VM (co-location). | +| other_config | (string -> string) map | Additional optional configuration (as usual). | + +*Hidden fields* are only for use by xapi internally, and not visible to +XenAPI users. + +Messages: none. + +### PGPU class + +A physical GPU device (pGPU). + +Fields: + +| Name | Type | Description | +|----------------|--------------------------|----------------------------------------------------| +| uuid | string | Unique identifier/object reference. | +| PCI | PCI ref | Link to the underlying PCI device. | +| other_config | (string -> string) map | Additional optional configuration (as usual). | +| host | host ref | The host that owns the GPU. | +| GPU_group | GPU_group ref | GPU group the pGPU is contained in. Can be Null. | + +Messages: none. + +### GPU\_group class + +A group of identical GPUs across hosts. A VM that is associated with a +GPU group can use any of the GPUs in the group. A VM does not need to +install new GPU drivers if moving from one GPU to another one in the +same GPU group. + +Fields: + +| Name | Type | Description | +|--------------------|--------------------------|----------------------------------------------------------------------------------| +| VGPUs | VGPU ref set | List of vGPUs in the group. | +| uuid | string | Unique identifier/object reference. | +| PGPUs | PGPU ref set | List of pGPUs in the group. | +| other_config | (string -> string) map | Additional optional configuration (as usual). | +| name_label | string | A human-readable name. | +| name_description | string | A notes field containing human-readable description. | +| GPU_types | string set | List of GPU types (vendor+device ID) that can be in this group (hidden field). | + +Messages: none. + +### VGPU class + +A virtual GPU device (vGPU). + +Fields: + +| Name | Type | Description | +|----------------------|--------------------------|--------------------------------------------------------------------------------------| +| uuid | string | Unique identifier/object reference. | +| VM | VM ref | VM that owns the vGPU. | +| GPU_group | GPU_group ref | GPU group the vGPU is contained in. | +| currently_attached | bool | Reflects whether the virtual device is currently "connected" to a physical device. | +| device | string | Order in which the devices are plugged into the VM. Restricted to "0" for now. | +| other_config | (string -> string) map | Additional optional configuration (as usual). + +Messages: + +| Prototype | Description | | +|---------------------------------------------------|--------------------------------------------------------------------------------------------------------|---| +| VGPU ref create (GPU_group ref, string, VM ref) | Manually assign the vGPU device to the VM given a device number, and link it to the given GPU group. | | +| void destroy (VGPU ref) | Remove the association between the GPU group and the VM. | | + +It is possible to assign more vGPUs to a group than number number of +pGPUs in the group. When a VM is started, a pGPU must be available; if +not, the VM will not start. Therefore, to guarantee that a VM has access +to a pGPU at any time, one must manually enforce that the number of +vGPUs in a GPU group does not exceed the number of pGPUs. XenCenter +might display a warning, or simply refuse to assign a vGPU, if this +constraint is violated. This is analogous to the handling of memory +availability in a pool: a VM may not be able to start if there is no +host having enough free memory. + +### VM class + +Fields: + +- Deprecate (unused) `PCI_bus` field +- Add field `VGPU ref set VGPUs`: List of vGPUs. +- Add field `PCI ref set attached_PCIs`: List of PCI devices that are + "currently attached" (plugged, passed-through) (*hidden field*). + +### host class + +Fields: + +- Add field `PCI ref set PCIs`: List of PCI devices. +- Add field `PGPU ref set PGPUs`: List of physical GPU devices. +- Add field `(string -> string) map chipset_info`, which contains at + least the key `iommu`. If `"true"`, this key indicates whether the + host has IOMMU/VT-d support build in, **and** this functionality is + enabled by Xen; the value will be `"false"` otherwise. + +Initialisation and Operations +----------------------------- + +### Enabling IOMMU/VT-d + +(This may not be needed in Xen 4.1. Confirm with Simon.) + +Provide a command that does this: + +- `/opt/xensource/libexec/xen-cmdline --set-xen iommu=1` +- reboot + +### Xapi startup + +Definitions: + +- PCI devices are matched on the combination of their `pci_id`, + `vendor_id`, and `device_id`. + +First boot and any subsequent xapi start: + +1. Find out from dmesg whether IOMMU support is present and enabled in + Xen, and set `host.chipset_info:iommu` accordingly. +2. Detect GPU devices currently present in the host. For each: + 1. If there is no matching PGPU object yet, create a PGPU object, + and add it to a GPU group containing identical PGPUs, or a new + group. + 2. If there is no matching PCI object yet, create one, and also + create or update the PCI objects for dependent devices. + +3. Destroy all existing PCI objects of devices that are not currently + present in the host (i.e. objects for devices that have been + replaced or removed). +4. Destroy all existing PGPU objects of GPUs that are not currently + present in the host. Send a XenAPI alert to notify the user of this + fact. +5. Update the list of `dependencies` on all PCI objects. +6. Sync `VGPU.currently_attached` on all `VGPU` objects. + +### Upgrade + +For any VMs that have `VM.other_config:pci` set to use a GPU, create an +appropriate vGPU, and remove the `other_config` option. + +### Generic PCI Interface + +A generic PCI interface exposed to higher-level code, such as the +networking and GPU management modules within Xapi. This functionality +relies on Xenops. + +The PCI module exposes the following functions: + +- Check whether a PCI device has free (unassigned) functions. This is + the case if the number of assignments in `PCI.attached_VMs` is + smaller than `PCI.functions`. +- Plug a PCI function into a running VM. + 1. Raise exception if there are no free functions. + 2. Plug PCI device, as well as dependent PCI devices. The PCI + module must also tell device-specific modules to update the + `currently_attached` field on dependent `VGPU` objects etc. + 3. Update `PCI.attached_VMs`. +- Unplug a PCI function from a running VM. + 1. Raise exception if the PCI function is not owned by (passed + through to) the VM. + 2. Unplug PCI device, as well as dependent PCI devices. The PCI + module must also tell device-specific modules to update the + `currently_attached` field on dependent `VGPU` objects etc. + 3. Update `PCI.attached_VMs`. + +### Construction and Destruction + +VGPU.create: + +1. Check license. Raise FEATURE\_RESTRICTED if the GPU feature has not + been enabled. +2. Raise INVALID\_DEVICE if the given device number is not "0", or + DEVICE\_ALREADY\_EXISTS if (indeed) the device already exists. This + is a convenient way of enforcing that only one vGPU per VM is + supported, for now. +3. Create `VGPU` object in the DB. +4. Initialise `VGPU.currently_attached = false`. +5. Return a ref to the new object. + +VGPU.destroy: + +1. Raise OPERATION\_NOT\_ALLOWED if `VGPU.currently_attached = true` + and the VM is running. +2. Destroy `VGPU` object. + +### VM Operations + +VM.start(\_on): + +1. If `host.chipset_info:iommu = "false"`, raise VM\_REQUIRES\_IOMMU. +2. Raise FEATURE\_REQUIRES\_HVM (carrying the string "GPU passthrough + needs HVM") if the VM is PV rather than HVM. +3. For each of the VM's vGPUs: + 1. Confirm that the given host has a pGPU in its associated GPU + group. If not, raise VM\_REQUIRES\_GPU. + 2. Consult the generic PCI module for all pGPUs in the group to + find out whether a suitable PCI function is available. If a + physical device is not available, raise VM\_REQUIRES\_GPU. + 3. Ask PCI module to plug an available pGPU into the VM's domain + and set `VGPU.currently_attached` to `true`. As a side-effect, + any dependent PCI devices would be plugged. + +VM.shutdown: + +1. Ask PCI module to unplug all GPU devices. +2. Set `VGPU.currently_attached` to `false` for all the VM's VGPUs. + +VM.suspend, VM.resume(\_on): + +- Raise VM\_HAS\_PCI\_ATTACHED if the VM has any plugged `VGPU` + objects, as suspend/resume for VMs with GPUs is currently not + supported. + +VM.pool\_migrate: + +- Raise VM\_HAS\_PCI\_ATTACHED if the VM has any plugged `VGPU` + objects, as live migration for VMs with GPUs is currently not + supported. + +VM.clone, VM.copy, VM.snapshot: + +- Copy `VGPU` objects along with the VM. + +VM.import, VM.export: + +- Include `VGPU` and `GPU_group` objects in the VM export format. + +VM.checkpoint + +- Raise VM\_HAS\_PCI\_ATTACHED if the VM has any plugged `VGPU` + objects, as checkpointing for VMs with GPUs is currently not + supported. + +### Pool Join and Eject + +Pool join: + +1. For each `PGPU`: + 1. Copy it to the pool. + 2. Add it to a `GPU_group` of identical PGPUs, or a new one. + +2. Copy each `VGPU` to the pool together with the VM that owns it, and + add it to the GPU group containing the same `PGPU` as before the + join. + +Step 1 is done automatically by the xapi startup code, and step 2 is +handled by the VM export/import code. Hence, no work needed. + +Pool eject: + +1. `VGPU` objects will be automatically GC'ed when the VMs are removed. +2. Xapi's startup code recreates the `PGPU` and `GPU_group` objects. + +Hence, no work needed. + +Required Low-level Interface +---------------------------- + +Xapi needs a way to obtain a list of all PCI devices present on a host. +For each device, xapi needs to know: + +- The PCI ID (BDF). +- The type of device (NIC, GPU, ...) according to a well-defined and + stable list of device types (as in `/usr/share/hwdata/pci.ids`). +- The device and vendor ID+name (currently, for PIFs, xapi looks up + the name in `/usr/share/hwdata/pci.ids`). +- Which other devices/functions are required to be passed through to + the same VM (co-located), e.g. other functions of a compound PCI + device. + +Command-Line Interface (xe) +----------------------------- + +- xe pgpu-list +- xe pgpu-param-list/get/set/add/remove/clear +- xe gpu-group-list +- xe gpu-group-param-list/get/set/add/remove/clear +- xe vgpu-list +- xe vgpu-create +- xe vgpu-destroy +- xe vgpu-param-list/get/set/add/remove/clear +- xe host-param-get param-name=chipset-info param-key=iommu + diff --git a/doc/content/design/gpu-support-evolution.md b/doc/content/design/gpu-support-evolution.md new file mode 100644 index 00000000000..73d1cb00d75 --- /dev/null +++ b/doc/content/design/gpu-support-evolution.md @@ -0,0 +1,209 @@ +--- +title: GPU support evolution +layout: default +design_doc: true +revision: 3 +status: released (7.0) +revision_history: +- revision_number: 1 + description: Documented interface changes between xapi and xenopsd for vGPU +- revision_number: 2 + description: Added design for storing vGPU-to-pGPU allocation in xapi database +- revision_number: 3 + description: Marked new xapi DB fields as internal-only +--- + +Introduction +------------ + +As of XenServer 6.5, VMs can be provisioned with access to graphics processors +(either emulated or passed through) in four different ways. Virtualisation of +Intel graphics processors will exist as a fifth kind of graphics processing +available to VMs. These five situations all require the VM's device model to be +created in subtly different ways: + +__Pure software emulation__ + +- qemu is launched either with no special parameter, if the basic Cirrus + graphics processor is required, otherwise qemu is launched with the + `-std-vga` flag. + +__Generic GPU passthrough__ + +- qemu is launched with the `-priv` flag to turn on privilege separation +- qemu can additionally be passed the `-std-vga` flag to choose the + corresponding emulated graphics card. + +__Intel integrated GPU passthrough (GVT-d)__ + +- As well as the `-priv` flag, qemu must be launched with the `-std-vga` and + `-gfx_passthru` flags. The actual PCI passthrough is handled separately + via xen. + +__NVIDIA vGPU__ + +- qemu is launched with the `-vgpu` flag +- a secondary display emulator, demu, is launched with the following parameters: + - `--domain` - the VM's domain ID + - `--vcpus` - the number of vcpus available to the VM + - `--gpu` - the PCI address of the physical GPU on which the emulated GPU will + run + - `--config` - the path to the config file which contains detail of the GPU to + emulate + +__Intel vGPU (GVT-g)__ + +- here demu is not used, but instead qemu is launched with five parameters: + - `-xengt` + - `-vgt_low_gm_sz` - the low GM size in MiB + - `-vgt_high_gm_sz` - the high GM size in MiB + - `-vgt_fence_sz` - the number of fence registers + - `-priv` + +xenopsd +------- + +To handle all these possibilities, we will add some new types to xenopsd's +interface: + +``` +module Pci = struct + type address = { + domain: int; + bus: int; + device: int; + fn: int; + } + + ... +end + +module Vgpu = struct + type gvt_g = { + physical_pci_address: Pci.address; + low_gm_sz: int64; + high_gm_sz: int64; + fence_sz: int; + } + + type nvidia = { + physical_pci_address: Pci.address; + config_file: string + } + + type implementation = + | GVT_g of gvt_g + | Nvidia of nvidia + + type id = string * string + + type t = { + id: id; + position: int; + implementation: implementation; + } + + type state = { + plugged: bool; + emulator_pid: int option; + } +end + +module Vm = struct + type igd_passthrough of + | GVT_d + + type video_card = + | Cirrus + | Standard_VGA + | Vgpu + | Igd_passthrough of igd_passthrough + + ... +end + +module Metadata = struct + type t = { + vm: Vm.t; + vbds: Vbd.t list; + vifs: Vif.t list; + pcis: Pci.t list; + vgpus: Vgpu.t list; + domains: string option; + } +end +``` + +The `video_card` type is used to indicate to the function +`Xenops_server_xen.VM.create_device_model_config` how the VM's emulated graphics +card will be implemented. A value of `Vgpu` indicates that the VM needs to be +started with one or more virtualised GPUs - the function will need to look at +the list of GPUs associated with the VM to work out exactly what parameters to +send to qemu. + +If `Vgpu.state.emulator_pid` of a plugged vGPU is `None`, this indicates that +the emulation of the vGPU is being done by qemu rather than by a separate +emulator. + +n.b. adding the `vgpus` field to `Metadata.t` will break backwards compatibility +with old versions of xenopsd, so some upgrade logic will be required. + +This interface will allow us to support multiple vGPUs per VM in future if +necessary, although this may also require reworking the interface between +xenopsd, qemu and demu. For now, xenopsd will throw an exception if it is asked +to start a VM with more than one vGPU. + +xapi +---- + +To support the above interface, xapi will convert all of a VM's non-passthrough +GPUs into `Vgpu.t` objects when sending VM metadata to xenopsd. + +In contrast to GVT-d, which can only be run on an Intel GPU which has been +has been hidden from dom0, GVT-g will only be allowed to run on a GPU which has +_not_ been hidden from dom0. + +If a GVT-g-capable GPU is detected, and it is not hidden from dom0, xapi will +create a set of VGPU_type objects to represent the vGPU presets which can run on +the physical GPU. Exactly how these presets are defined is TBD, but a likely +solution is via a set of config files as with NVIDIA vGPU. + +__Allocation of vGPUs to physical GPUs__ + +For NVIDIA vGPU, when starting a VM, each vGPU attached to the VM is assigned +to a physical GPU as a result of capacity planning at the pool level. The +resulting configuration is stored in the VM.platform dictionary, under +specific keys: + +- `vgpu_pci_id` - the address of the physical GPU on which the vGPU will run +- `vgpu_config` - the path to the vGPU config file which the emulator will use + +Instead of storing the assignment in these fields, we will add a new +internal-only database field: + +- `VGPU.scheduled_to_be_resident_on (API.ref_PGPU)` + +This will be set to the ref of the physical GPU on which the vGPU will run. From +here, xapi can easily obtain the GPU's PCI address. Capacity planning will also +take into account which vGPUs are scheduled to be resident on a physical GPU, +which will avoid races resulting from many vGPU-enabled VMs being started at +once. + +The path to the config file is already stored in the `VGPU_type.internal_config` +dictionary, under the key `vgpu_config`. xapi will use this value directly +rather than copying it to VM.platform. + +To support other vGPU implementations, we will add another internal-only +database field: + +- `VGPU_type.implementation enum(Passthrough|Nvidia|GVT_g)` + +For the `GVT_g` implementation, no config file is needed. Instead, +`VGPU_type.internal_config` will contain three key-value pairs, with the keys + +- `vgt_low_gm_sz` +- `vgt_high_gm_sz` +- `vgt_fence_sz` + +The values of these pairs will be used to construct a value of type +`Xenops_interface.Vgpu.gvt_g`, which will be passed down to xenopsd. diff --git a/doc/content/design/heterogeneous-pools.md b/doc/content/design/heterogeneous-pools.md new file mode 100644 index 00000000000..c3ca083f381 --- /dev/null +++ b/doc/content/design/heterogeneous-pools.md @@ -0,0 +1,289 @@ +--- +title: Heterogeneous pools +layout: default +design_doc: true +revision: 1 +status: released (5.6) +--- + +Notes +===== + +- The `cpuid` instruction is used to obtain a CPU's manufacturer, + family, model, stepping and features information. +- The feature bitvector is 128 bits wide: 2 times 32 bits of base + features plus 2 times 32 bits of extended features, which are + referred to as `base_ecx`, `base_edx`, `ext_ecx` and `ext_edx` + (after the registers used by `cpuid` to store the results). +- The feature bits can be masked by Intel FlexMigration and AMD + Extended Migration. This means that features can be made to appear + as absent. Hence, a CPU can appear as a less-capable CPU. + - AMD Extended Migration is able to mask both base and extended + features. + - Intel FlexMigration on Core 2 CPUs (Penryn) is able to mask + **only the base features** (`base_ecx` and `base_edx`). The + newer Nehalem and Westmere CPUs support extended-feature masking + as well. +- A process in dom0 (e.g. xapi) is able to call `cpuid` to obtain the + (possibly modified) CPU info, or can obtain this information from + Xen. Masking is done only by Xen at boot time, before any domains + are loaded. +- To apply a feature mask, a dom0 process may specify the mask in the + Xen command line in the file `/boot/extlinux.conf`. After a reboot, + the mask will be enforced. +- It is not possible to obtain the original features from a dom0 + process, if the features have been masked. Before applying the first + mask, the process could remember/store the original feature vector, + or obtain the information from Xen. +- All CPU cores on a host can be assumed to be identical. Masking will + be done simultaneously on all cores in a host. +- Whether a CPU supports FlexMigration/Extended Migration can (only) + be derived from the family/model/stepping information. +- XS5.5 has an exception for the EST feature in base\_ecx. This flag + is ignored on pool join. + +Overview of XenAPI Changes +========================== + +Fields +------ + +Currently, the datamodel has `Host_cpu` objects for each CPU core in a +host. As they are all identical, we are considering keeping just one CPU +record in the `Host` object itself, and deprecating the `Host_cpu` +class. For backwards compatibility, the `Host_cpu` objects will remain +as they are in MNR, but may be removed in subsequent releases. + +Hence, there will be a new field called `Host.cpu_info`, a read-only +string-string map, containing the following fixed set of keys: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Key nameDescription
cpu_countThe number of CPU cores in the host.
familyThe family (number) of the CPU.
featuresThe current (possibly masked) feature vector, as given by cpuid. Format: "<base_ecx>-<base_edx>-<ext_ecx>-<ext_edx>", 4 groups of 8 hexadecimal digits, separated by dashes.
features_after_rebootThe feature vector to be used after rebooting the host. This field can be modified by calling Host.set_cpu_features. Same format as features.
flagsThe flags of the physical CPU (a decoded version of the features field).
maskableIndicating whether the CPU supports Intel FlexMigration or AMD Extended Migration. There are three possible values: "no" means that masking is not possible, "base" means that only base features can be masked, and "full" means that base as well as extended features can be masked.
modelThe model number of the CPU.
modelnameThe model name of the CPU.
physical_featuresThe original, unmasked features. Same format as features.
speedThe speed of the CPU.
steppingThe stepping of the CPU.
vendorThe manufacturer of the CPU.
+ +Indicating whether the CPU supports Intel FlexMigration or AMD Extended +Migration. There are three possible values: `"no"` means that masking is +not possible, `"base"` means that only base features can be masked, and +`"full"` means that base as well as extended features can be masked. + +Note: When the `features` and `features_after_reboot` are different, +XenCenter could display a warning saying that a reboot is needed to +enforce the feature masking. + +The `Pool.other_config:cpuid_feature_mask` key is recognised. If this +key is present and if it contains a value in the same format as +`Host.cpu_info:features`, the value is used to mask the feature vectors +before comparisons during any pool join in the pool it is defined on. +This can be used to white-list certain feature flags, i.e. to ignore +them when adding a new host to a pool. The default it +`ffffff7f-ffffffff-ffffffff-ffffffff`, which white-lists the EST feature +for compatibility with XS 5.5 and earlier. + +Messages +-------- + +New messages: + +- `Host.set_cpu_features` + - Parameters: Host reference `host`, new CPU feature vector + `features`. + - Roles: only Pool Operator and Pool Admin. + - Sets the feature vector to be used after a reboot + (`Host.cpu_info:features_after_reboot`), if `features` is valid. +- `Host.reset_cpu_features` + - Parameter: Host reference `host`. + - Roles: only Pool Operator and Pool Admin. + - Removes the feature mask, such that after a reboot all features + of the CPU are enabled. + +XAPI +==== + +Back-end +-------- + +- Xen keeps the physical (unmasked) CPU features in memory when + starts, before applying any masks. Xen exposes the physical + features, as well as the current (possibly masked) features, to + dom0/xapi via the function `xc_get_boot_cpufeatures` in libxc. +- A dom0 script `/etc/xensource/libexec/xen-cmdline`, which provides a + future-proof way of modifying the Xen command-line key/value pairs. + This script has the following options, where `mask` is one of + `cpuid_mask_ecx`, `cpuid_mask_edx`, `cpuid_mask_ext_ecx` or + `cpuid_mask_ext_edx`, and `value` is `0xhhhhhhhh` (`h` is represents + a hex digit).: + - `--list-cpuid-masks` + - `--set-cpuid-masks mask=value mask=value` + - `--delete-cpuid-masks mask mask` +- A `restrict_cpu_masking` key has been added to the host licensing + restrictions map. This will be `true` when the `Host.edition` is + `free`, and `false` if it is `enterprise` or `platinum`. + +Start-up +-------- + +The `Host.cpu_info` field is refreshed: + +- The values for the keys `cpu_count`, `vendor`, `speed`, `modelname`, + `flags`, `stepping`, `model`, and `family` are obtained from + `/etc/xensource/boot_time_cpus` (and ultimately from + `/proc/cpuinfo`). +- The values of the `features` and `physical_features` are obtained + from Xen and the `features_after_reboot` key is made equal to the + `features` field. +- The value of the `maskable` key is determined by the CPU details. + - for Intel Core2 (Penryn) CPUs: + `family = 6 and (model = 1dh or (model = 17h and stepping >= 4))` + (`maskable = "base"`) + - for Intel Nehalem/Westmere CPUs: + `family = 6 and ((model = 1ah and stepping > 2) or model = 1eh or model = 25h or model = 2ch or model = 2eh or model = 2fh)` + (`maskable = "full"`) + - for AMD CPUs: `family >= 10h` (`maskable = "full"`) + +Setting (Masking) and Resetting the CPU Features +------------------------------------------------ + +- The `Host.set_cpu_features` call: + - checks whether the license of the host is Enterprise or + Platinum; throws FEATURE\_RESTRICTED if not. + - expects a string of 32 hexadecimal digits, optionally containing + spaces; throws INVALID\_FEATURE\_STRING if malformed. + - checks whether the given feature vector can be formed by masking + the physical feature vector; throws INVALID\_FEATURE\_STRING if + not. Note that on Intel Core 2 CPUs, it is only possible to the + mask the base features! + - checks whether the CPU supports FlexMigration/Extended + Migration; throws CPU\_FEATURE\_MASKING\_NOT\_SUPPORTED if not. + - sets the value of `features_after_reboot` to the given feature + vector. + - adds the new feature mask to the Xen command-line via the + `xen-cmdline` script. The mask is represented by one or more of + the following key/value pairs (where `h` represents a hex + digit): + - `cpuid_mask_ecx=0xhhhhhhhh` + - `cpuid_mask_edx=0xhhhhhhhh` + - `cpuid_mask_ext_ecx=0xhhhhhhhh` + - `cpuid_mask_ext_edx=0xhhhhhhhh` +- The `Host.reset_cpu_features` call: + - copies `physical_features` to `features_after_reboot`. + - removes the feature mask from the Xen command-line via the + `xen-cmdline` script (if any). + +Pool Join and Eject +------------------- + +- `Pool.join` fails when the `vendor` and `feature` keys do not match, + and disregards any other key in `Host.cpu_info`. + - However, as XS5.5 disregards the EST flag, there is a new way to + disregard/ignore feature flags on pool join, by setting a mask + in `Pool.other_config:cpuid_feature_mask`. The value of this + field should have the same format as `Host.cpu_info:features`. + When comparing the CPUID features of the pool and the joining + host for equality, this mask is applied before the comparison. + The default is `ffffff7f-ffffffff-ffffffff-ffffffff`, which + defines the EST feature, bit 7 of the base ecx flags, as "don't + care". +- `Pool.eject` clears the database (as usual), and additionally + removes the feature mask from `/boot/extlinux.conf` (if any). + +CLI +=== + +New commands: + +- `host-cpu-info` + - Parameters: `uuid` (optional, uses localhost if absent). + - Lists `Host.cpu_info` associated with the host. +- `host-get-cpu-features` + - Parameters: `uuid` (optional, uses localhost if absent). + - Returns the value of `Host.cpu_info:features]` associated with + the host. +- `host-set-cpu-features` + - Parameters: `features` (string of 32 hexadecimal digits, + optionally containing spaces or dashes), `uuid` (optional, uses + localhost if absent). + - Calls `Host.set_cpu_features`. +- `host-reset-cpu-features` + - Parameters: `uuid` (optional, uses localhost if absent). + - Calls `Host.reset_cpu_features`. + +The following commands will be deprecated: `host-cpu-list`, +`host-cpu-param-get`, `host-cpu-param-list`. + +WARNING: + +If the user is able to set any mask they like, they may end up disabling +CPU features that are required by dom0 (and probably other guest OSes), +resulting in a kernel panic when the machine restarts. Hence, using the +set function is potentially dangerous. + +It is apparently not easy to find out exactly which flags are safe to +mask and which aren't, so we cannot prevent an API/CLI user from making +mistakes in this way. However, using XenCenter would always be safe, as +XC always copies features masks from real hosts. + +If a machine ends up in such a bad state, there is a way to get out of +it. At the boot prompt (before Xen starts), you can type "menu.c32", +select a boot option and alter the Xen command-line to remove the +feature masks, after which the machine will again boot normally (note: +in our set-up, there is first a PXE boot prompt; the second prompt is +the one we mean here). + +The API/CLI documentation should stress the potential danger of using +this functionality, and explain how to get out of trouble again. + diff --git a/doc/content/design/integrated-gpu-passthrough/index.md b/doc/content/design/integrated-gpu-passthrough/index.md new file mode 100644 index 00000000000..4b9a827ef5d --- /dev/null +++ b/doc/content/design/integrated-gpu-passthrough/index.md @@ -0,0 +1,95 @@ +--- +title: Integrated GPU passthrough support +layout: default +design_doc: true +revision: 3 +status: released (6.5 SP1) +design_review: 33 +--- + +Introduction +------------ + +Passthrough of discrete GPUs has been +[available since XenServer 6.0]({{site.baseurl}}/xapi/design/gpu-passthrough.html). +With some extensions, we will also be able to support passthrough of integrated +GPUs. + +- Whether an integrated GPU will be accessible to dom0 or available to + passthrough to guests must be configurable via XenAPI. +- Passthrough of an integrated GPU requires an extra flag to be sent to qemu. + +Host Configuration +------------------ + +New fields will be added (both read-only): + +- `PGPU.dom0_access enum(enabled|disable_on_reboot|disabled|enable_on_reboot)` +- `host.display enum(enabled|disable_on_reboot|disabled|enable_on_reboot)` + +as well as new API calls used to modify the state of these fields: + +- `PGPU.enable_dom0_access` +- `PGPU.disable_dom0_access` +- `host.enable_display` +- `host.disable_display` + +Each of these API calls will return the new state of the field e.g. calling +`host.disable_display` on a host with `display = enabled` will return +`disable_on_reboot`. + +Disabling dom0 access will modify the xen commandline (using the xen-cmdline +tool) such that dom0 will not be able to access the GPU on next boot. + +Calling host.disable_display will modify the xen and dom0 commandlines such +that neither will attempt to send console output to the system display device. + +A state diagram for the fields `PGPU.dom0_access` and `host.display` is shown +below: + +![host.integrated_GPU_passthrough flow diagram](integrated-gpu-passthrough.png) + +While it is possible for these two fields to be modified independently, a +client must disable both the host display and dom0 access to the system display +device before that device can be passed through to a guest. + +Note that when a client enables or disables either of these fields, the change +can be cancelled until the host is rebooted. + +Handling vga_arbiter +-------------------- + +Currently, xapi will not create a PGPU object for the PCI device with address +reported by `/dev/vga_arbiter`. This is to prevent a GPU in use by dom0 from +from being passed through to a guest. This behaviour will be changed - instead +of not creating a PGPU object at all, xapi will create a PGPU, but its +supported_VGPU_types field will be empty. + +However, the PGPU's supported_VGPU_types will be populated as normal if: + +1. dom0 access to the GPU is disabled. +2. The host's display is disabled. +3. The vendor ID of the device is contained in a whitelist provided by xapi's + config file. + +A read-only field will be added: + +- `PGPU.is_system_display_device bool` + +This will be true for a PGPU iff `/dev/vga_arbiter` reports the PGPU as the +system display device for the host on which the PGPU is installed. + +Interfacing with xenopsd +------------------------ + +When starting a VM attached to an integrated GPU, the VM config sent to xenopsd +will contain a video_card of type IGD_passthrough. This will override the type +determined from VM.platform:vga. xapi will consider a GPU to be integrated if +both: + +1. It resides on bus 0. +2. The vendor ID of the device is contained in a whitelist provided by xapi's + config file. + +When xenopsd starts qemu for a VM with a video_card of type IGD_passthrough, +it will pass the flags "-std-vga" AND "-gfx_passthru". diff --git a/doc/content/design/integrated-gpu-passthrough/integrated-gpu-passthrough.png b/doc/content/design/integrated-gpu-passthrough/integrated-gpu-passthrough.png new file mode 100644 index 00000000000..0d3b28a3de6 Binary files /dev/null and b/doc/content/design/integrated-gpu-passthrough/integrated-gpu-passthrough.png differ diff --git a/doc/content/design/local-database.md b/doc/content/design/local-database.md new file mode 100644 index 00000000000..2393df63760 --- /dev/null +++ b/doc/content/design/local-database.md @@ -0,0 +1,68 @@ +--- +title: Local database +layout: default +design_doc: true +revision: 1 +status: proposed +--- + +All hosts in a pool use the shared database by sending queries to +the pool master. This creates a performance bottleneck as the pool +size increases. All hosts in a pool receive a database backup from +the master periodically, every couple of hours. This creates a +reliability problem as updates may be lost if the master fails during +the window before the backup. + +The reliability problem can be avoided by running with HA or the redo +log enabled, but this is not always possible. + +We propose to: + +- adapt the existing event machinery to allow every host to maintain + an up-to-date database replica; +- actively cache the database locally on each host and satisfy read + operations from the cache. Most database operations are reads so + this should reduce the number of RPCs across the network. + +In a later phase we can move to a completely +[distributed database](../distributed-database). + +Replicating the database +------------------------ + +We will create a database-level variant of the existing XenAPI `event.from` +API. The new RPC will block until a database event is generated, and then +the events will be returned using the existing "redo-log" event types. We +will add a few second delay into the RPC to batch the updates. + +We will replace the pool database download logic with an `event.from`-like +loop which fetches all the events from the master's database and applies +them to the local copy. The first call will naturally return the full database +contents. + +We will turn on the existing "in memory db cache" mechanism on all hosts, +not just the master. This will be where the database updates will go. + +The result should be that every host will have a `/var/xapi/state.db` file, +with writes going to the master first and then filtering down to all slaves. + +Using the replica as a cache +---------------------------- + +We will re-use the [Disaster Recovery](../../toolstack/features/DR) multiple +database mechanism to allow slaves to access their local database. We will +change the defalult database "context" to snapshot the local database, +perform reads locally and write-through to the master. + +We will add an HTTP header to all forwarded XenAPI calls from the master which +will include the current database generation count. When a forwarded XenAPI +operation is received, the slave will deliberately wait until the local cache +is at least as new as this, so that we always use fresh metadata for XenAPI +calls (e.g. the VM.start uses the absolute latest VM memory size). + +We will document the new database coherence policy, i.e. that writes on a host +will not immediately be seen by reads on another host. We believe that this +is only a problem when we are using the database for locking and are attempting +to hand over a lock to another host. We are already using XenAPI calls forwarded +to the master for some of this, but may need to do a bit more of this; in +particular the storage backends may need some updating. diff --git a/doc/content/design/management-interface-on-vlan.md b/doc/content/design/management-interface-on-vlan.md new file mode 100644 index 00000000000..b9af46c91fd --- /dev/null +++ b/doc/content/design/management-interface-on-vlan.md @@ -0,0 +1,224 @@ +--- +title: Management Interface on VLAN +layout: default +design_doc: true +revision: 3 +status: proposed +revision_history: +- revision_number: 1 + description: Initial version +- revision_number: 2 + description: Addition of `networkd_db` update for Upgrade +- revision_number: 3 + description: More info on `networkd_db` and API Errors +--- + +This document describes design details for the +REQ-42: Support Use of VLAN on XAPI Management Interface. + +XAPI and XCP-Networkd +=============== + +### Creating a VLAN + +Creating a VLAN is already there, Lisiting the steps to create a VLAN which is used later in the document. +Steps: + +1. Check the PIFs created on a Host for physical devices `eth0`, `eth1`. + `xe pif-list params=uuid physical=true host-uuid=UUID` this will list `pif-UUID` +2. Create a new network for the VLAN interface. + `xe network-create name-label=VLAN1` + It returns a new `network-UUID` +3. Create a VLAN PIF. + `xe vlan-create pif-uuid=pif-UUID network-uuid=network-UUID vlan=VLAN-ID` + It returns a new VLAN PIF `new-pif-UUID` +4. Plug the VLAN PIF. + `xe pif-plug uuid=new-pif-UUID` +5. Configure IP on the VLAN PIF. + `xe pif-reconfigure-ip uuid=new-pif-UUID mode= IP= netmask= gateway= DNS= ` + This will configure IP on the PIF, here `mode` is must and other parametrs are needed on selecting mode=static + +Similarly, creating a vlan pif can be achieved by corresponding XenAPI calls. + +Recognise VLAN config from management.conf +---------------------------------------------- + +For a newly installed host, If host installer was asked to put the management interface on given VLAN. +We will expect a new entry `VLAN=ID` under `/etc/firstboot.d/data/management.conf`. + +Listing current contents of management.conf which will be used later in the document. +`LABEL`=`eth0` -> Represents Pyhsical device on which Management Interface must reside. +`MODE`=`dhcp`||`static` -> Represents IP configuration mode for the Management Interface. There can be other parameters like IP, NETMASK, GATEWAY and DNS when we have `static` mode. +`VLAN`=`ID` -> New entry for specifying VLAN TAG going to be configured on device `LABEL`. + Management interface going to be configured on this VLAN ID with specified mode. + +### Firstboot script need to recognise VLAN config + +Firstboot script `/etc/firstboot.d/30-prepare-networking` need to be updated for configuring +management interface to be on provided VLAN ID. + +Steps to be followed: + +1. `PIF.scan` performed in the script must have created the PIFs for the underlying pyhsical devices. +2. Get the PIF UUID for physical device `LABEL`. +3. Repeat the steps mentioned in `Creating a VLAN`, i.e. network-create, vlan-create and pif-plug. Now we have a new PIF for the VLAN. +4. Perform `pif-reconfigure-ip` for the new VLAN PIF. +5. Perform `host-management-reconfigure` using new VLAN PIF. + +### XCP-Networkd need to recognise VLAN config during startup + +XCP-Networkd during first boot and boot after pool eject gets the initial network setup from the `management.conf` and `xensource-inventory` file to update the network.db for management interface info. +XCP-Networkd must honour the new VLAN config. + +Steps to be followed: + +1. During startup `read_config` step tries to read the `/var/lib/xcp/networkd.db` file which is not yet created just after host installation. +2. Since `networkd.db` read throws `Read_Error`, it tries to read `network.dbcache` which is also not available hence it goes to read `read_management_conf` file. +3. There can be two possible MODE `static` or `dhcp` taken from management.conf. +4. `bridge_name` is taken as `MANAGEMENT_INTERFACE` from xensource-inventory, further `bridge_config` and `interface_config` are build based on MODE. +5. Call `Bridge.make_config()` and `Interface.make_config()` are performed with respective `bridge_config` and `interface_config`. + +Updating networkd_db program +---------------------------- + +`networkd_db` provides the management interface info to the host installer during upgrade. +It reads `/var/lib/xcp/networkd.db` file to output the Management Interface information. Here we need to update the networkd_db to output the VLAN information when vlan bridge is a input. + +Steps to be followed: + +1. Currently VLAN interface IP information is provided correctly on passing VLAN bridge as input. + `networkd_db -iface xapi0` this will list `mode` as dhcp or static, if mode=static then it will provide `ipaddr` and `netmask` too. +2. We need to udpate this program to provide VLAN ID and parent bridge info on passing VLAN bridge as input. + `networkd_db -bridge xapi0` It should output the VLAN info like: + `interfaces=` + `vlan=vlanID` + `parent=xenbr0` using the parent bridge user can identify the physical interfaces. + Here we will extract VLAN and parent bridge from `bridge_config` under `networkd.db`. + +Additional VLAN parameter for Emergency Network Reset +----------------------------------------------------- + +Detail design is mentioned on http://xapi-project.github.io/xapi/design/emergency-network-reset.html +For using `xe-reset-networking` utility to configure management interface on VLAN, We need to add one more parameter `--vlan=vlanID` to the utility. +There are certain parameters need to be passed to this utility: --master, --device, --mode, --ip, --netmask, --gateway, --dns and new one --vlan. + +### VLAN parameter addition to xe-reset-networking + +Steps to be followed: + +1. Check if `VLANID` is passed then let bridge=`xapi0`. +2. Write the `bridge=xapi0` into xensource-inventory file, This should work as Xapi check avialable bridges while creating networks. +3. Write the `VLAN=vlanID` into `management.conf` and `/tmp/network-reset`. +4. Modify `check_network_reset` under xapi.ml to perform steps `Creating a VLAN` and perform `management_reconfigure` on vlan pif. + Step `Creating a VLAN` must have created the VLAN record in Xapi DB similar to firstboot script. +5. If no VLANID is specified then retain the current one, This utility must take the management interface info from `networkd_db` program and handle the VLAN config. + +### VLAN parameter addition to xsconsole Emergency Network Reset + +Under `Emergency Network Reset` option under the `Network and Management Interface` menu. +Selecting this option will show some explanation in the pane on the right-hand side. +Pressing will bring up a dialogue to select the interfaces to use as management interface after the reset. +After choosing a device, the dialogue continues with configuration options like in the `Configure Management Interface` dialogue. +There will be an additionall option for VLAN in the dialogue. +After completing the dialogue, the same steps as listed for xe-reset-networking are executed. + +Updating Pool Join/Eject operations +----------------------------------- + +### Pool Join while Pool having Management Interface on a VLAN + +Currently `pool-join` fails if VLANs are present on the host joining a pool. +We need to allow pool-join only if Pool and host joining a pool both has management interface on same VLAN. + +Steps to be followed: + +1. Under `pre_join_checks` update function `assert_only_physical_pifs` to check Pool master management_interface is on same VLAN. +2. Call `Host.get_management_interface` on Pool master and get the vlanID, match it with `localhost` management_interface VLAN ID. + If it matches then allow pool-join. +3. In case if there are multiple VLANs on host joining a pool, fail the pool-join gracefully. +4. After the pool-join, Host xapi db will get sync from pool master xapi db, This will be fine to have management interface on VLAN. + +### Pool Eject while host ejected having Management Interface on a VLAN + +Currently managament interface VLAN config on host is not been retained in `xensource-inventory` or `management.conf` file. +We need to retain the vlanID under config files. + +Steps to be followed: + +1. Under call `Pool.eject` we need to update `write_first_boot_management_interface_configuration_file` function. +2. Check if management_interface is on VLAN then get the VLANID from the pif. +3. Update the VLANID into the `managament.conf` file and the `bridge` into `xensource-inventory` file. + In order to be retained by XCP-Networkd on startup after the host is ejected. + +New API for Pool Management Reconfigure +--------------------------------------- + +Currently there is no Pool Level API to reconfigure management_interface for all of the Hosts in a Pool at once. +API `Pool.management_reconfigure` will be needed in order to reconfigure `manamegemnt_interface` on all hosts in a Pool to the same Network either VLAN or Physical. + + +### Current behaviour to change the Management Interface on Host + +Currently call `Host.management_reconfigure` with VLAN pif-uuid can change the management_interface to specified VLAN. +Listing the steps to understand the workflow of `management_interface` reconfigure. We will be using `Host.management_reconfigure` call inside the new API. + +Steps performed during management_reconfigure: + +1. `bring_pif_up` get called for the pif. +2. `xensource-inventory` get updated with the latest info of interface. +3 `update-mh-info` updates the management_mac into xenstore. +4. Http server gets restarted, even though xapi listen on all IP addresses, This new interface as `_the_ management` interface is used by slaves to connect to pool master. +5. `on_dom0_networking_change` refreshes console URIs for the new IP address. +6. Xapi db is updated with new management interface info. + +### Management Reconfigure on Pool from Physical Network to VLAN Network or from VLAN Network to Other VLAN Network or from VLAN Network to Physical Network + +Listing steps to be performed manually on each Host or Pool as a prerequisite to use the New API. +We need to make sure that new network which is going to be a management interface has PIFs configured on each Host. +In case of pyhsical network we will assume pifs are configured on each host, In case of vlan network we need to create vlan pifs on each Host. +We would assume that VLAN is available on the switch/network. + +Manual steps to be performed before calling new API: + +1. Create a vlan network on pool via `network.create`, In case of pyhsical NICs network must be present. +2. Create a vlan pif on each host via `VLAN.create` using above network ref, physical PIF ref and vlanID, Not needed in case of pyhsical network. + Or An Alternate call `pool.create_VLAN` providing `device` and above `network` will create vlan PIFs for all hosts in a pool. +3. Perform `PIF.reconfigure_ip` for each new Network PIF on each Host. + +If User wishes to change the management interface manually on each Host in a Pool, We should allow it, There will be a guideline for that: + +User can individually change management interface on each host calling `Host.management_reconfigure` using pifs on physical devices or vlan pifs. +This must be perfomed on slaves first and lastly on Master, As changing management_interface on master will disconnect slaves from master then further calls `Host.management_reconfigure` cannot be performed till master recover slaves via call `pool.recover_slaves`. + +### API Details + +- `Pool.management_reconfigure` + - Parameter: network reference `network`. + - Calling this function configures `management_interface` on each host of a pool. + - For the `network` provided it will check pifs are present on each Host, + In case of VLAN network it will check vlan pifs on provided network are present on each Host of Pool. + - Check IP is configured on above pifs on each Host. + - If PIFs are not present or IP is not configured on PIFs this call must fail gracefully, Asking user to configure them. + - Call `Host.management_reconfigure` on each slave then lastly on master. + - Call `pool.recover_slaves` on master inorder to recover slaves which might have lost the connection to master. + +### API errors + +Possible API errors that may be raised by `pool.management_reconfigure`: + +- `INTERFACE_HAS_NO_IP` : the specified PIF (`pif` parameter) has no IP configuration. The new API checks for all PIFs on the new Network has IP configured. There might be a case when user has forgotten to configure IP on PIF on one or many of the Hosts in a Pool. + +New API ERROR: + +- `REQUIRED_PIF_NOT_PRESENT` : the specified Network (`network` parameter) has no PIF present on the host in pool. There might be a case when user has forgotten to create vlan pif on one or many of the Hosts in a Pool. + +CP-Tickets +---------- + +1. CP-14027 +2. CP-14028 +3. CP-14029 +4. CP-14030 +5. CP-14031 +6. CP-14032 +7. CP-14033 diff --git a/doc/content/design/multiple-cluster-managers.md b/doc/content/design/multiple-cluster-managers.md new file mode 100644 index 00000000000..6c0e783fe66 --- /dev/null +++ b/doc/content/design/multiple-cluster-managers.md @@ -0,0 +1,73 @@ +--- +title: Multiple Cluster Managers +layout: default +design_doc: true +revision: 2 +status: confirmed +revision_history: +- revision_number: 1 + description: Initial revision +- revision_number: 2 + description: Short-term simplications and scope reduction +--- + +Introduction +------------ + +Xapi currently uses a cluster manager called [xhad](../../features/HA/HA.html). Sometimes other software comes with its own built-in way of managing clusters, which would clash with xhad (example: xhad could choose to fence node 'a' while the other system could fence node 'b' resulting in a total failure). To integrate xapi with this other software we have 2 choices: + +1. modify the other software to take membership information from xapi; or +2. modify xapi to take membership information from this other software. + +This document proposes a way to do the latter. + +XenAPI changes +-------------- + +### New field + +We will add the following new field: + +- `pool.ha_cluster_stack` of type `string` (read-only) + - If HA is enabled, this field reflects which cluster stack is in use. + - Set to `"xhad"` on upgrade, which implies that so far we have used XenServer's own cluster stack, called `xhad`. + +### Cluster-stack choice + +We assume for now that a particular cluster manager will be mandated (only) by certain types of clustered storage, recognisable by SR type (e.g. OCFS2 or Melio). The SR backend will be able to inform xapi if the SR needs a particular cluster stack, and if so, what is the name of the stack. + +When `pool.enable_ha` is called, xapi will determine which cluster stack to use based on the presence or absence of such SRs: + +- If an SR that needs its own cluster stack is attached to the pool, then xapi will use that cluster stack. +- If no SR that needs a particular cluster stack is attached to the pool, then xapi will use `xhad`. + +If multiple SRs that need a particular cluster stack exist, then the storage parts of xapi must ensure that no two such SRs are ever attached to a pool at the same time. + +### New errors + +We will add the following API error that may be raised by `pool.enable_ha`: + +- `INCOMPATIBLE_STATEFILE_SR`: the specified SRs (`heartbeat_srs` parameter) are not of the right type to hold the HA statefile for the `cluster_stack` that will be used. For example, there is a Melio SR attached to the pool, and therefore the required cluster stack is the Melio one, but the given heartbeat SR is not a Melio SR. The single parameter will be the name of the required SR type. + +The following new API error may be raised by `PBD.plug`: + +- `INCOMPATIBLE_CLUSTER_STACK_ACTIVE`: the operation cannot be performed because an incompatible cluster stack is active. The single parameter will be the name of the required cluster stack. This could happen (or example) if you tried to create an OCFS2 SR with XenServer HA already enabled. + +### Future extensions + +In future, we may add a parameter to explicitly choose the cluster stack: + +- New parameter to `pool.enable_ha` called `cluster_stack` of type `string` which will have the default value of empty string (meaning: let the implementation choose). +- With the additional parameter, `pool.enable_ha` may raise two new errors: + - `UNKNOWN_CLUSTER_STACK`: + The operation cannot be performed because the requested cluster stack does not exist. The user should check the name was entered correctly and, failing that, check to see if the software is installed. The exception will have a single parameter: the name of the cluster stack which was not found. + - `CLUSTER_STACK_CONSTRAINT`: HA cannot be enabled with the provided cluster stack because some third-party software is already active which requires a different cluster stack setting. The two parameters are: a reference to an object (such as an SR) which has created the restriction, and the name of the cluster stack that this object requires. + +Implementation +-------------- + +The `xapi.conf` file will have a new field: `cluster-stack-root` which will have the default value `/usr/libexec/xapi/cluster-stack`. The existing `xhad` scripts and tools will be moved to `/usr/libexec/xapi/cluster-stack/xhad/`. A hypothetical cluster stack called `foo` would be placed in `/usr/libexec/xapi/cluster-stack/foo/`. + +In `Pool.enable_ha` with `cluster_stack="foo"` we will verify that the subdirectory `/foo` exists. If it does not exist, then the call will fail with `UNKNOWN_CLUSTER_STACK`. + +Alternative cluster stacks will need to conform to the exact same interface as [xhad](../../features/HA/HA.html). diff --git a/doc/content/design/multiple-device-emulators.md b/doc/content/design/multiple-device-emulators.md new file mode 100644 index 00000000000..ae0225fb452 --- /dev/null +++ b/doc/content/design/multiple-device-emulators.md @@ -0,0 +1,72 @@ +--- +title: Multiple device emulators +layout: default +design_doc: true +revision: 1 +status: proposed +--- + +Xen's `ioreq-server` feature allows for several device emulator +processes to be attached to the same domain, each emulating different +sets of virtual hardware. This makes it possible, for example, to +emulate network devices in a separate process for improved security +and isolation, or to provide special purpose emulators for particular +virtual hardware devices. + +`ioreq-server` is currently used in XenServer to support vGPU, where it +is configured via the legacy toolstack interface. These changes will make +multiple emulators usable in open source Xen via the new libxl interface. + +libxl changes +------------- + +- The singleton device_model_version, device_model_stubdomain and + device_model fields in the b_info structure will be replaced by a list of + (version, stubdomain, model, arguments) tuples, one for each emulator. + +- libxl_domain_create_new() will be changed to spawn a new device model + for each entry in the list. + +It may also be useful to spawn the device models separately and only +attach them during domain creation. This could be supported by +making each device_model entry a union of `pid | parameter_tuple`. +If such an entry specifies a parameter tuple, it is processed as above; +if it specifies a pid, libxl_domain_create_new(), the existing device +model with that pid is attached instead. + +QEMU changes +------------ + +- Patches to make QEMU register with Xen as an ioreq-server have been + submitted upstream, but not yet applied. + +- QEMU's `--machine none` and `--nodefaults` options should make it + possible to create an empty machine and add just a host bus, PCI bus + and device. This has not yet been fully demonstrated, so QEMU changes + may be required. + +Xen changes +----------- + +- Until now, `ioreq-server` has only been used to connect one extra + device model, in addition to the default one. Multiple emulators + should work, but there is a chance that bugs will be discovered. + +Interfacing with xenopsd +------------------------ + +This functionality will only be available through the experimental +Xenlight-based xenopsd. + + - the `VM_build` clause in the `atomics_of_operation` function will be + changed to fill in the list of emulators to be created (or attached) + in the b_info struct + +Host Configuration +------------------ + +vGPU support is implemented mostly in xenopsd, so no Xapi changes are +required to support vGPU through the generic device model mechanism. +Changes would be required if we decided to expose the additional device +models through the API, but in the near future it is more likely that +any additional device models will be dealt with entirely by xenopsd. diff --git a/doc/content/design/ocfs2/index.md b/doc/content/design/ocfs2/index.md new file mode 100644 index 00000000000..c8d0852e0a9 --- /dev/null +++ b/doc/content/design/ocfs2/index.md @@ -0,0 +1,491 @@ +--- +title: OCFS2 storage +layout: default +design_doc: true +revision: 1 +status: proposed +--- + + +OCFS2 is a (host-)clustered filesystem which runs on top of a shared raw block +device. Hosts using OCFS2 form a cluster using a combination of network and +storage heartbeats and host fencing to avoid split-brain. + +The following diagram shows the proposed architecture with `xapi`: + +![Proposed architecture](ocfs2.png) + +Please note the following: + +- OCFS2 is configured to use global heartbeats rather than per-mount heartbeats + because we quite often have many SRs and therefore many mountpoints +- The OCFS2 global heartbeat should be collocated on the same SR as the XenServer + HA SR so that we depend on fewer SRs (the storage is a single point of failure + for OCFS2) +- The OCFS2 global heartbeat should itself be a raw VDI within an LVHDSR. +- Every host can be in at-most-one OCFS2 cluster i.e. the host cluster membership + is a per-host thing rather than a per-SR thing. Therefore `xapi` will be + modified to configure the cluster and manage the cluster node numbers. +- Every SR will be a filesystem mount, managed by a SM plugin called "OCFS2". +- Xapi HA uses the `xhad` process which runs in userspace but in the realtime + scheduling class so it has priority over all other userspace tasks. `xhad` + sends heartbeats via the `ha_statefile` VDI and via UDP, and uses the + Xen watchdog for host fencing. +- OCFS2 HA uses the `o2cb` kernel driver which sends heartbeats via the + `o2cb_statefile` and via TCP, fencing the host by panicing domain 0. + +Managing O2CB +============= + +OCFS2 uses the O2CB "cluster stack" which is similar to our `xhad`. To configure +O2CB we need to + +- assign each host an integer node number (from zero) +- on pool/cluster join: update the configuration on every node to include the + new node. In OCFS2 this can be done online. +- on pool/cluster leave/eject: update the configuration on every node to exclude + the old node. In OCFS2 this needs to be done offline. + +In the current Xapi toolstack there is a single global implicit cluster called a "Pool" +which is used for: resource locking; "clustered" storage repositories and fault handling (in HA). In the long term we will allow these types of clusters to be +managed separately or all together, depending on the sophistication of the +admin and the complexity of their environment. We will take a small step in that +direction by keeping the OCFS2 O2CB cluster management code at "arms length" +from the Xapi Pool.join code. + +In +[xcp-idl](https://github.com/xapi-project/xcp-idl) +we will define a new API category called "Cluster" (in addition to the +categories for +[Xen domains](https://github.com/xapi-project/xcp-idl/blob/37c676548a53b927ac411ab51f33892a7b891fda/xen/xenops_interface.ml#L102) +, [ballooning](https://github.com/xapi-project/xcp-idl/blob/37c676548a53b927ac411ab51f33892a7b891fda/memory/memory_interface.ml#L38) +, [stats](https://github.com/xapi-project/xcp-idl/blob/37c676548a53b927ac411ab51f33892a7b891fda/rrd/rrd_interface.ml#L76) +, +[networking](https://github.com/xapi-project/xcp-idl/blob/37c676548a53b927ac411ab51f33892a7b891fda/network/network_interface.ml#L106) +and +[storage](https://github.com/xapi-project/xcp-idl/blob/37c676548a53b927ac411ab51f33892a7b891fda/storage/storage_interface.ml#L51) +). These APIs will only be called by Xapi on localhost. In particular they will +not be called across-hosts and therefore do not have to be backward compatible. +These are "cluster plugin APIs". + +We will define the following APIs: + +- `Plugin:Membership.create`: add a host to a cluster. On exit the local host cluster software + will know about the new host but it may need to be restarted before the + change takes effect + - in:`hostname:string`: the hostname of the management domain + - in:`uuid:string`: a UUID identifying the host + - in:`id:int`: the lowest available unique integer identifying the host + where an integer will never be re-used unless it is guaranteed that + all nodes have forgotten any previous state associated with it + - in:`address:string list`: a list of addresses through which the host + can be contacted + - out: Task.id +- `Plugin:Membership.destroy`: removes a named host from the cluster. On exit the local + host software will know about the change but it may need to be restarted + before it can take effect + - in:`uuid:string`: the UUID of the host to remove +- `Plugin:Cluster.query`: queries the state of the cluster + - out:`maintenance_required:bool`: true if there is some outstanding configuration + change which cannot take effect until the cluster is restarted. + - out:`hosts`: a list of all known hosts together with a state including: + whether they are known to be alive or dead; or whether they are currently + "excluded" because the cluster software needs to be restarted +- `Plugin:Cluster.start`: turn on the cluster software and let the local host join +- `Plugin:Cluster.stop`: turn off the cluster software + +Xapi will be modified to: + +- add table `Cluster` which will have columns + - `name: string`: this is the name of the Cluster plugin (TODO: use same + terminology as SM?) + - `configuration: Map(String,String)`: this will contain any cluster-global + information, overrides for default values etc. + - `enabled: Bool`: this is true when the cluster "should" be running. It + may require maintenance to synchronise changes across the hosts. + - `maintenance_required: Bool`: this is true when the cluster needs to + be placed into maintenance mode to resync its configuration +- add method `XenAPI:Cluster.enable` which sets `enabled=true` and waits for all + hosts to report `Membership.enabled=true`. +- add method `XenAPI:Cluster.disable` which sets `enabled=false` and waits for all + hosts to report `Membership.enabled=false`. +- add table `Membership` which will have columns + - `id: int`: automatically generated lowest available unique integer + starting from 0 + - `cluster: Ref(Cluster)`: the type of cluster. This will never be NULL. + - `host: Ref(host)`: the host which is a member of the cluster. This may + be NULL. + - `left: Date`: if not 1/1/1970 this means the time at which the host + left the cluster. + - `maintenance_required: Bool`: this is true when the Host believes the + cluster needs to be placed into maintenance mode. +- add field `Host.memberships: Set(Ref(Membership))` +- extend enum `vdi_type` to include `o2cb_statefile` as well as `ha_statefile` +- add method `Pool.enable_o2cb` with arguments + - in: `heartbeat_sr: Ref(SR)`: the SR to use for global heartbeats + - in: `configuration: Map(String,String)`: available for future configuration tweaks + - Like `Pool.enable_ha` this will find or create the heartbeat VDI, create the + `Cluster` entry and the `Membership` entries. All `Memberships` will have + `maintenance_required=true` reflecting the fact that the desired cluster + state is out-of-sync with the actual cluster state. +- add method `XenAPI:Membership.enable` + - in: `self:Host`: the host to modify + - in: `cluster:Cluster`: the cluster. +- add method `XenAPI:Membership.disable` + - in: `self:Host`: the host to modify + - in: `cluster:Cluster`: the cluster name. +- add a cluster monitor thread which + - watches the `Host.memberships` field and calls `Plugin:Membership.create` and + `Plugin:Membership.destroy` to keep the local cluster software up-to-date + when any host in the pool changes its configuration + - calls `Plugin:Cluster.query` after an `Plugin:Membership:create` or + `Plugin:Membership.destroy` to see whether the + SR needs maintenance + - when all hosts have a last start time later than a `Membership` + record's `left` date, deletes the `Membership`. +- modify `XenAPI:Pool.join` to resync with the master's `Host.memberships` list. +- modify `XenAPI:Pool.eject` to + - call `Membership.disable` in the cluster plugin to stop the `o2cb` service + - call `Membership.destroy` in the cluster plugin to remove every other host + from the local configuration + - remove the `Host` metadata from the pool + - set `XenAPI:Membership.left` to `NOW()` +- modify `XenAPI:Host.forget` to + - remove the `Host` metadata from the pool + - set `XenAPI:Membership.left` to `NOW()` + - set `XenAPI:Cluster.maintenance_required` to true + +A Cluster plugin called "o2cb" will be added which + +- on `Plugin:Membership.destroy` + - comment out the relevant node id in cluster.conf + - set the 'needs a restart' flag +- on `Plugin:Membership.create` + - if the provided node id is too high: return an error. This means the + cluster needs to be rebooted to free node ids. + - if the node id is not too high: rewrite the cluster.conf using + the "online" tool. +- on `Plugin:Cluster.start`: find the VDI with `type=o2cb_statefile`; + add this to the "static-vdis" list; `chkconfig` the service on. We + will use the global heartbeat mode of `o2cb`. +- on `Plugin:Cluster.stop`: stop the service; `chkconfig` the service off; + remove the "static-vdis" entry; leave the VDI itself alone +- keeps track of the current 'live' cluster.conf which allows it to + - report the cluster service as 'needing a restart' (which implies + we need maintenance mode) + +Summary of differences between this and xHA: + +- we allow for the possibility that hosts can join and leave, without + necessarily taking the whole cluster down. In the case of `o2cb` we + should be able to have `join` work live and only `eject` requires + maintenance mode +- rather than write explicit RPCs to update cluster configuration state + we instead use an event watch and resync pattern, which is hopefully + more robust to network glitches while a reconfiguration is in progress. + +Managing xhad +============= + +We need to ensure `o2cb` and `xhad` do not try to conflict by fencing +hosts at the same time. We shall: + +- use the default `o2cb` timeouts (hosts fence if no I/O in 60s): this + needs to be short because disk I/O *on otherwise working hosts* can + be blocked while another host is failing/ has failed. + +- make the `xhad` host fence timeouts much longer: 300s. It's much more + important that this is reliable than fast. We will make this change + globally and not just when using OCFS2. + +In the `xhad` config we will cap the `HeartbeatInterval` and `StatefileInterval` +at 5s (the default otherwise would be 31s). This means that 60 heartbeat +messages have to be lost before `xhad` concludes that the host has failed. + +SM plugin +========= + +The SM plugin `OCFS2` will be a file-based plugin. + +TODO: which file format by default? + +The SM plugin will first check whether the `o2cb` cluster is active and fail +operations if it is not. + +I/O paths +========= + +When either HA or OCFS O2CB "fences" the host it will look to the admin like +a host crash and reboot. We need to (in priority order) + +1. help the admin *prevent* fences by monitoring their I/O paths + and fixing issues before they lead to trouble +2. when a fence/crash does happen, help the admin + - tell the difference between an I/O error (admin to fix) and a software + bug (which should be reported) + - understand how to make their system more reliable + + +Monitoring I/O paths +-------------------- + +If heartbeat I/O fails for more than 60s when running `o2cb` then the host will fence. +This can happen either + +- for a good reason: for example the host software may have deadlocked or someone may + have pulled out a network cable. + +- for a bad reason: for example a network bond link failure may have been ignored + and then the second link failed; or the heartbeat thread may have been starved of + I/O bandwidth by other processes + +Since the consequences of fencing are severe -- all VMs on the host crash simultaneously -- +it is important to avoid the host fencing for bad reasons. + +We should recommend that all users + +- use network bonding for their network heartbeat +- use multipath for their storage heartbeat + +Furthermore we need to *help* users monitor their I/O paths. It's no good if they use +a bonded network but fail to notice when one of the paths have failed. + +The current XenServer HA implementation generates the following I/O-related alerts: + +- `HA_HEARTBEAT_APPROACHING_TIMEOUT` (priority 5 "informational"): when half the + network heartbeat timeout has been reached. +- `HA_STATEFILE_APPROACHING_TIMEOUT` (priority 5 "informational"): when half the + storage heartbeat timeout has been reached. +- `HA_NETWORK_BONDING_ERROR` (priority 3 "service degraded"): when one of the bond + links have failed. +- `HA_STATEFILE_LOST` (priority 2 "service loss imminent"): when the storage heartbeat + has completely failed and only the network heartbeat is left. +- MULTIPATH_PERIODIC_ALERT (priority 3 "service degrated"): when one of the multipath + links have failed. + +Unfortunately alerts are triggered on "edges" i.e. when state changes, and not on "levels" +so it is difficult to see whether the link is currently broken. + +We should define datasources suitable for use by xcp-rrdd to expose the current state +(and the history) of the I/O paths as follows: + +- `pif__paths_failed`: the total number of paths which we know have failed. +- `pif__paths_total`: the total number of paths which are configured. +- `sr__paths_failed`: the total number of storage paths which we know have failed. +- `sr__paths_total`: the total number of storage paths which are configured. + +The `pif` datasources should be generated by `xcp-networkd` which already has a +[network bond monitoring thread](https://github.com/xapi-project/xcp-networkd/blob/bc0140feba19cf8dcced3bd66e54eeee112af819/networkd/network_monitor_thread.ml#L52). +THe `sr` datasources should be generated by `xcp-rrdd` plugins since there is no +storage daemon to generate them. +We should create RRDs using the `MAX` consolidation function, otherwise information +about failures will be lost by averaging. + +XenCenter (and any diagnostic tools) should warn when the system is at risk of fencing +in particular if any of the following are true: + +- `pif__paths_failed` is non-zero +- `sr__paths_failed` is non-zero +- `pif__paths_total` is less than 2 +- `sr__paths_total` is less than 2 + +XenCenter (and any diagnostic tools) should warn if any of the following *have been* +true over the past 7 days: + +- `pif__paths_failed` is non-zero +- `sr__paths_failed` is non-zero + + +Heartbeat "QoS" +--------------- + +The network and storage paths used by heartbeats *must* remain responsive otherwise +the host will fence (i.e. the host and all VMs will crash). + +Outstanding issue: how slow can `multipathd` get? How does it scale with the number of +LUNs. + +Post-crash diagnostics +====================== + +When a host crashes the effect on the user is severe: all the VMs will also +crash. In cases where the host crashed for a bad reason (such as a single failure +after a configuration error) we must help the user understand how they can +avoid the same situation happening again. + +We must make sure the crash kernel runs reliably when `xhad` and `o2cb` +fence the host. + +Xcp-rrdd will be modified to store RRDs in an `mmap(2)`d file sin the dom0 +filesystem (rather than in-memory). Xcp-rrdd will call `msync(2)` every 5s +to ensure the historical records have hit the disk. We should use the same +on-disk format as RRDtool (or as close to it as makes sense) because it has +already been optimised to minimise the amount of I/O. + +Xapi will be modified to run a crash-dump analyser program `xen-crash-analyse`. + +`xen-crash-analyse` will: + +- parse the Xen and dom0 stacks and diagnose whether + - the dom0 kernel was panic'ed by `o2cb` + - the Xen watchdog was fired by `xhad` + - anything else: this would indicate a bug that should be reported +- in cases where the system was fenced by `o2cb` or `xhad` then the analyser + - will read the archived RRDs and look for recent evidence of a path failure + or of a bad configuration (i.e. one where the total number of paths is 1) + - will parse the `xhad.log` and look for evidence of heartbeats "approaching + timeout" + +TODO: depending on what information we can determine from the analyser, we +will want to record some of it in the `Host_crash_dump` database table. + +XenCenter will be modified to explain why the host crashed and explain what +the user should do to fix it, specifically: + +- if the host crashed for no obvious reason then consider this a software + bug and recommend a bugtool/system-status-report is taken and uploaded somewhere +- if the host crashed because of `o2cb` or `xhad` then either + - if there is evidence of path failures in the RRDs: recommend the user + increase the number of paths or investigate whether some of the equipment + (NICs or switches or HBAs or SANs) is unreliable + - if there is evidence of insufficient paths: recommend the user add more + paths + +Network configuration +===================== + +The documentation should strongly recommend + +- the management network is bonded +- the management network is dedicated i.e. used only for management traffic + (including heartbeats) +- the OCFS2 storage is multipathed + +`xcp-networkd` will be modified to change the behaviour of the DHCP client. +Currently the `dhclient` will wait for a response and eventually background +itself. This is a big problem since DHCP can reset the hostname, and this can +break `o2cb`. Therefore we must insist that `PIF.reconfigure_ip` becomes +fully synchronous, supporting timeout and cancellation. Once the call returns +-- whether through success or failure -- there must not be anything in the +background which will change the system's hostname. + +TODO: figure out whether we need to request "maintenance mode" for hostname +changes. + +Maintenance mode +================ + +The purpose of "maintenance mode" is to take a host out of service and leave +it in a state where it's safe to fiddle with it without affecting services +in VMs. + +XenCenter currently does the following: + +- `Host.disable`: prevents new VMs starting here +- makes a list of all the VMs running on the host +- `Host.evacuate`: move the running VMs somewhere else + +The problems with maintenance mode are: + +- it's not safe to fiddle with the host network configuration with storage + still attached. For NFS this risks deadlocking the SR. For OCFS2 this + risks fencing the host. +- it's not safe to fiddle with the storage or network configuration if HA + is running because the host will be fenced. It's not safe to disable fencing + unless we guarantee to reboot the host on exit from maintenance mode. + +We should also + +- `PBD.unplug`: all storage. This allows the network to be safely reconfigured. + If the network is configured when NFS storage is plugged then the SR can + permanently deadlock; if the network is configured when OCFS2 storage is + plugged then the host can crash. + +TODO: should we add a `Host.prepare_for_maintenance` (better name TBD) +to take care of all this without XenCenter having to script it. This would also +help CLI and powershell users do the right thing. + +TODO: should we insist that the host is rebooted to leave maintenance +mode? This would make maintenance mode more reliable and allow us to integrate +maintenance mode with xHA (where maintenance mode is a "staged reboot") + +TODO: should we leave all clusters as part of maintenance mode? We +probably need to do this to avoid fencing. + +Walk-through: adding OCFS2 storage +================================== + +Assume you have an existing Pool of 2 hosts. First the client will set up +the O2CB cluster, choosing where to put the global heartbeat volume. The +client should check that the I/O paths have all been setup correctly with +bonding and multipath and prompt the user to fix any obvious problems. + +![The client enables O2CB and then creates an SR](o2cb-enable-external.svg) + +Internally within `Pool.enable_o2cb` Xapi will set up the cluster metadata +on every host in the pool: + +![Xapi creates the cluster configuration and each host updates its metadata](o2cb-enable-internal1.svg) + +At this point all hosts have in-sync `cluster.conf` files but all cluster +services are disabled. We also have `requires_mainenance=true` on all +`Membership` entries and the global `Cluster` has `enabled=false`. +The client will now try to enable the cluster with `Cluster.enable`: + +![Xapi enables the cluster software on all hosts](o2cb-enable-internal2.svg) + +Now all hosts are in the cluster and the SR can be created using the standard +SM APIs. + +Walk-through: remove a host +=========================== + +Assume you have an existing Pool of 2 hosts with `o2cb` clustering enabled +and at least one `ocfs2` filesystem mounted. If the host is online then +`XenAPI:Pool.eject` will: + +![Xapi ejects a host from the pool](pool-eject.svg) + +Note that: + +- All hosts will have modified their `o2cb` `cluster.conf` to comment out + the former host +- The `Membership` table still remembers the node number of the ejected host-- + this cannot be re-used until the SR is taken down for maintenance. +- All hosts can see the difference between their current `cluster.conf` + and the one they would use if they restarted the cluster service, so all + hosts report that the cluster must be taken offline i.e. `requires_maintence=true`. + +Summary of the impact on the admin +================================== + +OCFS2 is fundamentally a different type of storage to all existing storage +types supported by xapi. OCFS2 relies upon O2CB, which provides +[Host-level High Availability](../../../features/HA/HA.html). All HA implementations +(including O2CB and `xhad`) impose restrictions on the server admin to +prevent unnecessary host "fencing" (i.e. crashing). Once we have OCFS2 as +a feature, we will have to live with these restrictions which previously only +applied when HA was explicitly enabled. To reduce complexity we will not try +to enforce restrictions only when OCFS2 is being used or is likely to be used. + +Impact even if not using OCFS2 +------------------------------ + +- "Maintenance mode" now includes detaching all storage. +- Host network reconfiguration can only be done in maintenance mode +- XenServer HA enable takes longer +- XenServer HA failure detection takes longer +- Network configuration with DHCP must be fully synchronous i.e. it wil block + until the DHCP server responds. On a timeout, the change will not be made. + +Impact when using OCFS2 +----------------------- + +- Sometimes a host will not be able to join the pool without taking the + pool into maintenance mode +- Every VM will have to be XSM'ed (is that a verb?) to the new OCFS2 storage. + This means that VMs with more than 2 snapshots will have their snapshots + deleted; it means you need to provision another storage target, temporarily + doubling your storage needs; and it will take a long time. +- There will now be 2 different reasons why a host has fenced which the + admin needs to understand. diff --git a/doc/content/design/ocfs2/o2cb-enable-external.msc b/doc/content/design/ocfs2/o2cb-enable-external.msc new file mode 100644 index 00000000000..9ee67403dcd --- /dev/null +++ b/doc/content/design/ocfs2/o2cb-enable-external.msc @@ -0,0 +1,13 @@ +Client->Xapi: Pool.enable_o2cb +Xapi->LVHD: vdi_create +LVHD-->Xapi: OK +Note left of LVHD: VDI will be used for\nO2CB global heartbeat +Xapi-->Client: OK +Note right of Xapi: DB objects initialised\nbut cluster is offline +Client->Xapi: Cluster.enable +Xapi-->Client: OK +Note right of Xapi: O2CB cluster is online\non all hosts +Client->Xapi: SR.create +Xapi->OCFS2: sr_create +OCFS2-->Xapi: OK +Xapi-->Client: OK diff --git a/doc/content/design/ocfs2/o2cb-enable-external.svg b/doc/content/design/ocfs2/o2cb-enable-external.svg new file mode 100644 index 00000000000..46832e6219c --- /dev/null +++ b/doc/content/design/ocfs2/o2cb-enable-external.svg @@ -0,0 +1,15 @@ +Client->Xapi: Pool.enable_o2cb +Xapi->LVHD: vdi_create +LVHD-->Xapi: OK +Note left of LVHD: VDI will be used for\nO2CB global heartbeat +Xapi-->Client: OK +Note right of Xapi: DB objects initialised\nbut cluster is offline +Client->Xapi: Cluster.enable +Xapi-->Client: OK +Note right of Xapi: O2CB cluster is online\non all hosts +Client->Xapi: SR.create +Xapi->OCFS2: sr_create +OCFS2-->Xapi: OK +Xapi-->Client: OK + +Created with Raphaël 2.1.0 \ No newline at end of file diff --git a/doc/content/design/ocfs2/o2cb-enable-internal1.msc b/doc/content/design/ocfs2/o2cb-enable-internal1.msc new file mode 100644 index 00000000000..3bcf4a4f6b6 --- /dev/null +++ b/doc/content/design/ocfs2/o2cb-enable-internal1.msc @@ -0,0 +1,14 @@ +Participant Plugin\nSlave +Participant Xapi\nSlave +Participant Xapi\nMaster +Xapi\nSlave->Xapi\nMaster: events.from +Note over Xapi\nMaster: Create Cluster\nMemberships in DB +Xapi\nMaster-->Xapi\nSlave: events.from OK +Xapi\nMaster->Plugin\nMaster: Membership.create +Note over Plugin\nMaster: edit cluster.conf +Xapi\nMaster->Plugin\nMaster: Cluster.query +Plugin\nMaster->Xapi\nMaster: requires_maintenance=true +Xapi\nSlave->Plugin\nSlave: Membership.create +Note over Plugin\nSlave: edit cluster.conf +Xapi\nSlave->Plugin\nSlave: Cluster.query +Plugin\nSlave->Xapi\nSlave: requires_maintenance=true diff --git a/doc/content/design/ocfs2/o2cb-enable-internal1.svg b/doc/content/design/ocfs2/o2cb-enable-internal1.svg new file mode 100644 index 00000000000..b3f6ed0d571 --- /dev/null +++ b/doc/content/design/ocfs2/o2cb-enable-internal1.svg @@ -0,0 +1,15 @@ +Participant Plugin\nSlave +Participant Xapi\nSlave +Participant Xapi\nMaster +Xapi\nSlave->Xapi\nMaster: events.from +Note over Xapi\nMaster: Create Cluster\nMemberships in DB +Xapi\nMaster-->Xapi\nSlave: events.from OK +Xapi\nMaster->Plugin\nMaster: Membership.create +Note over Plugin\nMaster: edit cluster.conf +Xapi\nMaster->Plugin\nMaster: Cluster.query +Plugin\nMaster->Xapi\nMaster: requires_maintenance=true +Xapi\nSlave->Plugin\nSlave: Membership.create +Note over Plugin\nSlave: edit cluster.conf +Xapi\nSlave->Plugin\nSlave: Cluster.query +Plugin\nSlave->Xapi\nSlave: requires_maintenance=true +Created with Raphaël 2.1.0 \ No newline at end of file diff --git a/doc/content/design/ocfs2/o2cb-enable-internal2.msc b/doc/content/design/ocfs2/o2cb-enable-internal2.msc new file mode 100644 index 00000000000..d917cd8e734 --- /dev/null +++ b/doc/content/design/ocfs2/o2cb-enable-internal2.msc @@ -0,0 +1,16 @@ +Participant Plugin\nMaster +Xapi\nMaster->Xapi\nSlave: Membership.enable +Xapi\nMaster->Xapi\nMaster: Membership.enable +Xapi\nSlave->Plugin\nSlave: Membership.enable +Xapi\nMaster->Plugin\nMaster: Membership.enable +Note over Plugin\nSlave: chkconfig o2cb on +Note over Plugin\nMaster: chkconfig o2cb on +Xapi\nSlave->Plugin\nSlave: Cluster.query +Plugin\nSlave->Xapi\nSlave: not enabled yet +Xapi\nSlave->Plugin\nSlave: Cluster.query +Plugin\nSlave->Xapi\nSlave: enabled +Note over Xapi\nSlave: requires_maintenance=false +Xapi\nMaster->Plugin\nMaster: Cluster.query +Plugin\nMaster->Xapi\nMaster: enabled +Note over Xapi\nMaster: requires_maintenance=false +Note over Xapi\nMaster: enabled=true diff --git a/doc/content/design/ocfs2/o2cb-enable-internal2.svg b/doc/content/design/ocfs2/o2cb-enable-internal2.svg new file mode 100644 index 00000000000..e8c4da29580 --- /dev/null +++ b/doc/content/design/ocfs2/o2cb-enable-internal2.svg @@ -0,0 +1,17 @@ +Participant Plugin\nMaster +Xapi\nMaster->Xapi\nSlave: Membership.enable +Xapi\nMaster->Xapi\nMaster: Membership.enable +Xapi\nSlave->Plugin\nSlave: Membership.enable +Xapi\nMaster->Plugin\nMaster: Membership.enable +Note over Plugin\nSlave: chkconfig o2cb on +Note over Plugin\nMaster: chkconfig o2cb on +Xapi\nSlave->Plugin\nSlave: Cluster.query +Plugin\nSlave->Xapi\nSlave: not enabled yet +Xapi\nSlave->Plugin\nSlave: Cluster.query +Plugin\nSlave->Xapi\nSlave: enabled +Note over Xapi\nSlave: requires_maintenance=false +Xapi\nMaster->Plugin\nMaster: Cluster.query +Plugin\nMaster->Xapi\nMaster: enabled +Note over Xapi\nMaster: requires_maintenance=false +Note over Xapi\nMaster: enabled=true +Created with Raphaël 2.1.0 \ No newline at end of file diff --git a/doc/content/design/ocfs2/ocfs2.graffle b/doc/content/design/ocfs2/ocfs2.graffle new file mode 100644 index 00000000000..2ef614fc919 Binary files /dev/null and b/doc/content/design/ocfs2/ocfs2.graffle differ diff --git a/doc/content/design/ocfs2/ocfs2.png b/doc/content/design/ocfs2/ocfs2.png new file mode 100644 index 00000000000..aaf25f8a574 Binary files /dev/null and b/doc/content/design/ocfs2/ocfs2.png differ diff --git a/doc/content/design/ocfs2/pool-eject.msc b/doc/content/design/ocfs2/pool-eject.msc new file mode 100644 index 00000000000..9c71208c203 --- /dev/null +++ b/doc/content/design/ocfs2/pool-eject.msc @@ -0,0 +1,13 @@ +Participant Plugin\nMaster +Participant Xapi\nMaster +Xapi\nMaster->Xapi\nSlave:Pool.eject +Xapi\nSlave->Plugin\nSlave:Membership.destroy +Xapi\nSlave->Plugin\nSlave:Membership.destroy +Xapi\nSlave->Plugin\nSlave:Cluster.disable +Note over Plugin\nSlave:chkconfig o2cb off\nreboot +Note over Xapi\nMaster:remove Host metadata +Note over Xapi\nMaster:set Membership.left to NOW() +Xapi\nMaster->Plugin\nMaster:Membership.destroy +Xapi\nMaster->Plugin\nMaster:Cluster.query +Note over Xapi\nMaster:Cluster.requires_maintenance=true + diff --git a/doc/content/design/ocfs2/pool-eject.svg b/doc/content/design/ocfs2/pool-eject.svg new file mode 100644 index 00000000000..5f57ce745ad --- /dev/null +++ b/doc/content/design/ocfs2/pool-eject.svg @@ -0,0 +1,13 @@ +Participant Plugin\nMaster +Participant Xapi\nMaster +Xapi\nMaster->Xapi\nSlave:Pool.eject +Xapi\nSlave->Plugin\nSlave:Membership.destroy +Xapi\nSlave->Plugin\nSlave:Membership.destroy +Xapi\nSlave->Plugin\nSlave:Cluster.disable +Note over Plugin\nSlave:chkconfig o2cb off\nreboot +Note over Xapi\nMaster:remove Host metadata +Note over Xapi\nMaster:set Membership.left to NOW() +Xapi\nMaster->Plugin\nMaster:Membership.destroy +Xapi\nMaster->Plugin\nMaster:Cluster.query +Note over Xapi\nMaster:Cluster.requires_maintenance=true +Created with Raphaël 2.1.0 \ No newline at end of file diff --git a/doc/content/design/patches-in-vdis.md b/doc/content/design/patches-in-vdis.md new file mode 100644 index 00000000000..13c1f9f4a68 --- /dev/null +++ b/doc/content/design/patches-in-vdis.md @@ -0,0 +1,77 @@ +--- +title: patches in VDIs +layout: default +design_doc: true +revision: 1 +status: proposed +--- + +"Patches" are signed binary blobs which can be queried and applied. +They are stored in the dom0 filesystem under `/var/patch`. Unfortunately +the patches can be quite large -- imagine a repo full of RPMs -- and +the dom0 filesystem is usually quite small, so it can be difficult +to upload and apply some patches. + +Instead of writing patches to the dom0 filesystem, we shall write them +to disk images (VDIs) instead. We can then take advantage of features like + +- shared storage +- cross-host `VDI.copy` + +to manage the patches. + +XenAPI changes +============== + +1. Add a field `pool_patch.VDI` of type `Ref(VDI)`. When a new patch is + stored in a VDI, it will be referenced here. Older patches and cleaned + patches will have invalid references here. + +2. The HTTP handler for uploading patches will choose an SR to stream the + patch into. It will prefer to use the `pool.default_SR` and fall back + to choosing an SR on the master whose driver supports the `VDI_CLONE` + capability: we want the ability to fast clone patches, one per host + concurrently installing them. A VDI will be created whose size is 4x + the apparent size of the patch, defaulting to 4GiB if we have no size + information (i.e. no `content-length` header) + +3. `pool_patch.clean_on_host` will be deprecated. It will still try to + clean a patch *from the local filesystem* but this is pointless for + the new VDI patch uploads. + +4. `pool_patch.clean` will be deprecated. It will still try to clean a patch + from the *local filesystem* of the master but this is pointless for the + new VDI patch uploads. + +4. `pool_patch.pool_clean` will be deprecated. It will destroy any associated + patch VDI. Users will be encouraged to call `VDI.destroy` instead. + + + +Changes beneath the XenAPI +========================== + +1. `pool_patch` records will only be deleted if both the `filename` field + refers to a missing file on the master *and* the `VDI` field is a dangling + reference + +2. Patches stored in VDIs will be stored within a filesystem, like we used + to do with suspend images. This is needed because (a) we want to execute + the patches and block devices cannot be executed; and (b) we can use + spare space in the VDI as temporary scratch space during the patch + application process. Within the VDI we will call patches `patch` rather + than using a complicated filename. + +3. When a host wishes to apply a patch it will call `VDI.copy` to duplicate + the VDI to a locally-accessible SR, mount the filesystem and execute it. + If the patch is still in the master's dom0 filesystem then it will fall + back to the HTTP handler. + + +Summary of the impact on the admin +================================== + +- There will nolonger be a size limit on hotfixes imposed by the mechanism + itself. +- There must be enough free space in an SR connected to the host to be able + to apply a patch on that host. diff --git a/doc/content/design/pci-passthrough.md b/doc/content/design/pci-passthrough.md new file mode 100644 index 00000000000..c91cc8d7b02 --- /dev/null +++ b/doc/content/design/pci-passthrough.md @@ -0,0 +1,76 @@ +--- +title: PCI passthrough support +layout: default +design_doc: true +revision: 1 +status: proposed +--- + +Introduction +------------ + +GPU passthrough is already available in XAPI, this document proposes to also +offer passthrough for all PCI devices through XAPI. + +Design proposal +--------------- + +New methods for PCI object: +- `PCI.enable_dom0_access` +- `PCI.disable_dom0_access` +- `PCI.get_dom0_access_status`: compares the outputs of `/opt/xensource/libexec/xen-cmdline` + and `/proc/cmdline` to produce one of the four values that can be currently contained + in the `PGPU.dom0_access` field: + - disabled + - disabled_on_reboot + - enabled + - enabled_on_reboot + + How do determine the expected dom0 access state: + If the device id is present in both `pciback.hide` of `/proc/cmdline` and `xen-cmdline`: `enabled` + If the device id is present not in both `pciback.hide` of `/proc/cmdline` and `xen-cmdline`: `disabled` + If the device id is present in the `pciback.hide` of `/proc/cmdline` but not in the one of `xen-cmdline`: `disabled_on_reboot` + If the device id is not present in the `pciback.hide` of `/proc/cmdline` but is in the one of `xen-cmdline`: `enabled_on_reboot` + + A function rather than a field makes the data always accurate and even accounts for + changes made by users outside XAPI, directly through `/opt/xensource/libexec/xen-cmdline` + +With these generic methods available, the following field and methods will be *deprecated*: +- `PGPU.enable_dom0_access` +- `PGPU.disable_dom0_access` +- `PGPU.dom0_access` (DB field) + +They would still be usable and up to date with the same info as for the PCI methods. + +Test cases +---------- + +- hide a PCI: + - call `PCI.disable_dom0_access` on an `enabled` PCI + - check the PCI goes in state `disabled_on_reboot` + - reboot the host + - check the PCI goes in state `disabled` + + +- unhide a PCI: + - call `PCI.enable_dom0_access` on an `disabled` PCI + - check the PCI goes in state `enabled_on_reboot` + - reboot the host + - check the PCI goes in state `enabled` + +- get a PCI dom0 access state: + - on a `enabled` PCI, make sure the `get_dom0_access_status` returns `enabled` + - hide the PCI + - make sure the `get_dom0_access_status` returns `disabled_on_reboot` + - reboot + - make sure the `get_dom0_access_status` returns `disabled` + - unhide the PCI + - make sure the `get_dom0_access_status` returns `enabled_on_reboot` + - reboot + - make sure the `get_dom0_access_status` returns `enabled` + +- Check PCI/PGPU dom0 access coherence: + - hide a PCI belonging to a PGPU and make sure both states remains coherent at every step + - unhide a PCI belonging to a PGPU and make sure both states remains coherent at every step + - hide a PGPU and make sure its and its PCI's states remains coherent at every step + - unhide a PGPU and make sure its and its PCI's states remains coherent at every step diff --git a/doc/content/design/pif-properties.md b/doc/content/design/pif-properties.md new file mode 100644 index 00000000000..f53c7efbc3e --- /dev/null +++ b/doc/content/design/pif-properties.md @@ -0,0 +1,64 @@ +--- +title: GRO and other properties of PIFs +layout: default +design_doc: true +revision: 1 +status: released (6.5) +--- + +It has been possible to enable and disable GRO and other "ethtool" features on +PIFs for a long time, but there was never an official API for it. Now there is. + +Introduction +------------ + +The former way to enable GRO via the CLI is as follows: + + xe pif-param-set uuid= other-config:ethtool-gro=on + xe pif-plug uuid= + +The `other-config` field is a grab-bag of options that are not clearly defined. +The options exposed through `other-config` are mostly experimental features, and +the interface is not considered stable. Furthermore, the field is read/write +and does not have any input validation, and cannot not trigger any actions +immediately. The latter is why it is needed to call `pif-plug` after setting +the `ethtool-gro` key, in order to actually make things happen. + +New API +------- + +New field: + +* Field `PIF.properties` of type `(string -> string) map`. +* Physical and bond PIFs have a `gro` key in their `properties`, with possible values `on` and `off`. There are currently no other properties defined. +* VLAN and Tunnel PIFs do not have any properties. They implicitly inherit the properties from the PIF they are based upon (either a physical PIF or a bond). +* For backwards compatibility, if there is a `other-config:ethtool-gro` key present on the PIF, it will be treated as an override of the `gro` key in `PIF.properties`. + +New function: + +* Message `void PIF.set_property (PIF ref, string, string)`. + * First argument: the reference of the PIF to act on. + * Second argument: the key to change in the `properties` field. + * Third argument: the value to write. +* The function can only be used on physical PIFs that are not bonded, and on bond PIFs. Attempts to call the function on bond slaves, VLAN PIFs, or Tunnel PIFs, fail with `CANNOT_CHANGE_PIF_PROPERTIES`. +* Calls with invalid keys or values fail with `INVALID_VALUE`. +* When called on a bond PIF, the key in the `properties` of the associated bond slaves will also be set to same value. +* The function automatically causes the settings to be applied to the network devices (no additional `plug` is needed). This includes any VLANs that are on top of the PIF to-be-changed, as well as any bond slaves. + +Defaults, Installation and Upgrade +------------------------ + +* Any newly introduced PIF will have its `properties` field set to `"gro" -> "on"`. This includes PIFs obtained after a fresh installation of XenServer, as well as PIFs created using `PIF.introduce` or `PIF.scan`. In other words, GRO will be "on" by default. +* An upgrade from a version of XenServer that does not have the `PIF.properties` field, will give every physical and bond PIF a `properties` field set to `"gro" -> "on"`. In other words, GRO will be "on" by default after an upgrade. + +Bonding +------- + +* When creating a bond, the bond-slaves-to-be must all have equal `PIF.properties`. If not, the `bond.create` call will fail with `INCOMPATIBLE_BOND_PROPERTIES`. +* When a bond is created successfully, the `properties` of the bond PIF will be equal to the properties of the bond slaves. + +Command Line Interface +---------------------- + +* The `PIF.properties` field is exposed through `xe pif-list` and `xe pif-param-list` as usual. +* The `PIF.set_property` call is exposed through `xe pif-param-set`. For example: `xe pif-param-set uuid= properties:gro=off`. diff --git a/doc/content/design/plugin-protocol-v2.md b/doc/content/design/plugin-protocol-v2.md new file mode 100644 index 00000000000..8c02b85c61f --- /dev/null +++ b/doc/content/design/plugin-protocol-v2.md @@ -0,0 +1,198 @@ +--- +title: RRDD plugin protocol v2 +layout: default +design_doc: true +revision: 1 +status: released (7.0) +revision_history: +- revision_number: 1 + description: Initial version +--- + +Motivation +---------- + +rrdd plugins currently report datasources via a shared-memory file, using the +following format: + +``` +DATASOURCES +000001e4 +dba4bf7a84b6d11d565d19ef91f7906e +{ + "timestamp": 1339685573, + "data_sources": { + "cpu-temp-cpu0": { + "description": "Temperature of CPU 0", + "type": "absolute", + "units": "degC", + "value": "64.33" + "value_type": "float", + }, + "cpu-temp-cpu1": { + "description": "Temperature of CPU 1", + "type": "absolute", + "units": "degC", + "value": "62.14" + "value_type": "float", + } + } +} +``` + +This format contains four main components: + +* A constant header string + +`DATASOURCES` + +This should always be present. + +* The JSON data length, encoded as hexadecimal + +`000001e4` + +* The md5sum of the JSON data + +`dba4bf7a84b6d11d565d19ef91f7906e` + +* The JSON data itself, encoding the values and metadata associated with the +reported datasources. + +### Example +``` +{ + "timestamp": 1339685573, + "data_sources": { + "cpu-temp-cpu0": { + "description": "Temperature of CPU 0", + "type": "absolute", + "units": "degC", + "value": "64.33" + "value_type": "float", + }, + "cpu-temp-cpu1": { + "description": "Temperature of CPU 1", + "type": "absolute", + "units": "degC", + "value": "62.14" + "value_type": "float", + } + } +} +``` + +The disadvantage of this protocol is that rrdd has to parse the entire JSON +structure each tick, even though most of the time only the values will change. + +For this reason a new protocol is proposed. + +Protocol V2 +----------- + +|value|bits|format|notes| +|-----|----|------|-----| +|header string |(string length)*8|string|"DATASOURCES" as in the V1 protocol | +|data checksum |32 |int32 |binary-encoded crc32 of the concatenation of the encoded timestamp and datasource values| +|metadata checksum |32 |int32 |binary-encoded crc32 of the metadata string (see below) | +|number of datasources|32 |int32 |only needed if the metadata has changed - otherwise RRDD can use a cached value | +|timestamp |64 |int64 |Unix epoch | +|datasource values |n * 64 |int64 \| double |n is the number of datasources exported by the plugin, type dependent on the setting in the metadata for value_type [int64\|float] | +|metadata length |32 |int32 | | +|metadata |(string length)*8|string| | + +All integers/double are bigendian. The metadata will have the same JSON-based format as +in the V1 protocol, minus the timestamp and `value` key-value pair for each +datasource. + +| field | values | notes | required | +|-------|--------|-------|----------| +|description|string|Description of the datasource|no| +|owner|host \| vm \| sr|The object to which the data relates|no, default host| +|value_type|int64 \| float|The type of the datasource|yes| +|type|absolute \| derive \| gauge|The type of measurement being sent. Absolute for counters which are reset on reading, derive stores the derivative of the recorded values (useful for metrics which continually increase like amount of data written since start), gauge for things like temperature|no, default absolute| +|default|true \| false|Whether the source is default enabled or not|no, default false| +|units||The units the data should be displayed in|no| +|min||The minimum value for the datasource|no, default -infinity| +|max||The maximum value for the datasource|no, default +infinity| + + +### Example +``` +{ + "datasources": { + "memory_reclaimed": { + "description":"Host memory reclaimed by squeezed", + "owner":"host", + "value_type":"int64", + "type":"absolute", + "default":"true", + "units":"B", + "min":"-inf", + "max":"inf" + }, + "memory_reclaimed_max": { + "description":"Host memory that could be reclaimed by squeezed", + "owner":"host", + "value_type":"int64", + "type":"absolute", + "default":"true", + "units":"B", + "min":"-inf", + "max":"inf" + }, + { + "cpu-temp-cpu0": { + "description": "Temperature of CPU 0", + "owner":"host", + "value_type": "float", + "type": "absolute", + "default":"true", + "units": "degC", + "min":"-inf", + "max":"inf" + }, + "cpu-temp-cpu1": { + "description": "Temperature of CPU 1", + "owner":"host", + "value_type": "float", + "type": "absolute", + "default":"true", + "units": "degC", + "min":"-inf", + "max":"inf" + } + } +} +``` + +The above formatting is not required, but added here for readability. + +Reading algorithm +----------------- + +``` +if header != expected_header: + raise InvalidHeader() +if data_checksum == last_data_checksum: + raise NoUpdate() +if data_checksum != crc32(encoded_timestamp_and_values): + raise InvalidChecksum() +if metadata_checksum == last_metadata_checksum: + for datasource, value in cached_datasources, values: + update(datasource, value) +else: + if metadata_checksum != crc32(metadata): + raise InvalidChecksum() + cached_datasources = create_datasources(metadata) + for datasource, value in cached_datasources, values: + update(datasource, value) +``` + +This means that for a normal update, RRDD will only have to read the header plus +the first (16 + 16 + 4 + 8 + 8*n) bytes of data, where n is the number of +datasources exported by the plugin. If the metadata changes RRDD will have to +read all the data (and parse the metadata). + +n.b. the timestamp reported by plugins is not currently used by RRDD - it uses +its own global timestamp. diff --git a/doc/content/design/plugin-protocol-v3.md b/doc/content/design/plugin-protocol-v3.md new file mode 100644 index 00000000000..27596b63488 --- /dev/null +++ b/doc/content/design/plugin-protocol-v3.md @@ -0,0 +1,81 @@ +--- +title: RRDD plugin protocol v3 +layout: default +design_doc: true +revision: 1 +status: proposed +revision_history: +- revision_number: 1 + description: Initial version +--- + +Motivation +---------- + +rrdd plugins protocol v2 report datasources via shared-memory file, however it +has various limitations : + - metrics are unique by their names, thus it is not possible cannot have + several metrics that shares a same name (e.g vCPU usage per vm) + - only number metrics are supported, for example we can't expose string + metrics (e.g CPU Model) + +Therefore, it implies various limitations on plugins and limits +[OpenMetrics](https://openmetrics.io/) support for the metrics daemon. + +Moreover, it may not be practical for plugin developpers and parser implementations : + - json implementations may not keep insersion order on maps, which can cause + issues to expose datasource values as it is sensitive to the order of the metadata map + - header length is not constant and depends on datasource count, which complicates parsing + - it still requires a quite advanced parser to convert between bytes and numbers according to metadata + +A simpler protocol is proposed, based on OpenMetrics binary format to ease plugin and parser implementations. + +Protocol V3 +----------- + +For this protocol, we still use a shared-memory file, but significantly change the structure of the file. + +| value | bits | format | notes +| -------------- | ------------------ | ------ | ------------------------------------------------------------ +| header string | 12*8=96 | string | "OPENMETRICS1" which is one byte longer than "DATASOURCES", intentionally made at 12 bytes for alignment purposes +| data checksum | 32 | uint32 | Checksum of the concatenation of the rest of the header (from timestamp) and the payload data +| timestamp | 64 | uint64 | Unix epoch +| payload length | 32 | uint32 | Payload length +| payload data | 8*(payload length) | binary | OpenMetrics encoded metrics data (protocol-buffers format) + +All values are big-endian. + +The header size is constant (28 bytes) that implementation can rely on (read +the entire header in one go, simplify usage of memory mapping). + +As opposed to protocol v2 but alike protocol v1, metadata is included along +metrics in OpenMetrics format. + +`owner` attribute for metric should be exposed using a OpenMetrics label instead (named `owner`). + +Multiple metrics that shares the same name should be exposed under the same +Metric Family and be differenciated by labels (e.g `owner`). + +Reading algorithm +----------------- + +```python +if header != expected_header: + raise InvalidHeader() +if data_checksum == last_data_checksum: + raise NoUpdate() +if timestamp == last_timestamp: + raise NoUpdate() +if data_checksum != crc32(concat_header_end_payload): + raise InvalidChecksum() + +metrics = parse_openmetrics(payload_data) + +for family in metrics: + if family_exists(family): + update_family(family) + else + create_family(family) + +track_removed_families(metrics) +``` \ No newline at end of file diff --git a/doc/content/design/pool-wide-ssh.md b/doc/content/design/pool-wide-ssh.md new file mode 100644 index 00000000000..a903c545fab --- /dev/null +++ b/doc/content/design/pool-wide-ssh.md @@ -0,0 +1,80 @@ +--- +title: Pool-wide SSH +layout: default +design_doc: true +revision: 1 +status: proposed +--- + +## Background + +The SMAPIv3 plugin architecture requires that storage plugins are able to work +in the absence of xapi. Amongst other benefits, this allows them to be tested +in isolation, are able to be shared more widely than just within the XenServer +community and will cause less load on xapi's database. + +However, many of the currently existing SMAPIv1 backends require inter-host +operations to be performed. This is achieved via the use of the Xen-API call +'host.call_plugin', which allows an API user to execute a pre-installed plugin +on any pool member. This is important for operations such as coalesce / snapshot +where the active data path for a VM somewhere in the pool needs to be refreshed +in order to complete the operation. In order to use this, the RPM in which the +SM backend lives is used to deliver a plugin script into /etc/xapi.d/plugins, +and this executes the required function when the API call is made. + +In order to support these use-cases without xapi running, a new mechanism needs +to be provided to allow the execution of required functionality on remote hosts. +The canonical method for remotely executing scripts is ssh - the secure shell. +This design proposal is setting out how xapi might manage the public and +private keys to enable passwordless authentication of ssh sessions between all +hosts in a pool. + +## Modifications to the host + +On firstboot (and after being ejected), the host should generate a +host key (already done I believe), and an authentication key for the +user (root/xapi?). + +## Modifications to xapi + +Three new fields will be added to the host object: + +- ```host.ssh_public_host_key : string```: This is the host key that identifies the host +during the initial ssh key exchange protocol. This should be added to the +'known_hosts' field of any other host wishing to ssh to this host. + +- ```host.ssh_public_authentication_key : string```: This field is the public + key used for authentication when sshing from the root account on that host - + host A. This can be added to host B's ```authorized_keys``` file in order to + allow passwordless logins from host A to host B. + +- ```host.ssh_ready : bool```: A boolean flag indicating that the configuration + files in use by the ssh server/client on the host are up to date. + +One new field will be added to the pool record: + +- ```pool.revoked_authentication_keys : string list```: This field records all +authentication keys that have been used by hosts in the past. It is updated +when a host is ejected from the pool. + +### Pool Join + +On pool join, the master creates the record for the new host and populates the +two public key fields with values supplied by the joining host. It then sets +the ```ssh_ready``` field on all other hosts to ```false```. + +On each host in the pool, a thread is watching for updates to the +```ssh_ready``` value for the local host. When this is set to false, the host +then adds the keys from xapi's database to the appropriate places in the ssh +configuration files and restarts sshd. Once this is done, the host sets the +```ssh_ready``` field to 'true' + +### Pool Eject + +On pool eject, the host's ssh_public_host_key is lost, but the authetication key is added to a list of revoked keys on the pool object. This allows all other hosts to remove the key from the authorized_keys list when they next sync, which in the usual case is immediately the database is modified due to the event watch thread. If the host is offline though, the authorized_keys file will be updated the next time the host comes online. + + +## Questions + +- Do we want a new user? e.g. 'xapi' - how would we then use this user to execute privileged things? setuid binaries? +- Is keeping the revoked_keys list useful? If we 'control the world' of the authorized_keys file, we could just remove anything that's currently in there that xapi doesn't know about diff --git a/doc/content/design/schedule-snapshot.md b/doc/content/design/schedule-snapshot.md new file mode 100644 index 00000000000..75c8348d91f --- /dev/null +++ b/doc/content/design/schedule-snapshot.md @@ -0,0 +1,89 @@ +--- +title: Schedule Snapshot Design +layout: default +design_doc: true +design_review: 186 +revision: 2 +status: proposed +revision_history: +- revision_number: 1 + description: Initial version +- revision_number: 2 + description: Renaming VMSS fields and APIs. API message_create superseeds vmss_create_alerts. +- revision_number: 3 + description: Remove VMSS alarm_config details and use existing pool wide alarm config +- revision_number: 4 + description: Renaming field from retention-value to retained-snapshots and schedule-snapshot to scheduled-snapshot +- revision_number: 5 + description: Add new API task_set_status +--- + +The scheduled snapshot feature will utilize the existing architecture of VMPR. In terms of functionality, scheduled snapshot is basically VMPR without its archiving capability. + +Introduction +------------ + +* Schedule snapshot will be a new object in xapi as VMSS. +* A pool can have multiple VMSS. +* Multiple VMs can be a part of VMSS but a VM cannot be a part of multiple VMSS. +* A VMSS takes VMs snapshot with type [`snapshot`, `checkpoint`, `snapshot_with_quiesce`]. +* VMSS takes snapshot of VMs on configured intervals: + * `hourly` -> On everyday, Each hour, Mins [0;15;30;45] + * `daily` -> On everyday, Hour [0 to 23], Mins [0;15;30;45] + * `weekly` -> Days [`Monday`,`Tuesday`,`Wednesday`,`Thursday`,`Friday`,`Saturday`,`Sunday`], Hour[0 to 23], Mins [0;15;30;45] +* VMSS will have a limit on retaining number of VM snapshots in range [1 to 10]. + +Datapath Design +--------------- + +* There will be a cron job for VMSS. +* VMSS plugin will go through all the scheduled snapshot policies in the pool and check if any of them are due. +* If a snapshot is due then : Go through all the VM objects in XAPI associated with this scheduled snapshot policy and create a new snapshot. +* If the snapshot operation fails, create a notification alert for the event and move to the next VM. +* Check if an older snapshot now needs to be deleted to comply with the retained snapshots defined in the scheduled policy. +* If we need to delete any existing snapshots, delete the oldest snapshot created via scheduled policy. +* Set the last-run timestamp in the scheduled policy. + +Xapi Changes +------------ + +There is a new record for VM Scheduled Snapshot with new fields. + +New fields: + +* `name-label` type `String` : Name label for VMSS. +* `name-description` type `String` : Name description for VMSS. +* `enabled` type `Bool` : Enable/Disable VMSS to take snapshot. +* `type` type `Enum` [`snapshot`; `checkpoint`; `snapshot_with_quiesce`] : Type of snapshot VMSS takes. +* `retained-snapshots` type `Int64` : Number of snapshots limit for a VM, max limit is 10 and default is 7. +* `frequency` type `Enum` [`hourly`; `daily`; `weekly`] : Frequency of taking snapshot of VMs. +* `schedule` type `Map(String,String)` with (key, value) pair: + * hour : 0 to 23 + * min : [0;15;30;45] + * days : [`Monday`,`Tuesday`,`Wednesday`,`Thursday`,`Friday`,`Saturday`,`Sunday`] +* `last-run-time` type Date : DateTime of last execution of VMSS. +* `VMs` type VM refs : List of VMs part of VMSS. + +New fields to VM record: + +* `scheduled-snapshot` type VMSS ref : VM part of VMSS. +* `is-vmss-snapshot` type Bool : If snapshot created from VMSS. + +New APIs +-------- + +* vmss_snapshot_now (Ref vmss, Pool_Operater) -> String : This call executes the scheduled snapshot immediately. +* vmss_set_retained_snapshots (Ref vmss, Int value, Pool_Operater) -> unit : Set the value of vmss retained snapshots, max is 10. +* vmss_set_frequency (Ref vmss, String "value", Pool_Operater) -> unit : Set the value of the vmss frequency field. +* vmss_set_type (Ref vmss, String "value", Pool_Operater) -> unit : Set the snapshot type of the vmss type field. +* vmss_set_scheduled (Ref vmss, Map(String,String) "value", Pool_Operater) -> unit : Set the vmss scheduled to take snapshot. +* vmss_add_to_schedule (Ref vmss, String "key", String "value", Pool_Operater) -> unit : Add key value pair to VMSS schedule. +* vmss_remove_from_schedule (Ref vmss, String "key", Pool_Operater) -> unit : Remove key from VMSS schedule. +* vmss_set_last_run_time (Ref vmss, DateTime "value", Local_Root) -> unit : Set the last run time for VMSS. +* task_set_status (Ref task, status_type "value", READ_ONLY) -> unit : Set the status of task owned by same user, Pool_Operator can set status for any tasks. + +New CLIs +-------- + +* vmss-create (required : "name-label";"type";"frequency", optional : "name-description";"enabled";"schedule:";"retained-snapshots") -> unit : Creates VM scheduled snapshot. +* vmss-destroy (required : uuid) -> unit : Destroys a VM scheduled snapshot. diff --git a/doc/content/design/smapiv3/index.md b/doc/content/design/smapiv3/index.md new file mode 100644 index 00000000000..219e0c1bcea --- /dev/null +++ b/doc/content/design/smapiv3/index.md @@ -0,0 +1,95 @@ +--- +title: SMAPIv3 +layout: default +design_doc: true +revision: 1 +status: released (7.6) +--- + +Xapi accesses storage through "plugins" which currently use a protocol +called "SMAPIv1". This protocol has a number of problems: + +1. the protocol has many missing features, and this leads to people + using the XenAPI from within a plugin, which is racy, difficult to + get right, unscalable and makes component testing impossible. + +2. the protocol expects plugin authors to have a deep knowledge of the + Xen storage datapath (`tapdisk`, `blkback` etc) *and* the storage. + +3. the protocol is undocumented. + +We shall create a new revision of the protocol ("SMAPIv3") to address these +problems. + +The following diagram shows the new control plane: + +![Storage control plane](smapiv3.png) + +Requests from xapi are filtered through the existing `storage_access` +layer which is responsible for managing the mapping between VM VBDs and +VDIs. + +Each plugin is represented by a named queue, with APIs for + +- querying the state of each queue +- explicitly cancelling or replying to messages + +Legacy SMAPIv1 plugins will be processed via the existing `storage_access.SMAPIv1` +module. Newer SMAPIv3 plugins will be handled by a new `xapi-storage-script` +service. + +The SMAPIv3 APIs will be defined in an IDL format in a separate repo. + +xapi-storage-script +=================== + +The `xapi-storage-script` will run as a service and will + +- use `inotify` to monitor a well-known path in dom0 +- when a directory is created, check whether it contains storage plugins by + executing a `Plugin.query` +- assuming the directory contains plugins, it will register the queue name + and start listening for messages +- when messages from `xapi` or the CLI are received, it will generate the SMAPIv3 + .json message and fork the relevant script. + +SMAPIv3 IDL +=========== + +The IDL will support + +- documentation for all functions, parameters and results + - this will be extended to be a XenAPI-style versioning scheme in future +- generating hyperlinked HTML documentation, published on github +- generating libraries for python and OCaml + - the libraries will include marshalling, unmarshalling, type-checking + and command-line parsing and help generation + +Diagnostic tools +================ + +It will be possible to view the contents of the queue associated with any +plugin, and see whether + +- the queue is being served or not (perhaps the `xapi-storage-script` has + crashed) +- there are unanswered messages (perhaps one of the messages has caused + a deadlock in the implementation?) + +It will be possible to + +- delete/clear queues/messages +- download a message-sequence chart of the last N messages for inclusion in + bugtools. + +Anatomy of a plugin +=================== + +The following diagram shows what a plugin would look like: + +![Anatomy of a plugin](plugin.png) + +The SMAPIv3 +=========== + +Please read [the current SMAPIv3 documentation](https://xapi-project.github.io/xapi-storage). diff --git a/doc/content/design/smapiv3/plugin.graffle b/doc/content/design/smapiv3/plugin.graffle new file mode 100644 index 00000000000..4887d51f86a Binary files /dev/null and b/doc/content/design/smapiv3/plugin.graffle differ diff --git a/doc/content/design/smapiv3/plugin.png b/doc/content/design/smapiv3/plugin.png new file mode 100644 index 00000000000..6fe3857e6f0 Binary files /dev/null and b/doc/content/design/smapiv3/plugin.png differ diff --git a/doc/content/design/smapiv3/smapiv3.graffle b/doc/content/design/smapiv3/smapiv3.graffle new file mode 100644 index 00000000000..ef59a96d525 Binary files /dev/null and b/doc/content/design/smapiv3/smapiv3.graffle differ diff --git a/doc/content/design/smapiv3/smapiv3.png b/doc/content/design/smapiv3/smapiv3.png new file mode 100644 index 00000000000..85c0b35af9a Binary files /dev/null and b/doc/content/design/smapiv3/smapiv3.png differ diff --git a/doc/content/design/snapshot-revert.md b/doc/content/design/snapshot-revert.md new file mode 100644 index 00000000000..4618e1ee9ce --- /dev/null +++ b/doc/content/design/snapshot-revert.md @@ -0,0 +1,103 @@ +--- +title: Improving snapshot revert behaviour +layout: default +design_doc: true +revision: 1 +status: confirmed +--- + +Currently there is a XenAPI `VM.revert` which reverts a "VM" to the state it +was in when a VM-level snapshot was taken. There is no `VDI.revert` so +`VM.revert` uses `VDI.clone` to change the state of the disks. + +The use of `VDI.clone` has the side-effect of changing VDI refs and uuids. +This causes the following problems: + +- It is difficult for clients + such as [Apache CloudStack](http://cloudstack.apache.org) to keep track + of the disks it is actively managing +- VDI snapshot metadata (`VDI.snapshot_of` et al) has to be carefully + fixed up since all the old refs are now dangling + +We will fix these problems by: + +1. adding a `VDI.revert` to the SMAPIv2 and calling this from `VM.revert` +2. defining a new SMAPIv1 operation `vdi_revert` and a corresponding capability + `VDI_REVERT` +3. the Xapi implementation of `VDI.revert` will first try the `vdi_revert`, + and fall back to `VDI.clone` if that fails +4. implement `vdi_revert` for common storage types, including File and LVM-based + SRs. + +XenAPI changes +-------------- + +We will add the function `VDI.revert` with arguments: + +- in: `snapshot: Ref(VDI)`: the snapshot to which we want to revert +- in: `driver_params: Map(String,String)`: optional extra parameters +- out: `Ref(VDI)` the new VDI + +The function will look up the VDI which this is a `snapshot_of`, and change +the VDI to have the same contents as the snapshot. The snapshot will not be +modified. If the implementation is able to revert in-place, then the reference +returned will be the VDI this is a `snapshot_of`; otherwise it is a reference +to a fresh VDI (created by the `VDI.clone` fallback path) + +References: + +- @johnelse's [pull request](https://github.com/xapi-project/xen-api/pull/1963) + which implements this + +SMAPIv1 changes +--------------- + +We will define the function `vdi_revert` with arguments: + +- in: `sr_uuid`: the UUID of the SR containing both the VDI and the snapshot +- in: `vdi_uuid`: the UUID of the snapshot whose contents should be duplicated +- in: `target_uuid`: the UUID of the target whose contents should be replaced + +The function will replace the contents of the `target_uuid` VDI with the +contents of the `vdi_uuid` VDI without changing the identify of the target +(i.e. name-label, uuid and location are guaranteed to remain the same). +The `vdi_uuid` is preserved by this operation. The operation is obvoiusly +idempotent. + +Xapi changes +------------ + +Xapi will + +- use `VDI.revert` in the `VM.revert` code-path +- expose a new `xe vdi-revert` CLI command +- implement the `VDI.revert` by calling the SMAPIv1 function and falling back + to `VDI.clone` if a `Not_implemented` exception is thrown + +References: + +- @johnelse's [pull request](https://github.com/xapi-project/xen-api/pull/1963) + +SM changes +---------- + +We will modify + +- SRCommand.py and VDI.py to add a new `vdi_revert` function which throws + a 'not implemented' exception +- FileSR.py to implement `VDI.revert` using a variant of the existing + snapshot/clone machinery +- EXTSR.py and NFSSR.py to advertise the `VDI_REVERT` capability +- LVHDSR.py to implement `VDI.revert` using a variant of the existing + snapshot/clone machinery +- LVHDoISCSISR.py and LVHDoHBASR.py to advertise the `VDI_REVERT` capability + +Prototype code +============== + +Prototype code exists here: + +- [xapi-project/xcp-idl#37](https://github.com/xapi-project/xcp-idl/pull/37) by @johnelse +- [xapi-project/xen-api#2058](https://github.com/xapi-project/xen-api/pull/2058) mainly by @johnelse but with 2 extra patches from me +- [Definition of SMAPIv1 vdi_revert](https://github.com/djs55/sm/commit/cbc28755c9c4300ed067abc089081f58f821f504) +- [Hacky implementation for EXT/NFS](https://github.com/djs55/sm/commit/eb31d6205ccd707152a5b59c9a733fd48db5316b) diff --git a/doc/content/design/sr-level-rrds.md b/doc/content/design/sr-level-rrds.md new file mode 100644 index 00000000000..6822b7c64cf --- /dev/null +++ b/doc/content/design/sr-level-rrds.md @@ -0,0 +1,147 @@ +--- +title: SR-Level RRDs +layout: default +design_doc: true +revision: 11 +status: confirmed +design_review: 139 +revision_history: +- revision_number: 1 + description: Initial version +- revision_number: 2 + description: Added details about the VDI's binary format and size, and the SR capability name. +- revision_number: 3 + description: Tar was not needed after all! +- revision_number: 4 + description: Add details about discovering the VDI using a new vdi_type. +- revision_number: 5 + description: Add details about the http handlers and interaction with xapi's database +- revision_number: 6 + description: Add details about the framing of the data within the VDI +- revision_number: 7 + description: Redesign semantics of the rrd_updates handler +- revision_number: 8 + description: Redesign semantics of the rrd_updates handler (again) +- revision_number: 9 + description: Magic number change in framing format of vdi +- revision_number: 10 + description: Add details of new APIs added to xapi and xcp-rrdd +- revision_number: 11 + description: Remove unneeded API calls + +--- + +## Introduction + +Xapi has RRDs to track VM- and host-level metrics. There is a desire to have SR-level RRDs as a new category, because SR stats are not specific to a certain VM or host. Examples are size and free space on the SR. While recording SR metrics is relatively straightforward within the current RRD system, the main question is where to archive them, which is what this design aims to address. + +## Stats Collection + +All SR types, including the existing ones, should be able to have RRDs defined for them. Some RRDs, such as a "free space" one, may make sense for multiple (if not all) SR types. However, the way to measure something like free space will be SR specific. Furthermore, it should be possible for each type of SR to have its own specialised RRDs. + +It follows that each SR will need its own `xcp-rrdd` plugin, which runs on the SR master and defines and collects the stats. For the new thin-lvhd SR this could be `xenvmd` itself. The plugin registers itself with `xcp-rrdd`, so that the latter records the live stats from the plugin into RRDs. + +## Archiving + +SR-level RRDs will be archived in the SR itself, in a VDI, rather than in the local filesystem of the SR master. This way, we don't need to worry about master failover. + +The VDI will be 4MB in size. This is a little more space than we would need for the RRDs we have in mind at the moment, but will give us enough headroom for the foreseeable future. It will not have a filesystem on it for simplicity and performance. There will only be one RRD archive file for each SR (possibly containing data for multiple metrics), which is gzipped by `xcp-rrdd`, and can be copied onto the VDI. + +There will be a simple framing format for the data on the VDI. This will be as follows: + +Offset | Type | Name | Comment +-------|--------------------------|---------|-------------------------- +0 | 32 bit network-order int | magic | Magic number = 0x7ada7ada +4 | 32 bit network-order int | version | 1 +8 | 32 bit network-order int | length | length of payload +12 | gzipped data | data | + +Xapi will be in charge of the lifecycle of this VDI, not the plugin or `xcp-rrdd`, which will make it a little easier to manage them. Only xapi will attach/detach and read from/write to this VDI. We will keep `xcp-rrdd` as simple as possible, and have it archive to its standard path in the local file system. Xapi will then copy the RRDs in and out of the VDI. + +A new value `"rrd"` in the `vdi_type` enum of the datamodel will be defined, and the `VDI.type` of the VDI will be set to that value. The storage backend will write the VDI type to the LVM metadata of the VDI, so that xapi can discover the VDI containing the SR-level RRDs when attaching an SR to a new pool. This means that SR-level RRDs are currently restricted to LVM SRs. + +Because we will not write plugins for all SRs at once, and therefore do not need xapi to set up the VDI for all SRs, we will add an SR "capability" for the backends to be able to tell xapi whether it has the ability to record stats and will need storage for them. The capability name will be: `SR_STATS`. + +## Management of the SR-stats VDI + +The SR-stats VDI will be attached/detached on `PBD.plug`/`unplug` on the SR master. + +* On `PBD.plug` on the SR master, if the SR has the stats capability, xapi: + * Creates a stats VDI if not already there (search for an existing one based on the VDI type). + * Attaches the stats VDI if it did already exist, and copies the RRDs to the local file system (standard location in the filesystem; asks `xcp-rrdd` where to put them). + * Informs `xcp-rrdd` about the RRDs so that it will load the RRDs and add newly recorded data to them (needs a function like `push_rrd_local` for VM-level RRDs). + * Detaches stats VDI. + +* On `PBD.unplug` on the SR master, if the SR has the stats capability xapi: + * Tells `xcp-rrdd` to archive the RRDs for the SR, which it will do to the local filesystem. + * Attaches the stats VDI, copies the RRDs into it, detaches VDI. + +## Periodic Archiving + +Xapi's periodic scheduler regularly triggers `xcp-rrdd` to archive the host and VM RRDs. It will need to do this for the SR ones as well. Furthermore, xapi will need to attach the stats VDI and copy the RRD archives into it (as on `PBD.unplug`). + +## Exporting + +There will be a new handler for downloading an SR RRD: + + http:///sr_rrd?session_id=&uuid= + +RRD updates are handled via a single handler for the host, VM and SR UUIDs +RRD updates for the host, VMs and SRs are handled by a a single handler at +`/rrd_updates`. Exactly what is returned will be determined by the parameters +passed to this handler. + +Whether the host RRD updates are returned is governed by the presence of +`host=true` in the parameters. `host=` or the absence of the +`host` key will mean the host RRD is not returned. + +Whether the VM RRD updates are returned is governed by the `vm_uuid` key in the +URL parameters. `vm_uuid=all` will return RRD updates for all VM RRDs. +`vm_uuid=xxx` will return the RRD updates for the VM with uuid `xxx` only. +If `vm_uuid` is `none` (or any other string which is not a valid VM UUID) then +the handler will return no VM RRD updates. If the `vm_uuid` key is absent, RRD +updates for all VMs will be returned. + +Whether the SR RRD updates are returned is governed by the `sr_uuid` key in the +URL parameters. `sr_uuid=all` will return RRD updates for all SR RRDs. +`sr_uuid=xxx` will return the RRD updates for the SR with uuid `xxx` only. +If `sr_uuid` is `none` (or any other string which is not a valid SR UUID) then +the handler will return no SR RRD updates. If the `sr_uuid` key is absent, no +SR RRD updates will be returned. + +It will be possible to mix and match these parameters; for example to return +RRD updates for the host and all VMs, the URL to use would be: + + http:///rrd_updates?session_id=&start=10258122541&host=true&vm_uuid=all&sr_uuid=none + +Or, to return RRD updates for all SRs but nothing else, the URL to use would be: + + http:///rrd_updates?session_id=&start=10258122541&host=false&vm_uuid=none&sr_uuid=all + +While behaviour is defined if any of the keys `host`, `vm_uuid` and `sr_uuid` is +missing, this is for backwards compatibility and it is recommended that clients +specify each parameter explicitly. + +## Database updating. + +If the SR is presenting a data source called 'physical_utilisation', +xapi will record this periodically in its database. In order to do +this, xapi will fork a thread that, every n minutes (2 suggested, but +open to suggestions here), will query the attached SRs, then query +RRDD for the latest data source for these, and update the database. + +The utilisation of VDIs will _not_ be updated in this way until +scalability worries for RRDs are addressed. + +Xapi will cache whether it is SR master for every attached SR and only +attempt to update if it is the SR master. + +## New APIs. + +#### xcp-rrdd: + +* Get the filesystem location where sr rrds are archived: `val sr_rrds_path : uid:string -> string` + +* Archive the sr rrds to the filesystem: `val archive_sr_rrd : sr_uuid:string -> unit` + +* Load the sr rrds from the filesystem: `val push_sr_rrd : sr_uuid:string -> unit` diff --git a/doc/content/design/thin-lvhd/allocation-plane.graffle b/doc/content/design/thin-lvhd/allocation-plane.graffle new file mode 100644 index 00000000000..4d1b7465bdf Binary files /dev/null and b/doc/content/design/thin-lvhd/allocation-plane.graffle differ diff --git a/doc/content/design/thin-lvhd/allocation-plane.png b/doc/content/design/thin-lvhd/allocation-plane.png new file mode 100644 index 00000000000..ad2bd55855f Binary files /dev/null and b/doc/content/design/thin-lvhd/allocation-plane.png differ diff --git a/doc/content/design/thin-lvhd/control-plane.graffle b/doc/content/design/thin-lvhd/control-plane.graffle new file mode 100644 index 00000000000..b64b79a2ff1 Binary files /dev/null and b/doc/content/design/thin-lvhd/control-plane.graffle differ diff --git a/doc/content/design/thin-lvhd/control-plane.png b/doc/content/design/thin-lvhd/control-plane.png new file mode 100644 index 00000000000..289b3a32441 Binary files /dev/null and b/doc/content/design/thin-lvhd/control-plane.png differ diff --git a/doc/content/design/thin-lvhd/index.md b/doc/content/design/thin-lvhd/index.md new file mode 100644 index 00000000000..1bf209ca266 --- /dev/null +++ b/doc/content/design/thin-lvhd/index.md @@ -0,0 +1,878 @@ +--- +title: thin LVHD storage +layout: default +design_doc: true +revision: 3 +status: proposed +--- + +LVHD is a block-based storage system built on top of Xapi and LVM. LVHD +disks are represented as LVM LVs with vhd-format data inside. When a +disk is snapshotted, the LVM LV is "deflated" to the minimum-possible +size, just big enough to store the current vhd data. All other disks are +stored "inflated" i.e. consuming the maximum amount of storage space. +This proposal describes how we could add dynamic thin-provisioning to +LVHD such that + +- disks only consume the space they need (plus an adjustable small + overhead) +- when a disk needs more space, the allocation can be done *locally* + in the common-case; in particular there is no network RPC needed +- when the resource pool master host has failed, allocations can still + continue, up to some limit, allowing time for the master host to be + recovered; in particular there is no need for very low HA timeouts. +- we can (in future) support in-kernel block allocation through the + device mapper dm-thin target. + +The following diagram shows the "Allocation plane": + +![Allocation plane](allocation-plane.png) + +All VM disk writes are channelled through `tapdisk` which keeps track +of the remaining reserved space within the device mapper device. When +the free space drops below a "low-water mark", tapdisk sends a message +to a local per-SR daemon called `local-allocator` and requests more +space. + +The `local-allocator` maintains a free pool of blocks available for +allocation locally (hence the name). It will pick some blocks and +transactionally send the update to the `xenvmd` process running +on the SRmaster via the shared ring (labelled `ToLVM queue` in the diagram) +and update the device mapper tables locally. + +There is one `xenvmd` process per SR on the SRmaster. `xenvmd` receives +local allocations from all the host shared rings (labelled `ToLVM queue` +in the diagram) and combines them together, appending them to a redo-log +also on shared storage. When `xenvmd` notices that a host's free space +(represented in the metadata as another LV) is low it allocates new free blocks +and pushes these to the host via another shared ring (labelled `FromLVM queue` +in the diagram). + +The `xenvmd` process maintains a cache of the current VG metadata for +fast query and update. All updates are appended to the redo-log to ensure +they operate in O(1) time. The redo log updates are periodically flushed +to the primary LVM metadata. + +Since the operations are stored in the redo-log and will only be removed +after the real metadata has been written, the implication is that it is +possible for the operations to be performed more than once. This will +occur if the xenvmd process exits between flushing to the real metadata +and acknowledging the operations as completed. For this to work as expected, +every individual operation stored in the redo-log _must_ be idempotent. + +Note on running out of blocks +----------------------------- + +Note that, while the host has plenty of free blocks, local allocations should +be fast. If the master fails and the local free pool starts running out +and `tapdisk` asks for more blocks, then the local allocator won't be able +to provide them. +`tapdisk` should start to slow +I/O in order to provide the local allocator more time. +Eventually if ```tapdisk``` runs +out of space before the local allocator can satisfy the request then +guest I/O will block. Note Windows VMs will start to crash if guest +I/O blocks for more than 70s. Linux VMs, no matter PV or HVM, may suffer +from "block for more than 120 seconds" issue due to slow I/O. This +known issue is that, slow I/O during dirty pages writeback/flush may +cause memory starvation, then other userland process or kernel threads +would be blocked. + +The following diagram shows the control-plane: + +![control plane](control-plane.png) + +When thin-provisioning is enabled we will be modifying the LVM metadata at +an increased rate. We will cache the current metadata in the `xenvmd` process +and funnel all queries through it, rather than "peeking" at the metadata +on-disk. Note it will still be possible to peek at the on-disk metadata but it +will be out-of-date. Peeking can still be used to query the PV state of the volume +group. + +The `xenvm` CLI uses a simple +RPC interface to query the `xenvmd` process, tunnelled through `xapi` over +the management network. The RPC interface can be used for + +- activating volumes locally: `xenvm` will query the LV segments and program + device mapper +- deactivating volumes locally +- listing LVs, PVs etc + +Note that current LVHD requires the management network for these control-plane +functions. + +When the SM backend wishes to query or update volume group metadata it should use the +`xenvm` CLI while thin-provisioning is enabled. + +The `xenvmd` process shall use a redo-log to ensure that metadata updates are +persisted in constant time and flushed lazily to the regular metadata area. + +Tunnelling through xapi will be done by POSTing to the localhost URI + + /services/xenvmd/ + +Xapi will the either proxy the request transparently to the SRmaster, or issue an +http level redirect that the xenvm CLI would need to follow. + +If the xenvmd process is not running on the host on which it should +be, xapi will start it. + + +Components: roles and responsibilities +====================================== + +`xenvmd`: + +- one per plugged SRmaster PBD +- owns the LVM metadata +- provides a fast query/update API so we can (for example) create lots of LVs very fast +- allocates free blocks to hosts when they are running low +- receives block allocations from hosts and incorporates them in the LVM metadata +- can safely flush all updates and downgrade to regular LVM + +`xenvm`: + +- a CLI which talks the `xenvmd` protocol to query / update LVs +- can be run on any host, calls (except "format" and "upgrade") are forwarded by `xapi` +- can "format" a LUN to prepare it for `xenvmd` +- can "upgrade" a LUN to prepare it for `xenvmd` + +`local_allocator`: + +- one per plugged PBD +- exposes a simple interface to `tapdisk` for requesting more space +- receives free block allocations via a queue on the shared disk from `xenvmd` +- sends block allocations to `xenvmd` and updates the device mapper target locally + +`tapdisk`: + +- monitors the free space inside LVs and requests more space when running out +- slows down I/O when nearly out of space + +`xapi`: + +- provides authenticated communication tunnels +- ensures the xenvmd daemons are only running on the correct hosts. + +`SM`: + +- writes the configuration file for xenvmd (though doesn't start it) +- has an on/off switch for thin-provisioning +- can use either normal LVM or the `xenvm` CLI + +`membership_monitor` + +- configures and manages the connections between `xenvmd` and the `local_allocator` + +Queues on the shared disk +========================= + +The `local_allocator` communicates with `xenvmd` via a pair +of queues on the shared disk. Using the disk rather than the network means +that VMs will continue to run even if the management network is not working. +In particular + +- if the (management) network fails, VMs continue to run on SAN storage +- if a host changes IP address, nothing needs to be reconfigured +- if xapi fails, VMs continue to run. + +Logical messages in the queues +------------------------------ + +The `local_allocator` needs to tell the `xenvmd` which blocks have +been allocated to which guest LV. `xenvmd` needs to tell the +`local_allocator` which blocks have become free. Since we are based on +LVM, a "block" is an extent, and an "allocation" is a segment i.e. the +placing of a physical extent at a logical extent in the logical volume. + +The `local_allocator` needs to send a message with logical contents: + +- `volume`: a human-readable name of the LV +- `segments`: a list of LVM segments which says + "place physical extent x at logical extent y using a linear mapping". + +Note this message is idempotent. + +The `xenvmd` needs to send a message with logical contents: + +- `extents`: a list of physical extents which are free for the host to use + +Although +for internal housekeeping `xenvmd` will want to assign these +physical extents to logical extents within the host's free LV, the +`local_allocator` +doesn't need to know the logical extents. It only needs to know +the set of blocks which it is free to allocate. + +Starting up the local_allocator +------------------------------- + +What happens when a `local_allocator` (re)starts, after a + +- process crash, respawn +- host crash, reboot? + +When the `local_allocator` starts up, there are 2 cases: + +1. the host has just rebooted, there are no attached disks and no running VMs +2. the process has just crashed, there are attached disks and running VMs + +Case 1 is uninteresting. In case 2 there may have been an allocation in +progress when the process crashed and this must be completed. Therefore +the operation is journalled in a local filesystem in a directory which +is deliberately deleted on host reboot (Case 1). The allocation operation +consists of: + +1. `push`ing the allocation to `xenvmd` on the SRmaster +2. updating the device mapper + +Note that both parts of the allocation operation are idempotent and hence +the whole operation is idempotent. The journalling will guarantee it executes +at-least-once. + +When the `local_allocator` starts up it needs to discover the list of +free blocks. Rather than have 2 code paths, it's best to treat everything +as if it is a cold start (i.e. no local caches already populated) and to +ask the master to resync the free block list. The resync is performed by +executing a "suspend" and "resume" of the free block queue, and requiring +the remote allocator to: + +- `pop` all block allocations and incorporate these updates +- send the complete set of free blocks "now" (i.e. while the queue is + suspended) to the local allocator. + +Starting xenvmd +--------------- + +`xenvmd` needs to know + +- the device containing the volume group +- the hosts to "connect" to via the shared queues + +The device containing the volume group should be written to a config +file when the SR is plugged. + +`xenvmd` does not remember which hosts it is listening to across crashes, +restarts or master failovers. The `membership_monitor` will keep the +`xenvmd` list in sync with the `PBD.currently_attached` fields. + +Shutting down the local_allocator +--------------------------------- + +The `local_allocator` should be able to crash at any time and recover +afterwards. If the user requests a `PBD.unplug` we can perform a +clean shutdown by: + +- signalling `xenvmd` to suspend the block allocation queue +- arranging for the `local_allocator` to acknowledge the suspension and exit +- when the `xenvmd` sees the acknowlegement, we know that the + `local_allocator` is offline and it doesn't need to poll the queue any more + +Downgrading metadata +-------------------- + +`xenvmd` can be terminated at any time and restarted, since all compound +operations are journalled. + +Downgrade is a special case of shutdown. +To downgrade, we need to stop all hosts allocating and ensure all updates +are flushed to the global LVM metadata. `xenvmd` can shutdown +by: + +- shutting down all `local_allocator`s (see previous section) +- flushing all outstanding block allocations to the LVM redo log +- flushing the LVM redo log to the global LVM metadata + +Queues as rings +--------------- + +We can use a simple ring protocol to represent the queues on the disk. +Each queue will have a single consumer and single producer and reside within +a single logical volume. + +To make diagnostics simpler, we can require the ring to only support `push` +and `pop` of *whole* messages i.e. there can be no partial reads or partial +writes. This means that the `producer` and `consumer` pointers will always +point to valid message boundaries. + +One possible format used by the [prototype](https://github.com/mirage/shared-block-ring/blob/master/lib/ring.ml) is as follows: + +- sector 0: a magic string +- sector 1: producer state +- sector 2: consumer state +- sector 3...: data + +Within the producer state sector we can have: + +- octets 0-7: producer offset: a little-endian 64-bit integer +- octet 8: 1 means "suspend acknowledged"; 0 otherwise + +Within the consumer state sector we can have: + +- octets 0-7: consumer offset: a little-endian 64-bit integer +- octet 8: 1 means "suspend requested"; 0 otherwise + +The consumer and producer pointers point to message boundaries. Each +message is prefixed with a 4 byte length and padded to the next 4-byte +boundary. + +To push a message onto the ring we need to + +- check whether the message is too big to ever fit: this is a permanent + error +- check whether the message is too big to fit given the current free + space: this is a transient error +- write the message into the ring +- advance the producer pointer + +To pop a message from the ring we need to + +- check whether there is unconsumed space: if not this is a transient + error +- read the message from the ring and process it +- advance the consumer pointer + +Journals as queues +------------------ + +When we journal an operation we want to guarantee to execute it never +*or* at-least-once. We can re-use the queue implementation by `push`ing +a description of the work item to the queue and waiting for the +item to be `pop`ped, processed and finally consumed by advancing the +consumer pointer. The journal code needs to check for unconsumed data +during startup, and to process it before continuing. + +Suspending and resuming queues +------------------------------ + +During startup (resync the free blocks) and shutdown (flush the allocations) +we need to suspend and resume queues. The ring protocol can be extended +to allow the *consumer* to suspend the ring by: + +- the consumer asserts the "suspend requested" bit +- the producer `push` function checks the bit and writes "suspend acknowledged" +- the producer also periodically polls the queue state and writes + "suspend acknowledged" (to catch the case where no items are to be pushed) +- after the producer has acknowledged it will guarantee to `push` no more + items +- when the consumer polls the producer's state and spots the "suspend acknowledged", + it concludes that the queue is now suspended. + +The key detail is that the handshake on the ring causes the two sides +to synchronise and both agree that the ring is now suspended/ resumed. + + +Modelling the suspend/resume protocol +------------------------------------- + +To check that the suspend/resume protocol works well enough to be used +to resynchronise the free blocks list on a slave, a simple +[promela model](queue.pml) was created. We model the queue state as +2 boolean flags: + +``` +bool suspend /* suspend requested */ +bool suspend_ack /* suspend acknowledged *./ +``` + +and an abstract representation of the data within the ring: + +``` +/* the queue may have no data (none); a delta or a full sync. + the full sync is performed immediately on resume. */ +mtype = { sync delta none } +mtype inflight_data = none +``` + +There is a "producer" and a "consumer" process which run forever, +exchanging data and suspending and resuming whenever they want. +The special data item `sync` is only sent immediately after a resume +and we check that we never desynchronise with asserts: + +``` + :: (inflight_data != none) -> + /* In steady state we receive deltas */ + assert (suspend_ack == false); + assert (inflight_data == delta); + inflight_data = none +``` +i.e. when we are receiving data normally (outside of the suspend/resume +code) we aren't suspended and we expect deltas, not full syncs. + +The model-checker [spin](http://spinroot.com/spin/whatispin.html) +verifies this property holds. + +Interaction with HA +=================== + +Consider what will happen if a host fails when HA is disabled: + +- if the host is a slave: the VMs running on the host will crash but + no other host is affected. +- if the host is a master: allocation requests from running VMs will + continue provided enough free blocks are cached on the hosts. If a + host eventually runs out of free blocks, then guest I/O will start to + block and VMs may eventually crash. + +Therefore we *recommend* that users enable HA and only disable it +for short periods of time. Note that, unlike other thin-provisioning +implementations, we will allow HA to be disabled. + +Host-local LVs +============== + +When a host calls SMAPI `sr_attach`, it will use `xenvm` to tell `xenvmd` on the +SRmaster to connect to the `local_allocator` on the host. The `xenvmd` +daemon will create the volumes for queues and a volume to represent the +"free blocks" which a host is allowed to allocate. + +Monitoring +========== + +The `xenvmd` process should export RRD datasources over shared +memory named + +- ```sr___free```: the number of free blocks in + the local cache. It's useful to look at this and verify that it doesn't + usually hit zero, since that's when allocations will start to block. + For this reason we should use the `MIN` consolidation function. +- ```sr___requests```: a counter of the number + of satisfied allocation requests. If this number is too high then the quantum + of allocation should be increased. For this reason we should use the + `MAX` consolidation function. +- ```sr___allocations```: a counter of the number of + bytes being allocated. If the allocation rate is too high compared with + the number of free blocks divided by the HA timeout period then the + `SRmaster-allocator` should be reconfigured to supply more blocks with the host. + +Modifications to tapdisk +======================== + +TODO: to be updated by Germano + +```tapdisk``` will be modified to + +- on open: discover the current maximum size of the file/LV (for a file + we assume there is no limit for now) +- read a low-water mark value from a config file ```/etc/tapdisk3.conf``` +- read a very-low-water mark value from a config file ```/etc/tapdisk3.conf``` +- read a Unix domain socket path from a config file ```/etc/tapdisk3.conf``` +- when there is less free space available than the low-water mark: connect + to Unix domain socket and write an "extend" request +- upon receiving the "extend" response, re-read the maximum size of the + file/LV +- when there is less free space available than the very-low-water mark: + start to slow I/O responses and write a single 'error' line to the log. + +The extend request +------------------ + +TODO: to be updated by Germano + +The request has the following format: + +Octet offsets | Name | Description +-----------------|----------|------------ +0,1 | tl | Total length (including this field) of message (in network byte order) +2 | type | The value '0' indicating an extend request +3 | nl | The length of the LV name in octets, including NULL terminator +4,..,4+nl-1 | name | The LV name +4+nl,..,12+nl-1 | vdi_size | The virtual size of the logical VDI (in network byte order) +12+nl,..,20+nl-1 | lv_size | The current size of the LV (in network byte order) +20+nl,..,28+nl-1 | cur_size | The current size of the vhd metadata (in network byte order) + +The extend response +------------------- + +The response is a single byte value "0" which is a signal to re-examime +the LV size. The request will block indefinitely until it succeeds. The +request will block for a long time if + +- the SR has genuinely run out of space. The admin should observe the + existing free space graphs/alerts and perform an SR resize. +- the master has failed and HA is disabled. The admin should re-enable + HA or fix the problem manually. + +The local_allocator +=================== + +There is one `local_allocator` process per plugged PBD. +The process will be +spawned by the SM `sr_attach` call, and shutdown from the `sr_detach` call. + +The `local_allocator` accepts the following configuration (via a config file): + +- `socket`: path to a local Unix domain socket. This is where the `local_allocator` + listens for requests from `tapdisk` +- `allocation_quantum`: number of megabytes to allocate to each tapdisk on request +- `local_journal`: path to a block device or file used for local journalling. This + should be deleted on reboot. +- `free_pool`: name of the LV used to store the host's free blocks +- `devices`: list of local block devices containing the PVs +- `to_LVM`: name of the LV containing the queue of block allocations sent to `xenvmd` +- `from_LVM`: name of the LV containing the queue of messages sent from `xenvmd`. + There are two types of messages: + 1. Free blocks to put into the free pool + 2. Cap requests to remove blocks from the free pool. + +When the `local_allocator` process starts up it will read the host local +journal and + +- re-execute any pending allocation requests from tapdisk +- suspend and resume the `from_LVM` queue to trigger a full retransmit + of free blocks from `xenvmd` + +The procedure for handling an allocation request from tapdisk is: + +1. if there aren't enough free blocks in the free pool, wait polling the + `from_LVM` queue +2. choose a range of blocks to assign to the tapdisk LV from the free LV +3. write the operation (i.e. exactly what we are about to do) to the journal. + This ensures that it will be repeated if the allocator crashes and restarts. + Note that, since the operation may be repeated multiple times, it must be + idempotent. +5. push the block assignment to the `toLVM` queue +6. suspend the device mapper device +7. add/modify the device mapper target +8. resume the device mapper device +9. remove the operation from the local journal (i.e. there's no need to repeat + it now) +10. reply to tapdisk + +Shutting down the local-allocator +--------------------------------- + +The SM `sr_detach` called from `PBD.unplug` will use the `xenvm` CLI to request +that `xenvmd` disconnects from a host. The procedure is: + +1. SM calls `xenvm disconnect host` +2. `xenvm` sends an RPC to `xenvmd` tunnelled through `xapi` +3. `xenvmd` suspends the `to_LVM` queue +4. the `local_allocator` acknowledges the suspend and exits +5. `xenvmd` flushes all updates from the `to_LVM` queue and stops listening + +xenvmd +====== + +`xenvmd` is a daemon running per SRmaster PBD, started in `sr_attach` and +terminated in `sr_detach`. `xenvmd` has a config file containing: + +- `socket`: Unix domain socket where `xenvmd` listens for requests from + `xenvm` tunnelled by `xapi` +- `host_allocation_quantum`: number of megabytes to hand to a host at a time +- `host_low_water_mark`: threshold below which we will hand blocks to a host +- `devices`: local devices containing the PVs + +`xenvmd` continually + +- peeks updates from all the `to_LVM` queues +- calculates how much free space each host still has +- if the size of a host's free pool drops below some threshold: + - choose some free blocks +- if the size of a host's free pool goes above some threshold: + - request a cap of the host's free pool +- writes the change it is going to make to a journal stored in an LV +- pops the updates from the `to_LVM` queues +- pushes the updates to the `from_LVM` queues +- pushes updates to the LVM redo-log +- periodically flush the LVM redo-log to the LVM metadata area + +The membership monitor +====================== + +The role of the membership monitor is to keep the list of `xenvmd` connections +in sync with the `PBD.currently_attached` fields. + +We shall + +- install a ```host-pre-declare-dead``` script to use `xenvm` to send an RPC + to `xenvmd` to forcibly flush (without acknowledgement) the `to_LVM` queue + and destroy the LVs. +- modify XenAPI ```Host.declare_dead``` to call ```host-pre-declare-dead``` before + the VMs are unlocked +- add a ```host-pre-forget``` hook type which will be called just before a Host + is forgotten +- install a ```host-pre-forget``` script to use `xenvm` to call `xenvmd` to + destroy the host's local LVs + +Modifications to LVHD SR +======================== + +- `sr_attach` should: + - if an SRmaster, update the `MGT` major version number to prevent + - Write the xenvmd configuration file (on _all_ hosts, not just SRmaster) + - spawn `local_allocator` +- `sr_detach` should: + - call `xenvm` to request the shutdown of `local_allocator` +- `vdi_deactivate` should: + - call `xenvm` to request the flushing of all the `to_LVM` queues to the + redo log +- `vdi_activate` should: + - if necessary, call `xenvm` to deflate the LV to the minimum size (with some slack) + +Note that it is possible to attach and detach the individual hosts in any order +but when the SRmaster is unplugged then there will be no "refilling" of the host +local free LVs; it will behave as if the master host has failed. + +Modifications to xapi +===================== + +- Xapi needs to learn how to forward xenvm connections to the SR master. +- Xapi needs to start and stop xenvmd at the appropriate times +- We must disable unplugging the PBDs for shared SRs on the pool master + if any other slave has its PBD plugging. This is actually fixing an + issue that exists today - LVHD SRs require the master PBD to be + plugged to do many operations. +- Xapi should provide a mechanism by which the xenvmd process can be killed + once the last PBD for an SR has been unplugged. + +Enabling thin provisioning +========================== + +Thin provisioning will be automatically enabled on upgrade. When the SRmaster +plugs in `PBD` the `MGT` major version number will be bumped to prevent old +hosts from plugging in the SR and getting confused. +When a VDI is activated, it will be deflated to the new low size. + +Disabling thin provisioning +=========================== + +We shall make a tool which will + +- allow someone to downgrade their pool after enabling thin provisioning +- allow developers to test the upgrade logic without fully downgrading their + hosts + +The tool will + +- check if there is enough space to fully inflate all non-snapshot leaves +- unplug all the non-SRmaster `PBD`s +- unplug the SRmaster `PBD`. As a side-effect all pending LVM updates will be + written to the LVM metadata. +- modify the `MGT` volume to have the lower metadata version +- fully inflate all non-snapshot leaves + +Walk-through: upgrade +===================== + +Rolling upgrade should work in the usual way. As soon as the pool master has been +upgraded, hosts will be able to use thin provisioning when new VDIs are attached. +A VM suspend/resume/reboot or migrate will be needed to turn on thin provisioning +for existing running VMs. + +Walk-through: downgrade +======================= + +A pool may be safely downgraded to a previous version without thin provisioning +provided that the downgrade tool is run. If the tool hasn't run then the old +pool will refuse to attach the SR because the metadata has been upgraded. + +Walk-through: after a host failure +================================== + +If HA is enabled: + +- ```xhad``` elects a new master if necessary +- ```Xapi``` on the master will start xenvmd processes for shared thin-lvhd SRs +- the ```xhad``` tells ```Xapi``` which hosts are alive and which have failed. +- ```Xapi``` runs the ```host-pre-declare-dead``` scripts for every failed host +- the ```host-pre-declare-dead``` tells `xenvmd` to flush the `to_LVM` updates +- ```Xapi``` unlocks the VMs and restarts them on new hosts. + +If HA is not enabled: + +- The admin should verify the host is definitely dead +- If the dead host was the master, a new master must be designated. This will + start the xenvmd processes for the shared thin-lvhd SRs. +- the admin must tell ```Xapi``` which hosts have failed with ```xe host-declare-dead``` +- ```Xapi``` runs the ```host-pre-declare-dead``` scripts for every failed host +- the ```host-pre-declare-dead``` tells `xenvmd` to flush the `to_LVM` updates +- ```Xapi``` unlocks the VMs +- the admin may now restart the VMs on new hosts. + +Walk-through: co-operative master transition +============================================ + +The admin calls Pool.designate_new_master. This initiates a two-phase +commit of the new master. As part of this, the slaves will restart, +and on restart each host's xapi will kill any xenvmd that should only +run on the pool master. The new designated master will then restart itself +and start up the xenvmd process on itself. + +Future use of dm-thin? +====================== + +Dm-thin also uses 2 local LVs: one for the "thin pool" and one for the metadata. +After replaying our journal we could potentially delete our host local LVs and +switch over to dm-thin. + +Summary of the impact on the admin +================================== + +- If the VM workload performs a lot of disk allocation, then the admin *should* + enable HA. +- The admin *must* not downgrade the pool without first cleanly detaching the + storage. +- Extra metadata is needed to track thin provisioing, reducing the amount of + space available for user volumes. +- If an SR is completely full then it will not be possible to enable thin + provisioning. +- There will be more fragmentation, but the extent size is large (4MiB) so it + shouldn't be too bad. + +Ring protocols +============== + +Each ring consists of 3 sectors of metadata followed by the data area. The +contents of the first 3 sectors are: + +Sector, Octet offsets | Name | Type | Description +----------------------|-------------|--------|------ +0,0-30 | signature | string | Signature ("mirage shared-block-device 1.0") +1,0-7 | producer | uint64 | Pointer to the end of data written by the producer +1,8 | suspend_ack | uint8 | Suspend acknowledgement byte +2,0-7 | consumer | uint64 | Pointer to the end of data read by the consumer +2,8 | suspend | uint8 | Suspend request byte + + +Note. producer and consumer pointers are stored in little endian +format. + +The pointers are free running byte offsets rounded up to the next +4-byte boundary, and the position of the actual data is found by +finding the remainder when dividing by the size of the data area. The +producer pointer points to the first free byte, and the consumer +pointer points to the byte after the last data consumed. The actual +payload is preceded by a 4-byte length field, stored in little endian +format. When writing a 1 byte payload, the next value of the producer +pointer will therefore be 8 bytes on from the previous - 4 for the +length (which will contain [0x01,0x00,0x00,0x00]), 1 byte for the +payload, and 3 bytes padding. + +A ring is suspended and resumed by the consumer. To suspend, the +consumer first checks that the producer and consumer agree on the +current suspend status. If they do not, the ring cannot be +suspended. The consumer then writes the byte 0x02 into byte 8 of +sector 2. The consumer must then wait for the producer to acknowledge +the suspend, which it will do by writing 0x02 into byte 8 of sector 1. + +The FromLVM ring +---------------- + +Two different types of message can be sent on the FromLVM ring. + +The FreeAllocation message contains the blocks for the free pool. +Example message: + + (FreeAllocation((blocks((pv0(12326 12249))(pv0(11 1))))(generation 2))) + +Pretty-printed: + + (FreeAllocation + ( + (blocks + ( + (pv0(12326 12249)) + (pv0(11 1)) + ) + ) + (generation 2) + ) + ) + +This is a message to add two new sets of extents to the free pool. A +span of length 12249 extents starting at extent 12326, and a span of +length 1 starting from extent 11, both within the physical volume +'pv0'. The generation count of this message is '2'. The semantics of +the generation is that the local allocator must record the generation +of the last message it received since the FromLVM ring was resumed, +and ignore any message with a generated less than or equal to the last +message received. + +The CapRequest message contains a request to cap the free pool at +a maximum size. +Example message: + + (CapRequest((cap 6127)(name host1-freeme))) + +Pretty-printed: + + (CapRequest + ( + (cap 6127) + (name host1-freeme) + ) + ) + +This is a request to cap the free pool at a maximum size of 6127 +extents. The 'name' parameter reflects the name of the LV into which +the extents should be transferred. + +The ToLVM Ring +-------------- + +The ToLVM ring only contains 1 type of message. Example: + + ((volume test5)(segments(((start_extent 1)(extent_count 32)(cls(Linear((name pv0)(start_extent 12328)))))))) + +Pretty-printed: + + ( + (volume test5) + (segments + ( + ( + (start_extent 1) + (extent_count 32) + (cls + (Linear + ( + (name pv0) + (start_extent 12328) + ) + ) + ) + ) + ) + ) + ) + +This message is extending an LV named 'test5' by giving it 32 extents +starting at extent 1, coming from PV 'pv0' starting at extent +12328. The 'cls' field should always be 'Linear' - this is the only +acceptable value. + + +Cap requests +============ + +Xenvmd will try to keep the free pools of the hosts within a range +set as a fraction of free space. There are 3 parameters adjustable +via the config file: + +- low_water_mark_factor +- medium_water_mark_factor +- high_water_mark_factor + +These three are all numbers between 0 and 1. Xenvmd will sum the free +size and the sizes of all hosts' free pools to find the total +effective free size in the VG, `F`. It will then subtract the sizes of +any pending desired space from in-flight create or resize calls `s`. This +will then be divided by the number of hosts connected, `n`, and +multiplied by the three factors above to find the 3 absolute values +for the high, medium and low watermarks. + + {high, medium, low} * (F - s) / n + +When xenvmd notices that a host's free pool size has dropped below +the low watermark, it will be topped up such that the size is equal +to the medium watermark. If xenvmd notices that a host's free pool +size is above the high watermark, it will issue a 'cap request' to +the host's local allocator, which will then respond by allocating +from its free pool into the fake LV, which xenvmd will then delete +as soon as it gets the update. + +Xenvmd keeps track of the last update it has sent to the local +allocator, and will not resend the same request twice, unless it +is restarted. + diff --git a/doc/content/design/thin-lvhd/queue.pml b/doc/content/design/thin-lvhd/queue.pml new file mode 100644 index 00000000000..681f3027f77 --- /dev/null +++ b/doc/content/design/thin-lvhd/queue.pml @@ -0,0 +1,64 @@ +/* queue suspend/resume protocol */ + +/* flags in the shared disk */ +bool suspend /* suspend requested */ +bool suspend_ack /* suspend acknowledged *. + +/* the queue may have no data (none); a delta or a full sync. + the full sync is performed immediately on resume. */ +mtype = { sync delta none } +mtype inflight_data = none + +proctype consumer(){ + + /* get the channel back to a known state by suspending, + resuming and receiving the initial resync */ +resync: + (suspend == suspend_ack) + suspend = true; + (suspend == suspend_ack) +resync2: + /* drop old data */ + inflight_data = none; + suspend = false; + (suspend == suspend_ack) + (inflight_data == sync) + /* receive initial sync */ + inflight_data = none; + do + /* Consumer.pop */ + :: (inflight_data != none) -> + /* In steady state we receive deltas */ + assert (suspend_ack == false); + assert (inflight_data == delta); + inflight_data = none + /* Consumer.suspend */ + :: ((suspend == false)&&(suspend_ack == false)) -> + goto resync + /* Consumer.resume */ + :: ((suspend == true)&&(suspend_ack == true)) -> + goto resync2 + od; +} + +proctype producer(){ + do + /* Producer.state = Running */ + :: ((suspend == false)&&(suspend_ack==true)) -> + suspend_ack = false; + inflight_data = sync + /* Producer.state = Suspended */ + :: ((suspend == true) && (suspend_ack == false)) -> + suspend_ack = true + /* Producer.push */ + :: ((suspend == false) && (suspend_ack == false) && (inflight_data != sync)) -> + inflight_data = delta + od +} + +init { + atomic { + run producer(); + run consumer(); + } +} diff --git a/doc/content/design/thin-lvhd/thin-lvhd.graffle b/doc/content/design/thin-lvhd/thin-lvhd.graffle new file mode 100644 index 00000000000..b6ddda0dc15 Binary files /dev/null and b/doc/content/design/thin-lvhd/thin-lvhd.graffle differ diff --git a/doc/content/design/thin-lvhd/thin-lvhd.png b/doc/content/design/thin-lvhd/thin-lvhd.png new file mode 100644 index 00000000000..29f43d57ec8 Binary files /dev/null and b/doc/content/design/thin-lvhd/thin-lvhd.png differ diff --git a/doc/content/design/thin-lvhd/xenvmd.graffle b/doc/content/design/thin-lvhd/xenvmd.graffle new file mode 100644 index 00000000000..42c8cfc9cf2 Binary files /dev/null and b/doc/content/design/thin-lvhd/xenvmd.graffle differ diff --git a/doc/content/design/tunnelling.md b/doc/content/design/tunnelling.md new file mode 100644 index 00000000000..6452e19e9ec --- /dev/null +++ b/doc/content/design/tunnelling.md @@ -0,0 +1,197 @@ +--- +title: Tunnelling API design +layout: default +design_doc: true +revision: 1 +status: released (5.6 FP1) +--- + +To isolate network traffic between VMs (e.g. for security reasons) one can use +VLANs. The number of possible VLANs on a network, however, is limited, and +setting up a VLAN requires configuring the physical switches in the network. +GRE tunnels provide a similar, though more flexible solution. This document +proposes a design that integrates the use of tunnelling in the XenAPI. The +design relies on the recent introduction of the Open vSwitch, and +requires an Open vSwitch +([OpenFlow](https://www.opennetworking.org/sdn-resources/openflow)) controller +(further referred to as +_the controller_) to set up and maintain the actual GRE tunnels. + +We suggest following the way VLANs are modelled in the datamodel. Introducing a +VLAN involves creating a Network object for the VLAN, that VIFs can connect to. +The `VLAN.create` API call takes references to a PIF and Network to use and a +VLAN tag, and creates a VLAN object and a PIF object. We propose something +similar for tunnels; the resulting objects and relations for two hosts would +look like this: + + PIF (transport) -- Tunnel -- PIF (access) \ / VIF + Network -- VIF + PIF (transport) -- Tunnel -- PIF (access) / \ VIF + + +XenAPI changes +-------------- + +### New tunnel class + +#### Fields + +* `string uuid` (read-only) +* `PIF ref access_PIF` (read-only) +* `PIF ref transport_PIF` (read-only) +* `(string -> string) map status` (read/write); owned by the controller, containing at least the + key `active`, and `key` and `error` when appropriate (see below) +* `(string -> string) map other_config` (read/write) + +New fields in PIF class (automatically linked to the corresponding `tunnel` +fields): + +* `PIF ref set tunnel_access_PIF_of` (read-only) +* `PIF ref set tunnel_transport_PIF_of` (read-only) + +#### Messages + +* `tunnel ref create (PIF ref, network ref)` +* `void destroy (tunnel ref)` + +### Backends + +For clients to determine which network backend is in use (to decide whether +tunnelling functionality is enabled) a key `network_backend` is added to the +`Host.software_version` map on each host. The value of this key can be: + +* `bridge`: the Linux bridging backend is in use; +* `openvswitch`: the [Open vSwitch] backend is in use. + +### Notes + +* The user is responsible for creating tunnel and network objects, associating + VIFs with the right networks, and configuring the physical PIFs, all using + the XenAPI/CLI/XC. + +* The `tunnel.status` field is owned by the controller. It + may be possible to define an RBAC role for the controller, such that only the + controller is able to write to it. + +* The `tunnel.create` message does not take + a tunnel identifier (GRE key). The controller is responsible for assigning + the right keys transparently. When a tunnel has been set up, the controller + will write its key to `tunnel.status:key`, and it will set + `tunnel.status:active` to `"true"` in the same field. + +* In case a tunnel could + not be set up, an error code (to be defined) will be written to + `tunnel.status:error`, and `tunnel.status:active` will be `"false"`. + +Xapi +---- + +### tunnel.create + +* Fails with `OPENVSWITCH_NOT_ACTIVE` if the Open vSwitch networking sub-system + is not active (the host uses linux bridging). +* Fails with `IS_TUNNEL_ACCESS_PIF` if the specified transport PIF is a tunnel access PIF. +* Takes care of creating and connecting the new tunnel and PIF objects. + * Sets a random MAC on the access PIF. + * IP configuration of the tunnel + access PIF is left blank. (The IP configuration on a PIF is normally used for + the interface in dom0. In this case, there is no tunnel interface for dom0 to + use. Such functionality may be added in future.) + * The `tunnel.status:active` + field is initialised to `"false"`, indicating that no actual tunnelling + infrastructure has been set up yet. +* Calls `PIF.plug` on the new tunnel access PIF. + +### tunnel.destroy + +* Calls `PIF.unplug` on the tunnel access PIF. Destroys the `tunnel` and + tunnel access PIF objects. + +### PIF.plug on a tunnel access PIF + +* Fails with `TRANSPORT_PIF_NOT_CONFIGURED` if the underlying transport PIF has + `PIF.ip_configuration_mode = None`, as this interface needs to be configured + for the tunnelling to work. Otherwise, the transport PIF will be plugged. +* Xapi requests `interface-reconfigure` to "bring up" the tunnel access PIF, + which causes it to create a local bridge. +* No link will be made between the + new bridge and the physical interface by `interface-reconfigure`. The + controller is responsible for setting up these links. If the controller is + not available, no links can be created, and the tunnel network degrades to an + internal network (only intra-host connectivity). +* `PIF.currently_attached` is set to `true`. + +### PIF.unplug on a tunnel access PIF + +* Xapi requests `interface-reconfigure` to "bring down" the tunnel PIF, which + causes it to destroy the local bridge. +* `PIF.currently_attached` is set to `false`. + +### PIF.unplug on a tunnel transport PIF + +* Calls `PIF.unplug` on the associated tunnel access PIF(s). + +### PIF.forget on a tunnel access of transport PIF + +* Fails with `PIF_TUNNEL_STILL_EXISTS`. + +### VLAN.create + +* Tunnels can only exist on top of physical/VLAN/Bond PIFs, and not the other + way around. `VLAN.create` fails with `IS_TUNNEL_ACCESS_PIF` if given an + underlying PIF that is a tunnel access PIF. + +### Pool join + +* As for VLANs, when a host joins a pool, it will inherit the tunnels that are + present on the pool master. +* Any tunnels (tunnel and access PIF objects) + configured on the host are removed, which will leave their networks + disconnected (the networks become internal networks). As a joining host is + always a single host, there is no real use for having had tunnels on it, so + this probably will never be an issue. + +The controller +-------------- + +* The controller tracks the `tunnel` class to determine which bridges/networks + require GRE tunnelling. + * On start-up, it calls `tunnel.get_all` to obtain the information about all + tunnels. + * Registers for events on the `tunnel` class to stay up-to-date. +* A tunnel network is organised as a star topology. The controller is free to + decide which host will be the central host ("switching host"). +* If the + current switching host goes down, a new one will be selected, and GRE tunnels + will be reconstructed. +* The controller creates GRE tunnels connecting each + existing Open vSwitch bridge that is associated with the same tunnel network, + after assigning the network a unique GRE key. +* The controller destroys GRE + tunnels if associated Open vSwitch bridges are destroyed. If the destroyed + bridge was on the switching host, and other hosts are still using the same + tunnel network, a new switching host will be selected, and GRE tunnels will + be reconstructed. +* The controller sets `tunnel.status:active` to `"true"` for + all tunnel links that have been set up, and `"false"` if links are broken. +* The controller writes an appropriate error code (to be defined) to + `tunnel.status:error` in case something went wrong. +* When an access PIF is + plugged, and the controller succeeds to set up the tunnelling infrastructure, + it writes the GRE key to `tunnel.status:key` on the associated tunnel object + (at the same time `tunnel.status:active` will be set to `"true"`). +* When the + tunnel infrastructure is not up and running, the controller may remove the + key `tunnel.status:key` (optional; the key should anyway be disregarded if + `tunnel.status:active` is `"false"`). + +CLI +--- + +New `xe` commands (analogous to `xe vlan-`): + +* `tunnel-create` +* `tunnel-destroy` +* `tunnel-list` +* `tunnel-param-get` +* `tunnel-param-list` diff --git a/doc/content/design/vgpu-type-identifiers.md b/doc/content/design/vgpu-type-identifiers.md new file mode 100644 index 00000000000..234a16d4827 --- /dev/null +++ b/doc/content/design/vgpu-type-identifiers.md @@ -0,0 +1,112 @@ +--- +title: VGPU type identifiers +layout: default +design_doc: true +revision: 1 +status: released (7.0) +design_review: 156 +revision_history: +- revision_number: 1 + description: Initial version +--- + +Introduction +------------ + +When xapi starts, it may create a number of VGPU_type objects. These act as +VGPU presets, and exactly which VGPU_type objects are created depends on the +installed hardware and in certain cases the presence of certain files in dom0. + +When deciding which VGPU_type objects need to be created, xapi needs to +determine whether a suitable VGPU_type object already exists, as there should +never be duplicates. At the moment the combination of vendor name and model name +is used as a primary key, but this is not ideal as these values are subject to +change. We therefore need a way of creating a primary key to uniquely identify +VGPU_type objects. + +Identifier +---------- + +We will add a new read-only field to the database: + +- `VGPU_type.identifier (string)` + +This field will contain a string representation of the parameters required to +uniquely identify a VGPU_type. The parameters required can be summed up with the +following OCaml type: + +``` +type nvidia_id = { + pdev_id : int; + psubdev_id : int option; + vdev_id : int; + vsubdev_id : int; +} + +type gvt_g_id = { + pdev_id : int; + low_gm_sz : int64; + high_gm_sz : int64; + fence_sz : int64; + monitor_config_file : string option; +} + +type t = + | Passthrough + | Nvidia of nvidia_id + | GVT_g of gvt_g_id +``` + +When converting this type to a string, the string will always be prefixed with +`0001:` enabling future versioning of the serialisation format. + +For passthrough, the string will simply be: + +`0001:passthrough` + +For NVIDIA, the string will be `nvidia` followed by the four device IDs +serialised as four-digit hex values, separated by commas. If `psubdev_id` is +`None`, the empty string will be used e.g. + +``` +Nvidia { + pdev_id = 0x11bf; + psubdev_id = None; + vdev_id = 0x11b0; + vsubdev_id = 0x109d; +} +``` + +would map to + +`0001:nvidia,11bf,,11b0,109d` + +For GVT-g, the string will be `gvt-g` followed by the physical device ID encoded +as four-digit hex, followed by `low_gm_sz`, `high_gm_sz` and `fence_sz` encoded +as hex, followed by `monitor_config_file` (or the empty string if it is `None`) +e.g. + +``` +GVT_g { + pdev_id = 0x162a; + low_gm_sz = 128L; + high_gm_sz = 384L; + fence_sz = 4L; + monitor_config_file = None; +} +``` + +would map to + +`0001:gvt-g,162a,80,180,4,,` + +Having this string in the database will allow us to do a simple lookup to test +whether a certain VGPU_type already exists. Although it is not currently +required, this string can also be converted back to the type from which it was +generated. + +When deciding whether to create VGPU_type objects, xapi will generate the +identifier string and use it to look for existing VGPU_type objects in the +database. If none are found, xapi will look for existing VGPU_type objects with +the tuple of model name and vendor name. If still none are found, xapi will +create a new VGPU_type object. diff --git a/doc/content/design/virt-hw-platform-vn.md b/doc/content/design/virt-hw-platform-vn.md new file mode 100644 index 00000000000..ec4f21ce4cb --- /dev/null +++ b/doc/content/design/virt-hw-platform-vn.md @@ -0,0 +1,39 @@ +--- +title: Virtual Hardware Platform Version +layout: default +design_doc: true +revision: 1 +status: released (7.0) +--- + +### Background and goal + +Some VMs can only be run on hosts of sufficiently recent versions. + +We want a clean way to ensure that xapi only tries to run a guest VM on a host that supports the "virtual hardware platform" required by the VM. + +### Suggested design + +* In the datamodel, VM has a new integer field "hardware_platform_version" which defaults to zero. +* In the datamodel, Host has a corresponding new integer-list field "virtual_hardware_platform_versions" which defaults to list containing a single zero element (i.e. `[0]` or `[0L]` in OCaml notation). The zero represents the implicit version supported by older hosts that lack the code to handle the Virtual Hardware Platform Version concept. +* When a host boots it populates its own entry from a hardcoded value, currently `[0; 1]` i.e. a list containing the two integer elements `0` and `1`. (Alternatively this could come from a config file.) + * If this new version-handling functionality is introduced in a hotfix, at some point the pool master will have the new functionality while at least one slave does not. An old slave-host that does not yet have software to handle this feature will not set its DB entry, which will therefore remain as `[0]` (maintained in the DB by the master). +* The existing test for whether a VM can run on (or migrate to) a host must include a check that the VM's virtual hardware platform version is in the host's list of supported versions. +* When a VM is made to start using a feature that is available only in a certain virtual hardware platform version, xapi must set the VM's hardware_platform_version to the maximum of that version-number and its current value (i.e. raise if needed). + +For the version we could consider some type other than integer, but a strict ordering is needed. + +### First use-case + +Version 1 denotes support for a certain feature: + +> When a VM starts, if a certain flag is set in VM.platform then XenServer will provide an emulated PCI device which will trigger the guest Windows OS to seek drivers for the device, or updates for those drivers. Thus updated drivers can be obtained through the standard Windows Update mechanism. + +If the PCI device is removed, the guest OS will fail to boot. A VM using this feature must not be migrated to or started on a XenServer that lacks support for the feature. + +Therefore at VM start, we can look at whether this feature is being used; if it is, then if the VM's Virtual Hardware Platform Version is less than 1 we should raise it to 1. + +### Limitation +Consider a VM that requires version 1 or higher. Suppose it is exported, then imported into an old host that does not support this feature. Then the host will not check the versions but will attempt to run the VM, which will then have difficulties. + +The only way to prevent this would be to make a backwards-incompatible change to the VM metadata (e.g. a new item in an enum) so that the old hosts cannot read it, but that seems like a bad idea. diff --git a/doc/content/design/xenopsd_events.md b/doc/content/design/xenopsd_events.md new file mode 100644 index 00000000000..55ee74f7a5b --- /dev/null +++ b/doc/content/design/xenopsd_events.md @@ -0,0 +1,47 @@ +--- +layout: default +title: Process events from xenopsd in a timely manner +design_doc: true +status: proposed +revision: 1 +--- + +# Background + +There is a significant delay between the VM being unpaused and XAPI reporting it +as started during a bootstorm. +It can happen that the VM is able to send UDP packets already, but XAPI still reports it as not started for minutes. + +XAPI currently processes all events from xenopsd in a single thread, the unpause +events get queued up behind a lot of other events generated by the already +running VMs. + +We need to ensure that unpause events from xenopsd get processed in a timely +manner, even if XAPI is busy processing other events. + +# Timely processing of events + +If we process the events in a Round-Robin fashion then `unpause` events are reported in a timely fashion. +We need to ensure that events operating on the same VM are not processed in parallel. + +Xenopsd already has code that does exactly this, the purpose of the [xapi-work-queues refactoring PR](https://github.com/xapi-project/xenopsd/pull/337) is to +reuse this code in XAPI by creating a shared package between xenopsd and xapi: `xapi-work-queues`. + +# xapi-work-queues + +From the documentation of the new [Worker Pool interface](https://edwintorok.github.io/xapi-work-queues/Xapi_work_queues.html): + +A worker pool has a limited number of worker threads. +Each worker pops one tagged item from the queue in a round-robin fashion. +While the item is executed the tag temporarily doesn't participate in round-robin scheduling. +If during execution more items get queued with the same tag they get redirected to a private queue. +Once the item finishes execution the tag will participate in RR scheduling again. + +This ensures that items with the same tag do not get executed in parallel, +and that a tag with a lot of items does not starve the execution of other tags. + +The XAPI side of the changes will [look like this](https://github.com/edwintorok/xen-api/commit/b367bf86d3af4f773db9bf5d1500a4dec0f99bfa?diff=unified#diff-344dc1d17c4663add7fe5500813feef2) + +Known limitations: The active per-VM events should be a small number, this is already ensured in the `push_with_coalesce` / `should_keep` code on the [xenopsd side](https://github.com/xapi-project/xenopsd/blob/master/lib/xenops_server.ml#L441). Events to XAPI from xenopsd should already arrive coalesced. + + diff --git a/doc/content/design/xenprep.md b/doc/content/design/xenprep.md new file mode 100644 index 00000000000..42a393310c6 --- /dev/null +++ b/doc/content/design/xenprep.md @@ -0,0 +1,79 @@ +--- +title: XenPrep +layout: default +design_doc: true +revision: 2 +status: proposed +--- + +### Background +Windows guests should have XenServer-specific drivers installed. As of mid-2015 these have been always been installed and upgraded by an essentially manual process involving an ISO carrying the drivers. We have a plan to enable automation through the standard Windows Update mechanism. This will involve a new additional virtual PCI device being provided to the VM, to trigger Windows Update to fetch drivers for the device. + +There are many existing Windows guests that have drivers installed already. These drivers must be uninstalled before the new drivers are installed (and ideally before the new PCI device is added). To make this easier, we are planning a XenAPI call that will cause the removal of the old drivers and the addition of the new PCI device. + +Since this is only to help with updating old guests, the call may well be removed at some point in the future. + +### Brief high-level design +The XenAPI call will be called `VM.xenprep_start`. It will update the VM record to note that the process has started, and will insert a special ISO into the VM's virtual CD drive. + +That ISO will contain a tool which will be set up to auto-run (if auto-run is enabled in the guest). The tool will: + +1. Lock the CD drive so other Windows programs cannot eject the disc. +2. Uninstall the old drivers. +3. Eject the CD to signal success. +4. Shut down the VM. + +XenServer will interpret the ejection of the CD as a success signal, and when the VM shuts down without the special ISO in the drive, XenServer will: + +1. Update the VM record: + * Remove the mark that shows that the xenprep process is in progress + * Give it the new PCI device: set `VM.auto_update_drivers` to `true`. + * If `VM.virtual_hardware_platform_version` is less than 2, then set it to 2. +2. Start the VM. + +### More details of the xapi-project parts +(The tool that runs in the guest is out of scope for this document.) + +#### Start +The XenAPI call `VM.xenprep_start` will throw a power-state error if the VM is not running. +For RBAC roles, it will be available to "VM Operator" and above. + +It will: + +1. Insert the xenprep ISO into the VM's virtual CD drive. +2. Write `VM.other_config` key `xenprep_progress=ISO_inserted` to record the fact that the xenprep process has been initiated. + +If `xenprep_start` is called on a VM already undergoing xenprep, the call will return successfully but will not do anything. + +If the VM does not have an empty virtual CD drive, the call will fail with a suitable error. + +#### Cancellation +While xenprep is in progress, any request to eject the xenprep ISO (except from inside the guest) will be rejected with a new error "VBD_XENPREP_CD_IN_USE". + +There will be a new XenAPI call `VM.xenprep_abort` which will: + +1. Remove the `xenprep_progress` entry from `VM.other_config`. +2. Make a best-effort attempt to eject the CD. (The guest might prevent ejection.) + +This is not intended for cancellation while the xenprep tool is running, but rather for use before it starts, for example if auto-run is disabled or if the VM has a non-Windows OS. + +#### Completion + +Aim: when the guest shuts down after ejecting the CD, XenServer will start the guest again with the new PCI device. + +Xapi works through the queue of events it receives from xenopsd. It is possible that by the time xapi processes the cd-eject event, the guest might have shut down already. + +When the shutdown (not reboot) event is handled, we shall check whether we need to do anything xenprep-related. If +* The VM `other_config` map has `xenprep_progress` as either of `ISO_inserted` or `shutdown`, and +* The xenprep ISO is no longer in the drive + +then we must (in the specified order) + +1. Update the VM record: + 1. In `VM.other_config` set `xenprep_progress=shutdown` + 2. If `VM.virtual_hardware_platform_version` is less than 2, then set it to 2. + 3. Give it the new PCI device: set `VM.auto_update_drivers` to `true`. +2. Initiate VM start. +3. Remove `xenprep_progress` from `VM.other_config` + +The most relevant code is probably the `update_vm` function in `ocaml/xapi/xapi_xenops.ml` in the `xen-api` repo (or in some function called from there). diff --git a/doc/layouts/partials/content-header.html b/doc/layouts/partials/content-header.html new file mode 100644 index 00000000000..9043445a3a4 --- /dev/null +++ b/doc/layouts/partials/content-header.html @@ -0,0 +1,48 @@ +{{ if eq $.Page.Params.design_doc true }} + + + + + + + + + + + {{ with $.Page.Params.status | lower }} + + {{ end }} + + {{ with $.Page.Params.design_review }} + + + + + {{ end }} + {{ with $.Page.Params.revision_history }} + + + + {{ range . }} + + + + + {{ end }} + {{ end }} +
Design document
Revisionv{{$.Page.Params.revision}}
Status + {{.}} +
Review + #{{.}} +
Revision history
v{{.revision_number}}{{.description}}
+{{ end }} diff --git a/doc/layouts/partials/custom-header.html b/doc/layouts/partials/custom-header.html new file mode 100644 index 00000000000..e26070a9468 --- /dev/null +++ b/doc/layouts/partials/custom-header.html @@ -0,0 +1,2 @@ +{{ $style := resources.Get "css/misc.css" }} + diff --git a/doc/layouts/shortcodes/design_docs_list.html b/doc/layouts/shortcodes/design_docs_list.html new file mode 100644 index 00000000000..d64334f3ab1 --- /dev/null +++ b/doc/layouts/shortcodes/design_docs_list.html @@ -0,0 +1,34 @@ +
+Key: +Revision +Proposed +Confirmed +Released (vA.B) +Unrecognised status +
+ + + +{{ range sort $.Page.Pages "Params.status"}} + +{{ end }} + +
+ {{ .Title }} + v{{.Params.revision}} + {{ with .Params.status | lower }} + + {{.}} + + {{ end }} +
\ No newline at end of file