diff --git a/enhancements/network/images/VRFs.svg b/enhancements/network/images/VRFs.svg new file mode 100644 index 0000000000..855417fa73 --- /dev/null +++ b/enhancements/network/images/VRFs.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/egress-ip-l2-primary.svg b/enhancements/network/images/egress-ip-l2-primary.svg new file mode 100644 index 0000000000..e1454122a9 --- /dev/null +++ b/enhancements/network/images/egress-ip-l2-primary.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/egress-ip-vrf-lgw.svg b/enhancements/network/images/egress-ip-vrf-lgw.svg new file mode 100644 index 0000000000..cb6222bf6a --- /dev/null +++ b/enhancements/network/images/egress-ip-vrf-lgw.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/egress-ip-vrf-sgw.svg b/enhancements/network/images/egress-ip-vrf-sgw.svg new file mode 100644 index 0000000000..e2387ae778 --- /dev/null +++ b/enhancements/network/images/egress-ip-vrf-sgw.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/local-gw-node-setup-vrfs.svg b/enhancements/network/images/local-gw-node-setup-vrfs.svg new file mode 100644 index 0000000000..9b7ba269a5 --- /dev/null +++ b/enhancements/network/images/local-gw-node-setup-vrfs.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/multi-homing-l2-gw.svg b/enhancements/network/images/multi-homing-l2-gw.svg new file mode 100644 index 0000000000..f633254ac4 --- /dev/null +++ b/enhancements/network/images/multi-homing-l2-gw.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/images/openshift-router-multi-network.svg b/enhancements/network/images/openshift-router-multi-network.svg new file mode 100644 index 0000000000..0386aecabf --- /dev/null +++ b/enhancements/network/images/openshift-router-multi-network.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/enhancements/network/user-defined-network-segmentation.md b/enhancements/network/user-defined-network-segmentation.md new file mode 100644 index 0000000000..3a30845530 --- /dev/null +++ b/enhancements/network/user-defined-network-segmentation.md @@ -0,0 +1,969 @@ +--- +title: user-defined-network-segmentation +authors: + - "@trozet" + - "@qinqon" +reviewers: + - "@tssurya" + - "@danwinship" + - "@fedepaol" + - "@maiqueb" + - "@jcaamano" + - "@Miciah" + - "@dceara" + - "@dougbtv" +approvers: + - "@tssurya" + - "@jcaamano" +api-approvers: + - "None" +creation-date: 2024-05-03 +last-updated: 2024-05-28 +tracking-link: + - https://issues.redhat.com/browse/SDN-4789 +--- + +# User-Defined Network Segmentation + +## Summary + +OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. +Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all +pods connecting to the same layer 3 virtual topology. The scope of this effort is to bring the same flexibility of the +secondary network to the primary network. Therefore, pods are able to connect to different types of networks as their +primary network. + +Additionally, multiple and different instances of primary networks may co-exist for different users, and they will provide +native network isolation. + +## Terminology + +* **Primary Network** - The network which is used as the default gateway for the pod. Typically recognized as the eth0 +interface in the pod. +* **Secondary Network** - An additional network and interface presented to the pod. Typically created as an additional +Network Attachment Definition (NAD), leveraging Multus. Secondary Network in the context of this document refers to a +secondary network provided by the OVN-Kubernetes CNI. +* **Cluster Default Network** - This is the routed OVN network that pods attach to by default today as their primary network. +The pods default route, service access, as well as kubelet probe are all served by the interface (typically eth0) on this network. +* **User-Defined Network** - A network that may be primary or secondary, but is declared by the user. +* **Layer 2 Type Network** - An OVN-Kubernetes topology rendered into OVN where pods all connect to the same distributed +logical switch (layer 2 segment) which spans all nodes. Uses Geneve overlay. +* **Layer 3 Type Network** - An OVN-Kubernetes topology rendered into OVN where pods have a per-node logical switch and subnet. +Routing is used for pod to pod communication across nodes. This is the network type used by the cluster default network today. +Uses Geneve overlay. +* **Localnet Type Network** - An OVN-Kubernetes topology rendered into OVN where pods connect to a per-node logical switch +that is directly wired to the underlay. + +## Motivation + +As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, +each tenant (akin to Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. +Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes +the paradigm is opposite, by default all pods can reach other pods, and security is provided by implementing Network Policy. +Network Policy can be cumbersome to configure and manage for a large cluster. It also can be limiting as it only matches +TCP, UDP, and SCTP traffic. Furthermore, large amounts of network policy can cause performance issues in CNIs. With all +these factors considered, there is a clear need to address network security in a native fashion, by using networks per +tenant to isolate traffic. + +### User Stories + +* As a user I want to be able to migrate applications traditionally on OpenStack to Kubernetes, keeping my tenant network + space isolated and having the ability to use a layer 2 network. +* As a user I want to be able to ensure network security between my namespaces without having to manage and configure + complex network policy rules. +* As an administrator, I want to be able to provision networks to my tenants to ensure their networks and applications + are natively isolated from other tenants. +* As a user, I want to be able to request a unique, primary network for my namespace without having to get administrator + permission. +* As a user, I want user-defined primary networks to be able to have similar functionality as the cluster default network, + regardless of being on a layer 2 or layer 3 type network. Features like Egress IP, Egress QoS, Kubernetes services, + Ingress, and pod Egress should all function as they do today in the cluster default network. +* As a user, I want to be able to use my own consistent IP addressing scheme in my network. I want to be able to specify + and re-use the same IP subnet for my pods across different namespaces and clusters. This provides a consistent + and repeatable network environment for administrators and users. + +### Goals + +* Provide a configurable way to indicate that a pod should be connected to a user-defined network of a specific type as a +primary interface. +* The primary network may be configured as a layer 3 or layer 2 type network. +* Allow networks to have overlapping pod IP address space. This range may not overlap with the default cluster subnet +used for allocating pod IPs on the cluster default network today. +* The cluster default primary network defined today will remain in place as the default network pods attach to. The cluster +default network will continue to serve as the primary network for pods in a namespace that has no primary user-defined network. Pods +with primary user-defined networks will still attach to the cluster default network with limited access to Kubernetes system resources. +Pods with primary user-defined networks will have at least two network interfaces, one connected to the cluster default network and one +connected to the user-defined network. Pods with primary user-defined networks will use the user-defined network as their default +gateway. +* Allow multiple namespaces per network. +* Support cluster ingress/egress traffic for user-defined networks, including secondary networks. +* Support for ingress/egress features on user-defined primary networks where possible: + * EgressQoS + * EgressService + * EgressIP + * Load Balancer and NodePort Services, as well as services with External IPs. +* In addition to ingress service support, there will be support for Kubernetes services in user-defined networks. The +scope of reachability to that service as well as endpoints selected for that service will be confined to the network +and corresponding namespace(s) where that service was created. +* Support for pods to continue to have access to the cluster default primary network for DNS and KAPI service access. +* Kubelet healthchecks/probes will still work on all pods. +* OpenShift Router/Ingress will work with some limitations for user-defined networks. + +### Non-Goals + +* Allowing different service CIDRs to be used in different networks. +* Localnet will not be supported initially for primary networks. +* Allowing multiple primary networks per namespace. +* Hybrid overlay support on user-defined networks. + +### Future-Goals + +* DNS lookup for pods returning records for IPs on the user-defined network. In the first phase DNS will return the pod +IP on the cluster default network instead. +* Admin ability to configure networks to have access to all services and/or expose services to be accessible from all +networks. +* Ability to advertise user-defined networks to external networks using BGP/EVPN. This will enable things like: + * External -> Pod ingress per VRF (Ingress directly to pod IP) + * Multiple External Gateway (MEG) in a BGP context, with ECMP routes +* Allow connection of multiple networks via explicit router API configuration. +* An API to allow user-defined ports for pods to be exposed on the cluster default network. This may be used for things +like promethus metric scraping. +* Potentially, coming up with an alternative solution for requiring the cluster default network connectivity to the pod, +and presenting the IP of the pod to Kubernetes as the user-defined primary network IP, rather than the cluster default +network IP. + +## Proposal + +By default in OVN-Kubernetes, pods are attached to what is known as the “cluster default" network, which is a routed network +divided up into a subnet per node. All pods will continue to have an attachment to this network, even when assigned a +different primary network. Therefore, when a pod is assigned to a user-defined network, it will have two interfaces, one +to the cluster default network, and one to the user-defined network. The cluster default network is required in order to provide: + +1. KAPI service access +2. DNS service access +3. Kubelet healthcheck probes to the pod + +All other traffic from the pod will be dropped by firewall rules on this network, when the pod is assigned a user-defined +primary network. Routes will be added to the pod to route KAPI/DNS traffic out towards the cluster default network. Note, +it may be desired to allow access to any Kubernetes service on the cluster default network (instead of just KAPI/DNS), +but at a minimum KAPI/DNS will be accessible. Furthermore, the IP of the pod from the Kubernetes API will continue to +show the IP assigned in the cluster default network. + +In OVN-Kubernetes secondary networks are defined using Network Attachment Definitions (NADs). For more information on +how these are configured, refer to: + +[https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/features/multi-homing.md](https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/features/multi-homing.md) + +The proposal here is to leverage this existing mechanism to create the network. A new field, “primaryNetwork” is +introduced to the NAD spec which indicates that this network should be used for the pod's primary network. Additionally, +a new "joinSubnet" field is added in order to specify the join subnet used inside the OVN network topology. An +example OVN-Kubernetes NAD may look like: + +``` +apiVersion: k8s.cni.cncf.io/v1 +kind: NetworkAttachmentDefinition +metadata: + name: l3-network + namespace: default +spec: + config: |2 + { + "cniVersion": "0.3.1", + "name": "l3-network", + "type": "ovn-k8s-cni-overlay", + "topology":"layer3", + "subnets": "10.128.0.0/16/24,2600:db8::/29", + "joinSubnet": "100.65.0.0/24,fd99::/64", + "mtu": 1400, + "netAttachDefName": "default/l3-network", + "primaryNetwork": true + } +``` + +The NAD must be created before any pods are created for this namespace. If cluster default networked pods existed before +the user-defined network was created, any further pods created in this namespace after the NAD was created will return +an error on CNI ADD. + +Only one primary network may exist per namespace. If more than one user-defined network is created with the +"primaryNetwork" key set to true, then future pod creations will return an error on CNI ADD until the network +configuration is corrected. + +A pod may not connect to multiple primary networks other than the cluster default. When the NAD is created, +OVN-Kubernetes will validate the configuration, as well as that no pods have been created in the namespace already. If +pods existed before the NAD was created, errors will be logged, and no further pods will be created in this namespace +until the network configuration is fixed. + +After creating the NAD, pods created in this namespace will connect to the newly defined network as their primary +network. The primaryNetwork key is used so that OVN-Kubernetes knows which network should be used, in case there are multiple +NADs created for a namespace (secondary networks). + +After a pod is created that shall connect to a user-defined network, it will then be annotated by OVN-Kubernetes with the +appropriate networking config: + +``` +trozet@fedora:~/Downloads$ oc get pods -o yaml -n ns1 +apiVersion: v1 +items: +- apiVersion: v1 + kind: Pod + metadata: + annotations: + k8s.ovn.org/pod-networks: 'k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.244.1.6/24","fd00:10:244:2::6/64"],"mac_address":"0a:58:0a:f4:01:06","routes":[{"dest":"10.244.0.0/16","nextHop":"10.244.1.1"},{"dest":"100.64.0.0/16","nextHop":"10.244.1.1"},{"dest":"fd00:10:244::/48","nextHop":"fd00:10:244:2::1"},{"dest":"fd98::/64","nextHop":"fd00:10:244:2::1"}],"type":default},"default/l3-network":{"ip_addresses":["10.128.1.3/24","2600:db8:0:2::3/64"],"mac_address":"0a:58:0a:80:01:03","gateway_ips":["10.128.1.1","2600:db8:0:2::1"],"routes":[{"dest":"10.128.0.0/16","nextHop":"10.128.1.1"},{"dest":"10.96.0.0/16","nextHop":"10.128.1.1"},{"dest":"100.64.0.0/16","nextHop":"10.128.1.1"},{"dest":"2600:db8::/29","nextHop":"2600:db8:0:2::1"},{"dest":"fd00:10:96::/112","nextHop":"2600:db8:0:2::1"},{"dest":"fd99::/64","nextHop":"2600:db8:0:2::1"}],"type":primary}}' + k8s.v1.cni.cncf.io/network-status: |- + [{ + "name": "default", + "interface": "eth0", + "ips": [ + "10.244.1.6" + ]}, + "mac": "0a:58:0a:f4:02:03", + "default": true, + "dns": {} + }] +status: + phase: Running + podIP: 10.244.1.6 + podIPs: + - ip: 10.244.1.6 + - ip: fd00:10:244:2::6 +``` + +In the above output the primary network is listed within the k8s.ovn.org/pod-networks annotation. However, the network-status +cncf annotation does not contain the primary network. This is due to the fact that OVNK will do an implicit CNI ADD for both +the default cluster network and the primary network. This way a user does not have to manually request that the pod is attached +to the primary network. + +Multiple namespaces may also be configured to use the same network. In this case the underlying OVN network will be the +same, following a similar pattern to what is [already supported today for secondary networks](https://docs.openshift.com/container-platform/4.15/networking/multiple_networks/configuring-additional-network.html#configuration-ovnk-network-plugin-json-object_configuring-additional-network). + +### CRDs for Managing Networks + +Network Attachment Definitions (NADs) are the current way to configure the network in OVN-Kubernetes today, and the +method proposed in this enhancement. There are two major shortcomings of NAD: + +1. It has free-form configuration that depends on the CNI. There is no API validation of what a user enters, leading to +mistakes which are not caught at configuration time and may cause unexpected functional behavior at runtime. +2. It requires cluster admin RBAC in order to create the NAD. + +In order to address these issues, a proper CRD may be implemented which indirectly creates the NAD for OVN-Kubernetes. +This solution may consist of more than one CRD, namely an Admin based CRD and one that is namespace scoped for tenants. +The reasoning behind this is we want tenants to be able to create their own user-defined network for their namespace, +but we do not want them to be able to connect to another namespace’s network without permission. The Admin based version +would give higher level access and allow an administrator to create a network that multiple namespaces could connect to. +It may also expose more settings in the future for networks that would not be safe in the hands of a tenant, like +deciding if a network is able to reach other services in other networks. With tenants having access to be able to create +multiple networks, we need to consider potential attack vectors like a tenant trying to exhaust OVN-Kubernetes +resources by creating too many secondary networks. + +Furthermore, by utilizing a CRD, the status of the network CR itself can be used to indicate whether it is configured +by OVN-Kubernetes. For example, if a user creates a network CR and there is some problem (like pods already existed) then +an error status can be reported to the CR, rather than relying on the user to check OVN-Kubernetes logs. + +### IP Addressing + +As previously mentioned, one of the goals is to allow user-defined networks to have overlapping pod IP addresses. This +is enabled by allowing a user to configure what CIDR to use for pod addressing when they create the network. However, +this range cannot overlap with the default cluster CIDR used by the cluster default network today. + +Furthermore, the internal masquerade subnet and the Kubernetes service subnet will remain unique and will exist globally +to serve all networks. The masquerade subnet must be large enough to accommodate enough networks. Therefore, the +subnet size of the masquerade subnet is equal to the number of desired networks * 2, as we need 2 masquerade IPs per +network. The masquerade subnet remains localized to each node, so each node can use the same IP addresses and the size +of the subnet does not scale with number of nodes. + +The transit switch subnets may overlap between all networks. This network is just used for transport between nodes, and +is never seen by the pods or external clients. + +The join subnet of the default cluster network may not overlap with the join subnet of user-defined networks. This is +due to the fact that the pod is connected to the default network, as well as the user-defined primary network. The join +subnet is SNAT'ed by the GR of that network in order to facilitate ingress reply service traffic going back to the +proper GR, in case it traverses the overlay. For this reason, the pods may see this IP address and routes are added to +the pod to steer the traffic to the right interface (100.64.0.0/16 is the default cluster network join subnet): + +``` +[root@pod3 /]# ip route show +default via 10.244.1.1 dev eth0 +10.96.0.0/16 via 10.244.1.1 dev eth0 +10.244.0.0/16 via 10.244.1.1 dev eth0 +10.244.1.0/24 dev eth0 proto kernel scope link src 10.244.1.8 +100.64.0.0/16 via 10.244.1.1 dev eth0 +``` + +Since the pod needs routes for each join subnet, any layer 3 or layer 2 network that is attached to the pod needs a unique +join subnet. Consider a pod connected to the default cluster network, a user-defined, layer 3, primary network, and a +layer 2, secondary network: + +| Network | Pod Subnet | Node Pod Subnet | Join Subnet | +|-----------------|---------------|-----------------|---------------| +| Cluster Default | 10.244.0.0/16 | 10.244.0.0/24 | 100.64.0.0/16 | +| Layer 3 | 10.245.0.0/16 | 10.245.0.0/24 | 100.65.0.0/16 | +| Layer 2 | 10.246.0.0/16 | N/A | 100.66.0.0/16 | + + +The routing table would look like: + +``` +[root@pod3 /]# ip route show +default via 10.245.0.1 dev eth1 +10.96.0.0/16 via 10.245.0.1 dev eth1 +10.244.0.0/16 via 10.244.0.1 dev eth0 +10.245.0.0/16 via 10.245.0.1 dev eth1 +10.244.0.0/24 dev eth0 proto kernel scope link src 10.244.0.8 +10.245.0.0/24 dev eth1 proto kernel scope link src 10.245.0.8 +10.246.0.0/16 dev eth2 proto kernel scope link src 10.246.0.8 +100.64.0.0/16 via 10.244.0.1 dev eth0 +100.65.0.0/16 via 10.245.0.1 dev eth1 +100.66.0.0/16 via 10.246.0.1 dev eth2 +``` + +Therefore, when specifying a user-defined network it will be imperative to ensure that the networks a pod will connect to +do not have overlapping pod network or join network subnets. OVN-Kubernetes should be able to detect this scenario and +refuse to CNI ADD a pod with conflicts. + +### DNS + +DNS lookups will happen via every pod’s access to the DNS service on the cluster default network. CoreDNS lookups for +pods will resolve to the pod’s IP on the cluster default network. This is a limitation of the first phase of this feature +and will be addressed in a future enhancement. DNS lookups for services and external entities will function correctly. + +### Services + +Services in Kubernetes are namespace scoped. Any creation of a service in a namespace without a user-defined network +(using cluster default network as primary) will only be accessible by other namespaces also using the default network as +their primary network. Services created in namespaces served by user-defined networks, will only be accessible to +namespaces connected to the user-defined network. + +Since most applications require DNS and KAPI access, there is an exception to the above conditions where pods that are +connected to user-defined networks are still able to access KAPI and DNS services that reside on the cluster default +network. In the future, access to more services on the default network may be granted. However, that would require more +groundwork around enforcing network policy (which is evaluated typically after service DNAT) as potentially nftables +rules. Such work is considered a future enhancement and beyond the scope of this initial implementation. + +With this proposal, OVN-Kubernetes will check which network is being used for this namespace, and then only enable the +service there. The cluster IP of the service will only be available in the network of that service, except for KAPI and +DNS as previously explained. Host networked pods in a namespace with a user-defined primary network will also be limited +to only accessing the cluster IP of the services for that network. Load balancer IP and nodeport services are also +supported on user-defined networks. Service selectors are only able to select endpoints from the same namespace where the +service exists. Services that exist before the user-defined network is assigned to a namespace will result in +OVN-Kubernetes executing a re-sync on all services in that namespace, and updating all load balancers. Keep in mind that +pods must not exist in the namespace when the namespace is assigned to a new network or the new network assignment will +not be accepted by OVN-Kubernetes. + +Services in a user-defined network will be reachable by other namespaces that share the same network. + +As previously mentioned, Kubernetes API and DNS services will be accessible by all pods. + +Endpoint slices will provide the IPs of the cluster default network in Kubernetes API. For this implementation the required +endpoints are those IP addresses which reside on the user-defined primary network. In order to solve this problem, +OVN-Kubernetes may create its own endpoint slices or may choose to do dynamic lookups at runtime to map endpoints to +their primary IP address. Leveraging a second set of endpoint slices will be the preferred method, as it creates less +indirection and gives explicit Kube API access to what IP addresses are being used by OVN-Kubernetes. + +Kubelet health checks to pods are queried via the cluster default network. When endpoints are considered unhealthy they +will be removed from the endpoint slice, and thus their primary IP will be removed from the OVN load balancer. However, +it is important to note that the healthcheck is being performed via the cluster default network interface on the pod, +which ensures the application is alive, but does not confirm network connectivity of the primary interface. Therefore, +there could be a situation where OVN networking on the primary interface is broken, but the default interface continues +to work and reports 200 OK to Kubelet, thus rendering the pod serving in the endpoint slice, but unable to function. +Although this is an unlikely scenario, it is good to document. + +### Network Policy + +Network Policy will be fully supported for user-defined primary networks as it is today with the cluster default network. +However, configuring network policies that allow traffic between namespaces that connect to different user-defined +primary networks will have no effect. This traffic will not be allowed, as the networks have no connectivity to each other. +These types of policies will not be invalidated by OVN-Kubernetes, but the configuration will have no effect. Namespaces +that share the same user-defined primary network will still benefit from network policy that applies access control over +a shared network. Additionally, policies that block/allow cluster egress or ingress traffic will still be enforced for +any user-defined primary network. + +### API Extensions + +The main API extension here will be a namespace scoped network CRD as well a cluster scoped network CRD. These CRDs +will be registered by Cluster Network Operator (CNO). See the [CRDs for Managing Networks](#crds-for-managing-networks) +section for more information on how the CRD will work. There will be a finalizer on the CRDs, so that upon deletion +OVN-Kubernetes can validate that there are no pods still using this network. If there are pods still attached to this +network, the network will not be removed. + +### Workflow Description + +#### Tenant Use Case + +As a tenant I want to ensure when I create pods in my namespace their network traffic is isolated from other tenants on +the cluster. In order to ensure this, I first create a network CRD that is namespace scoped and indicate: + + - Type of network (Layer 3 or Layer 2) + - IP addressing scheme I wish to use (optional) + - Indicate this network will be the primary network + +After creating this CRD, I can check the status of the CRD to ensure it is actively being used as the primary network +for my namespace by OVN-Kubernetes. Once verified, I can now create pods and they will be in their own isolated SDN. + +#### Admin Use Case + +As an admin, I have a customer who has multiple namespaces and wants to connect them all to the same private network. In +order to accomplish this, I first create an admin network CRD that is cluster scoped and indicate: + +- Type of network (Layer 3 or Layer 2) +- IP addressing scheme I wish to use (optional) +- Indicate this network will be the primary network +- Selector to decide which namespaces may connect to this network. May use the ```kubernetes.io/metadata.name``` label to +guarantee uniqueness and eliminates the ability to falsify access. + +After creating the CRD, check the status to ensure OVN-Kubernetes has accepted this network to serve the namespaces +selected. Now tenants may go ahead and be provisioned their namespace. + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +The management cluster should have no reason to use multiple networks, unless it for security reasons it makes sense +to use native network isolation over network policy. It makes more sense that multiple primary networks will be used in +the hosted cluster in order to provide tenants with better isolation from each other, without the need for network +policy. There should be no hypershift platform-specific considerations with this feature. + +#### Standalone Clusters + +Full support. + +#### Single-node Deployments or MicroShift + +SNO and Microshift will both have full support for creating multiple primary networks. There may be some increased +resource usage when scaling up number of networks as it requires more configuration in OVN, and more processing in +OVN-Kubernetes. + +### Risks and Mitigations + +The biggest risk with this feature is hitting scale limitations. With many namespaces and networks, the number of +internal OVN objects will multiply, as well as internal kernel devices, rules, VRFs. There will need to be a large-scale +effort to determine how many networks we can comfortably support. + +There is also a risk of breaking secondary projects that integrate with OVN-Kubernetes, such as Metal LB or Submariner. + +### Drawbacks + +As described in the Design Details section, this proposal will require reserving two IPs per network in the masquerade +subnet. This is a private subnet only used internally by OVN-Kubernetes, but it will require increasing the subnet size +in order to accommodate multiple networks. Today this subnet by default is configured as a /29 for IPv4, and only 6 IP +addresses are used. With this new design, users will need to reconfigure their subnet to be large enough to hold the +desired number of networks. Note, API changes will need to be made in order to support changing the masquerade subnet +post-installation. + +### Implementation Details/Notes/Constraints + +OVN offers the ability to create multiple virtual topologies. As with secondary networks in OVN-Kubernetes today, +separate topologies are created whenever a new network is needed. The same methodology will be leveraged for this design. +Whenever a new network of type layer 3 or layer 2 is requested, a new topology will be created for that network where +pods may connect to. + +The limitation today with secondary networks is that there is only support for east/west traffic. This RFE will address +adding support for user-defined primary and secondary network north/south support. In order to support north/south +traffic, pods on different networks need to be able to egress, typically using the host’s IP. Today in shared gateway +mode we use a Gateway Router (GR) in order to provide this external connectivity, while in local gateway mode, the host +kernel handles SNAT’ing and routing out egress traffic. Ingress traffic also follows similar, and reverse paths. There +are some exceptions to these rules: + +1. MEG traffic always uses the GR to send and receive traffic. +2. Egress IP on the primary NIC always uses the GR, even in local gateway mode. +3. Egress Services always use the host kernel for egress routing. + +To provide an ingress/egress point for pods on different networks the most simple solution may appear to be to connect +them all to a single gateway router. This introduces an issue where now networks are all connected to a single router, +and there may be routing happening between networks that were supposed to be isolated from one another. Furthermore in +the future, we will want to extend these networks beyond the cluster, and to do that in OVN would require making a +single router VRF aware, which adds more complexity into OVN. + +The proposal here is to create a GR per network. With this topology, OVN will create a patch port per network to the +br-ex bridge. OVN-Kubernetes will be responsible for being VRF/network aware and forwarding packets via flows in br-ex +to the right GR. Each per-network GR will only have load balancers configured on it for its network, and only be able to +route to pods in its network. The logical topology would look something like this, if we use an example of having a +cluster default primary network, a layer 3 primary network, and a layer 2 primary network: + +![VRF Topology](images/VRFs.svg) + +In the above diagram, each network is assigned a unique conntrack zone and conntrack mark. These are required in order +to be able to handle overlapping networks egressing into the same VRF and SNAT’ing to the host IP. Note, the default +cluster network does not need to use a unique CT mark or zone, and will continue to work as it does today. This is due +to the fact that no user-defined network may overlap with the default cluster subnet. More details in the next section. + +#### Shared Gateway Mode + +##### Pod Egress + +On pod egress, the respective GR of that network will handle doing the SNAT to a unique masquerade subnet IP assigned to +this network. For example, in the above diagram packets leaving GR-layer3 would be SNAT’ed to 169.254.169.5 in zone 64005. +The packet will then enter br-ex, where flows in br-ex will match this packet, and then SNAT the packet to the node IP +in zone 0, and apply its CT mark of 5. Finally, the packet will be recirculated back to table 0, where the packet will +be CT marked with 1 in zone 64000, and sent out of the physical interface. In OVN-Kubernetes we use zone 64000 to track +things from OVN or the host and additionally, we mark packets from OVN with a CT Mark of 1 and packets from the host +with 2. Pseudo openflow rules would look like this (assuming node IP of 172.18.0.3): + +``` +pod-->GR(snat, 169.254.169.5, zone 64005)->br-ex(snat, 172.18.0.3, zone 0, mark 5, table=0) -->recirc table0 (commit +zone 64000, mark 1) -->eth0 +``` + +The above design will accommodate for overlapping networks with overlapping ports. The worst case scenario is if two +networks share the same address space, and two pods with identical IPs are trying to connect externally using the same +source and destination port. Although unlikely, we have to plan for this type of scenario. When each pod tries to send a +packet through their respective GR, SNAT’ing to the unique GR masquerade IP differentiates the conntrack entries. Now, +when the final SNAT occurs in br-ex with zone 0, they can be determined as different connections via source IP, and when +SNAT’ing to host IP, conntrack will detect a collision using the same layer 4 port, and choose a different port to use. + +When reply traffic comes back into the cluster, we must now submit the packet to conntrack to find which network this +traffic belongs to. The packet is always first sent into zone 64000, where it is determined whether this packet +belonged to OVN (CT mark of 1) or the host. Once identified by CT mark as OVN traffic, the packet will then be unSNAT’ed +in zone 0 via br-ex rules and the CT mark restored of which network it belonged to. Finally, we can send the packet to +the correct GR via the right patch port, by matching on the restored CT Mark. From there, OVN will handle unSNAT’ing the +masquerade IP and forward the packet to the original pod. + +To support KubeVirt live migration the GR LRP will have an extra address with the configured gateway for the layer2 +subnet (to allow the gateway IP to be independent of the node where the VM is running on). After live migration succeeds, +OVN should send a GARP for VMs to clean up its ARP tables since the gateway IP has different mac now. + +The live migration feature at layer 2 described here will work only with OVN interconnect (OVN IC, which is used by OCP). +Since there is no MAC learning between zones, so we can have the same extra address on every gateway router port, basically +implementing anycast for this SVI address. + +Following is a picture that illustrate all these bits with a topology + +![Layer 2 Egress Topology](images/multi-homing-l2-gw.svg) + +##### Services + +When ingress service traffic enters br-ex, there are flows installed that steer service traffic towards the OVN GR. With +additional networks, these flows will be modified to steer traffic to the correct GR-<network>’s patch port. + +When a host process or host networked pod on a Kubernetes node initiates a connection to a service, iptables rules will +DNAT the nodeport or loadbalancer IP into the cluster IP, and then send the traffic via br-ex where it is masqueraded +and sent into the OVN GR. These flows can all be modified to detect the service IP and then send to the correct +GR-<network> patch port. For example, in the br-ex (breth0) bridge today we have flows that match on packets sent +to the service CIDR (10.96.0.0/24): + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=0 | grep 10.96 + cookie=0xdeff105, duration=22226.373s, table=0, n_packets=41, n_bytes=4598, idle_age=19399,priority=500,ip,in_port=LOCAL,nw_dst=10.96.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) +``` + +Packets that are destined to the service CIDR are SNAT'ed to the masquerade IP of the host (169.254.169.2) and then +sent to the dispatch table 2: + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=2 + cookie=0xdeff105, duration=22266.310s, table=2, n_packets=41, n_bytes=4598, actions=mod_dl_dst:02:42:ac:12:00:03,output:"patch-breth0_ov" +``` + +In the above flow, all packets have the dest MAC address changed to be that of the OVN GR, and then sent on the patch port +towards the OVN GR. With multiple networks, host access to cluster IP service flows will now be modified to be on a per +cluster IP basis. For example, if we assume two services exist on two user defined namespaces with cluster IPs 10.96.0.5 +and 10.96.0.6. The flows would look like: + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=0 | grep 10.96 + cookie=0xdeff105, duration=22226.373s, table=0, n_packets=41, n_bytes=4598, idle_age=19399,priority=500,ip,in_port=LOCAL,nw_dst=10.96.0.5 actions=set_field:2->reg1,ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) + cookie=0xdeff105, duration=22226.373s, table=0, n_packets=41, n_bytes=4598, idle_age=19399,priority=500,ip,in_port=LOCAL,nw_dst=10.96.0.6 actions=set_field:3->reg1,ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) +``` + +The above flows are now per cluster IP and will send the packet to the dispatch table while also setting unique register +values to differentiate which OVN network these packets should be delivered to: + +``` +[root@ovn-worker ~]# ovs-ofctl dump-flows breth0 table=2 + cookie=0xdeff105, duration=22266.310s, table=2, n_packets=41, n_bytes=4598, reg1=0x2 actions=mod_dl_dst:02:42:ac:12:00:05,output:"patch-breth0-net1" + cookie=0xdeff105, duration=22266.310s, table=2, n_packets=41, n_bytes=4598, reg1=0x3 actions=mod_dl_dst:02:42:ac:12:00:06,output:"patch-breth0-net2" +``` + +Furthermore, host networked pod access to services will be restricted to the network it belongs to. For more information +see the [Host Networked Pods](#host-networked-pods) section. + +Additionally, in the case where there is hairpin service traffic to the host +(Host->Service->Endpoint is also the host), the endpoint reply traffic will need to be distinguishable on a per network +basis. In order to achieve this, each OVN GR’s unique masquerade IP will be leveraged. + +For service access towards KAPI/DNS or potentially other services on the cluster default network, there are two potential +technical solutions. Assume eth0 is the pod interface connected to the cluster default network, and eth1 is connected to the +user-defined primary network: + +1. Add routes for KAPI/DNS specifically into the pod to go out eth0, while all other service access will go to eth1. +This will then just work normally with the load balancers on the switches for the respective networks. + +2. Do not send any service traffic out of eth0, instead all service traffic goes to eth1. In this case all service +traffic is flowing through the user-defined primary network, where only load balancers for that network are configured +on that network's OVN worker switch. Therefore, packets to KAPI/DNS (services not on this network) are not DNAT'ed at +the worker switch and are instead forwarded onwards to the ovn_cluster_router_<user-defined network> or +GR-<node-user-defined-network> for layer 3 or layer 2 networks, respectively . This router is +configured to send service CIDR traffic to ovn-k8s-mp0-<user-defined network>. IPTables rules in the host only permit +access to KAPI/DNS and drop all other service traffic coming from ovn-k8s-mp0-<user-defined network>. The traffic then +gets routed to br-ex and default GR where it hits the OVN load balancer there and forwarded to the right endpoint. + +While the second option is more complex, it allows for not configuring routes to service addresses in the pod that could +hypothetically change. + +##### Egress IP + +This feature works today by labeling and choosing a node+network to be used for egress, and then OVN logical routes and +logical route policies are created which steer traffic from a pod towards a specific gateway router (for primary network +egress). From there the packets are SNAT’ed by the OVN GR to the egress IP, and sent to br-ex. Egress IP is cluster +scoped, but applies to selected namespaces, which will allow us to only apply the SNAT and routes to the GR and OVN +topology elements of that network. In the layer 3 case, the current design used today for the cluster default primary +network will need some changes. Since Egress IP may be served on multiple namespaces and thus networks, it is possible +that there could be a collision as previously mentioned in the Pod Egress section. Therefore, the same solution provided +in that section where the GR SNATs to the masquerade subnet must be utilized. However, once the packet arrives in br-ex +we will need a way to tell if it was sent from a pod affected by a specific egress IP. To address this, pkt_mark will be +used to mark egress IP packets and signify to br-ex which egress IP to SNAT to. An example where the egress IP is +1.1.1.1 that maps to pkt_mark 10 would look something like this: + +![Egress IP VRF SGW](images/egress-ip-vrf-sgw.svg) + +For layer 2, egress IP has never been supported before. With the IC design, there is no need to have an +ovn_cluster_router and join switch separating the layer 2 switch network (transit switch) from the GR. For non-IC cases +this might be necessary, but for OpenShift purposes will only describe the behavior of IC in this proposal. In the layer +2 IC model, GRs per node on a network will all be connected to the layer 2 transit switch: + +![Egress IP Layer 2](images/egress-ip-l2-primary.svg) + +In the above diagram, Node 2 is chosen to be the egress IP node for any pods in namespace A. Pod 1 and Pod 2 have +default gateway routes to their respective GR on their node. When egress traffic leaves Pod 2, it is sent towards its +GR-A on node 2, where it is SNAT’ed to the egress IP and the traffic sent to br-ex. For Pod 1, its traffic is sent to +its GR-A on Node 1, where it is then rerouted towards GR-A on Node 2 for egress. + +##### Egress Firewall + +Egress firewall is enforced at the OVN logical switch, and this proposal has no effect on its functionality. + +##### Egress QoS + +Egress QoS is namespace scoped and functions by marking packets at the OVN logical switch, and this proposal has no +effect on its functionality. + +##### Egress Service + +Egress service is namespace scoped and its primary function is to SNAT egress packets to a load balancer IP. As +previously mentioned, the feature works the same in shared and local gateway mode, by leveraging the local gateway mode +path. Therefore, its design will be covered in the Local Gateway Mode section of the Design Details. + +##### Multiple External Gateways (MEG) + +There will be no support for MEG or pod direct ingress on any network other than the primary, cluster default network. +This support may be enhanced later by extending VRFs/networks outside the cluster. + +#### Local Gateway Mode + +With local gateway mode, egress/ingress traffic uses the kernel’s networking stack as a next hop. OVN-Kubernetes +leverages an interface named “ovn-k8s-mp0” in order to facilitate sending traffic to and receiving traffic from the +host. For egress traffic, the host routing table decides where to send the egress packet, and then the source IP is +masqueraded to the node IP of the egress interface. For ingress traffic, the host routing table steers packets destined +for pods via ovn-k8s-mp0 and SNAT’s the packet to the interface address. + +For multiple networks to use local gateway mode, some changes are necessary. The ovn-k8s-mp0 port is a logical port in +the OVN topology tied to the cluster default network. There will need to be multiple ovn-k8s-mp0 ports created, one per +network. Additionally, all of these ports cannot reside in the default VRF of the host network. Doing so would result in +an inability to have overlapping subnets, as well as the host VRF would be capable of routing packets between namespace +networks, which is undesirable. Therefore, each ovn-k8s-mp0-<network> interface must be placed in its own VRF: + +![Local GW Node Setup](images/local-gw-node-setup-vrfs.svg) + +The VRFs will clone the default routing table, excluding routes that are created by OVN-Kubernetes for its networks. +This is similar to the methodology in place today for supporting +[Egress IP with multiple NICs](https://github.com/openshift/enhancements/blob/master/enhancements/network/egress-ip-multi-nic.md). + +##### Pod Egress + +Similar to the predicament outlined in Shared Gateway mode, we need to solve the improbable case where two networks have +the same address space, and pods with the same IP/ports are trying to talk externally to the same server. In this case, +OVN-Kubernetes will reserve an extra IP from the masquerade subnet per network. This masquerade IP will be used to SNAT +egress packets from pods leaving via mp0. The SNAT will be performed by ovn_cluster_router for layer 3 networks and +the gateway router (GR) for layer 2 networks using configuration like: + +``` +ovn-nbctl --gateway-port=rtos-ovn-worker lr-nat-add DR snat 10.244.0.6 169.254.169.100 +``` + +Now when egress traffic arrives in the host via mp0, it will enter the VRF, where clone routes will route the packet as +if it was in the default VRF out a physical interface, typically towards br-ex, and the packet is SNAT’ed to the host IP. + +When the egress reply comes back into the host, iptables will unSNAT the packet and the destination will be +169.254.169.100. At this point, an ip rule will match the destination on the packet and do a lookup in the VRF where a +route specifying 169.254.169.100/32 via 10.244.0.1 will cause the packet to be sent back out the right mp0 port for the +respective network. + +Note, the extra masquerade SNAT will not be required on the cluster default network's ovn-k8s-mp0 port. This will +preserve the previous behavior, and it is not necessary to introduce this SNAT since the default cluster network subnet +may not overlap with user-defined networks. + +##### Services + +Local gateway mode services function similar to the behavior described in host -> service description in the Shared +Gateway Mode Services section. When the packet enters br-ex, it is forwarded to the host, where it is then DNAT’ed to +the cluster IP and typically sent back into br-ex towards the OVN GR. This traffic will behave the same as previously +described. There are some exceptions to this case, namely when external traffic policy (ETP) is set to local. In this +case traffic is DNAT’ed to a special masquerade IP (169.254.169.3) and sent via ovn-k8s-mp0. There will need to be IP +rules to match on the destination node port and steer traffic to the right VRF for this case. Additionally, with internal +traffic policy (ITP) is set to local, packets are marked in the mangle table and forwarded via ovn-k8s-mp0 with an IP +rule and routing table 7. This logic will need to ensure the right ovn-k8s-mp0 is chosen for this case as well. + +##### Egress IP + +As previously mentioned, egress IP on the primary NIC follows the pathway of shared gateway mode. The traffic is not +routed by the kernel networking stack as a next hop. However, for multi-nic support, packets are sent into the kernel +via the ovn-k8s-mp0 port. Here the packets are matched on, sent to an egress IP VRF, SNAT’ed and sent out the chosen +interface. The detailed steps for a pod with IP address 10.244.2.3 affected by egress IP look like: + +1. Pod sends egress packet, arrives in the kernel via ovn-k8s-mp0 port, the packet is marked with 1008 (0x3f0 in hex) +if it should skip egress IP. It has no mark if the packet should be affected by egress IP. +2. IP rules match the source IP of the packet, and send it into an egress IP VRF (rule 6000): + + ``` + sh-5.2# ip rule + + 0: from all lookup local + 30: from all fwmark 0x1745ec lookup 7 + 5999: from all fwmark 0x3f0 lookup main + 6000: from 10.244.2.3 lookup 1111 + 32766: from all lookup main + 32767: from all lookup default + ``` + +3. Iptables rules save the packet mark in conntrack. This is only applicable to packets that were marked with 1008 and +are bypassing egress IP: + + ``` + sh-5.2# iptables -t mangle -L PREROUTING + + Chain PREROUTING (policy ACCEPT) + target prot opt source destination + CONNMARK all -- anywhere anywhere mark match 0x3f0 CONNMARK save + CONNMARK all -- anywhere anywhere mark match 0x0 CONNMARK restore + ``` + +4. VRF 1111 has a route in it to steer the packet to the right egress interface: + + ``` + sh-5.2# ip route show table 1111 + default dev eth1 + ``` + +5. IPTables rules in NAT table SNAT the packet: + + ``` + -A OVN-KUBE-EGRESS-IP-MULTI-NIC -s 10.244.2.3/32 -o eth1 -j SNAT --to-source 10.10.10.100 + ``` + +6. For reply bypass traffic, the 0x3f0 mark is restored, and ip rules 5999 send it back into default VRF for routing +back into mp0 for non-egress IP packets. This is rule and connmark restoring is required for the packet to pass the +reverse path filter (RPF) check. For egress IP reply packets, there is no connmark restored and the packets hit the +default routing table to go back into mp0. + +This functionality will continue to work, with ip rules steering the packets from the per network VRF to the appropriate +egress IP VRF. CONNMARK will continue to be used so that return traffic is sent back to the correct VRF. Step 5 in the +above may need to be tweaked to match on mark in case 2 pods have overlapping IPs, and are both egressing +the same interface with different Egress IPs. The flow would look something like this: + +![Egress IP VRF LGW](images/egress-ip-vrf-lgw.svg) + +##### Egress Firewall + +Egress firewall is enforced at the OVN logical switch, and this proposal has no effect on its functionality. + + +##### Egress QoS + +Egress QoS is namespace scoped and functions by marking packets at the OVN logical switch, and this proposal has no +effect on its functionality. + +##### Egress Service + +Egress service functions similar to Egress IP in local gateway mode, with the exception that all traffic paths go +through the kernel networking stack. Egress Service also uses IP rules and VRFs in order to match on traffic and forward +it out the right network (if specified in the CRD). It uses iptables in order to SNAT packets to the load balancer IP. +Like Egress IP, with user-defined networks there will need to be IP rules with higher precedence to match on packets from +specific networks and direct them to the right VRF. + +##### Multiple External Gateways (MEG) + +There will be no support for MEG or pod direct ingress on any network other than the primary, cluster default network. +Remember, MEG works the same way in local or shared gateway mode, by utilizing the shared gateway path. This support may +be enhanced later by extending VRFs/networks outside the cluster. + +#### Kubernetes Readiness/Liveness Probes + +As previously mentioned, Kubelet probes will continue to work. This includes all types of probes such as TCP, HTTP or +GRPC. Additionally, we want to restrict host networked pods in namespaces that belong to user-defined networks from +being able to access pods in other networks. For that reason, we need to block host networked pods from being able to +access pods via the cluster default network. In order to do this, but still allow Kubelet to send probes; the cgroup +module in iptables will be leveraged. For example: + +``` +root@ovn-worker:/# iptables -L -t raw -v +Chain PREROUTING (policy ACCEPT 6587 packets, 1438K bytes) + pkts bytes target prot opt in out source destination + +Chain OUTPUT (policy ACCEPT 3003 packets, 940K bytes) + pkts bytes target prot opt in out source destination + 3677 1029K ACCEPT all -- any any anywhere anywhere cgroup kubelet.slice/kubelet.service + 0 0 ACCEPT all -- any any anywhere anywhere ctstate ESTABLISHED + 564 33840 DROP all -- any any anywhere 10.244.0.0/16 +``` + +From the output we can see that traffic to the pod network ```10.244.0.0/16``` will be dropped by default. However, +traffic coming from kubelet will be allowed. + +#### Host Networked Pods + +##### VRF Considerations + +By encompassing VRFs into the host, this introduces some constraints and requirements for the behavior of host networked +type pods. If a host networked pod is created in a Kubernetes namespace that has a user-defined network, it should be +confined to only talking to ovn-networked pods on that same user-defined network. + +With Linux VRFs, different socket types behave differently by default. Raw, unbound sockets by default are allowed to +listen and span multiple VRFs, while TCP, UDP, SCTP and other protocols are restricted to the default VRF. There are +settings to control this behavior via sysctl, with the defaults looking like this: + +``` +trozet@fedora:~/Downloads/ip-10-0-169-248.us-east-2.compute.internal$ sudo sysctl -A | grep net | grep l3mdev +net.ipv4.raw_l3mdev_accept = 1 +net.ipv4.tcp_l3mdev_accept = 0 +net.ipv4.udp_l3mdev_accept = 0 +``` + +Note, there is no current [support in the kernel for SCTP](https://lore.kernel.org/netdev/bf6bcf15c5b1f921758bc92cae2660f68ed6848b.1668357542.git.lucien.xin@gmail.com/), +and it does not look like there is support for IPv6. Given the desired behavior to restrict host networked pods to +talking to only pods in their namespace/network, it may make sense to set raw_l3dev_accept to 0. This is set to 1 by +default to allow legacy ping applications to work over VRFs. Furthermore, a user modifying sysctl settings to allow +applications to listen across all VRFs will be unsupported. Reasons include there can be odd behavior and interactions +that occur with applications communicating across multiple VRFs, as well as the fact that this would break the native +network isolation paradigm offered by this feature. + +For host network pods to be able to communicate with pod IPs on their user-defined network, the only supported method +will be for the applications to bind their socket to the VRF device. Many applications will not be able to support this, +so in the future it makes sense to come up with a better solution. One possibility is to use ebpf in order to intercept +the socket bind call of an application (that typically will bind to INADDR_ANY) and force it to bind to the VRF device. +Note, host network pods will still be able to communicate with pods via services that belong to its user-defined network +without any limitations. See the next section [Service Access](#service-access) for more information. + +Keep in mind that if a host network pod runs and does not bind to the VRF device, it will be able to communicate on the +default VRF. This means the host networked pod will be able to talk to other host network pods. However, due to nftables +rules in the host however, it will not be able to talk to OVN networked pods via the default cluster network/VRF. + +For more information on how VRFs function in Linux and the settings discussed in this section, refer to +[https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/networking/vrf.rst?h=v6.1](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/networking/vrf.rst?h=v6.1) for +more details. + +##### Service Access + +Host networked pods in a user-defined network will be restricted to only accessing services in either: +1. The cluster default network. +2. The user-defined network in which the host networked pod's namespace belongs to. + +This will be enforced by iptables/nftables rules added that match on the cgroup of the host networked pod. For example: + +``` +root@ovn-worker:/# iptables -L -t raw -v +Chain PREROUTING (policy ACCEPT 60862 packets, 385M bytes) + pkts bytes target prot opt in out source destination + +Chain OUTPUT (policy ACCEPT 36855 packets, 2504K bytes) + pkts bytes target prot opt in out source destination + 17 1800 ACCEPT all -- any any anywhere 10.96.0.1 cgroup /kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-pod992d3b9e_3f85_42e2_9558_9d4273d4236f.slice +23840 6376K ACCEPT all -- any any anywhere anywhere cgroup kubelet.slice/kubelet.service + 0 0 ACCEPT all -- any any anywhere anywhere ctstate ESTABLISHED + 638 37720 DROP all -- any any anywhere 10.244.0.0/16 + 28 1440 DROP all -- any any anywhere 10.96.0.0/16 +``` +In the example above, access to the service network of ```10.96.0.0/16``` is denied by default. However, one host networked +pod is given access to the 10.96.0.1 cluster IP service, while other host networked pods are blocked from access. + +#### OpenShift Router/Ingress + +One of the goals of this effort is to allow ingress services to pods that live on primary user-defined networks. + +OpenShift router is deployed in two different ways: + +* On-prem: deployed as a host networked pod running haproxy, using a keepalived VIP. +* Cloud: deployed as an OVN networked pod running haproxy, fronted by a cloud external load balancer which forwards +traffic to a nodeport service with ETP Local. + +OpenShift router is implemented as HAProxy, which is capable of terminating TLS or allowing TLS passthrough and +forwarding HTTP connections directly to endpoints of services. + +##### On-Prem + +Due to previously listed constraints in the Host Networked Pods section, host networked pods are unable to talk directly +to OVN networked pods unless bound to the VRF itself. HAProxy needs to communicate with potentially many VRFs, so +therefore this functionality does not work. However, a host networked pod is capable of reaching any service CIDR. In +order to solve this problem, openshift-ingress-operator will be modified so that when it configures HAProxy, it will +check if the service belongs to a user-defined network, and then use the service CIDR as the forwarding path. + +##### Cloud + +In the cloud case, the ovn-networked HAProxy pod cannot reach pods on other networks. Like the On-Prem case, HAProxy +will be modified for user-defined networks to forward to the service CIDR rather than the endpoint directly. The caveat +here is that an OVN networked pod only has service access to: + +1. CoreDNS and Kube API via the cluster default network +2. Any service that resides on the same namespace as the pod + +To solve this problem, the OpenShift ingress namespace will be given permission to access any service, by forwarding +service access via mp0, and allowing the access with ipTables rules. In OVN-Kubernetes upstream, we can allow a +configuration to list certain namespaces that are able to have access to all services. This should be reserved for +administrators to configure. + +A diagram of how the Cloud traffic path will work: + +![OpenShift Router Multi Network](images/openshift-router-multi-network.svg) + +##### Limitations + +With OpenShift Router, stickiness is used to ensure that traffic for a session always reaches the same endpoint. For TLS +passthrough, HAProxy uses the source IP address as that's all it has (it's not decrypting the connection, so it can't +look at cookies, for example). Otherwise, it uses a cookie. + +By changing the behavior of HAProxy to forward to a service CIDR instead of an endpoint directly, we cannot accommodate +using cookies for stickiness. However, a user can configure the service with sessionAffinity, which will cause OVN to +use stickiness by IP. Therefore, users who wish to use OpenShift router with user-defined networks will be limited to +only enabling session stickiness via client IP. + +## Test Plan + +* E2E upstream and downstream jobs covering supported features across multiple networks. +* E2E tests which ensure network isolation between OVN networked and host networked pods, services, etc. +* E2E tests covering network subnet overlap and reachability to external networks. +* Scale testing to determine limits and impact of multiple user-defined networks. This is not only limited to OVN, but + also includes OVN-Kubernetes’ design where we spawn a new network controller for every new network created. +* Integration testing with other features like IPSec to ensure compatibility. + +## Graduation Criteria + +### Dev Preview -> Tech Preview + +There will be no dev or tech preview for this feature. + +### Tech Preview -> GA + +Targeting GA in OCP version 4.17. + +### Removing a deprecated feature + +N/A + +## Upgrade / Downgrade Strategy + +There are no specific requirements to be able to upgrade. The cluster default primary network will continue to function +as it does in previous releases on upgrade. Only newly created namespaces and pods will be able to leverage this feature +post upgrade. + +## Version Skew Strategy + +N/A + +## Operational Aspects of API Extensions + +Operating with user-defined primary networks using a network CRD API will function mostly the same as default cluster +network does today. Creating the user-defined networks shall be done for namespace(s) before pods are created in that +namespace. Otherwise, pods will fail during CNI add. Any functional limitations with user-defined networks are outlined +in other sections of this document. + +## Support Procedures + +## Alternatives