Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding xDS Adapter CFP #14

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 204 additions & 0 deletions cilium/CFP-30235-xds-adapter.md
joestringer marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
# CFP-30235: xDS Adapter for ClusterIP Routing

**SIG:** Agent

**Begin Design Discussion:** 2023-01-03
robscott marked this conversation as resolved.
Show resolved Hide resolved

**Cilium Release:** 1.16

**Authors:** Rob Scott <[email protected]>

## Summary

Add a new xDS adapter in Cilium that could take advantage of some of the
strengths of xDS, particularly the feedback loop via LRS and the overall
robscott marked this conversation as resolved.
Show resolved Hide resolved
potential for scalability improvements when adjustments to routing
configurations don't need to round trip through the Kubernetes API Server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't they need to round trip through an xDS controller / server?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like one of the major advantages of an xDS control plane managing this information is that it's only handling this information, rather than being a generic workload-management API like Kubernetes.

For some cases, I don't it will ever make sense to put xDS between Cilium and Kubernetes, but for some use cases, particularly endpoints and endpoint grouping (Clusters in xDS, Services in Kubernetes), it seems more straightforward to map these to xDS objects.

This adapter would be an alternative source of endpoints and would not replace
the existing default behavior of reading directly from Kubernetes APIs.

## Motivation

There are some capabilities that the Kubernetes API Server is not well equipped
to handle. This includes topology aware routing (or any form of routing that
requires a feedback loop) and routing to endpoints outside of the cluster.

## Goals

* Describe how an xDS adapter could work in Cilium
* Highlight new capabilities that xDS could enable

robscott marked this conversation as resolved.
Show resolved Hide resolved
## Non-Goals

* Build anything additional on top of this xDS adapter. Although this could be a
foundational technology for several new features, those are out of scope for
this proposal.
Comment on lines +38 to +40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how we can have an xDS adapter without having an implementation of the server / controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that to validate that this works, we'll need an xDS control plane to talk to. Which is a not-insignificant engineering problem in itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point. What I'd really intended here was that this CFP was meant to be about foundational building blocks/infrastructure, not features. I agree that an xDS control plane needs to be bundled with this in some way and have already added it as phase 3 of this CFP.


## Background

Several years ago, Matt Klein published a blog post charting a vision for xDS as
a [universal data plane
API](https://blog.envoyproxy.io/the-universal-data-plane-api-d15cec7a). Although
these APIs are primarily used by Envoy and gRPC today, there has been a broader
goal in the community to enable more widespread usage as part of a truly
universal data plane API. xDS is CNCF-governed and is gradually transitioning
content from the
[envoyproxy/data-plane-api](https://github.com/envoyproxy/data-plane-api) repo
to [cncf/xds](https://github.com/cncf/xds) repo as part of that overall vision.

In parallel to these efforts by the xDS community, GKE is planning to introduce
xDS as an additional data source for DPv2 configuration. This feels sufficiently

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DPv2

Has any public high-level summary been written about this that could be linked here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly don't have anything public yet, we're still in pretty early stages.

generic and helpful that it could be something that could be contributed to
upstream Cilium. This could be particularly useful for at least two common use
cases:

1. Supporting Services and Endpoints from outside of the local cluster.
1. Supporting advanced routing techniques, such as topology aware routing.
robscott marked this conversation as resolved.
Show resolved Hide resolved

Although it is out of scope for this specific CFP to provide complete solutions
for either of these use cases, it will demonstrate the benefits of having an
xDS adapter when developing a solution either of these use cases.

### Use Cases

#### Topology Aware Routing

As the author of both EndpointSlices and the existing Topology Aware Routing
approach in Kubernetes, I feel like I’m well equipped to discuss the limitations
of both. What most people want from Topology Aware Routing is essentially to
fill the local zone to capacity and then spillover to the next closest set of
endpoints with capacity. Unfortunately that is very difficult to achieve with
the existing Kubernetes tooling. There are a few key problems here:

1. Kubernetes has no feedback loop to its dataplane and therefore no reliable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, in Cilium, an "xDS adapter" alone doesn't solve this. We will also need changes into Cilium's control plane to propagate its internal state back to the xDS adapter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this just provides the foundation to solve this. The whole idea of topology aware routing or really any form of routing that requires a feedback loop is deceptively simple until you start to think through this part of it. What xDS gives us here is an established API + patterns for completing a feedback loop, but like you're saying, we'd still need to connect the dots here.

way to understand if endpoints have reached capacity.
2. Even if there were a feedback loop, spillover to other zones benefits from
some form of centralized orchestration to avoid a thundering herd problem.
3. All changes to endpoint routing currently have to go through Services or
EndpointSlices, and by extension, the Kubernetes API Server. It would be very
expensive to implement incremental weighting/spillover adjustments with that
overhead. For example, adjusting the weight of any individual endpoint would
require a write to the EndpointSlice API which would then need to be
distributed to all consumers of that API. That can be problematic from a
scalability perspective when there are frequent updates paired with a large
number of endpoints and/or nodes.
Comment on lines +83 to +90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this problem. Kubernetes API Server is the "control plane" of the cluster, won't we have this same problem for any "control plane" that receives the load reports from Cilium?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the bidirectional nature of xDS comes in a bit more handy - every xDS communication is a message and response pair, unlike the Kubernetes API where each operation is either a write or a read. The design of the Load Reporting Service uses this to its advantage, effectively making the control plane a client instead of the server, since one "give me your load numbers" request will produce a stream of load numbers (that the control plane will simply acknowledge).

That said, I think it's important to remember that actually building these control planes is full of really hard concurrency problems. More on that under the "incremental" section in a few lines.

Copy link
Member

@joestringer joestringer Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LRS providing feedback on top can only be a net increase in load. I could see maybe an argument there that kube-apiserver protocols are not as well suited to the problem, but before we even consider handling the load feedback, what about the existing load of just configuring K8sEndpoints on all of the cilium-agents?

Perhaps some guiding questions:

  • In the target deployment scenario, what percentage of kube-apiserver resources (CPU,memory,network) usage corresponds to service/endpoint handling? Is the bottleneck having a big enough kube-apiserver node, and the consideration is that we can increase the available CPU by moving this functionality to a secondary node with an additional 2x, 5x, 10x CPU to process the events? That's an argument that moving this service handling to a dedicated process can gain some multiplier of available resources to handle the scale.
  • In the proposed design, is there something inherent that reduces the number of requests/responses that are being sent around the cluster? Let's say for a ballpark number the naive assumption is that if today's cluster has 1K nodes, then you have 1K events per endpoint update. If you move the service handling to a second process, then kube-apiserver as source of truth will reduce down to only transmitting one event to the new service control plane manager, but then the service control plane manager is responsible for (a) the original 1K events and now also (b) marshalling/unmarshalling between disparate protocols. (Incidentally I realize that with EndpointSlice this example is not the correct math at all, but hopefully you get the point of my question)
  • How does this impact availability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me preface this by saying that scalability was not my primary goal with this CFP, but I do think it could be a nice coincidental benefit of the approach even outside of load reporting. When we get to load reporting, I can't think of any feasible way to do it with existing k8s constructs, and I'd rather not invents something new here when LRS already exists.

I spent a fair amount of time thinking about the Kubernetes endpoint scalability problem as I was working on the EndpointSlice API, and went as far as scale testing large k8s clusters until they broke to really understand what was happening (relevant slides). This is a relatively unique problem in terms of Kubernetes APIs. Here's what we have to deal with:

  1. EndpointSlices are updated frequently (essentially any time Pod status changes)
  2. Every update needs to be distributed to every node in the cluster
  3. There are often thousands of endpoints in a cluster

Some Kubernetes APIs have to deal with the scale of EndpointSlices but are only consumed in one or two places - ie some kind of centralized controller(s). The Pod API is consumed by every Node thanks to Kubelet, but Kubelet is able to filter only the Pods that are local to it, meaning that each Pod is distributed to exactly one Node.

Let's consider a rolling update of a deployment with 100 Pods. Each of those Pods is going to go through a transition of unready -> ready while the old 100 Pods are transitioning from ready -> unready. Let's say that translates to 400 distinct events to Process (old Pod -> unready, old Pod -> terminated, new Pod -> unready, new Pod -> ready) * 100. In a naive implementation, that would mean that Kubernetes API Server would need to transmit 400 EndpointSlice updates to every node in the cluster. That becomes especially fun when you consider that some providers support 15,000 Nodes per cluster.

Fortunately the EndpointSlice controller uses batching to mitigate that to some extent, but hopefully that sets the stage for the unique problem that endpoints pose to the Kubernetes API Server. Like @joestringer mentions above, from the perspective of the API Server, it would be amazing if it didn't need to distribute all of those updates to each individual Node.

As far as LRS specifically, +1 to everything @youngnick mentioned above. Load reporting would be especially complex with existing Kubernetes constructs but relatively straightforward with LRS. Everything that goes through Kubernetes API Server would need to be persisted through writes to etcd. On the other hand, LRS combined with an xDS control plane could enable us to bypass that entirely.

In the proposed design, is there something inherent that reduces the number of requests/responses that are being sent around the cluster?

I don't think this necessarily does that. It may have some marginal impact here, but I don't think it will be a huge one. In my opinion, a lot of the value here would be that we'd have a path to moving this load from the API Server to a separate control plane that could be scaled independently and potentially more optimized for this specific purpose. My proposal is not really focused on scalability, but I think it provides the potential for significant improvements here. For example, you might choose to deploy an xDS control plane per-zone as a solution that might improve both availability and scalability.

How does this impact availability?

So in the base case where people just don't use this feature, I think there's no impact. Assuming they do, I think it entirely depends on how xDS control planes are deployed. Using the example above, if there were a separate instance of a control plane in each zone it may improve overall availability. On the other hand, if you're adding a single xDS control plane instance to a cluster that has 3 API Server replicas, you might decrease availability.


I believe that xDS is uniquely positioned to address these limitations.
[LRS](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/load_stats/v3/lrs.proto)
(Load Reporting Service) provides a straightforward way to provide load reports
to a centralized control plane. Similarly, [Delta
xDS](https://www.envoyproxy.io/docs/envoy/latest/configuration/overview/xds_api#delta-endpoints)
enables xDS to distribute incremental updates that can be smaller in scope than
comparable Kubernetes APIs. All of this could be combined to avoid adding
additional load on the Kubernetes API Server.
Comment on lines +92 to +99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that LRS is designed to do exactly this, but implementing incremental xDS is not straightforward. Unless your control plane is very carefully designed, it's nearly impossible (which is why many simpler Envoy control planes don't do it.)

To be clear, the hard part is on whoever's building the control plane, not inside the Cilium Agent. But given that we'll need an open-source thing to test with at the very least, the engineering cost of building this shouldn't be underestimated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which is why the scope of the control plane work should be clear enough and be approached in a phased manner. Implementing something for e2e testing is much simpler than building one usable in production.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing not running this in production gives me a bit of concern. How will Cilium users benefit from this functionality if the reference implementation isn't built with production in mind?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my previous comment, I am not suggesting we should build an experimental control plane. Instead, we should agree on the scope of the work as an initial deliverable. Building a production level control plane that handles all the edge cases can be challenging. There should definitely be a reference implementation that is usable outside of tests.

Copy link
Member Author

@robscott robscott Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the minimum bar here in terms of initial development would be for a OSS/CNCF reference xDS control plane that is sufficiently reliable to run for e2e testing and development and also at least one that is production ready. Those will probably be the same project, but in the odd chance we are successful at building a good ecosystem here, it's at least theoretically possible that there will be multiple OSS production-ready options and we won't need the reference/testing implementation to also be production ready. Our e2e tests here could become something that could be paired with any compatible xDS control plane in the future.


#### Endpoints From Other Sources

In some cases, it’s useful to be able to route to endpoints from other sources,
such as other Kubernetes clusters or outside of Kubernetes altogether. This can
be particularly relevant when implementing Multi-Cluster Service routing.
Implementations today often rely on one of the following approaches:

1. Mirror endpoints from other clusters into Kubernetes by creating custom
Services and EndpointSlices.
2. Introduce Gateways on the edges of clusters that can be used to route
requests from other clusters.

With xDS support, it could be possible to have a centralized xDS control plane
that was familiar with endpoints in multiple sources, providing a potentially
more efficient alternative to either of the approaches commonly used today.
Comment on lines +113 to +115
Copy link
Member

@joestringer joestringer Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it about this centralized xDS control plane that is more efficient?

Probably worth also contrasting with kvstoremesh (see CiliumCon @ kubecon NA 2023 or https://arthurchiao.art/blog/trip-first-step-towards-cloud-native-security/#222-kvstoremesh)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cilium/cilium#30283 describes the benefits of centralization in a different context. Just wanted to share this for reviewers not aware of the adjacent proposal.


## Proposal

### Logic

The introduction of this adapter would mean that Cilium could accept endpoints
from more than one source. Cilium could be configured to read exclusively from
Kubernetes (default), xDS, or both, depending on the use case.

When both data sources were in use, precedence in any conflicts would be given
to data received directly from Kubernetes. As a concrete example, it's possible
that the same Service could be represented in both Kubernetes and xDS. If that
were the case, we'd give precedence to the Service from Kubernetes APIs and drop
the one from xDS, likely surfacing some kind of warning along the way.

### Architecture

This xDS adapter would be similar conceptually to KVStore. Essentially this
would be another backend that Cilium could get data from. Similar to how Cilium
can read from Kubernetes APIs and/or KVStore today, we would add a third option
for xDS as a data source, and develop a shared interface that worked across all
data sources.
robscott marked this conversation as resolved.
Show resolved Hide resolved

### API Mapping

To keep the initial work as focused as possible, I propose using existing xDS
APIs to model ClusterIP routing. In the future, we may want to consider either
additions to those existing xDS APIs or new xDS APIs to cover the entirety of
Cilium functionality. To start, it seems highly valuable to focus on the areas
where existing xDS APIs overlap with existing Cilium capabilities.

This shows how we could use existing xDS APIs to represent Cilium capabilities
for ClusterIP routing:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link the Cilium API this is modeled to target?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently referring to Cilium's Service cache, but I expect that at least some details may change a bit if we're building a common interface around that as part of phase 1: https://github.com/cilium/cilium/blob/0632058f820c05013dfcb010d6bda0911b1269d7/pkg/service/service.go#L73-L97

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confusing with so many Service types within the code, but can this be mapped 1:1 to loadbalancer.SVC?

| Cilium Field | Origin xDS | Comments |
| - | - | - |
| Frontend.Address | `filter_chain_match.prefix_ranges` | |
| Frontend.Port | `filter_chain_match.destination_port` | |
| Frontend.Protocol | `cluster.metadata.protocol` | |
| Backend[*].FEPortName | `cluster.metadata.port_name` | |
| Backend[*].NodeName | ? | TBD - This is only used for logic to determine if an endpoint is local. Instead of sending this through xDS for every endpoint we could just look at the CIDR configured for each Cilium instance. |
| Backend[*].Addr, Backend[*].Port | `lb_endpoints.endpoint.address.socket_address` | |
| Backend[*].State | `lb_endpoints.health_status` | |
| Backend[*].Preferred | N/A | Not used in Kubernetes |
| Backend[*].Weight | `lb_endpoints.endpoint.load_balancing_weight` | |
| Type | N/A | Out of scope - only focusing on ClusterIP Services |
| ExternalTrafficPolicy, InternalTrafficPolicy | `cluster.load_balancing_policy` | Defined in the typed struct `type.googleapis.com/cilium_io_traffic_policy` |
| NatPolicy | N/A | Not used in Kubernetes |
| SessionAffinity | `tcpproxy.hash_policy` | |
| SessionAffinityTimeout | ? | TBD - This represents the maximum session stickiness time. There is no natural place for this in existing xDS APIs. |
| HealthCheckNodePort | N/A | Out of scope - only focusing on ClusterIP Services |
| Name | `cluster.metadata.service_name` | |
| Namespace | `cluster.metadata.service_namespace` | |

robscott marked this conversation as resolved.
Show resolved Hide resolved
## Impacts / Key Questions

_List crucial impacts and key questions. They likely require discussion and are required to understand the trade-offs of the CFP. During the lifecycle of a CFP, discussion on design aspects can be moved into this section. After reading through this section, it should be possible to understand any potentially negative or controversial impact of this CFP. It should also be possible to derive the key design questions: X vs Y._

### Impact: ... 1

_Describe crucial impacts and key questions that likely require discussion and debate._

### Key Question: ... 2

_Describe a key question_

### Option 1:

#### Pros

* ...

#### Cons

* ...

### Option 2:

#### Pros

* ...

#### Cons

* ...

## Future Milestones

_List things that this CFP will enable but that are out of scope for now. This can help understand the greater impact of a proposal without requiring to extend the scope of a CFP unnecessarily._

### Deferred Milestone 1

_Description of deferred milestone_

### Deferred Milestone 2