Adding xDS Adapter CFP #14

robscott · 2024-01-17T01:33:52Z

This is a follow up from the original proposal doc I wrote and corresponds with cilium/cilium#30235.

Note: I'm not quite sure how to fill out impacts and key questions yet, it's possible those will come more naturally with more review of this proposal?

/cc @joestringer @youngnick

mikemorris · 2024-01-17T18:03:49Z

cilium/CFP-30235-xds-adapter.md

+to [cncf/xds](https://github.com/cncf/xds) repo as part of that overall vision.
+
+In parallel to these efforts by the xDS community, GKE is planning to introduce
+xDS as an additional data source for DPv2 configuration. This feels sufficiently


DPv2

Has any public high-level summary been written about this that could be linked here?

Sadly don't have anything public yet, we're still in pretty early stages.

cilium/CFP-30235-xds-adapter.md

shashankram

Overall this seems like a reasonable starting point to enable xDS

cilium/CFP-30235-xds-adapter.md

aanm · 2024-01-23T15:08:00Z

cilium/CFP-30235-xds-adapter.md

+potential for scalability improvements when adjustments to routing
+configurations don't need to round trip through the Kubernetes API Server.


Won't they need to round trip through an xDS controller / server?

It seems like one of the major advantages of an xDS control plane managing this information is that it's only handling this information, rather than being a generic workload-management API like Kubernetes.

For some cases, I don't it will ever make sense to put xDS between Cilium and Kubernetes, but for some use cases, particularly endpoints and endpoint grouping (Clusters in xDS, Services in Kubernetes), it seems more straightforward to map these to xDS objects.

aanm · 2024-01-23T15:58:46Z

cilium/CFP-30235-xds-adapter.md

+* Build anything additional on top of this xDS adapter. Although this could be a
+  foundational technology for several new features, those are out of scope for
+  this proposal.


I don't understand how we can have an xDS adapter without having an implementation of the server / controller.

I agree that to validate that this works, we'll need an xDS control plane to talk to. Which is a not-insignificant engineering problem in itself.

Yeah that's a good point. What I'd really intended here was that this CFP was meant to be about foundational building blocks/infrastructure, not features. I agree that an xDS control plane needs to be bundled with this in some way and have already added it as phase 3 of this CFP.

aanm · 2024-01-23T16:05:23Z

cilium/CFP-30235-xds-adapter.md

+endpoints with capacity. Unfortunately that is very difficult to achieve with
+the existing Kubernetes tooling. There are a few key problems here:
+
+1. Kubernetes has no feedback loop to its dataplane and therefore no reliable


Just to be clear, in Cilium, an "xDS adapter" alone doesn't solve this. We will also need changes into Cilium's control plane to propagate its internal state back to the xDS adapter.

+1, this just provides the foundation to solve this. The whole idea of topology aware routing or really any form of routing that requires a feedback loop is deceptively simple until you start to think through this part of it. What xDS gives us here is an established API + patterns for completing a feedback loop, but like you're saying, we'd still need to connect the dots here.

aanm · 2024-01-23T16:07:49Z

cilium/CFP-30235-xds-adapter.md

+3. All changes to endpoint routing currently have to go through Services or
+   EndpointSlices, and by extension, the Kubernetes API Server. It would be very
+   expensive to implement incremental weighting/spillover adjustments with that
+   overhead. For example, adjusting the weight of any individual endpoint would
+   require a write to the EndpointSlice API which would then need to be
+   distributed to all consumers of that API. That can be problematic from a
+   scalability perspective when there are frequent updates paired with a large
+   number of endpoints and/or nodes.


I don't understand this problem. Kubernetes API Server is the "control plane" of the cluster, won't we have this same problem for any "control plane" that receives the load reports from Cilium?

This is where the bidirectional nature of xDS comes in a bit more handy - every xDS communication is a message and response pair, unlike the Kubernetes API where each operation is either a write or a read. The design of the Load Reporting Service uses this to its advantage, effectively making the control plane a client instead of the server, since one "give me your load numbers" request will produce a stream of load numbers (that the control plane will simply acknowledge).

That said, I think it's important to remember that actually building these control planes is full of really hard concurrency problems. More on that under the "incremental" section in a few lines.

LRS providing feedback on top can only be a net increase in load. I could see maybe an argument there that kube-apiserver protocols are not as well suited to the problem, but before we even consider handling the load feedback, what about the existing load of just configuring K8sEndpoints on all of the cilium-agents?

Perhaps some guiding questions:

In the target deployment scenario, what percentage of kube-apiserver resources (CPU,memory,network) usage corresponds to service/endpoint handling? Is the bottleneck having a big enough kube-apiserver node, and the consideration is that we can increase the available CPU by moving this functionality to a secondary node with an additional 2x, 5x, 10x CPU to process the events? That's an argument that moving this service handling to a dedicated process can gain some multiplier of available resources to handle the scale.

In the proposed design, is there something inherent that reduces the number of requests/responses that are being sent around the cluster? Let's say for a ballpark number the naive assumption is that if today's cluster has 1K nodes, then you have 1K events per endpoint update. If you move the service handling to a second process, then kube-apiserver as source of truth will reduce down to only transmitting one event to the new service control plane manager, but then the service control plane manager is responsible for (a) the original 1K events and now also (b) marshalling/unmarshalling between disparate protocols. (Incidentally I realize that with EndpointSlice this example is not the correct math at all, but hopefully you get the point of my question)

How does this impact availability?

Let me preface this by saying that scalability was not my primary goal with this CFP, but I do think it could be a nice coincidental benefit of the approach even outside of load reporting. When we get to load reporting, I can't think of any feasible way to do it with existing k8s constructs, and I'd rather not invents something new here when LRS already exists.

I spent a fair amount of time thinking about the Kubernetes endpoint scalability problem as I was working on the EndpointSlice API, and went as far as scale testing large k8s clusters until they broke to really understand what was happening (relevant slides). This is a relatively unique problem in terms of Kubernetes APIs. Here's what we have to deal with:

EndpointSlices are updated frequently (essentially any time Pod status changes)

Every update needs to be distributed to every node in the cluster

There are often thousands of endpoints in a cluster

Some Kubernetes APIs have to deal with the scale of EndpointSlices but are only consumed in one or two places - ie some kind of centralized controller(s). The Pod API is consumed by every Node thanks to Kubelet, but Kubelet is able to filter only the Pods that are local to it, meaning that each Pod is distributed to exactly one Node.

Let's consider a rolling update of a deployment with 100 Pods. Each of those Pods is going to go through a transition of unready -> ready while the old 100 Pods are transitioning from ready -> unready. Let's say that translates to 400 distinct events to Process (old Pod -> unready, old Pod -> terminated, new Pod -> unready, new Pod -> ready) * 100. In a naive implementation, that would mean that Kubernetes API Server would need to transmit 400 EndpointSlice updates to every node in the cluster. That becomes especially fun when you consider that some providers support 15,000 Nodes per cluster.

Fortunately the EndpointSlice controller uses batching to mitigate that to some extent, but hopefully that sets the stage for the unique problem that endpoints pose to the Kubernetes API Server. Like @joestringer mentions above, from the perspective of the API Server, it would be amazing if it didn't need to distribute all of those updates to each individual Node.

As far as LRS specifically, +1 to everything @youngnick mentioned above. Load reporting would be especially complex with existing Kubernetes constructs but relatively straightforward with LRS. Everything that goes through Kubernetes API Server would need to be persisted through writes to etcd. On the other hand, LRS combined with an xDS control plane could enable us to bypass that entirely.

In the proposed design, is there something inherent that reduces the number of requests/responses that are being sent around the cluster?

I don't think this necessarily does that. It may have some marginal impact here, but I don't think it will be a huge one. In my opinion, a lot of the value here would be that we'd have a path to moving this load from the API Server to a separate control plane that could be scaled independently and potentially more optimized for this specific purpose. My proposal is not really focused on scalability, but I think it provides the potential for significant improvements here. For example, you might choose to deploy an xDS control plane per-zone as a solution that might improve both availability and scalability.

How does this impact availability?

So in the base case where people just don't use this feature, I think there's no impact. Assuming they do, I think it entirely depends on how xDS control planes are deployed. Using the example above, if there were a separate instance of a control plane in each zone it may improve overall availability. On the other hand, if you're adding a single xDS control plane instance to a cluster that has 3 API Server replicas, you might decrease availability.

aanm · 2024-01-23T16:13:07Z

cilium/CFP-30235-xds-adapter.md

+for example if Services from different sources have the same IP, the following
+order will be used for precedence:
+
+1. Kubernetes


0. local API - It's possible to add endpoints and services using the local API.

yes, I think that the first part of this effort needs to be to understand the existing internal model for endpoints and endpoint groups and how well it aligns with the other models (including xDS).

+1, phase 1 in this proposal will be to define a common interface that can be used for all Service and Endpoint data sources. I think that mostly exists, but it would be good to ensure that everything is going through the same shared path here with an interface we all agree on before going any further. I also added the local API to the list as suggested by @aanm above.

aanm · 2024-01-23T16:14:17Z

cilium/CFP-30235-xds-adapter.md

+| Name | `cluster.metadata.service_name` | |
+| Namespace | `cluster.metadata.service_namespace` | |
+
+### 3. Sample xDS Control Plane


Food for thought: it might make sense to add this xDS Control plane into Cilium Operator.

+1 if the community decides to provide this to Cilium users.

I'm -1 on adding it into the Operator, personally, because the operator is currently tightly focussed on its current role: reading from Kubernetes objects, building some internal model, then writing back to Kubernetes. The Operator doesn't have any actual routing logic itself.

An xDS control plane will need to have a lot of routing logic added. I think it's better to keep this as a separate thing that's built with the same tooling as the Agent and Operator (that is, using the Hive and Cell pattern), so that it can be merged in later if necessary.

Seems reasonable to have a standlone component that implements the control plane. Adding it to the operator has the benefit of not requiring a new component and it can simply be feature gated. Though from a stability perspective, keeping this out of operator is beneficial.

cilium/CFP-30235-xds-adapter.md

youngnick · 2024-01-24T06:48:09Z

cilium/CFP-30235-xds-adapter.md

+I believe that xDS is uniquely positioned to address these limitations.
+[LRS](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/load_stats/v3/lrs.proto)
+(Load Reporting Service) provides a straightforward way to provide load reports
+to a centralized control plane. Similarly, [Delta
+xDS](https://www.envoyproxy.io/docs/envoy/latest/configuration/overview/xds_api#delta-endpoints)
+enables xDS to distribute incremental updates that can be smaller in scope than
+comparable Kubernetes APIs. All of this could be combined to avoid adding
+additional load on the Kubernetes API Server.


I agree that LRS is designed to do exactly this, but implementing incremental xDS is not straightforward. Unless your control plane is very carefully designed, it's nearly impossible (which is why many simpler Envoy control planes don't do it.)

To be clear, the hard part is on whoever's building the control plane, not inside the Cilium Agent. But given that we'll need an open-source thing to test with at the very least, the engineering cost of building this shouldn't be underestimated.

Yes, which is why the scope of the control plane work should be clear enough and be approached in a phased manner. Implementing something for e2e testing is much simpler than building one usable in production.

Discussing not running this in production gives me a bit of concern. How will Cilium users benefit from this functionality if the reference implementation isn't built with production in mind?

To clarify my previous comment, I am not suggesting we should build an experimental control plane. Instead, we should agree on the scope of the work as an initial deliverable. Building a production level control plane that handles all the edge cases can be challenging. There should definitely be a reference implementation that is usable outside of tests.

I think the minimum bar here in terms of initial development would be for a OSS/CNCF reference xDS control plane that is sufficiently reliable to run for e2e testing and development and also at least one that is production ready. Those will probably be the same project, but in the odd chance we are successful at building a good ecosystem here, it's at least theoretically possible that there will be multiple OSS production-ready options and we won't need the reference/testing implementation to also be production ready. Our e2e tests here could become something that could be paired with any compatible xDS control plane in the future.

shashankram · 2024-01-24T22:27:48Z

cilium/CFP-30235-xds-adapter.md

+
+This shows how we could use existing xDS APIs to represent Cilium capabilities
+for ClusterIP routing:
+


Could you link the Cilium API this is modeled to target?

This is currently referring to Cilium's Service cache, but I expect that at least some details may change a bit if we're building a common interface around that as part of phase 1: https://github.com/cilium/cilium/blob/0632058f820c05013dfcb010d6bda0911b1269d7/pkg/service/service.go#L73-L97

A bit confusing with so many Service types within the code, but can this be mapped 1:1 to loadbalancer.SVC?

joestringer · 2024-01-25T00:00:00Z

cilium/CFP-30235-xds-adapter.md

+With xDS support, it could be possible to have a centralized xDS control plane
+that was familiar with endpoints in multiple sources, providing a potentially
+more efficient alternative to either of the approaches commonly used today.


What is it about this centralized xDS control plane that is more efficient?

Probably worth also contrasting with kvstoremesh (see CiliumCon @ kubecon NA 2023 or https://arthurchiao.art/blog/trip-first-step-towards-cloud-native-security/#222-kvstoremesh)

cilium/cilium#30283 describes the benefits of centralization in a different context. Just wanted to share this for reviewers not aware of the adjacent proposal.

joestringer · 2024-01-25T00:02:38Z

cilium/CFP-30235-xds-adapter.md

+Cilium we'd likely want to report new and active connections to an xDS control
+plane via LRS.


Just to note (partly coming from the session we had earlier in the week): I think it makes sense to clearly outline what sort of stats we are looking for, format, etc. and also acknowledge that this may require datapath changes. I suspect that this may end up being a logically fairly separate change that could have other debuggability benefits even in existing environments if exposed through prometheus or other relevant interfaces.

cilium/CFP-30235-xds-adapter.md

DerekTBrown · 2024-01-31T23:01:06Z

Could you provide some more detail around where the proposed xDS Adapter would sit in the Cilium stack? I could see how this would inform some of the interface API design decisions.

In particular, I would like to hear your thoughts around how the xDS Adapter would interact with the embedded Envoy instance. For instance, does the xDS adapter proxy/cache requests from Envoy to an upstream server, or is Envoy communicating directly with upstream xDS servers?

DerekTBrown · 2024-01-31T23:05:29Z

One other thought: Would it make sense to explore an approach that supports only a single backend (rather than trying to reconcile between several sources). Basically:

Cilium would have either a KVStore/Local API/xDS Mode
There could be external tooling that would help migrate between modes (eg. an xDS implementation that reads from existing Cilium CRs).

mikemorris · 2024-02-01T21:49:00Z

Would it make sense to explore an approach that supports only a single backend

While cilium/cilium#30283 is a separate proposal to build a full alternative xDS interface, my understanding is that the narrow scope of this proposal is part of the attraction, as it allows xDS to be used just where it provides a specific benefit, as @youngnick mentions in #14 (comment)

youngnick · 2024-02-07T16:32:21Z

In particular, I would like to hear your thoughts around how the xDS Adapter would interact with the embedded Envoy instance. For instance, does the xDS adapter proxy/cache requests from Envoy to an upstream server, or is Envoy communicating directly with upstream xDS servers?

The xDS adapter in this CFP is totally orthogonal to the built-in Envoy instance, except in so far as it would take endpoints from somewhere else, and make them available to as Cilium endpoints to be used in Envoy config like any other one. But that's a very indirect link, and code-wise they will be completely separate.

xmulligan · 2024-08-09T09:33:50Z

@robscott we now have statuses for CFPs https://github.com/cilium/design-cfps#status. Are you still trying to move this towards implementable or is it dormant for now?

pchaigno · 2024-08-22T09:29:19Z

FYI, there's an implementation submitted for reviews at cilium/cilium#34484.

joestringer · 2024-11-21T23:51:06Z

Next step on this CFP is a refresh from the latest developments, onus is on @robscott for the update. Marking draft until then.

Adding xDS Adapter CFP

1310f6f

robscott marked this pull request as ready for review January 17, 2024 17:16

mikemorris reviewed Jan 17, 2024

View reviewed changes

linsun reviewed Jan 18, 2024

View reviewed changes

cilium/CFP-30235-xds-adapter.md Show resolved Hide resolved

linsun reviewed Jan 18, 2024

View reviewed changes

cilium/CFP-30235-xds-adapter.md Show resolved Hide resolved

shashankram reviewed Jan 18, 2024

View reviewed changes

cilium/CFP-30235-xds-adapter.md Outdated Show resolved Hide resolved

Adding more details about development phases and API choices

a0577de

shashankram reviewed Jan 22, 2024

View reviewed changes

cilium/CFP-30235-xds-adapter.md Outdated Show resolved Hide resolved

cilium/CFP-30235-xds-adapter.md Show resolved Hide resolved

linsun reviewed Jan 23, 2024

View reviewed changes

cilium/CFP-30235-xds-adapter.md Outdated Show resolved Hide resolved

aanm reviewed Jan 23, 2024

View reviewed changes

youngnick reviewed Jan 24, 2024

View reviewed changes

cilium/CFP-30235-xds-adapter.md Show resolved Hide resolved

youngnick reviewed Jan 24, 2024

View reviewed changes

shashankram reviewed Jan 24, 2024

View reviewed changes

joestringer reviewed Jan 25, 2024

View reviewed changes

cilium/CFP-30235-xds-adapter.md Show resolved Hide resolved

robscott force-pushed the xds-adapter branch from e82ef33 to 1cb2088 Compare January 26, 2024 02:35

Responding to comments

7772844

robscott force-pushed the xds-adapter branch from 1cb2088 to 7772844 Compare January 26, 2024 02:36

linsun mentioned this pull request Feb 2, 2024

Adding Gloo Network to distribution & support page cilium/cilium.io#365

Open

robscott mentioned this pull request Jun 6, 2024

Adding robscott to members cilium/community#127

Merged

AwesomePatrol mentioned this pull request Aug 21, 2024

Add an experimental xDS client cilium/cilium#34484

Open

joestringer marked this pull request as draft November 21, 2024 23:51

		potential for scalability improvements when adjustments to routing
		configurations don't need to round trip through the Kubernetes API Server.


		This shows how we could use existing xDS APIs to represent Cilium capabilities
		for ClusterIP routing:

		Cilium we'd likely want to report new and active connections to an xDS control
		plane via LRS.

Adding xDS Adapter CFP #14

Are you sure you want to change the base?

Adding xDS Adapter CFP #14

Conversation

robscott commented Jan 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shashankram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robscott Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joestringer Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DerekTBrown commented Jan 31, 2024

DerekTBrown commented Jan 31, 2024

mikemorris commented Feb 1, 2024

youngnick commented Feb 7, 2024

xmulligan commented Aug 9, 2024

pchaigno commented Aug 22, 2024

joestringer commented Nov 21, 2024

joestringer Jan 25, 2024 •

edited

Loading

robscott Jan 25, 2024 •

edited

Loading

joestringer Jan 25, 2024 •

edited

Loading