Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP-30984 : Add CFP for DNS proxy HA - V2 #54

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

hemanthmalla
Copy link
Member

Add CFP for cilium/cilium#30984

Follow up of #32

**Cilium Release:** 1.17

**Authors:** Hemanth Malla <[email protected]>, Vipul Singh <[email protected]>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Administrative point:

If we think we agree on the basics here, but would like more of the design to be fleshed out:

Suggested change
Status: Provisional

(If provisional, please highlight anything that still needs to be expanded with the text "UNRESOLVED: description of unresolved design point").

Or alternatively if we think the design is satisfactorily complete:

Suggested change
Status: Implementable

See status in the readme for more details. https://github.com/cilium/design-cfps?tab=readme-ov-file#status

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the point I made at the end of the discussion with @cilium/sig-policy today. Whatever we agree upon, we should be able to merge in. If there are points left to iterate on, we can highlight those points. If we've resolved the points of difference we can mark it implementable. Goal is to make iteration easier so we don't have to overthink an entire design that is subject to change during implementation, but rather we can more incrementally change the design in-tree as we discuss, debate, implement and learn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a status field once we close out the open discussion items.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like we haven't documented Provisional in the readme yet ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like we haven't documented Provisional in the readme yet ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, provisional is just under discussion in #48 but not documented yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mark this as implementable.

Signed-off-by: Hemanth Malla <[email protected]>
@tamilmani1989
Copy link

tamilmani1989 commented Sep 16, 2024

@squeed / @joestringer we addressed your comments and appreciate if you can another look. Haven't decided on status of CFP though. Do you think we can set to Implementable?

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with merging in "Provisional" status with these changes. Given how this feature interacts with availability during upgrade, I think we'd need to align on what upgrade should look like before really considering this as implementable.

The danger I see for not aligning around upgrade would be that this feature could either block initiatives from @cilium/sig-datapath / @cilium/sig-policy to optimize the dataplane, or alternatively such initiatives could end up breaking this feature which we don't really want either.

Either this or we say that figuring out upgrade is part of moving this feature from alpha->beta or beta->stable. We haven't outlined criteria for alpha/beta/stable features in the project yet but I'd be open to discussions about how we set expectations around that.

cilium/CFP-30984-dns-proxy-ha-v2.md Outdated Show resolved Hide resolved
cilium/CFP-30984-dns-proxy-ha-v2.md Show resolved Hide resolved
@tamilmani1989
Copy link

Given how this feature interacts with availability during upgrade, I think we'd need to align on what upgrade should look like before really considering this as implementable

Agreed. we can create separate doc/CFP just to brainstorm about how upgrade works with SDP and in-agent dnsproxy. we can note upgrade as unresolved and merge this CFP with provisional status

The danger I see for not aligning around upgrade would be that this feature could either block initiatives from @cilium/sig-datapath / @cilium/sig-policy to optimize the dataplane, or alternatively such initiatives could end up breaking this feature which we don't really want either.

That's true. For version compatibility, if required, we can follow same path as envoy by tying SDP version with cilium versions https://github.com/cilium/proxy#version-compatibility-matrix. The motive should be try to support upgrade without breaking but it may become necessary to break in some cases where we can come up with something like version compatibility.

@joestringer
Copy link
Member

joestringer commented Sep 24, 2024

For version compatibility, if required, we can follow same path as envoy by tying SDP version with cilium versions

AFAIK cilium-proxy doesn't try to solve upgrade, the link is just stating which version of Cilium ships with which Envoy version.

Signed-off-by: Hemanth Malla <[email protected]>
@tamilmani1989
Copy link

For version compatibility, if required, we can follow same path as envoy by tying SDP version with cilium versions

AFAIK cilium-proxy doesn't try to solve upgrade, the link is just stating which version of Cilium ships with which Envoy version.

@joestringer hope you are talking about minor upgrades. In my opinion, upgrade should be backward compatible atleast with immediate previous minor. Old SDP <-> New cilium or New SDP <-> old cilium should work. It might require for us to support both old/new api/fields in new version for some time. Subsequently in later releases, we can remove deprecated api/fields after we make wide announcement.

@@ -23,10 +23,11 @@ Users rely on toFQDN policies to enforce network policies against traffic to des
* Introduce a streaming gRPC API for exchanging FQDN policy related information.
* Introduce a standalone DNS proxy (SDP) that binds on the same port as built-in proxy with SO_REUSEPORT.
* Enforce L7 DNS policy via SDP.
* When an endpoint's DNS traffic is selected by an L7 policy, DNS requests and responses will be forwarded to their destinations via SDP even if cilium-agent is not running. So, clients re-resolving DNS to establish new connections will not be blocked anymore if the IP addresses from the new resolution are unchanged. Note that the L3/L4 policy for the resolved names should have already been plumbed when the agent was running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a goal, this is implementation detail. Should this be dropped or moved under the proposal section?


* SDP will not listen on the DNS proxy port until a connection is established with cilium agent and initial L7 DNS policy rules are received. Meanwhile, built-in DNS proxy will continue to serve requests. SDP relies on cilium agent for initial bootstrap. In future, we could make SDP retrieve initial policy information from other sources, but this is not in scope for this CFP.

### Handling Upgrades
Copy link
Member

@joestringer joestringer Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't feel that this section adequately explains upgrades. The text is capturing some of the key discussions we have had though, which helps. What I have in mind is something a bit more like a test plan that outlines the bringing up / shutting down of different components and exactly what's expected in terms of forwarding state around.

Overall though I don't feel that continuing this path by debating back and forth in this design doc is really helping drive this upgrades discussion. I think that at this point it makes sense to just accept that some aspects of this CFP are underspecified / not agreed upon, but the overall design is still "implementable" and ultimately subject to review of the actual implementation. As we learn more through implementation we can iron out any remaining differences, and figure out how to best test the functionality.

EDIT 2024-11-23: Better wording, more helpful suggestions.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions:

  1. Add the status
  2. Move the new goal into the proposal section

Then let's move forward with this. I think that developing and testing the implementation will do more practical good than continuing discussion on the CFP.

Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late comments. As Joe said, let's discuss more on an actual PR - please do include me on review.

}
```

Method : SubscribeToDNSPolicies ( Invoked from agent to SDP via bi-directional stream )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invoked from agent to SDP

Should probably be "invoked from SDP to agent" right?

Otherwise I think this doesn't match how the RPC is specified, and also would mean that the SDP exposes a gRPC API itself, which seems unnecessary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct. its invoked by SDP to agent and that opens up a stream which used by Cilium Agent to send updates


*Note: `dns_pattern` follows the same convention used in CNPs. See https://docs.cilium.io/en/stable/security/policy/language/#dns-based for more details*

*Note: `DNSPolicies` is a snapshot of the latest known policy information for all endpoints on the host. Sending a snapshot allows for dealing with deletions automatically*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to send a full snapshot each time? It seems possible to match k8s with a ListAndWatch approach to avoid sending a full snapshot.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we intend to send the full snapshot of the policies applied. Not familiar with the ListAndWatch approach, just curious if that will require permissions for SDP to read the CRD ? Then SDP will map the policy rules (labels<>identity) too ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants