Skip to content
This repository has been archived by the owner on Jul 25, 2022. It is now read-only.

gardenctl v2 - SSH controller proposal - old #508

Closed
petersutter opened this issue Mar 19, 2021 · 23 comments
Closed

gardenctl v2 - SSH controller proposal - old #508

petersutter opened this issue Mar 19, 2021 · 23 comments

Comments

@petersutter
Copy link
Contributor

petersutter commented Mar 19, 2021

Motivation

gardenctl (v1) has the functionality to setup ssh sessions to the targeted shoot cluster. For this, infrastructure resources like vms, firewall rules etc have to be created. gardenctl will clean up the resources after the SSH session. However there were issues in the past where that infrastructure resources did not get cleaned up properly, e.g. due to some error and was not retried. Hence the proposal, to have a dedicated controller (for each infrastructure) that manages the infrastructure resources. gardenctl also re-used the ssh node credentials which were used by each user. Instead, ssh credentials should be created for each user / ssh session.

This way, operators would have auditable and personalised ssh credentials that last only for a given time.
#499 (comment)

Assumption

The ssh controller on the seed has access to the garden cluster and is allowed to read/watch SSH resources created by the user

New Components involved in SSH Flow (naming to be defined)

  • gardenctlv2
  • Provider specific SSH Controller
  • Central SSH Controller
    • Runs on garden runtime cluster
    • Defines types (e.g. SSH resource)
    • Heartbeat controller
    • Runs Admission Webhook server

SSH Flow

  1. User/gardenctlv2 creates SSH Resource in garden cluster (see resource example below)
  2. Mutating Webhook (Central SSH Controller) for the SSH resource
  • according to shootRef, sets the spec.seedName
  • according to shootRef, sets the spec.providerType
  1. Validating Webhook (Central SSH Controller) ensures
  • that there is a validating webhook configuration for the SSH resource
  • that the user has certain role in the project where the referenced cluster is (or has the permission to read the secret)? What about garden namespace not having a corresponding project? See question (6) at the bottom
  • the SSH resource was created AFTER the validating webhook configuration was created
  1. SSH Controller on Seed:
  • Watches SSH resources for own seed
  • Create User's node credentials (public key authentication)
  • Node fetches these credentials and updates sshd configuration
  • Controller reads cloudprovider credentials from seed-shoot namespace
  • Deploy infrastructure resources
    • bastion host vm, should instrument the machine controller manager if possible
    • create security groups / firewall rules etc using the respective infrastructure SDKs.
  • Write SSH keypair to garden cluster as secret (to be discussed)
  • Updates status of SSH resource with bastionNode, sshKeypair secret reference. Sets ready flag on resource so that the client knows when to initiate the SSH connection
  1. gardenctl
  • reads secret of sshKeypair and initiates SSH session
  • runs heartbeat in parallel as long as the SSH session is open
  1. After missing heartbeat (configurable) a heartbeat controller, running on the garden runtime cluster, will delete the SSH resource
  2. Once the SSH resource gets deleted, all created resources should be cleaned up by the provider specific ssh controller

Example

apiVersion: operations.gardener.cloud/v1alpha1
kind: SSH
metadata:
  generateName: cli-
  namespace: garden-myproject
  annotations:
    operations.gardener.cloud/created-by: foo # set by the mutating webhook
    operations.gardener.cloud/last-heartbeat-at: "2021-03-19T11:58:00Z"
    # operations.gardener.cloud/operation: keepalive # this annotation is removed by the mutating webhook and the last-heartbeat-at timestamp will be updated
spec:
  shootRef: # namespace cannot be set / it's the same as .metadata.namespace
    name: my-cluster

  # seedName: aws-eu2 # is set by the mutating webhook
  # providerType: aws # is set by the mutating webhook

  clientIP: # external ip of the user
    ipv4: 1.2.3.4
    # ipv6: ::1

  # sshConfig: # optional
  #   apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
  #   kind: SSHConfig
  #   foo: bar

  # lifetime: 30m # alternative to the last-heartbeat-at timestamp in the anntoations. Only one approach should be chosen

status:
  bastionNode: 1.2.3.5 # or with user, like [email protected], or have own field for the user, or should the user information be stored in own (secret) resource?
  sshKeypair: # secret reference
    name: foo
    namespace: bar

Open questions

  1. How to handle seed migrations of a shoot? Central SSH Controller deletes SSH resource?
  • Proposal: forbid the migration when SSH resources still exist
  1. Should the user be allowed to set the spec.seedName in the SSH resource?
  • Answer: Mutating webhook will ensure that the correct seedName is set
  1. Is there a better way to "hand out" the user's node credentials to the user (spec.sshKeypair), without storing it as secret in the garden cluster in the respective project namespace?
  • Answer: Sounds reasonable
  1. Garden cluster access of ssh controller:
    a) "static" kubeconfig is set in controller registions' providerConfig.values https://github.com/gardener/gardener/blob/master/docs/extensions/controllerregistration.md#scenario-1-deployed-by-gardener.
    b) Sync with @mvladev regarding the concept of authentication between clusters without static credentials
  2. Having spec.clientIP in the SSH resources exposes the external ip of the user to an attacker that has access to the garden cluster with read permission for the SSH resources
  3. How to ensure that the user is allowed to create an ssh session for the respective shoot? With the terminal-controller-manager, the approach is to check, if the user has the permission to read the referenced kubeconfig secret. If allowed, the Terminal resource is created and the controller acts upon it. The proposal above was to check if the user has a certain role in the project, however operators currently have no role in a user's project. So there must be other means to check this. Ideas?
@rfranzke
Copy link
Contributor

Thanks, a few comments:

  1. I don't see the provider type in your SSH resource proposal.

  2. What is .spec.node/spec.flagProviderID? Why is it needed? The SSH resource should only create a bastion with access to the nodes, that's it. From there the user can SSH to whatever node is desired.

  3. What about calling the API group operations.gardener.cloud?

  4. Why the .spec.shootRef has a namespace? Probably it should be removed and always be the same like .metadata.namespace.

  5. the SSH resource was created AFTER the validating webhook configuration was created

    What does this mean? How is this ensured and why is this necessary? The webhook configuration is probably applied when the landscape gets deployed, and from then it will never be deleted?

  6. I would not call it "central SSH controller in garden cluster" because it is no controller. It's rather a webhook server with multiple handlers (depends on previous).

  7. that the user has certain role in the project where the referenced cluster is (or has the permission to read the secret)? What about garden namespace not having a corresponding project? See question (5) at the bottom

    Why do we need to care about this, i.e., wouldn't it be enough to allow project admins to create this resource via RBAC? No extra code would be needed?

  8. Deploy infrastructure resources

    Probably it makes sense to evaluate whether we can re-use what already exists (Infrastructure resource and MCM resources) before implementing this again.

  9. About the heartbeat handling - wouldn't it be better to specify a "lifetime" in the SSH resource (that is defaulted to 30m or so (some low value) and can potentially be extended if a longer session would be required)?

  10. How to handle seed migrations of a shoot? Central SSH Controller deletes SSH resource?

    Hm, maybe the easiest would be to forbid the migration when SSH resources still exist?

  11. Should the user be allowed to set the spec.seedName in the SSH resource?

    Even if no, how could you prevent it? Can't you simply validate that only the "correct" seed name (= the same name as for the Shoot) can be used?

  12. Is there a better way to "hand out" the user's node credentials to the user (spec.sshKeypair), without storing it as secret in the garden cluster in the respective project namespace?

    Hm, why aren't you happy with this approach? It actually sounds good?

  13. Having spec.clientIP in the SSH resources exposes the external ip of the user to an attacker that has access to the garden cluster with read permission for the SSH resources

    I didn't understand this field. Why do you want to allow the user to specify anything here? Usually, the IP is anyways determined by the infrastructure provider. Move this from spec to status?

  14. How to ensure that the user is allowed to create an ssh session for the respective shoot?

    See above. You can probably make use of https://github.com/gardener/gardener/blob/master/docs/extensions/project-roles.md.

@petersutter
Copy link
Contributor Author

thanks @rfranzke for your comments. I updated the description accordingly

  1. I don't see the provider type in your SSH resource proposal.

done

  1. What is .spec.node/spec.flagProviderID? Why is it needed? The SSH resource should only create a bastion with access to the nodes, that's it. From there the user can SSH to whatever node is desired.

right, I have removed it from the example

  1. What about calling the API group operations.gardener.cloud?

sounds good, done

  1. Why the .spec.shootRef has a namespace? Probably it should be removed and always be the same like .metadata.namespace.

Very good point! Restricting this to the current namespace makes the validation obsolete

  1. Why do we need to care about this, i.e., wouldn't it be enough to allow project admins to create this resource via RBAC? No extra code would be needed?

Yes, with your recommendation from (4) it makes things much easier and it would not be required. I have striked through the respective paragraphs in the proposal

  1. What does this mean? How is this ensured and why is this necessary? The webhook configuration is probably applied when the landscape gets deployed, and from then it will never be deleted?

Not needed anymore because of (4). We needed to do this with the terminal-controller-manager to make absolutely sure no one created a terminal resource at a time when e.g. the validating webhook was temporarily deleted, e.g. by human error. Please let me know if you want to know more details why we did it this way. But basically it's because in the Terminal resource you can reference (kubeconfig) secrets from other namespaces and the validating webhook makes sure that you have the permission to read them. This was also pentested

  1. I would not call it "central SSH controller in garden cluster" because it is no controller. It's rather a webhook server with multiple handlers (depends on previous).

I'm also not sure how to call it. However it still has some minimalistic heartbeat/lifetime controller, that takes care to delete the SSH resource once it's "expired"

  1. Probably it makes sense to evaluate whether we can re-use what already exists (Infrastructure resource and MCM resources) before implementing this again.

Yes that's also the task to evaluate what can be re-used. I hope the @gardener/mcm-maintainers (and others) can give feedback on what can be re-used as I lack experience in this area

  1. About the heartbeat handling - wouldn't it be better to specify a "lifetime" in the SSH resource (that is defaulted to 30m or so (some low value) and can potentially be extended if a longer session would be required)?

I have added a spec.lifetime field, as alternative to the heartbeat annotation (as it is done with the Terminal resource). If we go with the heartbeat annotation approach, we can re-use parts of the terminalheartbeat_controller.go and mutating_webhook

  1. Even if no, how could you prevent it? Can't you simply validate that only the "correct" seed name (= the same name as for the Shoot) can be used?

yes, makes sense. Or it's just always set/overwritten by the mutating webhook

  1. Hm, why aren't you happy with this approach? It actually sounds good?

Currently, every admin in the project can read the secrets. So every admin can read the ssh credentials for another user/admin in the project.

  1. I didn't understand this field. Why do you want to allow the user to specify anything here? Usually, the IP is anyways determined by the infrastructure provider. Move this from spec to status?

spec.clientIP is the external ip of the user, that initiates the SSH session. This IP needs to be added to the firewall rules so only this user can access the bastion host.

  1. See above. You can probably make use of https://github.com/gardener/gardener/blob/master/docs/extensions/project-roles.md.

Yes good point, makes sense

@petersutter
Copy link
Contributor Author

@holgerkoser proposed to rename the resource to SSHSession.

(👍 / 👎) If there are no objections, I will rename it accordingly.

@rfranzke
Copy link
Contributor

@holgerkoser proposed to rename the resource to SSHSession.

Hm, not sure, I'd rather vote for Bastion (because that's what the controller actually creates - the SSH session needs to be established by the user afterwards and is out of scope of the controller)?

I'm also not sure how to call it. However it still has some minimalistic heartbeat/lifetime controller, that takes care to delete the SSH resource once it's "expired"

Yes, OK, right.

Yes that's also the task to evaluate what can be re-used. I hope the @gardener/mcm-maintainers (and others) can give feedback on what can be re-used as I lack experience in this area

👍🏻

Currently, every admin in the project can read the secrets. So every admin can read the ssh credentials for another user/admin in the project.

OK, yes, true, but is this really a problem? I would think it's good enough to have one dedicated SSH keypair per SSH/SSHSession/Bastion resource. User-specific key-pairs seem to not bring a huge benefit (compared to the complexity) as such bastion would just be short-lived anyways?

spec.clientIP is the external ip of the user, that initiates the SSH session. This IP needs to be added to the firewall rules so only this user can access the bastion host.

Ah, ok, makes sense. However, I would rename this to make it more clear what this is about. Probably it could be helpful to align with the ingress namings in the NetworkPolicy resource?

@vlerenc
Copy link
Contributor

vlerenc commented Mar 21, 2021

@petersutter Thank you for picking this topic up. Here a few comments:

New Components involved in SSH Flow (naming to be defined)

In miss the component that actually sees the public key in the seed and reconfigures sshd.

SSH Flow

Why is the SSH request/resource not node-specific? I guess, it was included, but then @rfranzke wrote (and you removed it):

What is .spec.node/spec.flagProviderID? Why is it needed? The SSH resource should only create a bastion with access to the nodes, that's it. From there the user can SSH to whatever node is desired.

Now its generally open to all nodes? Do we really want/need that? Here my thoughts:

Pro:

  • One request to have access to all nodes
  • No complicated bastion handling where as with individual node authorizations, we would need to implement something like a reference counter for the bastion, but that cannot be too hard to implement and it's anyways needed for multi-user authorizations (two or more operators accessing the same shoot cluster nodes)

Con:

  • Security problem, because it opens up all nodes (security is always also about minimal access)
  • Auditing problem, because we now cannot ensure anymore which nodes an operator touches
  • Rolling update violation, because now I assume all nodes fetch (around the same time) the same public key and reconfigure their sshd

In principle, we must avoid changes that involve all nodes in parallel, because we can never know whether the sudden change, executed in parallel on all nodes has impact (a bug that breaks them, even only temporarily causes a hick-up as we have seen in the recent past with the kubelet restarts on configuration changes). I know, we talk here about a non-critical component, sshd, but bugs are not unheard of in the world of IT.

After missing heartbeat (configurable) a heartbeat controller, running on the garden runtime cluster, will delete the SSH resource
What does that mean? Where is the heartbeat coming from? From gardenctlv2 or what? I don’t understand. Haven’t we said, let’s make it 8h (a shift or whatever) unless returned back by gardenctlv2. Then the user is free to use whatever client in whatever network setup. Or 1h wit hthe means to extend it.
How to handle seed migrations of a shoot? Central SSH Controller deletes SSH resource?
Proposal: forbid the migration when SSH resources still exist

No, I would not forbid that. CPM trumps everything (we don't do it for fun, but because an operator has invoked the highest emergency measure possible: DR protocol to save a cluster). Besides, where is the problem? Everything that happens on the nodes continues as before (nodes, their workload, here the bastion and the sshd configuration). The seed controller is never acting on the nodes directly (as is no other component) - it is always vice versa, which we use for CPM already. The resources in the garden cluster remain the same, too. Only the controller in the new seed must know of the old state (e.g. bastion) and that’s what the shoot state is good for. If the old seed is not cooperative, it may actively delete the bastion, but that's a corner case in a corner case and acceptable. The new seed will have to detect the missing bastion (that can happen anyway by customer intervention) and either drop the requests for ssh or recreate a bastion.

Should the user be allowed to set the spec.seedName in the SSH resource?

No, that’s not his concern and nothing good can come out of it. Why asking? It always has to be the right seed holding the control plane and that's our job to manage.

Is there a better way to "hand out" the user's node credentials to the user (spec.sshKeypair), without storing it as secret in the garden cluster in the respective project namespace?

I think so. Why not reuse @mvladev's idea here: gardener/gardener#1433 (comment) to implement something like this:

When gardenctl issues the request, the controller in the garden cluster computes a keypair on-the-fly and in-memory, returns the private key in the status, which is ephemeral, but keeps the public key, which is persistent (and will be replicated into the seed and finally onto the node(s)). I would think, that's much safer and prevents impersonalization by the (Gardener) machinery or anybody with access (another operator or attacker).

Having spec.clientIP in the SSH resources exposes the external ip of the user to an attacker that has access to the garden cluster with read permission for the SSH resources

Where is that coming from? Isn’t that hard to obtain (on the client, gardenctl doesn’t know it and on the garden cluster it may be disguised by the LB/NAT GWs)? As much as I like it, is that realistic?

@timebertt
Copy link
Contributor

  1. Is there a better way to "hand out" the user's node credentials to the user (spec.sshKeypair), without storing it as secret in the garden cluster in the respective project namespace?

Why do we have to generate new node credentials at all? I would give the client responsibility for that.
So simply add the user's ssh public key to the BastionHost resource and the controller is then responsible for allowing access with it to the bastion host.
Then it could add the Shoot's ssh credentials to the bastion host so that the user can login to all machines from there on.

By this, we would lock down the bastion host to a single user (the one who requested it) and we wouldn't need to generate/store/transport any private credentials.

spec.clientIP is the external ip of the user, that initiates the SSH session. This IP needs to be added to the firewall rules so only this user can access the bastion host.

This doesn't sound like cloud-native security to me. Using client IP address restrictions is just bandaid and rather indicates a lack of proper authentication mechanisms.
As Vedran said, the client IP could be NATed, meaning it doesn't identify a "single user". And we would only restrict access to the bastion host to this IP, because there are shared credentials stored in the garden cluster (meaning anyone with access to the those keys can "authenticate" as the user requesting the bastion host).
So I would drop this field/mechanism in favor of proper authentication mechanisms using the user's ssh public key as proposed above.

@vlerenc
Copy link
Contributor

vlerenc commented Mar 21, 2021

Why do we have to generate new node credentials at all? I would give the client responsibility for that.
So simply add the user's ssh public key to the BastionHost resource and the controller is then responsible for allowing access with it to the bastion host.

That's what we actually don't want @timebertt , because then we cannot control how well they were picked, whether they are rotated and stored safely (the user hasn't leaked her private key), etc. At no point we can/should trust the end user/operator, I believe.

This doesn't sound like cloud-native security to me. Using client IP address restrictions is just bandaid and rather indicates a lack of proper authentication mechanisms.

I believe that is not quite true. As with security in general, it is always about multiple layers of security. There is no single measure that, if done properly, makes everything suddenly safe. It is always the combination of many things.

That said, I am not and never was a friend of IP whitelisting, but in some cases it is a valid means, e.g. our infrastructure clusters can be (additionally) protected this way (and just because its additional it doesn't mean it's not also increasing our security posture) and I would not reject the thought (hiding end points often helps not becoming prey in the first place).

The client IP could be NATed...

Yes, I have actually technical doubts here (speaking against).

@prashanth26
Copy link

prashanth26 commented Mar 22, 2021

Thanks @petersutter for looking into this topic.

Yes that's also the task to evaluate what can be re-used. I hope the @gardener/mcm-maintainers (and others) can give feedback on what can be re-used as I lack experience in this area

To help with spinning up a Bastian VM by MCM the following would be the flow,

  • We would have to create/setup all the fields required MCM resources - MachineDeployment, MachineClass and Secret(s) (referred by the machineClass)

  • A chart for machineClass deployment can be found here.

  • And the way to fill the chart up can be found here at the extension controller.

  • Once you create these three resources created (in the seed, shoot namespace) and there must already be an MCM running in the namespace.

  • This MCM will be able to manage the created resource by spinning by the required VM by creating MachineSet and Machine objects for it. This is will create the required VM.

  • The key difference would be the values given to the fields, you can probably reuse most of the fields from the extension controller code, however, fields like SSHKey, SecurityGroups etc. are something that needs to be taken care of (or) created by an external entity/controller as required.

Lastly, we can help you out with this part. Once we have the proposal cemented, I and @AxiomSamarth can help you with the implementation details.

Controller reads cloudprovider credentials from seed-shoot namespace
Deploy infrastructure resources
bastion host vm, should instrument the machine controller manager if possible
create security groups / firewall rules etc using the respective infrastructure SDKs.
Write SSH keypair to garden cluster as secret (to be discussed)

Regarding this, we will have to also create the SSH key pair prior to trying to create the Bastian VM. Probably something you have already thought of, but just thought to mention.

@vlerenc
Copy link
Contributor

vlerenc commented Mar 22, 2021

@prashanth26 In regards to the "SSH key pair" on infrastructure level: it is the purpose of this BLI to completely eliminate them/have none/drop that code. It is inherently unsafe to have them in the first place as they are neither personal nor can they be rotated (at least not in AWS, I believe to remember; there they are immutable once the EC2 instance is created). So, we want to get rid of them for good and take full control of sshd ourselves, have no means of access unless a controller on the machine detects the authorised request for SSH access and adds the public key(s) (and removesit/them later). No key pairs in the infrastructure anymore.

@timebertt
Copy link
Contributor

That's what we actually don't want @timebertt , because then we cannot control how well they were picked, whether they are rotated and stored safely (the user hasn't leaked her private key), etc. At no point we can/should trust the end user/operator, I believe.

Hmm, ok. That wasn't exactly clear to me. But how it be better to hand out credentials of the same kind to the user/operator on the fly? They could also immediately leak it and we have to trust them not to do so. The only thing that would make it less risky is, that the credentials are fresh, right?
If the controller creates them on the fly, there is similarly no way to rotate them. I guess, the only way to invalidate them would be to shutdown the bastion host (which would also work for the flow I described).

So, we want to get rid of them for good and take full control of sshd ourselves, have no means of access unless a controller on the machine detects the authorised request for SSH access and adds the public key(s) (and removesit/them later). No key pairs in the infrastructure anymore.

Ah ok, that one's new to me. I didn't see it in this proposal. If we want to do this, it would of course totally change the game.
It would give us means to invalidate credentials without shutting down the host.

That said, I am not and never was a friend of IP whitelisting, but in some cases it is a valid means, e.g. our infrastructure clusters can be (additionally) protected this way (and just because its additional it doesn't mean it's not also increasing our security posture) and I would not reject the thought (hiding end points often helps not becoming prey in the first place).
Yes, I have actually technical doubts here (speaking against).

Hmm, ok. Maybe it can be done as an additional security measure.

But I can already picture myself cursing this restriction, when I switch from my unstable ISP connection to LTE, then later realizing that I need to connect to the VPN and thereby switching client IPs multiple times a day.

Still my point was, that in the above proposal, the client IP restriction is the only thing making the bastion host (or say SSHSession) personal, because SSH credentials generated by the controller are basically shared and are stored in the infrastructure (i.e. garden cluster).
If we can't trust a public key given by the user, then let's at least somehow make the credentials actually personal by encrypting them on the fly or doing some ElGamal/DH-like key exchange per user and per ssh session or something similar.

@vlerenc
Copy link
Contributor

vlerenc commented Mar 22, 2021

But how it be better to hand out credentials of the same kind to the user/operator on the fly? They could also immediately leak it and we have to trust them not to do so.

Why would they leak "immediately"? Doesn't the generation on-the-fly have two very evident key properties:

  • They were just created, so they first have to leak, but age-old credentials may have leaked a long time ago
  • The whole process will invalidate them within hours (they will always be created on-the-fly, because there simply are no old ones to still use)

The only thing that would make it less risky is, that the credentials are fresh, right?

I don't understand your point. They are fresh, again and again (and old ones are dropped automatically), because they are generated on-the-fly (see description above). Where is the problem in this proposal?

It would of course totally change the game.
It would give us means to invalidate credentials without shutting down the host.

Ah, but yes. It's the purpose of this BLI to improve the SSH handling and the IaaS key pairs are a primitive means at best that has certain not acceptable limitations. We cannot continue to use them and still meet the high security requirements.

But I can already picture myself cursing this restriction, when I switch from my unstable ISP connection to LTE, then later realizing that I need to connect to the VPN and thereby switching client IPs multiple times a day.

Yes, that's why I also believe it's not worth it. It isn't practical and I see technical challenges (where is the IP coming from was my question: neither the client can send it as it doesn't know it, nor can we be sure that the client when it reaches the backend still has it).

Still my point was, that in the above proposal, the client IP restriction is the only thing making the bastion host (or say SSHSession) personal, because SSH credentials generated by the controller are basically shared and are stored in the infrastructure (i.e. garden cluster).

No, they aren't (shared) and that's what is written above (always personal). Why do you think anything is shared?

Only the public key is used to configure bastion and node. Only the user of gardenctl has "her" private key. Different people can access the same or another node through the same bastion, but always with their private keys. Just because someone has created SSH access, another one cannot impersonate that person. That won't be possible, because only the user of gardenctl who requested access has her private key.

If we can't trust a public key given by the user, then let's at least somehow make the credentials actually personal by encrypting them on the fly or doing some ElGamal/DH-like key exchange per user and per ssh session or something similar.

I do not understand. Why would we not trust them and why are they not personal? I am at a loss here. Maybe we continue out-of-band @timebertt ...

@vlerenc
Copy link
Contributor

vlerenc commented Mar 22, 2021 via email

@timebertt
Copy link
Contributor

Maybe we continue out-of-band @timebertt ...

Yep, please. There are some misunderstandings at play here.

@prashanth26
Copy link

@prashanth26 In regards to the "SSH key pair" on infrastructure level: it is the purpose of this BLI to completely eliminate them/have none/drop that code. It is inherently unsafe to have them in the first place as they are neither personal nor can they be rotated (at least not in AWS, I believe to remember; there they are immutable once the EC2 instance is created). So, we want to get rid of them for good and take full control of sshd ourselves, have no means of access unless a controller on the machine detects the authorised request for SSH access and adds the public key(s) (and removesit/them later). No key pairs in the infrastructure anymore.

Okay. Got it. Will keep that in mind.

@timebertt
Copy link
Contributor

Ok, there were indeed some misunderstandings happening here. @vlerenc and I had a chat to clarify.

Originally (based on the initial proposal from @petersutter), I assumed that the SSH controller would store the generated credentials in some secret in the garden. And that's why I thought of the credentials as "shared", as it's accessible by all every project member / gardener operator.

The proposal from @mvladev to make credentials available via some temporary subresource or a similar mechanism, will require an extension API server and won't work with CRDs. It seems undesirable to us to reuse gardener-apiserver for it (we rather want to decouple gardenctl from g/g's core components) or to add another extension API server, so we would need to take another approach.

Another option, which I tried to describe above, for making the credentials only accessible for a specific user (requestor of the "SSH session"), would be to include the user's public GPG key in the SSH resource and the generated private key will be encrypted with this key, so that only they can decrypt it (can be done by gardenctl on the fly).
The advantage of this approach would be that credentials are freshly generated but still "personalized".

Other questions/topics that are unclear to us / that we are missing in the initial proposal:

  • Are bastion hosts reused between different SSH session (from potentially different users)?
  • What about the component/controller sitting on the nodes and reconfiguring sshd? What's the flow here?
  • Which credentials are then used for accessing the nodes from the bastion host?

@vlerenc Please add if I forgot something.
@ all: a jointly call about the open questions will hopefully speed up discussion here.

@petersutter
Copy link
Contributor Author

[@vlerenc] What does that mean? Where is the heartbeat coming from? From gardenctlv2 or what? I don’t understand. Haven’t we said, let’s make it 8h (a shift or whatever) unless returned back by gardenctlv2. Then the user is free to use whatever client in whatever network setup. Or 1h wit hthe means to extend it.

I didn't understood the remark with the network setup. The heartbeat is coming from whatever client, e.g. gardenctl. The landscape administrator configures the controller, that cleans up the resource on missing heartbeats:

  • max lifetime - to not have bastions that can exist "forever"
  • TTL

If the landscape admin configures the TTL high enough (like 8h), the client would not necessarily send a heartbeat, however I would plan that gardenctl sends a heartbeat.

Are bastion hosts reused between different SSH session (from potentially different users)?

My initial thinking was that it's not reused. If it is reused, there MUST be means to extend its lifetime. If the max lifetime is about to be reached (e.g. it expires in less than 30 minutes) a new Bastion should be created.
I guess it should then also be clear who keeps the bastion alive. The mutating webhook should take care to document this accordingly on the resource.

What about the component/controller sitting on the nodes and reconfiguring sshd? What's the flow here?

Good point, I was actually citing @vlerenc on this one from the gardenctlv2 issue, assuming that there are already existing concepts on gardener that can be reused or that there are some ideas on that already. Seems not to be the case

Which credentials are then used for accessing the nodes from the bastion host?

@timebertt Are you suggesting to have different (temporary) credentials for the bastion and the node? Otherwise I did not get the question

@vlerenc
Copy link
Contributor

vlerenc commented Mar 22, 2021

The heartbeat is coming from whatever client, e.g. gardenctl. The landscape administrator configures the controller, that cleans up the resource on missing heartbeats:
If the landscape admin configures the TTL high enough (like 8h), the client would not necessarily send a heartbeat, however I would plan that gardenctl sends a heartbeat.

@petersutter I was coming from the practical side and how to use this feature in conjunction with custom SSH clients, etc. If the heart-beat is not short-lived, but rather 4h or 8h (and can be extended once or twice), great. :-)

My initial thinking was that [bastions] are not reused. If it is reused, there MUST be means to extend its lifetime. If the max lifetime is about to be reached (e.g. it expires in less than 30 minutes) a new Bastion should be created.
I guess it should then also be clear who keeps the bastion alive. The mutating webhook should take care to document this accordingly on the resource.

Personally, I would rather handle it this way:

  • In the garden cluster the operator places the "intent" to gain access to a particular node or nodes (can be extended afterwards, both in time and scope, i.e. more/other nodes) of a particular shoot in a particular project (implicit via namespace). A controller creates the key pair either on-the-fly (if the resource was created for that operator for the first time) and returns the private key (similar to Martin's proposal in a different context) and persists the public key as secret that is referenced in the status of the intent... or it generates the key pair (but for the arguments Tim proposed) persists them both, but encrypts the private key with the user's public key that gardenctl presented, so that nobody can impersonate that operator (in a separate call, gardenctl can download the encrypted private key and can decrypt it locally with the user's private key).
  • This gets broken down into separate resources in the seed, one for the shoot bastion (that then gets created, if it doesn't exist yet) and per node (to which the user has requested access, but only those, i.e. each node has its own resource) - both created/managed by a controller in the seed watching the intent(s) in the garden cluster and populated with the public keys for sshd on bastion and nodes. That controller also does the reference counting or uses Kubernetes ownerRefs for it (all nodes are owners of the bastion) and if the last node access of the last operator is gone, the bastion can be deleted (though having custom logic that keeps the bastion around for another hour or so would be cool).

Bastions are heavy-weight. I would definitely reuse them/not recreate them again and again if multiple operators need access. This is helpful not only in the case of multiple operators accessing one shoot, but also if we want fine-grained access to individual nodes. You wouldn't want to create a bastion to access node A and another one to access node B.

@petersutter
Copy link
Contributor Author

I would once again come back to the proposal from @timebertt of using the user's public key for the authentication, however with some slight adjustments. @vlerenc you said that

we cannot control how well they were picked, whether they are rotated and stored safely (the user hasn't leaked her private key), etc. At no point we can/should trust the end user/operator, I believe.

However the validating webhook could take a look a the Algorithm used, the validity bounds (NotBefore should be a recent date, NotAfter should not be too far in the future) they keylength and maybe perform other checks as necessary.
This way the client is forced to provide a new and strong public key. This would make things much easier

@vlerenc
Copy link
Contributor

vlerenc commented Mar 22, 2021

However the validating webhook could take a look a the Algorithm used, the validity bounds (NotBefore should be a recent date, NotAfter should not be too far in the future) they keylength and maybe perform other checks as necessary.
This way the client is forced to provide a new and strong public key. This would make things much easier

Is that information in a plain (SSH RSA) private/public key? Aren't these fields only available in certs? Also, if we require strong/quick pseudo-rotation/validity, how would that help?

@petersutter
Copy link
Contributor Author

Oh right, I was not thinking on using an ssh public key. This does not work then

@tedteng
Copy link
Contributor

tedteng commented Mar 23, 2021

Are bastion hosts reused between different SSH session (from potentially different users)?

How the exit/clean-up mechanism for the bastion in v2, maybe already discussed somewhere I missed it.?
because we may have the following issue if reuse bastion host

  • the current design(gardenctl v1 ssh ) is user exit bastion host the cleanup method invoked immediately and all resources removed (remove order: Bastion host -> firewall rules(Public IP allow) -> security group).
  • User A creates Bastion host as jump Server for ssh to access Project A - Node A from Public IP A. next User B access node B in same Project A via Bastion host from Public IP B. The bastion hosts reused as it already created by User A in the same Project A.
  • The issue is the cleanup is triggered when User A completed the task and exit the bastion hosts. That means Bastion host, security group, firewall rules (Public IP A allow) need to destroy it which created by User A. however, User B may still need a session in node B.

That will bring a dilemma, The cleanup process from User A will destroy Bastion host, firewall rules (Public IP A), and then the process break when removing Security group due to Public IP B still exist. And User B lost session due to Bastion host missing. and Security Group (Public IP B) remaining in hyperscale

@vlerenc
Copy link
Contributor

vlerenc commented Mar 23, 2021

@tedteng That's what is meant with reference counter for the bastion above. The controller in the seed "keeps book" of all "intents" for this bastion and deletes (ideally delayed, but that's a bit more effort) the bastion after the last "intent" (=usage) is gone.

@petersutter petersutter changed the title gardenctl v2 - SSH controller proposal gardenctl v2 - SSH controller proposal - old Mar 29, 2021
@petersutter
Copy link
Contributor Author

petersutter commented Mar 29, 2021

will open a new issue with a new proposal with a different approach as a result from "offline" discussions -> #510

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants