Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running primera3par-csp as a DaemonSet or Deployment with multiple replicas #237

Open
obacak opened this issue Mar 23, 2021 · 16 comments
Open

Comments

@obacak
Copy link
Contributor

obacak commented Mar 23, 2021

This is an enhancement request on top of hpe-csi v1.4 where csi.NodeServiceCapability_RPC_GET_VOLUME_STATS, is enabled on hpe-csi-node, as a result, this puts too much strain on the single replica primera3par-csp deployment on big K8s clusters (We have clusters with 48 workers) polling that single pod with many requests, saturating it thus detach/attach operations are hindered to a point where the cluster is not usable.

We had to ask for a test image from HPE so csi.NodeServiceCapability_RPC_GET_VOLUME_STATS is disabled on the v1.4 and that solved this immediate problem on the big cluster.

The proposal is the following:

Would it be possible to deploy the primera3par-csp as a daemonset and set its service spec.externalTrafficPolicy to Local so the hpe-csi-node pods will poll their local primera3par-csp pod for the volume stats and the csp pod which runs on the same K8s worker as the controller will be working as a "leader"?

Could you please advise?

@datamattsson
Copy link
Collaborator

How has this been isolated to the single replica CSP? Do you have an empirical tests to share that this actually helps what you're suggesting? It could be the backend API server (primera/3par) that gets slammed and grinds to a halt. The CSP is just an API gateway that should be handling this just fine. Also csi.NodeServiceCapability_RPC_GET_VOLUME_STATS is not implemented properly on the primera/3par CSP yet and that could be a different problem all together.

@sneharai4
Copy link
Contributor

@datamattsson csi.NodeServiceCapability_RPC_GET_VOLUME_STATS invokes Getvolumebyid on the csp, does it really matter what response is received(missing params free_bytes/used_bytes)? Btw, we got to know about the changes last Friday from the spec side.

@obacak
Copy link
Contributor Author

obacak commented Mar 23, 2021

How has this been isolated to the single replica CSP? Do you have an empirical tests to share that this actually helps what you're suggesting?

On v1.4 we see many calls from the hpe-csi-node pods (48 of them since we have 48 workers) calling primera3par-csp-svc like:

>>>>> Get Volume Cmd - Volume name/id: pvc-36acb784-6dd5-4f4f-b748-995c9aecadfe" file="get_volume_cmd.go:59"

in return the primera3par-csp pod makes a call to the kube-apiserver to find out whether that hpeinfovolume object exists or not, those requests are stacked to a point that a volume create request coming from the controller which triggers the same method on the primera3par-csp method like >>>>> Get Volume Cmd - Volume name/id takes too long to return, that initial loss of time has a cascading effect on provisioning, attaching and mounting operation.

It could be the backend API server (primera/3par) that gets slammed and grinds to a halt

If you mean by API server the storage array, then I say no. This blockage is happening at the primera3par-csp pod level when it receives the >>>>> Get Volume Cmd - Volume name/id request and asks that volume to the kube-apiserver, there are no calls at that point to the storage array. I believe there is a call once kube-apiserver responds "no there is no such object" then a call is made to the storage array but I might be wrong on that one since primera3par code is not opensource, maybe @sneharai4 can shed some light here.

on a 48 worker node cluster where we have v1.3 deployed and we have around 425 volumes, please see the request IDs (which are all sequential) and the timestamps:

    Mar 21, 2021 @ 12:38:27.861    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:38:27Z" level=info msg="[ REQUEST-ID 126096 ] -- >>>>> Create Volume Cmd for volume pvc-d24cf511-0639-42a4-9604-a209bf3eca95" file="create_volume_cmd.go:100"
    Mar 21, 2021 @ 12:38:27.861    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:38:27Z" level=info msg="[ REQUEST-ID 126096 ] -- Create volume request (after unmarshal): &models.CreateVolumeRequest{Name:\"pvc-d24cf511-0639-42a4-9604-a209bf3eca95\", Size:1073741824, Description:\"Block Volume created with the HPE CSI Driver for Kubernetes\", BaseSnapshotId:\"\", Clone:false, Config:models.Config{Cpg:\"\", SnapCpg:\"\", ProvisioningType:\"tpvv\", ImportVol:\"\", ImportVolAsClone:\"\", CloneOf:\"\", Compression:false, ReplicationDevices:\"\", RemoteCopyGroup:\"\", VirtualCopyOf:\"\", VolumeGroup:\"\", IscsiPortalIps:\"\"}}\n" file="request_handler.go:93"
    Mar 21, 2021 @ 12:38:27.859    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:38:27Z" level=info msg="[ REQUEST-ID 126095 ] -- <<<<<< Get Volume By Name" file="request_handler.go:143"
    Mar 21, 2021 @ 12:38:27.859    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:38:27Z" level=info msg="[ REQUEST-ID 126095 ] -- <<<<< Get Volume Cmd - Volume name/id: pvc-d24cf511-0639-42a4-9604-a209bf3eca95" file="get_volume_cmd.go:80"
    Mar 21, 2021 @ 12:38:27.840    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:38:27Z" level=info msg="[ REQUEST-ID 126095 ] -- >>>>>>>>>  Get Volume By Name " file="request_handler.go:137"
    Mar 21, 2021 @ 12:38:27.840    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:38:27Z" level=info msg="[ REQUEST-ID 126095 ] -- >>>>> Get Volume Cmd - Volume name/id: pvc-d24cf511-0639-42a4-9604-a209bf3eca95" file="get_volume_cmd.go:60"
    Mar 21, 2021 @ 12:36:13.644    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:36:13Z" level=info msg="[ REQUEST-ID 126094 ] -- Published value in get volume cmd %!(EXTRA bool=false)" file="get_volume_cmd.go:137"
    Mar 21, 2021 @ 12:36:13.644    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:36:13Z" level=info msg="[ REQUEST-ID 126094 ] --  GET VOLUME BY ID: {\"Mountpoint\":\"\",\"config\":{\"cpg\":\"\",\"provisioning_type\":\"tpvv\"},\"description\":\"Block Volume created with the HPE CSI Driver for Kubernetes\",\"id\":\"pvc-0260012f-a93e-4314-8224-1f281ff2ece1\",\"name\":\"pvc-0260012f-a93e-4314-8224-1f281ff2ece1\",\"published\":false,\"size\":1073741824}" file="get_volume_cmd.go:170"
    Mar 21, 2021 @ 12:36:13.644    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:36:13Z" level=info msg="[ REQUEST-ID 126094 ] -- <<<<<< Get Volume By Id" file="request_handler.go:133"
    Mar 21, 2021 @ 12:36:13.644    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:36:13Z" level=info msg="[ REQUEST-ID 126094 ] -- <<<<< Get Volume Cmd - Volume name/id: pvc-0260012f-a93e-4314-8224-1f281ff2ece1" file="get_volume_cmd.go:174"
    Mar 21, 2021 @ 12:36:13.641    primera3par-csp-77f98b579f-xf9nl    hpestorage/hpe3parprimera-csp:v1.1.0    time="2021-03-21T11:36:13Z" level=info msg="[ REQUEST-ID 126094 ] -- >>>>> Get Volume Cmd - Volume name/id: pvc-0260012f-a93e-4314-8224-1f281ff2ece1" file="get_volume_cmd.go:60"

And see below another 48 worker node cluster where we have v1.4 deployed with around 600 volumes, again please notice the timestamps and the request IDs:

    Mar 21, 2021 @ 13:07:49.465    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101735 ] -- <<<<<< Get Volume By Id" file="request_handler.go:228"
    Mar 21, 2021 @ 13:07:49.465    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101735 ] -- Published value in get volume cmd %!(EXTRA bool=true)" file="get_volume_cmd.go:69"
    Mar 21, 2021 @ 13:07:49.465    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101735 ] -- <<<<< Get Volume Cmd - Volume name/id: pvc-0857d47f-818f-4e73-91b5-92ad3a51d6d8" file="get_volume_cmd.go:110"
    Mar 21, 2021 @ 13:07:49.465    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101735 ] --  GET VOLUME BY ID: {\"Mountpoint\":\"/var/lib/kubelet/plugins/hpe.com/mounts/pvc-0857d47f-818f-4e73-91b5-92ad3a51d6d8\",\"config\":{\"compression\":\"false\",\"cpg\":\"\",\"provisioning_type\":\"tpvv\",\"snap_cpg\":\"\"},\"description\":\"Block Volume created with the HPE CSI Driver for Kubernetes\",\"id\":\"pvc-0857d47f-818f-4e73-91b5-92ad3a51d6d8\",\"name\":\"pvc-0857d47f-818f-4e73-91b5-92ad3a51d6d8\",\"published\":true,\"size\":1073741824,\"volume_group_id\":\"\"}" file="get_volume_cmd.go:106"
    Mar 21, 2021 @ 13:07:49.398    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101823 ] -- >>>>> Get Volume Cmd - Volume name/id: pvc-32cb257e-aa41-4615-8b1f-76a1feb76d78" file="get_volume_cmd.go:59"
    Mar 21, 2021 @ 13:07:49.398    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101823 ] -- >>>>>>>>>  Get Volume By Name " file="request_handler.go:232"
    Mar 21, 2021 @ 13:07:49.260    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101734 ] --  GET VOLUME BY ID: {\"Mountpoint\":\"/var/lib/kubelet/plugins/hpe.com/mounts/pvc-7ba439ff-4f73-42d1-82ac-fe62e2b91a32\",\"config\":{\"compression\":\"false\",\"cpg\":\"\",\"provisioning_type\":\"tpvv\",\"snap_cpg\":\"\"},\"description\":\"Block Volume created with the HPE CSI Driver for Kubernetes\",\"id\":\"pvc-7ba439ff-4f73-42d1-82ac-fe62e2b91a32\",\"name\":\"pvc-7ba439ff-4f73-42d1-82ac-fe62e2b91a32\",\"published\":true,\"size\":5368709120,\"volume_group_id\":\"\"}" file="get_volume_cmd.go:106"
    Mar 21, 2021 @ 13:07:49.260    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101734 ] -- Published value in get volume cmd %!(EXTRA bool=true)" file="get_volume_cmd.go:69"
    Mar 21, 2021 @ 13:07:49.260    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101734 ] -- <<<<<< Get Volume By Id" file="request_handler.go:228"
    Mar 21, 2021 @ 13:07:49.260    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101734 ] -- <<<<< Get Volume Cmd - Volume name/id: pvc-7ba439ff-4f73-42d1-82ac-fe62e2b91a32" file="get_volume_cmd.go:110"
    Mar 21, 2021 @ 13:07:49.197    primera3par-csp-85858fff66-6wtqn    quay.io/hpestorage/hpe3parprimera-csp:v1.2.1    time="2021-03-21T12:07:49Z" level=info msg="[ REQUEST-ID 101822 ] -- >>>>>>>>>  Get Volume By Id " file="request_handler.go:221"

Also csi.NodeServiceCapability_RPC_GET_VOLUME_STATS is not implemented properly on the primera/3par CSP yet and that could be a different problem all together.

I know, the implementation is done on the hpe-csi-node but it is hpe-csi-node asks for the existence of the volume to primera3par-csp which in return forwards that information to kube-apiserver.

Once we received a "patched" hpe-csi-node where that single line of code taken out, the provisioning of a new volume (provisioning/attaching/mounting) came down from 10 minutes to 36 seconds on that same 48 worker node cluster and we started to see less >>>>> Get Volume Cmd - Volume name/id entries in primera3par-csp logs.

I would like to re-iterate my initial question: Is it possible to deploy primera3par-csp as a daemonset or deployment with many replicas?

@datamattsson
Copy link
Collaborator

Thanks for elaborating! I hope we can get some more eyes on this (@rgcostea @raunakkumar @rkumpf). AFAIK the CSP should not make any calls to the kube-apiserver for csi.NodeServiceCapability_RPC_GET_VOLUME_STATS and if that's what you're seeing, that is a problem altogether. Bear in mind I'm not overly familiar with the primera/3par CSP as it uses several CRDs not part of the CSP spec.

I however like the DaemonSet idea with the traffic policy as the controller-driver would never need to traverse the network to reach a CSP and the node-driver would have its own CSP to fulfill Node* operations against. What I "like" is not usually what ends up in the CSI driver so I'm hoping the team can get back to you.

@obacak
Copy link
Contributor Author

obacak commented Mar 23, 2021

Thanks you @datamattsson.

AFAIK the CSP should not make any calls to the kube-apiserver for csi.NodeServiceCapability_RPC_GET_VOLUME_STATS and if that's what you're seeing, that is a problem altogether

I believe that is the case but again @sneharai4 should be able to confirm it.

I will be looking forward to hearing from the team.

@raunakkumar
Copy link
Collaborator

Thanks for elaborating! I hope we can get some more eyes on this (@rgcostea @raunakkumar @rkumpf). AFAIK the CSP should not make any calls to the kube-apiserver for csi.NodeServiceCapability_RPC_GET_VOLUME_STATS and if that's what you're seeing, that is a problem altogether. Bear in mind I'm not overly familiar with the primera/3par CSP as it uses several CRDs not part of the CSP spec.

I however like the DaemonSet idea with the traffic policy as the controller-driver would never need to traverse the network to reach a CSP and the node-driver would have its own CSP to fulfill Node* operations against. What I "like" is not usually what ends up in the CSI driver so I'm hoping the team can get back to you.

The node driver does make a call to the CSP to fetch the volume attributes so there could be multiple nodes trying to query the CSP to retrieve the volume attributes. We will try to test out the above scenario with other CSPs (Nimble, CV) and confirm whether we hit the same issue. We haven't considered the Daemonset approach but we were thinking of including a flag disableNodeVolumeStats by which this feature could be optionally disabled on large scale systems.

@obacak
Copy link
Contributor Author

obacak commented Mar 24, 2021

@raunakkumar , thanks for your message.
We would definitely need that flag on v1.4.x driver but we also would like to be able to get the metrics from the driver with v1.5, this is something we are keeping an eye on since last year.

As I cannot see the source code of CSP, can the primera3par-csp work as multiple replicas? Are there any issues you might think of?

@obacak obacak changed the title Running primera3par-csp as a DaemonSet Running primera3par-csp as a DaemonSet or Deployment with multiple replicas Mar 25, 2021
@obacak
Copy link
Contributor Author

obacak commented Mar 25, 2021

Some updates on this:

On our lab (3 worker cluster), I have tried primera3par-csp with 3 replicas.

I expect DaemonSet with local externalPolicy service will just work but that will not help us in our normal cluster (48 workers) since the controller will talk to its local CSP pod and all requests will be stacked on it: For example in one worker we have around 70-80 pods with around 20 hpe.csi.com volumes, draining the node makes the CSP pod work for around 12-13 minutes so all the volumes are detached from that host and attached to other workers across the cluster, i.e. an elasticsearch pod takes around 15 minutes to start in another worker, 12 minutes is gone because of detach/attach and mount operations.

So the idea is that by increasing the replica count of the CSP pod at least the detach and attach operations (DeleteVLUNRequest and CreateVLUNRequest) can be sent to the storage array in parallel by multiple CSP pods for different volumes.

So during the test I discovered that the session between the CSP pod and the storage array is initiated by the controller so this becomes problematic when the second request to the CSP service hits the second CSP pod, there is no session on that one so a new session has to be established which results in further time loss. Although when all the CSP pods have their sessions with the storage array, things get fast but it only lasts for 15 minutes and all the CSP pods lose their sessions at the same time and another round of time loss hits the cluster just because the controller needs to initiate the session renewal for the CSP pods.

I believe with minimal code change both on primera3par-csp and on the csi-driver, csp deployment can have more than 1 replicas which will speed up create/detach/attach/delete volume operations considerably in big clusters.

I have one question though: Can the storage array handle this multi replica set-up? Or what is the limit?

@imran-ansari
Copy link

Some updates on this:

On our lab (3 worker cluster), I have tried primera3par-csp with 3 replicas.

I expect DaemonSet with local externalPolicy service will just work but that will not help us in our normal cluster (48 workers) since the controller will talk to its local CSP pod and all requests will be stacked on it: For example in one worker we have around 70-80 pods with around 20 hpe.csi.com volumes, draining the node makes the CSP pod work for around 12-13 minutes so all the volumes are detached from that host and attached to other workers across the cluster, i.e. an elasticsearch pod takes around 15 minutes to start in another worker, 12 minutes is gone because of detach/attach and mount operations.

So the idea is that by increasing the replica count of the CSP pod at least the detach and attach operations (DeleteVLUNRequest and CreateVLUNRequest) can be sent to the storage array in parallel by multiple CSP pods for different volumes.

So during the test I discovered that the session between the CSP pod and the storage array is initiated by the controller so this becomes problematic when the second request to the CSP service hits the second CSP pod, there is no session on that one so a new session has to be established which results in further time loss. Although when all the CSP pods have their sessions with the storage array, things get fast but it only lasts for 15 minutes and all the CSP pods lose their sessions at the same time and another round of time loss hits the cluster just because the controller needs to initiate the session renewal for the CSP pods.

I believe with minimal code change both on primera3par-csp and on the csi-driver, csp deployment can have more than 1 replicas which will speed up create/detach/attach/delete volume operations considerably in big clusters.

I have one question though: Can the storage array handle this multi replica set-up? Or what is the limit?

@obacak - Couple of things I would like to clarify here:

  1. Default session timeout for the array is 15 mins which can be changed to a value between 3mins - 24hours
  2. Controller initiates session creation only once for a given username. When the array session expires, CSP automatically renews it. CSI continues to use the same session-id that is initially returned to it by the CSP in session-creation flow. The time loss here is minimal.

@obacak
Copy link
Contributor Author

obacak commented Mar 26, 2021

@imran-ansari , thanks for the extra info, maybe I should write down my test results below for 3 replica csp and single replica controller on our lab cluster which has 3 workers:

I create a single replica test deployment which is using a pvc.

csi-provisioner notices the pvc and issues a CreateVolumeRequest to the hpe-csi-controller.hpe-csi-driver (controller). Initially the controller checks whether that pv (which is to be created) exists or not via a GET method to http://primera3par-csp-svc:8080/containers/v1/volumes?name=pvc-d496f2c0-939c-438e-b90c-40118a8af101.
The request arrives at the primera3par-csp pod nr.1 (csp-0). csp-0 checks this against the kube-apiserver to see if hpevolumeinfo with that name exists or not and sends back 404 to the controller. This time the controller does a POST request to the csp service: http://primera3par-csp-svc:8080/containers/v1/volumes, this request hits at csp-1 pod and at this point it returns back to the controller with the following message: session renewal required. The controller attempts to login and this request arrives on csp-2.
Now, out of the 3 csp pods, only the csp-2 has a session with the storage array and the next Create Volume Request arrives at the csp-0 which replies back to the controller with Sending following message to CSI driver: session renewal required and 404 for the pv which was requested to be created. The controller logs the following: Received a null reader. That is not expected. and does again About to attempt login to CSP for backend ...

So on a big cluster, there will be huge ping-pong between the controller and the csp pods till all csp pods have a valid session for the storage array as the csp service will balance the request to a different csp pod each time.

I did not look in detail the controller's source code but I can imagine if csp is to run in a k8s cluster with multiple replicas, the controller needs to be free of any session creation for csp against the storage array and that responsibility should be on the csp only.

@imran-ansari
Copy link

@imran-ansari , thanks for the extra info, maybe I should write down my test results below for 3 replica csp and single replica controller on our lab cluster which has 3 workers:

I create a single replica test deployment which is using a pvc.

csi-provisioner notices the pvc and issues a CreateVolumeRequest to the hpe-csi-controller.hpe-csi-driver (controller). Initially the controller checks whether that pv (which is to be created) exists or not via a GET method to http://primera3par-csp-svc:8080/containers/v1/volumes?name=pvc-d496f2c0-939c-438e-b90c-40118a8af101.
The request arrives at the primera3par-csp pod nr.1 (csp-0). csp-0 checks this against the kube-apiserver to see if hpevolumeinfo with that name exists or not and sends back 404 to the controller. This time the controller does a POST request to the csp service: http://primera3par-csp-svc:8080/containers/v1/volumes, this request hits at csp-1 pod and at this point it returns back to the controller with the following message: session renewal required. The controller attempts to login and this request arrives on csp-2.
Now, out of the 3 csp pods, only the csp-2 has a session with the storage array and the next Create Volume Request arrives at the csp-0 which replies back to the controller with Sending following message to CSI driver: session renewal required and 404 for the pv which was requested to be created. The controller logs the following: Received a null reader. That is not expected. and does again About to attempt login to CSP for backend ...

So on a big cluster, there will be huge ping-pong between the controller and the csp pods till all csp pods have a valid session for the storage array as the csp service will balance the request to a different csp pod each time.

I did not look in detail the controller's source code but I can imagine if csp is to run in a k8s cluster with multiple replicas, the controller needs to be free of any session creation for csp against the storage array and that responsibility should be on the csp only.

May be to get around this, the session information can be maintained in CRD (besides being in-memory) so that all the CSPs see the same info.

@obacak
Copy link
Contributor Author

obacak commented Mar 30, 2021

In a new session CRD?

@imran-ansari
Copy link

In a new session CRD?

Yes. But that will be in addition to the in-memory session cache. So the first lookup will always be in-memory and if the session is not found there then CRD will be queried. If the session is still not found then the session-creation flow will take place.

@obacak
Copy link
Contributor Author

obacak commented Mar 30, 2021

I think that makes sense, I have been looking at the controller source code and your proposal will ensure minimum code change there, plus the CSP can always check that CRD for a valid session in case in-memory session cache is empty.

@raunakkumar
Copy link
Collaborator

@imran-ansari , thanks for the extra info, maybe I should write down my test results below for 3 replica csp and single replica controller on our lab cluster which has 3 workers:

I create a single replica test deployment which is using a pvc.

csi-provisioner notices the pvc and issues a CreateVolumeRequest to the hpe-csi-controller.hpe-csi-driver (controller). Initially the controller checks whether that pv (which is to be created) exists or not via a GET method to http://primera3par-csp-svc:8080/containers/v1/volumes?name=pvc-d496f2c0-939c-438e-b90c-40118a8af101.
The request arrives at the primera3par-csp pod nr.1 (csp-0). csp-0 checks this against the kube-apiserver to see if hpevolumeinfo with that name exists or not and sends back 404 to the controller. This time the controller does a POST request to the csp service: http://primera3par-csp-svc:8080/containers/v1/volumes, this request hits at csp-1 pod and at this point it returns back to the controller with the following message: session renewal required. The controller attempts to login and this request arrives on csp-2.
Now, out of the 3 csp pods, only the csp-2 has a session with the storage array and the next Create Volume Request arrives at the csp-0 which replies back to the controller with Sending following message to CSI driver: session renewal required and 404 for the pv which was requested to be created. The controller logs the following: Received a null reader. That is not expected. and does again About to attempt login to CSP for backend ...

So on a big cluster, there will be huge ping-pong between the controller and the csp pods till all csp pods have a valid session for the storage array as the csp service will balance the request to a different csp pod each time.

I did not look in detail the controller's source code but I can imagine if csp is to run in a k8s cluster with multiple replicas, the controller needs to be free of any session creation for csp against the storage array and that responsibility should be on the csp only.

@obacak Is the issue with create volumes or volume attach/detach? Storing the sessions in a CRD works but creates a security risk to anyone who could hit the CSPs with the session.
The ping pong would occur only on the initial request. Those sessions are cached until the storage provider's TTL is encountered. Scaling replicas does come with this challenge.
We are working on retrieving the stats using the StatFS (as your initial PR).

@obacak
Copy link
Contributor Author

obacak commented Apr 15, 2021

@raunakkumar , the issue is there for any operation concerning the communication between the CSP and the storage array; so it is create volume, detach volume, attach volume and delete volume.
Security risk on the CRD can be mitigated by RBAC, giving specific right for the hpe service accounts only. In our clusters, no one but only cluster admins can have access to the storage system where hpe-csi is deployed. Plus, this not any different than having the secret necessary to create a session against the storage array. On top of that, we use Calico so we can easily create Network Policy therefore only the controller can talk to the CSP service.

In our clusters (especially in the big ones), ping pong is not acceptable, it could hit anyone; delaying deployment or upgrade of services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants