-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running primera3par-csp as a DaemonSet or Deployment with multiple replicas #237
Comments
How has this been isolated to the single replica CSP? Do you have an empirical tests to share that this actually helps what you're suggesting? It could be the backend API server (primera/3par) that gets slammed and grinds to a halt. The CSP is just an API gateway that should be handling this just fine. Also |
@datamattsson |
On v1.4 we see many calls from the hpe-csi-node pods (48 of them since we have 48 workers) calling
in return the primera3par-csp pod makes a call to the kube-apiserver to find out whether that
If you mean by API server the storage array, then I say no. This blockage is happening at the on a 48 worker node cluster where we have v1.3 deployed and we have around 425 volumes, please see the request IDs (which are all sequential) and the timestamps:
And see below another 48 worker node cluster where we have v1.4 deployed with around 600 volumes, again please notice the timestamps and the request IDs:
I know, the implementation is done on the hpe-csi-node but it is hpe-csi-node asks for the existence of the volume to primera3par-csp which in return forwards that information to kube-apiserver. Once we received a "patched" hpe-csi-node where that single line of code taken out, the provisioning of a new volume (provisioning/attaching/mounting) came down from 10 minutes to 36 seconds on that same 48 worker node cluster and we started to see less I would like to re-iterate my initial question: Is it possible to deploy primera3par-csp as a |
Thanks for elaborating! I hope we can get some more eyes on this (@rgcostea @raunakkumar @rkumpf). AFAIK the CSP should not make any calls to the kube-apiserver for I however like the |
Thanks you @datamattsson.
I believe that is the case but again @sneharai4 should be able to confirm it. I will be looking forward to hearing from the team. |
The node driver does make a call to the CSP to fetch the volume attributes so there could be multiple nodes trying to query the CSP to retrieve the volume attributes. We will try to test out the above scenario with other CSPs (Nimble, CV) and confirm whether we hit the same issue. We haven't considered the Daemonset approach but we were thinking of including a flag |
@raunakkumar , thanks for your message. As I cannot see the source code of CSP, can the primera3par-csp work as multiple replicas? Are there any issues you might think of? |
Some updates on this: On our lab (3 worker cluster), I have tried primera3par-csp with 3 replicas. I expect DaemonSet with local externalPolicy service will just work but that will not help us in our normal cluster (48 workers) since the controller will talk to its local CSP pod and all requests will be stacked on it: For example in one worker we have around 70-80 pods with around 20 hpe.csi.com volumes, draining the node makes the CSP pod work for around 12-13 minutes so all the volumes are detached from that host and attached to other workers across the cluster, i.e. an elasticsearch pod takes around 15 minutes to start in another worker, 12 minutes is gone because of detach/attach and mount operations. So the idea is that by increasing the replica count of the CSP pod at least the detach and attach operations ( So during the test I discovered that the session between the CSP pod and the storage array is initiated by the controller so this becomes problematic when the second request to the CSP service hits the second CSP pod, there is no session on that one so a new session has to be established which results in further time loss. Although when all the CSP pods have their sessions with the storage array, things get fast but it only lasts for 15 minutes and all the CSP pods lose their sessions at the same time and another round of time loss hits the cluster just because the controller needs to initiate the session renewal for the CSP pods. I believe with minimal code change both on primera3par-csp and on the csi-driver, csp deployment can have more than 1 replicas which will speed up create/detach/attach/delete volume operations considerably in big clusters. I have one question though: Can the storage array handle this multi replica set-up? Or what is the limit? |
@obacak - Couple of things I would like to clarify here:
|
@imran-ansari , thanks for the extra info, maybe I should write down my test results below for 3 replica csp and single replica controller on our lab cluster which has 3 workers: I create a single replica test csi-provisioner notices the So on a big cluster, there will be huge ping-pong between the controller and the csp pods till all csp pods have a valid session for the storage array as the csp I did not look in detail the controller's source code but I can imagine if csp is to run in a k8s cluster with multiple replicas, the controller needs to be free of any session creation for csp against the storage array and that responsibility should be on the csp only. |
May be to get around this, the session information can be maintained in CRD (besides being in-memory) so that all the CSPs see the same info. |
In a new session CRD? |
Yes. But that will be in addition to the in-memory session cache. So the first lookup will always be in-memory and if the session is not found there then CRD will be queried. If the session is still not found then the session-creation flow will take place. |
I think that makes sense, I have been looking at the controller source code and your proposal will ensure minimum code change there, plus the CSP can always check that CRD for a valid session in case in-memory session cache is empty. |
@obacak Is the issue with create volumes or volume attach/detach? Storing the sessions in a CRD works but creates a security risk to anyone who could hit the CSPs with the session. |
@raunakkumar , the issue is there for any operation concerning the communication between the CSP and the storage array; so it is create volume, detach volume, attach volume and delete volume. In our clusters (especially in the big ones), ping pong is not acceptable, it could hit anyone; delaying deployment or upgrade of services. |
This is an enhancement request on top of hpe-csi v1.4 where
csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
is enabled on hpe-csi-node, as a result, this puts too much strain on the single replica primera3par-csp deployment on big K8s clusters (We have clusters with 48 workers) polling that single pod with many requests, saturating it thus detach/attach operations are hindered to a point where the cluster is not usable.We had to ask for a test image from HPE so
csi.NodeServiceCapability_RPC_GET_VOLUME_STATS
is disabled on the v1.4 and that solved this immediate problem on the big cluster.The proposal is the following:
Would it be possible to deploy the primera3par-csp as a daemonset and set its service
spec.externalTrafficPolicy
toLocal
so the hpe-csi-node pods will poll their local primera3par-csp pod for the volume stats and the csp pod which runs on the same K8s worker as the controller will be working as a "leader"?Could you please advise?
The text was updated successfully, but these errors were encountered: