-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to attach iscsi volumes #66
Comments
Thanks for reporting this. I'm out of pocket for the next couple of weeks. Can you describe more about the cluster sizes and PVC counts etc and I'll try to reproduce this. Historically the CSI driver has struggled with Flannel as a CNI, especially on K3s, is that what you're using? |
Yes, i am using k3s with flannel as CNI
and k3s is started using Each cluster has 3 nodes running Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-131-generic x86_64) and 8 PVC (size between 2 and 40 GB) at the moment Each cluster has a different
I've eanbled trace log level in csi.
Even if volumes should be published, if i run I also tried to manually mount the volume via ssh and it seems that in such a case there is no problem
after this, both |
Thanks for providing the logs and manually verifying the data path. This is definitely a control plane issue. I have a hard time following the logs on my phone and I'll have to get back to you later. |
Can you capture the logs from the node driver as well? Something must error out there somehow. |
I made some more experiments with this issue. I tried to use k3s with calico instead of flannel and also used rke2 with calico, but in both case I got the same error, so i't not something flannel or k3s related. My cluster are all provisioned by using the same vm template on vsphere, so i thought this was the problem. The strange thing is that it appeared after a certain amount of nodes are connected, even if at that point i already had 10+ nodes with duplicated IQN Anyway, it seems that even now, with unique IQN for every nodes, scaling up makes the same VolumeAttachment error appear This is for example a cluster i created few minutes ago
it still took 2+ minutes and a lot of failures to attach this volume Finally, here are the requested logs
|
This we need to investigate. Can you turn on tracing for the CSI driver? If you're installing with the chart, use While not ideal perhaps, are you getting the same issues if you would disable volume encryption? |
These are the logs with tracing enabled.
I didn't try with volume encryption disabled but I can have a try if you think this test may be helpful |
The node is receiving the publish request:
... and completes it a few seconds later.
So, this leaves us to the CSI controller that something is stalling in the control plane. The Can you grab the CSI controller logs with tracing on too for a failed request? It's the "hpe-csi-driver" container in the |
I see a lot of these:
Which is strange why it's repeatedly POST'ing the host and repeatably getting nothing useful back. Do you have anything in the CSP log that corresponds to all these POSTs? |
I don't see anything too strange, this post seems to behave like it should
A suspicious thing may be the following
the response is ~35k lines long is there any limitation on the response size that the system can handle? |
I'm sure there is but I've not seen anything hitting any boundaries. There are some parameters to tighten up the response for |
While I can't be sure that this is the problem, I saw that the VolumeAttachFailure is more likely to occur when:
I also leave here a complete CSP log file, maybe I missed something |
I have seen troubles of parallelism. An example is running multiple e2e tests suites in parallel on a single cluster. Fails 100% with TrueNAS and succeeds 100% on any other CSP (like Nimble). If this is a TrueNAS REST API issue or CSP issue I haven't isolated yet. I'm just baffled there's no obvious bugs that produce an error that give any sensible clues. |
ok, next week I'll try to rearrange my deployment pipeline to avoid parallel volume mounts and I'll report back |
I've reordered the deployment of my charts to mount a single PV at time but I didn't got any improvements |
@datamattsson I got same issue from csi-driver. Logs:
|
Can you open a new issue with this and include the CSI node driver log, not controller. |
I’ll do it when I encounter this issue again. |
I am using a single virtualized TrueNAS SCALE Dragonfish-24.04.2.2 to provide storage to various k3s clusters.
Each cluster has it's own dataset and use truenas-csp version 2.5.1 to create volumes.
I followed install instruction and everything seems to work, since pvc can be provisioned and mounted.
However, after 7/8 nodes of different clusters start to connect to truenas, volumes struggle when they are mounted.
I start to see the following event:
Sometimes, they are still successfully mounted after some time (that varies from 10 seconds to 1 hour) but other times they stay in this error state.
Also, sometimes volumes are mounted without problems.
Volumes are always correctly provisioned, since i can see them listed in truenas web ui, thay just fail to be mounted.
I don't seen any obvious error or timeout log on
hpe-csi-controller
norhpe-csi-node
nortruenas-csp
, even after enabling debug log.It just seems that the mount request is pending indefinitely (and sometimes ends successfully).
TrueNAS doesn't seems to show any resource problem like high cpu usage, and I wasn't able to find any error log even there.
Any idea?
The text was updated successfully, but these errors were encountered: