-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZooKeeperNetEx connection loss issue: an acquired lock seems is not released #156
Comments
Have you checked your zookeeper ? |
Thanks, I'll check it. |
@MajeQafouri
+1
What makes you think this is the issue? The exception you reported mentions connection loss.
Can you please report the full |
This is the stack trace : Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. |
@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested? With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding? |
Actually, after hosting on different envs, I can say It would relate to the client hosting environment, let me make it clear. client host env: local Docker desktop (win 10 64x) client host env: running in debug mode of Visual Studio 2022(win 10 64x) client host env: hosted on k8s But, still I don't know what is the issue and how should I fix it |
It seems in case of an exception the acquired lock is not released, and even during the next tries, it cannot be acquired. |
As I described different client hosting envs, I can say it's not related to pod/container restarts, |
I forgot to mention I don't receive requests from outside, my application is a web application that has more than 50 workers, and in a timely manner every five seconds trying to do some functionality( with different lock keys(specific for each worker)). Is it possible the issue relates to the number of TCP sockets for each request we are opening? maybe the reason is that it works without issue on windows(visual studio) and as I containerize the application it starts to face some issues. |
This is possible, but I really would recommend pulling the logs from the zookeeper side and seeing what zookeeper thinks is the reason for connection loss (e.g. is the client timing out, too many connections, etc). |
@MajeQafouri Have you seen this one |
Thanks for mentioning the link, I believe the root of the issue is the same. but do you have any solution for that? |
I tried to directly use the zookeeper library https://www.nuget.org/profiles/shayhatsor2
|
@MajeQafouri I'm looking into whether we can work around this problem within DistributedLock itself. |
@MajeQafouri I've published a prerelease version of the Medallion.ZooKeeper package here: https://www.nuget.org/packages/DistributedLock.ZooKeeper/1.0.1-alpha001 This replaces the ZooKeeperNetEx dependency with my initial attempt at resolving the issue (I haven't reproduced the issue locally so we'll see how it goes). Can you pull down the prerelease and let me know whether that resolves your issue? Thanks! |
@MajeQafouri did you get a chance to test out the prerelease package? |
Same Issue Here! |
Sorry for the late answer, unfortunately, I'm packed these days, and couldn't test the new Package, BTW thanks for the effort. As soon as I manage to test it, keep you posted. |
Hi, We could finally test the alpha package. |
@devlnull excellent. Thanks for testing! Feel free to use the prerelease version for now and keep me posted on any issues you encounter. The only change is the ZooKeeperNetEx alternative. If @MajeQafouri also comes back with a good test result I'll publish it as a stable version. |
Finally, I tested the new alpha version, overall It's much better with just a few connection Loss Exception, client host env: hosted on k8s (3 pods) The exception stack trace :
I hope it would be helpful. |
Thanks @MajeQafouri ! For these few remaining connection losses, do they also appear in the Zookeeper logs or are they still “phantom connection losses”? Does the application recover after this happens or no? |
In the Zookeeper log, we don't get exception logs, but on the app side we have them, and the Application seems to recover itself after connection loss exception, actually after sessionExpiredException. again I run the app on k8s with 3 pods and zookeeper service on one pod, I send you the app log also the zookeeper log, you can also check the acquirement and release logs which include exceptions, I hope it would be helpful. MyApp_Pod1_Logs.txt |
@MajeQafouri I do see a disconnect/session expired log in the zookeeper log, though. Could that be related? |
I hardly believe so, but I couldn't get why we get this error once in a while in Zookeeper console :"Connection reset by peer" |
Unfortunately we are getting Connection Loss sometimes, but it will be gone in a minute.
|
@devlnull are you seeing "connection reset by peer"/"connection expired" in the zookeeper logs like @MajeQafouri is? |
@madelson Actually I don't have access to zookeeper logs yet, but I will soon.
|
@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way? |
Good idea, I'll test it |
@MajeQafouri ok I just published the version with logging. Here's how to use it: In your csproj, add the debug version like so:
Then you can link the library's logging to your logging system via the
Let me know if you have any trouble getting it working! |
@madelson I tried to use the Debug package, also the logLevel was set as you said. but I'm not sure have I set them properly or not, but still receiving connectionLoss exception ` C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSequentialPathHelper.cs:li C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:li C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:li Even with one instance, it means there is no concurrent request for the Resource Key |
@MajeQafouri sorry if I wasn't clear. The new build isn't attempting to fix the issue, it's trying to give us more useful logs to potentially diagnose what is going on.
This line pipes the zookeeper logs into whatever logging system you have. You can then grab those logs and investigate and/or post them here. |
@madelson I believe I did it, please check the log settings that I've done `
Is it right? |
@MajeQafouri that code looks reasonable to me. Are you getting any logs from this? If so can you share? |
No, that's the only type of exception and log detail which I receive. |
@MajeQafouri that's strange. I see that you're currently logging to your own logger using Could you set a break point inside the |
@MajeQafouri any luck getting the logs? I just did a quick test like this:
And I get logs like this:
|
finally, I managed to put the logger, but in Verbose log level there was not any detail about the exception, Just two type of log I've got : response::org.apache.zookeeper.proto.CreateResponse or Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True |
@MajeQafouri actually this is what I'm looking for. Any chance you can upload a log file? |
got it, but I'm receiving just these two types of logs, not errors(I mean inside the LogMessageEmitted event), but inside the application sometimes the locks cannot be acquired, but the rate is quite too much in comparison to other Distributed locks, cause I'm also using other distributed locks, like Redis. |
So do the logs you posted account for 100% of the unique lines? That would surprise me since my test generated some different log types. Note that I'm not just interested in error logs, I want to see if we can understand what is happening on your system.
As an experiment, can you try something like this to see if it improves reliability? Again, I'm just trying to understand where the problem lies:
|
No problem, I'll follow the approach, but unfortunately for now, I don't have access to the server, keep you posted as soon as I apply the changes. |
@MajeQafouri / @devlnull any updates here? Any success with collecting logs or using retry? |
Hey, I'm afraid, Not yet. |
Hi @madelson , I am facing the same issue, even after using the https://www.nuget.org/packages/DistributedLock.ZooKeeper/1.0.1-alpha001. |
@dansuse yes this code is all available; I'd love to have more eyes on it/contributors since I don't have an environment where I can reproduce this. The only change within DistributedLock itself is to make the alpha package is just swapping out the reference to ZooKeeperNetEx with my fork of that package:
The changes I've made on my fork are on this branch. You can see the diff here. The only real change is in SocketContext.cs. If you have a chance to play around and have any thoughts on how to further improve the fork, I'd love to collaborate on this. One thing I saw in the ZooKeeper docs is:
I wonder if this implies that we should be catching this connection loss exception and just waiting for the client to recover (but for how long?). Without the ability to reproduce this easily in my environment its been hard for me to validate. I also suspect there is more to it since people didn't start facing issues with this until more recent versions of .NET and it seems like people who are using the ZooKeeperNetEx package directly (vs through DistributedLock) are also struggling with this... |
@madelson we have just tested your change locally and in a K8 cluster and that code change fixed the issue - could you issue a PR for this change against the main repo? |
@Jetski5822 I filed shayhatsor/zookeeper#54 , but the maintainer has not been very active so I don't anticipate it being merged. If I were to publish a stable version of my Medallion.ZooKeeperNetEx fork and released a stable version of DistributedLock.ZooKeeper which used that would that be valuable to you? |
Has anyone having these issues tried the https://www.nuget.org/packages/ZooKeeper.Net package? Seems like another variant on ZooKeeperNetEx that was published a bit more recently. |
Same issue here running zookeeper in a docker container, the alpha release seems to have fixed the issue. |
The Vostok.ZooKeeper.Client package was last published in 2022 and has a decent number of downloads. Has anyone tried it? For context, while I can move forward with my patched fork of ZooKeeperNetEx, I'd love to find a higher-quality alternative to rely on going forward since I'm not in position to maintain a ZooKeeper client. |
That used ZooKeeperEx under the hood too :( |
@madelson Where I work, we have a custom ZK build which fixes this and also takes in some of the code you use to manage the ZooKeeper instance, given our potential reliance on ZK - it might be worth us publishing it; Ill have a chat internally an see if thats possible, then you could rely on that. |
I've implemented a lock with the Zookeeper with this configuration :
There are several worker services inside the application, each of them working with a different lock key.
periodically it tries to accuqire the lock and do some processes. It seems they are working without problem, but after a while, I get this exception
Locking failed.Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
It seems the lock cannot be acquired because it has not been released, although there is no concurrent request for the lock key.
The LockService code in dotnet :
I appreciate any suggestion
The text was updated successfully, but these errors were encountered: