Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZooKeeperNetEx connection loss issue: an acquired lock seems is not released #156

Open
MattGhafouri opened this issue Nov 21, 2022 · 55 comments

Comments

@MattGhafouri
Copy link

I've implemented a lock with the Zookeeper with this configuration :

  1. DistributedLock.ZooKeeper - Version="1.0.0"
  2. dotnet version 6.0
  3. Hosted on K8s (one pod, there is no concurrent request)
  4. Zookeeper server configuration on K8s :

version: "3.9"
services:
zk1:
container_name: zk1
hostname: zk1
image: bitnami/zookeeper:3.8.0-debian-11-r57
ports:
- 2181:2181
environment:
- ALLOW_ANONYMOUS_LOGIN=yes
- ZOO_SERVER_ID=1
- ZOO_SERVERS=0.0.0.0:2888:3888
- ZOO_MAX_CLIENT_CNXNS=500

There are several worker services inside the application, each of them working with a different lock key.
periodically it tries to accuqire the lock and do some processes. It seems they are working without problem, but after a while, I get this exception
Locking failed.Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

It seems the lock cannot be acquired because it has not been released, although there is no concurrent request for the lock key.

The LockService code in dotnet :

    `
     private TimeSpan _connectionTimeoutInSecond = TimeSpan.FromSeconds(30);
     private TimeSpan _waitingForLockInSecond = TimeSpan.FromSeconds(30);
     public async Task<LockProcessResult> DoActionWithLockAsync(string lockKey, Func<Task> func)
       {
      var processResult = new LockProcessResult();
      try
      {
        var @lock = new ZooKeeperDistributedLock(lockKey, _configuration.ConnectionString, opt =>
        {
            opt.ConnectTimeout(_connectionTimeoutInSecond);
        });

        await using (var handle = await @lock.TryAcquireAsync(timeout: _waitingForLockInSecond))
        {
            if (handle != null)
            {
                // I have the lock 
                await func(); 
            }
            else
            {
                processResult.SetException(new LockAcquisitionFailedException(lockKey)); 
            }
        }

     }
     catch (Exception ex)
     {
        //I got the exceptions here
        processResult.SetException(ex); 
     }

     return processResult;
 }`

I appreciate any suggestion

@hoerup
Copy link

hoerup commented Nov 21, 2022

Have you checked your zookeeper ?
Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

@MattGhafouri
Copy link
Author

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

Thanks, I'll check it.

@madelson
Copy link
Owner

@MajeQafouri

Have you checked your zookeeper ?
Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

+1

It seems the lock cannot be acquired because it has not been released

What makes you think this is the issue? The exception you reported mentions connection loss.

after a while, I get this exception

Can you please report the full exception.ToString() including the stack trace? This will allow us to see where inside DistributedLock it is hitting the exception.

@MattGhafouri
Copy link
Author

This is the stack trace :

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List1 acl, CreateMode createMode) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 535 at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable1 aclEnumerable, Boolean ensureDirectoryExists) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperNodeCreator.cs:line 38
at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable1 aclEnumerable, Boolean ensureDirectoryExists) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperNodeCreator.cs:line 54 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:line 70 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:line 132
at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock<Medallion.Threading.ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperDistributedLock.cs:line 97

@madelson
Copy link
Owner

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

@MattGhafouri
Copy link
Author

Actually, after hosting on different envs, I can say It would relate to the client hosting environment, let me make it clear.

client host env: local Docker desktop (win 10 64x)
zookeeper: hosted on k8s
result: facing the same exception once in a while

client host env: running in debug mode of Visual Studio 2022(win 10 64x)
zookeeper: hosted on k8s
result: working without problem

client host env: hosted on k8s
zookeeper: hosted on k8s
result: facing the same exception once in a while.

But, still I don't know what is the issue and how should I fix it

@MattGhafouri
Copy link
Author

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

It seems in case of an exception the acquired lock is not released, and even during the next tries, it cannot be acquired.

@MattGhafouri
Copy link
Author

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

As I described different client hosting envs, I can say it's not related to pod/container restarts,

@MattGhafouri
Copy link
Author

MattGhafouri commented Nov 22, 2022

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

I forgot to mention I don't receive requests from outside, my application is a web application that has more than 50 workers, and in a timely manner every five seconds trying to do some functionality( with different lock keys(specific for each worker)).

Is it possible the issue relates to the number of TCP sockets for each request we are opening? maybe the reason is that it works without issue on windows(visual studio) and as I containerize the application it starts to face some issues.

@madelson
Copy link
Owner

Is it possible the issue relates to the number of TCP sockets for each request we are opening?

This is possible, but I really would recommend pulling the logs from the zookeeper side and seeing what zookeeper thinks is the reason for connection loss (e.g. is the client timing out, too many connections, etc).

@hoerup
Copy link

hoerup commented Nov 23, 2022

@MajeQafouri Have you seen this one
shayhatsor/zookeeper#45

@MattGhafouri
Copy link
Author

MattGhafouri commented Nov 23, 2022

@MajeQafouri Have you seen this one shayhatsor/zookeeper#45

Thanks for mentioning the link, I believe the root of the issue is the same. but do you have any solution for that?

@MattGhafouri
Copy link
Author

I tried to directly use the zookeeper library https://www.nuget.org/profiles/shayhatsor2
I faced the same issue in the dockerized environment, then traced the Nuget package code, and got an exception in the WriteLock method. Maybe the author can help us @shayhatsor

WARNING WriteLock Caught: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
at org.apache.zookeeper.ZooKeeper.deleteAsync(String path, Int32 version) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 587
at org.apache.zookeeper.recipes.lock.WriteLock.DeleteNode.execute() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 120
at org.apache.zookeeper.recipes.lock.WriteLock.unlock() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 85]
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
at org.apache.zookeeper.ZooKeeper.deleteAsync(String path, Int32 version) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 587
at org.apache.zookeeper.recipes.lock.WriteLock.DeleteNode.execute() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 120
at org.apache.zookeeper.recipes.lock.WriteLock.unlock() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 85

@madelson
Copy link
Owner

@MajeQafouri I'm looking into whether we can work around this problem within DistributedLock itself.

@madelson
Copy link
Owner

madelson commented Dec 3, 2022

@MajeQafouri I've published a prerelease version of the Medallion.ZooKeeper package here: https://www.nuget.org/packages/DistributedLock.ZooKeeper/1.0.1-alpha001

This replaces the ZooKeeperNetEx dependency with my initial attempt at resolving the issue (I haven't reproduced the issue locally so we'll see how it goes).

Can you pull down the prerelease and let me know whether that resolves your issue? Thanks!

@madelson madelson changed the title An acquired lock seems is not released ZooKeeperNetEx connection loss issue: an acquired lock seems is not released Dec 16, 2022
@madelson
Copy link
Owner

@MajeQafouri did you get a chance to test out the prerelease package?

@devlnull
Copy link

Same Issue Here!
Actually I just faced with this issue with 3 pods trying to use zookeeper in k8s..
I'll try to use the alpha package and give a feedback here

@MattGhafouri
Copy link
Author

@MajeQafouri did you get a chance to test out the prerelease package?

Sorry for the late answer, unfortunately, I'm packed these days, and couldn't test the new Package, BTW thanks for the effort. As soon as I manage to test it, keep you posted.

@devlnull
Copy link

Hi, We could finally test the alpha package.
I don't know if it helps, but We've test it and compare with old package, the result was acceptable. no connection loss anymore.
Good luck.

@madelson
Copy link
Owner

@devlnull excellent. Thanks for testing! Feel free to use the prerelease version for now and keep me posted on any issues you encounter. The only change is the ZooKeeperNetEx alternative.

If @MajeQafouri also comes back with a good test result I'll publish it as a stable version.

@MattGhafouri
Copy link
Author

MattGhafouri commented Dec 23, 2022

Finally, I tested the new alpha version, overall It's much better with just a few connection Loss Exception,
The Hosting environment :

client host env: hosted on k8s (3 pods)
zookeeper: hosted on k8s (1 pod)
result: facing some Connection loss exceptions rarely.

The exception stack trace :

   .ZookeeperLockProviderService[0]
    ---------Locking failed. Key:MyKeyName--.Exception of type 
        'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
        org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 
       'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
        at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List`1 acl, CreateMode createMode) in 
        C:\dev\zookeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 542
       at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0. 
       <<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in 
        /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 72
     --- End of stack trace from previous location ---
       at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0. 
      <<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in 
      /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 64
     --- End of stack trace from previous location ---
       at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection 
       connection, ZooKeeperPath directory, String namePrefix, IEnumerable`1 aclEnumerable, Boolean ensureDirectoryExists) in 
       /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 27
       at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 
       waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in 
       /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 65
       at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 
          waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) 
        in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 127
       at 
       Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion. 
       Threading.Internal.IInternalDistributedLock<Medallion.Threading. 
       ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken 
      cancellationToken) in /_/DistributedLock.ZooKeeper/ZooKeeperDistributedLock.cs:line 91
       at .ZookeeperLockProviderService.DoActionWithLockAsync[TParameter](String lockKey, TParameter parameter, Func`2 
       func) in /src/src/Infrastructure//ZookeeperLockProviderService.cs:line 113
       .ZookeeperLockProviderService[0]

I hope it would be helpful.

@madelson
Copy link
Owner

Thanks @MajeQafouri !

For these few remaining connection losses, do they also appear in the Zookeeper logs or are they still “phantom connection losses”? Does the application recover after this happens or no?

@MattGhafouri
Copy link
Author

MattGhafouri commented Dec 26, 2022

In the Zookeeper log, we don't get exception logs, but on the app side we have them, and the Application seems to recover itself after connection loss exception, actually after sessionExpiredException.

again I run the app on k8s with 3 pods and zookeeper service on one pod, I send you the app log also the zookeeper log, you can also check the acquirement and release logs which include exceptions, I hope it would be helpful.

MyApp_Pod1_Logs.txt
MyApp_Pod2_Logs.txt
MyApp_Pod3_Logs.txt
logs-from-zookeeper.txt

@madelson
Copy link
Owner

madelson commented Jan 1, 2023

@MajeQafouri I do see a disconnect/session expired log in the zookeeper log, though. Could that be related?

@MattGhafouri
Copy link
Author

I hardly believe so, but I couldn't get why we get this error once in a while in Zookeeper console :"Connection reset by peer"

@devlnull
Copy link

devlnull commented Jan 2, 2023

Unfortunately we are getting Connection Loss sometimes, but it will be gone in a minute.

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

@madelson
Copy link
Owner

madelson commented Jan 4, 2023

@devlnull are you seeing "connection reset by peer"/"connection expired" in the zookeeper logs like @MajeQafouri is?

@devlnull
Copy link

devlnull commented Jan 4, 2023

@madelson Actually I don't have access to zookeeper logs yet, but I will soon.
for now I have this stack trace info, it might help.

   at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List`1 acl, CreateMode createMode) in C:\dev\zookeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 542
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0.<<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 72
--- End of stack trace from previous location ---
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0.<<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 64
--- End of stack trace from previous location ---
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable`1 aclEnumerable, Boolean ensureDirectoryExists) in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 27
   at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 65
   at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 127
   at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock<Medallion.Threading.ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken cancellationToken) in /_/DistributedLock.ZooKeeper/ZooKeeperDistributedLock.cs:line 91
   at Medallion.Threading.Internal.DistributedLockHelpers.ThrowTimeoutIfNull[T](ValueTask`1 task, String object) in /_/DistributedLock.Core/Internal/DistributedLockHelpers.cs:line 142
   at Medallion.Threading.Internal.Helpers.Convert[TDerived,TBase](ValueTask`1 task, ValueTaskConversion _) in /_/DistributedLock.Core/Internal/Helpers.cs:line 24

@madelson
Copy link
Owner

madelson commented Jan 6, 2023

@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way?

@MattGhafouri
Copy link
Author

@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way?

Good idea, I'll test it

@madelson
Copy link
Owner

madelson commented Jan 7, 2023

@MajeQafouri ok I just published the version with logging. Here's how to use it:

In your csproj, add the debug version like so:

<PackageReference Include="Medallion.ZooKeeperNetEx" Version="1.0.0-debug001" />

Then you can link the library's logging to your logging system via the DebugLog class:

ZooKeeperNetEx.DebugLog.LogLevel = TraceLevel.Verbose; // to enable full verbose logging
ZooKeeperNetEx.DebugLog.LogMessageEmitted += message => yourLogger.Log(message);

Let me know if you have any trouble getting it working!

@MattGhafouri
Copy link
Author

MattGhafouri commented Jan 9, 2023

@madelson I tried to use the Debug package, also the logLevel was set as you said. but I'm not sure have I set them properly or not, but still receiving connectionLoss exception

`
org.apache.zookeeper.KeeperException+ConnectionLossException:
Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
at org.apache.zookeeper.ZooKeeper.existsAsync(String path, Watcher watcher)
in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 745
at Medallion.Threading.ZooKeeper.ZooKeeperSequentialPathHelper.GetNodeCreationTimeAsyn
c(ZooKeeperConnection connection, String path)
in

C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSequentialPathHelper.cs:li
ne 145
at Medallion.Threading.ZooKeeper.ZooKeeperSequentialPathHelper.FilterAndSortAsync
(String parentNode, IEnumerable1 childrenNames, Func2 getNodeCreationTimeAsync, String prefix, String alternatePrefix)
in
C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSequentialPathHelper.cs:li
ne 59
at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync,
TimeoutValue timeout,
CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in

C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:li
ne 76
at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync,
TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix)
in

C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:li
ne 132
at
Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock<Medallion.Th
reading.ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync
(TimeoutValue timeout, CancellationToken cancellationToken) in
C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperDistributedLock.cs:line 97
at RGS.Jackpot.Infrastrucure.Common.ZookeeperLockProviderService.DoActionWithLockAsync(String lockKey, Func1 func) in /src/src/Infrastructure/RGS.Jackpot.Infrastrucure.Common/ZookeeperLockProviderService.cs:line 71

Even with one instance, it means there is no concurrent request for the Resource Key
Let me know, if I'm doing something wrong

@madelson
Copy link
Owner

madelson commented Jan 9, 2023

@MajeQafouri sorry if I wasn't clear. The new build isn't attempting to fix the issue, it's trying to give us more useful logs to potentially diagnose what is going on.

ZooKeeperNetEx.DebugLog.LogMessageEmitted += message => yourLogger.Log(message);

This line pipes the zookeeper logs into whatever logging system you have. You can then grab those logs and investigate and/or post them here.

@MattGhafouri
Copy link
Author

MattGhafouri commented Jan 9, 2023

@madelson I believe I did it, please check the log settings that I've done

`

     builder.Logging.ClearProviders();
     builder.Logging.SetMinimumLevel(LogLevel.Trace);

    ZooKeeperNetEx.DebugLog.LogLevel = TraceLevel.Verbose; // to enable full verbose logging        
    builder.Host.UseNLog();
    
     ZooKeeperNetEx.DebugLog.LogMessageEmitted += message => logger.Log(LogLevel.Trace, message);`

Is it right?

@madelson
Copy link
Owner

@MajeQafouri that code looks reasonable to me. Are you getting any logs from this? If so can you share?

@MattGhafouri
Copy link
Author

No, that's the only type of exception and log detail which I receive.

@madelson
Copy link
Owner

@MajeQafouri that's strange. I see that you're currently logging to your own logger using LogLevel.Trace. Is it possible you're not enabling trace logs?

Could you set a break point inside the LogMessageEmitted event handler and see if it is getting called?

@madelson
Copy link
Owner

@MajeQafouri any luck getting the logs? I just did a quick test like this:

        ZooKeeperNetEx.DebugLog.LogLevel = System.Diagnostics.TraceLevel.Verbose;
        ZooKeeperNetEx.DebugLog.LogMessageEmitted += m => Console.WriteLine(m);
        var @lock = new ZooKeeperDistributedLock(Guid.NewGuid().ToString(), connectionString);
        await using (var handle = await @lock.AcquireAsync())
        {
        }

And I get logs like this:

Info ZooKeeper Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=20000 watcher=Medallion.Threading.ZooKeeper.ZooKeeperConnection+ConnectionWatcher sessionId=0 sessionPasswd=<hidden>
Verbose DynamicHostProvider Trying to resolve at least one IP from hosts:{127.0.0.1:2181}
Verbose DynamicHostProvider Resolving Hosts={127.0.0.1:2181}
Verbose DynamicHostProvider Resolved Hosts={127.0.0.1:2181} to {{127.0.0.1:2181}}
Info ClientCnxn Opening socket connection to server {127.0.0.1:2181}
...

@MattGhafouri
Copy link
Author

finally, I managed to put the logger, but in Verbose log level there was not any detail about the exception, Just two type of log I've got :

response::org.apache.zookeeper.proto.CreateResponse
Verbose ClientCnxn Reading reply sessionid:0x1087668dd3f000e, packet:: clientPath: serverPath: finished:False header::org.apache.zookeeper.proto.RequestHeaderreplyHeader::org.apache.zookeeper.proto.ReplyHeader request::org.apache.zookeeper.proto.CreateRequest response::org.apache.zookeeper.proto.CreateResponse
Verbose ClientCnxn Reading reply sessionid:0x1087668dd3f000e, packet:: clientPath: serverPath: finished:False header::org.apache.zookeeper.proto.RequestHeaderreplyHeader::org.apache.zookeeper.proto.ReplyHeader request::org.apache.zookeeper.proto.CreateRequest

or

Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True
Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True

@madelson
Copy link
Owner

@MajeQafouri actually this is what I'm looking for. Any chance you can upload a log file?

@MattGhafouri
Copy link
Author

got it, but I'm receiving just these two types of logs, not errors(I mean inside the LogMessageEmitted event), but inside the application sometimes the locks cannot be acquired, but the rate is quite too much in comparison to other Distributed locks, cause I'm also using other distributed locks, like Redis.

@madelson
Copy link
Owner

but I'm receiving just these two types of logs, not errors(

So do the logs you posted account for 100% of the unique lines? That would surprise me since my test generated some different log types. Note that I'm not just interested in error logs, I want to see if we can understand what is happening on your system.

inside the application sometimes the locks cannot be acquired

As an experiment, can you try something like this to see if it improves reliability? Again, I'm just trying to understand where the problem lies:

// use this routine to acquire your ZooKeeper locks instead of calling TryAcquireAsync() directly
public static async ValueTask<ZooKeeperDistributedLockHandle?> TryAcquireAsyncWithRetry(
    ZooKeeperDistributedLock @lock,
    TimeSpan? timeout = null,
    CancellationToken cancellationToken = default);
{
    var retriesRemaining = 3;
    while (true)
    {
        try
        {
            return await @lock.TryAcquireAsync(timeout, cancellationToken);
        }
        catch (org.apache.zookeeper.KeeperException.ConnectionLossException ex) 
            when (retriesRemaining > 0)
        {
            // maybe issue a log here that we're retrying?
            --retriesRemaining;
        }
    } 
}

@MattGhafouri
Copy link
Author

No problem, I'll follow the approach, but unfortunately for now, I don't have access to the server, keep you posted as soon as I apply the changes.

@madelson
Copy link
Owner

@MajeQafouri / @devlnull any updates here? Any success with collecting logs or using retry?

@MattGhafouri
Copy link
Author

Hey, I'm afraid, Not yet.

@dansuse
Copy link

dansuse commented Jun 8, 2023

Hi @madelson , I am facing the same issue, even after using the https://www.nuget.org/packages/DistributedLock.ZooKeeper/1.0.1-alpha001.
Is it possible to open source the code for above nuget package?
This is the stack trace from my application.
stack-trace

@madelson
Copy link
Owner

madelson commented Jun 9, 2023

Is it possible to open source the code for above nuget package?

@dansuse yes this code is all available; I'd love to have more eyes on it/contributors since I don't have an environment where I can reproduce this.

The only change within DistributedLock itself is to make the alpha package is just swapping out the reference to ZooKeeperNetEx with my fork of that package:

-    <PackageReference Include="ZooKeeperNetEx" Version="3.4.12.4" />
+    <PackageReference Include="Medallion.ZooKeeperNetEx" Version="1.0.0-debug001" />

The changes I've made on my fork are on this branch. You can see the diff here. The only real change is in SocketContext.cs.

If you have a chance to play around and have any thoughts on how to further improve the fork, I'd love to collaborate on this.

One thing I saw in the ZooKeeper docs is:

You must test ZooKeeper server failures. The ZooKeeper service can survive failures as long as a majority of servers are active. The question to ask is: can your application handle it? In the real world a client's connection to ZooKeeper can break. (ZooKeeper server failures and network partitions are common reasons for connection loss.) The ZooKeeper client library takes care of recovering your connection and letting you know what happened, but you must make sure that you recover your state and any outstanding requests that failed. Find out if you got it right in the test lab, not in production - test with a ZooKeeper service made up of a several of servers and subject them to reboots.

I wonder if this implies that we should be catching this connection loss exception and just waiting for the client to recover (but for how long?). Without the ability to reproduce this easily in my environment its been hard for me to validate. I also suspect there is more to it since people didn't start facing issues with this until more recent versions of .NET and it seems like people who are using the ZooKeeperNetEx package directly (vs through DistributedLock) are also struggling with this...

@Jetski5822
Copy link

@madelson we have just tested your change locally and in a K8 cluster and that code change fixed the issue - could you issue a PR for this change against the main repo?

@madelson
Copy link
Owner

@Jetski5822 I filed shayhatsor/zookeeper#54 , but the maintainer has not been very active so I don't anticipate it being merged. If I were to publish a stable version of my Medallion.ZooKeeperNetEx fork and released a stable version of DistributedLock.ZooKeeper which used that would that be valuable to you?

@madelson
Copy link
Owner

Has anyone having these issues tried the https://www.nuget.org/packages/ZooKeeper.Net package? Seems like another variant on ZooKeeperNetEx that was published a bit more recently.

@BoutemineOualid
Copy link

Same issue here running zookeeper in a docker container, the alpha release seems to have fixed the issue.

@madelson
Copy link
Owner

The Vostok.ZooKeeper.Client package was last published in 2022 and has a decent number of downloads. Has anyone tried it?

For context, while I can move forward with my patched fork of ZooKeeperNetEx, I'd love to find a higher-quality alternative to rely on going forward since I'm not in position to maintain a ZooKeeper client.

@Jetski5822
Copy link

That used ZooKeeperEx under the hood too :(

@madelson
Copy link
Owner

I think the really solve this we need to be handling ConnectionLost explicitly like they do here and as described here.

Perhaps with that fully in place we don't even need ZooKeeperNetEx patch

@madelson madelson added zookeeper bug dependencies Pull requests that update a dependency file and removed dependencies Pull requests that update a dependency file labels Mar 17, 2024
@Jetski5822
Copy link

@madelson Where I work, we have a custom ZK build which fixes this and also takes in some of the code you use to manage the ZooKeeper instance, given our potential reliance on ZK - it might be worth us publishing it; Ill have a chat internally an see if thats possible, then you could rely on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants