-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not enough room in chunk cache - return 503 #332
Comments
And the strange thing that this error happen eraly, even if I stop the server with |
The 503 is generated when the chunk cache is full of dirty chunks and there's no room for the incoming change. Normally this should just be a temporary state while the dirty chunks get flushed to storage. |
Hm, actually I've changed s3_sync_interval to some large number, as I assumed it has something to do with AWS S3 which I don't use, and apparently that was incorrect. That explains why the errors started to happen early. I changed it as this s3_sync messages were spamming the logs (from my perspective) when I tried to figure out what was the problem originally. So am I right that this error can happen if hsload uploads the data faster than it can be written to disk by the server? I've changed the interval back to 1 second and increased chunk cache size to 256M, let's see if I'll be able to upload the file now. |
Sorry about the confusion with the config names -- it's a bit of historical artifact: HSDS was first developed with only AWS S3 for storage. Later support was added for Azure blobs and Posix. By that point, it was too late to go back and rename everything. Yes, you'll get the error if data is coming in faster than it can be written (once the memory buffer is used up). In h5pyd, there's built in backoff logic to resend requests when a 503 is received. |
Yeah, I totally understand the historic reasons :) Maybe the corresponding comment (https://github.com/HDFGroup/hsds/blob/master/admin/config/config.yml#L36) in the config file could be updated expanding on that a bit. Meanwhile I still have the problem with too many 503s, but now for different reasons. Here is my setup and what I get:
So the data node went down from 20:46:12 to 20:48:22.
So this list_keys procedure brought the node down for 2.5 minutes, which hsload wasn't able to deal with. Is that an expected behavior? Or did I misconfigure something? Shouldn't multiple data nodes resolve this problem? I believe I still had it with 4 nodes, but maybe that was a different problem. I'm now trying to run hsload with retries=20. If I understand correctly that the delay between the retries goes up exponentially that should work around this issue. |
No luck, but now it's a bit different problem again. I now use the following command
I guess retries number is not converted to an int somewhere?
|
After fixing conversion of retries to int I was finally able to upload the file. I didn't do a PR as I'm not sure it's the apropriate way to do this, but here is my patch.
Still, I don't really like the history of the server status. Here is the excerpt from the logs (I grepped "update_dn_info" in SN node logs):
Basically, after some time, data nodes constantly go down because of this list_keys operation (not sure what triggers it and what its purpose). So even if I have multiple nodes, at some point it results in all of them being down at the same time. I would be thankful for some guidance how to configure the server better, if it's possible. |
So if you do a Anyway, I'll try loading the file myself and see if I hit similar issues. |
@jreadey No, DN containers aren't fully failing and continue to work. However, when this list_keys operation is happening they don't process other requests, so after about 20 seconds, SN marks them as inactive. After list_keys finishes (which takes couple of minutes) they report back to SN and it marks them as active again. The problem seem to stem from listKeys operation that needs to walk over 93K keys which takes significant amount of time. This number of keys is very similar to the number of chunks reported by |
OK, I think I understand what's going on now. The SN container(s) poll the DN's periodically to make sure they are still alive. If a DN gets bogged down in a request that takes a long time to process (without any await), the SN will assume the DN is down and not sending any more requests until the health check comes back. The DN's use asyncio, so that if say there's a bit of a wait to read or write data that won't interfere with the SN health checks. With S3 storage this all goes as planned since all the S3 calls are asynchronous. For Posix storage we use aiofiles which has async calls for reading and writing, but not listing files (see Tinche/aiofiles#160). So with a large number of files to list, list_dir is taking too long and misses the health check. Ironically, list_dir is not critical for hsload. It get's called by scanRoot which is a background task whose purpose is to tabulate the total storage used by a domain (which you'll see if you do hsstat <domain_path>). I have a patch which should fix this. I'll give it a try and report back. |
Great! Thanks for looking into this! |
My load failed about a third of the way through with a retries exceeded message. I'll try another run and update the default retries and increase the time before a node is not considered active (the head_sleep_time, node_sleep_time configs). |
Second attempt worked - no errors with
These change the time between health checks from 10 seconds to 60. @assaron - let me know if you are able to do a hsload with this setup. |
@jreadey thanks for the updates! I've checked the new version and the number of inactive nodes definitely decreased, even with the default server settings. It still failed on my first attempt with the default retries number, but the second one with Though I don't understand why DN nodes still go down at all. I'd expect that your patch that adds breaks for ListKeys should have fixed that completely. |
I'm not sure why we are still seeing an occasional inactive node. I suspect there is still somewhere in the code that is not letting other tasks run for a significant time. I'll keep an eye out for any pieces of code that seem suspicious. |
@jreadey these downtimes seem to be still associated with list keys operation. Do I get this right, that the DN nodes supposed to do a healthCheck every 10 seconds where it tells to the head node that everything's OK, otherwise the head node can say the the DN node is no longer active? Because I see this in DN log (I only showing healthCheck and listKeys lines):
So Here is the part of SN node log with update on node statuses for the same period:
I don't know how to check it properly, but I assume the DN node which log I showed above is this dn-5fbd7. This looks suspicious to me. I even double-checked that I'm running the docker I built myself from the repo and it seemed fine. |
The way asyncio works in python, until a task does an "await" no other task can run. The DN containers send a periodic health check to let the Head contain know that it hasn't crashed or gone away. My update put in some awaiti's which I thought would solve the issue. I'll take another look. Until then you can use the head_sleep_time/node_sleep_time config to allow more time between health check runs. Probably 120 is a better setting than 60 judging from the logs. |
Just to note, I used the default config and didn't change sleep time from 10 to 60 seconds as you suggested. Actually, when I think about it, I find it to be a way to just hide the problem: the node won't be pronounced inactive, but it won't respond to any requests. Or am I wrong here? Anyway, I've tried to look a bit closer how this sleep work and added the following debug output code:
Unexpectedly to me, here are the DN logs that I got:
All of these sleeps happened before the file stats commands has been executed, instead of in between! I don't think it's an expected behavior... |
I guess these sleeps should be done in the next cycle actually. Let me try to do that. |
Thanks for the PR. Change is merged into master. |
Hi, I continue to try upload the large file from #329 and I still get the issues with the upload failing because of too many 503 errors.
Currently I have a situation when the upload fails very fast after starting, I guess because the server got into some bad state. Here is the relevant portion of hsload log:
It looks like it fails when try to submitting even the first chunk.
When I check the datanode container, I see the following errors in the log:
What's going on here? Is this an expected behavior? For me it doesn't sound intuitive that the client request should fail when some internal cache got saturated. It doesn't look like any flushing of the cache is initiated either.
The text was updated successfully, but these errors were encountered: