-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load testing SMT and NMT in preparation for 50 projects #22
Comments
This issue is about verifying initial scalability in 2 parts:
|
8 hours per training, 24 per week per GPU - can handle 50 projects per week on 2 GPU's. |
Two questions, @johnml1135 :
As soon as I hear from you, I will commit and we can test and close this issue. |
Great - let's see if we can test against the internal QA - now at https://qa-int.serval-api.org/swagger/index.html (only on VPN). We can monitor the cpu and memory load. |
@johnml1135 OK, great! What kind of additional set up will I need to bombard that (besides being on the VPN and changing the url)? Anything? |
The right auth0 client - use |
OK, sweet. I will paste those results when I get them. Unfortunately, a lot of today the queue has been full, so it hasn't been convenient to get results. Hopefully, tomorrow morning 🤞. (Update: The queue is full still - waiting. Maybe we should consider routing tests to a higher priority queue). |
Finally managed to get some results in: Fetching authorization token... |
These are results of running the same script locally: Fetching authorization token... |
Everything looks good except Mongo's response to the 30,000 docs and the 400 codes on word graph (which don't appear locally). I'd like to investigate more as to why its failing. That really shouldn't be such an overwhelming number for MongoDB, so I wonder if it's still processing the added docs and maybe adding a sleep would help. I'll investigate. Any thoughts? @johnml1135 |
I'm 99% sure that the issue causing the timeouts after adding the files is that some requests get canceled at the end of 60-second window of bombardment and these are causing something akin to the cancellation subscription issue we'd seen earlier. Wasn't that issue addressed? Do we need to investigate further solutions? |
On the 30,000 docs:
On the 601's
|
@Enkidu93 I'm trying to interpret the results. Are all of the requests to the get-all-engines endpoint timing out? |
Right, so like I mentioned above (way up there ^), this is 50 concurrent connections at 10 requests per second. In other words, I imagine this is significantly more traffic than we should expect (but I figured I ought to push it). I can experiment and see what a sufficiently slow rate would be. One option is to do like I've seen elsewhere and, like you said, allow filtering on the 'get-all' end points as well as having a max number of results parameter that we default to a 'safe' value. As for the 601 400s, I'm really unsure. Given that I can't recreate the problem locally, I'm probably gonna need to dig in the logs on the qa server (what's the best way to do that?). |
Yes, that's right. They're all timing out. |
Should we spin off separate issues for addressing these problems and close/leave open this? |
400s on word-graph are the result of running out of memory => being unable to build more engines:
|
This looks like it is running out of disk space not memory. |
About 30 seconds. (Note, this is with ~20,000 engines). I can try again after deleting them. Getting a single engine is a little more than half a second. |
OK, I'll investigate further, but it looks like that hadn't happened on the internal QA (given that there were nearly no engines (at one point) while there were hundreds of thousands of pretranslations). I'll verify though before opening an issue. |
Something fishy is definitely going on, and I think it's a user error. A couple days ago, I did manage to run the load testing script without issue and the numbers that came back were good. I wanted to tweak one of the values, came back to it, and now this. It's almost as though there's a delay between my posting engines and them actually being created. I'll delete them all again, and try once more. |
It's still intermittent for me, and I haven't found a way to successfully run my test and paste results here because of the timeout issues. Is it possible I'm triggering some kind of security protocol meant to protect the server from a DoS attack or something? Any ideas? @johnml1135 |
If it takes 30 seconds to respond to a single get-all-engines request with 20,000 engines, I would like to understand what is taking so much time. Is it the Mongo query, deserializing the query results into model objects, serializing the results into JSON, etc.? @Enkidu93, you said this does not seem to happen on your dev machine, so using a profiler might not be an option. Maybe we could add some debug log entries that benchmark different parts of the request? At the very least, we should be able to see how much time it takes ASP.NET Core to handle the request using the existing logs. That should tell us if it is the actual Serval service that is slow or something external, such as the reverse proxy. It would also be good to understand how much memory is needed to fulfill the request. |
Yes, or is it just an issue on my end - we don't know. That's correct, @ddaspit, I can't recreate the issue locally. @johnml1135, I can follow the instructions from the README to redeploy the internal QA with modified Serval-Machine code for debugging this, right? |
@Enkidu93 - yes you may (that is what it is for). Another way to try to hopefully reproduce without having to create docker images and deploy (a 10 minute cycle at least), is to use CPU limits on docker compose: https://stackoverflow.com/questions/42345235/how-to-specify-memory-cpu-limit-in-docker-compose-version-3. |
Good idea! I can now recreate the issue locally. I'll be debugging more tomorrow. |
The settings are in the k8s yaml files. Did you also limit the mongo
database to the same limits?
…On Tue, Sep 19, 2023 at 8:34 PM Eli C. Lowry ***@***.***> wrote:
Good idea! I tried limiting each service to 0.5 cpus and 128MB wand wasn't
able to recreate the problem locally. Do you know off-hand what the limits
are on the int-qa?
—
Reply to this email directly, view it on GitHub
<#22 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADIY5NFFJEEZX3YPALOINEDX3I2YLANCNFSM57JVNLYQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sure, OK. Thank you! I did. I am now able to recreate the issue regardless (I edited my comment). Working on debugging now. |
Are we only allocating 128MB of memory to the MongoDB instance? |
I believe it's set up with a 1500MB limit. I'm investigating this as we speak. |
I tried timing different elements of the logic, and the majority of the time is spent outside the visible code. Serval and MongoDB were maxing out cpu-usage. I tried bumping up the allotment without too much luck. I did notice something odd though which is that Serval seems to use a huge amount of memory that scales with the amount that I afford it (i.e. say I give it 1000MB, it'll use 90% of it after creating all the engines; same with 2000MB). I'm not sure what's going on there - thoughts? |
Is there some way that serval might be keeping all of the engines in memory even when not querying? |
So, I can get it to work by increasing the memory and cpu allotment for serval and mongo. I'm now experimenting with how tightly I can bound those. Is there a reason we can't afford more resources to them? What's our actual limit? I don't seem to have access to viewing information about our nodes specs etc. |
We have 4 CPU's for Mongo and Serval - for all engines. We can likely increase the memory if needed, but increasing the CPU would be more $$$. Let's see if we can get acceptable behavior without increasing the levels first. |
65% and 15% CPU usage? |
Yes - continuous for weeks. There is nothing on the logs that would indicate what is happening. |
So we should pursue paging? @johnml1135 |
Here is some info on allocating sufficient CPU and memory to Mongo. |
I'm checking if we can give more CPU's to Mongo... |
Working with @g3mackay on allocating CPU's more dynamically - https://stackoverflow.com/questions/52487333/how-to-assign-a-namespace-to-certain-nodes. We should be able to ensure that Mongo gets up to (or more than) 2 CPU's without starving the Serval API. |
What has yet to be done in this issue? Where do we go from here, @johnml1135? |
Please add the scripts to the Serval repo to refer to them in the future as needed. |
Addressed here - at least for the time being |
Here are some ways to increase the number of simultaneous users on Machine:
The text was updated successfully, but these errors were encountered: