Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initialize processor on server boot #2059

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

vvmruder
Copy link
Collaborator

@vvmruder vvmruder commented Oct 28, 2024

Fixes #2056

As discussed this PR changes pyramid_oereb behaviour. It loads Processor and therefore all related theme configuration on server boot time. Before the theme configuration was initialized on every process. This should provide a recognizable gain in performance.

pyramid_oereb will fail to start if Processor could not be initialized for any reason.

TODO:

  • fix tests
  • test in example infrastructures
    • BS ✅
    • BL
    • NE ✅
    • TI ✅
    • BE ✅

@vvmruder
Copy link
Collaborator Author

@michmuel @voisardf @svamaa : I think this is worth a try now in your infra. Please let me know if you need anything.

@vvmruder
Copy link
Collaborator Author

I did a quick manual check. On the master branch the test request lasts around 1.7seconds on my laptop. While this branch delivers the same request in 0.24seconds. A nice gain I think.

@lopo977
Copy link
Collaborator

lopo977 commented Oct 29, 2024

When running a local test on my workstation, your implementation of the singleton pattern processes the JSON request (CH452802850772) in 4.37 seconds, down from 6.26 seconds. Looks good to me.

@vvmruder
Copy link
Collaborator Author

When running a local test on my workstation, your implementation of the singleton pattern processes the JSON request (CH452802850772) in 4.37 seconds, down from 6.26 seconds. Looks good to me.

@lopo977 Thank you for testing. Did you run the request multiple times? For me, the first was a bit longer, the following were much faster than. Can you confirm that?

@lopo977
Copy link
Collaborator

lopo977 commented Oct 29, 2024

When running a local test on my workstation, your implementation of the singleton pattern processes the JSON request (CH452802850772) in 4.37 seconds, down from 6.26 seconds. Looks good to me.

@lopo977 Thank you for testing. Did you run the request multiple times? For me, the first was a bit longer, the following were much faster than. Can you confirm that?

I ran the same request multiple times. The first was longer also for me. I confirm.

Copy link
Collaborator

@svamaa svamaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this branch within our local dev environment. Overall I discovered a 9% improvment over 3 different parcels with each 20 requests. However, the improvement is significantly smaller, the larger the parcel and the associated data get. This makes sense.

However, I discovered in general a large variety in response time (this happens also on the master branch).
Even within a small parcel the response time varies up to 2 seconds between the same request. Does anyone else also finds this behaviour on their systems? I did not connect a db on my machine but on our intranet. The network should actually not be a large factor here.

@voisardf
Copy link
Collaborator

voisardf commented Nov 1, 2024

I also am under the impression our extract creation times do vary for the same pdf even between the 2nd and following extracts. I plan to do some more structured testing next week.

@peterschaer
Copy link
Collaborator

I finally found time to do some tests:

EGRID Description No. of DB-Queries master No. of DB-Queries PR Processing time master Processing time PR
CH523822463517 parcel with only one PLR 198 157 4.8s 0.7s
CH743546874207 parcel with two topics 208 167 4.9s 1.2s
CH762046353030 parcel with six topics 261 220 9.1s 4.7s

I made all my tests on my development machine (Win10, no Docker, DB remote (but staging DB, so not a lot of load to be expected)). All values were taken from the Pyramid Debug Toolbar. All requested extracts were XML extracts.

All in all a very significant gain!

@michmuel
Copy link
Collaborator

I compared the branch of this PR with the master branch on our dev instance. The request of a XML extract is faster using the branch of this PR. The gain is in the order as outlined by @peterschaer. That's great!

We use the oereblex source for many oereb topics. In our extended oereblex sources we handle the indexing of the legal provisions which are retrieved for the relevant public law restrictions at each request from oereblex. The index is required for the extract and shall be in the range -1000 to 1000. At least in our instance of oereblex, legal provisions do not have an index (only laws have). When all the sources are initialised only at server start up, we will have to change the reset of this index. That should be faesible.

Just important to note that request related information attached to sources may have to be handled differently with this changes.

@svamaa
Copy link
Collaborator

svamaa commented Nov 19, 2024

Thank you all for testing!
@vvmruder, I think you can start implementing the testing.

@michmuel
Copy link
Collaborator

With the new approach, oereblex links are cached from server start-up (

identifier = '{}{}{}'.format(geolink, law_status.code, params.language)
). This means that changes in oereblex for links that have been invoked by pyramid oereb will not take effect in the app till server restart.

Shall we change the caching behavior? Any opinions? (e.g. @jwkaltz?)

@jwkaltz
Copy link
Member

jwkaltz commented Nov 22, 2024

With the new approach, oereblex links are cached from server start-up (

identifier = '{}{}{}'.format(geolink, law_status.code, params.language)

). This means that changes in oereblex for links that have been invoked by pyramid oereb will not take effect in the app till server restart.
Shall we change the caching behavior? Any opinions? (e.g. @jwkaltz?)

@michmuel We already have a caching of oereblex links (see #993), this fills the links as needed. I don't think we should cache all links on server start-up.
Maybe I misunderstand?

@michmuel
Copy link
Collaborator

With the new approach, oereblex links are cached from server start-up (

identifier = '{}{}{}'.format(geolink, law_status.code, params.language)

). This means that changes in oereblex for links that have been invoked by pyramid oereb will not take effect in the app till server restart.
Shall we change the caching behavior? Any opinions? (e.g. @jwkaltz?)

@michmuel We already have a caching of oereblex links (see #993), this fills the links as needed. I don't think we should cache all links on server start-up. Maybe I misunderstand?

Let me know if I do not understand correctly, but I think, today, it works like this:

  1. Server start
  2. Request for parcel XY -> get public law restrictions -> download corresponding oereblex xmls (different plrs may have the same lexlinks but every lexlink is downloaded only once due to caching)
  3. Request for parcel XY again -> get public law restrictions -> download corresponding oereblex xmls (different plrs may have the same lexlinks but every lexlink is downloaded only once due to caching)

Changes / corrections in Oereblex are active at least for the next request.

With this pull request the behavior changes as the oereblex classes are initialised only on server start:

  1. Server start
  2. Request for parcel XY -> get public law restrictions -> download corresponding oereblex xmls (different plrs may have the same lexlinks but every lexlink is downloaded only once due to caching)
  3. Request for parcel XY again -> get public law restrictions -> no lexlink is downloaded because all lexlinks are cached

Changes / corrections in Oereblex are active immediately for lexlinks that have not been requested since server start and after next server start for lexlinks that have been requested since server start.

@vvmruder
Copy link
Collaborator Author

@michmuel You were right with your assumptions.

It took me a while to follow the path and find a maybe nice solution without changing too much in the code. However, I would be happy if you could check this out first on some real world data. Is that possible?

@michmuel
Copy link
Collaborator

@michmuel You were right with your assumptions.

It took me a while to follow the path and find a maybe nice solution without changing too much in the code. However, I would be happy if you could check this out first on some real world data. Is that possible?

Thanks @vvmruder

Looking into the logs I found that a specific geolink is retrieved once for every request. If more than one plr with the same geolink is found for a real estate, the geolink is not retrieved again.

However, is it robust in an uwsgi-setting with many threads? Can you provide some reasoning?

@vvmruder
Copy link
Collaborator Author

@michmuel You made a valid point. I fixed it now. You may have a look if that is how you expected it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of data collection
7 participants