Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeking views of administrators on proposal to expose audiobook keywords and promote exploration of our catalog via keyword hyperlinks #210

Open
peterjdann opened this issue Mar 19, 2024 · 8 comments · May be fixed by #225

Comments

@peterjdann
Copy link
Contributor

redrun has suggested to me that before creating any signifcant pull request, it would be a good idea if I were to put what I have in mind before administrators to get their views on whether the proposed change makes sense at all, or perhaps makes sense in principle, but requires modification. I think this is a most reasonable suggestion, but am not sure by what mechanism I should make any such approach.

I have attached a PDF file which I hope shows pretty clearly, at a GUI level at least, what I have in mind, with screenshots taken on my local development environment.

I'd be interested in the thoughts of all stakeholders. If nothing else, I've learned a hell of a lot by exploring this question. If this doesn't go anywhere, I've at least got that to hang onto!

Proposal to expose keyterms to catalog users 2024 03 19.pdf

@redrun45
Copy link
Collaborator

Personally, I like this a lot. As for consensus: discussion has just begun. Public discussion here, and admins will probably have an internal discussion as well.

On a purely technical note (else I'd post it in the linked forum thread), I think we could benefit from counting up the projects that use each keyword with a cron job, just like we do with authors and genres.
It would grow our database, but I think it would still be much faster than cross-referencing and counting projects on page load. Worth testing!

@peterjdann
Copy link
Contributor Author

peterjdann commented Mar 20, 2024 via email

@redrun45
Copy link
Collaborator

If you're writing your own 'locustfile' test specs, this might be of use. Apparently, you can use the same options as this module.
Without trying it myself:
.get('https://librivox.org/<your query>', verify=False)
... ought to help with that SSL verification.

@peterjdann
Copy link
Contributor Author

peterjdann commented Mar 21, 2024

Suggestion from redrun45 proved to be a lifesaver. Tbank you!
I have now run load tests with locust against three scenarios:

  • Baseline (main branch, untouched)
  • Version of code that calculated keyword usage dynamically at runtime
  • Version of code that reads usage statistics from keywords table that are presumed to be updated by a cron job, but which I have, for now, updated manually using the same SQL that would be used in any putative cron job

I have attached in a zip file reports from all three runs, together with the locust file I used for these tests.
In brief, the results show the following average response times on my local machine under the test conditions I set up:

  • Baseline (no code changes): 491 ms
  • Precalculated statistics: 484 ms
  • Dynamically calculated statistics: 816 ms

At face value, anyway, this suggests to me that we could display keywords to catalog users with no discernible degradation of server performance provided we kept the keyword usage statistics updated with a cron job, rather than calculating them each time on the fly.
I'm no expert in load testing. If the test conditions I've used here do not appear to be sufficiently rigorous, please let me know.
locust file and load test reports.zip

@notartom
Copy link
Member

That's a pretty massive hit for dynamic statistics. We could always go with the Cron job, but I'm curious what the SQL to calculate those was, and whether it can be made faster by adding foreign keys and/or indexes to our schema.

@peterjdann
Copy link
Contributor Author

peterjdann commented Mar 22, 2024

Good point.

I have used the following SQL for a dynamic count of the number of instances where a keyword is in use. As you can see, it performs a join involving project_keywords.keyword.id. This column, however, is not currently indexed in the database.

    SELECT keywords.id, keywords.value, sub.keyword_count
    FROM keywords
    JOIN project_keywords 
    ON keywords.id = project_keywords.keyword_id
    JOIN (SELECT keyword_id,
           COUNT(project_id) AS keyword_count
           FROM project_keywords
           GROUP BY 1
           ) sub
    ON keywords.id = sub.keyword_id
    WHERE project_keywords.project_id = ?
    ORDER BY sub.keyword_count DESC' 

It would certainly be a good idea to establish a foreign key relationship between project_keywords.keyword_id and keywords.id.

My first attempt to set one up was frustrated by the fact that there are 1910 instances where project_keywords.keyword_id = 0. I needed to delete these before I could set up a foreign key on this column. I then set up an index on. project_keywords.keyword_id.

Used the following SQL through these adventures:

SELECT keyword_id from project_keywords where keyword_id NOT IN (select id from keywords);

===

DELETE from project_keywords where keyword_id < 1;

===

ALTER TABLE project_keywords
ADD CONSTRAINT FOREIGN KEY (keyword_id) REFERENCES keywords(id)
ON DELETE CASCADE;

===

CREATE INDEX keyword_id
ON project_keywords(keyword_id);

After making these DB changes, I load tested my version of the code that counts keyword instances at runtime (ie, no cron job updating required) using the same load test I used previously, and this time got an average response time of 546 ms.

I have attached this load test report.
05 load test using dynamically calculated counts of keyword usage after indexing previously unindexed column in project_keywords table.html.zip

@peterjdann
Copy link
Contributor Author

After sleeping on it, realised that logically there should also be a foreign key relationship between project_keywords and projects table. Set this up:
ALTER TABLE project_keywords
ADD CONSTRAINT FOREIGN KEY (projects_id) REFERENCES projects(id)
ON DELETE CASCADE;

After making this further DB change, ran the same load test again, this time getting an average response time of 571 ms.
I have attached this load test report.
06 load test using dynamicallly calculated counts of keyworkd usage after additionaly settin up foreign key relationship between project_keywords and projects tables.html.zip

@redrun45 redrun45 removed the question label Mar 29, 2024
@redrun45
Copy link
Collaborator

We've kicked this back and forth and not had any objections, other than to changing the 'keyword' term to 'keyterm'. When the PR is ready, I'll be glad to test it, and we might also want this on the staging server for other admins. 😄

@redrun45 redrun45 linked a pull request Jul 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants