Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

top_fields.sql times out #54

Open
jmelot opened this issue Sep 17, 2024 · 4 comments
Open

top_fields.sql times out #54

jmelot opened this issue Sep 17, 2024 · 4 comments
Assignees

Comments

@jmelot
Copy link
Member

jmelot commented Sep 17, 2024

Or at least it did when I tried to run it ~2 weeks ago. The l2_candidates subquery seemed to be the culprit. I started looking into how it could be sped up but didn't solve it before I went OOO. Going ahead and opening an issue to give you guys a heads-up since it sounds like you'll be rerunning this soon (possibly before I have a chance to take a look again).

@jmelot
Copy link
Member Author

jmelot commented Sep 20, 2024

(Keeping notes on what I tried for when someone has a chance to dive into this)

Even just the first two joins of l2_candidates time out after 6h:

  select
    top_l1_fields.merged_id,
  from top_l1_fields
  # Get the L2 children of the top L1 fields; this is just a taxonomy lookup
  inner join fields_of_study_v2.field_hierarchy l2_children
    on trim(top_l1_fields.name) = trim(l2_children.display_name)
  # Also get top-L0 fields with their L1 children to restrict against
  inner join top_l0_fields
    on trim(top_l1_fields.name) = trim(top_l0_fields.l1_child_name)

@jamesdunham
Copy link
Member

One solution would be restricting scores earlier, e.g. keeping just the top k scores by level during inference, for some value of k that's higher than we think we'd ever really need. Say 10. This is what MAG used to do, and it would solve the efficiency problem for paper-level scoring without measurable impact on analysis.

But cluster-level averages should probably be calculated over papers beforehand, which would introduce some complexity into the inference pipeline.

@jmelot
Copy link
Member Author

jmelot commented Sep 23, 2024

In the short term I like the idea of restricting scores earlier, especially if you all are going to rerun this pipeline soon.

@jamesdunham
Copy link
Member

Per ~ in person ~ discussion today, we're re-running this pipeline soonish assuming we can find efficiency gains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants