`top_fields.sql` times out #54

jmelot · 2024-09-17T15:21:16Z

Or at least it did when I tried to run it ~2 weeks ago. The l2_candidates subquery seemed to be the culprit. I started looking into how it could be sped up but didn't solve it before I went OOO. Going ahead and opening an issue to give you guys a heads-up since it sounds like you'll be rerunning this soon (possibly before I have a chance to take a look again).

The text was updated successfully, but these errors were encountered:

jmelot · 2024-09-20T15:09:12Z

(Keeping notes on what I tried for when someone has a chance to dive into this)

Even just the first two joins of l2_candidates time out after 6h:

  select
    top_l1_fields.merged_id,
  from top_l1_fields
  # Get the L2 children of the top L1 fields; this is just a taxonomy lookup
  inner join fields_of_study_v2.field_hierarchy l2_children
    on trim(top_l1_fields.name) = trim(l2_children.display_name)
  # Also get top-L0 fields with their L1 children to restrict against
  inner join top_l0_fields
    on trim(top_l1_fields.name) = trim(top_l0_fields.l1_child_name)

jamesdunham · 2024-09-23T13:18:43Z

One solution would be restricting scores earlier, e.g. keeping just the top k scores by level during inference, for some value of k that's higher than we think we'd ever really need. Say 10. This is what MAG used to do, and it would solve the efficiency problem for paper-level scoring without measurable impact on analysis.

But cluster-level averages should probably be calculated over papers beforehand, which would introduce some complexity into the inference pipeline.

jmelot · 2024-09-23T13:28:07Z

In the short term I like the idea of restricting scores earlier, especially if you all are going to rerun this pipeline soon.

jamesdunham · 2024-09-24T12:06:36Z

Per ~ in person ~ discussion today, we're re-running this pipeline soonish assuming we can find efficiency gains.

jmelot assigned jamesdunham and rggelles Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`top_fields.sql` times out #54

`top_fields.sql` times out #54

jmelot commented Sep 17, 2024

jmelot commented Sep 20, 2024 •

edited

Loading

jamesdunham commented Sep 23, 2024

jmelot commented Sep 23, 2024

jamesdunham commented Sep 24, 2024

top_fields.sql times out #54

top_fields.sql times out #54

Comments

jmelot commented Sep 17, 2024

jmelot commented Sep 20, 2024 • edited Loading

jamesdunham commented Sep 23, 2024

jmelot commented Sep 23, 2024

jamesdunham commented Sep 24, 2024

`top_fields.sql` times out #54

`top_fields.sql` times out #54

jmelot commented Sep 20, 2024 •

edited

Loading