Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519

GarethCabournDavies · 2023-10-05T15:34:21Z

Using this meant that I was able to actually complete a run of this code using realistic analysis files!

Basically use the mechanism described in #4510 to pre-mask the triggers before loading, also move some repetition of the same calculation out of the loops it was repeated in.

Some increased logging as well

spxiwh · 2023-10-05T15:41:29Z

bin/all_sky_search/pycbc_bin_trigger_rates_dq

+    n_triggers_orig = trig_file[f'{ifo}/snr'].size
+    logging.info("Trigger file has %d triggers", n_triggers_orig)
+    logging.info('Generating trigger mask')
+    idx, _ = trig_file.select(


Can we make an action here that it would be really good if we could do a select on the ranking, instead of needing to use SNR as a proxy.

This https://github.com/gwastro/pycbc/blob/master/pycbc/io/hdf.py#L520 is another place where one just wants a set of idxes of triggers with loud rankings. A second part of this is that the unused _ return here is using more memory than anything else. We may want a select method that only returns idx, and never compute the rest.

I have changed to use ranking straight away - there's a bit of jiggery-pokery to get the trigs into the right format, but seems okay at the moment.

There is a bit of inefficiency with unused datasets if they arent a part of the ranking, but this is done in a couple of minutes anyway, and is nothing compared to the background_bins.

I've also updated the select() function to allow an indices_only return as suggested

@spxiwh seems like you want an issue for future development opened?

Oh I see, Gareth already set something up.

spxiwh · 2023-10-05T15:50:49Z

Can you also move the trigger file parsing above the background bin computation. Otherwise it takes an hour to compute tons of SEOBNRv5 durations, and then the job gets evicted when the memory usage spikes. Do the high-memory-using-thing first.

… in HFile.select

spxiwh · 2023-10-06T18:01:33Z

Thanks for taking the next steps with this, but for this PR, could we bring this back to just doing this by SNR (and moving the trigger file reading up). That change has been tested and should be good to go.

The other parts to this: having a select method that only returns indices, and being able to run select with the ranking only, affects more codes than just this one and so I'd prefer to pull that off to a separate (and broader) PR (with less urgency). I'd also like to consider this a bit further, as if we have ranking = snr (for instance) this code would be 4x slower than it needs to me while it reads a bunch of arrays that it won't use. I'd also like the "mask_to_loudest_n_..." method to use this functionality.

GarethCabournDavies · 2023-10-09T08:32:57Z

I have de-scoped and added notes in #4524

GarethCabournDavies · 2023-10-09T10:36:37Z

I ran a comparison of the output file to the one from the workflow. (Every attribute, every dataset; values and dtypes)

The files do differ slightly after this change, as the 'binX/rates' datasets are float64 in the old run, and float32 in the new test.

Which I think is due to this line, not specifying the exact float dtype, and they were run on different hardware

The datasets are numerically equal to one another though

spxiwh · 2023-10-09T14:25:02Z

np.array([1,2,3,4], dtype=np.float32).astype('float').dtype returns float64 not float32

GarethCabournDavies · 2023-10-09T14:42:49Z

np.array([1,2,3,4], dtype=np.float32).astype('float').dtype returns float64 not float32

Will that always be the case though? I can't find source code to check that it will be the same whatever hardware is used.

I think changing the code here to use .astype(np.float64) is safest for this specific use-case (and also the other astypes in the executable)

spxiwh · 2023-10-09T14:50:14Z

I tried print statements in the code in question, and indeed this ends up as a float32 but I've no idea why. This shouldn't be hardware dependent etc. etc. as float means float64 in python. (I know there's some weirdness around things like longdouble, but the built-in python float should be clear).

Anyway, changing astype('float') to astype(np.float64) at least solves the discrepancy, and should let us proceed if we've verified that files are identical. (This seems to store ints anyway, so not really sure why it needs to be float64 ... but let's stick with "doesn't change output files")

GarethCabournDavies · 2023-10-10T08:15:19Z

Anyway, changing astype('float') to astype(np.float64) at least solves the discrepancy, and should let us proceed if we've verified that files are identical.

Weirdly, this seems to make some of the datasets numerically different now! I would advocate leaving things as they are (i.e .astype('float')) as the numerical identicality is more important than matching dtypes

e.g.:

Dataset bin9/rates is not the same
Maximum proportional difference: [0.09051649 0.21997125]
Values, File one: [ 0.87410188 17.6588747 ], File two: [ 0.79848112 21.99397667]

spxiwh · 2023-10-10T08:45:10Z

Interesting observation:

import numpy as np
print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)
import logging
import argparse
import pycbc
import pycbc.events
from pycbc.events import stat as pystat
from pycbc.types.timeseries import load_timeseries
import numpy as np
import h5py as h5
from pycbc.io.hdf import HFile, SingleDetTriggers
from pycbc.version import git_verbose_msg as version
print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)

returns different dtypes for the two prints! ... So something in io.hdf (which seems to trigger this) seems to be messing with how this works.

spxiwh · 2023-10-10T09:00:23Z

Here's a small example that demonstrates why this is happening:

import numpy as np
print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)
from ligo.lw import types as ligolw_types
print([(k,np.sctypeDict[k],ligolw_types.ToNumPyType[k]) for k in ligolw_types.ToNumPyType if k in np.sctypeDict])
# This line is the one that causes the change! It's in io/record.py
np.sctypeDict.update(ligolw_types.ToNumPyType)

print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)

It also shows what will change. It does mean that the behaviour of np.astype('int') or np.astype('float)(and perhapsnp.astype('double')` will change.

However, if we replace all np.astype('int') with np.astype(np.int64) and np.astype('float') with np.astype('np.float64') previous behaviour would be restored. I'll check with @ahnitz in pycbc-code if this behaviour was ever intended (I assume we just wanted the weird things like 'int_2s' added, not the existing things overwritten).

spxiwh · 2023-10-10T09:06:49Z

@GarethCabournDavies Can you see if we get identical output if we change all the astype commands in this cpde to match previous behaviour?? That should work, and then we can get this merged. The bigger question of "is this intended" can be considered outside of this PR.

GarethCabournDavies · 2023-10-10T13:35:59Z

The output is not identical after changing the astype('float') to astype(np.float64) - I think this is a case of later calculations being messed up, as the direct comparison after the change does not fail.

I don't know if we need to go through this code line-by-line and check that types are appropriate, but that is not part of this PR

If we leave things as they are (using the string types), then output is identical apart from dtype in the datasets

maxtrevor · 2023-10-10T13:48:08Z

The output is not identical after changing the astype('float') to astype(np.float64) - I think this is a case of later calculations being messed up, as the direct comparison after the change does not fail.

I don't know if we need to go through this code line-by-line and check that types are appropriate, but that is not part of this PR

If we leave things as they are (using the string types), then output is identical apart from dtype in the datasets

How much are the data products changed by switching to use astype(np.float64) ? Does it look consistent with just being related to the finite precision of a float?

maxtrevor · 2023-10-10T14:36:45Z

My only remaining concern is the numerical changes resulting from changing dtypes. Once that is udnerstood / resolved this should be good to go.

spxiwh · 2023-10-10T19:39:18Z

So to avoid confusion on this point: This patch should not change any of the dtypes. Because of the issue being resolved in #4525 some arrays (with this patch as is) would use reduced precision. Once #4525 is merged this will no longer be the case. I would also recommend explicit type casting, rather than using astype('int') in any case, so I propose adding this to the patch:

diff --git a/bin/all_sky_search/pycbc_bin_trigger_rates_dq b/bin/all_sky_search/pycbc_bin_trigger_rates_dq
index 9457b61971..eb2bed4c14 100644
--- a/bin/all_sky_search/pycbc_bin_trigger_rates_dq
+++ b/bin/all_sky_search/pycbc_bin_trigger_rates_dq
@@ -82,7 +82,7 @@ with h5.File(args.trig_file, 'r') as trig_file:
         if args.prune_number < 1:
             del stat
 
-trig_times_int = trig_times.astype('int')
+trig_times_int = trig_times.astype(np.int64)
 
 dq_times = np.array([])
 dq_logl = np.array([])
@@ -102,12 +102,12 @@ dq_percentiles = np.percentile(dq_logl, percentiles)[1:]
 
 # seconds bin tells what bin each second ends up
 seconds_bin = np.array([n_bins - np.sum(dq_percentiles >= dq_ll)
-                        for dq_ll in dq_logl]).astype('int')
+                        for dq_ll in dq_logl]).astype(np.int64)
 del dq_percentiles
 
 # bin times tells how much time ends up in each bin
 bin_times = np.array([np.sum(seconds_bin == i)
-                      for i in range(n_bins)]).astype('float')
+                      for i in range(n_bins)]).astype(np.float64)
 full_time = float(len(seconds_bin))
 times_nz = (bin_times > 0)
 del dq_logl
@@ -148,7 +148,7 @@ with h5.File(args.output_file, 'w') as f:
         logging.info('Processing %d triggers...', len(trig_percentile))
 
         (counts, bins) = np.histogram(trig_percentile, bins=(percentiles)/100)
-        counts = counts.astype('float')
+        counts = counts.astype(np.float64)
         rates = np.zeros_like(bin_times)
         rates[times_nz] = counts[times_nz]/bin_times[times_nz]
         mean_rate = len(trig_percentile) / full_time

with that patch in place there should not be any difference due to dtypes, because the dtypes of arrays will not change (regardless of whether or not #4525 is applied).

The remaining question is why has Gareth observed changes in output files before/after this patch? I'm beginning to suspect that this may completely unrelated to dtype changes (and possibly the dtype change led Gareth to believe that files were otherwise the same, when in fact they were not). I haven't yet reproduced this though ...

spxiwh · 2023-10-10T21:11:55Z

It seems that the issue here was that we are assuming that SNR is always bigger than the single ranking. This is not the case when PSD variation is present. This can increase (or decrease) the SNR. The patch also uses a > stat whereas in the old code it was >= stat ... One event in the test case had a stat of exactly 6.500000000.

I'll propose a fix to this tomorrow, but it would have to be a little less flexible than I would like until #4524 can be implemented.

spxiwh · 2023-10-11T08:16:22Z

Here's the rest of the proposed change:

(pycbc_python3_throwawaytest4) [ian.harry@ldas-grid 01:13 am gareth_test]$ diff ~/virtualenvs/pycbc_python3_throwawaytest4/bin/pycbc_bin_trigger_rates_dq ./pycbc_bin_trigger_rates_dq_NEW_fixed
61,62c61,62
<     idx, _ = trig_file.select(
<         lambda snr: snr > args.stat_threshold,
---
>     idx, _, _ = trig_file.select(
>         lambda snr, psd_var_val: (snr/psd_var_val**0.5) >= args.stat_threshold,
63a64
>         f'{ifo}/psd_var_val',
94c95,96
< keep = stat > args.stat_threshold
---
> keep = stat >= args.stat_threshold
>

In particular the keep line on line 94 becomes >= from >. Also the select becomes a bit hacky and explicitly includes psd_var_val:

with HFile(args.trig_file, 'r') as trig_file:
    n_triggers_orig = trig_file[f'{ifo}/snr'].size
    logging.info("Trigger file has %d triggers", n_triggers_orig)
    logging.info('Generating trigger mask')
    idx, _, _ = trig_file.select(
        lambda snr, psd_var_val: (snr/psd_var_val**0.5) >= args.stat_threshold,
        f'{ifo}/snr',
        f'{ifo}/psd_var_val',
        return_indices=True
    )
    data_mask = np.zeros(n_triggers_orig, dtype=bool)
    data_mask[idx] = True

Perhaps there should be some logic around this so that if psd_var_val is not computed (or worse, if it's 0!) this doesn't get included??

#4524 would help fix this properly, but needs some thought (and a bit more time) over how to do it properly.

tdent · 2023-10-11T08:53:03Z

bin/all_sky_search/pycbc_bin_trigger_rates_dq

+    value given SNR and psd variation statistic
+    """
+    rw_snr = snr / psd_var_val ** 0.5
+    rw_snr[psd_var_val == 0] = 0


What does it mean for psd var to be =0? Ie what could cause that?

What value do we get if it is not computed in the first place?

pycbc_inspiral does not store it if not computed. PyCBC Live has not yet defined behaviour in this case.

.... If this will be fixed "in a better way" soon anyway, we probably don't need the "is it equal to 0" check (but it won't harm). Note that in the ranking computation, psd_var rescaling is not applied if the value is smaller than 0.65.

Noting that the current code would fail if psd_var_val is not computed.

Do you want me to prevent that failure? It would be a case of minor changes in the select function, where if the dataset isn't present but is asked for, it could be set to zero?

For now (avoiding things that should be in #4524), I would limit changes to this executable, and use a if statement to use the old call to select if psd_var is not present, and the new one if it is.

Failure is good if the alternative is continuing while producing a plausible but wrong answer ..
Just noting that substituting a 'silent' default value might be rather dangerous

I think wrapping in a if statement would work here. We aren't giving an option being ignored, we are simply working around a dataset which may or may not have been saved

Can we then remove rw_snr[psd_var_val == 0] = 0 ? If psd var is calculated and the psd var val is 0, there is a big problem and I think we want the code to produce an error rather than ignore it ..

Please remove this line

GarethCabournDavies · 2023-10-11T13:06:53Z

Files now differ at a binary level (thought shouldn't I don't think), but all attributes and datasets or numerically equal

I haven't been able to test the psd_var_val missing check active, as finding a run with dq but without psd_var_val is not obvious. However the function is the same as previously tested

spxiwh

One line needs removing, otherwise approved.

…wastro#4519) * Apply a few efficiency savings to pycbc_bin_trigger_rates_dq * Us more logging. stat-threshold now has a default * Some improvements to logging * Move background_bins after trigger loading, allow indices_only return in HFile.select * Use sngl-ranking straight away in the select function * Descope PR * cleanup reversion to descope * cleanup reversion to descope * Explicit numpy types * gt --> ge * IH bugfix * apply some logic around psd var reweighting * wrap trigger selection to allow for missing psd_var_val case * remove checks around the psd_var_val

GarethCabournDavies added 3 commits October 5, 2023 07:49

Apply a few efficiency savings to pycbc_bin_trigger_rates_dq

dd4434d

Us more logging. stat-threshold now has a default

06e32db

Some improvements to logging

24d8466

GarethCabournDavies added bug offline search labels Oct 5, 2023

GarethCabournDavies requested a review from maxtrevor October 5, 2023 15:34

spxiwh reviewed Oct 5, 2023

View reviewed changes

GarethCabournDavies added 2 commits October 6, 2023 01:24

Move background_bins after trigger loading, allow indices_only return…

99bedab

… in HFile.select

Use sngl-ranking straight away in the select function

e652d2a

Descope PR

d5003c3

GarethCabournDavies added 2 commits October 9, 2023 01:42

cleanup reversion to descope

c17a2fc

cleanup reversion to descope

07fbaf3

GarethCabournDavies added 2 commits October 11, 2023 01:08

Explicit numpy types

231c560

gt --> ge

3432514

GarethCabournDavies added 2 commits October 11, 2023 01:24

IH bugfix

6172296

apply some logic around psd var reweighting

6d80075

tdent reviewed Oct 11, 2023

View reviewed changes

wrap trigger selection to allow for missing psd_var_val case

4f71e1a

spxiwh approved these changes Oct 11, 2023

View reviewed changes

remove checks around the psd_var_val

faf002c

GarethCabournDavies enabled auto-merge (squash) October 11, 2023 13:44

GarethCabournDavies merged commit d7325c3 into gwastro:master Oct 11, 2023
31 checks passed

GarethCabournDavies deleted the dq_trig_rates_efficiency branch October 16, 2023 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519

Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519

GarethCabournDavies commented Oct 5, 2023

spxiwh Oct 5, 2023

GarethCabournDavies Oct 6, 2023

tdent Oct 9, 2023

tdent Oct 9, 2023

spxiwh commented Oct 5, 2023

spxiwh commented Oct 6, 2023

GarethCabournDavies commented Oct 9, 2023

GarethCabournDavies commented Oct 9, 2023

spxiwh commented Oct 9, 2023 •

edited

Loading

GarethCabournDavies commented Oct 9, 2023

spxiwh commented Oct 9, 2023

GarethCabournDavies commented Oct 10, 2023 •

edited

Loading

spxiwh commented Oct 10, 2023

spxiwh commented Oct 10, 2023

spxiwh commented Oct 10, 2023

GarethCabournDavies commented Oct 10, 2023

maxtrevor commented Oct 10, 2023

maxtrevor commented Oct 10, 2023

spxiwh commented Oct 10, 2023

spxiwh commented Oct 10, 2023

spxiwh commented Oct 11, 2023

tdent Oct 11, 2023

spxiwh Oct 11, 2023

spxiwh Oct 11, 2023

GarethCabournDavies Oct 11, 2023

spxiwh Oct 11, 2023

tdent Oct 11, 2023

GarethCabournDavies Oct 11, 2023

tdent Oct 11, 2023 •

edited

Loading

spxiwh Oct 11, 2023

GarethCabournDavies commented Oct 11, 2023 •

edited

Loading

spxiwh left a comment

Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519

Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519

Conversation

GarethCabournDavies commented Oct 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spxiwh commented Oct 5, 2023

spxiwh commented Oct 6, 2023

GarethCabournDavies commented Oct 9, 2023

GarethCabournDavies commented Oct 9, 2023

spxiwh commented Oct 9, 2023 • edited Loading

GarethCabournDavies commented Oct 9, 2023

spxiwh commented Oct 9, 2023

GarethCabournDavies commented Oct 10, 2023 • edited Loading

spxiwh commented Oct 10, 2023

spxiwh commented Oct 10, 2023

spxiwh commented Oct 10, 2023

GarethCabournDavies commented Oct 10, 2023

maxtrevor commented Oct 10, 2023

maxtrevor commented Oct 10, 2023

spxiwh commented Oct 10, 2023

spxiwh commented Oct 10, 2023

spxiwh commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdent Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GarethCabournDavies commented Oct 11, 2023 • edited Loading

spxiwh left a comment

Choose a reason for hiding this comment

spxiwh commented Oct 9, 2023 •

edited

Loading

GarethCabournDavies commented Oct 10, 2023 •

edited

Loading

tdent Oct 11, 2023 •

edited

Loading

GarethCabournDavies commented Oct 11, 2023 •

edited

Loading