Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519

Merged

Conversation

GarethCabournDavies
Copy link
Contributor

Using this meant that I was able to actually complete a run of this code using realistic analysis files!

Basically use the mechanism described in #4510 to pre-mask the triggers before loading, also move some repetition of the same calculation out of the loops it was repeated in.

Some increased logging as well

n_triggers_orig = trig_file[f'{ifo}/snr'].size
logging.info("Trigger file has %d triggers", n_triggers_orig)
logging.info('Generating trigger mask')
idx, _ = trig_file.select(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make an action here that it would be really good if we could do a select on the ranking, instead of needing to use SNR as a proxy.

This https://github.com/gwastro/pycbc/blob/master/pycbc/io/hdf.py#L520 is another place where one just wants a set of idxes of triggers with loud rankings. A second part of this is that the unused _ return here is using more memory than anything else. We may want a select method that only returns idx, and never compute the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed to use ranking straight away - there's a bit of jiggery-pokery to get the trigs into the right format, but seems okay at the moment.

There is a bit of inefficiency with unused datasets if they arent a part of the ranking, but this is done in a couple of minutes anyway, and is nothing compared to the background_bins.

I've also updated the select() function to allow an indices_only return as suggested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spxiwh seems like you want an issue for future development opened?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, Gareth already set something up.

@spxiwh
Copy link
Contributor

spxiwh commented Oct 5, 2023

Can you also move the trigger file parsing above the background bin computation. Otherwise it takes an hour to compute tons of SEOBNRv5 durations, and then the job gets evicted when the memory usage spikes. Do the high-memory-using-thing first.

@spxiwh
Copy link
Contributor

spxiwh commented Oct 6, 2023

Thanks for taking the next steps with this, but for this PR, could we bring this back to just doing this by SNR (and moving the trigger file reading up). That change has been tested and should be good to go.

The other parts to this: having a select method that only returns indices, and being able to run select with the ranking only, affects more codes than just this one and so I'd prefer to pull that off to a separate (and broader) PR (with less urgency). I'd also like to consider this a bit further, as if we have ranking = snr (for instance) this code would be 4x slower than it needs to me while it reads a bunch of arrays that it won't use. I'd also like the "mask_to_loudest_n_..." method to use this functionality.

@GarethCabournDavies
Copy link
Contributor Author

I have de-scoped and added notes in #4524

@GarethCabournDavies
Copy link
Contributor Author

I ran a comparison of the output file to the one from the workflow. (Every attribute, every dataset; values and dtypes)

The files do differ slightly after this change, as the 'binX/rates' datasets are float64 in the old run, and float32 in the new test.

Which I think is due to this line, not specifying the exact float dtype, and they were run on different hardware

The datasets are numerically equal to one another though

@spxiwh
Copy link
Contributor

spxiwh commented Oct 9, 2023

np.array([1,2,3,4], dtype=np.float32).astype('float').dtype returns float64 not float32

@GarethCabournDavies
Copy link
Contributor Author

np.array([1,2,3,4], dtype=np.float32).astype('float').dtype returns float64 not float32

Will that always be the case though? I can't find source code to check that it will be the same whatever hardware is used.

I think changing the code here to use .astype(np.float64) is safest for this specific use-case (and also the other astypes in the executable)

@spxiwh
Copy link
Contributor

spxiwh commented Oct 9, 2023

I tried print statements in the code in question, and indeed this ends up as a float32 but I've no idea why. This shouldn't be hardware dependent etc. etc. as float means float64 in python. (I know there's some weirdness around things like longdouble, but the built-in python float should be clear).

Anyway, changing astype('float') to astype(np.float64) at least solves the discrepancy, and should let us proceed if we've verified that files are identical. (This seems to store ints anyway, so not really sure why it needs to be float64 ... but let's stick with "doesn't change output files")

@GarethCabournDavies
Copy link
Contributor Author

GarethCabournDavies commented Oct 10, 2023

Anyway, changing astype('float') to astype(np.float64) at least solves the discrepancy, and should let us proceed if we've verified that files are identical.

Weirdly, this seems to make some of the datasets numerically different now! I would advocate leaving things as they are (i.e .astype('float')) as the numerical identicality is more important than matching dtypes

e.g.:

Dataset bin9/rates is not the same
Maximum proportional difference: [0.09051649 0.21997125]
Values, File one: [ 0.87410188 17.6588747 ], File two: [ 0.79848112 21.99397667]

@spxiwh
Copy link
Contributor

spxiwh commented Oct 10, 2023

Interesting observation:

import numpy as np
print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)
import logging
import argparse
import pycbc
import pycbc.events
from pycbc.events import stat as pystat
from pycbc.types.timeseries import load_timeseries
import numpy as np
import h5py as h5
from pycbc.io.hdf import HFile, SingleDetTriggers
from pycbc.version import git_verbose_msg as version
print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)

returns different dtypes for the two prints! ... So something in io.hdf (which seems to trigger this) seems to be messing with how this works.

@spxiwh
Copy link
Contributor

spxiwh commented Oct 10, 2023

Here's a small example that demonstrates why this is happening:

import numpy as np
print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)
from ligo.lw import types as ligolw_types
print([(k,np.sctypeDict[k],ligolw_types.ToNumPyType[k]) for k in ligolw_types.ToNumPyType if k in np.sctypeDict])
# This line is the one that causes the change! It's in io/record.py
np.sctypeDict.update(ligolw_types.ToNumPyType)

print(np.array([1,2,3,4], dtype=np.float32).astype('float').dtype)

It also shows what will change. It does mean that the behaviour of np.astype('int') or np.astype('float)(and perhapsnp.astype('double')` will change.

However, if we replace all np.astype('int') with np.astype(np.int64) and np.astype('float') with np.astype('np.float64') previous behaviour would be restored. I'll check with @ahnitz in pycbc-code if this behaviour was ever intended (I assume we just wanted the weird things like 'int_2s' added, not the existing things overwritten).

@spxiwh
Copy link
Contributor

spxiwh commented Oct 10, 2023

@GarethCabournDavies Can you see if we get identical output if we change all the astype commands in this cpde to match previous behaviour?? That should work, and then we can get this merged. The bigger question of "is this intended" can be considered outside of this PR.

@GarethCabournDavies
Copy link
Contributor Author

The output is not identical after changing the astype('float') to astype(np.float64) - I think this is a case of later calculations being messed up, as the direct comparison after the change does not fail.

I don't know if we need to go through this code line-by-line and check that types are appropriate, but that is not part of this PR

If we leave things as they are (using the string types), then output is identical apart from dtype in the datasets

@maxtrevor
Copy link
Contributor

The output is not identical after changing the astype('float') to astype(np.float64) - I think this is a case of later calculations being messed up, as the direct comparison after the change does not fail.

I don't know if we need to go through this code line-by-line and check that types are appropriate, but that is not part of this PR

If we leave things as they are (using the string types), then output is identical apart from dtype in the datasets

How much are the data products changed by switching to use astype(np.float64) ? Does it look consistent with just being related to the finite precision of a float?

@maxtrevor
Copy link
Contributor

My only remaining concern is the numerical changes resulting from changing dtypes. Once that is udnerstood / resolved this should be good to go.

@spxiwh
Copy link
Contributor

spxiwh commented Oct 10, 2023

So to avoid confusion on this point: This patch should not change any of the dtypes. Because of the issue being resolved in #4525 some arrays (with this patch as is) would use reduced precision. Once #4525 is merged this will no longer be the case. I would also recommend explicit type casting, rather than using astype('int') in any case, so I propose adding this to the patch:

diff --git a/bin/all_sky_search/pycbc_bin_trigger_rates_dq b/bin/all_sky_search/pycbc_bin_trigger_rates_dq
index 9457b61971..eb2bed4c14 100644
--- a/bin/all_sky_search/pycbc_bin_trigger_rates_dq
+++ b/bin/all_sky_search/pycbc_bin_trigger_rates_dq
@@ -82,7 +82,7 @@ with h5.File(args.trig_file, 'r') as trig_file:
         if args.prune_number < 1:
             del stat
 
-trig_times_int = trig_times.astype('int')
+trig_times_int = trig_times.astype(np.int64)
 
 dq_times = np.array([])
 dq_logl = np.array([])
@@ -102,12 +102,12 @@ dq_percentiles = np.percentile(dq_logl, percentiles)[1:]
 
 # seconds bin tells what bin each second ends up
 seconds_bin = np.array([n_bins - np.sum(dq_percentiles >= dq_ll)
-                        for dq_ll in dq_logl]).astype('int')
+                        for dq_ll in dq_logl]).astype(np.int64)
 del dq_percentiles
 
 # bin times tells how much time ends up in each bin
 bin_times = np.array([np.sum(seconds_bin == i)
-                      for i in range(n_bins)]).astype('float')
+                      for i in range(n_bins)]).astype(np.float64)
 full_time = float(len(seconds_bin))
 times_nz = (bin_times > 0)
 del dq_logl
@@ -148,7 +148,7 @@ with h5.File(args.output_file, 'w') as f:
         logging.info('Processing %d triggers...', len(trig_percentile))
 
         (counts, bins) = np.histogram(trig_percentile, bins=(percentiles)/100)
-        counts = counts.astype('float')
+        counts = counts.astype(np.float64)
         rates = np.zeros_like(bin_times)
         rates[times_nz] = counts[times_nz]/bin_times[times_nz]
         mean_rate = len(trig_percentile) / full_time

with that patch in place there should not be any difference due to dtypes, because the dtypes of arrays will not change (regardless of whether or not #4525 is applied).

The remaining question is why has Gareth observed changes in output files before/after this patch? I'm beginning to suspect that this may completely unrelated to dtype changes (and possibly the dtype change led Gareth to believe that files were otherwise the same, when in fact they were not). I haven't yet reproduced this though ...

@spxiwh
Copy link
Contributor

spxiwh commented Oct 10, 2023

It seems that the issue here was that we are assuming that SNR is always bigger than the single ranking. This is not the case when PSD variation is present. This can increase (or decrease) the SNR. The patch also uses a > stat whereas in the old code it was >= stat ... One event in the test case had a stat of exactly 6.500000000.

I'll propose a fix to this tomorrow, but it would have to be a little less flexible than I would like until #4524 can be implemented.

@spxiwh
Copy link
Contributor

spxiwh commented Oct 11, 2023

Here's the rest of the proposed change:

(pycbc_python3_throwawaytest4) [ian.harry@ldas-grid 01:13 am gareth_test]$ diff ~/virtualenvs/pycbc_python3_throwawaytest4/bin/pycbc_bin_trigger_rates_dq ./pycbc_bin_trigger_rates_dq_NEW_fixed
61,62c61,62
<     idx, _ = trig_file.select(
<         lambda snr: snr > args.stat_threshold,
---
>     idx, _, _ = trig_file.select(
>         lambda snr, psd_var_val: (snr/psd_var_val**0.5) >= args.stat_threshold,
63a64
>         f'{ifo}/psd_var_val',
94c95,96
< keep = stat > args.stat_threshold
---
> keep = stat >= args.stat_threshold
> 

In particular the keep line on line 94 becomes >= from >. Also the select becomes a bit hacky and explicitly includes psd_var_val:

with HFile(args.trig_file, 'r') as trig_file:
    n_triggers_orig = trig_file[f'{ifo}/snr'].size
    logging.info("Trigger file has %d triggers", n_triggers_orig)
    logging.info('Generating trigger mask')
    idx, _, _ = trig_file.select(
        lambda snr, psd_var_val: (snr/psd_var_val**0.5) >= args.stat_threshold,
        f'{ifo}/snr',
        f'{ifo}/psd_var_val',
        return_indices=True
    )
    data_mask = np.zeros(n_triggers_orig, dtype=bool)
    data_mask[idx] = True

Perhaps there should be some logic around this so that if psd_var_val is not computed (or worse, if it's 0!) this doesn't get included??

#4524 would help fix this properly, but needs some thought (and a bit more time) over how to do it properly.

value given SNR and psd variation statistic
"""
rw_snr = snr / psd_var_val ** 0.5
rw_snr[psd_var_val == 0] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for psd var to be =0? Ie what could cause that?

What value do we get if it is not computed in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pycbc_inspiral does not store it if not computed. PyCBC Live has not yet defined behaviour in this case.

.... If this will be fixed "in a better way" soon anyway, we probably don't need the "is it equal to 0" check (but it won't harm). Note that in the ranking computation, psd_var rescaling is not applied if the value is smaller than 0.65.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that the current code would fail if psd_var_val is not computed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to prevent that failure? It would be a case of minor changes in the select function, where if the dataset isn't present but is asked for, it could be set to zero?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now (avoiding things that should be in #4524), I would limit changes to this executable, and use a if statement to use the old call to select if psd_var is not present, and the new one if it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure is good if the alternative is continuing while producing a plausible but wrong answer ..
Just noting that substituting a 'silent' default value might be rather dangerous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wrapping in a if statement would work here. We aren't giving an option being ignored, we are simply working around a dataset which may or may not have been saved

Copy link
Contributor

@tdent tdent Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we then remove rw_snr[psd_var_val == 0] = 0 ? If psd var is calculated and the psd var val is 0, there is a big problem and I think we want the code to produce an error rather than ignore it ..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line

@GarethCabournDavies
Copy link
Contributor Author

GarethCabournDavies commented Oct 11, 2023

Files now differ at a binary level (thought shouldn't I don't think), but all attributes and datasets or numerically equal

I haven't been able to test the psd_var_val missing check active, as finding a run with dq but without psd_var_val is not obvious. However the function is the same as previously tested

Copy link
Contributor

@spxiwh spxiwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line needs removing, otherwise approved.

@GarethCabournDavies GarethCabournDavies enabled auto-merge (squash) October 11, 2023 13:44
@GarethCabournDavies GarethCabournDavies merged commit d7325c3 into gwastro:master Oct 11, 2023
31 checks passed
@GarethCabournDavies GarethCabournDavies deleted the dq_trig_rates_efficiency branch October 16, 2023 09:04
PRAVEEN-mnl pushed a commit to PRAVEEN-mnl/pycbc that referenced this pull request Nov 3, 2023
…wastro#4519)

* Apply a few efficiency savings to pycbc_bin_trigger_rates_dq

* Us more logging. stat-threshold now has a default

* Some improvements to logging

* Move background_bins after trigger loading, allow indices_only return in HFile.select

* Use sngl-ranking straight away in the select function

* Descope PR

* cleanup reversion to descope

* cleanup reversion to descope

* Explicit numpy types

* gt --> ge

* IH bugfix

* apply some logic around psd var reweighting

* wrap trigger selection to allow for missing psd_var_val case

* remove checks around the psd_var_val
bhooshan-gadre pushed a commit to bhooshan-gadre/pycbc that referenced this pull request Mar 4, 2024
…wastro#4519)

* Apply a few efficiency savings to pycbc_bin_trigger_rates_dq

* Us more logging. stat-threshold now has a default

* Some improvements to logging

* Move background_bins after trigger loading, allow indices_only return in HFile.select

* Use sngl-ranking straight away in the select function

* Descope PR

* cleanup reversion to descope

* cleanup reversion to descope

* Explicit numpy types

* gt --> ge

* IH bugfix

* apply some logic around psd var reweighting

* wrap trigger selection to allow for missing psd_var_val case

* remove checks around the psd_var_val
lpathak97 pushed a commit to lpathak97/pycbc that referenced this pull request Mar 13, 2024
…wastro#4519)

* Apply a few efficiency savings to pycbc_bin_trigger_rates_dq

* Us more logging. stat-threshold now has a default

* Some improvements to logging

* Move background_bins after trigger loading, allow indices_only return in HFile.select

* Use sngl-ranking straight away in the select function

* Descope PR

* cleanup reversion to descope

* cleanup reversion to descope

* Explicit numpy types

* gt --> ge

* IH bugfix

* apply some logic around psd var reweighting

* wrap trigger selection to allow for missing psd_var_val case

* remove checks around the psd_var_val
acorreia61201 pushed a commit to acorreia61201/pycbc that referenced this pull request Apr 4, 2024
…wastro#4519)

* Apply a few efficiency savings to pycbc_bin_trigger_rates_dq

* Us more logging. stat-threshold now has a default

* Some improvements to logging

* Move background_bins after trigger loading, allow indices_only return in HFile.select

* Use sngl-ranking straight away in the select function

* Descope PR

* cleanup reversion to descope

* cleanup reversion to descope

* Explicit numpy types

* gt --> ge

* IH bugfix

* apply some logic around psd var reweighting

* wrap trigger selection to allow for missing psd_var_val case

* remove checks around the psd_var_val
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants