-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519
Apply some needed efficiency savings to pycbc_bin_trigger_rates_dq #4519
Conversation
n_triggers_orig = trig_file[f'{ifo}/snr'].size | ||
logging.info("Trigger file has %d triggers", n_triggers_orig) | ||
logging.info('Generating trigger mask') | ||
idx, _ = trig_file.select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make an action here that it would be really good if we could do a select on the ranking, instead of needing to use SNR as a proxy.
This https://github.com/gwastro/pycbc/blob/master/pycbc/io/hdf.py#L520 is another place where one just wants a set of idxes of triggers with loud rankings. A second part of this is that the unused _
return here is using more memory than anything else. We may want a select
method that only returns idx, and never compute the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed to use ranking straight away - there's a bit of jiggery-pokery to get the trigs into the right format, but seems okay at the moment.
There is a bit of inefficiency with unused datasets if they arent a part of the ranking, but this is done in a couple of minutes anyway, and is nothing compared to the background_bins.
I've also updated the select() function to allow an indices_only return as suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spxiwh seems like you want an issue for future development opened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, Gareth already set something up.
Can you also move the trigger file parsing above the background bin computation. Otherwise it takes an hour to compute tons of SEOBNRv5 durations, and then the job gets evicted when the memory usage spikes. Do the high-memory-using-thing first. |
Thanks for taking the next steps with this, but for this PR, could we bring this back to just doing this by SNR (and moving the trigger file reading up). That change has been tested and should be good to go. The other parts to this: having a |
I have de-scoped and added notes in #4524 |
I ran a comparison of the output file to the one from the workflow. (Every attribute, every dataset; values and dtypes) The files do differ slightly after this change, as the 'binX/rates' datasets are float64 in the old run, and float32 in the new test. Which I think is due to this line, not specifying the exact float dtype, and they were run on different hardware The datasets are numerically equal to one another though |
|
Will that always be the case though? I can't find source code to check that it will be the same whatever hardware is used. I think changing the code here to use |
I tried print statements in the code in question, and indeed this ends up as a Anyway, changing |
Weirdly, this seems to make some of the datasets numerically different now! I would advocate leaving things as they are (i.e e.g.:
|
Interesting observation:
returns different dtypes for the two prints! ... So something in |
Here's a small example that demonstrates why this is happening:
It also shows what will change. It does mean that the behaviour of However, if we replace all np.astype('int') with |
@GarethCabournDavies Can you see if we get identical output if we change all the |
The output is not identical after changing the I don't know if we need to go through this code line-by-line and check that types are appropriate, but that is not part of this PR If we leave things as they are (using the string types), then output is identical apart from dtype in the datasets |
How much are the data products changed by switching to use |
My only remaining concern is the numerical changes resulting from changing dtypes. Once that is udnerstood / resolved this should be good to go. |
So to avoid confusion on this point: This patch should not change any of the dtypes. Because of the issue being resolved in #4525 some arrays (with this patch as is) would use reduced precision. Once #4525 is merged this will no longer be the case. I would also recommend explicit type casting, rather than using
with that patch in place there should not be any difference due to dtypes, because the dtypes of arrays will not change (regardless of whether or not #4525 is applied). The remaining question is why has Gareth observed changes in output files before/after this patch? I'm beginning to suspect that this may completely unrelated to dtype changes (and possibly the dtype change led Gareth to believe that files were otherwise the same, when in fact they were not). I haven't yet reproduced this though ... |
It seems that the issue here was that we are assuming that SNR is always bigger than the single ranking. This is not the case when PSD variation is present. This can increase (or decrease) the SNR. The patch also uses a I'll propose a fix to this tomorrow, but it would have to be a little less flexible than I would like until #4524 can be implemented. |
Here's the rest of the proposed change:
In particular the
Perhaps there should be some logic around this so that if #4524 would help fix this properly, but needs some thought (and a bit more time) over how to do it properly. |
value given SNR and psd variation statistic | ||
""" | ||
rw_snr = snr / psd_var_val ** 0.5 | ||
rw_snr[psd_var_val == 0] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for psd var to be =0? Ie what could cause that?
What value do we get if it is not computed in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pycbc_inspiral
does not store it if not computed. PyCBC Live has not yet defined behaviour in this case.
.... If this will be fixed "in a better way" soon anyway, we probably don't need the "is it equal to 0" check (but it won't harm). Note that in the ranking computation, psd_var rescaling is not applied if the value is smaller than 0.65.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that the current code would fail if psd_var_val is not computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to prevent that failure? It would be a case of minor changes in the select function, where if the dataset isn't present but is asked for, it could be set to zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now (avoiding things that should be in #4524), I would limit changes to this executable, and use a if statement to use the old call to select if psd_var is not present, and the new one if it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure is good if the alternative is continuing while producing a plausible but wrong answer ..
Just noting that substituting a 'silent' default value might be rather dangerous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think wrapping in a if statement would work here. We aren't giving an option being ignored, we are simply working around a dataset which may or may not have been saved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we then remove rw_snr[psd_var_val == 0] = 0
? If psd var is calculated and the psd var val is 0, there is a big problem and I think we want the code to produce an error rather than ignore it ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this line
Files now differ at a binary level (thought shouldn't I don't think), but all attributes and datasets or numerically equal I haven't been able to test the psd_var_val missing check active, as finding a run with dq but without psd_var_val is not obvious. However the function is the same as previously tested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One line needs removing, otherwise approved.
…wastro#4519) * Apply a few efficiency savings to pycbc_bin_trigger_rates_dq * Us more logging. stat-threshold now has a default * Some improvements to logging * Move background_bins after trigger loading, allow indices_only return in HFile.select * Use sngl-ranking straight away in the select function * Descope PR * cleanup reversion to descope * cleanup reversion to descope * Explicit numpy types * gt --> ge * IH bugfix * apply some logic around psd var reweighting * wrap trigger selection to allow for missing psd_var_val case * remove checks around the psd_var_val
…wastro#4519) * Apply a few efficiency savings to pycbc_bin_trigger_rates_dq * Us more logging. stat-threshold now has a default * Some improvements to logging * Move background_bins after trigger loading, allow indices_only return in HFile.select * Use sngl-ranking straight away in the select function * Descope PR * cleanup reversion to descope * cleanup reversion to descope * Explicit numpy types * gt --> ge * IH bugfix * apply some logic around psd var reweighting * wrap trigger selection to allow for missing psd_var_val case * remove checks around the psd_var_val
…wastro#4519) * Apply a few efficiency savings to pycbc_bin_trigger_rates_dq * Us more logging. stat-threshold now has a default * Some improvements to logging * Move background_bins after trigger loading, allow indices_only return in HFile.select * Use sngl-ranking straight away in the select function * Descope PR * cleanup reversion to descope * cleanup reversion to descope * Explicit numpy types * gt --> ge * IH bugfix * apply some logic around psd var reweighting * wrap trigger selection to allow for missing psd_var_val case * remove checks around the psd_var_val
…wastro#4519) * Apply a few efficiency savings to pycbc_bin_trigger_rates_dq * Us more logging. stat-threshold now has a default * Some improvements to logging * Move background_bins after trigger loading, allow indices_only return in HFile.select * Use sngl-ranking straight away in the select function * Descope PR * cleanup reversion to descope * cleanup reversion to descope * Explicit numpy types * gt --> ge * IH bugfix * apply some logic around psd var reweighting * wrap trigger selection to allow for missing psd_var_val case * remove checks around the psd_var_val
Using this meant that I was able to actually complete a run of this code using realistic analysis files!
Basically use the mechanism described in #4510 to pre-mask the triggers before loading, also move some repetition of the same calculation out of the loops it was repeated in.
Some increased logging as well