Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some efficiency savings for pycbc_fit_sngls_over_multiparam #4957

Merged
merged 4 commits into from
Dec 12, 2024

Conversation

GarethCabournDavies
Copy link
Contributor

@GarethCabournDavies GarethCabournDavies commented Nov 26, 2024

Add in a few efficiency savings for pycbc_fit_sngls_over_multiparam

Standard information about the request

This is an efficiency update
This change affects the offline search, the live search
This change changes nothing for the output
This change follows style guidelines (See e.g. PEP8), has been proposed using the contribution guidelines

Motivation

I was looking at pycbc_fit_sngls_over_multiparam thinking it would be a good place to learn / implement some GPU efficiency, but then we (mainly Ian) saw that there were some huge efficiency savings that could be made fairly quickly

Contents

Biggest saving:

  • Move one instance where an array is indexed to re-sort to outside of a loop
    Additional savings:
  • Pre-allocate memory for a list where we know how long it will be rather than appending to a list in a loop.
  • Index on a boolean in the case that we are using a significant portion of the templates (smooth_distance_weighted)
  • Allow numpy.average() to revert to the more-efficient numpy.mean() method rather than using equal weights.
  • Use pre-calculated number of templates in the reporting of progress rather than calculating every time

Testing performed

Note that for this testing, I set the loop over templates to break at 10,000 iterations.
This meant that for the "old" testing, I needed to implement the pre-allocated array saving described above in order to get outputs to match

Differences in output files

All equivalent files' datasets match using equality (i.e. numpy.array_equal(dataset1, dataset2)), dtypes match and attributes match.

Hashes of files do not match - I am unsure how this can be the case though.

Profiling

smooth_tophat (default) smoothing:

Summary:

The "old" profiling graph shows that something which is not in a function is the dominant cost. I found that by setting the smooth() function to not be called this drastically improved performance. We found that the problem was in the arguments being passed to the function, not the function itself.

time output shows that the 'new' method takes approx 1/35 of the 'old' time.
There are also a bunch of page faults and voluntary context switches in the old version, which I don't know what they are, but it sounds bad.

Profiling graphs:
new and old

time -v output:
Old:

    User time (seconds): 1391.56
    System time (seconds): 612.39
    Percent of CPU this job got: 94%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 35:21.35
    Maximum resident set size (kbytes): 412880
    Major (requiring I/O) page faults: 803
    Minor (reclaiming a frame) page faults: 139735804
    Voluntary context switches: 94621
    Involuntary context switches: 276684

New:

    User time (seconds): 67.32
    System time (seconds): 2.73
    Percent of CPU this job got: 102%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08.55
    Maximum resident set size (kbytes): 419592
    Major (requiring I/O) page faults: 12
    Minor (reclaiming a frame) page faults: 97534
    Voluntary context switches: 10431
    Involuntary context switches: 702104

distance_weighted smoothing

Summary:
In this, there is the extra cost of generating a normal PDF for every template, this means that the savings aren't quite as significant, but still noteworthy

Profiling graphs:
new
and old

time -v output:

old:

    User time (seconds): 1837.79
    System time (seconds): 580.82
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 40:17.43
    Maximum resident set size (kbytes): 434604
    Minor (reclaiming a frame) page faults: 236214226
    Voluntary context switches: 9713
    Involuntary context switches: 670573

new:

    User time (seconds): 1082.32
    System time (seconds): 3.87
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 18:04.86
    Maximum resident set size (kbytes): 414860
    Minor (reclaiming a frame) page faults: 95858
    Voluntary context switches: 10375
    Involuntary context switches: 487334

n_closest smoothing

This is essentially unchanged, as the major savings part of the PR is not affecting this code path.

For completeness, profiles can be found at this link under fit_over_n_closest_{new,old}.{txt,png} for new/old and time output/ profiling graph respectively

  • The author of this pull request confirms they will adhere to the code of conduct

n_templates = len(nabove)
rang = numpy.arange(0, n_templates)

nabove_smoothed = numpy.zeros_like(parvals[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preallocation saves some time, writing to indexes of the array rather than appending

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also on this line, why is parvals[0] being used specifically if we have other variables for the number of templates?

@spxiwh
Copy link
Contributor

spxiwh commented Nov 28, 2024

I feel someone else should review this, given my involvement in creating this.

@GarethCabournDavies GarethCabournDavies removed the request for review from spxiwh November 28, 2024 13:50
nabove_smoothed.append(smoothed_tuple[0])
alpha_smoothed.append(smoothed_tuple[1])
ntotal_smoothed.append(smoothed_tuple[2])
smoothed_tuple = smooth(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here one could do eg smooth_vals[i, :] = smooth(...) rather than having to get a tuple and assign 3 things separately

Copy link
Contributor

@tdent tdent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, up to a couple of requests to deal with apparently redundant num_templates variables, and to shorten the code / reduce repetition by bundling the 3 outputs of smoothing together.

invalphan_sort = invalphan[par_sort]
ntotal_sort = ntotal[par_sort]
# Preallocate memory for *param_vals[0]-sorted* smoothing results
nabove_smoothed = numpy.zeros(num_templates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was envisaging that these separate arrays would go away in favour of the smoothed_vals. Do we need to keep them here?

nabove_smoothed = numpy.array(nabove_smoothed)[unsort]
alpha_smoothed = numpy.array(alpha_smoothed)[unsort]
ntotal_smoothed = numpy.array(ntotal_smoothed)[unsort]
smoothed_vals[:,0] = nabove_smoothed[unsort]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these assignments in triplicate still seem redundant, can we go directly from smoothed_tuple to smoothed_vals ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we did, as the un-sorting was needed for this part of the calculation, but that can be done on the n*3 array in one go

Copy link
Contributor

@tdent tdent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments, I think we're nearly there

@tdent tdent enabled auto-merge (squash) December 12, 2024 12:50
@tdent tdent merged commit 3f53cfa into gwastro:master Dec 12, 2024
29 checks passed
@GarethCabournDavies GarethCabournDavies deleted the fit_over_param_gpu branch December 13, 2024 09:48
@GarethCabournDavies GarethCabournDavies added the v23_release_branch PRs applied to the v2.3.X release branch or to be cherry-picked if merging to master label Dec 13, 2024
@GarethCabournDavies
Copy link
Contributor Author

Note that I can't be entirely certain that this will translate nicely onto the v23 release branch, but I'm fairly sure it will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
offline search v23_release_branch PRs applied to the v2.3.X release branch or to be cherry-picked if merging to master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants