Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<algorithm>: nth_element does not comply with worst case running time requirements #856

Open
BillyONeal opened this issue May 22, 2020 · 13 comments · May be fixed by #5100
Open

<algorithm>: nth_element does not comply with worst case running time requirements #856

BillyONeal opened this issue May 22, 2020 · 13 comments · May be fixed by #5100
Labels
bug Something isn't working performance Must go faster

Comments

@BillyONeal
Copy link
Member

BillyONeal commented May 22, 2020

Describe the bug
The standard requires nth_element (both std and std::ranges overloads) to perform only O(n lg n) swaps. However, we currently implement only quickselect:

while (_ISORT_MAX < _ULast - _UFirst) { // divide and conquer, ordering partition containing Nth

This has worst case running time O(n^2) if we always choose bad pivots, just like quicksort. We need some form of 'ideal' checking similar to how std::sort works, and fall back to a median-of-medians or similar implementation.

Additional context

This was previously recorded in a Microsoft-internal bug database but I closed the issue there as I had filed it against myself.

@BillyONeal BillyONeal added bug Something isn't working performance Must go faster labels May 22, 2020
@CaseyCarter CaseyCarter changed the title <algorithm>: std::nth_element does not comply with worst case running time requirements <algorithm>: nth_element does not comply with worst case running time requirements Jul 21, 2020
@Andor233
Copy link
Contributor

Complexity: For the overloads with no ExecutionPolicy, linear on averageFor the overloads with an ExecutionPolicy, O(N)applications of the predicate, and O(NlogN)swaps, where N=last - first.

It doesn't mention the worst time when without ExecutionPolicy. And I check the llvm's code, the worst case is also O(n^2).
As for the ExecutionPolicy, MSVC/STL not implentent parallelized ways at present.
So maybe the title change to The implentention of parallelized nth_element will be better.

@Andor233
Copy link
Contributor

And do you think weather it is a LWG issue?

@CaseyCarter
Copy link
Member

Our implementations of parallel algorithms delegate to the serial implementations when the ExecutionPolicy is execution::seq, or when they fail to allocate memory needed to parallelize. Consequently, our serial implementations need to meet the complexity requirements the Standard puts on the parallel algorithms.

@BillyONeal
Copy link
Member Author

And do you think weather it is a LWG issue?

I consistently argued that a big Oh worst case that is worse for the parallel algorithms than for the serial ones was nonsense (how is a user supposed to aim for the right size of N where the parallel one actually wins?) but it seemed like not enough people cared about nth_element at all.

@Andor233
Copy link
Contributor

Our implementations of parallel algorithms delegate to the serial implementations when the ExecutionPolicy is execution::seq, or when they fail to allocate memory needed to parallelize. Consequently, our serial implementations need to meet the complexity requirements the Standard puts on the parallel algorithms.

But if you want an O(NlogN) algorithm, the average complexity might be worse, because the original algorithm was O(n) on average, and deter to O(N^2) in few cases.

So I think we need a clearer definition in the standard here.

@BillyONeal
Copy link
Member Author

So I think we need a clearer definition in the standard here.

I think the existing wording in the standard is very clear. The serial version must be O(n), while the parallel version must be O(n ln n). O(n^2) is never allowed. My argument with SG1 was that it should always be O(n) which can be achieved by limiting the number of subdivisions to the available parallelism.

They said O(n lg n) because the divide and conquer version of the algorithm is indeed O(n lg n) but there's no reason the same trick that makes introselect O(n) can't be applied to the decision to do more parallelism, as in the following pseudocode:

max_depth = max(42, number_of_threads)
# this fallback is the one missing from our STL (and the complexity reasoning in the standard)
if (current_depth > max_depth) { return median_of_medians(); }
pivot = choose_pivot()
if (parallel) { parallel_partition(pivot); }
if (pivot < n) { return nth_element(pivot, end, n - (pivot - begin), current_depth + 1) }
if (n < pivot) { return nth_element(begin, pivot, n, current_depth + 1); }
// pivot == n

@Andor233
Copy link
Contributor

So I think we need a clearer definition in the standard here.

I think the existing wording in the standard is very clear. The serial version must be O(n), while the parallel version must be O(n ln n). O(n^2) is never allowed. My argument with SG1 was that it should always be O(n) which can be achieved by limiting the number of subdivisions to the available parallelism.

They said O(n lg n) because the divide and conquer version of the algorithm is indeed O(n lg n) but there's no reason the same trick that makes introselect O(n) can't be applied to the decision to do more parallelism, as in the following pseudocode:

max_depth = max(42, number_of_threads)
# this fallback is the one missing from our STL (and the complexity reasoning in the standard)
if (current_depth > max_depth) { return median_of_medians(); }
pivot = choose_pivot()
if (parallel) { parallel_partition(pivot); }
if (pivot < n) { return nth_element(pivot, end, n - (pivot - begin), current_depth + 1) }
if (n < pivot) { return nth_element(begin, pivot, n, current_depth + 1); }
// pivot == n

I can't get your point because there is no algorithm can get nth element with O(n)

@BillyONeal
Copy link
Member Author

I can't get your point because there is no algorithm can get nth element with O(n)

Median-of-medians is O(n). https://en.wikipedia.org/wiki/Median_of_medians#Proof_of_linear_running_time

@BillyONeal
Copy link
Member Author

BillyONeal commented Oct 8, 2024

I wrote a big discussion in the isocpp mailing lists about this that I think might be useful to have to reference so I'm morally reposting it here:

The intended implementation of std::nth_element is to implement introselect. Selection has a ‘fast practically but slow worst case’ algorithm pair just like sort does. In the same way that std::sort must implement something like introsort, where it falls back to heap sort or similar if the number of quicksort partitions becomes awful, nth_element is supposed to choose between quickselect and Median-of-medians. Quickselect is like quicksort where after putting the pivot into positions, rather than recursively sorting both sides, you only select on the side containing the selection you want. In Master theorem terms, if the pivot is good, we on average divide the work into 2 subproblems (b is 2), and we only solve one of them (a is 1), and did a partition op f(n) = O(n). This makes c_crit 0, putting us in Case 3, for overall O(n).

Just like quicksort, if the chosen pivot is awful, it can degrade to O(N^2) worst case. Median-of-medians takes linear big-Oh time, but has poor real world constant factors.

Based on the complexity listed for the parallel version, I believe SG1 intends the same quickselect algorithm to be used, but quickselect’s partition operation replaced with parallelized partition.

I apologize for the confusion, further discussion forces us to use ‘partition’ both ways:

  • as in “put all the true elements at the beginning of the range”
  • as in “a division of the work to do that can be handed out to other parallel processors”
    I’m going to call the first thing a ‘partition’ and the second one a ‘part’.

The straightforward way to parallelize partition is to do what stable_partition does. At each step, divide the input into 2 parts, and recursively partition each part, giving a sequence like TTTTTTFFFFFFTTTTTTFFFF, then rotate* the middle FFFFFTTTTTT section to put all the Ts at the front and all the Fs at the end. In Master theorem terms, f(n) is the rotate which is O(n), a and b are both 2. That gives c_crit of 1, putting us in Case 2, and an overall result of O(n lg n) , matching the standard.

Putting this back into Master theorem terms for Quickselect, f(n) is now O(n lg n), b is still 2, and a is still 1, c_crit is still 0, still putting us in Case 3, for an overall result of O(n lg n), the current requirement for parallelized std::nth_element.

* partition can use reverse rather than the rotate that stable_partition needs, but I don’t think that was meaningful to explaining how it works

P.S.: The reason I believe the complexity requirement for parallelized partition is wrong that you can stop dividing into subproblems and recursing once you have divided the work into enough parts to saturate the available parallelism. We don’t need N subproblems, we only need T subproblems. We can recurse to depth lg(T) to generate T subproblems, and run the classic linear partition operation on those subproblems. That gives a tree of depth lg(T), and at each layer of the tree we do O(n) work to combine the work of the subproblems, giving overall time O(N lg T). But T, for purposes of N, is a constant factor, so the requirement in the standard should be O(n), not O(n lg n).

@Andor233
Copy link
Contributor

Andor233 commented Oct 9, 2024

I wrote a big discussion in the isocpp mailing lists about this that I think might be useful to have to reference so I'm morally reposting it here:

The intended implementation of std::nth_element is to implement introselect. Selection has a ‘fast practically but slow worst case’ algorithm pair just like sort does. In the same way that std::sort must implement something like introsort, where it falls back to heap sort or similar if the number of quicksort partitions becomes awful, nth_element is supposed to choose between quickselect and Median-of-medians. Quickselect is like quicksort where after putting the pivot into positions, rather than recursively sorting both sides, you only select on the side containing the selection you want. In Master theorem terms, if the pivot is good, we on average divide the work into 2 subproblems (b is 2), and we only solve one of them (a is 1), and did a partition op f(n) = O(n). This makes c_crit 0, putting us in Case 3, for overall O(n).

Just like quicksort, if the chosen pivot is awful, it can degrade to O(N^2) worst case. Median-of-medians takes linear big-Oh time, but has poor real world constant factors.

Based on the complexity listed for the parallel version, I believe SG1 intends the same quickselect algorithm to be used, but quickselect’s partition operation replaced with parallelized partition.

I apologize for the confusion, further discussion forces us to use ‘partition’ both ways:

  • as in “put all the true elements at the beginning of the range”
  • as in “a division of the work to do that can be handed out to other parallel processors”
    I’m going to call the first thing a ‘partition’ and the second one a ‘part’.

The straightforward way to parallelize partition is to do what stable_partition does. At each step, divide the input into 2 parts, and recursively partition each part, giving a sequence like TTTTTTFFFFFFTTTTTTFFFF, then rotate* the middle FFFFFTTTTTT section to put all the Ts at the front and all the Fs at the end. In Master theorem terms, f(n) is the rotate which is O(n), a and b are both 2. That gives c_crit of 1, putting us in Case 2, and an overall result of O(n lg n) , matching the standard.

Putting this back into Master theorem terms for Quickselect, f(n) is now O(n lg n), b is still 2, and a is still 1, c_crit is still 0, still putting us in Case 3, for an overall result of O(n lg n), the current requirement for parallelized std::nth_element.

  • partition can use reverse rather than the rotate that stable_partition needs, but I don’t think that was meaningful to explaining how it works

P.S.: The reason I believe the complexity requirement for parallelized partition is wrong that you can stop dividing into subproblems and recursing once you have divided the work into enough parts to saturate the available parallelism. We don’t need N subproblems, we only need T subproblems. We can recurse to depth lg(T) to generate T subproblems, and run the classic linear partition operation on those subproblems. That gives a tree of depth lg(T), and at each layer of the tree we do O(n) work to combine the work of the subproblems, giving overall time O(N lg T). But T, for purposes of N, is a constant factor, so the requirement in the standard should be O(n), not O(n lg n).

Could you put me in the mail list? [email protected]

@frederick-vs-ja
Copy link
Contributor

And do you think weather it is a LWG issue?

I've filed LWG-4162 for this. But per @BillyONeal's thoughts, I think the currently proposed resolution is less than ideal now and O(N) complexity should be required, perhaps.

So I think we need a clearer definition in the standard here.

It's permitted to implement a faster algorithm even though the LWG issue hasn't been resolved. IIUC there's no real or theoretical obstruction.

@BillyONeal
Copy link
Member Author

Could you put me in the mail list? [email protected]

Unfortunately that stuff is controlled by ISO rules; I am not the list owner.

@BillyONeal
Copy link
Member Author

BillyONeal commented Oct 9, 2024

I've filed LWG-4162 for this. But per @BillyONeal's thoughts, I think the currently proposed resolution is less than ideal now and O(N) complexity should be required, perhaps.

I agree, but I already had that argument with LEWG and SG1 and don't think an LWG issue is the right place to litigate such a thing. It needs a paper if someone is willing to write one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants