<algorithm>: nth_element does not comply with worst case running time requirements #856

BillyONeal · 2020-05-22T05:59:10Z

Describe the bug
The standard requires nth_element (both std and std::ranges overloads) to perform only O(n lg n) swaps. However, we currently implement only quickselect:

STL/stl/inc/algorithm

Line 4470 in ff94756

    
           while (_ISORT_MAX < _ULast - _UFirst) { // divide and conquer, ordering partition containing Nth

This has worst case running time O(n^2) if we always choose bad pivots, just like quicksort. We need some form of 'ideal' checking similar to how std::sort works, and fall back to a median-of-medians or similar implementation.

Additional context

This was previously recorded in a Microsoft-internal bug database but I closed the issue there as I had filed it against myself.

The text was updated successfully, but these errors were encountered:

Andor233 · 2024-09-26T13:41:47Z

Complexity: For the overloads with no ExecutionPolicy, linear on averageFor the overloads with an ExecutionPolicy, O(N)applications of the predicate, and O(NlogN)swaps, where N=last - first.

It doesn't mention the worst time when without ExecutionPolicy. And I check the llvm's code, the worst case is also O(n^2).
As for the ExecutionPolicy, MSVC/STL not implentent parallelized ways at present.
So maybe the title change to The implentention of parallelized nth_element will be better.

Andor233 · 2024-09-26T14:45:51Z

And do you think weather it is a LWG issue?

CaseyCarter · 2024-09-26T19:28:33Z

Our implementations of parallel algorithms delegate to the serial implementations when the ExecutionPolicy is execution::seq, or when they fail to allocate memory needed to parallelize. Consequently, our serial implementations need to meet the complexity requirements the Standard puts on the parallel algorithms.

BillyONeal · 2024-09-26T20:07:20Z

And do you think weather it is a LWG issue?

I consistently argued that a big Oh worst case that is worse for the parallel algorithms than for the serial ones was nonsense (how is a user supposed to aim for the right size of N where the parallel one actually wins?) but it seemed like not enough people cared about nth_element at all.

Andor233 · 2024-09-27T04:03:32Z

Our implementations of parallel algorithms delegate to the serial implementations when the ExecutionPolicy is execution::seq, or when they fail to allocate memory needed to parallelize. Consequently, our serial implementations need to meet the complexity requirements the Standard puts on the parallel algorithms.

But if you want an O(NlogN) algorithm, the average complexity might be worse, because the original algorithm was O(n) on average, and deter to O(N^2) in few cases.

So I think we need a clearer definition in the standard here.

BillyONeal · 2024-09-27T19:25:24Z

So I think we need a clearer definition in the standard here.

I think the existing wording in the standard is very clear. The serial version must be O(n), while the parallel version must be O(n ln n). O(n^2) is never allowed. My argument with SG1 was that it should always be O(n) which can be achieved by limiting the number of subdivisions to the available parallelism.

They said O(n lg n) because the divide and conquer version of the algorithm is indeed O(n lg n) but there's no reason the same trick that makes introselect O(n) can't be applied to the decision to do more parallelism, as in the following pseudocode:

max_depth = max(42, number_of_threads)
# this fallback is the one missing from our STL (and the complexity reasoning in the standard)
if (current_depth > max_depth) { return median_of_medians(); }
pivot = choose_pivot()
if (parallel) { parallel_partition(pivot); }
if (pivot < n) { return nth_element(pivot, end, n - (pivot - begin), current_depth + 1) }
if (n < pivot) { return nth_element(begin, pivot, n, current_depth + 1); }
// pivot == n

Andor233 · 2024-09-28T11:02:54Z

So I think we need a clearer definition in the standard here.

I think the existing wording in the standard is very clear. The serial version must be O(n), while the parallel version must be O(n ln n). O(n^2) is never allowed. My argument with SG1 was that it should always be O(n) which can be achieved by limiting the number of subdivisions to the available parallelism.

They said O(n lg n) because the divide and conquer version of the algorithm is indeed O(n lg n) but there's no reason the same trick that makes introselect O(n) can't be applied to the decision to do more parallelism, as in the following pseudocode:
max_depth = max(42, number_of_threads)
# this fallback is the one missing from our STL (and the complexity reasoning in the standard)
if (current_depth > max_depth) { return median_of_medians(); }
pivot = choose_pivot()
if (parallel) { parallel_partition(pivot); }
if (pivot < n) { return nth_element(pivot, end, n - (pivot - begin), current_depth + 1) }
if (n < pivot) { return nth_element(begin, pivot, n, current_depth + 1); }
// pivot == n

I can't get your point because there is no algorithm can get nth element with O(n)

BillyONeal · 2024-10-08T19:56:58Z

I can't get your point because there is no algorithm can get nth element with O(n)

Median-of-medians is O(n). https://en.wikipedia.org/wiki/Median_of_medians#Proof_of_linear_running_time

BillyONeal · 2024-10-08T20:21:29Z

I wrote a big discussion in the isocpp mailing lists about this that I think might be useful to have to reference so I'm morally reposting it here:

The intended implementation of std::nth_element is to implement introselect. Selection has a ‘fast practically but slow worst case’ algorithm pair just like sort does. In the same way that std::sort must implement something like introsort, where it falls back to heap sort or similar if the number of quicksort partitions becomes awful, nth_element is supposed to choose between quickselect and Median-of-medians. Quickselect is like quicksort where after putting the pivot into positions, rather than recursively sorting both sides, you only select on the side containing the selection you want. In Master theorem terms, if the pivot is good, we on average divide the work into 2 subproblems (b is 2), and we only solve one of them (a is 1), and did a partition op f(n) = O(n). This makes c_crit 0, putting us in Case 3, for overall O(n).

Just like quicksort, if the chosen pivot is awful, it can degrade to O(N^2) worst case. Median-of-medians takes linear big-Oh time, but has poor real world constant factors.

Based on the complexity listed for the parallel version, I believe SG1 intends the same quickselect algorithm to be used, but quickselect’s partition operation replaced with parallelized partition.

I apologize for the confusion, further discussion forces us to use ‘partition’ both ways:

as in “put all the true elements at the beginning of the range”
as in “a division of the work to do that can be handed out to other parallel processors”
I’m going to call the first thing a ‘partition’ and the second one a ‘part’.

The straightforward way to parallelize partition is to do what stable_partition does. At each step, divide the input into 2 parts, and recursively partition each part, giving a sequence like TTTTTTFFFFFFTTTTTTFFFF, then rotate* the middle FFFFFTTTTTT section to put all the Ts at the front and all the Fs at the end. In Master theorem terms, f(n) is the rotate which is O(n), a and b are both 2. That gives c_crit of 1, putting us in Case 2, and an overall result of O(n lg n) , matching the standard.

Putting this back into Master theorem terms for Quickselect, f(n) is now O(n lg n), b is still 2, and a is still 1, c_crit is still 0, still putting us in Case 3, for an overall result of O(n lg n), the current requirement for parallelized std::nth_element.

* partition can use reverse rather than the rotate that stable_partition needs, but I don’t think that was meaningful to explaining how it works

P.S.: The reason I believe the complexity requirement for parallelized partition is wrong that you can stop dividing into subproblems and recursing once you have divided the work into enough parts to saturate the available parallelism. We don’t need N subproblems, we only need T subproblems. We can recurse to depth lg(T) to generate T subproblems, and run the classic linear partition operation on those subproblems. That gives a tree of depth lg(T), and at each layer of the tree we do O(n) work to combine the work of the subproblems, giving overall time O(N lg T). But T, for purposes of N, is a constant factor, so the requirement in the standard should be O(n), not O(n lg n).

Andor233 · 2024-10-09T05:12:07Z

I wrote a big discussion in the isocpp mailing lists about this that I think might be useful to have to reference so I'm morally reposting it here:

The intended implementation of std::nth_element is to implement introselect. Selection has a ‘fast practically but slow worst case’ algorithm pair just like sort does. In the same way that std::sort must implement something like introsort, where it falls back to heap sort or similar if the number of quicksort partitions becomes awful, nth_element is supposed to choose between quickselect and Median-of-medians. Quickselect is like quicksort where after putting the pivot into positions, rather than recursively sorting both sides, you only select on the side containing the selection you want. In Master theorem terms, if the pivot is good, we on average divide the work into 2 subproblems (b is 2), and we only solve one of them (a is 1), and did a partition op f(n) = O(n). This makes c_crit 0, putting us in Case 3, for overall O(n).

Just like quicksort, if the chosen pivot is awful, it can degrade to O(N^2) worst case. Median-of-medians takes linear big-Oh time, but has poor real world constant factors.

Based on the complexity listed for the parallel version, I believe SG1 intends the same quickselect algorithm to be used, but quickselect’s partition operation replaced with parallelized partition.

I apologize for the confusion, further discussion forces us to use ‘partition’ both ways:

as in “put all the true elements at the beginning of the range”

as in “a division of the work to do that can be handed out to other parallel processors”
I’m going to call the first thing a ‘partition’ and the second one a ‘part’.

The straightforward way to parallelize partition is to do what stable_partition does. At each step, divide the input into 2 parts, and recursively partition each part, giving a sequence like TTTTTTFFFFFFTTTTTTFFFF, then rotate* the middle FFFFFTTTTTT section to put all the Ts at the front and all the Fs at the end. In Master theorem terms, f(n) is the rotate which is O(n), a and b are both 2. That gives c_crit of 1, putting us in Case 2, and an overall result of O(n lg n) , matching the standard.

Putting this back into Master theorem terms for Quickselect, f(n) is now O(n lg n), b is still 2, and a is still 1, c_crit is still 0, still putting us in Case 3, for an overall result of O(n lg n), the current requirement for parallelized std::nth_element.

partition can use reverse rather than the rotate that stable_partition needs, but I don’t think that was meaningful to explaining how it works

P.S.: The reason I believe the complexity requirement for parallelized partition is wrong that you can stop dividing into subproblems and recursing once you have divided the work into enough parts to saturate the available parallelism. We don’t need N subproblems, we only need T subproblems. We can recurse to depth lg(T) to generate T subproblems, and run the classic linear partition operation on those subproblems. That gives a tree of depth lg(T), and at each layer of the tree we do O(n) work to combine the work of the subproblems, giving overall time O(N lg T). But T, for purposes of N, is a constant factor, so the requirement in the standard should be O(n), not O(n lg n).

Could you put me in the mail list? [email protected]

frederick-vs-ja · 2024-10-09T07:27:36Z

And do you think weather it is a LWG issue?

I've filed LWG-4162 for this. But per @BillyONeal's thoughts, I think the currently proposed resolution is less than ideal now and O(N) complexity should be required, perhaps.

So I think we need a clearer definition in the standard here.

It's permitted to implement a faster algorithm even though the LWG issue hasn't been resolved. IIUC there's no real or theoretical obstruction.

BillyONeal · 2024-10-09T22:19:40Z

Could you put me in the mail list? [email protected]

Unfortunately that stuff is controlled by ISO rules; I am not the list owner.

BillyONeal · 2024-10-09T22:20:27Z

I've filed LWG-4162 for this. But per @BillyONeal's thoughts, I think the currently proposed resolution is less than ideal now and O(N) complexity should be required, perhaps.

I agree, but I already had that argument with LEWG and SG1 and don't think an LWG issue is the right place to litigate such a thing. It needs a paper if someone is willing to write one.

BillyONeal added bug Something isn't working performance Must go faster labels May 22, 2020

StephanTLavavej mentioned this issue Jul 21, 2020

Implement ranges::nth_element #1063

Merged

CaseyCarter changed the title ~~<algorithm>: std::nth_element does not comply with worst case running time requirements~~ <algorithm>: nth_element does not comply with worst case running time requirements Jul 21, 2020

muellerj2 linked a pull request Nov 19, 2024 that will close this issue

<algorithm>: Implement worst-case linear-time nth_element #5100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<algorithm>: nth_element does not comply with worst case running time requirements #856

<algorithm>: nth_element does not comply with worst case running time requirements #856

BillyONeal commented May 22, 2020 •

edited by CaseyCarter

Loading

Andor233 commented Sep 26, 2024

Andor233 commented Sep 26, 2024

CaseyCarter commented Sep 26, 2024

BillyONeal commented Sep 26, 2024

Andor233 commented Sep 27, 2024

BillyONeal commented Sep 27, 2024

Andor233 commented Sep 28, 2024

BillyONeal commented Oct 8, 2024

BillyONeal commented Oct 8, 2024 •

edited

Loading

Andor233 commented Oct 9, 2024

frederick-vs-ja commented Oct 9, 2024

BillyONeal commented Oct 9, 2024

BillyONeal commented Oct 9, 2024 •

edited

Loading

<algorithm>: nth_element does not comply with worst case running time requirements #856

<algorithm>: nth_element does not comply with worst case running time requirements #856

Comments

BillyONeal commented May 22, 2020 • edited by CaseyCarter Loading

Andor233 commented Sep 26, 2024

Andor233 commented Sep 26, 2024

CaseyCarter commented Sep 26, 2024

BillyONeal commented Sep 26, 2024

Andor233 commented Sep 27, 2024

BillyONeal commented Sep 27, 2024

Andor233 commented Sep 28, 2024

BillyONeal commented Oct 8, 2024

BillyONeal commented Oct 8, 2024 • edited Loading

Andor233 commented Oct 9, 2024

frederick-vs-ja commented Oct 9, 2024

BillyONeal commented Oct 9, 2024

BillyONeal commented Oct 9, 2024 • edited Loading

BillyONeal commented May 22, 2020 •

edited by CaseyCarter

Loading

BillyONeal commented Oct 8, 2024 •

edited

Loading

BillyONeal commented Oct 9, 2024 •

edited

Loading