HiGHS as a SPEC CPU benchmark #1929

heshpdx · 2024-09-14T23:49:47Z

heshpdx
Sep 14, 2024

Hello friends,

I’m a CPU architect at Ampere Computing where I do performance analysis and workload characterization. I also serve on the SPEC CPU committee, working on benchmarks for the next version of SPEC CPU, CPUv8 . We try to find computationally intensive workloads in diverse fields, to help measure performance across a wide variety of behaviors and application domains. Based on the longevity of HiGHS, its large active community in linear optimization fields, plus its use in education, we have proposed the HiGHS solver be included in the next set of marquee benchmarks in SPEC CPU.

As part of the effort, we have ported and integrated the HiGHS mainline code into the SPEC CPU harness so that it can be tested on a wide variety of systems in a controlled environment to produce reproducible results. It is a cross-company effort, and this work has been done mostly by my counterpart at Intel. We have both single-threaded and lightly-multi-threaded HiGHS command lines which run and produce the same output across many compilers (llvm, gcc, icc, aocc, nvhpc, cray), ISAs (aarch64, x86, power) and operating systems (linux, windows, android, macosx).

Each benchmark candidate undergoes intense scrutiny, so we are seeking some help and guidance on issues that have come up. I hope we can use this gitlab issue to share and ask questions. Please be aware that we are computer architects and low level systems engineers, not mathematicians! Here are our current set of questions.

We are using HiGHS 1.5.3 from Sept 2023. Should we update to a newer version? Are there features above 1.5.3 that would motivate an update and rebase? Maybe the subsequent questions/answers would justify an upgrade:
For the single-threaded benchmarks, the goal is to have the total runtime of the “reference” benchmark to be around 4 minutes on a modern system, and under 1.8 GB of memory. The question is, if you were to select a set of workloads to represent your application, what would you choose? Are there specific flows that you would want more prominently exercised than others, because they are more representative? Are there some problems that are commonly used when testing application performance because they are “hard”? Are there different solvers to exercise?
We are using the following cmdlines to comprise the single threaded benchmarks. The three in “ref” together fit the runtime requirements above. Do you see any issues with using these or would you change/augment them? (We also have a “train” size which is one tenth the runtime of “ref”, and it is primarily used to train the compiler for feedback directed optimizations).

        --presolve off wide15.mps          (train)
        --presolve off square15.mps        (train)
        --presolve off i_n13.mps           (train and ref)
        --presolve off netlarge2.mps       (ref)
        --presolve off supportcase10.mps   (ref)

For the multithreaded benchmarks, the goal is to have the total runtime between 5 and 10 minutes on a modern system, and under 64 GB of memory, running with 40 threads or fewer. We recognize that not all applications/problems are embarrassingly parallel, and we have noticed that HiGHS can only scale performance to about 6 or 8 threads, after which point the threading overhead takes over the profile. Does this corroborate with your knowledge and testing? Does the thread parallelism get better with a later version of HiGHS than what we have? Does the parallelism vary per input problem? We are currently capping the max threads used to 8.
We are using the following cmdlines to comprise the multi-threaded benchmark. What are your thoughts on these inputs? We were trying to use neos-2978193-inde.mps but it had problems across systems. It seemed to use a different solver (MIP?)

        --presolve off i_n13.mps
        --presolve off chromaticindex1024_7.mps
        --presolve off netlarge3.mps
        --presolve off s250r10.mps
        --presolve off neos_3025225.mps
        --presolve off square15.mps
        --presolve off neos_5052403_cygnet.mps

One of the pillars of benchmarking is to ensure "equal work" is computed for every system. We have currently commented out the lines in lp_data/Highs.cpp that print out the simplex_iteration_count because they were varying between runs and that led to miscompares in the output files between systems/compiler opts. Can you share what the Simplex iteration count represents and how it can vary between runs? Is this how we can measure equal work, and verify through some tolerance? E.g. could we say the run completed 20000 simplex iterations plus or minus 100, and that signifies equal work between systems? We do have some benchmarks that take a non-linear path to get to their result, and we check for almost-equal-work through printing out how many steps they took, or some other intermediate metric that allows us to view the work accomplished. (Just like a math professor asks a student to show their work!)
Do all the problems have a single solution? Can problems have multiple solutions, or can they be unsatisfiable with zero solutions? We have seen in other solver software that sometimes an equal work guarantee is only possible when there are zero solutions, since that means the entire search space was visited, the entire graph was traversed, etc. Problems with multiple solutions cause problems especially due to random walks that happen to coalesce to the solution if you are lucky. That gets us to:
Another pillar of benchmarking is to reduce randomness, also to ensure as close to equal work as we can get. What kind of runtime variation do you expect run-to-run? How about variance from changing the value on --random-seed? How much is randomness used in the solver algorithms? For SPEC CPU we replace std::rand with our own “deterministic rand” to ensure the same sequence of random numbers are produced regardless of system under test; however we are still seeing some big variance (10% or more). If you can clarify the intention of the randomness that can help us mitigate this.
And finally, a real functional issue we are hitting right now. Through rigorous testing, we found that some combinations of systems and compilers yield errors during solving. Below is a failing output from supportcast10.mps, and is reproducible on one system. What is the best way to debug this? The expectation is that all systems would produce the same output, but when something like this happens we need to provide a way for the benchmark user to debug what happened. (Did their new compiler produce incorrect assembly, or unsafe optimizations?) Changing the random_seed fixed the issue on that system, but the error moved and showed up on another system, so we need to fix this the correct way.

    Running HiGHS 1.5.3 [date: 2023-09-20, git hash: n/a] Copyright (c) 2023 HiGHS under MIT licence terms
    LP   supportcase10 has 165684 rows; 14770 cols; 555082 nonzeros
    Solving LP without presolve or with basis Using EKK dual simplex solver - serial
    Model   status      : Solve error

We appreciate your input and willingness to help us! Please share your thoughts at your convenience.

Thank you!

jajhall · 2024-09-15T07:49:31Z

jajhall
Sep 15, 2024
Maintainer Sponsor

Thanks for these questions. I'll do some experiments with the test problems you mention and formulate replies for you to read before we chat - as the replies may prompt a lot more questions!

0 replies

jajhall · 2024-09-15T08:29:20Z

jajhall
Sep 15, 2024
Maintainer Sponsor

We were trying to use neos-2978193-inde.mps but it had problems across systems. It seemed to use a different solver (MIP?)

neos-2978193-inde is, indeed, a MIP so, by default, will be solved using the HiGHS MIP solver

supportcase10, chromaticindex1024-7 and neos-5052403-cygnet are also MIPs - unless (unlikely) you've found files for their LP relaxations.

If you're wanting to solve the relaxation of a MIP, use the option solve_relaxation=true that can only be set in a file foo.set and defined by using the run-time option --options_file=foo.set

You're also using _ rather than - in all file names except i_n13.mps, which does have a _

1 reply

heshpdx Sep 15, 2024
Author

Hans Mittelmann converted those files to LP. Mostly they were taken from https://plato.asu.edu/ftp/lptestset/

heshpdx · 2024-09-15T22:40:40Z

heshpdx
Sep 15, 2024
Author

One more pillar of benchmarking is to prevent self-timing. A system is properly measured from outside the system, and so the SPEC CPU harness is responsible for determining how long the runtime is. We recognize that the total binary execution time may include setup time that you don't usually measure and report. Hopefully the setup time is minimal.

In addition to not checking the output of self-timing, we should actually disable calls to the timer which take up valuable cycles. I looked into adding a runtime option to HiGHS to disable timing, but feeding that option all the way down to the modules that needs it started getting cumbersome. Instead, something like this gets us what we need with minimal fuss.

--- a/src/HiGHS/src/io/HMpsFF.cpp
+++ b/src/HiGHS/src/io/HMpsFF.cpp
@@ -487,9 +487,13 @@ HMpsFF::Parsekey HMpsFF::parseDefault(const HighsLogOptions& log_options,
 }
 
 double getWallTime() {
+#ifndef SPEC
   using namespace std::chrono;
   return duration_cast<duration<double> >(wall_clock::now().time_since_epoch())
       .count();
+#else
+  return 0;
+#endif
 }

Incidentally, when we apply this simple patch, we see a 4% performance improvement, proving that timing takes time. So there may be motivation to implement a cmdline option for your users.

2 replies

jajhall Sep 16, 2024
Maintainer Sponsor

Thanks. Most of the timing calls are for development profiling only. I'll put an option in to control this.

I'd be very wary of switching timing off completely, as many users want to set a time limit, and the calls required to achieve this are minimal relative to those for development profiling

jajhall Sep 20, 2024
Maintainer Sponsor

A couple of years ago the timing calls for development were suppressed when developer options are not set, so this isn't the source of the large number of timing calls.

The large number of timing calls come from testing whether the run time has exceeded time_limit. I've suppressed the timing calls when time_limit is infinite (and in places where time-related logging is done, and output_flag is false, but that doesn't help you) but I don't see any meaningful reduction in the run time. Nor if I suppress all calls to wall_clock::now()

jajhall · 2024-09-20T11:25:40Z

jajhall
Sep 20, 2024
Maintainer Sponsor

We are using HiGHS 1.5.3 from Sept 2023. Should we update to a newer version? Are there features above 1.5.3 that would motivate an update and rebase? Maybe the subsequent questions/answers would justify an upgrade:

Yes, (for reasons, see below) to the latest version, 1.7.2, unless we need to make some modifications to suit your purposes

1 reply

heshpdx Sep 30, 2024
Author

Based on your advice, I will recommend we use the current drop at https://github.com/ERGO-Code/HiGHS/tree/latest branch.

jajhall · 2024-09-20T11:27:46Z

jajhall
Sep 20, 2024
Maintainer Sponsor

2. (also 3., 5. and 6.) For the single-threaded benchmarks, the goal is to have the total runtime of the “reference” benchmark to be around 4 minutes on a modern system, and under 1.8 GB of memory. The question is, if you were to select a set of workloads to represent your application, what would you choose? Are there specific flows that you would want more prominently exercised than others, because they are more representative? Are there some problems that are commonly used when testing application performance because they are “hard”? Are there different solvers to exercise?

As discussed, we will identify a set of LP problems that run in around 4 minutes on a modern system, and under 1.8 GB of memory, and yield the same simplex iteration count on all architectures for which we can test.

0 replies

jajhall · 2024-09-20T11:34:06Z

jajhall
Sep 20, 2024
Maintainer Sponsor

4. For the multithreaded benchmarks, the goal is to have the total runtime between 5 and 10 minutes on a modern system, and under 64 GB of memory, running with 40 threads or fewer. We recognize that not all applications/problems are embarrassingly parallel, and we have noticed that HiGHS can only scale performance to about 6 or 8 threads, after which point the threading overhead takes over the profile. Does this corroborate with your knowledge and testing? Does the thread parallelism get better with a later version of HiGHS than what we have? Does the parallelism vary per input problem? We are currently capping the max threads used to 8.

As discussed, the speed-up of the parallel simplex solver for LP is largely limited by the number of memory channels, due to simplex computation being (typically) "Take an (integer) index and (double) value from memory, multiply a component (given by the index) from a vector by the value, and add overwrite the vector entry with the the result, or add the result to an accumulating sum". Hopefully the whole of the vector is in cache, otherwise that may have to be read from memory, too!

Hence speed-up tails off at 4-8 threads.

Thread parallelism is the same in all versions of HiGHS

0 replies

jajhall · 2024-09-20T11:43:24Z

jajhall
Sep 20, 2024
Maintainer Sponsor

7. Do all the problems have a single solution? Can problems have multiple solutions, or can they be unsatisfiable with zero solutions? We have seen in other solver software that sometimes an equal work guarantee is only possible when there are zero solutions, since that means the entire search space was visited, the entire graph was traversed, etc. Problems with multiple solutions cause problems especially due to random walks that happen to coalesce to the solution if you are lucky. That gets us to:

LP problems typically have multiple solutions, and a different settings of the random_seed option may yield a different solutions. All solutions have the same optimal objective value, though.

Since the algorithm is deterministic, there's no issue relating to your "equal work guarantee is only possible when there are zero solutions, since that means the entire search space was visited" situaiton. For LP, the entire search space is too large for all of it to be visited.

0 replies

jajhall · 2024-09-20T11:52:43Z

jajhall
Sep 20, 2024
Maintainer Sponsor

8. Another pillar of benchmarking is to reduce randomness, also to ensure as close to equal work as we can get. What kind of runtime variation do you expect run-to-run? How about variance from changing the value on --random-seed? How much is randomness used in the solver algorithms? For SPEC CPU we replace std::rand with our own “deterministic rand” to ensure the same sequence of random numbers are produced regardless of system under test; however we are still seeing some big variance (10% or more). If you can clarify the intention of the randomness that can help us mitigate this.

As discussed, HiGHS has its own random number generator (in util/HighsRandom.h) that has not changed since v1.5.3. As I understand, it is (deliberately) compiler/architecture dependent

Different settings of the random_seed option will give performance variation for LP by a factor in c. [1/1.5, 1.5]. It's not difficult to prevent the random seed option from being modified (from zero). Just modify lines c.713 of p/HighsOptions.h to be

record_int = new OptionRecordInt(kRandomSeedString, "Random seed used in HiGHS", advanced, &random_seed, 0, 0, 0);

where the last three parameters are the lower bound, default value and upper bound of an option

0 replies

heshpdx · 2024-09-20T16:47:12Z

heshpdx
Sep 20, 2024
Author

For question 9, our testers are requesting a way to debug without having to rebuild the binary. Are there knobs we can enable to dump verbose logs?

5 replies

jajhall Sep 20, 2024
Maintainer Sponsor

Yes, there are options that can be set in a data file to achieve this

heshpdx Sep 27, 2024
Author

Can you share an example? Thanks!

jajhall Sep 27, 2024
Maintainer Sponsor

See https://ergo-code.github.io/HiGHS/stable/options/intro/ and, for example,
Options.txt

jajhall Sep 27, 2024
Maintainer Sponsor

Sorry, I see that you want to know what options to set for debug logging

jajhall Sep 27, 2024
Maintainer Sponsor

Logging.txt shows how detailed and verbose logging can be obtained

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HiGHS as a SPEC CPU benchmark #1929

{{title}}

Replies: 9 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

HiGHS as a SPEC CPU benchmark #1929

heshpdx Sep 14, 2024

Replies: 9 comments · 9 replies

jajhall Sep 15, 2024 Maintainer Sponsor

jajhall Sep 15, 2024 Maintainer Sponsor

heshpdx Sep 15, 2024 Author

heshpdx Sep 15, 2024 Author

jajhall Sep 16, 2024 Maintainer Sponsor

jajhall Sep 20, 2024 Maintainer Sponsor

jajhall Sep 20, 2024 Maintainer Sponsor

heshpdx Sep 30, 2024 Author

jajhall Sep 20, 2024 Maintainer Sponsor

jajhall Sep 20, 2024 Maintainer Sponsor

jajhall Sep 20, 2024 Maintainer Sponsor

jajhall Sep 20, 2024 Maintainer Sponsor

heshpdx Sep 20, 2024 Author

jajhall Sep 20, 2024 Maintainer Sponsor

heshpdx Sep 27, 2024 Author

jajhall Sep 27, 2024 Maintainer Sponsor

jajhall Sep 27, 2024 Maintainer Sponsor

jajhall Sep 27, 2024 Maintainer Sponsor

heshpdx
Sep 14, 2024

Replies: 9 comments 9 replies

jajhall
Sep 15, 2024
Maintainer Sponsor

jajhall
Sep 15, 2024
Maintainer Sponsor

heshpdx Sep 15, 2024
Author

heshpdx
Sep 15, 2024
Author

jajhall Sep 16, 2024
Maintainer Sponsor

jajhall Sep 20, 2024
Maintainer Sponsor

jajhall
Sep 20, 2024
Maintainer Sponsor

heshpdx Sep 30, 2024
Author

jajhall
Sep 20, 2024
Maintainer Sponsor

jajhall
Sep 20, 2024
Maintainer Sponsor

jajhall
Sep 20, 2024
Maintainer Sponsor

jajhall
Sep 20, 2024
Maintainer Sponsor

heshpdx
Sep 20, 2024
Author

jajhall Sep 20, 2024
Maintainer Sponsor

heshpdx Sep 27, 2024
Author

jajhall Sep 27, 2024
Maintainer Sponsor

jajhall Sep 27, 2024
Maintainer Sponsor

jajhall Sep 27, 2024
Maintainer Sponsor