NOTES.txt

Perf/BM library

OBJ: microbenchmark library for C++ small binaries with significant data structures.
OBJ: targeted for benchmarking operations on thread safe data structures
OBJ: exploit the highest resolution clock on Intel and AMD chipsets

REQ: Intel and AMD x86
REQ: C++, v >= 11
REQ: header only. All a user needs to do is #include "bm.h"
REQ: provide results that have locally minimized variance

NONREQ: reduce variance caused by hardware/kernel interrupts.
 NOTE: cannot prevent kernel interrupts or hardware interrupts so please run a few times
 NOTE: cannot prevent OOO execution
NONREQ: provide results that have globally minimized variance
 IDEA: perhaps we use some kind of voting mechanism: get 3 or 5 local minimas of variance and choose the majority?
NONREQ: exception safety
NONREQ: Windows

ALT:  google/benchmark [https://github.com/google/benchmark]
DIFF: overkill for small binaries
ALT:  sheredom/ubench [https://github.com/sheredom/ubench.h]
DIFF: uses linux clocks which loses resolution on ns if the underlying kernel isn't using rdtsc

NOTE: check kernel's clock source by
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource

Example:
static void BM_memcpy(BM::Controller &c) {
  char *src = new char[c.Arg(0)];
  char *dst = new char[c.Arg(0)];
  memset(src, 'x', c.Arg(0));
  for (auto _ : c) {
    memcpy(dst, src, c.Arg(0));
  }
  delete[] src;
  delete[] dst;
}
BM_Register(BM_memcpy)->Arg(8)->Arg(64)->Arg(512);
BM_Main();

TESTING
Tests are a series of binaries in tests/
Create a series of interesting binaries to benchmark.
Use godbolt to verify that the correct machine instructions are being generated
 especially for preventing harmful micro-optimisations.

DESIGN NOTES
central contiguous array of Benchmark structs
add to it with BM_Register
run benchmarks and store with BM_Main
(will use flags to configure how it is run and where results are written to)

FORMATTING:
Use Google C++ style.

--------------------------------------------------------------------------------

TODO: Run() to execute and print results
controller is iterable and injects/extract information from the key benchmarkable region
each iteration represents a new Arg or Thread
we iterate once we have a result
we have a result when we have gathered enough samples
each sample we rdtsc to get how long that section took
if negative, keep counter and warn user at the end if greater than zero.
we stop collecting samples when:
- running stddev or variance stops changing for some number of iterations?
  the tsc sample equals our running average
- minimum_time < cpu_time
- 1|min_iter_count < iters < 1e9
- 5*minimum_time < real_time
we produce a results vector
results store:
- name + state (like any arg or thread)
- cpu_time total (accumulation of rdtsc calls)
- iterations (number of times we gathered samples)
- cpu_time per iter estimate
- wall_time
- negative sample count (only if we encountered it). Warn the user if > 0.

TODO: warmup flag:
if set high, it will warmup the code path before measuring --benchmark_warmup=true

TODO: random interleaving can reduce variation --benchmark_enable_random_interleaving=true

TODO: specify non-zero repetition count --benchmark_repetitions=9

TODO: decrease the per-repetition time --benchmark_min_time=0.1

TODO: Add Args(A) and ArgRange(A-B)? To test how parameter size impacts benchmarks
c.Arg(0) and c.Arg(1) to access integers.
MAYBE: need to pause timing within the critical section.

TODO: Add Threads(N)? To test how multithread impacts benchmarks

TODO: BM::DoNotOptimize(void *p);
Escape and Clobber work together to trick the compiler into thinking we might globally use some data, thus it avoids optimizing it out
Effect of clobber() is limited to memory that is potentially accessible from an imagineary global root pointer.
From a compilers point of view, some data is completely hidden from the initial scope. Clobber pretends to touch all the data the global scope can see.
To ensure all data can be seen, we use scape.
// Marks p as arbitrarily aliased (to prevent dead code elim or scalar replacement of aggregate passes)
static void escape(void *p) {
	asm volatile("" : : "g"(p) : "memory");
}

// Aliases all the memory in the scope
static void clobber() {
	asm volatile("" : : : "memory");
}

void benchmark() {
	vector<int> v;
	v.reserve(1);
	v.push_back(10);
}

Benchmark would normally get mostly optimized out.
v.reserve(1) is not visible to clobber until we escape

void benchmark() {
	vector<int> v;
	v.reserve(1);
	escape(v.data());
	v.push_back(10);
	clobber();
}

TODO: prevent reordering of instructions with fencing
You can probably just use memory_order_acquire_release on some random atomic int?

e.g. Say we have uint8_t array;
prevent clflush from being reordered by the CPU or the compiler in this direction w/
_mm_mfence();
flush the line containing the element w/
_mm_clflush(&array[0]);

mfence and lfence must be in this order + compiler barrier for rdtsc
_mm_mfence();
_mm_lfence();
time1 = __rdtsc();

serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load
_mm_lfence();
int temp = array[0];  // this is a cache miss

measuring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss
no need for mfence because there are no stores in between
mfence and lfence must be in this order + compiler barrier for rdtsc and the load
_mm_lfence();
time2 = __rdtsc();

serialize __rdtsc with respect to trailing instructions
_mm_lfence();
msl = time2 - time1;

TODO: suggesting pinning benchmark to a CPU to reduce variance.
MAYBE? Make this a parameter to BM::State instead of a flag?

TODO: check Linux scaling governor.

TODO: check randomize virtual addressing space

TODO: CSV and JSON output

IDEA: fallback to chrono if rdtsc isn't available? Warn user about this.

-----

FOR ANOTHER DAY
IDEA: flags library? I might try to break this out into its own library
	FLAG_DECLARE("FLAG_NAME", type, default_value, "what is it and how to use it");
	RETURN_IF_ERROR(FLAG_SET("FLAG_NAME", value));
	StatusOr<int> f = ASSIGN_OR_RETURN(FLAG_GET("FLAG_NAME"));