-
Notifications
You must be signed in to change notification settings - Fork 0
/
NOTES.txt
173 lines (138 loc) · 6.04 KB
/
NOTES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
Perf/BM library
OBJ: microbenchmark library for C++ small binaries with significant data structures.
OBJ: targeted for benchmarking operations on thread safe data structures
OBJ: exploit the highest resolution clock on Intel and AMD chipsets
REQ: Intel and AMD x86
REQ: C++, v >= 11
REQ: header only. All a user needs to do is #include "bm.h"
REQ: provide results that have locally minimized variance
NONREQ: reduce variance caused by hardware/kernel interrupts.
NOTE: cannot prevent kernel interrupts or hardware interrupts so please run a few times
NOTE: cannot prevent OOO execution
NONREQ: provide results that have globally minimized variance
IDEA: perhaps we use some kind of voting mechanism: get 3 or 5 local minimas of variance and choose the majority?
NONREQ: exception safety
NONREQ: Windows
ALT: google/benchmark [https://github.com/google/benchmark]
DIFF: overkill for small binaries
ALT: sheredom/ubench [https://github.com/sheredom/ubench.h]
DIFF: uses linux clocks which loses resolution on ns if the underlying kernel isn't using rdtsc
NOTE: check kernel's clock source by
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
Example:
static void BM_memcpy(BM::Controller &c) {
char *src = new char[c.Arg(0)];
char *dst = new char[c.Arg(0)];
memset(src, 'x', c.Arg(0));
for (auto _ : c) {
memcpy(dst, src, c.Arg(0));
}
delete[] src;
delete[] dst;
}
BM_Register(BM_memcpy)->Arg(8)->Arg(64)->Arg(512);
BM_Main();
TESTING
Tests are a series of binaries in tests/
Create a series of interesting binaries to benchmark.
Use godbolt to verify that the correct machine instructions are being generated
especially for preventing harmful micro-optimisations.
DESIGN NOTES
central contiguous array of Benchmark structs
add to it with BM_Register
run benchmarks and store with BM_Main
(will use flags to configure how it is run and where results are written to)
FORMATTING:
Use Google C++ style.
--------------------------------------------------------------------------------
TODO: Run() to execute and print results
controller is iterable and injects/extract information from the key benchmarkable region
each iteration represents a new Arg or Thread
we iterate once we have a result
we have a result when we have gathered enough samples
each sample we rdtsc to get how long that section took
if negative, keep counter and warn user at the end if greater than zero.
we stop collecting samples when:
- running stddev or variance stops changing for some number of iterations?
the tsc sample equals our running average
- minimum_time < cpu_time
- 1|min_iter_count < iters < 1e9
- 5*minimum_time < real_time
we produce a results vector
results store:
- name + state (like any arg or thread)
- cpu_time total (accumulation of rdtsc calls)
- iterations (number of times we gathered samples)
- cpu_time per iter estimate
- wall_time
- negative sample count (only if we encountered it). Warn the user if > 0.
TODO: warmup flag:
if set high, it will warmup the code path before measuring --benchmark_warmup=true
TODO: random interleaving can reduce variation --benchmark_enable_random_interleaving=true
TODO: specify non-zero repetition count --benchmark_repetitions=9
TODO: decrease the per-repetition time --benchmark_min_time=0.1
TODO: Add Args(A) and ArgRange(A-B)? To test how parameter size impacts benchmarks
c.Arg(0) and c.Arg(1) to access integers.
MAYBE: need to pause timing within the critical section.
TODO: Add Threads(N)? To test how multithread impacts benchmarks
TODO: BM::DoNotOptimize(void *p);
Escape and Clobber work together to trick the compiler into thinking we might globally use some data, thus it avoids optimizing it out
Effect of clobber() is limited to memory that is potentially accessible from an imagineary global root pointer.
From a compilers point of view, some data is completely hidden from the initial scope. Clobber pretends to touch all the data the global scope can see.
To ensure all data can be seen, we use scape.
// Marks p as arbitrarily aliased (to prevent dead code elim or scalar replacement of aggregate passes)
static void escape(void *p) {
asm volatile("" : : "g"(p) : "memory");
}
// Aliases all the memory in the scope
static void clobber() {
asm volatile("" : : : "memory");
}
void benchmark() {
vector<int> v;
v.reserve(1);
v.push_back(10);
}
Benchmark would normally get mostly optimized out.
v.reserve(1) is not visible to clobber until we escape
void benchmark() {
vector<int> v;
v.reserve(1);
escape(v.data());
v.push_back(10);
clobber();
}
TODO: prevent reordering of instructions with fencing
You can probably just use memory_order_acquire_release on some random atomic int?
e.g. Say we have uint8_t array;
prevent clflush from being reordered by the CPU or the compiler in this direction w/
_mm_mfence();
flush the line containing the element w/
_mm_clflush(&array[0]);
mfence and lfence must be in this order + compiler barrier for rdtsc
_mm_mfence();
_mm_lfence();
time1 = __rdtsc();
serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load
_mm_lfence();
int temp = array[0]; // this is a cache miss
measuring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss
no need for mfence because there are no stores in between
mfence and lfence must be in this order + compiler barrier for rdtsc and the load
_mm_lfence();
time2 = __rdtsc();
serialize __rdtsc with respect to trailing instructions
_mm_lfence();
msl = time2 - time1;
TODO: suggesting pinning benchmark to a CPU to reduce variance.
MAYBE? Make this a parameter to BM::State instead of a flag?
TODO: check Linux scaling governor.
TODO: check randomize virtual addressing space
TODO: CSV and JSON output
IDEA: fallback to chrono if rdtsc isn't available? Warn user about this.
-----
FOR ANOTHER DAY
IDEA: flags library? I might try to break this out into its own library
FLAG_DECLARE("FLAG_NAME", type, default_value, "what is it and how to use it");
RETURN_IF_ERROR(FLAG_SET("FLAG_NAME", value));
StatusOr<int> f = ASSIGN_OR_RETURN(FLAG_GET("FLAG_NAME"));