-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the VCF parser #1644
Speed up the VCF parser #1644
Conversation
// which is ~40% faster again, but it's not so portable. | ||
// i.e. p = (uint8_t *)strchrnul((char *)start, aux->sep); | ||
uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep); | ||
p = p2 ? p2 : start + strlen((char *)start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise there's a faster alternative here too.
Drop the if (*p == 0)
code below (so move to the earlier if-else cases) and use p2 == NULL instead. Eg:
uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep);
if (p2) {
aux->p = (const char *)p2;
} else {
aux->p = (const char *)start + strlen((char *)start);
aux->finished = 1;
}
This avoids a second conditional on *p
. Isolated testing of kstrtok though shows it's only around 3-4% quicker so I can't be bothered to change it now, and that corresponds to a tiny amount in most VCF parsing.
Note the old kstrtok was about 18% slower, so the use of strchr has been significant. We probably ought to apply this to klib, but I see there hasn't been movement on the last PR I made so I suspect support is dead.
|
||
char *end = s->s + s->l; | ||
// detect FORMAT "." | ||
static int vcf_parse_format_empty1(kstring_t *s, const bcf_hdr_t *h, bcf1_t *v, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe time to get rid of the numbers on the end of these function names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do if you wish, but I added them deliberately as an aide-memoire as to what part of the original function they belong to. Ie as an indicator for part 1, part 2, etc. It could just be done in comments, but it's irrelevant anyway as they're all static and won't occur externally.
e1bbc3f
to
cadca75
Compare
This simply turns the monolithic vcf_parse_format functions into a series of numbers sub-functions whose primary is to localise the variables to that code block and to make it easier to see the structure of the tasks being performed. There is no code optimisation here and the main algorithm is unchanged, so this is just moving of code from 1 function to multiple functions. However it makes the next commit easier to understand as we're not trying to see a delta mixed in with restructuring. An unexpected consequence however of making the variables local to their blocks is that it also speeds up the code. The first step in separating this code into functions was simply adding curly braces around each code segment and moving the function-global variables into their respective blocks. The before/after benchmarkjs on 100,000 lines of a multi-sample G1K VCF are ("perf stat" cycle counts): ORIG LOCALISED gcc7 29335942870 27353443704 gcc13 31974757094 31452452908 clang13 31290989382 29508665020 Benchmarked again after moving to actual functions, but the difference was tiny in comparison.)
This is the main meat of the VCF read speedup, following on from the previous code refactoring. Combined timings on testing GNOMAD very INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich file for different compilers, and the GIAB HG002 VCF truth set are: INFO heavy (15-29% speedup) (34-39% speedup) dev(s) PR(s) dev(s) PR(s) clang13 6.29 5.34 2.84 1.85 gcc13 6.74 5.22 2.93 1.93 gcc7 7.96 5.65 3.25 1.98 FORMAT heavy (6-19% speedup) (18-22% speedup) dev PR dev PR clang13 9.17 8.58 5.45 4.48 gcc13 9.88 8.04 5.08 3.95 gcc7 9.12 8.33 4.87 3.98 GIAB HG002 (28-29% speedup) (33-37% speedup) dev PR dev PR clang13 12.88 9.30 5.12 3.29 gcc13 12.04 8.60 4.74 3.19 gcc7 12.87 9.37 5.32 3.34 (Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively) Bigger speedups (see first message in PR) were seen on some older hardware. Specific optimisations along with estimates of their benefit include, in approximate order of writing / testing: - Adding consts and caching of bcf_hdr_nsamples(h). No difference on system gcc (gcc7) and clang13, but a couple percent gain on gcc13. - Remove the need for most calls to hts_str2uint by recognising that most GT numbers are single digits. This was 4-5% saving for gcc and 9-10% on clang. - Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated ks_resize checking. This is a further ~10% speedup. - Unrolling in bcf_enc_vint to encourage SIMD. - Improve speed of bgzf_getline and kstrrok via memchr/strchr. In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker with clang 13 and 19-25% quicker with gcc 7. I did investigate a manually unrolled 64-bit search, before I remembered the existance of memchr (doh!). This is often faster on clang (17-19%), but marginally slower on gcc. The actual speed up to this function however is considerably more (3-4x quicker). For interest, I include the equivalent code here, as it may be useful in other contexts: #if HTS_ALLOW_UNALIGNED != 0 && ULONG_MAX == 0xffffffffffffffff // 64-bit unrolled delim detection #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL) // Quicker broadcast on clang than bit shuffling in delim union { uint64_t d8; uint8_t d[8]; } u; memset(u.d, delim, 8); const uint64_t d8 = u.d8; uint64_t *b8 = (uint64_t *)(&buf[fp->block_offset]); const int l8 = (fp->block_length-fp->block_offset)/8; for (l = 0; l < (l8 & ~3); l+=4) { if (haszero(b8[l+0] ^ d8)) break; if (haszero(b8[l+1] ^ d8)) { l++; break; } if (haszero(b8[l+2] ^ d8)) { l+=2; break; } if (haszero(b8[l+3] ^ d8)) { l+=3; break; } } l *= 8; for (l += fp->block_offset; l < fp->block_length && buf[l] != delim; l++); The analogous kstrtok change is using strchr+strlen instead of memchr as we don't know the string end. This makes kstrtok around 150% quicker when parsing a single sample VCF. When not finding aux->sep in the string, strchr returns NULL rather than end of string, so we need an additional strlen to set aux->p. However there is also glibc's strchrnul which solves this in a single call. This makes kstrtok another 40% quicker on this test, but overall it's not a big bottleneck any more. - Use strchr in vcf_parse_info. This is a major speed increase over manual searching on Linux. TODO: is this just glibc? Eg libmusl speeds, etc? Other OSes? It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it) with gcc. Even more with clang. The total speed gain on a single sample VCF view (GIAB truth set) is 12-19% fewer cycles: - Minor "GT" check improvement. This has no real affect on gcc13 and clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7% - Speed up the unknown value check (strcmp(p, "."). Helps gcc7 the most (9%), with gcc13/clang13 in the 3-4% gains. - Speed up vcf_parse_format_max3. This is the first parse through the FORMAT fields. Ideally we'd merge this and fill5 (the other parse through), but that is harder due to the data pivot / rotate. For now we just optimise the existing code path. Instead of a laborious switch character by character, we have an initial tight loop to find the first meta-character and then a switch to do char dependant code. This is 5% to 13% speed up depending on data set. - Remove kputc and minimise resize for bcf_enc_int1. 3-8% speedup depending on data / compiler. - Use memcmp instead of strcmp for "END" and ensure we have room. Also memset over explicit nulling of arrays. - Force BCF header dicts to be larger than needed. This is a tactic to reduce hash collisions due to the use of overly simple hash functions. It seems to typically be around 3-8% speed gain. - Restructure of main vcf_parse function. This can speed things up by 6-7% on basic single-sample files. The previous loop caused lots of branch prediction misses due to the counter 'i' being used to do 8 different parts of code depending on token number. Additionally it's got better error checking now as previously running out of tokens early just did a return 0 rather than complaining about missing columns.
Rebased and a massive squashathon. Retested dev vs this on a couple platforms with 3 compilers and 3 datasets. One thing that needs testing is other systems, eg MacOS. There is some assumption on the C library being efficient, particularly things like |
The two we keep are the internal while loop to find the next ; or = instead of iterating back in the outer for loop, and memcmp instead of strcmp for "END". The strchr changes do help glibc on excessively long INFO tokens, seen in the GIAB truth set and GNOMAD files, but they have no impact on most mainstream VCF outputs. Furthermore, other C libraries, such as MUSL, are considerably slowed down by the use of strchr. Hence this isn't a particularly robust or warranted change.
Merged including the separate commit that reverts the |
This is an amalgamation of many small optimisations which jointly can have a significant effect.
I've benchmarked it on data that is very heavy on INFO fields, very heavy on FORMAT fields (many samples), and more "normal" single sample data. Also tested it with the system gcc (7), gcc 13, and clang 13. All with default optimisation levels (-O2).
With gcc7 it's 26 to 75% quicker, gcc13 is 23-53% quicker and clang13 is 12-57% quicker. Depending on how modern the host is and whether it's INFO or FORMAT dominated.
We may wish to test these same files on a Mac as the speed could well be very dependent on the efficiency of the C library. Ask me for details, or use your own files.
Small tests, but representative of a variety of data types. Obviously it'll depend on what contents there are in the files.
The changes are fairly non-invasive and there is no fundamental change of algorithm here. Just careful optimisation of what is there.
Larger scale rewrites may offer more potential, such as merging the find-max + alloc + fill-buffer nature of FORMAT handling. It may be better to store integer fields in a 4-byte fixed width instead of the smallest width, avoiding the max calculation stage. The extra memory isn't that bad I'd think. Strings are a bit trickier, but if there's genuinely differences in string length then it's likely that storing packed strings with an index into the buffer is smaller than expanding to their maximum size. Similar for other arrays. These may well have a minimal impact on memory usage (maybe even shrink it in some cases), while paving the way for a single-pass parser. It's a lot more work to do though so it's parked for now.
Even better would be to simply go with VCF API v2 that doesn't do lots of parsing in the first place, and has a more API driven query that processes just-in-time and the smallest amount we need. We spend a lot of time in hashing for example, which is irrelevant if we don't actually manipulate that key. Similar to #1081. That had a huge huge speed up (3-4x) for some operations, but was essentially vetoed if I recall. However if we're breaking the API then it's clearly the way to go. (That PR didn't break the API though provided people used the proper functions instead of just grubbing around internally without first calling unpack).
For now this is many commits as it can make it easier to review. In particular the first commit puts blocks around the main function (which speeds it up by itself) and moves most veriables to become block-local, while the second is simply lifting those blocks into smaller functions for better code structuring (and no other changes). From then on it's more piecemeal tweaks.
A summary of changes:
vcf_parse_format
into 7 separate functions, for ease of understanding.bcf_hdr_nsamples
). Shouldn't really change anything, but it's good practice and it's possible the compilers can't optimise as well as we'd think due to potential aliasing.kputc
inbcf_enc_vint
andbcf_enc_size
bcf_enc_vint
array handling to encourage SIMD generation by the compilersstrchr
ormemchr
(or for writingmemset
). These are typically implemented in 32-bit or 64-bit words and considerably faster than the naive approaches we do. This is the bit we may wish to test on other OSes, as I have my own alternatives if needs be (see commit 1db4543)strcmp(t, "GT")
with manual check. No effect on modern compilers, but old compilers do actually call strcmp here. Similarlystrcmp(p, ".")
and strcmp(key, "END") are rewritten withmemcmp
as the compilers normally inline that for short fixed strings. (We know it'll be nul terminated already due to kstrtok.)I've tested it via the bcftools test harness too, which initially found a bug I'd missed.
TODO: VCF writing speed. That'll be a new PR, but I can already see it's got lots of inefficient use of
kputc
everywhere.