Speed up the VCF parser #1644

jkbonfield · 2023-07-12T14:13:58Z

This is an amalgamation of many small optimisations which jointly can have a significant effect.

I've benchmarked it on data that is very heavy on INFO fields, very heavy on FORMAT fields (many samples), and more "normal" single sample data. Also tested it with the system gcc (7), gcc 13, and clang 13. All with default optimisation levels (-O2).

With gcc7 it's 26 to 75% quicker, gcc13 is 23-53% quicker and clang13 is 12-57% quicker. Depending on how modern the host is and whether it's INFO or FORMAT dominated.

A chunk of GIAB truth set.  Single sample:

        Seconds for dev / PR
        seq4d           seq4-head1
gcc7    12.62 / 7.17    10.13 / 6.49
gcc13   10.65 / 6.96     8.65 / 6.29
clang13 11.87 / 7.55     8.61 / 6.49

gnomad-g1k.10k.vcf (many-samples):

        seq4d           seq4-head1
gcc7    10.62 / 8.46    8.29 / 6.58
gcc13   11.11 / 8.80    8.21 / 6.68
clang13 11.91 / 9.23    8.14 / 6.94

gnomad.sites.50k.vcf - no samples but very heavy in INFO field:

        seq4d           seq4-head1
gcc7    8.27 / 4.83     5.91 / 3.84
gcc13   6.97 / 4.76     4.99 / 3.84
clang13 6.43 / 5.06     4.49 / 4.01

We may wish to test these same files on a Mac as the speed could well be very dependent on the efficiency of the C library. Ask me for details, or use your own files.

Small tests, but representative of a variety of data types. Obviously it'll depend on what contents there are in the files.

The changes are fairly non-invasive and there is no fundamental change of algorithm here. Just careful optimisation of what is there.

Larger scale rewrites may offer more potential, such as merging the find-max + alloc + fill-buffer nature of FORMAT handling. It may be better to store integer fields in a 4-byte fixed width instead of the smallest width, avoiding the max calculation stage. The extra memory isn't that bad I'd think. Strings are a bit trickier, but if there's genuinely differences in string length then it's likely that storing packed strings with an index into the buffer is smaller than expanding to their maximum size. Similar for other arrays. These may well have a minimal impact on memory usage (maybe even shrink it in some cases), while paving the way for a single-pass parser. It's a lot more work to do though so it's parked for now.

Even better would be to simply go with VCF API v2 that doesn't do lots of parsing in the first place, and has a more API driven query that processes just-in-time and the smallest amount we need. We spend a lot of time in hashing for example, which is irrelevant if we don't actually manipulate that key. Similar to #1081. That had a huge huge speed up (3-4x) for some operations, but was essentially vetoed if I recall. However if we're breaking the API then it's clearly the way to go. (That PR didn't break the API though provided people used the proper functions instead of just grubbing around internally without first calling unpack).

For now this is many commits as it can make it easier to review. In particular the first commit puts blocks around the main function (which speeds it up by itself) and moves most veriables to become block-local, while the second is simply lifting those blocks into smaller functions for better code structuring (and no other changes). From then on it's more piecemeal tweaks.

A summary of changes:

Split up vcf_parse_format into 7 separate functions, for ease of understanding.
Hints for const and caching of lookups (eg bcf_hdr_nsamples). Shouldn't really change anything, but it's good practice and it's possible the compilers can't optimise as well as we'd think due to potential aliasing.
Removal of str2uint calls for simple single-digit GTs
Remove many calls of kputc in bcf_enc_vint and bcf_enc_size
Unrolled bcf_enc_vint array handling to encourage SIMD generation by the compilers
Replace manual byte-by-byte loops with strchr or memchr (or for writing memset). These are typically implemented in 32-bit or 64-bit words and considerably faster than the naive approaches we do. This is the bit we may wish to test on other OSes, as I have my own alternatives if needs be (see commit 1db4543)
Replace strcmp(t, "GT") with manual check. No effect on modern compilers, but old compilers do actually call strcmp here. Similarly strcmp(p, ".") and strcmp(key, "END") are rewritten with memcmp as the compilers normally inline that for short fixed strings. (We know it'll be nul terminated already due to kstrtok.)
Speed up max value parsing via lookup table and/or range checking.
Plus more comments and documentation (it could do with a lot more still)

I've tested it via the bcftools test harness too, which initially found a bug I'd missed.

TODO: VCF writing speed. That'll be a new PR, but I can already see it's got lots of inefficient use of kputc everywhere.

jkbonfield · 2023-07-12T15:36:43Z

kstring.c

+		// which is ~40% faster again, but it's not so portable.
+		// i.e.   p = (uint8_t *)strchrnul((char *)start, aux->sep);
+		uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep);
+		p = p2 ? p2 : start + strlen((char *)start);


I realise there's a faster alternative here too.

Drop the if (*p == 0) code below (so move to the earlier if-else cases) and use p2 == NULL instead. Eg:

uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep); if (p2) { aux->p = (const char *)p2; } else { aux->p = (const char *)start + strlen((char *)start); aux->finished = 1; }

This avoids a second conditional on *p. Isolated testing of kstrtok though shows it's only around 3-4% quicker so I can't be bothered to change it now, and that corresponds to a tiny amount in most VCF parsing.

Note the old kstrtok was about 18% slower, so the use of strchr has been significant. We probably ought to apply this to klib, but I see there hasn't been movement on the last PR I made so I suspect support is dead.

htslib/vcf.h

vcf.c

daviesrob · 2023-07-28T15:19:26Z

vcf.c


-    char *end = s->s + s->l;
+// detect FORMAT "."
+static int vcf_parse_format_empty1(kstring_t *s, const bcf_hdr_t *h, bcf1_t *v,


Maybe time to get rid of the numbers on the end of these function names?

I can do if you wish, but I added them deliberately as an aide-memoire as to what part of the original function they belong to. Ie as an indicator for part 1, part 2, etc. It could just be done in comments, but it's irrelevant anyway as they're all static and won't occur externally.

vcf.c

This simply turns the monolithic vcf_parse_format functions into a series of numbers sub-functions whose primary is to localise the variables to that code block and to make it easier to see the structure of the tasks being performed. There is no code optimisation here and the main algorithm is unchanged, so this is just moving of code from 1 function to multiple functions. However it makes the next commit easier to understand as we're not trying to see a delta mixed in with restructuring. An unexpected consequence however of making the variables local to their blocks is that it also speeds up the code. The first step in separating this code into functions was simply adding curly braces around each code segment and moving the function-global variables into their respective blocks. The before/after benchmarkjs on 100,000 lines of a multi-sample G1K VCF are ("perf stat" cycle counts): ORIG LOCALISED gcc7 29335942870 27353443704 gcc13 31974757094 31452452908 clang13 31290989382 29508665020 Benchmarked again after moving to actual functions, but the difference was tiny in comparison.)

This is the main meat of the VCF read speedup, following on from the previous code refactoring. Combined timings on testing GNOMAD very INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich file for different compilers, and the GIAB HG002 VCF truth set are: INFO heavy (15-29% speedup) (34-39% speedup) dev(s) PR(s) dev(s) PR(s) clang13 6.29 5.34 2.84 1.85 gcc13 6.74 5.22 2.93 1.93 gcc7 7.96 5.65 3.25 1.98 FORMAT heavy (6-19% speedup) (18-22% speedup) dev PR dev PR clang13 9.17 8.58 5.45 4.48 gcc13 9.88 8.04 5.08 3.95 gcc7 9.12 8.33 4.87 3.98 GIAB HG002 (28-29% speedup) (33-37% speedup) dev PR dev PR clang13 12.88 9.30 5.12 3.29 gcc13 12.04 8.60 4.74 3.19 gcc7 12.87 9.37 5.32 3.34 (Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively) Bigger speedups (see first message in PR) were seen on some older hardware. Specific optimisations along with estimates of their benefit include, in approximate order of writing / testing: - Adding consts and caching of bcf_hdr_nsamples(h). No difference on system gcc (gcc7) and clang13, but a couple percent gain on gcc13. - Remove the need for most calls to hts_str2uint by recognising that most GT numbers are single digits. This was 4-5% saving for gcc and 9-10% on clang. - Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated ks_resize checking. This is a further ~10% speedup. - Unrolling in bcf_enc_vint to encourage SIMD. - Improve speed of bgzf_getline and kstrrok via memchr/strchr. In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker with clang 13 and 19-25% quicker with gcc 7. I did investigate a manually unrolled 64-bit search, before I remembered the existance of memchr (doh!). This is often faster on clang (17-19%), but marginally slower on gcc. The actual speed up to this function however is considerably more (3-4x quicker). For interest, I include the equivalent code here, as it may be useful in other contexts: #if HTS_ALLOW_UNALIGNED != 0 && ULONG_MAX == 0xffffffffffffffff // 64-bit unrolled delim detection #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL) // Quicker broadcast on clang than bit shuffling in delim union { uint64_t d8; uint8_t d[8]; } u; memset(u.d, delim, 8); const uint64_t d8 = u.d8; uint64_t *b8 = (uint64_t *)(&buf[fp->block_offset]); const int l8 = (fp->block_length-fp->block_offset)/8; for (l = 0; l < (l8 & ~3); l+=4) { if (haszero(b8[l+0] ^ d8)) break; if (haszero(b8[l+1] ^ d8)) { l++; break; } if (haszero(b8[l+2] ^ d8)) { l+=2; break; } if (haszero(b8[l+3] ^ d8)) { l+=3; break; } } l *= 8; for (l += fp->block_offset; l < fp->block_length && buf[l] != delim; l++); The analogous kstrtok change is using strchr+strlen instead of memchr as we don't know the string end. This makes kstrtok around 150% quicker when parsing a single sample VCF. When not finding aux->sep in the string, strchr returns NULL rather than end of string, so we need an additional strlen to set aux->p. However there is also glibc's strchrnul which solves this in a single call. This makes kstrtok another 40% quicker on this test, but overall it's not a big bottleneck any more. - Use strchr in vcf_parse_info. This is a major speed increase over manual searching on Linux. TODO: is this just glibc? Eg libmusl speeds, etc? Other OSes? It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it) with gcc. Even more with clang. The total speed gain on a single sample VCF view (GIAB truth set) is 12-19% fewer cycles: - Minor "GT" check improvement. This has no real affect on gcc13 and clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7% - Speed up the unknown value check (strcmp(p, "."). Helps gcc7 the most (9%), with gcc13/clang13 in the 3-4% gains. - Speed up vcf_parse_format_max3. This is the first parse through the FORMAT fields. Ideally we'd merge this and fill5 (the other parse through), but that is harder due to the data pivot / rotate. For now we just optimise the existing code path. Instead of a laborious switch character by character, we have an initial tight loop to find the first meta-character and then a switch to do char dependant code. This is 5% to 13% speed up depending on data set. - Remove kputc and minimise resize for bcf_enc_int1. 3-8% speedup depending on data / compiler. - Use memcmp instead of strcmp for "END" and ensure we have room. Also memset over explicit nulling of arrays. - Force BCF header dicts to be larger than needed. This is a tactic to reduce hash collisions due to the use of overly simple hash functions. It seems to typically be around 3-8% speed gain. - Restructure of main vcf_parse function. This can speed things up by 6-7% on basic single-sample files. The previous loop caused lots of branch prediction misses due to the counter 'i' being used to do 8 different parts of code depending on token number. Additionally it's got better error checking now as previously running out of tokens early just did a return 0 rather than complaining about missing columns.

jkbonfield · 2023-08-07T16:14:43Z

Rebased and a massive squashathon.

Retested dev vs this on a couple platforms with 3 compilers and 3 datasets.

One thing that needs testing is other systems, eg MacOS. There is some assumption on the C library being efficient, particularly things like strchr. It's true with glibc, but not sure if this is universally true.

The two we keep are the internal while loop to find the next ; or = instead of iterating back in the outer for loop, and memcmp instead of strcmp for "END". The strchr changes do help glibc on excessively long INFO tokens, seen in the GIAB truth set and GNOMAD files, but they have no impact on most mainstream VCF outputs. Furthermore, other C libraries, such as MUSL, are considerably slowed down by the use of strchr. Hence this isn't a particularly robust or warranted change.

daviesrob · 2023-08-11T10:38:06Z

Merged including the separate commit that reverts the vcf_parse_info() changes - I think it's useful to keep the history there should we want to revisit that area later.

jkbonfield commented Jul 12, 2023

View reviewed changes

daviesrob self-assigned this Jul 25, 2023

daviesrob reviewed Jul 28, 2023

View reviewed changes

jkbonfield force-pushed the vcf_speed3 branch from 7e5cf71 to 10a5707 Compare August 1, 2023 10:49

daviesrob reviewed Aug 2, 2023

View reviewed changes

vcf.c Outdated Show resolved Hide resolved

daviesrob reviewed Aug 4, 2023

View reviewed changes

vcf.c Outdated Show resolved Hide resolved

vcf.c Outdated Show resolved Hide resolved

vcf.c Outdated Show resolved Hide resolved

jkbonfield force-pushed the vcf_speed3 branch 2 times, most recently from e1bbc3f to cadca75 Compare August 7, 2023 14:28

jkbonfield added 2 commits August 7, 2023 16:53

jkbonfield force-pushed the vcf_speed3 branch from cadca75 to 2318975 Compare August 7, 2023 16:12

daviesrob merged commit ac70212 into samtools:develop Aug 11, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up the VCF parser #1644

Speed up the VCF parser #1644

jkbonfield commented Jul 12, 2023 •

edited

Loading

jkbonfield Jul 12, 2023

daviesrob Jul 28, 2023

jkbonfield Jul 31, 2023

jkbonfield commented Aug 7, 2023 •

edited

Loading

daviesrob commented Aug 11, 2023

Speed up the VCF parser #1644

Speed up the VCF parser #1644

Conversation

jkbonfield commented Jul 12, 2023 • edited Loading

jkbonfield Jul 12, 2023

Choose a reason for hiding this comment

daviesrob Jul 28, 2023

Choose a reason for hiding this comment

jkbonfield Jul 31, 2023

Choose a reason for hiding this comment

jkbonfield commented Aug 7, 2023 • edited Loading

daviesrob commented Aug 11, 2023

jkbonfield commented Jul 12, 2023 •

edited

Loading

jkbonfield commented Aug 7, 2023 •

edited

Loading