Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the VCF parser #1644

Merged
merged 3 commits into from
Aug 11, 2023
Merged

Speed up the VCF parser #1644

merged 3 commits into from
Aug 11, 2023

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Jul 12, 2023

This is an amalgamation of many small optimisations which jointly can have a significant effect.

I've benchmarked it on data that is very heavy on INFO fields, very heavy on FORMAT fields (many samples), and more "normal" single sample data. Also tested it with the system gcc (7), gcc 13, and clang 13. All with default optimisation levels (-O2).

With gcc7 it's 26 to 75% quicker, gcc13 is 23-53% quicker and clang13 is 12-57% quicker. Depending on how modern the host is and whether it's INFO or FORMAT dominated.

A chunk of GIAB truth set.  Single sample:

        Seconds for dev / PR
        seq4d           seq4-head1
gcc7    12.62 / 7.17    10.13 / 6.49
gcc13   10.65 / 6.96     8.65 / 6.29
clang13 11.87 / 7.55     8.61 / 6.49

gnomad-g1k.10k.vcf (many-samples):

        seq4d           seq4-head1
gcc7    10.62 / 8.46    8.29 / 6.58
gcc13   11.11 / 8.80    8.21 / 6.68
clang13 11.91 / 9.23    8.14 / 6.94

gnomad.sites.50k.vcf - no samples but very heavy in INFO field:

        seq4d           seq4-head1
gcc7    8.27 / 4.83     5.91 / 3.84
gcc13   6.97 / 4.76     4.99 / 3.84
clang13 6.43 / 5.06     4.49 / 4.01

We may wish to test these same files on a Mac as the speed could well be very dependent on the efficiency of the C library. Ask me for details, or use your own files.

Small tests, but representative of a variety of data types. Obviously it'll depend on what contents there are in the files.

The changes are fairly non-invasive and there is no fundamental change of algorithm here. Just careful optimisation of what is there.

Larger scale rewrites may offer more potential, such as merging the find-max + alloc + fill-buffer nature of FORMAT handling. It may be better to store integer fields in a 4-byte fixed width instead of the smallest width, avoiding the max calculation stage. The extra memory isn't that bad I'd think. Strings are a bit trickier, but if there's genuinely differences in string length then it's likely that storing packed strings with an index into the buffer is smaller than expanding to their maximum size. Similar for other arrays. These may well have a minimal impact on memory usage (maybe even shrink it in some cases), while paving the way for a single-pass parser. It's a lot more work to do though so it's parked for now.

Even better would be to simply go with VCF API v2 that doesn't do lots of parsing in the first place, and has a more API driven query that processes just-in-time and the smallest amount we need. We spend a lot of time in hashing for example, which is irrelevant if we don't actually manipulate that key. Similar to #1081. That had a huge huge speed up (3-4x) for some operations, but was essentially vetoed if I recall. However if we're breaking the API then it's clearly the way to go. (That PR didn't break the API though provided people used the proper functions instead of just grubbing around internally without first calling unpack).

For now this is many commits as it can make it easier to review. In particular the first commit puts blocks around the main function (which speeds it up by itself) and moves most veriables to become block-local, while the second is simply lifting those blocks into smaller functions for better code structuring (and no other changes). From then on it's more piecemeal tweaks.

A summary of changes:

  • Split up vcf_parse_format into 7 separate functions, for ease of understanding.
  • Hints for const and caching of lookups (eg bcf_hdr_nsamples). Shouldn't really change anything, but it's good practice and it's possible the compilers can't optimise as well as we'd think due to potential aliasing.
  • Removal of str2uint calls for simple single-digit GTs
  • Remove many calls of kputc in bcf_enc_vint and bcf_enc_size
  • Unrolled bcf_enc_vint array handling to encourage SIMD generation by the compilers
  • Replace manual byte-by-byte loops with strchr or memchr (or for writing memset). These are typically implemented in 32-bit or 64-bit words and considerably faster than the naive approaches we do. This is the bit we may wish to test on other OSes, as I have my own alternatives if needs be (see commit 1db4543)
  • Replace strcmp(t, "GT") with manual check. No effect on modern compilers, but old compilers do actually call strcmp here. Similarly strcmp(p, ".") and strcmp(key, "END") are rewritten with memcmp as the compilers normally inline that for short fixed strings. (We know it'll be nul terminated already due to kstrtok.)
  • Speed up max value parsing via lookup table and/or range checking.
  • Plus more comments and documentation (it could do with a lot more still)

I've tested it via the bcftools test harness too, which initially found a bug I'd missed.

TODO: VCF writing speed. That'll be a new PR, but I can already see it's got lots of inefficient use of kputc everywhere.

// which is ~40% faster again, but it's not so portable.
// i.e. p = (uint8_t *)strchrnul((char *)start, aux->sep);
uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep);
p = p2 ? p2 : start + strlen((char *)start);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise there's a faster alternative here too.

Drop the if (*p == 0) code below (so move to the earlier if-else cases) and use p2 == NULL instead. Eg:

uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep);
if (p2) {
    aux->p = (const char *)p2;
} else {
    aux->p = (const char *)start + strlen((char *)start);
    aux->finished = 1;
}

This avoids a second conditional on *p. Isolated testing of kstrtok though shows it's only around 3-4% quicker so I can't be bothered to change it now, and that corresponds to a tiny amount in most VCF parsing.

Note the old kstrtok was about 18% slower, so the use of strchr has been significant. We probably ought to apply this to klib, but I see there hasn't been movement on the last PR I made so I suspect support is dead.

@daviesrob daviesrob self-assigned this Jul 25, 2023
htslib/vcf.h Outdated Show resolved Hide resolved
htslib/vcf.h Outdated Show resolved Hide resolved
vcf.c Show resolved Hide resolved

char *end = s->s + s->l;
// detect FORMAT "."
static int vcf_parse_format_empty1(kstring_t *s, const bcf_hdr_t *h, bcf1_t *v,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe time to get rid of the numbers on the end of these function names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do if you wish, but I added them deliberately as an aide-memoire as to what part of the original function they belong to. Ie as an indicator for part 1, part 2, etc. It could just be done in comments, but it's irrelevant anyway as they're all static and won't occur externally.

vcf.c Show resolved Hide resolved
vcf.c Outdated Show resolved Hide resolved
vcf.c Outdated Show resolved Hide resolved
vcf.c Outdated Show resolved Hide resolved
vcf.c Outdated Show resolved Hide resolved
@jkbonfield jkbonfield force-pushed the vcf_speed3 branch 2 times, most recently from e1bbc3f to cadca75 Compare August 7, 2023 14:28
This simply turns the monolithic vcf_parse_format functions into a
series of numbers sub-functions whose primary is to localise the
variables to that code block and to make it easier to see the
structure of the tasks being performed.

There is no code optimisation here and the main algorithm is
unchanged, so this is just moving of code from 1 function to multiple
functions.  However it makes the next commit easier to understand as
we're not trying to see a delta mixed in with restructuring.

An unexpected consequence however of making the variables local to
their blocks is that it also speeds up the code.

The first step in separating this code into functions was simply
adding curly braces around each code segment and moving the
function-global variables into their respective blocks.  The
before/after benchmarkjs on 100,000 lines of a multi-sample G1K VCF
are ("perf stat" cycle counts):

        ORIG            LOCALISED
gcc7    29335942870     27353443704
gcc13   31974757094     31452452908
clang13 31290989382     29508665020

Benchmarked again after moving to actual functions, but the difference
was tiny in comparison.)
This is the main meat of the VCF read speedup, following on from the
previous code refactoring.  Combined timings on testing GNOMAD very
INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich
file for different compilers, and the GIAB HG002 VCF truth set are:

INFO heavy          (15-29% speedup)    (34-39% speedup)
                    dev(s)  PR(s)       dev(s)  PR(s)
    clang13         6.29    5.34        2.84    1.85
    gcc13           6.74    5.22        2.93    1.93
    gcc7            7.96    5.65        3.25    1.98

FORMAT heavy        (6-19% speedup)     (18-22% speedup)
                    dev     PR          dev     PR
    clang13         9.17    8.58        5.45    4.48
    gcc13           9.88    8.04        5.08    3.95
    gcc7            9.12    8.33        4.87    3.98

GIAB HG002          (28-29% speedup)    (33-37% speedup)
                    dev     PR          dev     PR
    clang13         12.88   9.30        5.12    3.29
    gcc13           12.04   8.60        4.74    3.19
    gcc7            12.87   9.37        5.32    3.34

(Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively)

Bigger speedups (see first message in PR) were seen on some older
hardware.

Specific optimisations along with estimates of their benefit include,
in approximate order of writing / testing:

- Adding consts and caching of bcf_hdr_nsamples(h).
  No difference on system gcc (gcc7) and clang13, but a couple percent
  gain on gcc13.

- Remove the need for most calls to hts_str2uint by recognising that
  most GT numbers are single digits.  This was 4-5% saving for gcc and
  9-10% on clang.

- Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated
  ks_resize checking.  This is a further ~10% speedup.

- Unrolling in bcf_enc_vint to encourage SIMD.

- Improve speed of bgzf_getline and kstrrok via memchr/strchr.

  In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker
  with clang 13 and 19-25% quicker with gcc 7.  I did investigate a
  manually unrolled 64-bit search, before I remembered the existance of
  memchr (doh!).  This is often faster on clang (17-19%), but marginally
  slower on gcc.  The actual speed up to this function however is
  considerably more (3-4x quicker).

  For interest, I include the equivalent code here, as it may be useful in
  other contexts:

    #if HTS_ALLOW_UNALIGNED != 0 && ULONG_MAX == 0xffffffffffffffff
    // 64-bit unrolled delim detection
    #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL)
            // Quicker broadcast on clang than bit shuffling in delim
            union {
                uint64_t d8;
                uint8_t d[8];
            } u;
            memset(u.d, delim, 8);
            const uint64_t d8 = u.d8;

            uint64_t *b8 = (uint64_t *)(&buf[fp->block_offset]);
            const int l8 = (fp->block_length-fp->block_offset)/8;
            for (l = 0; l < (l8 & ~3); l+=4) {
                if (haszero(b8[l+0] ^ d8))
                    break;
                if (haszero(b8[l+1] ^ d8)) {
                    l++;
                    break;
                }
                if (haszero(b8[l+2] ^ d8)) {
                    l+=2;
                    break;
                }
                if (haszero(b8[l+3] ^ d8)) {
                    l+=3;
                    break;
                }
            }
            l *= 8;
            for (l += fp->block_offset;
                 l < fp->block_length && buf[l] != delim;
                 l++);

  The analogous kstrtok change is using strchr+strlen instead of memchr
  as we don't know the string end. This makes kstrtok around 150%
  quicker when parsing a single sample VCF.

  When not finding aux->sep in the string, strchr returns NULL rather
  than end of string, so we need an additional strlen to set aux->p.
  However there is also glibc's strchrnul which solves this in a single
  call.  This makes kstrtok another 40% quicker on this test, but
  overall it's not a big bottleneck any more.

- Use strchr in vcf_parse_info.

  This is a major speed increase over manual searching on Linux.
  TODO: is this just glibc?  Eg libmusl speeds, etc?  Other OSes?

  It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it)
  with gcc.  Even more with clang.  The total speed gain on a single
  sample VCF view (GIAB truth set) is 12-19% fewer cycles:

- Minor "GT" check improvement.  This has no real affect on gcc13 and
  clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7%

- Speed up the unknown value check (strcmp(p, ".").  Helps gcc7 the
  most (9%), with gcc13/clang13 in the 3-4% gains.

- Speed up vcf_parse_format_max3.

  This is the first parse through the FORMAT fields.  Ideally we'd merge
  this and fill5 (the other parse through), but that is harder due to
  the data pivot / rotate.

  For now we just optimise the existing code path.  Instead of a
  laborious switch character by character, we have an initial tight loop
  to find the first meta-character and then a switch to do char
  dependant code.

  This is 5% to 13% speed up depending on data set.

- Remove kputc and minimise resize for bcf_enc_int1.
  3-8% speedup depending on data / compiler.

- Use memcmp instead of strcmp for "END" and ensure we have room.
  Also memset over explicit nulling of arrays.

- Force BCF header dicts to be larger than needed.

  This is a tactic to reduce hash collisions due to the use of overly
  simple hash functions.  It seems to typically be around 3-8% speed gain.

- Restructure of main vcf_parse function.

  This can speed things up by 6-7% on basic single-sample files.

  The previous loop caused lots of branch prediction misses due to the
  counter 'i' being used to do 8 different parts of code depending on
  token number.  Additionally it's got better error checking now as
  previously running out of tokens early just did a return 0 rather than
  complaining about missing columns.
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Aug 7, 2023

Rebased and a massive squashathon.

Retested dev vs this on a couple platforms with 3 compilers and 3 datasets.

One thing that needs testing is other systems, eg MacOS. There is some assumption on the C library being efficient, particularly things like strchr. It's true with glibc, but not sure if this is universally true.

The two we keep are the internal while loop to find the next ; or =
instead of iterating back in the outer for loop, and memcmp instead of
strcmp for "END".

The strchr changes do help glibc on excessively long INFO tokens, seen
in the GIAB truth set and GNOMAD files, but they have no impact on
most mainstream VCF outputs.  Furthermore, other C libraries, such as
MUSL, are considerably slowed down by the use of strchr.  Hence this
isn't a particularly robust or warranted change.
@daviesrob daviesrob merged commit ac70212 into samtools:develop Aug 11, 2023
8 checks passed
@daviesrob
Copy link
Member

Merged including the separate commit that reverts the vcf_parse_info() changes - I think it's useful to keep the history there should we want to revisit that area later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants