-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mmap hfile backend #690
base: develop
Are you sure you want to change the base?
Conversation
Is there a reason why the mmap-based-hFILE backend in https://github.com/samtools/htslib-plugins doesn't fit your needs? (Especially if you updated it to use a fixed buffer.) |
@jmarshall the intention of this PR is to replace |
^ also just made an edit description that matches the purpose of this PR |
That's probably my fault giving incorrect advice to Thomas - I didn't realise there was even an mmap hFILE already, although is there a reason why it's hidden away in the plugin package instead of main stream? Mentally I label that repo as "here be dragons" and avoid it, due to the association with iRODS. ;-) Our intention here is to remove the last reason for the existence of mFILE which in turn will aid lifting the reference querying machinery out of the cram subdirectory. |
That plugin code followed conversations with @lairdm (see lairdm/develop circa August 2016), who was interested in using I am certainly in favour of exterminating mFILE. Last I looked I thought mmap wasn't actually activated (so plain Whether you access a local file via Since
|
Actually selecting this via a mode letter hint turns out to be rather tidy, means it works with |
Just done a little bit of an experiment with the performance of the using a mmap inside a hfile (using @jmarshall's implementation) vs using a hfile vs an operating directly on the mmapped memory on the new reference fetching logic I'm going to push to #589. Here are my results:
where the sequential test is executing the command: time samtools view ~/NA12878.alt_bwamem_GRCh38DH.20150826.CEU.exome.cram | head -1000000 > /dev/null and the parallel test is executing the command: time parallel --ungroup "echo Started {} && samtools view ~/NA12878.alt_bwamem_GRCh38DH.20150826.CEU.exome.cram | head -500000 > /dev/null && echo {} Finished" ::: {1..10} S indicates the system time and U the user time, according to GNU time. Both of these commands are being executed on my mac, which has specifications here: https://support.apple.com/kb/sp715?locale=en_GB. It might be better to do benchmarking on a machine that is more likely to run it (i.e. have more cores), but this is an indication. |
It may be best to use "samtools view -c" to count records. Otherwise you're largely just benchmarking how long it takes to print up SAM records (which isn't something htslib shines with). Try BAM too as the decode times are smaller and you'll see the impact of I/O more than with CRAM. |
I've now got around to doing this benchmarking now. To test the reference fetching code exclusively, I've composed a cram file which is formed from a single sample that doesn't differ from the reference, on chromosome 22: $ cp reference.fa reference.fq
$ bwa index reference.fa
$ bwa mem reference.fa reference.fq > ref.sam
$ samtools view -C -T reference.fa ref.sam > ref.cram
$ echo -e "@SQ\tSN:chr22\tM5:ac37ec46683600f808cdd41eac1d55cd" > new_header
$ samtools reheader -i new_header ref.cram Then I applied a patch on #589: diff --git a/cram/cram_io.c b/cram/cram_io.c
index 1b175e2..9457f80 100644
--- a/cram/cram_io.c
+++ b/cram/cram_io.c
@@ -44,6 +44,7 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
* a way to return errors for when malloc fails.
*/
+#include <fcntl.h>
#include <config.h>
#include <stdio.h>
@@ -68,6 +69,7 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include <math.h>
#include <time.h>
#include <stdint.h>
+#include <sys/mman.h>
#ifdef HAVE_LIBDEFLATE
#include <libdeflate.h>
@@ -1500,8 +1502,8 @@ char *cram_content_type2str(enum cram_content_type t) {
* Frees/unmaps a reference sequence and associated file handles.
*/
static void ref_entry_free_seq(ref_entry *e) {
- if (e->seq)
- free(e->seq);
+ // if (e->seq)
+ // free(e->seq);
e->seq = NULL;
}
@@ -2006,6 +2008,12 @@ static int cram_populate_ref(cram_fd *fd, int id, ref_entry *r) {
}
r->length = st.st_size;
+ if (0){
+ int ref_fd = open(ref_fn, O_RDONLY);
+ char* mapped_file = mmap(0, st.st_size, PROT_READ, MAP_SHARED, ref_fd, 0);
+ r->seq = mapped_file;
+ }
+ else{
r->offset = r->line_length = r->bases_per_line = 0;
fd->refs->fn = r->fn = string_dup(fd->refs->pool, ref_fn);
@@ -2018,6 +2026,7 @@ static int cram_populate_ref(cram_fd *fd, int id, ref_entry *r) {
fd->refs->fp = bgzf_open_ref(ref_fn, "r", 1);
}
+ }
free(ref_fn); I then ran: $ for i in `seq 1 50`; do time samtools view ref.cram &> /dev/null; done 2> unmmapped_times
<CHANGE CONDITION, RECOMPLE>
$ for i in `seq 1 50`; do time samtools view ref.cram &> /dev/null; done 2> mmapped_times
$ grep real unmmapped_times | cut -c 10-12 > unmmapped_real_times
$ grep real mmapped_times | cut -c 10-12 > mapped_real_times Below is a summary of the times:
In this situation, it looks like mmapping the reference is quicker than reading using the previous method |
I'm starting to look at this again now. I prefer @jmarshall's implementation using the 'm' field in the mode string as it's cleaner to use and doesn't require juggling of temporary buffers to prepend "mmap:" to filenames either. It's also closer to the way the mFILE code works that we'll be replacing here. It's questionable whether we should permit mmap on all of the modes, eg append isn't going to work, but that's arguably a calling code error. |
Starting from @jmarshall's branch, I extended this to replace mFILE.c with hfiles instead. It needs more testing and is a work in progress, but this is the code so far: https://github.com/jkbonfield/htslib/tree/m-mode-mmap NB: still needing fixes, eg for Windows:
|
It's probably a good idea to implement a |
The eventual use of this PR will be to replace the reference file wrapper in the reference fetching logic from an
mFILE
to an mmapped hFILE. This will eventually be used in pushed code to #589)