-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Progress and Cancelation for Indexing #1379
base: develop
Are you sure you want to change the base?
Conversation
This is a nice idea, but currently it changes the signature of The |
I wonder if there is a general way this can be handled. It's rather specific to indexing only. What about any other task iterating over a file? It seems to me that file offset, if known, is a more workable progression marker. We don't necessarily know the start / end of the region in question, but maybe that too can be made accessable by analysing the index query results. We could almost do this right now by using ftell type functions (on the relevant hts file descriptor) from a separate thread, were it not for potential thread memory access clashes. If there was a thread-safe API function that could be called though then the issue is solved. It's a different model of operation, requiring a separate UI thread that periodically polls to perform updates, but this feels cleaner than adding callbacks to potentially more and more pieces of code. Edit: note the above is for progress indicators, but not cancelling. I'd guess you would also need a cancelling method by setting a flag on the file descriptor. The exsiting code should handle errors when reading / writing data anyway, so cancelling is simply a matter of forcing that code to be triggered. |
We have |
I'm not making myself clear. I'm talking about separate UI threads from computation threads. This is often how many GUIs are written. So it doesn't matter that the function consumes all input before finishing, provided it's possible to query the progress from another thread and feed back. That's why I say a thread safe interface to query position in the input file. (With also a thread safe way of determining the start / end positions for purposes of range queries) Specifically I'm trying to think of general library design choices that can be applied universally, hopefully without too many icky internals, without having to modify every function in htslib and/or command in samtools to have callback functions or some sort of co-routine yield interface to be interruptable. |
FYI I have had a draft adding |
I guess you could do this by probing from another thread. It's the thread safe part of it that would be hard... |
Using the file position/offset works well when reading the file records. Though a combined Would the thread safe progress api allow for canceling the index job? Another nice feature of using a callback is you can pause the computation (by not returning immediately) . |
Maybe we need to know what the actual problem is you're trying to solve. For example why is being able to pause computation via a callback a beneficial feature? My worry about doing this for indexing, is what's next? General reading through a file, needing a new API with a call back there? Then for VCF too? Then in the synced reader VCF interface? You can see there is inevitable room for feature creep. You personally may only need indexing, but once precedence has been set it's inevitable other people will start to ask for a callback interface in every part of the library. |
Indexing takes a long time by nature of the fact that you have to read the whole file. As a result there are two problems I'm trying to solve:
I wouldn't worry too much about creep, as neither of these cases are an issue with general reading through the files. When reading the files iterating record by record provides very granular steps through the file. As a result it is easy to calculate progress based on the current offset, and it is very easy to stop reading. I suppose an alternate option would be to make apis to run the index creation in steps, something like
And i could drive the indexing loop outside of htslib. |
That's a good point about single long-running function vs many repeated calls to (eg) read, merge, etc. I was thinking more broadly, including those that manipulate samtools. (Although frankly the way tools like pysam pretend samtools is a library rather than a subprocess has never really sat well with me.) |
…x2 to preserve API and ABI
@daviesrob I made the changes that you suggested. I'm not sure if this was waylaid by the following discussion. If this is still of interest hopefully it is a little closer to where it needs to be. |
It would be really nice to be able to capture the progress of the indexing functions, especially for large files as this can take some time.
Adding callbacks to the index functions would provide a way to pass progress to the embedding library. Additionally it allow for canceling the indexing function (without killing the program).
Is this something that you are interested in supporting? Let me know if this is of any interest, or if there is anything i can do to help. Thanks!