Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Add HTS_PARSE_* parsing flags #260

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

Conversation

jmarshall
Copy link
Member

Issue samtools/bcftools#309 is due to hts_parse_decimal() now matching hts_parse_reg() in allowing commas as thousands-separators, while _regions_init_string() expects to be able to use commas as list delimiters. In fact _regions_init_string() contains its own region parser presumably because hts_parse_reg() was not suitable as _regions_init_string() wanted to treat commas specially itself.

One solution to this is to add a flags parameter to hts_parse_decimal() so the caller can say whether to eat thousands-separating commas. This would be handy as further flags can be envisioned, and hts_parse_decimal() has not appeared in an htslib release yet so we can still change its signature.

It would also be useful to be able to specify the same flags in region parsing; this proposes adding a new hts_parse_region() alongside the existing hts_parse_reg() (which is of course kept as is for compatibility reasons).

On the downside, users will not know whether they can use thousands separators in particular options. However this is already the case, e.g., samtools mpileup -r REGION has always allowed such commas but bcftools merge ‑r REGIONS does not.

Currently not for merging, just proposing API function changes. Thoughts?

@jkbonfield
Copy link
Contributor

It seems a sensible addition to me.

However it may perhaps be possible to parse unambiguously, albeit in a higher complexity parser (I forget my parser terms, but maybe it makes it LR instead of LL?). Some examples

"bcftools -r 11,13,13" - clearly 3 chromosomes.

"bcftools -r 11:1,000-2,000" - one 1k-2k region of chr11.

"bcftools -r 11:1,000-2,000,12:1,000-2,000,13:1,000-2,000" - obviously 1000-2000 bp in 3 separate chromosomes. The second comma in "-2,000,12:" is ambiguous until we observe the ":". This clearly means a new reference so backing up from there it becomes obvious ",12:" is new ref 12.

In short, right to left parsing instead of left to right will unambiguously allow us to handle commas in numbers and commas between chromosome names... PROVIDED no one uses a comma in their reference name! If you do that, then sucks to be you :P as it wouldn't work already.

@pd3
Copy link
Member

pd3 commented Aug 26, 2015

@jkbonfield Unfortunately, bcftools accepts also single positions, not just ranges (-r 1:123), making expressions like this ambiguous: -r 1:1,123 can be interpreted as -r 1:1123 but also as -r 1:1 -r 123.

@jkbonfield
Copy link
Contributor

Ah I see, so -r 1:1,123 is pos 1 onwards of chr 1 and all of chr 123. So yes, it's ambiguous. There goes that plan.

I also note though that it's basically breaking the SAM spec already. You are permitted comma in reference names, which means it already cannot work on all legal data.

One thing that would be useful in region parsing is to specify a file containing a list of regions. Using -r 1:100-200 -r 2:100-200 -r 3:100-200 is a pain in unix as you have to prefix all the regions with -r. A samtools view syntax works niceley where the region is just the end argument, as you can do view blah.bam cat regions.txt. However that blows up if you get too many.

Therefore more useful would be a region that meant "the contents of this file". Eg -r "*regions.txt" (I'd prefer @regions.txt, but it's not permitted) where we use something similar to a bed file with one region per line in an unambiguous notation.

@jeromekelleher
Copy link
Member

This isn't really anything to do with me, so sorry to butt in. I'm curious though: why allow this thousands separator at all? I agree it's pretty tedious typing in all the zeros for long coordinates, and it can be hard to keep track of how big the numbers are, but this seems quite complicated. Perhaps an alternative would be to allow exponent style notation, i.e.

bcftools -r 11:1e3-2e3 

Obviously for fully precise coordinates, you'd need to type the complete integer strings in, but for command line exploratory use I think this could be a useful shorthand. This would also substantially easier to parse, and be less likely to lead to ambiguity I think.

@jmarshall
Copy link
Member Author

@jkbonfield My original plan for hts_parse_decimal was to parse a substring (i.e., give it [str,strlim) rather than just str), which would facilitate parsing up to a bit before the next colon… but I discarded it on the grounds that that sort of thing leads to madness and unpredictability.

Unpredictability of the sort @pd3 points out, except that currently it treats -r 1:1,123 as -r 1:1 -r 123 (all of it).

@jkbonfield there are already also options for region-list files in these tools.

@jmarshall
Copy link
Member Author

@jeromekelleher: one of the advantages of this hts_parse_decimal() is that it does accept exponent-style notation, and could in future accept more.

Allowing commas as thousands separators is indeed a pain in the neck. Heng allowed them in regions (see bam_parse_region()) right from samtools 0.1.1 in 2008, so here we are.

@pd3
Copy link
Member

pd3 commented Aug 26, 2015

@jeromekelleher Funny you should mention this, because the scientific notation is how this whole thing started. By the way, the scientific representation is safe to use for precise coordinates, 1e4 is guaranteed to be interpreted as 10000. I agree thousands separator is not necessary, there is a parallel discussion here samtools/bcftools#309.

@pd3
Copy link
Member

pd3 commented Aug 26, 2015

This may be a good place to add a comment about one difference between region parsing in samtools and bcftools: single position, such as 20:12345 is interpreted by samtools to mean all records starting with 20:12345. In bcftools this means a single position. The samtools behaviour is achieved by giving -r 20:12345-.

@jkbonfield
Copy link
Contributor

On Wed, Aug 26, 2015 at 02:36:23AM -0700, pd3 wrote:

@jeromekelleher Funny you should mention this, because the
scientific notation is how this whole thing started. By the way, the
scientific representation is safe to use for precise coordinates, 1e4
is guaranteed to be interpreted as 10000. I agree thousands separator
is not necessary, there is a parallel discussion here
samtools/bcftools#309.

Frankly I never liked the scientific notations anyway for regions and
find tools that accept k, m, g suffixes to be far clearer to read and
easier to type. But that's another can of worms!

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@jmarshall
Copy link
Member Author

find tools that accept k, m, g suffixes to be far clearer to read

You might like to guess at what further flags I was envisioning 😄

…except I think you mean k, M, G

@jeromekelleher
Copy link
Member

@jmarshall --- that makes sense, thanks for clearing it up.

@pd3 --- I was thinking more about specifying a coordinate like 123456789, or 10000001. In this case, exponent notation doesn't do you any good, and you just have to type out all the digits.

[IN PROGRESS]  Need to figure out whether hts_parse_region() is workable
with a strend argument and the possibility of colons in chromosome names...
@jmarshall
Copy link
Member Author

As suggested by @jkbonfield, f859e8d (already on develop) fixes the motivating bug by adding flags to hts_parse_decimal(). This PR now represents the wishlist item of adding a similarly general hts_parse_region() with flags and end-of-string parameters that would allow it to be used by _regions_init_string() et al.

@jmarshall jmarshall added this to the wishlist milestone Sep 2, 2015
pd3 pushed a commit to pd3/htslib that referenced this pull request Sep 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants