-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Add HTS_PARSE_* parsing flags #260
base: develop
Are you sure you want to change the base?
Conversation
It seems a sensible addition to me. However it may perhaps be possible to parse unambiguously, albeit in a higher complexity parser (I forget my parser terms, but maybe it makes it LR instead of LL?). Some examples "bcftools -r 11,13,13" - clearly 3 chromosomes. "bcftools -r 11:1,000-2,000" - one 1k-2k region of chr11. "bcftools -r 11:1,000-2,000,12:1,000-2,000,13:1,000-2,000" - obviously 1000-2000 bp in 3 separate chromosomes. The second comma in "-2,000,12:" is ambiguous until we observe the ":". This clearly means a new reference so backing up from there it becomes obvious ",12:" is new ref 12. In short, right to left parsing instead of left to right will unambiguously allow us to handle commas in numbers and commas between chromosome names... PROVIDED no one uses a comma in their reference name! If you do that, then sucks to be you :P as it wouldn't work already. |
faea4e9
to
10e2c7d
Compare
@jkbonfield Unfortunately, bcftools accepts also single positions, not just ranges ( |
Ah I see, so -r 1:1,123 is pos 1 onwards of chr 1 and all of chr 123. So yes, it's ambiguous. There goes that plan. I also note though that it's basically breaking the SAM spec already. You are permitted comma in reference names, which means it already cannot work on all legal data. One thing that would be useful in region parsing is to specify a file containing a list of regions. Using -r 1:100-200 -r 2:100-200 -r 3:100-200 is a pain in unix as you have to prefix all the regions with -r. A samtools view syntax works niceley where the region is just the end argument, as you can do view blah.bam Therefore more useful would be a region that meant "the contents of this file". Eg -r "*regions.txt" (I'd prefer @regions.txt, but it's not permitted) where we use something similar to a bed file with one region per line in an unambiguous notation. |
This isn't really anything to do with me, so sorry to butt in. I'm curious though: why allow this thousands separator at all? I agree it's pretty tedious typing in all the zeros for long coordinates, and it can be hard to keep track of how big the numbers are, but this seems quite complicated. Perhaps an alternative would be to allow exponent style notation, i.e.
Obviously for fully precise coordinates, you'd need to type the complete integer strings in, but for command line exploratory use I think this could be a useful shorthand. This would also substantially easier to parse, and be less likely to lead to ambiguity I think. |
@jkbonfield My original plan for Unpredictability of the sort @pd3 points out, except that currently it treats @jkbonfield there are already also options for region-list files in these tools. |
@jeromekelleher: one of the advantages of this Allowing commas as thousands separators is indeed a pain in the neck. Heng allowed them in regions (see |
@jeromekelleher Funny you should mention this, because the scientific notation is how this whole thing started. By the way, the scientific representation is safe to use for precise coordinates, 1e4 is guaranteed to be interpreted as 10000. I agree thousands separator is not necessary, there is a parallel discussion here samtools/bcftools#309. |
This may be a good place to add a comment about one difference between region parsing in samtools and bcftools: single position, such as |
On Wed, Aug 26, 2015 at 02:36:23AM -0700, pd3 wrote:
Frankly I never liked the scientific notations anyway for regions and The Wellcome Trust Sanger Institute is operated by Genome Research |
You might like to guess at what further flags I was envisioning 😄 …except I think you mean |
@jmarshall --- that makes sense, thanks for clearing it up. @pd3 --- I was thinking more about specifying a coordinate like 123456789, or 10000001. In this case, exponent notation doesn't do you any good, and you just have to type out all the digits. |
10e2c7d
to
4361a3d
Compare
[IN PROGRESS] Need to figure out whether hts_parse_region() is workable with a strend argument and the possibility of colons in chromosome names...
As suggested by @jkbonfield, f859e8d (already on develop) fixes the motivating bug by adding flags to |
4361a3d
to
9cffb48
Compare
(via @wkretzsch) Closes samtools#260
Issue samtools/bcftools#309 is due to
hts_parse_decimal()
now matchinghts_parse_reg()
in allowing commas as thousands-separators, while_regions_init_string()
expects to be able to use commas as list delimiters. In fact_regions_init_string()
contains its own region parser presumably becausehts_parse_reg()
was not suitable as_regions_init_string()
wanted to treat commas specially itself.One solution to this is to add a flags parameter to
hts_parse_decimal()
so the caller can say whether to eat thousands-separating commas. This would be handy as further flags can be envisioned, andhts_parse_decimal()
has not appeared in an htslib release yet so we can still change its signature.It would also be useful to be able to specify the same flags in region parsing; this proposes adding a new
hts_parse_region()
alongside the existinghts_parse_reg()
(which is of course kept as is for compatibility reasons).On the downside, users will not know whether they can use thousands separators in particular options. However this is already the case, e.g.,
samtools mpileup -r REGION
has always allowed such commas butbcftools merge ‑r REGIONS
does not.Currently not for merging, just proposing API function changes. Thoughts?