Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_vcf cannot read VCF file #21

Closed
GeorgeGkafas opened this issue Jun 17, 2024 · 8 comments
Closed

read_vcf cannot read VCF file #21

GeorgeGkafas opened this issue Jun 17, 2024 · 8 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@GeorgeGkafas
Copy link

Dear Dr. Bitarello,
I have a vcf file and I'm keen on examining for evidence of balancing selection. Using the read_vcf command there is an error:

read_vcf(x=system.file(package="balselr", "crup.vcf"))
Error: Items of 'old' not found in column names: [#CHROM]. Consider skip_absent=TRUE.

Trying to incorprorate the skip_absent=TRUE,

read_vcf(x=system.file(package="balselr", "crup.vcf", skip_absent=TRUE))

Error in data.table::fread(x, skip = "##", header = TRUE) :
Input is empty or only contains BOM or terminal control characters

Any help, will be much appreciated.

@GeorgeGkafas GeorgeGkafas changed the title read_vcf cannot read VCF ile read_vcf cannot read VCF file Jun 17, 2024
@bbitarello
Copy link
Member

Dear @GeorgeGkafas,
thank you for using balselr. Since you sent me this input file, I will personally look into it to replicate the error.

Can you provide the following info:

  • operating system you're using
  • R version
  • R studio version (if using R studio)
    Thank you

@bbitarello bbitarello pinned this issue Jun 21, 2024
@bbitarello bbitarello self-assigned this Jun 21, 2024
@bbitarello bbitarello added the help wanted Extra attention is needed label Jun 21, 2024
@bbitarello
Copy link
Member

Oh, wait, I see the problem already.

read_vcf(x=system.file(package="balselr", "crup.vcf"))

This won't work with your personal file. In the example, we use:

read_vcf(x=system.file(package="balselr", "example.vcf"))

which means the function read_vcf() is reading in the file example.vcf that is part of the package (hence the system.file() function.

To read your own file you should simple provide the path and file name:

read_vcf(x="crup.vcf")

Please try that and report back if possible.

Bárbara

@GeorgeGkafas
Copy link
Author

Hi Barbara,
Thank you for your reply.
The error is still there

read_vcf(x="Contig1.f.GT.vcf")
Error: Items of 'old' not found in column names: [#CHROM]. Consider skip_absent=TRUE.

As I tried to figure out the issue, I found this on a help-page "If a function, it will be called with old and expected to return the new column names. The new column names must be the same length as columns provided to old argument." Do you think that the read_vcf tries to find the exact same number of individuals?

I use R version 4.1.2 (2021-11-01) -- "Bird Hippie", in a cluster sever using Linux.

Thank you
George

@bbitarello
Copy link
Member

Hello, I was able to reproduce this error with the file you sent ("crup.vcf"). It appears the file has strange formatting separating the columns. read_vcf() uses data.table::fread() to read in the file and expects tab delimited columns in the VCF format. Can you provide info on how this file was generated?
Thank you

@bbitarello
Copy link
Member

So the problem is your vcf file has tabs ("\t") in the info lines beginning with "##". This confused the data.table::fread(), which reads tabs as the delimiters. In your case it is trying to read the lines it is supposed to skip (those starting with "##").

The best thing to do would be to select the contents in your vcf of the first 66 lines beginning with ## and replace tabs with spaces. Then paste that back into your file and it should work.

Since this is not really a package issue I will likely not change the code, though I could eventually do so and replace the following line in read_vcf:

inp <- data.table::fread(x, skip = "##", header = TRUE)

with

inp <- data.table::fread(x, skip = "CHROM", header = TRUE)

since both work.

@bbitarello
Copy link
Member

bbitarello commented Jul 3, 2024

Here's how I did it in the command line. The modified version foryour file is read in properly by read.vcf:

#replace tabs with spaces in info lines
grep "##" crup.vcf|sed 's/\t/\s/g'  >  head.txt

#save new header lines into new file
cat head.txt >  crup_mod.vcf 

#add the non info lines to the new file
 grep -v "##" crup2.vcf  >> crup_mod.vcf 

then, in R:

> read_vcf(x="~/Desktop/crup_mod.vcf")
                 CHR   POS     ID    REF    ALT        QUAL FILTER
              <char> <int> <char> <char> <char>       <num> <char>
 1:  Crup_1041_33310    25      .      G      T 7.65490e+04      .
 2:  Crup_1041_33310    34      .      G      A 9.54484e+04      .

works perfectly.

Cheers

@GeorgeGkafas
Copy link
Author

Thank you very much!!!! It's much appreciated!

best,
George

@GeorgeGkafas
Copy link
Author

Hi,
Also if you delete all the "##" lines from the VCF file, but keep the first one (##fileformat=VCFv4.2) the parse_vcf works.
best,
George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants