read_vcf cannot read VCF file #21

GeorgeGkafas · 2024-06-17T23:10:13Z

Dear Dr. Bitarello,
I have a vcf file and I'm keen on examining for evidence of balancing selection. Using the read_vcf command there is an error:

read_vcf(x=system.file(package="balselr", "crup.vcf"))
Error: Items of 'old' not found in column names: [#CHROM]. Consider skip_absent=TRUE.

Trying to incorprorate the skip_absent=TRUE,

read_vcf(x=system.file(package="balselr", "crup.vcf", skip_absent=TRUE))

Error in data.table::fread(x, skip = "##", header = TRUE) :
Input is empty or only contains BOM or terminal control characters

Any help, will be much appreciated.

bbitarello · 2024-06-21T17:37:33Z

Dear @GeorgeGkafas,
thank you for using balselr. Since you sent me this input file, I will personally look into it to replicate the error.

Can you provide the following info:

operating system you're using
R version
R studio version (if using R studio)
Thank you

bbitarello · 2024-06-21T17:40:06Z

Oh, wait, I see the problem already.

read_vcf(x=system.file(package="balselr", "crup.vcf"))

This won't work with your personal file. In the example, we use:

read_vcf(x=system.file(package="balselr", "example.vcf"))

which means the function read_vcf() is reading in the file example.vcf that is part of the package (hence the system.file() function.

To read your own file you should simple provide the path and file name:

read_vcf(x="crup.vcf")

Please try that and report back if possible.

Bárbara

GeorgeGkafas · 2024-06-21T20:51:04Z

Hi Barbara,
Thank you for your reply.
The error is still there

read_vcf(x="Contig1.f.GT.vcf")
Error: Items of 'old' not found in column names: [#CHROM]. Consider skip_absent=TRUE.

As I tried to figure out the issue, I found this on a help-page "If a function, it will be called with old and expected to return the new column names. The new column names must be the same length as columns provided to old argument." Do you think that the read_vcf tries to find the exact same number of individuals?

I use R version 4.1.2 (2021-11-01) -- "Bird Hippie", in a cluster sever using Linux.

Thank you
George

bbitarello · 2024-07-03T17:59:18Z

Hello, I was able to reproduce this error with the file you sent ("crup.vcf"). It appears the file has strange formatting separating the columns. read_vcf() uses data.table::fread() to read in the file and expects tab delimited columns in the VCF format. Can you provide info on how this file was generated?
Thank you

bbitarello · 2024-07-03T19:54:25Z

So the problem is your vcf file has tabs ("\t") in the info lines beginning with "##". This confused the data.table::fread(), which reads tabs as the delimiters. In your case it is trying to read the lines it is supposed to skip (those starting with "##").

The best thing to do would be to select the contents in your vcf of the first 66 lines beginning with ## and replace tabs with spaces. Then paste that back into your file and it should work.

Since this is not really a package issue I will likely not change the code, though I could eventually do so and replace the following line in read_vcf:

inp <- data.table::fread(x, skip = "##", header = TRUE)

with

inp <- data.table::fread(x, skip = "CHROM", header = TRUE)

since both work.

bbitarello · 2024-07-03T20:03:37Z

Here's how I did it in the command line. The modified version foryour file is read in properly by read.vcf:

#replace tabs with spaces in info lines
grep "##" crup.vcf|sed 's/\t/\s/g'  >  head.txt

#save new header lines into new file
cat head.txt >  crup_mod.vcf 

#add the non info lines to the new file
 grep -v "##" crup2.vcf  >> crup_mod.vcf

then, in R:

> read_vcf(x="~/Desktop/crup_mod.vcf")
                 CHR   POS     ID    REF    ALT        QUAL FILTER
              <char> <int> <char> <char> <char>       <num> <char>
 1:  Crup_1041_33310    25      .      G      T 7.65490e+04      .
 2:  Crup_1041_33310    34      .      G      A 9.54484e+04      .

works perfectly.

Cheers

GeorgeGkafas · 2024-07-03T20:05:42Z

Thank you very much!!!! It's much appreciated!

best,
George

GeorgeGkafas · 2024-07-04T14:20:30Z

Hi,
Also if you delete all the "##" lines from the VCF file, but keep the first one (##fileformat=VCFv4.2) the parse_vcf works.
best,
George

GeorgeGkafas changed the title ~~read_vcf cannot read VCF ile~~ read_vcf cannot read VCF file Jun 17, 2024

bbitarello pinned this issue Jun 21, 2024

bbitarello self-assigned this Jun 21, 2024

bbitarello added the help wanted Extra attention is needed label Jun 21, 2024

bbitarello closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_vcf cannot read VCF file #21

read_vcf cannot read VCF file #21

GeorgeGkafas commented Jun 17, 2024

bbitarello commented Jun 21, 2024

bbitarello commented Jun 21, 2024

GeorgeGkafas commented Jun 21, 2024

bbitarello commented Jul 3, 2024

bbitarello commented Jul 3, 2024

bbitarello commented Jul 3, 2024 •

edited

Loading

GeorgeGkafas commented Jul 3, 2024

GeorgeGkafas commented Jul 4, 2024

read_vcf cannot read VCF file #21

read_vcf cannot read VCF file #21

Comments

GeorgeGkafas commented Jun 17, 2024

bbitarello commented Jun 21, 2024

bbitarello commented Jun 21, 2024

GeorgeGkafas commented Jun 21, 2024

bbitarello commented Jul 3, 2024

bbitarello commented Jul 3, 2024

bbitarello commented Jul 3, 2024 • edited Loading

GeorgeGkafas commented Jul 3, 2024

GeorgeGkafas commented Jul 4, 2024

bbitarello commented Jul 3, 2024 •

edited

Loading