Can't find any of the first 10 BED transcript_ids in fasta file #22

eggrandio · 2024-08-30T14:08:23Z

Hi,

I am getting the

Can't find any of the first 10 BED transcript_ids in fasta file

error while the transcript_ids of the bed file are clearly in the transcript fasta file.

I parsed the .gtf file with 'agat_convert_sp_gff2gtf.pl' and made a custom bash script to retain only the transcript_id as the FASTA header so there are no issues. Here is a sample of the gtf, the bed and the fasta files:
GTF:

#!genome-build DOE Joint Genome Institute Ccitriodora_v2_1
#!genome-version Ccitriodora_v2_1
#!genome-date 2020-10
#!genome-build-accession GCA_014858505.1
#!genebuild-last-updated 2021-09
1	Ccitriodora_v2_1	region	1	31549113	.	.	.	gene_id "region:1"; Alias "CM026410.1"; ID "region:1";
1	JGI	gene	17519	18454	.	+	.	gene_id "gene-BT93_A0001"; ID "gene:gene-BT93_A0001"; biotype "protein_coding"; logic_name "genemodel_jgi";
1	JGI	mRNA	17519	18454	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; Parent "gene:gene-BT93_A0001"; biotype "protein_coding"; tag "Ensembl_canonical";
1	JGI	exon	17519	17859	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "2"; ensembl_phase "0"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; rank "1";
1	JGI	exon	17879	17949	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "1"; ensembl_phase "2"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; rank "2";
1	JGI	exon	18051	18454	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "0"; ensembl_phase "1"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; rank "3";
1	JGI	CDS	17519	17859	.	+	0	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1	JGI	CDS	17879	17949	.	+	1	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1	JGI	CDS	18051	18454	.	+	2	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";

BED:

1	17518	18454	rna-gnl|WGS:JABURB|Cocit.A0001.1	0	+	17518	18454	0	3	341,71,404,	0,360,532,
1	31589	32512	rna-gnl|WGS:JABURB|Cocit.A0002.1	0	+	31589	32512	0	2	415,407,	0,516,
1	46163	47204	rna-gnl|WGS:JABURB|Cocit.A0003.1	0	+	46163	47204	0	2	441,475,	0,566,
1	62155	63760	rna-gnl|WGS:JABURB|Cocit.A0004.1	0	+	62155	63760	0	2	421,407,	0,1198,
1	64048	68743	rna-gnl|WGS:JABURB|Cocit.A0005.1	0	-	64048	68743	0	5	352,116,158,74,1541,	0,1195,1836,2993,3154,
1	70379	70571	rna-gnl|WGS:JABURB|Cocit.A0006.1	0	-	70379	70571	0	1	192,	0,
1	76482	79181	rna-gnl|WGS:JABURB|Cocit.A0007.1	0	+	76482	79181	0	3	360,387,570,	0,1132,2129,
1	76482	79181	rna-gnl|WGS:JABURB|Cocit.A0007.2	0	+	76482	79181	0	3	360,309,570,	0,1132,2129,
1	79626	88276	rna-gnl|WGS:JABURB|Cocit.A0008.1	0	-	79626	88276	0	10	551,102,86,272,92,138,71,181,157,243,	0,1512,1761,2584,4213,4387,5882,6738,7027,8407,
1	90394	103967	rna-gnl|WGS:JABURB|Cocit.A0009.1	0	+	90394	103967	0	17	450,91,119,191,211,241,416,109,200,350,97,219,240,210,286,197,882,	0,603,924,2440,2718,4600,5837,7150,7535,7867,8376,8552,9331,10114,11699,12126,12691,
1	111647	117022	rna-gnl|WGS:JABURB|Cocit.A0010.1	0	+	111647	117022	0	8	264,129,855,246,24,141,48,471,	0,1487,1745,3106,3458,3742,4009,4904,

FASTA:

>rna-gnl|WGS:JABURB|Cocit.A0001.1
ATGACAGCCCTCAAGCTCAAGAAGCTCCTCCTGACCGCCATCGCGGTCGCTGGGATCGTTGTCTCTGCTCTGCCTGACACCGCCTCGGCCCAGAACTGCGGGTGTGCAGCCAACC

Maybe it is because of the symbols in the transcript names?

The text was updated successfully, but these errors were encountered:

etwatson · 2024-11-08T22:03:41Z

So, there is an edit that is needed to fix this problem, especially if you installed with pip3, since it doesn't appear that the fix is in that dist. Check the thread here.

To find where your check_strandedness.py file is, run:
pip3 show how_are_we_stranded_here

Then, you can make the edit there to fix the problem.

Oh, and you will probably need to fix another bug as stated here.

Your welcome @signalbash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't find any of the first 10 BED transcript_ids in fasta file #22

Can't find any of the first 10 BED transcript_ids in fasta file #22

eggrandio commented Aug 30, 2024

etwatson commented Nov 8, 2024 •

edited

Loading

Can't find any of the first 10 BED transcript_ids in fasta file #22

Can't find any of the first 10 BED transcript_ids in fasta file #22

Comments

eggrandio commented Aug 30, 2024

etwatson commented Nov 8, 2024 • edited Loading

etwatson commented Nov 8, 2024 •

edited

Loading