Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't find any of the first 10 BED transcript_ids in fasta file #22

Open
eggrandio opened this issue Aug 30, 2024 · 1 comment
Open

Comments

@eggrandio
Copy link

Hi,

I am getting the

Can't find any of the first 10 BED transcript_ids in fasta file

error while the transcript_ids of the bed file are clearly in the transcript fasta file.

I parsed the .gtf file with 'agat_convert_sp_gff2gtf.pl' and made a custom bash script to retain only the transcript_id as the FASTA header so there are no issues. Here is a sample of the gtf, the bed and the fasta files:
GTF:

#!genome-build DOE Joint Genome Institute Ccitriodora_v2_1
#!genome-version Ccitriodora_v2_1
#!genome-date 2020-10
#!genome-build-accession GCA_014858505.1
#!genebuild-last-updated 2021-09
1	Ccitriodora_v2_1	region	1	31549113	.	.	.	gene_id "region:1"; Alias "CM026410.1"; ID "region:1";
1	JGI	gene	17519	18454	.	+	.	gene_id "gene-BT93_A0001"; ID "gene:gene-BT93_A0001"; biotype "protein_coding"; logic_name "genemodel_jgi";
1	JGI	mRNA	17519	18454	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; Parent "gene:gene-BT93_A0001"; biotype "protein_coding"; tag "Ensembl_canonical";
1	JGI	exon	17519	17859	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "2"; ensembl_phase "0"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; rank "1";
1	JGI	exon	17879	17949	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "1"; ensembl_phase "2"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; rank "2";
1	JGI	exon	18051	18454	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "0"; ensembl_phase "1"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; rank "3";
1	JGI	CDS	17519	17859	.	+	0	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1	JGI	CDS	17879	17949	.	+	1	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1	JGI	CDS	18051	18454	.	+	2	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";

BED:

1	17518	18454	rna-gnl|WGS:JABURB|Cocit.A0001.1	0	+	17518	18454	0	3	341,71,404,	0,360,532,
1	31589	32512	rna-gnl|WGS:JABURB|Cocit.A0002.1	0	+	31589	32512	0	2	415,407,	0,516,
1	46163	47204	rna-gnl|WGS:JABURB|Cocit.A0003.1	0	+	46163	47204	0	2	441,475,	0,566,
1	62155	63760	rna-gnl|WGS:JABURB|Cocit.A0004.1	0	+	62155	63760	0	2	421,407,	0,1198,
1	64048	68743	rna-gnl|WGS:JABURB|Cocit.A0005.1	0	-	64048	68743	0	5	352,116,158,74,1541,	0,1195,1836,2993,3154,
1	70379	70571	rna-gnl|WGS:JABURB|Cocit.A0006.1	0	-	70379	70571	0	1	192,	0,
1	76482	79181	rna-gnl|WGS:JABURB|Cocit.A0007.1	0	+	76482	79181	0	3	360,387,570,	0,1132,2129,
1	76482	79181	rna-gnl|WGS:JABURB|Cocit.A0007.2	0	+	76482	79181	0	3	360,309,570,	0,1132,2129,
1	79626	88276	rna-gnl|WGS:JABURB|Cocit.A0008.1	0	-	79626	88276	0	10	551,102,86,272,92,138,71,181,157,243,	0,1512,1761,2584,4213,4387,5882,6738,7027,8407,
1	90394	103967	rna-gnl|WGS:JABURB|Cocit.A0009.1	0	+	90394	103967	0	17	450,91,119,191,211,241,416,109,200,350,97,219,240,210,286,197,882,	0,603,924,2440,2718,4600,5837,7150,7535,7867,8376,8552,9331,10114,11699,12126,12691,
1	111647	117022	rna-gnl|WGS:JABURB|Cocit.A0010.1	0	+	111647	117022	0	8	264,129,855,246,24,141,48,471,	0,1487,1745,3106,3458,3742,4009,4904,

FASTA:

>rna-gnl|WGS:JABURB|Cocit.A0001.1
ATGACAGCCCTCAAGCTCAAGAAGCTCCTCCTGACCGCCATCGCGGTCGCTGGGATCGTTGTCTCTGCTCTGCCTGACACCGCCTCGGCCCAGAACTGCGGGTGTGCAGCCAACC

Maybe it is because of the symbols in the transcript names?

@etwatson
Copy link

etwatson commented Nov 8, 2024

So, there is an edit that is needed to fix this problem, especially if you installed with pip3, since it doesn't appear that the fix is in that dist. Check the thread here.

To find where your check_strandedness.py file is, run:
pip3 show how_are_we_stranded_here

Then, you can make the edit there to fix the problem.

Oh, and you will probably need to fix another bug as stated here.

Your welcome @signalbash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants