Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on MashMap step of generateDecoyTranscriptome.sh #5

Open
jaclyn-taroni opened this issue Jun 5, 2019 · 14 comments

Comments

@jaclyn-taroni
Copy link

Hi all,

I get Segmentation fault (core dumped) on step 3 of generateDecoyTranscriptome.sh.

I've filed marbl/MashMap#21 upstream with more detailed information. I wanted to file an issue here in case you have any insight or I am using the script improperly.

Here's how I'm using this:

bash scripts/generateDecoyTranscriptome.sh \
	-j 8 \
	-g Homo_sapiens.GRCh38.dna.toplevel.fa \
	-t Homo_sapiens.GRCh38.cdna.all.fa \
	-a Homo_sapiens.GRCh38.96.gtf \
        -o ${human_output}

I realize you have gentrome.fa and decoys.txt for human here: https://github.com/COMBINE-lab/salmon#pre-computed-decoy-transcriptomes

I'm interested in generating this for zebrafish and happened to run into this problem with human first/before I found that on the Salmon README.

Thank you!

@k3yavi
Copy link
Member

k3yavi commented Jun 6, 2019

Hi @jaclyn-taroni ,

Thanks for raising this issue, one other user is also facing the similar issue with human genome.
While MashMap peeps and we are looking for the cause and the solution for the problem, if you can forward me the links to zebrafish genome and gtf I can run it in our system and forward to you the decoy sequences.

@jaclyn-taroni
Copy link
Author

Hi @k3yavi,

Thanks for the quick reply and the offer. I was planning on using the most recent Ensembl release for zebrafish. Here are the relevant links:

ftp://ftp.ensembl.org/pub/release-96/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.toplevel.fa.gz
ftp://ftp.ensembl.org/pub/release-96/gtf/danio_rerio/Danio_rerio.GRCz11.96.gtf.gz

Thanks again!

@rob-p
Copy link
Contributor

rob-p commented Jun 7, 2019

Hi @jaclyn-taroni,

@k3yavi has built the decoy transcriptome for zebrafish, you can grab it from the link on the salmon readme.

--Rob

@jaclyn-taroni
Copy link
Author

Thank you very much @k3yavi and @rob-p!

@cmatKhan
Copy link

hi @k3yavi

I'm getting the same error with data from a tick species -- any chance you'd be willing to run this for me, too?

The genome is (we use the first one, Ixodes-Scapularis-IES6_...):
https://www.vectorbase.org/downloads?field_organism_taxonomy_tid%5B%5D=340&field_download_file_type_tid%5B%5D=457&field_download_file_format_tid=All&field_status_value=Current

The .gtf (ISE6, same as above):
https://www.vectorbase.org/downloads?field_organism_taxonomy_tid%5B%5D=340&field_download_file_type_tid%5B%5D=412&field_download_file_format_tid=473&field_status_value=Current

And a transcriptome that is as of yet unpublished/posted -- I'd have to send it.

@k3yavi
Copy link
Member

k3yavi commented Jun 26, 2019

Hi @cmatKhan ,
Ixodes_scapularis.tar.gz should do it.

@choulabucsf
Copy link

Very much appreciated.

I realized after I hit send that there is a transcriptome on vectorbase -- I assume that's what you used?

@k3yavi
Copy link
Member

k3yavi commented Jun 26, 2019

Actually I just used the gtf and the genome to extract the transcriptome .

@k3yavi
Copy link
Member

k3yavi commented Jul 17, 2019

Hi Guys,

Just to give the heads up, we have curated the decoys sequence of a subset of model organism and it can be found here.

@doubtfulresearch
Copy link

doubtfulresearch commented Jul 24, 2019

I'm having this issue as well, I've tried it on a couple machines although the most RAM so far is 24GB (20 free).

Any chance you could generate decoys for refseq human and mouse? They give GFF annotation files, I was feeding that directly into step 2 (instead of the exons.bed) and step 2 completes fine, but step 3 fails pretty early with segmentation fault.

Alternatively, can you give an estimate of how much RAM this script is using on your machine where it successfully completes? Also, how long do you typically find it takes? I've not used MashMap before. I tried doing a trial run with a smaller genome and gave it 10 threads and while it didn't have a segmentation fault, after ~ 6 hours in step 3 I gave up since I didn't really need the decoys but was surprised at how long it was taking.

Thanks!

@k3yavi
Copy link
Member

k3yavi commented Jul 25, 2019

Hi, please fill the following decoy generation request form https://forms.gle/3baJc5SYrkSWb1z48 and we will let you know once we have the decoys.

On our machine it was taking ~100G and approximately an hour to run for human gencode data.

Thanks !

@k3yavi
Copy link
Member

k3yavi commented Nov 3, 2019

Hi guys,

Just wanted to let you know, we recently released a new version of salmon where you don't have to explicitly run the mashmap pipeline. With v1.0 salmon can consume both the genome and transcriptome without the need of annotations. Please checkout the new preprint or follow this tutorial for redindexing.

@lpantano
Copy link

lpantano commented Nov 4, 2019

Thank you so much! I asked in the chat, but just in case. Any estimation of memory during index and quantification, assuming a human genome like reference? Thanks!

@rob-p
Copy link
Contributor

rob-p commented Nov 4, 2019

Hi @lpantano,

The indexing using the entire human genome as decoy and the whole transcriptome (gencode v29) as the actual target sequence takes ~20G of RAM in our runs. The final (dense) index size is ~19G so construction RAM is only a little bit more. Interestingly, while the final index for using the whole genome as decoy is considerably bigger than if one uses the mashmap decoy sequences, the indexing memory is quite a bit smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants