Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible redundant rowname assignment in tximport.r #1365

Closed
pmoris opened this issue Aug 29, 2024 · 2 comments
Closed

Possible redundant rowname assignment in tximport.r #1365

pmoris opened this issue Aug 29, 2024 · 2 comments

Comments

@pmoris
Copy link
Contributor

pmoris commented Aug 29, 2024

I was running the tximport.r code manually while trying to wrap my head around how it works and how it relates to the different types of salmon quant files and gene_count files generated by the pipeline (see https://nfcore.slack.com/archives/CE8SSJV3N/p1724863202019919?thread_ts=1686156473.934039&cid=CE8SSJV3N). While doing so, I stumbled upon the following line of code that seems to be doing nothing?

rownames(gene_info) <- gene_info[["tx"]]

As far as I can tell, gene_info is derived from transcript_info which contains the columns "gene_id", "gene_name", but not tx. So, at least when I was trying to reproduce the steps manually, the assignment did nothing (NULL was being assigned to the row names, changing nothing).

Related, the line just above it

gene_info <- transcript_info$gene[match(rownames(gi[[1]]), transcript_info$gene[["gene_id"]]),]

also appears to be redundant. I believe gene_info just ends up holding the contents of transcript_info$gene. Or are there cases where the genes resulting from summarizeToGene can differ from those in transcript_info (which are ultimately derived from the tx2gene file)?

Caveat to all of the above: since I'm actively trying to understand the code and tximport methodology, I might be completely off the mark here ;)

If anyone can confirm, I'd be happy to make a PR for this.

@pinin4fjords
Copy link
Member

pinin4fjords commented Aug 29, 2024

Note that all of this logic is now moved to an nf-core module (see the dev branch). But the lines you highlight are still there, so I'll explain:

gene_info <- transcript_info\$gene[match(rownames(gi[[1]]), transcript_info\$gene[["gene_id"]]),]

This fetches the gene-level metadata table from the transcript_info list object, and also sorts it to match the gene order in the gi object. The gene metadata table is a uniquified subset of the gene-wise columns from the input metadata table.

So that line is not redundant.

But yes, this line probably is, maybe a relic of prior development. It's not doing any harm, but feel free to PR on nf-core/modules, and once it's in we can bump the module here.

@pmoris
Copy link
Contributor Author

pmoris commented Aug 29, 2024

Cheers, thanks for the explainer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants