Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest INRB data with permission #242

Merged
merged 2 commits into from
Apr 30, 2024
Merged

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Apr 26, 2024

Description of proposed changes

From slack, it was requested that the new INRB data on mpox clade I from https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2 be added to our Nextstrain analysis. INRB is working to add to NCBI, so this is a temporary solution similar to what has been done previously.

After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset.

Please feel free to push further commits to this branch or suggest changes.

Related issue(s)

Checklist

  • Checks pass

@j23414 j23414 requested review from trvrb and a team April 26, 2024 23:04
j23414 added 2 commits April 26, 2024 16:06
From slack channel: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1714159380201999
it was requested that the new INRB data on mpox clade 1b be added to our Nextstrain analysis from

* https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2

INRB is working to add to NCBI, so this is a temporary solution which is similar to:

* fb871ef#diff-2b15577b072066f9c4c63eeb20343e6dc4f1e40ed43239d702743648ef35325eR2

After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset.

* Mainly followed instructions from [Adding new sequences not from GenBank](https://github.com/nextstrain/mpox/tree/59eaf472e7ca870567f21d83e082942fd31a3646/ingest#static-files)
* Assigned records temporary IDs `TMP0000` to `TMP0046`
* Set `authors` to "INRB"
@j23414 j23414 force-pushed the add-inrb-with-permission branch from ee7e06b to 613d9ab Compare April 26, 2024 23:08
@corneliusroemer
Copy link
Member

corneliusroemer commented Apr 29, 2024

Thanks @j23414! How did you populate the metadata? From the FASTA headers? We might need to s/find/replace some of the fields to conform with what ingest expects them to be called - unless you've already done so manually!

Have you done a test run of ingest to see whether the output looks right? Would be good to do that and link to the results! I'll see whether I can do that now.

This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132
Will have to check once workflow is done.

@j23414
Copy link
Contributor Author

j23414 commented Apr 29, 2024

How did you populate the metadata? From the FASTA headers?

Hi @corneliusroemer! I hacked a fix for the fasta file headers using the following perl script (add_ids.pl):

#! /usr/bin/env perl

use strict;
use warnings;

my @TMPIDS=();

for my $i ("TMP0000" .. "TMP0099") {
    push @TMPIDS, $i;
}

my $i=0;
while(<>){
  if(/>(.*)/){
    my $header=$1;
    print ">$TMPIDS[$i++]";
    print "|INRB";
    print "|Africa";
    print "|Democratic Republic of the Congo";
    print "|$header\n";
  }else{
    print;
  }
}

Then ran

perl add_ids.pl ingest/submission01_mpox47_2024.fasta > fixedheaders.fasta
./ingest/bin/fasta-to-ndjson \
 --fasta fixedheaders.fasta \
 --fields genbank_accession authors region country strain host ocountry division collected \
 --exclude ocountry \
 > ingest/data/inrb.ndjson

And then kept checking nextstrain build ingest runs, editing the field names as needed to get it to run successfully.

This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132
Will have to check once workflow is done.

Ohh, thanks for submitted the github action check! 🙌 Should be able to grep "TMP" from the final sequences.fasta and metadata.tsv files.

@corneliusroemer
Copy link
Member

Great, thanks for filling me in on the details! There might be a typo in one of your commands ocountry rather than country.

@j23414
Copy link
Contributor Author

j23414 commented Apr 29, 2024

ocountry

Thanks for pointing this out! This was on purpose (-exclude ocountry) ;) It's so I could create a new country column and avoid

sed 's/DRC/Democratic Republic of the Congo/g'

While DRC shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.

@corneliusroemer
Copy link
Member

Test run seems to have worked!

wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/metadata.tsv.gz
wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/sequences.fasta.xz      

I'll merge then as it simplifies including the sequences in our builds. If there are outliers/issues, we can always simply exclude the accessions post-ingest, in the phylogenetic/nextclade workflows.

@corneliusroemer corneliusroemer merged commit 56fb8cb into master Apr 30, 2024
26 checks passed
@corneliusroemer corneliusroemer deleted the add-inrb-with-permission branch April 30, 2024 15:04
@joverlee521
Copy link
Contributor

Nice work @j23414! I hope the old instructions were somewhat helpful? Please feel free to update it with any extra steps you had to take here.


Thanks for pointing this out! This was on purpose (-exclude ocountry) ;) It's so I could create a new country column and avoid

sed 's/DRC/Democratic Republic of the Congo/g'

While DRC shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.

FYI, since all sources go through the ingest pipeline, you could have added this to the geolocation-rules.tsv as

Africa/DRC/*/*    Africa/Democratic Republic of the Congo/*/*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants