Add dengue lineages dataset #223

jamessiqueirap · 2024-08-12T20:58:19Z

These datasets are based on the dengue virus lineage systems described by Verity et al., 2024, and are suitable for the analysis of viral sequences from the four dengue virus serotypes.

ivan-aksamentov · 2024-08-13T08:02:39Z

Hi @jamessiqueirap! Thanks!

I will let our science team to review. It has been challenging for them to produce serotype datasets so far. Let's see what they say.

In the meantime a couple of technical nuances:

Could you please create an additional level of directories to make sure the datasets for any pathogens are not directly in the community/ directory, as described here: https://github.com/nextstrain/nextclade_data/blob/master/docs/dataset-curation-guide.md#dataset-paths

[...] We only ask to not submit datasets directly into the community/, to avoid clashes between datasets from different authors and organizations. [...]
This is not mandatory, but would be nice to have some example sequences for each dataset as sequences.fasta (and to declare them in the pathogen.json field "files" as "examples": "sequences.fasta") - this allows Nextclade users to quickly try a dataset and decide if they like it or not. Also helpful for Nextstrain team to review the datasets and to debug our software. So if you have some sequences with permissive licenses, please add them. Somewhere between 10 to 100 sequences should be perfect.

jamessiqueirap · 2024-08-13T12:59:09Z

Hi @ivan-aksamentov, thank you very much for your valuable feedback and corrections!

I have already implemented all the suggested changes, including creating the additional level of directories and I have already included the example sequences, following your advice. I appreciate your guidance on this — it is really helpful.

Thanks again

ivan-aksamentov · 2024-08-13T13:15:02Z

I pushed results of the rebuild script:

./scripts/rebuild --input-dir 'data/' --output-dir 'data_output/'

(This is normally done automatically, but we haven't figured how to do this securely for third-party contributions yet. This needs to be rerun if there are changes to the data/ directory)

This allows to use the data_output/ as a dataset server:

https://clades.nextstrain.org/?dataset-server=gh:jamessiqueirap/dengue-lineages-dataset@master@/data_output

Here are the links to Nextcalde with datasets preselected, for easier testing:

rneher · 2024-08-13T15:10:46Z

Hi @jamessiqueirap , thanks for contributing these. Very exciting!

The trees look good to me.

Two things I noticed:

the private mutation QC parameters are too stringent. My usual rule of thumb is that the typical value should be similar to the average number of mutations on terminal branches of the reference tree. The cut-off value I normally set to 3 times the typical value.
you have one dataset of each serotype. So you probably don't need the all-serotypes in the path. I'd say community/v-gen-lab/dengue-lineages/denv4 is preferable to community/v-gen-lab/dengue-lineages/all-serotypes/denv4. If you want, you can later add an all-serotypes tree along side denv1, denv2, etc.

data/community/v-gen-lab/dengue-lineages/all-serotypes/denv1/pathogen.json

jamessiqueirap · 2024-08-13T17:48:30Z

Hi @jamessiqueirap , thanks for contributing these. Very exciting!

The trees look good to me.

Two things I noticed:

the private mutation QC parameters are too stringent. My usual rule of thumb is that the typical value should be similar to the average number of mutations on terminal branches of the reference tree. The cut-off value I normally set to 3 times the typical value.

you have one dataset of each serotype. So you probably don't need the all-serotypes in the path. I'd say community/v-gen-lab/dengue-lineages/denv4 is preferable to community/v-gen-lab/dengue-lineages/all-serotypes/denv4. If you want, you can later add an all-serotypes tree along side denv1, denv2, etc.

@rneher Thank you so much! I think this is a great approach, and I've already implemented it.

corneliusroemer · 2024-08-13T19:54:23Z

You can use these URLs to test directly, without having to wait for Ivan to rebuild:

corneliusroemer · 2024-08-13T21:26:06Z

Impressive work, great job making these datasets! A few thoughts and comments (not necessarily blocking release)

General points

You could shorten the path to dengue (i.e. removing the -lineages from v-gen-lab/dengue-lineages/denv1 to make it v-gen-lab/dengue/denv1) to keep in line with the pattern that we just mention the virus name and nothing else. Almost all datasets have lineages so this is redundant.
One could remove pr from the genome annotation as it is a true subset of M, being in frame (i.e. it's just duplication of information)
It might be nice to have the lineage colors be topologically ordered (using color ordering). They currently mostly are, but there are a few outliers, e.g. 1I is followed by 1II rather than by 1I_A etc.
It might be nice to include collection country names in the strain names. If you download them from NCBI Virus, you can customize the strain name to include the collection country.
Some branches are really long, one might potentially want to exclude sequences on long branches as these are either sequencing/assembly errors or due to recombination.
It might be good to list the strain name of the reference under which the sequence is usually known. In this case it seems to be 45AZ5 for denv1 and Thailand/16681/84 for denv2 - not sure if this rings a bell for anyone but it might
It might be good to include the ref seq of each dataset in the example sequences, so one can see where it falls on the tree.
DENV1:
- Lineage 1VI seems to be entirely missing, is this on purpose? Maybe it's gone out of business? Or is this a typo? In the preprint it mentions VI but not VII
DENV2:
- In DENV2, there's a very overdiverged sequence, probably best to exclude (likely artefact): OM744110.1|2021-11-16
- I can't find lineage/serotype 2I, is this on purpose?

Regarding recombination - I was wondering how the lineage system plans to deal with it. It's reportedly common in Dengue (e.g. https://www.pnas.org/doi/full/10.1073/pnas.96.13.7352, https://pubmed.ncbi.nlm.nih.gov/10331266/). I had a look at the lineage preprint but couldn't find mention of how to deal with recombination (I did a string search for recomb and only found 3 hits, all in reference to SARS-CoV-2). SARS-CoV-2 pango gives recombinants special names (e.g. XBB) - has it been discussed how recombination will be treated in the Dengue nomenclature? This might be something worth discussing in the paper, potentially.

jamessiqueirap · 2024-08-14T17:03:55Z

@corneliusroemer Thank you very much, I'm delighted to receive suggestions from you! The implementation process for dengue lineage nomenclature is a collaborative effort involving various research groups from different countries.

Regarding your suggestions:

General points

You could shorten the path to dengue (i.e., removing the -lineages from v-gen-lab/dengue-lineages/denv1 to make it v-gen-lab/dengue/denv1) to keep in line with the pattern that we just mention the virus name and nothing else. Almost all datasets have lineages, so this is redundant.

One could remove pr from the genome annotation as it is a true subset of M, being in frame (i.e., it's just duplication of information).

It might be nice to have the lineage colors be topologically ordered (using color ordering). They currently mostly are, but there are a few outliers, e.g., 1I is followed by 1II rather than by 1I_A, etc.

It might be nice to include collection country names in the strain names. If you download them from NCBI Virus, you can customize the strain name to include the collection country.

It might be good to list the strain name of the reference under which the sequence is usually known. In this case, it seems to be 45AZ5 for DENV1 and Thailand/16681/84 for DENV2 - not sure if this rings a bell for anyone, but it might.

It might be good to include the ref seq of each dataset in the example sequences so one can see where it falls on the tree.

You are absolutely correct, and I will implement these changes in the dataset right away.

Some branches are really long, one might potentially want to exclude sequences on long branches as these are either sequencing/assembly errors or due to recombination.

DENV2:

In DENV2, there's a very overdiverged sequence, probably best to exclude (likely artefact): OM744110.1|2021-11-16

These trees only contain the representative topology of each designated lineage, and in some cases, certain branches may indeed appear very long. However, we will review these points with the other members of the project.

I can't find lineage/serotype 2I, is this on purpose?

Given the granularity of lineages, this label only appears when you enable the visualization of all lineages.

DENV1:

Lineage 1VI seems to be entirely missing, is this on purpose? Maybe it's gone out of business? Or is this a typo? In the preprint it mentions VI but not VII

Regarding this, in the preprint, we used "1VI" to designate what we will now call "1VII" based on the suggestion of the scientific committee. Since there is already another genotype in the literature designated as "VI," we made this change to avoid any mismatch with the literature. We didn't include VI in the dataset as it's no longer in circulation.

Regarding recombination - I was wondering how the lineage system plans to deal with it. It's reportedly common in Dengue (e.g. https://www.pnas.org/doi/full/10.1073/pnas.96.13.7352, https://pubmed.ncbi.nlm.nih.gov/10331266/). I had a look at the lineage preprint but couldn't find mention of how to deal with recombination (I did a string search for recomb and only found 3 hits, all in reference to SARS-CoV-2). SARS-CoV-2 pango gives recombinants special names (e.g. XBB) - has it been discussed how recombination will be treated in the Dengue nomenclature? This might be something worth discussing in the paper, potentially.

Regarding the nomenclature of recombinants, how they will be addressed is not yet fully defined. Nonetheless, I appreciate your concern, and I will bring this up for discussion with the project members.

corneliusroemer · 2024-08-14T17:08:36Z

If you have any questions on how to implement certain things do let me know! It's a pleasure to see others make datasets and I want to help as much as I can!

Regarding long branches, we usually exclude them as they are most likely sequencing errors or potentially recombination, both of which can mess up the tree. Removing them doesn't cause problems in lineage assignments.

I'd only include long branches if there's clear evidence the are real (most likely only recombinants)

jamessiqueirap · 2024-08-15T18:14:53Z

@corneliusroemer I have implemented the changes you suggested. However, I couldn't find an efficient (and aesthetically pleasing) way to add country information directly to the sequence names. To address this, I opted to export the country information for each sample. Now, it's possible to both check the country of origin via shift + click and color the branches according to country information.
Thank you so much once again!

ivan-aksamentov · 2024-08-15T19:01:26Z

I pushed a rebuild to assess how it works in Nextclade Web as a whole (index, examples, columns, tree, exports, autosuggestions etc.)

@jamessiqueirap There is a small defect in the tree.json files:

{
  "meta": {
    "extensions": {
      "nextclade": {
        "clade_node_attrs": [
          {
            "name": "clade_membership",
            "displayName": "Dengue Lineages (Nextclade)",
            "description": ""
          }
        ]
      }
    }
  }
}

The clade_membership ("built-in" or "default" clades) attribute is treated specially and does not need to be declared in the clade_node_attrs. So this entire extensions object can be safely removed. Only additional clade-like node attributes (e.g. a competing second nomenclature) needs to be declared there. With this object in place, currently Nextclade is confused and creates and additional empty column Dengue Lineages (Nextclade) in web and an empty clade_membership in output TSV files. Not critical but would be nice to remove.

I can remove easily from the tree.json files here, but you probably want to remove it from your workflow repo as well. Let me know.

jamessiqueirap · 2024-08-15T19:38:25Z

@ivan-aksamentov Thank you very much for the observation; this was part of some old tests we were doing, but I've already corrected all the files! Let's try again haha

ivan-aksamentov · 2024-08-15T20:11:28Z

@jamessiqueirap Done!

This looks good to me. I don't have any more technical recommendations. If science team has no other comments, then this is ready to be merged and released. And we can of course release followup updates and fixes any time if needed.

corneliusroemer · 2024-08-16T13:11:47Z

Excellent stuff @jamessiqueirap, really great job!

Here's a second round of comments, please don't see them as criticism :) They don't need to be implemented, they are suggestions, you can also work on them later if you like after release:

Strain name of ref 4 is "rDEN4" I think - it might not be such a great reference if it's a recombinant clone that was a vaccine candidate. But that's maybe for Eneida Hatcher (don't know her Github account name), as you're just using the official refseq - so that's fine, maybe they could add another one that's more typical.
You could add your affiliation to the README, i.e. your lab/uni
Typo in readme:
, also you can add line breaks there before the "For bugs" by adding an extra line break in the markdown. or end the previous line with a space
You could enable the cluster QC metric - but not necessary
The color scale is a bit random - you could use color ordering the way we do in most nextstrain workflows, see:
- https://github.com/nextstrain/mpox/blob/2ce0d9284ccc8cf9b06e8094c7fa28c8f9d85771/nextclade/Snakefile#L423-L437
- https://github.com/nextstrain/mpox/blob/master/nextclade/scripts/assign-colors.py
You can remove the "lineages" from the dataset name, currently it's "DENV-2 lineages", we don't need to say that there are lineages in there as there's no other dataset without lineages
There's still the kind of unnecessary pr gene in the genome annotation which is just a subset of prM - I'd probably remove it, you can call the result prM

I think that's it for now - let me know if you have any questions.

jamessiqueirap · 2024-08-16T17:51:17Z

@corneliusroemer thank you so much for your suggestions—I truly appreciate it! I went ahead and implemented all the changes right away. I'm really excited about this project and can hardly contain my enthusiasm! 😄

I did run into one issue with the coloration of certain branches. Even though I followed the script's color assignment flow that you recommended, it seems like Nextclade is mixing up some of the tones when exporting the colors.

For example, branches 1I_K and 1I_K.1 were assigned the colors #5098B9 and #539CB3, which should be in the blue palette, but they’re showing up as shades of red in the exported tree. I'm a bit puzzled by this...

corneliusroemer · 2024-08-16T19:17:32Z

I'm really excited about this project and can hardly contain my enthusiasm! 😄

That's great to hear :) If you enjoy making Nextclade datasets, there are many viruses left that people would love having datasets for! For example Chikungunya, I've heard!

I did run into one issue with the coloration of certain branches. Even though I followed the script's color assignment flow that you recommended, it seems like Nextclade is mixing up some of the tones when exporting the colors.

For example, branches 1I_K and 1I_K.1 were assigned the colors #5098B9 and #539CB3, which should be in the blue palette, but they’re showing up as shades of red in the exported tree. I'm a bit puzzled by this...

I'd be more than happy to have a look at your workflow. Where does it live? I tried the repo mentioned in the README but that seems to be just a folder with the workflow outputs, not the source code of the workflow that makes the datasets.

jamessiqueirap · 2024-08-16T20:41:03Z

That's great to hear :) If you enjoy making Nextclade datasets, there are many viruses left that people would love having datasets for! For example Chikungunya, I've heard!

@corneliusroemer Thank you! My group is actually very interested in developing something for Chikungunya, especially since the center where I'm working on my doctorate may soon start monitoring this pathogen. We're excited about the possibilities!

I'd be more than happy to have a look at your workflow. Where does it live? I tried the repo mentioned in the README but that seems to be just a folder with the workflow outputs, not the source code of the workflow that makes the datasets.

You're absolutely right—the links currently lead to the outputs because, to be honest, I'm still learning a lot about the tool. Everything I've done so far has been a bit of a manual, brute-force effort, hahaha! But I'm working on getting more organized!

corneliusroemer · 2024-08-16T21:06:15Z

Great, let me know if want to make CHIKV or similar, I'm happy to help!

Regarding workflow organization, I recommend you look into snakemake. It's how I make all my workflows. See e.g. the mpox one: https://github.com/nextstrain/mpox/tree/master/nextclade

jamessiqueirap · 2024-08-17T18:30:49Z

Great, let me know if want to make CHIKV or similar, I'm happy to help!

@corneliusroemer Thanks a lot! We’ll definitely get in touch when we start on CHIKV or something similar.

Regarding workflow organization, I recommend you look into snakemake. It's how I make all my workflows. See e.g. the mpox one: https://github.com/nextstrain/mpox/tree/master/nextclade

I wanted to let you know that your m-pox folder was incredibly helpful in showing me how to set up my workflow. Take a look!!

The process kicks off after the trees are created because, as I mentioned before, it was built using representative trees from a lot more samples, generated by another team member.

corneliusroemer · 2024-08-30T16:21:25Z

I wanted to let you know that your m-pox folder was incredibly helpful in showing me how to set up my workflow. Take a look!!

It's amazing! I'll have a thorough look once I have a little more free time, I'm super busy with https://pathoplexus.org/ at the moment (that contributed to the delay, sorry for that!)

The process kicks off after the trees are created because, as I mentioned before, it was built using representative trees from a lot more samples, generated by another team member.

Great! That's good to know.

I will merge your PR and release - congratulations on this fantastic work! Please open a PR/issue or write an email if you'd like to contribute datasets for other pathogens.

jamessiqueirap · 2024-08-31T18:54:51Z

@corneliusroemer Thank you so much! Do you have any idea when the dataset might be released on the Nextclade site? We’re wrapping up the manuscript submission, and it would be fantastic if we could include the announcement in the final version. 😊

Additionally, I'm excited to collaborate on creating new datasets for other pathogens!

corneliusroemer · 2024-08-31T20:34:35Z

I will release now, it's on master already: master.clades.nextstrain.org - but release will follow in <1hr - thanks for the reminder I got interrupted by some urgent pathoplexus.org thing again.

corneliusroemer · 2024-08-31T20:45:17Z

@jamessiqueirap here we go - it's released on release branch - will take another 5min until you can see it on clades.nextstrain.org - but it will be there soon. Ping me if not!

https://github.com/nextstrain/nextclade_data/releases/tag/2024-08-31--20-44-06Z

corneliusroemer · 2024-08-31T21:05:30Z

@jamessiqueirap it's live on clades.nextstrain.org

For the manuscript, you can provide a link that automatically selects the right dataset like this:

https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv1
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv2
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv3
https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv4

Try it out: https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv1

jamessiqueirap · 2024-08-31T21:10:44Z

@corneliusroemer Awesome! thank you!

corneliusroemer · 2024-08-31T21:18:22Z

No, thank you 😄

corneliusroemer

Great stuff

dengue-lineages added

bf9fd0e

jamessiqueirap changed the title ~~dengue-lineages added~~ Add dengue lineages dataset Aug 12, 2024

Add example sequences

e009a93

chore: rebuild [skip ci]

9602798

ivan-aksamentov reviewed Aug 13, 2024

View reviewed changes

data/community/v-gen-lab/dengue-lineages/all-serotypes/denv1/pathogen.json Outdated Show resolved Hide resolved

ivan-aksamentov and others added 3 commits August 13, 2024 17:40

feat: fix order of datasets in the index

6e97f8a

Updated QC parameters

cc5b64b

Updated QC parameters

2a46388

jamessiqueirap and others added 4 commits August 13, 2024 16:05

Merge branch 'nextstrain:master' into master

9a4c779

Readme updated

14bdabd

Update README.md

1f082cb

Update README.md

4d366f3

jamessiqueirap added 4 commits August 13, 2024 17:36

Updated QC parameters

66b7b47

Updated QC parameters

c4523cc

Updated QC parameters

a723897

Updated parameters

24d6d48

jamessiqueirap added 4 commits August 14, 2024 20:49

Update for all datasets

04397b9

Update for denv1

3fde71d

Update for denv1

273cdfb

Tree files update

f6683a3

jamessiqueirap and others added 2 commits August 15, 2024 15:26

Readme update

b9eade7

chore: rebuild [skip ci]

3f546ec

Tree files updaated

f0dcf09

chore: rebuild [skip ci]

e13e9fc

Coloring updates

803d542

outliers removed from trees

14758b2

Readme Update

051f3eb

Merge branch 'nextstrain:master' into master

6f58bbd

corneliusroemer deployed to refs/pull/225/merge August 30, 2024 16:07 — with GitHub Actions Active

corneliusroemer merged commit c1d05f8 into nextstrain:master Aug 30, 2024
2 checks passed

corneliusroemer reviewed Aug 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dengue lineages dataset #223

Add dengue lineages dataset #223

jamessiqueirap commented Aug 12, 2024

ivan-aksamentov commented Aug 13, 2024

jamessiqueirap commented Aug 13, 2024 •

edited

Loading

ivan-aksamentov commented Aug 13, 2024 •

edited

Loading

rneher commented Aug 13, 2024

jamessiqueirap commented Aug 13, 2024

corneliusroemer commented Aug 13, 2024

corneliusroemer commented Aug 13, 2024

jamessiqueirap commented Aug 14, 2024

General points

corneliusroemer commented Aug 14, 2024

jamessiqueirap commented Aug 15, 2024

ivan-aksamentov commented Aug 15, 2024 •

edited

Loading

jamessiqueirap commented Aug 15, 2024

ivan-aksamentov commented Aug 15, 2024 •

edited

Loading

corneliusroemer commented Aug 16, 2024

jamessiqueirap commented Aug 16, 2024

corneliusroemer commented Aug 16, 2024

jamessiqueirap commented Aug 16, 2024

corneliusroemer commented Aug 16, 2024

jamessiqueirap commented Aug 17, 2024 •

edited

Loading

corneliusroemer commented Aug 30, 2024

jamessiqueirap commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024 •

edited

Loading

jamessiqueirap commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024

corneliusroemer left a comment

Add dengue lineages dataset #223

Add dengue lineages dataset #223

Conversation

jamessiqueirap commented Aug 12, 2024

ivan-aksamentov commented Aug 13, 2024

jamessiqueirap commented Aug 13, 2024 • edited Loading

ivan-aksamentov commented Aug 13, 2024 • edited Loading

rneher commented Aug 13, 2024

jamessiqueirap commented Aug 13, 2024

corneliusroemer commented Aug 13, 2024

corneliusroemer commented Aug 13, 2024

General points

jamessiqueirap commented Aug 14, 2024

General points

corneliusroemer commented Aug 14, 2024

jamessiqueirap commented Aug 15, 2024

ivan-aksamentov commented Aug 15, 2024 • edited Loading

jamessiqueirap commented Aug 15, 2024

ivan-aksamentov commented Aug 15, 2024 • edited Loading

corneliusroemer commented Aug 16, 2024

jamessiqueirap commented Aug 16, 2024

corneliusroemer commented Aug 16, 2024

jamessiqueirap commented Aug 16, 2024

corneliusroemer commented Aug 16, 2024

jamessiqueirap commented Aug 17, 2024 • edited Loading

corneliusroemer commented Aug 30, 2024

jamessiqueirap commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024 • edited Loading

jamessiqueirap commented Aug 31, 2024

corneliusroemer commented Aug 31, 2024

corneliusroemer left a comment

Choose a reason for hiding this comment

jamessiqueirap commented Aug 13, 2024 •

edited

Loading

ivan-aksamentov commented Aug 13, 2024 •

edited

Loading

ivan-aksamentov commented Aug 15, 2024 •

edited

Loading

ivan-aksamentov commented Aug 15, 2024 •

edited

Loading

jamessiqueirap commented Aug 17, 2024 •

edited

Loading

corneliusroemer commented Aug 31, 2024 •

edited

Loading