-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dengue lineages dataset #223
Conversation
Hi @jamessiqueirap! Thanks! I will let our science team to review. It has been challenging for them to produce serotype datasets so far. Let's see what they say. In the meantime a couple of technical nuances:
|
Hi @ivan-aksamentov, thank you very much for your valuable feedback and corrections! I have already implemented all the suggested changes, including creating the additional level of directories and I have already included the example sequences, following your advice. I appreciate your guidance on this — it is really helpful. Thanks again |
I pushed results of the ./scripts/rebuild --input-dir 'data/' --output-dir 'data_output/' (This is normally done automatically, but we haven't figured how to do this securely for third-party contributions yet. This needs to be rerun if there are changes to the This allows to use the Here are the links to Nextcalde with datasets preselected, for easier testing: |
Hi @jamessiqueirap , thanks for contributing these. Very exciting! The trees look good to me. Two things I noticed:
|
data/community/v-gen-lab/dengue-lineages/all-serotypes/denv1/pathogen.json
Outdated
Show resolved
Hide resolved
@rneher Thank you so much! I think this is a great approach, and I've already implemented it. |
Impressive work, great job making these datasets! A few thoughts and comments (not necessarily blocking release) General points
Regarding recombination - I was wondering how the lineage system plans to deal with it. It's reportedly common in Dengue (e.g. https://www.pnas.org/doi/full/10.1073/pnas.96.13.7352, https://pubmed.ncbi.nlm.nih.gov/10331266/). I had a look at the lineage preprint but couldn't find mention of how to deal with recombination (I did a string search for |
@corneliusroemer Thank you very much, I'm delighted to receive suggestions from you! The implementation process for dengue lineage nomenclature is a collaborative effort involving various research groups from different countries. Regarding your suggestions:
You are absolutely correct, and I will implement these changes in the dataset right away.
These trees only contain the representative topology of each designated lineage, and in some cases, certain branches may indeed appear very long. However, we will review these points with the other members of the project.
Given the granularity of lineages, this label only appears when you enable the visualization of all lineages.
Regarding this, in the preprint, we used "1VI" to designate what we will now call "1VII" based on the suggestion of the scientific committee. Since there is already another genotype in the literature designated as "VI," we made this change to avoid any mismatch with the literature. We didn't include VI in the dataset as it's no longer in circulation.
Regarding the nomenclature of recombinants, how they will be addressed is not yet fully defined. Nonetheless, I appreciate your concern, and I will bring this up for discussion with the project members. |
If you have any questions on how to implement certain things do let me know! It's a pleasure to see others make datasets and I want to help as much as I can! Regarding long branches, we usually exclude them as they are most likely sequencing errors or potentially recombination, both of which can mess up the tree. Removing them doesn't cause problems in lineage assignments. I'd only include long branches if there's clear evidence the are real (most likely only recombinants) |
@corneliusroemer I have implemented the changes you suggested. However, I couldn't find an efficient (and aesthetically pleasing) way to add country information directly to the sequence names. To address this, I opted to export the country information for each sample. Now, it's possible to both check the country of origin via shift + click and color the branches according to country information. |
I pushed a rebuild to assess how it works in Nextclade Web as a whole (index, examples, columns, tree, exports, autosuggestions etc.) @jamessiqueirap There is a small defect in the tree.json files: {
"meta": {
"extensions": {
"nextclade": {
"clade_node_attrs": [
{
"name": "clade_membership",
"displayName": "Dengue Lineages (Nextclade)",
"description": ""
}
]
}
}
}
} The I can remove easily from the |
@ivan-aksamentov Thank you very much for the observation; this was part of some old tests we were doing, but I've already corrected all the files! Let's try again haha |
@jamessiqueirap Done! This looks good to me. I don't have any more technical recommendations. If science team has no other comments, then this is ready to be merged and released. And we can of course release followup updates and fixes any time if needed. |
Excellent stuff @jamessiqueirap, really great job! Here's a second round of comments, please don't see them as criticism :) They don't need to be implemented, they are suggestions, you can also work on them later if you like after release:
I think that's it for now - let me know if you have any questions. |
@corneliusroemer thank you so much for your suggestions—I truly appreciate it! I went ahead and implemented all the changes right away. I'm really excited about this project and can hardly contain my enthusiasm! 😄 I did run into one issue with the coloration of certain branches. Even though I followed the script's color assignment flow that you recommended, it seems like Nextclade is mixing up some of the tones when exporting the colors. For example, branches 1I_K and 1I_K.1 were assigned the colors #5098B9 and #539CB3, which should be in the blue palette, but they’re showing up as shades of red in the exported tree. I'm a bit puzzled by this... |
That's great to hear :) If you enjoy making Nextclade datasets, there are many viruses left that people would love having datasets for! For example Chikungunya, I've heard!
I'd be more than happy to have a look at your workflow. Where does it live? I tried the repo mentioned in the README but that seems to be just a folder with the workflow outputs, not the source code of the workflow that makes the datasets. |
@corneliusroemer Thank you! My group is actually very interested in developing something for Chikungunya, especially since the center where I'm working on my doctorate may soon start monitoring this pathogen. We're excited about the possibilities!
You're absolutely right—the links currently lead to the outputs because, to be honest, I'm still learning a lot about the tool. Everything I've done so far has been a bit of a manual, brute-force effort, hahaha! But I'm working on getting more organized! |
Great, let me know if want to make CHIKV or similar, I'm happy to help! Regarding workflow organization, I recommend you look into snakemake. It's how I make all my workflows. See e.g. the mpox one: https://github.com/nextstrain/mpox/tree/master/nextclade |
@corneliusroemer Thanks a lot! We’ll definitely get in touch when we start on CHIKV or something similar.
I wanted to let you know that your m-pox folder was incredibly helpful in showing me how to set up my workflow. Take a look!! The process kicks off after the trees are created because, as I mentioned before, it was built using representative trees from a lot more samples, generated by another team member. |
It's amazing! I'll have a thorough look once I have a little more free time, I'm super busy with https://pathoplexus.org/ at the moment (that contributed to the delay, sorry for that!)
Great! That's good to know. I will merge your PR and release - congratulations on this fantastic work! Please open a PR/issue or write an email if you'd like to contribute datasets for other pathogens. |
@corneliusroemer Thank you so much! Do you have any idea when the dataset might be released on the Nextclade site? We’re wrapping up the manuscript submission, and it would be fantastic if we could include the announcement in the final version. 😊 Additionally, I'm excited to collaborate on creating new datasets for other pathogens! |
I will release now, it's on master already: master.clades.nextstrain.org - but release will follow in <1hr - thanks for the reminder I got interrupted by some urgent pathoplexus.org thing again. |
@jamessiqueirap here we go - it's released on release branch - will take another 5min until you can see it on clades.nextstrain.org - but it will be there soon. Ping me if not! https://github.com/nextstrain/nextclade_data/releases/tag/2024-08-31--20-44-06Z |
@jamessiqueirap it's live on clades.nextstrain.org For the manuscript, you can provide a link that automatically selects the right dataset like this:
Try it out: https://clades.nextstrain.org/?dataset-name=community/v-gen-lab/dengue/denv1 |
@corneliusroemer Awesome! thank you! |
No, thank you 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff
These datasets are based on the dengue virus lineage systems described by Verity et al., 2024, and are suitable for the analysis of viral sequences from the four dengue virus serotypes.