Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LO - 4 - getting data related to a publication #37

Open
GavinHuttley opened this issue Nov 10, 2023 · 2 comments
Open

LO - 4 - getting data related to a publication #37

GavinHuttley opened this issue Nov 10, 2023 · 2 comments

Comments

@GavinHuttley
Copy link
Contributor

  • explain the nature of the problem you were trying to solve
  • explain the information you needed (a tree, the alignments, information on how those were derived, etc...)
  • what was actually linked to the paper?
  • how did you get the data in the end (thanks to David Duchene!)
  • what was the indicated formats?
  • what was the experience? (Bad Phylip!)
  • go to actual definition of phylip format
  • show actual format -- crying emoji
  • we will then solve this in the final hour when we write a custom app

Write content for a wiki page that sketches the motivation, providing links to:

  • the paper, web site(s)
  • the data url defined within the paper
  • upload in a separate comment (on this issue) some representative compressed data that has the indicated problem to it, we can then use a link to that issue for folk to be able to download the data

Do all the above as a single comment on this issue. Also make a presentation (10') on this.

@YapengLang write text in a comment on this issue, including screen grabs of the relevant papers so

@YapengLang
Copy link
Collaborator

YapengLang commented Nov 12, 2023

sketches for the coming wiki page:

In this LO, we will take a close look at the data obtained from a publication. The first two sections will introduce the background behind data wrangling and the issue of unexpected file format. All these experiences will then be secured in the last section by using Cogent3 Apps.

Aims of the data sampling

  • the nature of my problem

    In my study, the published data sets were required for developing my novel model-based measure for phylogenetic inference limit. Divergent sequences in these published data have been curated to develop the measure of site-saturation (paper URL: https://pubmed.ncbi.nlm.nih.gov/34508605/). It was believed that the statistical test associated with this measure could detect the limit of inference. Therefore, I want to investigate the relationship between my measure of the limit of inference and the site-saturation measure using the same data sets in the paper provided.

    [screenshot of paper title and abstract]

  • the information I needed

    The original data sets cited in the above paper are necessary to inspect the behaviours of the published site-saturation measurement and mine. In order to reveal the distribution of site-saturation measurement extensively, I want data sets spanning in a gradient of site-saturation.

    [bar chart of site-saturation of data sets]

    Each data set consists of alignments of different, homologous genes. To link two measures, the alignment for a single locus and the corresponding site-saturation measure are needed in the files below:

    1. Compressed file of alignments -- for developing the novel model-based measure of the phylogenetic inference limit. In the next section, I will demonstrate how to read this file and tackle the unexpected format indicated by the file extension. Consequently, the phylogenetic models can be estimated based on these alignments.

      [a brief workflow of my experiment]

    2. Statistics of site-saturation -- the existing measure of site-saturation. The retrieving will not be covered in this demonstration.

Parsing the alignments files

  • The properties of alignments files

    • Alignment formats are inconsistent across different studies. (e.g. fasta, NEXUS, Phylip)
    • For one study, alignments of each locus are stored in the same format.
  • The difficulty when parsing the data
    Some files did not follow the definition of formats indicated by the file extension. Take the Phylip format as an example:

    • strict Phylip format definition: [screenshot from the PHYLIP doc]

      • begins with a line indicating the number of taxa and sequence length
      • restricts 10 characters for the taxa names with blanks filled
      • the sequence can continue for more than one line
    • an example of bad Phylip [screenshot of the duchene's data] 😞😢😭

      • more characters for the taxa names with blanks
        Hence, an error message will be reported when running the strict Phylip parser.

Demo for parsing bad Phylip by creating a Cogent3 App

To achieve the workflow in section 2, we need a customised function to load the bad Phylip formats. After that, the phylogenetic models can be estimated on all alignments sequentially by the feature of the Cogent3 App.

An instance of a study is here: phy_data.zip

  • Step1: Write Python function to read relaxed Phylip format with key components:

    1. read the first line to get the indicated num. taxa and seq. length
    2. read taxa names preceding the first several lines
    3. concatenate the sequences following the names
  • Step2: Wrap the function into C3 App, by using the decorator @define_app(app_type=LOADER).

    The script can be downloaded here: parse_bad_phylip.py.zip

  • Step3: Construct the whole phylogenetic analysis with the composable apps.

from cogent3 import open_data_store, get_app

in_dstore = open_data_store("data/phy_data.zip", suffix="phy", mode="r") # change "data/phy_data.zip" to your data path 

out_dstore = open_data_store("data/outdb.sqlitedb", mode="w") # "data/outdb.sqlitedb" where your data base for output are 


# the way to construct the prior trees for each locus
dist_cal = get_app("fast_slow_dist", fast_calc="paralinear", moltype="DNA")
est_tree = get_app("quick_tree", drop_invalid=False)
calc_tree = dist_cal + est_tree

loader = load_bad_phylip() # your customised cogent3 app
model = get_app("model", "GTR", tree_func = calc_tree) # the phylogenetic modeller 
writer = get_app("write_db", data_store=out_dstore) # the function to write estiamtions into data base. 

# construct the whole process of the phylogenetic analysis 
process= loader + model + writer
process.apply_to(ds[:5]) # fine to change the number it applied to 

     

@GavinHuttley GavinHuttley changed the title LO getting data related to a publication LO - 4 - getting data related to a publication Nov 12, 2023
@fredjaya
Copy link

Main comments

Great wiki structure and content @YapengLang, I think it flows really well for a workshop!

It would be good to limit repeating content across the presentation and tutorial. In the 10min presentation:

  • Spend a slide on introducing the context of your research (i.e. saturation and comparison of methods to detect them)
  • Refer to your conference talk for more information
  • Then spend the remaining time providing an outline of your experience with accessing the data and the problems encountered

Then for the workshop/wiki:

  • Aim to limit the discussion of your project to a few sentences (~2) at the beginning. Again, this is just to provide context to the key data issues encountered

Phylogenetic inference is restricted by the divergence of sequences. Empirical measurement of divergence (or so-called site-saturation) has been developed to track the limit of inference, which was established on the global nucleotide frequencies for a given alignment and hence model-free. In my study, I want to investigate the link between the measure of site-saturation and my novel model-based phylogenetic measure of inference limit.

The focus of this paragraph can be shifted from (limits due to) saturation and to the data requirements for good phylogenetic inference.

The reason for this is that it caters to the participants (phylogeneticists) the limit of saturation needs to be explained in more detail, but goes beyond the scope of the workshop. Focusing on discussing the issues from divergent sequences would be ideal.

Things that could be mentioned include why phylogenetic inference is restricted by divergent sequences e.g. hard to align, might bias inference if highly saturated and unaccounted for etc.

Other comments

alignments of homologies for different genes (including the priori trees constructed by the distance method) - for my novel measure of inference limit

  • Needs clarity - were these alignments of different, homologous genes?

Each data set is a study for a species family, which consists of various alignments per locus.

  • "species" not needed, perhaps change to "family-level data sets"

The difficulty when parsing the data: some Phylip formats did not follow the strict Phylip definition.

  • Could mention that in general, all alignment formats can be presented differently. For example, interleaved/non-interleaved fasta, NEXUS files can include data for trees, partitions etc. (or not!), but the focus of this L/O is PHYLIP.

Demo for parsing bad Phylip by Cogent3 App

  • "by creating a Cogent3 app"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants