LO - 4 - getting data related to a publication #37

GavinHuttley · 2023-11-10T00:05:24Z

explain the nature of the problem you were trying to solve
explain the information you needed (a tree, the alignments, information on how those were derived, etc...)
what was actually linked to the paper?
how did you get the data in the end (thanks to David Duchene!)
what was the indicated formats?
what was the experience? (Bad Phylip!)
go to actual definition of phylip format
show actual format -- crying emoji
we will then solve this in the final hour when we write a custom app

Write content for a wiki page that sketches the motivation, providing links to:

the paper, web site(s)
the data url defined within the paper
upload in a separate comment (on this issue) some representative compressed data that has the indicated problem to it, we can then use a link to that issue for folk to be able to download the data

Do all the above as a single comment on this issue. Also make a presentation (10') on this.

@YapengLang write text in a comment on this issue, including screen grabs of the relevant papers so

YapengLang · 2023-11-12T12:35:31Z

sketches for the coming wiki page:

In this LO, we will take a close look at the data obtained from a publication. The first two sections will introduce the background behind data wrangling and the issue of unexpected file format. All these experiences will then be secured in the last section by using Cogent3 Apps.

Aims of the data sampling

the nature of my problem

In my study, the published data sets were required for developing my novel model-based measure for phylogenetic inference limit. Divergent sequences in these published data have been curated to develop the measure of site-saturation (paper URL: https://pubmed.ncbi.nlm.nih.gov/34508605/). It was believed that the statistical test associated with this measure could detect the limit of inference. Therefore, I want to investigate the relationship between my measure of the limit of inference and the site-saturation measure using the same data sets in the paper provided.

[screenshot of paper title and abstract]
the information I needed

The original data sets cited in the above paper are necessary to inspect the behaviours of the published site-saturation measurement and mine. In order to reveal the distribution of site-saturation measurement extensively, I want data sets spanning in a gradient of site-saturation.

[bar chart of site-saturation of data sets]

Each data set consists of alignments of different, homologous genes. To link two measures, the alignment for a single locus and the corresponding site-saturation measure are needed in the files below:
1. Compressed file of alignments -- for developing the novel model-based measure of the phylogenetic inference limit. In the next section, I will demonstrate how to read this file and tackle the unexpected format indicated by the file extension. Consequently, the phylogenetic models can be estimated based on these alignments.
  
  [a brief workflow of my experiment]
2. Statistics of site-saturation -- the existing measure of site-saturation. The retrieving will not be covered in this demonstration.

Parsing the alignments files

The properties of alignments files
- Alignment formats are inconsistent across different studies. (e.g. fasta, NEXUS, Phylip)
- For one study, alignments of each locus are stored in the same format.
The difficulty when parsing the data
Some files did not follow the definition of formats indicated by the file extension. Take the Phylip format as an example:
- strict Phylip format definition: [screenshot from the PHYLIP doc]
  - begins with a line indicating the number of taxa and sequence length
  - restricts 10 characters for the taxa names with blanks filled
  - the sequence can continue for more than one line
- an example of bad Phylip [screenshot of the duchene's data] 😞😢😭
  - more characters for the taxa names with blanks
    Hence, an error message will be reported when running the strict Phylip parser.

Demo for parsing bad Phylip by creating a Cogent3 App

To achieve the workflow in section 2, we need a customised function to load the bad Phylip formats. After that, the phylogenetic models can be estimated on all alignments sequentially by the feature of the Cogent3 App.

An instance of a study is here: phy_data.zip

Step1: Write Python function to read relaxed Phylip format with key components:
1. read the first line to get the indicated num. taxa and seq. length
2. read taxa names preceding the first several lines
3. concatenate the sequences following the names
Step2: Wrap the function into C3 App, by using the decorator @define_app(app_type=LOADER).

The script can be downloaded here: parse_bad_phylip.py.zip
Step3: Construct the whole phylogenetic analysis with the composable apps.

from cogent3 import open_data_store, get_app

in_dstore = open_data_store("data/phy_data.zip", suffix="phy", mode="r") # change "data/phy_data.zip" to your data path 

out_dstore = open_data_store("data/outdb.sqlitedb", mode="w") # "data/outdb.sqlitedb" where your data base for output are 


# the way to construct the prior trees for each locus
dist_cal = get_app("fast_slow_dist", fast_calc="paralinear", moltype="DNA")
est_tree = get_app("quick_tree", drop_invalid=False)
calc_tree = dist_cal + est_tree

loader = load_bad_phylip() # your customised cogent3 app
model = get_app("model", "GTR", tree_func = calc_tree) # the phylogenetic modeller 
writer = get_app("write_db", data_store=out_dstore) # the function to write estiamtions into data base. 

# construct the whole process of the phylogenetic analysis 
process= loader + model + writer
process.apply_to(ds[:5]) # fine to change the number it applied to

fredjaya · 2023-11-14T00:47:11Z

Main comments

Great wiki structure and content @YapengLang, I think it flows really well for a workshop!

It would be good to limit repeating content across the presentation and tutorial. In the 10min presentation:

Spend a slide on introducing the context of your research (i.e. saturation and comparison of methods to detect them)
Refer to your conference talk for more information
Then spend the remaining time providing an outline of your experience with accessing the data and the problems encountered

Then for the workshop/wiki:

Aim to limit the discussion of your project to a few sentences (~2) at the beginning. Again, this is just to provide context to the key data issues encountered

Phylogenetic inference is restricted by the divergence of sequences. Empirical measurement of divergence (or so-called site-saturation) has been developed to track the limit of inference, which was established on the global nucleotide frequencies for a given alignment and hence model-free. In my study, I want to investigate the link between the measure of site-saturation and my novel model-based phylogenetic measure of inference limit.

The focus of this paragraph can be shifted from (limits due to) saturation and to the data requirements for good phylogenetic inference.

The reason for this is that it caters to the participants (phylogeneticists) the limit of saturation needs to be explained in more detail, but goes beyond the scope of the workshop. Focusing on discussing the issues from divergent sequences would be ideal.

Things that could be mentioned include why phylogenetic inference is restricted by divergent sequences e.g. hard to align, might bias inference if highly saturated and unaccounted for etc.

Other comments

alignments of homologies for different genes (including the priori trees constructed by the distance method) - for my novel measure of inference limit

Needs clarity - were these alignments of different, homologous genes?

Each data set is a study for a species family, which consists of various alignments per locus.

"species" not needed, perhaps change to "family-level data sets"

The difficulty when parsing the data: some Phylip formats did not follow the strict Phylip definition.

Could mention that in general, all alignment formats can be presented differently. For example, interleaved/non-interleaved fasta, NEXUS files can include data for trees, partitions etc. (or not!), but the focus of this L/O is PHYLIP.

Demo for parsing bad Phylip by Cogent3 App

"by creating a Cogent3 app"

GavinHuttley assigned YapengLang Nov 10, 2023

GavinHuttley added the learning outcome label Nov 12, 2023

GavinHuttley changed the title ~~LO getting data related to a publication~~ LO - 4 - getting data related to a publication Nov 12, 2023

KatherineCaley mentioned this issue Nov 12, 2023

LO - 3 - Identifying and dealing with data issues #27

Closed

KatherineCaley added the block 4 label Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LO - 4 - getting data related to a publication #37

LO - 4 - getting data related to a publication #37

GavinHuttley commented Nov 10, 2023

YapengLang commented Nov 12, 2023 •

edited

Loading

fredjaya commented Nov 14, 2023

LO - 4 - getting data related to a publication #37

LO - 4 - getting data related to a publication #37

Comments

GavinHuttley commented Nov 10, 2023

YapengLang commented Nov 12, 2023 • edited Loading

Aims of the data sampling

Parsing the alignments files

Demo for parsing bad Phylip by creating a Cogent3 App

fredjaya commented Nov 14, 2023

Main comments

Other comments

YapengLang commented Nov 12, 2023 •

edited

Loading