Skip to content

Commit

Permalink
paper: fixing pandas citation
Browse files Browse the repository at this point in the history
  • Loading branch information
biomadeira committed Sep 26, 2024
1 parent 9497e64 commit fac5c40
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 3 deletions.
4 changes: 2 additions & 2 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -66,14 +66,14 @@ @incollection{kans_entrez_2024
year = {2024},
}

@misc{team_pandas-devpandas_2024,
@misc{pandas_2024,
title = {pandas-dev/pandas: {Pandas}},
shorttitle = {pandas-dev/pandas},
url = {https://zenodo.org/records/13819579},
abstract = {Pandas is a powerful data structures for data analysis, time series, and statistics.},
urldate = {2024-09-25},
publisher = {Zenodo},
author = {team, The pandas development},
author = {The pandas development team},
month = sep,
year = {2024},
doi = {10.5281/zenodo.13819579},
Expand Down
2 changes: 1 addition & 1 deletion paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Taxonomy Resolver has been developed with simplicity in mind and it can be used
* **filtering** a tree based on the inclusion and/or exclusion of certain TaxIDs
* **writing and loading** tree data structures using Python’s object serialisation

A taxonomy tree is a hierarchical structure that can be seen as a collection of deeply nested containers - nodes connected by edges, following the hierarchy, from the parent node - the root, all the way down to the children nodes - the leaves. An object-oriented programming (OOP) tree implementation based on recursion does not typically scale well for large trees, such as the NCBI Taxonomy, which is composed of >2.6 million nodes. To improve performance, Taxonomy Resolver represents the tree structure following the Nested Set Model, which is a technique developed to represent hierarchical data in relational databases lacking recursion capabilities. This allows for efficient and inexpensive querying of parent-child relationships. The full tree is traversed following the Modified Preorder Tree Traversal (MPTT) strategy [@celko_chapter_2004], in which each node in the tree is visited twice. In a preorder traversal, the root node is visited first, then recursively a preorder traversal of the left sub-tree, followed by a recursive preorder traversal of the right subtree, in order, until every node has been visited. The modified strategy allows capturing the 'left' and 'right' ($lft$ and $rgt$, respectively) boundaries of each subtree, which are stored as two additional attributes. Finding a subtree is as simple as searching for the nodes of interest where $lft > node lft$ and $rgt < node rgt$. Likewise, finding the full path to a node is as simple as searching for the nodes where $lft < node lft$ and $rgt > node rgt$. Traversal attributes, depth and node indexes are captured for each tree node and are stored as a pandas DataFrame [@team_pandas-devpandas_2024].
A taxonomy tree is a hierarchical structure that can be seen as a collection of deeply nested containers - nodes connected by edges, following the hierarchy, from the parent node - the root, all the way down to the children nodes - the leaves. An object-oriented programming (OOP) tree implementation based on recursion does not typically scale well for large trees, such as the NCBI Taxonomy, which is composed of >2.6 million nodes. To improve performance, Taxonomy Resolver represents the tree structure following the Nested Set Model, which is a technique developed to represent hierarchical data in relational databases lacking recursion capabilities. This allows for efficient and inexpensive querying of parent-child relationships. The full tree is traversed following the Modified Preorder Tree Traversal (MPTT) strategy [@celko_chapter_2004], in which each node in the tree is visited twice. In a preorder traversal, the root node is visited first, then recursively a preorder traversal of the left sub-tree, followed by a recursive preorder traversal of the right subtree, in order, until every node has been visited. The modified strategy allows capturing the 'left' and 'right' ($lft$ and $rgt$, respectively) boundaries of each subtree, which are stored as two additional attributes. Finding a subtree is as simple as searching for the nodes of interest where $lft > node lft$ and $rgt < node rgt$. Likewise, finding the full path to a node is as simple as searching for the nodes where $lft < node lft$ and $rgt > node rgt$. Traversal attributes, depth and node indexes are captured for each tree node and are stored as a pandas DataFrame [@pandas_2024].

In conclusion, Taxonomy Resolver has been developed to take advantage of the Nested Set Model tree structure, so it can perform fast validation and create lists of taxa that compose a particular subtree. Inclusion and exclusion lists can also be seamlessly used to produce subset trees with wide applications, particularly for sequence similarity search.

Expand Down

0 comments on commit fac5c40

Please sign in to comment.