This package is a utility tool for the releases found in the Representative Sets of RNA 3D Structures.
There is one file, representative_set.py
, with two classes:
RNARepresentativeSet
WebsiteParser
For the most part, you should only use RNARepresentativeSet
. The WebsiteParser
methods parse the live web page, which isn't very fast. The RNARepresentativeSet
class persists the data parsed from the webpage so the website doesn't have to be scraped more than once. If you do happen to need to parse the web page though, the RNARepresentativeSet
object has an attribute called parser
that is the WebsiteParser
class.
This package depends on BeautifulSoup
. Installation instructions can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
I usually install it the easy way: pip install beautifulsoup4
Once you have the BeautifulSoup package installed, yo ucan use this package. First, create an instance of the RNARepresentativeSet
class:
from RNARepresentativeSet import RNARepresentativeSet
reps = RNARepresentativeSet()
The default instantation (i.e., creating an instance with no input parameters to the constructor) will parse the webpage's latest release and the 4.0A resolution cutoff. You can get whatever release and cutoff you want though. Here are some examples:
# Get latest data
reps = RNARepresentativeSet()
# Get latest data at 20.0A resolution cutoff
reps = RNARepresentativeSet(resolution_cutoff="20.0A")
# Get release 3.292 with default 4.0A resolution cutoff
reps = RNARepresentativeSet(relase_id="3.292")
# Get release 3.292 with 20.0A resolution cutoff
reps = RNARepresentativeSet(relase_id="3.292", resolution_cutoff="20.0A")
You can get representative info for a chain by accessing the object the same way you'd access a dictionary object:
data = reps["6XU8|1|A5+6XU8|1|A8"]
print(data)
""" Output of printing `data`:
{'pdb_id': '6XU8', 'info': ['28S ribosomal RNA, 5.8S ribosomal RNA, 2[...]', 'Electron microscopy', 'Release Date: 2021-07-28', 'Standardized name: LSU rRNA + 5.8S rRNA', 'Source: Eukarya', 'Rfam: RF02543 + RF00002']}
"""
Notice there are also two keys in the object returned, pdb_id
and info
:
key = "6XU8|1|A5+6XU8|1|A8"
print(reps[key]["pdb_id"])
# Output: 6XU8
print(reps[key]["info"])
# Output: ['28S ribosomal RNA, 5.8S ribosomal RNA, 2[...]', 'Electron microscopy', 'Release Date: 2021-07-28', 'Standardized name: LSU rRNA + 5.8S rRNA', 'Source: Eukarya', 'Rfam: RF02543 + RF00002']
Note: When you try to access a chain that doesn't exist in the representative set, the value returned will be false. Example:
print(reps["SomeNonexistentChain"])
# Ouputs: False
There are also two possibly helpful methods on the RNARepresentativeSet
object:
get_rep_for
get_unique_reps_from_list
The get_rep_for
method takes an entry and returns the representative PDB ID for it. For example:
reps = RNARepresentativeSet()
constituent = "7P8Q|1|B"
rep = reps.get_rep_for(constituent)
print(rep) # Outputs "6E0O"
The method get_unique_reps_from_list
takes in a list of entries and returns a unique set of PDB IDs. For example:
reps = RNARepresentativeSet()
list_of_constituents = [
"1VQ6|1|4", # Repped by 1VQ6
"6OY5|1|I", # Repped by 6N6I
"7MQC|1|B" # Repped by 6N6I
]
rep_set = reps.get_unique_reps_from_list(list_of_constituents)
print(rep_set) # Outputs {1VQ6, 6N61}
Note that even though the method was passed three entries, it only returns two IDs because two of the entries are represented by the same PDB ID.
The classes here parse the data on the linked page. Basically, it creates a huge dictionary. The keys for this dictionary are all of the individual entries in the "Class members" column. The value for each of those keys is whatever the PDB ID in the "Representative" column is.
The default behavior of this package is to grab the most recent data published on the RNA3DHub list here: http://rna.bgsu.edu/rna3dhub/nrlist
They try to release a new version every week, so the data you grabbed a few weeks ago may be different from the data you grab today! Please keep this in mind and save the release ID and cutoff (default is 4.0A) that you used.
I have only tested a few combinations of release IDs and cutoffs. There may be issues with older versions of releases. There is some error-handling and fault-detection baked into the package, but it's not perfect.