Skip to content

YesselmanLab/RNA_Representative_Set

 
 

Repository files navigation

RNA Representative Set

This package is a utility tool for the releases found in the Representative Sets of RNA 3D Structures.

There is one file, representative_set.py, with two classes:

  • RNARepresentativeSet
  • WebsiteParser

For the most part, you should only use RNARepresentativeSet. The WebsiteParser methods parse the live web page, which isn't very fast. The RNARepresentativeSet class persists the data parsed from the webpage so the website doesn't have to be scraped more than once. If you do happen to need to parse the web page though, the RNARepresentativeSet object has an attribute called parser that is the WebsiteParser class.

How to use it

This package depends on BeautifulSoup. Installation instructions can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

I usually install it the easy way: pip install beautifulsoup4

Once you have the BeautifulSoup package installed, yo ucan use this package. First, create an instance of the RNARepresentativeSet class:

from RNARepresentativeSet import RNARepresentativeSet

reps = RNARepresentativeSet()

The default instantation (i.e., creating an instance with no input parameters to the constructor) will parse the webpage's latest release and the 4.0A resolution cutoff. You can get whatever release and cutoff you want though. Here are some examples:

# Get latest data
reps = RNARepresentativeSet()

# Get latest data at 20.0A resolution cutoff
reps = RNARepresentativeSet(resolution_cutoff="20.0A")

# Get release 3.292 with default 4.0A resolution cutoff
reps = RNARepresentativeSet(relase_id="3.292")

# Get release 3.292 with 20.0A resolution cutoff
reps = RNARepresentativeSet(relase_id="3.292", resolution_cutoff="20.0A")

You can get representative info for a chain by accessing the object the same way you'd access a dictionary object:

data = reps["6XU8|1|A5+6XU8|1|A8"]
print(data)
""" Output of printing `data`:
{'pdb_id': '6XU8', 'info': ['28S ribosomal RNA, 5.8S ribosomal RNA, 2[...]', 'Electron microscopy', 'Release Date: 2021-07-28', 'Standardized name:  LSU rRNA +  5.8S rRNA', 'Source: Eukarya', 'Rfam: RF02543 + RF00002']}
"""

Notice there are also two keys in the object returned, pdb_id and info:

key = "6XU8|1|A5+6XU8|1|A8"

print(reps[key]["pdb_id"])
# Output: 6XU8

print(reps[key]["info"])
# Output: ['28S ribosomal RNA, 5.8S ribosomal RNA, 2[...]', 'Electron microscopy', 'Release Date: 2021-07-28', 'Standardized name:  LSU rRNA +  5.8S rRNA', 'Source: Eukarya', 'Rfam: RF02543 + RF00002']

Note: When you try to access a chain that doesn't exist in the representative set, the value returned will be false. Example:

print(reps["SomeNonexistentChain"])
# Ouputs: False

There are also two possibly helpful methods on the RNARepresentativeSet object:

  • get_rep_for
  • get_unique_reps_from_list

The get_rep_for method takes an entry and returns the representative PDB ID for it. For example:

reps = RNARepresentativeSet()
constituent = "7P8Q|1|B"
rep = reps.get_rep_for(constituent)
print(rep) # Outputs "6E0O"

The method get_unique_reps_from_list takes in a list of entries and returns a unique set of PDB IDs. For example:

reps = RNARepresentativeSet()

list_of_constituents = [
    "1VQ6|1|4", # Repped by 1VQ6
    "6OY5|1|I", # Repped by 6N6I
    "7MQC|1|B" # Repped by 6N6I
]

rep_set = reps.get_unique_reps_from_list(list_of_constituents)

print(rep_set) # Outputs {1VQ6, 6N61}

Note that even though the method was passed three entries, it only returns two IDs because two of the entries are represented by the same PDB ID.

How it works

The classes here parse the data on the linked page. Basically, it creates a huge dictionary. The keys for this dictionary are all of the individual entries in the "Class members" column. The value for each of those keys is whatever the PDB ID in the "Representative" column is.

Warnings!!!!

Release Versions

The default behavior of this package is to grab the most recent data published on the RNA3DHub list here: http://rna.bgsu.edu/rna3dhub/nrlist

They try to release a new version every week, so the data you grabbed a few weeks ago may be different from the data you grab today! Please keep this in mind and save the release ID and cutoff (default is 4.0A) that you used.

Testing

I have only tested a few combinations of release IDs and cutoffs. There may be issues with older versions of releases. There is some error-handling and fault-detection baked into the package, but it's not perfect.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%