Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull genbank files with Accession numbers #33

Open
chacalle opened this issue Sep 2, 2016 · 12 comments
Open

Pull genbank files with Accession numbers #33

chacalle opened this issue Sep 2, 2016 · 12 comments

Comments

@chacalle
Copy link
Contributor

chacalle commented Sep 2, 2016

The NCBI is phasing out GI numbers per this announcement. The code works for now but vdb.parse needs to be updated to get genbank files by accession number and not gi number.

@chacalle
Copy link
Contributor Author

chacalle commented Sep 3, 2016

So the NCBI and BioPython don't actually let you pull genbank files via accession numbers. Issue was started on the BioPython page to track changes for this. biopython/biopython#926

@pkundert
Copy link

Any updates on the change-over from GI to accession numbers? I use a similar script to automate generating knock-out vectors. Even after updating xcode and biopython, I still see this error message:

File "dictyko.py", line 624, in
gbrecord = locus_maps(gene, flank) ###returns outfile name to use in primer stuff
File "dictyko.py", line 89, in locus_maps
gi_id, ORF_start, ORF_end, strand = fetch_gene_coordinates(gene)
File "dictyko.py", line 59, in fetch_gene_coordinates
rec = Entrez.read(handle)
File "/Library/Python/2.7/site-packages/Bio/Entrez/init.py", line 376, in read
record = handler.read(handle)
File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line 205, in read
self.parser.ParseFile(handle)
File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line 513, in externalEntityRefHandler
self.dtd_urls.append(url)
UnboundLocalError: local variable 'url' referenced before assignment

@chacalle
Copy link
Contributor Author

@pkundert I'd recommend asking on the BioPython page. I haven't hear anything yet.
biopython/biopython#926

@sidneymbell
Copy link
Contributor

I think this issue has now become pressing; running dengue_upload, my accessions list and query are being formed correctly, but this returns giList==[].
https://github.com/nextstrain/fauna/blob/master/vdb/parse.py#L195

I'm investigating now (starting with the biopython issue thread @chacalle mentioned above), but wanted to give people a heads up in the meantime.

@sidneymbell
Copy link
Contributor

Seems like people are on it, but it's also a bit of a mess for the time being:
https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/comment-page-1/#comment-35754

@chacalle
Copy link
Contributor Author

chacalle commented Dec 9, 2016

@sidneymbell Hey Sidney, I can try helping with this later. Do you know if running update on test_vdb is also failing? python vdb/zika_update.py -db test_vdb -v zika. I feel like people would be creating issues on biopython if this is failing for others as well.

@sidneymbell
Copy link
Contributor

Hey @chacalle - I wondered about that as well. I'm planning to spend the morning investigating in more detail, and will certainly start with the zika implementation to see if it's just something specific about the way my code interacts with the base scripts.

It's definitely failing at the step where it tries to run the query with GI numbers (the query itself is being created and formatted correctly), and it hasn't in the past, which makes me rather suspicious though. I'll update here with what I figure out today. Thanks!

@sidneymbell
Copy link
Contributor

sidneymbell commented Dec 9, 2016

@chacalle --
So, the good news is it's a false alarm. It is failing on the esearch step (just returning an empty ID list with an error message that sounds a whole lot like it's a GI number issue), but luckily I don't think it's the case (I totally leapt to conclusions here).

The less awesome news is that I'm pretty sure it's related to the number of accessions. This doesn't make a whole lot of sense given that, from the docs

Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 100,000 records.

and retmax == 10**9 for our queries (in my case, n==6000). But, it's reproducible.

Shouldn't hard to fix, I'll patch it and submit a PR for your thoughts. Thanks for looking at this, and sorry for the confusion!

@pawlowac
Copy link

pawlowac commented Feb 4, 2017

Has there been any solution to this? Entrez (efetch/epost) won't accept accession.version, but most of their results are given as an accession. Otherwise, is there a way to replace thousands of accession.version with GI numbers?

@trvrb
Copy link
Member

trvrb commented Feb 25, 2017

After upgrading to biopython 1.68

pip install biopython --upgrade
Successfully installed biopython-1.68

--update_citations is working again for me. I don't know if the underlying bug is actually resolved however,

@chacalle
Copy link
Contributor Author

@trvrb --update_citations stopped working? It seems like biopython 1.68 was released in August 2016 (http://biopython.org/wiki/Download) so I don't think the underlying problem is fixed. I wonder why it wasn't working.

According to this comment (https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/comment-page-1/#comment-35754) they are supposed to blog about it when they do finally change things.

@trvrb
Copy link
Member

trvrb commented Feb 25, 2017

Oh. This was entirely me then. Thanks for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants