Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pandas version 2 #169

Closed
victorlin opened this issue Nov 20, 2024 · 5 comments
Closed

Support pandas version 2 #169

victorlin opened this issue Nov 20, 2024 · 5 comments
Assignees

Comments

@victorlin
Copy link
Member

In the process of nextstrain/public#12, it was noticed that pandas dependency version constraint in this repo is preventing Nextstrain runtimes from resolving to pandas version 2.

pandas >=1.1.5, ==1.*

It's possible that the code is already compatible with pandas v2, in which case the fix is as simple as changing that line to pandas >=1.1.5, <3 (also mentioned in nextstrain/public#12).

Here are some resources for the migration:

@joverlee521
Copy link
Contributor

joverlee521 commented Nov 21, 2024

Aaand of course this is tied to #113...

Trial run of our flu_upload with Python 3.10 runs into error:

ImportError: Pandas requires version '2.0.1' or newer of 'xlrd' (version '1.2.0' currently installed).

@joverlee521
Copy link
Contributor

joverlee521 commented Nov 21, 2024

I was able to install packages without any issues because xlrd is an optional dependency that is only checked with pandas[excel]. Installing with pandas[excel] >=1.1.5, <3 and xlrd >=1.0.0, ==1.* keeps pandas at v1.5.3. because the extra excel was only added in v2.

Upgrading xlrd to v2 shouldn't be an issue for sequence uploads since GISAID exports .xls files.
However, titer data come as .xlsx files, which does require xlrd v1 with Python 3.7
We can make this version difference explicit with environment markers:

diff --git a/requirements.txt b/requirements.txt
index 5694eed..5964ad0 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,8 +1,10 @@
 biopython >=1.73, ==1.*
 boto >=2.38, ==2.*
 certifi
-pandas >=1.1.5, ==1.*
+pandas[excel] >=2.0.0, <3; python_version>="3.8"
 rethinkdb >=2.4.8, ==2.4.*, !=2.4.10
 requests >=2.20.0, ==2.*
 unidecode >=1.0.22, ==1.*
-xlrd >=1.0.0, ==1.*
+# Use Python 3.7 for tbd uploading scripts
+pandas >=1.1.5, ==1.*; python_version<"3.8"
+xlrd >=1.0.0, ==1.*; python_version<"3.8"

Depending on how many issues we run into with replacing use of xlrd with excelrd (as suggested in #113 (comment)), we can also change the dependencies to

diff --git a/requirements.txt b/requirements.txt
index 5694eed..7cc1d7d 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,8 +1,8 @@
 biopython >=1.73, ==1.*
 boto >=2.38, ==2.*
 certifi
-pandas >=1.1.5, ==1.*
+pandas[excel] >=2.0.0, <3
 rethinkdb >=2.4.8, ==2.4.*, !=2.4.10
 requests >=2.20.0, ==2.*
 unidecode >=1.0.22, ==1.*
-xlrd >=1.0.0, ==1.*
+excelrd >=3.0.0, <4

@victorlin
Copy link
Member Author

victorlin commented Nov 21, 2024

Thanks for digging into this @joverlee521! Stepping back a bit, a couple points:

  • The Nextstrain runtimes don't need to have pandas v2 any time soon. It's just that Augur (finally) supports it so we are one step closer to the runtimes resolving to pandas v2, at which point we should be on the lookout for any breakage. Though I think Augur's pathogen-repo-ci tests gives us some confidence.
  • Is it worth considering decoupling Fauna from the Nextstrain runtimes (e.g. using an extended version of the base image when working with Fauna)? That would allow us to keep using fauna with pandas v1 and everything else with pandas v2. I think it's worth considering if the only reason for Fauna to support pandas v2 is to unblock the Nextstrain runtime.

@joverlee521
Copy link
Contributor

The Nextstrain runtimes don't need to have pandas v2 any time soon.

Gotcha! Didn't want fauna to be a blocker here...

Is it worth considering decoupling Fauna from the Nextstrain runtimes (e.g. using an extended version of the base image when working with Fauna)?

Yes! It might makes sense to move it out of base image since it is only used by seasonal-flu and avian-flu. It should be straightforward to point the automated GH Action workflows to use the extended image. However, it might be extra burden on users to remember to use the correct image when running things manually.

@victorlin
Copy link
Member Author

Decoupling seems like the better option, I will write up a separate issue and link it soon.

Feel free to close this one as not planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants