-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet_integration #122
Parquet_integration #122
Conversation
… opened and closed only once per compartment, in the higher-level function seed(). ToDo: 1. Write lower-level second version, 2. Design and run tests.
…eading out of config file (in seed() from ingest.py)
…. 2) Corrected writer path using .join() in getWriters(). 3) Corrected construction of the local variable TablePaths, with which getWriters is called. It is now a string list, not a list of lists. 4)Capitalization of the string value of the name variable, i.e. images is now capitalized when called as name, which was the default case for the compartments.
Convert schemata
…s in ingest_parquet.py.
….py, the original file is named ingest_old.px
thanks for filing the PR @diskontinuum - it is clear this represents a ton of work and its great you're bringing the project to the next level. It is not quite there yet - we can iterate on this pull request until we get there. First, before I perform a more in depth review, we need to get the testing suite figured out. In general, travis has a really nice interface to figure out exactly where problems arise. (after clicking on the KeyError: PosixPath('/home/travis/build/cytomining/cytominer-database/tests/conftest.py') which is related to the following error on line 425 File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/_pytest/config/argparsing.py", line 314, in addoption
raise ValueError("option names %s already added" % conflict) Old BehaviorWe don't see this testing error in new buildes (example here). Recommended SolutionPart 1It is quite difficult at the moment to parse through exactly what change is causing this error. I noticed that you've created a new folder structure called Part 2While you are merging, it will also be good to consider how you will drop most of all of the additional data files added in this PR (225 files! 😱 ). I understand the reason for the data files (they're named appropriately for their purpose - which is great), but we should try our best to keep the data files consistent (and not add many different versions with slightly different inputs). Instead of removing a column or inserting an NA in the testing data by copying and manually removing for each new test, load existing data and drop the columns or inject NA entries after loading. This reduces bloat and is more transparent to the software engineer reading the code Thanks again for the very thoroughly documented code! I am looking forward to jumping in after we've addressed these two important points. Addressing these two points will save a ton of time for PR review, but also for all future cytominer-database developments (of which you have become a major contributor) |
Great! I see that most files were redundant. One potential concern though - do we keep track of all of those specific data files and which tests were using which data (and the specific manipulations that necessitated data folders in the first place)? (sorry if this is clear in the updated code, I haven't given it a detailed look just yet) Also, based on the travis build error: ==================================== ERRORS ====================================
____________________ ERROR collecting tests/test_ingest.py _____________________
ImportError while importing test module '/home/travis/build/cytomining/cytominer-database/tests/test_ingest.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_ingest.py:7: in <module>
import cytominer_database.ingest
cytominer_database/ingest.py:92: in <module>
import pyarrow
E ImportError: No module named 'pyarrow' It looks I'll also copy here what we discussed to be the next steps for getting this merged in and your contributions realized.
|
Co-authored-by: Greg Way <[email protected]>
…abase into colab_updates
…imports and tests accordingly.
…parquet backend. The default (no flag) is SQLite.
… a separate, new command module and included all corresponding tests for the command (including default, explicit sqlite and explicit parquet) on all test data. All tests pass.
… to'ingest_variable_engine.py'/'test_ingest_variable_engine.py'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congrats for this huge contribution - version 0.4 here we come!
Please do think about the remaining comments. Once we touch base about them and finalize their discussions, I will merge and initiate the new version release. Thanks for sticking with this!
cytominer_database/ingest.py
Outdated
""" | ||
Read CSV files into a database backend. | ||
|
||
:param config_file: Configuration file. | ||
:param source: Directory containing subdirectories that contain CSV files. | ||
:param target: Connection string for the database. | ||
:param skip_image_prefix: True if the prefix of image table name should be excluded | ||
:param : True if the prefix of image table name should be excluded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment still needs addressing
Co-authored-by: Greg Way <[email protected]>
Co-authored-by: Greg Way <[email protected]>
Includes:
(1) Option to handle very large data sets by writing all .CSV files (per compartment) to a Parquet File -> issue #96 .
(2) determination of a reference table (via sampling or with a specified folder) -> issue #100 (missing columns)
(3) alignment w.r.t. reference table -> partly addresses issue #91, because there is no ingestion error for missing columns and zero-valued columns. However, the table is still read (and missing values are set to NaN)
(3) type conversion to deal with type inconsistencies (necessary to be able to write to same Parquet file)
(4) Several functions that return a dictionary of table paths of a certain table type. This needs work. Ideally, we condense all of them to a single general function which also solves issue #88.
Needs Work:
Note: