Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet_integration #122

Merged
merged 153 commits into from
Sep 17, 2020
Merged

Parquet_integration #122

merged 153 commits into from
Sep 17, 2020

Conversation

diskontinuum
Copy link
Contributor

Includes:

(1) Option to handle very large data sets by writing all .CSV files (per compartment) to a Parquet File -> issue #96 .

(2) determination of a reference table (via sampling or with a specified folder) -> issue #100 (missing columns)

(3) alignment w.r.t. reference table -> partly addresses issue #91, because there is no ingestion error for missing columns and zero-valued columns. However, the table is still read (and missing values are set to NaN)
(3) type conversion to deal with type inconsistencies (necessary to be able to write to same Parquet file)

(4) Several functions that return a dictionary of table paths of a certain table type. This needs work. Ideally, we condense all of them to a single general function which also solves issue #88.

Needs Work:

  • eliminate separate processing of image.CSV v.s. CSV files of compartments (so much redundant code and many potential bugs due to the inconsistent data structures).
  • transfer testing functions from colaboratory notebooks
  • add Pytest test scripts: (1) Modification of old test_ingest.py that runs with Parquet option (comparing the shape of files) and (2) Pytest script that also includes comparison of tabular values (modify from colaboratory notebooks).
  • condense functions of similar functionality
  • find and delete unused functions and change/delete hard-coded parts s.t. everything can be read from config.ini, consistently. This addresses almost all functions in utils.py.

Note:

  • .CSV tables from different wells are now in the same Parquet File, but distinguishable as distinct "row-group". The different table types are in different Parquet Files (e.g. image.parquet, Cell.parquet, ...)
  • An obvious and easy next step is to write a function that concatenates (horizontally) the Parquet files of different types. We can simply merge tables according to their (identical) TableNumber.

diskontinuum and others added 18 commits December 16, 2019 18:15
… opened and closed only once per compartment, in the higher-level function seed(). ToDo: 1. Write lower-level second version, 2. Design and run tests.
…eading out of config file (in seed() from ingest.py)
…. 2) Corrected writer path using .join() in getWriters(). 3) Corrected construction of the local variable TablePaths, with which getWriters is called. It is now a string list, not a list of lists. 4)Capitalization of the string value of the name variable, i.e. images is now capitalized when called as name, which was the default case for the compartments.
….py, the original file is named ingest_old.px
@gwaybio
Copy link
Member

gwaybio commented Feb 18, 2020

thanks for filing the PR @diskontinuum - it is clear this represents a ton of work and its great you're bringing the project to the next level. It is not quite there yet - we can iterate on this pull request until we get there.

First, before I perform a more in depth review, we need to get the testing suite figured out. In general, travis has a really nice interface to figure out exactly where problems arise. (after clicking on the details link above next to the continuous-integration/travis-ci/ build) you will notice some strange error on line 350 (https://travis-ci.org/cytomining/cytominer-database/jobs/650395552?utm_medium=notification&utm_source=github_status)

KeyError: PosixPath('/home/travis/build/cytomining/cytominer-database/tests/conftest.py')

which is related to the following error on line 425

  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/_pytest/config/argparsing.py", line 314, in addoption
    raise ValueError("option names %s already added" % conflict)

Old Behavior

We don't see this testing error in new buildes (example here).

Recommended Solution

Part 1

It is quite difficult at the moment to parse through exactly what change is causing this error. I noticed that you've created a new folder structure called tests_WIP. I think now is time to merge that folder into the real testing suite. This will help immensely for viewing exact changes you've made (e.g. there is no quick way to tell if tests_WIP/conftest.py is different than tests/conftest.py.

Part 2

While you are merging, it will also be good to consider how you will drop most of all of the additional data files added in this PR (225 files! 😱 ). I understand the reason for the data files (they're named appropriately for their purpose - which is great), but we should try our best to keep the data files consistent (and not add many different versions with slightly different inputs). Instead of removing a column or inserting an NA in the testing data by copying and manually removing for each new test, load existing data and drop the columns or inject NA entries after loading. This reduces bloat and is more transparent to the software engineer reading the code

Thanks again for the very thoroughly documented code! I am looking forward to jumping in after we've addressed these two important points. Addressing these two points will save a ton of time for PR review, but also for all future cytominer-database developments (of which you have become a major contributor)

@gwaybio
Copy link
Member

gwaybio commented Feb 20, 2020

Great! I see that most files were redundant. One potential concern though - do we keep track of all of those specific data files and which tests were using which data (and the specific manipulations that necessitated data folders in the first place)? (sorry if this is clear in the updated code, I haven't given it a detailed look just yet)

Also, based on the travis build error:

==================================== ERRORS ====================================
____________________ ERROR collecting tests/test_ingest.py _____________________
ImportError while importing test module '/home/travis/build/cytomining/cytominer-database/tests/test_ingest.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_ingest.py:7: in <module>
    import cytominer_database.ingest
cytominer_database/ingest.py:92: in <module>
    import pyarrow
E   ImportError: No module named 'pyarrow'

It looks pyarrow should also be added to a file called requirements.txt.

I'll also copy here what we discussed to be the next steps for getting this merged in and your contributions realized.

so, to proceed lets

  1. Get pytest running locally
  2. Make sure the collab code is following pytest conventions (the examples in the repo are a great place to start)
  3. Let me know if there are any specific questions

diskontinuum and others added 19 commits July 1, 2020 18:39
Co-authored-by: Greg Way <[email protected]>
…parquet backend. The default (no flag) is SQLite.
… a separate, new command module and included all corresponding tests for the command (including default, explicit sqlite and explicit parquet) on all test data. All tests pass.
… to'ingest_variable_engine.py'/'test_ingest_variable_engine.py'
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats for this huge contribution - version 0.4 here we come!

Please do think about the remaining comments. Once we touch base about them and finalize their discussions, I will merge and initiate the new version release. Thanks for sticking with this!

CONTRIBUTORS.md Show resolved Hide resolved
README.rst Show resolved Hide resolved
"""
Read CSV files into a database backend.

:param config_file: Configuration file.
:param source: Directory containing subdirectories that contain CSV files.
:param target: Connection string for the database.
:param skip_image_prefix: True if the prefix of image table name should be excluded
:param : True if the prefix of image table name should be excluded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment still needs addressing

doc/conf.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants