Parquet_integration #122

diskontinuum · 2020-02-10T22:43:27Z

Includes:

(1) Option to handle very large data sets by writing all .CSV files (per compartment) to a Parquet File -> issue #96 .

(2) determination of a reference table (via sampling or with a specified folder) -> issue #100 (missing columns)

(3) alignment w.r.t. reference table -> partly addresses issue #91, because there is no ingestion error for missing columns and zero-valued columns. However, the table is still read (and missing values are set to NaN)
(3) type conversion to deal with type inconsistencies (necessary to be able to write to same Parquet file)

(4) Several functions that return a dictionary of table paths of a certain table type. This needs work. Ideally, we condense all of them to a single general function which also solves issue #88.

Needs Work:

eliminate separate processing of image.CSV v.s. CSV files of compartments (so much redundant code and many potential bugs due to the inconsistent data structures).
transfer testing functions from colaboratory notebooks
add Pytest test scripts: (1) Modification of old test_ingest.py that runs with Parquet option (comparing the shape of files) and (2) Pytest script that also includes comparison of tabular values (modify from colaboratory notebooks).
condense functions of similar functionality
find and delete unused functions and change/delete hard-coded parts s.t. everything can be read from config.ini, consistently. This addresses almost all functions in utils.py.

Note:

.CSV tables from different wells are now in the same Parquet File, but distinguishable as distinct "row-group". The different table types are in different Parquet Files (e.g. image.parquet, Cell.parquet, ...)
An obvious and easy next step is to write a function that concatenates (horizontally) the Parquet files of different types. We can simply merge tables according to their (identical) TableNumber.

… opened and closed only once per compartment, in the higher-level function seed(). ToDo: 1. Write lower-level second version, 2. Design and run tests.

…eading out of config file (in seed() from ingest.py)

…. 2) Corrected writer path using .join() in getWriters(). 3) Corrected construction of the local variable TablePaths, with which getWriters is called. It is now a string list, not a list of lists. 4)Capitalization of the string value of the name variable, i.e. images is now capitalized when called as name, which was the default case for the compartments.

Convert schemata

…s in ingest_parquet.py.

….py, the original file is named ingest_old.px

gwaybio · 2020-02-18T20:07:01Z

thanks for filing the PR @diskontinuum - it is clear this represents a ton of work and its great you're bringing the project to the next level. It is not quite there yet - we can iterate on this pull request until we get there.

First, before I perform a more in depth review, we need to get the testing suite figured out. In general, travis has a really nice interface to figure out exactly where problems arise. (after clicking on the details link above next to the continuous-integration/travis-ci/ build) you will notice some strange error on line 350 (https://travis-ci.org/cytomining/cytominer-database/jobs/650395552?utm_medium=notification&utm_source=github_status)

KeyError: PosixPath('/home/travis/build/cytomining/cytominer-database/tests/conftest.py')

which is related to the following error on line 425

  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/_pytest/config/argparsing.py", line 314, in addoption
    raise ValueError("option names %s already added" % conflict)

Old Behavior

We don't see this testing error in new buildes (example here).

Recommended Solution

Part 1

It is quite difficult at the moment to parse through exactly what change is causing this error. I noticed that you've created a new folder structure called tests_WIP. I think now is time to merge that folder into the real testing suite. This will help immensely for viewing exact changes you've made (e.g. there is no quick way to tell if tests_WIP/conftest.py is different than tests/conftest.py.

Part 2

While you are merging, it will also be good to consider how you will drop most of all of the additional data files added in this PR (225 files! 😱 ). I understand the reason for the data files (they're named appropriately for their purpose - which is great), but we should try our best to keep the data files consistent (and not add many different versions with slightly different inputs). Instead of removing a column or inserting an NA in the testing data by copying and manually removing for each new test, load existing data and drop the columns or inject NA entries after loading. This reduces bloat and is more transparent to the software engineer reading the code

Thanks again for the very thoroughly documented code! I am looking forward to jumping in after we've addressed these two important points. Addressing these two points will save a ton of time for PR review, but also for all future cytominer-database developments (of which you have become a major contributor)

…t.py-versions

gwaybio · 2020-02-20T22:16:57Z

Great! I see that most files were redundant. One potential concern though - do we keep track of all of those specific data files and which tests were using which data (and the specific manipulations that necessitated data folders in the first place)? (sorry if this is clear in the updated code, I haven't given it a detailed look just yet)

Also, based on the travis build error:

==================================== ERRORS ====================================
____________________ ERROR collecting tests/test_ingest.py _____________________
ImportError while importing test module '/home/travis/build/cytomining/cytominer-database/tests/test_ingest.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_ingest.py:7: in <module>
    import cytominer_database.ingest
cytominer_database/ingest.py:92: in <module>
    import pyarrow
E   ImportError: No module named 'pyarrow'

It looks pyarrow should also be added to a file called requirements.txt.

I'll also copy here what we discussed to be the next steps for getting this merged in and your contributions realized.

so, to proceed lets

Get pytest running locally

Make sure the collab code is following pytest conventions (the examples in the repo are a great place to start)

Let me know if there are any specific questions

…this branch

Co-authored-by: Greg Way <[email protected]>

…abase into colab_updates

…imports and tests accordingly.

…parquet backend. The default (no flag) is SQLite.

…arquet' option.

… a separate, new command module and included all corresponding tests for the command (including default, explicit sqlite and explicit parquet) on all test data. All tests pass.

… to'ingest_variable_engine.py'/'test_ingest_variable_engine.py'

gwaybio

Congrats for this huge contribution - version 0.4 here we come!

Please do think about the remaining comments. Once we touch base about them and finalize their discussions, I will merge and initiate the new version release. Thanks for sticking with this!

CONTRIBUTORS.md

README.rst

cytominer_database/commands/command_ingest_variable_engine.py

gwaybio · 2020-07-29T12:25:37Z

cytominer_database/ingest.py

    """
    Read CSV files into a database backend.

    :param config_file: Configuration file.
    :param source: Directory containing subdirectories that contain CSV files.
    :param target: Connection string for the database.
-    :param skip_image_prefix: True if the prefix of image table name should be excluded
+    :param  : True if the prefix of image table name should be excluded


this comment still needs addressing

doc/conf.py

Co-authored-by: Greg Way <[email protected]>

diskontinuum and others added 18 commits December 16, 2019 18:15

[WIP] Integrated Parquet Engine Option, started with test modifications

d606e21

Included Parquet-Writer in ingest.py : In this version, the writer is…

6a66fa8

… opened and closed only once per compartment, in the higher-level function seed(). ToDo: 1. Write lower-level second version, 2. Design and run tests.

Added 'database_engine' section to the config files, incl. tests.

7e766a5

Added Parquet Option to config files. Corrected argument list of into().

abe71e9

Fixed SyntaxError: positional argument follows keyword argument

7b9e1e0

debuggig: removed comments from config.ini, removed splitext() when r…

cf6039e

…eading out of config file (in seed() from ingest.py)

debugged: changed argument list of getWriter()

847dae7

Started integration of schema-checking

5369033

-

98e673c

Added sampling and conversion functions

3194ce5

Merge pull request #1 from diskontinuum/convert_schemata

a321c09

Convert schemata

Does not contain changes from colab notebook.

0e84e68

Added original ingest.py and deleted old partial code. All new code i…

fc27c3e

…s in ingest_parquet.py.

fixed imports and naming bugs

cee0029

Added option to ignore object.csv

d9e8185

renamed WIP folder

0c07ad2

deleted deprecated ingest.py. The new version is named ingest_parquet…

e1292dd

….py, the original file is named ingest_old.px

diskontinuum added 2 commits February 19, 2020 13:48

removed additional test folder, updated test_ingest.py, renamed inges…

1b64786

…t.py-versions

updated install requirement 'pyarrow' in setup.py

848baad

diskontinuum added 8 commits February 24, 2020 10:39

added 'pyarrow>=0.16.1' to requirements.txt

6c414f3

removed pyarrow from install_requires in setup.py

a5af51c

comma

7f26041

quotation marks

a34c99d

reduced requirements pyarrow>=0.16.1 to pyarrow>=0.16.0

9016f01

added numpy to requirements.txt

62d4685

renaming of ingest files to check if tests work on original files in …

5a3ef5b

…this branch

Added original test_ingest.py code

a7cb5a3

diskontinuum and others added 19 commits July 1, 2020 18:39

Update README.rst

92e90bd

Co-authored-by: Greg Way <[email protected]>

bracket

8563d6c

Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…

512c48e

…abase into colab_updates

added author 'Frances Hubis' and updated year

c55560b

added author 'Frances Hubis' and git link to CONTRIBUTORS.md

a2c9ae6

renamed module ingest_incl_parquet.py to ingest_parquet.py . Changed …

d1610fc

…imports and tests accordingly.

added a 'parquet' flag option

6759ce7

removed old-named ingest_incl_parquet.py

022fb94

changed command structure. Now the flag '--parquet' will trigger the …

81f6e89

…parquet backend. The default (no flag) is SQLite.

updated README.md to contain the click command instruction for the 'p…

6126085

…arquet' option.

removed deprecated filename

d2660ac

removed 'parquet' flag - command tests pass again

fe10d81

included new command flag '--variable_engine/--no-variable_engine' in…

d16bcfb

… a separate, new command module and included all corresponding tests for the command (including default, explicit sqlite and explicit parquet) on all test data. All tests pass.

renamed new code version 'ingest_parquet.py'/'test_ingest_parquet.py'…

2529591

… to'ingest_variable_engine.py'/'test_ingest_variable_engine.py'

prettified code with black

45a39e9

reversed accidental erasure

c17d852

reversed ordering

aae91e9

-

094707a

Merge branch 'master' into colab_updates

edf2953

gwaybio approved these changes Jul 29, 2020

View reviewed changes

diskontinuum and others added 5 commits July 29, 2020 16:26

Update CONTRIBUTORS.md

ed68760

Co-authored-by: Greg Way <[email protected]>

Update cytominer_database/commands/command_ingest_variable_engine.py

6cfd9d1

Co-authored-by: Greg Way <[email protected]>

Added Hubis to copyright list and ran black again.

799997f

Addressed incomplete doctring in ingest.seed()

1b2976c

Update cytominer_database/ingest.py

91c871c

gwaybio merged commit 9cf4340 into cytomining:master Sep 17, 2020

diskontinuum deleted the colab_updates branch September 21, 2020 09:16

This was referenced Sep 24, 2020

Ingest fails when columns don't align across sets #100

Open

Integrate the use of a reference schema for the '--sqlite' option (done for --parquet) #127

Open

gwaybio mentioned this pull request Dec 14, 2020

Modular single cell ingest cytomining/pycytominer#112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet_integration #122

Parquet_integration #122

diskontinuum commented Feb 10, 2020

gwaybio commented Feb 18, 2020 •

edited

Loading

gwaybio commented Feb 20, 2020

gwaybio left a comment

gwaybio Jul 29, 2020

Parquet_integration #122

Parquet_integration #122

Conversation

diskontinuum commented Feb 10, 2020

gwaybio commented Feb 18, 2020 • edited Loading

Old Behavior

Recommended Solution

Part 1

Part 2

gwaybio commented Feb 20, 2020

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio Jul 29, 2020

Choose a reason for hiding this comment

gwaybio commented Feb 18, 2020 •

edited

Loading