Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet_integration #122

Merged
merged 153 commits into from
Sep 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
153 commits
Select commit Hold shift + click to select a range
d606e21
[WIP] Integrated Parquet Engine Option, started with test modifications
diskontinuum Dec 16, 2019
6a66fa8
Included Parquet-Writer in ingest.py : In this version, the writer is…
diskontinuum Jan 2, 2020
7e766a5
Added 'database_engine' section to the config files, incl. tests.
diskontinuum Jan 3, 2020
abe71e9
Added Parquet Option to config files. Corrected argument list of into().
diskontinuum Jan 6, 2020
7b9e1e0
Fixed SyntaxError: positional argument follows keyword argument
diskontinuum Jan 6, 2020
cf6039e
debuggig: removed comments from config.ini, removed splitext() when r…
diskontinuum Jan 6, 2020
847dae7
debugged: changed argument list of getWriter()
diskontinuum Jan 6, 2020
25557c2
corrected 4 bugs: 1)completed argument list when calling getWriters()…
diskontinuum Jan 7, 2020
5369033
Started integration of schema-checking
diskontinuum Jan 16, 2020
98e673c
-
diskontinuum Jan 18, 2020
3194ce5
Added sampling and conversion functions
diskontinuum Jan 22, 2020
a321c09
Merge pull request #1 from diskontinuum/convert_schemata
diskontinuum Jan 22, 2020
0e84e68
Does not contain changes from colab notebook.
diskontinuum Feb 8, 2020
fc27c3e
Added original ingest.py and deleted old partial code. All new code i…
diskontinuum Feb 10, 2020
cee0029
fixed imports and naming bugs
diskontinuum Feb 10, 2020
d9e8185
Added option to ignore object.csv
diskontinuum Feb 10, 2020
0c07ad2
renamed WIP folder
diskontinuum Feb 11, 2020
e1292dd
deleted deprecated ingest.py. The new version is named ingest_parquet…
diskontinuum Feb 14, 2020
1b64786
removed additional test folder, updated test_ingest.py, renamed inges…
diskontinuum Feb 19, 2020
848baad
updated install requirement 'pyarrow' in setup.py
diskontinuum Feb 19, 2020
6c414f3
added 'pyarrow>=0.16.1' to requirements.txt
diskontinuum Feb 24, 2020
a5af51c
removed pyarrow from install_requires in setup.py
diskontinuum Feb 24, 2020
7f26041
comma
diskontinuum Feb 24, 2020
a34c99d
quotation marks
diskontinuum Feb 24, 2020
9016f01
reduced requirements pyarrow>=0.16.1 to pyarrow>=0.16.0
diskontinuum Feb 24, 2020
62d4685
added numpy to requirements.txt
diskontinuum Feb 24, 2020
5a3ef5b
renaming of ingest files to check if tests work on original files in …
diskontinuum Feb 24, 2020
a7cb5a3
Added original test_ingest.py code
diskontinuum Feb 24, 2020
7efc190
changed function argument name in munge.py. Reading the config file (…
diskontinuum Feb 24, 2020
0d2f585
new separate folders for new layout of config file (cytominer_databas…
diskontinuum Feb 24, 2020
d34ce21
changed config_file arguments to config_path for function calls of cy…
diskontinuum Feb 27, 2020
a0a72c6
added new options to the default config.ini. Old tests pass.
diskontinuum Feb 27, 2020
1dedbda
modified the local config.ini files in all test folders. All original…
diskontinuum Feb 27, 2020
e67a239
Changed entries in config.ini files to default value 'sample' and sta…
diskontinuum Feb 28, 2020
2565b6c
Refatored. ToDo: Finish original tests, add new tests, clean up and d…
diskontinuum Apr 28, 2020
c42dd91
Original tests pass with given config files. ToDo: Test with differen…
diskontinuum Apr 28, 2020
e5193b9
deleted test duplicates
diskontinuum Apr 28, 2020
8e45b21
Added a documentation for the configuration file config.ini to rhe re…
diskontinuum Apr 28, 2020
e08cb83
updated readme.
diskontinuum Apr 28, 2020
406d164
updated readme.
diskontinuum Apr 28, 2020
b72dc1d
updated readme.
diskontinuum Apr 28, 2020
dd7d50e
updated readme.
diskontinuum Apr 28, 2020
c4f3de7
updated readme.
diskontinuum Apr 28, 2020
38a988d
updated readme.
diskontinuum Apr 28, 2020
cfad784
updated readme.
diskontinuum Apr 28, 2020
be5a24a
updated readme.
diskontinuum Apr 28, 2020
d2eec56
updated readme.
diskontinuum Apr 28, 2020
3226f95
updated readme.
diskontinuum Apr 28, 2020
92e0c85
updated readme.
diskontinuum Apr 28, 2020
adc98df
updated readme.
diskontinuum Apr 28, 2020
0cd216c
updated readme.
diskontinuum Apr 28, 2020
bfb56cb
updated readme.
diskontinuum Apr 28, 2020
a97b644
updated readme.
diskontinuum Apr 28, 2020
382c8ce
updated readme.
diskontinuum Apr 28, 2020
3e7c795
Merged load_df(). Debugged. Passes all tests. ToDo: Further unificati…
diskontinuum May 4, 2020
e7e089f
Massive refactoring and clean-up. Merged engine-options. Tests pass.
diskontinuum May 5, 2020
0d19431
Added docstrings. Deleted old comments.
diskontinuum May 7, 2020
d96303f
Further code cleanse
diskontinuum May 7, 2020
d553b38
added crucial assertions to write.py. ToDo: build pytest analogon.
diskontinuum May 8, 2020
df2eba3
removed unnecessary assignments in main script
diskontinuum May 8, 2020
51cb5f6
added one missing assignment
diskontinuum May 8, 2020
309ab0d
Update README.rst
diskontinuum May 8, 2020
e166bf9
Update README.rst
diskontinuum May 8, 2020
c987cf5
Updated documentation in main script
diskontinuum May 8, 2020
6ed7b8c
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum May 8, 2020
5f9631d
Finalized docstring comments in main script, including general explan…
diskontinuum May 8, 2020
d0e95c6
change all type-converting functions (definitions and calls) to not r…
diskontinuum May 18, 2020
d0af406
merging the tableSchema functions
diskontinuum May 19, 2020
1834cd2
simplified and unified the generation of reference files dictionary
diskontinuum May 26, 2020
84ec340
removed redundant list unpacking
diskontinuum May 26, 2020
733d96d
More expclicit name retrieval
diskontinuum May 27, 2020
91ff904
Tests pass : corrected string capitalization error
diskontinuum May 27, 2020
33d3f82
Update cytominer_database/ingest_incl_parquet.py
diskontinuum May 27, 2020
03cd7b2
Update README.rst
diskontinuum May 27, 2020
befc8a9
Update cytominer_database/load.py
diskontinuum May 27, 2020
874a22b
comments removed from read_config()
diskontinuum May 27, 2020
ea2f5e8
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum May 27, 2020
09fec3b
various PR suggestions
diskontinuum May 27, 2020
5e8236a
renamed get_table_paths_from_directory_list() to directory_list_to_pa…
diskontinuum May 27, 2020
3241f2d
renamed get_unique_reference_dirs() to get_path_dictionary
diskontinuum May 27, 2020
79b5c35
renamed return arguments ref_dirs and ref_dir to sampled_path_diction…
diskontinuum May 27, 2020
9bc8d54
Update cytominer_database/config/config_cellpainting.ini
diskontinuum May 31, 2020
800b6d3
Changed default sampling fractions from 1 to 0.3 in the config files.
diskontinuum May 31, 2020
dd527e0
Update cytominer_database/config/config_htqc.ini
diskontinuum May 31, 2020
fb246b6
Update cytominer_database/ingest.py
diskontinuum May 31, 2020
1850236
Update cytominer_database/ingest.py
diskontinuum May 31, 2020
de908d6
deleted old code
diskontinuum May 31, 2020
2cf7a99
deleted space
diskontinuum May 31, 2020
1bacfac
removed docstring prose
diskontinuum May 31, 2020
e2fd6b7
isolated warning-block
diskontinuum May 31, 2020
810a52c
isolated warning-block
diskontinuum May 31, 2020
038b77b
warning box .rst
diskontinuum May 31, 2020
f33d42f
Added .html version of README.rst. Warning box is displayed correctly.
diskontinuum May 31, 2020
51c6462
strong emphasis instead of warning box
diskontinuum May 31, 2020
4d7fbc2
added .vscode json files to .gitignore
diskontinuum May 31, 2020
c4fd31d
Update .gitignore
diskontinuum Jun 24, 2020
7630aae
vscode cleanup
diskontinuum Jun 24, 2020
bd50178
Update README.rst
diskontinuum Jun 24, 2020
8276cae
Update cytominer_database/config/config_default.ini
diskontinuum Jun 24, 2020
a841db0
Update cytominer_database/ingest.py
diskontinuum Jun 24, 2020
14428e0
Update cytominer_database/write.py
diskontinuum Jun 24, 2020
4c3c021
Update cytominer_database/ingest.py
diskontinuum Jul 1, 2020
aa10636
rm unused imports
diskontinuum Jul 1, 2020
2d10a34
smaller import: os -> os.path
diskontinuum Jul 1, 2020
00ddfeb
decreased or removed os imports
diskontinuum Jul 1, 2020
3ce001e
narrowed remaining os imports where possible
diskontinuum Jul 1, 2020
8b4f35b
Update cytominer_database/ingest.py
diskontinuum Jul 1, 2020
32b30e2
removed comments
diskontinuum Jul 1, 2020
cfa0c27
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum Jul 1, 2020
97c09ca
Removed unnecessary comments
diskontinuum Jul 1, 2020
80728cb
Added raise ValueError to type_conver_dataframe(), for the case in wh…
diskontinuum Jul 1, 2020
4ba394c
Moved and renamed constant list of column names that should not be co…
diskontinuum Jul 1, 2020
b0ca497
Added UserWarning in case no values were converted
diskontinuum Jul 1, 2020
5b80f1d
missing bracket error
diskontinuum Jul 1, 2020
13c3369
Added further ValueErrors and UserWarnings (in case of conversion err…
diskontinuum Jul 1, 2020
076c939
ran BLACK formatting tool over all files
diskontinuum Jul 1, 2020
52a8812
removed python 2
diskontinuum Jul 1, 2020
367cf5c
replaced tricky list comprehension
diskontinuum Jul 1, 2020
c729cf4
replaced the f-string with .format()
diskontinuum Jul 1, 2020
4a07121
removed print()-statements
diskontinuum Jul 1, 2020
f273299
removed remaining print()-statements from chaotic tableSchema.py
diskontinuum Jul 1, 2020
339ac19
f-strings for the original ingest()
diskontinuum Jul 1, 2020
f709963
fixed prefixing of column labels in both ingest() and load() (no list…
diskontinuum Jul 1, 2020
4bc4ab7
Update README.rst
diskontinuum Jul 1, 2020
7e14bcc
Update README.rst
diskontinuum Jul 1, 2020
8138130
Update README.rst
diskontinuum Jul 1, 2020
0ebe5c3
Update README.rst
diskontinuum Jul 1, 2020
080b0cc
Update README.rst
diskontinuum Jul 1, 2020
9a1d4fd
Update README.rst
diskontinuum Jul 1, 2020
92e90bd
Update README.rst
diskontinuum Jul 1, 2020
8563d6c
bracket
diskontinuum Jul 2, 2020
512c48e
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum Jul 2, 2020
c55560b
added author 'Frances Hubis' and updated year
diskontinuum Jul 24, 2020
a2c9ae6
added author 'Frances Hubis' and git link to CONTRIBUTORS.md
diskontinuum Jul 24, 2020
d1610fc
renamed module ingest_incl_parquet.py to ingest_parquet.py . Changed …
diskontinuum Jul 24, 2020
6759ce7
added a 'parquet' flag option
diskontinuum Jul 24, 2020
022fb94
removed old-named ingest_incl_parquet.py
diskontinuum Jul 24, 2020
81f6e89
changed command structure. Now the flag '--parquet' will trigger the …
diskontinuum Jul 24, 2020
6126085
updated README.md to contain the click command instruction for the 'p…
diskontinuum Jul 24, 2020
d2660ac
removed deprecated filename
diskontinuum Jul 24, 2020
fe10d81
removed 'parquet' flag - command tests pass again
diskontinuum Jul 26, 2020
d16bcfb
included new command flag '--variable_engine/--no-variable_engine' i…
diskontinuum Jul 26, 2020
2529591
renamed new code version 'ingest_parquet.py'/'test_ingest_parquet.py'…
diskontinuum Jul 26, 2020
45a39e9
prettified code with black
diskontinuum Jul 26, 2020
c17d852
reversed accidental erasure
diskontinuum Jul 28, 2020
aae91e9
reversed ordering
diskontinuum Jul 28, 2020
094707a
-
diskontinuum Jul 28, 2020
edf2953
Merge branch 'master' into colab_updates
diskontinuum Jul 28, 2020
ed68760
Update CONTRIBUTORS.md
diskontinuum Jul 29, 2020
6cfd9d1
Update cytominer_database/commands/command_ingest_variable_engine.py
diskontinuum Jul 29, 2020
799997f
Added Hubis to copyright list and ran black again.
diskontinuum Jul 29, 2020
1b2976c
Addressed incomplete doctring in ingest.seed()
diskontinuum Jul 29, 2020
91c871c
Update cytominer_database/ingest.py
gwaybio Sep 17, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ tests/__pycache__/
.pytest_cache/
*.ipynb
.DS_Store
.vscode/
1 change: 0 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ install:
language:
- python
python:
- 2.7
- 3.5
script:
- pytest
Expand Down
2 changes: 2 additions & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ cytominer-database initially developed by
- [Allen Goodman](https://github.com/0x00b1/)
- [Claire McQuin](https://github.com/mcquin/)
- [Shantanu Singh](https://github.com/shntnu/)
- [Gregory Way](https://github.com/gwaygenomics)
- [Frances Hubis](https://github.com/diskontinuum/)
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
85 changes: 81 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
==================
cytominer-database
==================

Expand All @@ -17,16 +18,92 @@ high-throughput imaging experiment. The measurements are stored across thousands
cytominer-database helps you organize these data into a single database backend, such as SQLite.

Why cytominer-database?
-----------------------
=======================
While tools like CellProfiler can store measurements directly in databases, it is usually infeasible to create a
centralized database in which to store these measurements. A more scalable approach is to create a set of CSVs per
“batch” of images, and then later merge these CSVs.

cytominer-database ingest reads these CSVs, checks for errors, then ingests them into a database backend, including
SQLite, MySQL, PostgresSQL, and several other backends supported by odo.
cytominer-database ingest reads these CSVs, checks for errors, then ingests
them into a database backend. The default backend is `SQLite`.

.. code-block:: sh

cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini

will ingest the CSV files nested under source_directory into a SQLite backend
will ingest the CSV files nested under source_directory into a `SQLite` backend
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
To select the `Parquet` backend add a `--parquet` flag:

.. code-block:: sh

cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini --parquet

The ingest_config.ini file then needs to be adjusted to contain the `Parquet` specifications.

How to use the configuration file
=================================
The configuration file ingest_config.ini must be located in the source_directory and can be modified to specify the ingestion.
There are three different sections.

The [filenames] section
-----------------------

.. code-block::

[filenames]
image = image.csv #or: Image.csv
object = object.csv #or: Object.csv

cytominer-database is currently limited to the following measurement file kinds:
Cells.csv, Cytoplasm.csv, Nuclei.csv, Image.csv, Object.csv.
The [filenames] section in the configuration file saves the correct basename of existing measurement files.
This may be important in the case of inconsistent capitalization.

The [database_engine] section
-----------------------------

.. code-block::

[ingestion_engine]
engine = Parquet #or: SQLite

The [database_engine] section specifies the backend.
Possible key-value pairs are:
**engine** = *SQLite* or **engine** = *Parquet*.

The [schema] section
--------------------

.. code-block::

[schema]
reference_option = sample #or: path/to/reference/folder relative to source_directory
ref_fraction = 1 #or: any decimal value in [0, 1]
type_conversion = int2float #or: all2string

The [schema] section specifies how to manage incompatibilities in the table schema of the files.
In that case, a Parquet file is fixed to a schema with which it was first opened, i.e. by the first file which is written (the reference file).
To append the data of all .csv files of that file-kind, it is important to assure the reference file satisfies certain incompatibility requirements.
For example, make sure the reference file does not miss any columns and all existing files can be automatically converted to the reference schema.
Note: This section is used only if the files are ingested to Parquet format and was developed to handle the special cases in which tables that cannot be concatenated automatically.

There are two options for the key **reference_option**:

The first option is to create a designated folder containing one .csv reference file for every kind of file ("Cytoplasm.csv", "Nuclei.csv", ...) and save the folder path in the config file as **reference_option** = *path/to/reference/folder*, where the path is relative to the source_directory from the ingest command.
These reference files' schema will determine the schema of the Parquet file into which all .csv files of its kind will be ingested.


**This option relies on manual selection, hence the chosen reference files must be checked explicitly: Make sure the .csv files are complete in the number of columns and contain no NaN values.**

Alternatively, the reference files can be found automatically from a sampled subset of all existing files.
This is the case if **reference_option** = *sample* is set.
A subset of all files is sampled uniformly at random and the first table with the maximum number of columns among all sampled .csv files is chosen as the reference table.
If this case, an additional key **ref_fraction** can be set, which specifies the fraction of files sampled among all files.
The default value is **ref_fraction** = *1* , for which all tables are compared in width.
This key is only used if "reference_option=sample".

Lastly, the key **type_conversion** determines how the schema types are handled in the case of disagreement.
The default value is *int2float*, for which all integer columns are converted to floats.
This has been proven helpful for trivial columns (0-valued column), which may be of "int" type and cannot be written into the same table as non-trivial files with non-zero float values.
Automatic type conversion can be avoided by converting all values to string-type.
This can be done by setting **type_conversion** = *all2string*.
However, the loss of type information might be a disadvantage in downstream tasks.
23 changes: 8 additions & 15 deletions cytominer_database/commands/command_ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,18 @@
SOURCE is a directory containing subdirectories that contain CSV files.

TARGET is a connection string for the database.
"""
)
@click.argument(
"source",
type=click.Path(exists=True)
)
@click.argument(
"target",
type=click.Path(writable=True)
""",
)
@click.argument("source", type=click.Path(exists=True))
@click.argument("target", type=click.Path(writable=True))
@click.option(
"-c",
"--config-file",
default=pkg_resources.resource_filename(
"cytominer_database",
os.path.join("config", "config_cellpainting.ini")
"cytominer_database", os.path.join("config", "config_cellpainting.ini")
),
help="Configuration file.",
type=click.Path(exists=True)
type=click.Path(exists=True),
)
@click.option(
"--munge/--no-munge",
Expand All @@ -43,7 +36,7 @@
have been merged into a single CSV file; \
the CSV will be split into one CSV per compartment \
(Default: true).
"""
""",
)
@click.option(
"--skip-image-prefix/--no-skip-image-prefix",
Expand All @@ -53,10 +46,10 @@
excluded from the names of columns from per image \
table e.g. use `Metadata_Plate` instead of \
`Image_Metadata_Plate` (Default: true).
"""
""",
)
def command(source, target, config_file, munge, skip_image_prefix):
if munge:
cytominer_database.munge.munge(config_file=config_file, source=source)
cytominer_database.munge.munge(config_path=config_file, source=source)

cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix)
80 changes: 80 additions & 0 deletions cytominer_database/commands/command_ingest_variable_engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import click
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
import configparser
import os
import pkg_resources

import cytominer_database.ingest
import cytominer_database.ingest_variable_engine
import cytominer_database.munge

"""
Runs new code (ingest_variable_engine.py instead of ingest.py).
Two backend engines are available: Sqlite and Parquet.
In effect, these options are read from the config file.
In terms of the command (and testing the command),
the config file name needs to be specified
(each backend choice has its own config file).
"""


@click.command(
"ingest_new",
help="""\
Import CSV files into a database.

SOURCE is a directory containing subdirectories that contain CSV files.

TARGET is a connection string for the database.
""",
)
@click.argument("source", type=click.Path(exists=True))
@click.argument("target", type=click.Path(writable=True))
@click.option(
"-c",
"--config-file",
default=pkg_resources.resource_filename(
"cytominer_database", os.path.join("config", "config_cellpainting.ini")
),
help="Configuration file.",
type=click.Path(exists=True),
)
@click.option(
"--munge/--no-munge",
default=True,
help="""\
True if the CSV files for individual compartments \
have been merged into a single CSV file; \
the CSV will be split into one CSV per compartment \
(Default: true).
""",
)
@click.option(
"--skip-image-prefix/--no-skip-image-prefix",
default=True,
help="""\
True if the prefix of image table name should be \
excluded from the names of columns from per image \
table e.g. use `Metadata_Plate` instead of \
`Image_Metadata_Plate` (Default: true).
""",
)
@click.option(
"--variable-engine/--no-variable-engine",
default=False,
help="""\
True if multiple backend engines (SQLite or Parquet)\
can be selected. The config file then determines
which backend engine is used (path of which is passed as a flag).\
Default: False (--no-variable-engine)
""",
)
def command(source, target, config_file, munge, skip_image_prefix, variable_engine):
if munge:
cytominer_database.munge.munge(config_path=config_file, source=source)

if variable_engine:
cytominer_database.ingest_variable_engine.seed(
source, target, config_file, skip_image_prefix
)
else:
cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix)
7 changes: 7 additions & 0 deletions cytominer_database/config/config_cellpainting.ini
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ image = Image.csv
object = object.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
8 changes: 8 additions & 0 deletions cytominer_database/config/config_default.ini
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
[filenames]
image = Image.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
8 changes: 8 additions & 0 deletions cytominer_database/config/config_htqc.ini
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,11 @@
image = image.csv
object = object.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
Loading