Skip to content

Commit

Permalink
Merge pull request #122 from diskontinuum/colab_updates
Browse files Browse the repository at this point in the history
Parquet_integration
  • Loading branch information
gwaybio authored Sep 17, 2020
2 parents 3926b55 + 91c871c commit 9cf4340
Show file tree
Hide file tree
Showing 34 changed files with 1,712 additions and 175 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ tests/__pycache__/
.pytest_cache/
*.ipynb
.DS_Store
.vscode/
1 change: 0 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ install:
language:
- python
python:
- 2.7
- 3.5
script:
- pytest
Expand Down
2 changes: 2 additions & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ cytominer-database initially developed by
- [Allen Goodman](https://github.com/0x00b1/)
- [Claire McQuin](https://github.com/mcquin/)
- [Shantanu Singh](https://github.com/shntnu/)
- [Gregory Way](https://github.com/gwaygenomics)
- [Frances Hubis](https://github.com/diskontinuum/)
85 changes: 81 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
==================
cytominer-database
==================

Expand All @@ -17,16 +18,92 @@ high-throughput imaging experiment. The measurements are stored across thousands
cytominer-database helps you organize these data into a single database backend, such as SQLite.

Why cytominer-database?
-----------------------
=======================
While tools like CellProfiler can store measurements directly in databases, it is usually infeasible to create a
centralized database in which to store these measurements. A more scalable approach is to create a set of CSVs per
“batch” of images, and then later merge these CSVs.

cytominer-database ingest reads these CSVs, checks for errors, then ingests them into a database backend, including
SQLite, MySQL, PostgresSQL, and several other backends supported by odo.
cytominer-database ingest reads these CSVs, checks for errors, then ingests
them into a database backend. The default backend is `SQLite`.

.. code-block:: sh
cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini
will ingest the CSV files nested under source_directory into a SQLite backend
will ingest the CSV files nested under source_directory into a `SQLite` backend
To select the `Parquet` backend add a `--parquet` flag:

.. code-block:: sh
cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini --parquet
The ingest_config.ini file then needs to be adjusted to contain the `Parquet` specifications.

How to use the configuration file
=================================
The configuration file ingest_config.ini must be located in the source_directory and can be modified to specify the ingestion.
There are three different sections.

The [filenames] section
-----------------------

.. code-block::
[filenames]
image = image.csv #or: Image.csv
object = object.csv #or: Object.csv
cytominer-database is currently limited to the following measurement file kinds:
Cells.csv, Cytoplasm.csv, Nuclei.csv, Image.csv, Object.csv.
The [filenames] section in the configuration file saves the correct basename of existing measurement files.
This may be important in the case of inconsistent capitalization.

The [database_engine] section
-----------------------------

.. code-block::
[ingestion_engine]
engine = Parquet #or: SQLite
The [database_engine] section specifies the backend.
Possible key-value pairs are:
**engine** = *SQLite* or **engine** = *Parquet*.

The [schema] section
--------------------

.. code-block::
[schema]
reference_option = sample #or: path/to/reference/folder relative to source_directory
ref_fraction = 1 #or: any decimal value in [0, 1]
type_conversion = int2float #or: all2string
The [schema] section specifies how to manage incompatibilities in the table schema of the files.
In that case, a Parquet file is fixed to a schema with which it was first opened, i.e. by the first file which is written (the reference file).
To append the data of all .csv files of that file-kind, it is important to assure the reference file satisfies certain incompatibility requirements.
For example, make sure the reference file does not miss any columns and all existing files can be automatically converted to the reference schema.
Note: This section is used only if the files are ingested to Parquet format and was developed to handle the special cases in which tables that cannot be concatenated automatically.

There are two options for the key **reference_option**:

The first option is to create a designated folder containing one .csv reference file for every kind of file ("Cytoplasm.csv", "Nuclei.csv", ...) and save the folder path in the config file as **reference_option** = *path/to/reference/folder*, where the path is relative to the source_directory from the ingest command.
These reference files' schema will determine the schema of the Parquet file into which all .csv files of its kind will be ingested.


**This option relies on manual selection, hence the chosen reference files must be checked explicitly: Make sure the .csv files are complete in the number of columns and contain no NaN values.**

Alternatively, the reference files can be found automatically from a sampled subset of all existing files.
This is the case if **reference_option** = *sample* is set.
A subset of all files is sampled uniformly at random and the first table with the maximum number of columns among all sampled .csv files is chosen as the reference table.
If this case, an additional key **ref_fraction** can be set, which specifies the fraction of files sampled among all files.
The default value is **ref_fraction** = *1* , for which all tables are compared in width.
This key is only used if "reference_option=sample".

Lastly, the key **type_conversion** determines how the schema types are handled in the case of disagreement.
The default value is *int2float*, for which all integer columns are converted to floats.
This has been proven helpful for trivial columns (0-valued column), which may be of "int" type and cannot be written into the same table as non-trivial files with non-zero float values.
Automatic type conversion can be avoided by converting all values to string-type.
This can be done by setting **type_conversion** = *all2string*.
However, the loss of type information might be a disadvantage in downstream tasks.
23 changes: 8 additions & 15 deletions cytominer_database/commands/command_ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,18 @@
SOURCE is a directory containing subdirectories that contain CSV files.
TARGET is a connection string for the database.
"""
)
@click.argument(
"source",
type=click.Path(exists=True)
)
@click.argument(
"target",
type=click.Path(writable=True)
""",
)
@click.argument("source", type=click.Path(exists=True))
@click.argument("target", type=click.Path(writable=True))
@click.option(
"-c",
"--config-file",
default=pkg_resources.resource_filename(
"cytominer_database",
os.path.join("config", "config_cellpainting.ini")
"cytominer_database", os.path.join("config", "config_cellpainting.ini")
),
help="Configuration file.",
type=click.Path(exists=True)
type=click.Path(exists=True),
)
@click.option(
"--munge/--no-munge",
Expand All @@ -43,7 +36,7 @@
have been merged into a single CSV file; \
the CSV will be split into one CSV per compartment \
(Default: true).
"""
""",
)
@click.option(
"--skip-image-prefix/--no-skip-image-prefix",
Expand All @@ -53,10 +46,10 @@
excluded from the names of columns from per image \
table e.g. use `Metadata_Plate` instead of \
`Image_Metadata_Plate` (Default: true).
"""
""",
)
def command(source, target, config_file, munge, skip_image_prefix):
if munge:
cytominer_database.munge.munge(config_file=config_file, source=source)
cytominer_database.munge.munge(config_path=config_file, source=source)

cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix)
80 changes: 80 additions & 0 deletions cytominer_database/commands/command_ingest_variable_engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
import click
import configparser
import os
import pkg_resources

import cytominer_database.ingest
import cytominer_database.ingest_variable_engine
import cytominer_database.munge

"""
Runs new code (ingest_variable_engine.py instead of ingest.py).
Two backend engines are available: Sqlite and Parquet.
In effect, these options are read from the config file.
In terms of the command (and testing the command),
the config file name needs to be specified
(each backend choice has its own config file).
"""


@click.command(
"ingest_new",
help="""\
Import CSV files into a database.
SOURCE is a directory containing subdirectories that contain CSV files.
TARGET is a connection string for the database.
""",
)
@click.argument("source", type=click.Path(exists=True))
@click.argument("target", type=click.Path(writable=True))
@click.option(
"-c",
"--config-file",
default=pkg_resources.resource_filename(
"cytominer_database", os.path.join("config", "config_cellpainting.ini")
),
help="Configuration file.",
type=click.Path(exists=True),
)
@click.option(
"--munge/--no-munge",
default=True,
help="""\
True if the CSV files for individual compartments \
have been merged into a single CSV file; \
the CSV will be split into one CSV per compartment \
(Default: true).
""",
)
@click.option(
"--skip-image-prefix/--no-skip-image-prefix",
default=True,
help="""\
True if the prefix of image table name should be \
excluded from the names of columns from per image \
table e.g. use `Metadata_Plate` instead of \
`Image_Metadata_Plate` (Default: true).
""",
)
@click.option(
"--variable-engine/--no-variable-engine",
default=False,
help="""\
True if multiple backend engines (SQLite or Parquet)\
can be selected. The config file then determines
which backend engine is used (path of which is passed as a flag).\
Default: False (--no-variable-engine)
""",
)
def command(source, target, config_file, munge, skip_image_prefix, variable_engine):
if munge:
cytominer_database.munge.munge(config_path=config_file, source=source)

if variable_engine:
cytominer_database.ingest_variable_engine.seed(
source, target, config_file, skip_image_prefix
)
else:
cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix)
7 changes: 7 additions & 0 deletions cytominer_database/config/config_cellpainting.ini
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ image = Image.csv
object = object.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
8 changes: 8 additions & 0 deletions cytominer_database/config/config_default.ini
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
[filenames]
image = Image.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
8 changes: 8 additions & 0 deletions cytominer_database/config/config_htqc.ini
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,11 @@
image = image.csv
object = object.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
Loading

0 comments on commit 9cf4340

Please sign in to comment.