Merge pull request #122 from diskontinuum/colab_updates

Parquet_integration
cytomining · Sep 17, 2020 · 9cf4340 · 9cf4340
2 parents 3926b55 + 91c871c
commit 9cf4340
Show file tree

Hide file tree

Showing 34 changed files with 1,712 additions and 175 deletions.
diff --git a/.gitignore b/.gitignore
@@ -21,3 +21,4 @@ tests/__pycache__/
 .pytest_cache/
 *.ipynb
 .DS_Store
+.vscode/
diff --git a/.travis.yml b/.travis.yml
@@ -12,7 +12,6 @@ install:
 language:
   - python
 python:
-  - 2.7
   - 3.5
 script:
   - pytest

diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -4,3 +4,5 @@ cytominer-database initially developed by
 - [Allen Goodman](https://github.com/0x00b1/)
 - [Claire McQuin](https://github.com/mcquin/)
 - [Shantanu Singh](https://github.com/shntnu/)
+- [Gregory Way](https://github.com/gwaygenomics)
+- [Frances Hubis](https://github.com/diskontinuum/)
diff --git a/README.rst b/README.rst
@@ -1,3 +1,4 @@
+==================
 cytominer-database
 ==================
 
@@ -17,16 +18,92 @@ high-throughput imaging experiment. The measurements are stored across thousands
 cytominer-database helps you organize these data into a single database backend, such as SQLite.
 
 Why cytominer-database?
------------------------
+=======================
 While tools like CellProfiler can store measurements directly in databases, it is usually infeasible to create a
 centralized database in which to store these measurements. A more scalable approach is to create a set of CSVs per
 “batch” of images, and then later merge these CSVs.
 
-cytominer-database ingest reads these CSVs, checks for errors, then ingests them into a database backend, including
-SQLite, MySQL, PostgresSQL, and several other backends supported by odo.
+cytominer-database ingest reads these CSVs, checks for errors, then ingests
+them into a database backend. The default backend is `SQLite`.
 
 .. code-block:: sh
 
 	cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini
 
-will ingest the CSV files nested under source_directory into a SQLite backend
+will ingest the CSV files nested under source_directory into a `SQLite` backend
+To select the `Parquet` backend add a `--parquet` flag:
+
+.. code-block:: sh
+
+	cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini --parquet
+
+The ingest_config.ini file then needs to be adjusted to contain the `Parquet` specifications.
+
+How to use the configuration file
+=================================
+The configuration file ingest_config.ini must be located in the source_directory and can be modified to specify the ingestion.
+There are three different sections.
+
+The [filenames] section
+-----------------------
+
+.. code-block::
+
+  [filenames]
+  image   = image.csv      #or: Image.csv
+  object  = object.csv     #or: Object.csv
+
+cytominer-database is currently limited to the following measurement file kinds:
+Cells.csv, Cytoplasm.csv, Nuclei.csv, Image.csv, Object.csv.
+The [filenames] section in the configuration file saves the correct basename of existing measurement files.
+This may be important in the case of inconsistent capitalization.
+
+The [database_engine] section
+-----------------------------
+
+.. code-block::
+
+  [ingestion_engine]
+  engine = Parquet      #or: SQLite
+
+The [database_engine] section specifies the backend.
+Possible key-value pairs are:
+**engine** = *SQLite* or **engine** = *Parquet*.
+
+The [schema] section
+--------------------
+
+.. code-block::
+
+ [schema]
+ reference_option = sample         #or: path/to/reference/folder relative to source_directory
+ ref_fraction     = 1              #or: any decimal value in [0, 1]
+ type_conversion  = int2float      #or: all2string
+
+The [schema] section specifies how to manage incompatibilities in the table schema of the files.
+In that case, a Parquet file is fixed to a schema with which it was first opened, i.e. by the first file which is written (the reference file).
+To append the data of all .csv files of that file-kind, it is important to assure the reference file satisfies certain incompatibility requirements.
+For example, make sure the reference file does not miss any columns and all existing files can be automatically converted to the reference schema.
+Note: This section is used only if the files are ingested to Parquet format and was developed to handle the special cases in which tables that cannot be concatenated automatically.
+
+There are two options for the key **reference_option**:
+
+The first option is to create a designated folder containing one .csv reference file for every kind of file ("Cytoplasm.csv", "Nuclei.csv", ...) and save the folder path in the config file as **reference_option** = *path/to/reference/folder*, where the path is relative to the source_directory from the ingest command.
+These reference files' schema will determine the schema of the Parquet file into which all .csv files of its kind will be ingested.
+
+
+**This option relies on manual selection, hence the chosen reference files must be checked explicitly: Make sure the .csv files are complete in the number of columns and contain no NaN values.**
+
+Alternatively, the reference files can be found automatically from a sampled subset of all existing files.
+This is the case if **reference_option** = *sample* is set.
+A subset of all files is sampled uniformly at random and the first table with the maximum number of columns among all sampled .csv files is chosen as the reference table.
+If this case, an additional key **ref_fraction** can be set, which specifies the fraction of files sampled among all files.
+The default value is **ref_fraction** = *1* , for which all tables are compared in width.
+This key is only used if "reference_option=sample".
+
+Lastly, the key **type_conversion** determines how the schema types are handled in the case of disagreement.
+The default value is *int2float*, for which all integer columns are converted to floats.
+This has been proven helpful for trivial columns (0-valued column), which may be of "int" type and cannot be written into the same table as non-trivial files with non-zero float values.
+Automatic type conversion can be avoided by converting all values to string-type.
+This can be done by setting **type_conversion** = *all2string*.
+However, the loss of type information might be a disadvantage in downstream tasks.
diff --git a/cytominer_database/commands/command_ingest.py b/cytominer_database/commands/command_ingest.py
@@ -15,25 +15,18 @@
 SOURCE is a directory containing subdirectories that contain CSV files.
 
 TARGET is a connection string for the database.
-"""
-)
-@click.argument(
-    "source",
-    type=click.Path(exists=True)
-)
-@click.argument(
-    "target",
-    type=click.Path(writable=True)
+""",
 )
+@click.argument("source", type=click.Path(exists=True))
+@click.argument("target", type=click.Path(writable=True))
 @click.option(
     "-c",
     "--config-file",
     default=pkg_resources.resource_filename(
-        "cytominer_database",
-        os.path.join("config", "config_cellpainting.ini")
+        "cytominer_database", os.path.join("config", "config_cellpainting.ini")
     ),
     help="Configuration file.",
-    type=click.Path(exists=True)
+    type=click.Path(exists=True),
 )
 @click.option(
     "--munge/--no-munge",
@@ -43,7 +36,7 @@
 have been merged into a single CSV file; \
 the CSV will be split into one CSV per compartment \
 (Default: true).
-"""
+""",
 )
 @click.option(
     "--skip-image-prefix/--no-skip-image-prefix",
@@ -53,10 +46,10 @@
 excluded from the names of columns from per image \
 table e.g. use  `Metadata_Plate` instead of \
 `Image_Metadata_Plate` (Default: true).
-"""
+""",
 )
 def command(source, target, config_file, munge, skip_image_prefix):
     if munge:
-        cytominer_database.munge.munge(config_file=config_file, source=source)
+        cytominer_database.munge.munge(config_path=config_file, source=source)
 
     cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix)
diff --git a/cytominer_database/commands/command_ingest_variable_engine.py b/cytominer_database/commands/command_ingest_variable_engine.py
@@ -0,0 +1,80 @@
+import click
+import configparser
+import os
+import pkg_resources
+
+import cytominer_database.ingest
+import cytominer_database.ingest_variable_engine
+import cytominer_database.munge
+
+"""
+Runs new code (ingest_variable_engine.py instead of ingest.py).
+Two backend engines are available: Sqlite and Parquet. 
+In effect, these options are read from the config file.
+In terms of the command (and testing the command), 
+the config file name needs to be specified 
+(each backend choice has its own config file).
+"""
+
+
+@click.command(
+    "ingest_new",
+    help="""\
+Import CSV files into a database.
+
+SOURCE is a directory containing subdirectories that contain CSV files.
+
+TARGET is a connection string for the database.
+""",
+)
+@click.argument("source", type=click.Path(exists=True))
+@click.argument("target", type=click.Path(writable=True))
+@click.option(
+    "-c",
+    "--config-file",
+    default=pkg_resources.resource_filename(
+        "cytominer_database", os.path.join("config", "config_cellpainting.ini")
+    ),
+    help="Configuration file.",
+    type=click.Path(exists=True),
+)
+@click.option(
+    "--munge/--no-munge",
+    default=True,
+    help="""\
+True if the CSV files for individual compartments \
+have been merged into a single CSV file; \
+the CSV will be split into one CSV per compartment \
+(Default: true).
+""",
+)
+@click.option(
+    "--skip-image-prefix/--no-skip-image-prefix",
+    default=True,
+    help="""\
+True if the prefix of image table name should be \
+excluded from the names of columns from per image \
+table e.g. use  `Metadata_Plate` instead of \
+`Image_Metadata_Plate` (Default: true).
+""",
+)
+@click.option(
+    "--variable-engine/--no-variable-engine",
+    default=False,
+    help="""\
+True if multiple backend engines (SQLite or Parquet)\
+can be selected. The config file then determines
+which backend engine is used (path of which is passed as a flag).\
+Default: False (--no-variable-engine) 
+""",
+)
+def command(source, target, config_file, munge, skip_image_prefix, variable_engine):
+    if munge:
+        cytominer_database.munge.munge(config_path=config_file, source=source)
+
+    if variable_engine:
+        cytominer_database.ingest_variable_engine.seed(
+            source, target, config_file, skip_image_prefix
+        )
+    else:
+        cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix)
diff --git a/cytominer_database/config/config_cellpainting.ini b/cytominer_database/config/config_cellpainting.ini
@@ -3,3 +3,10 @@ image = Image.csv
 object = object.csv
 experiment = Experiment.csv
 
+[database_engine]
+database = SQLite   
+
+[schema]
+reference_option = sample
+ref_fraction = 1
+type_conversion = int2float 
diff --git a/cytominer_database/config/config_default.ini b/cytominer_database/config/config_default.ini
@@ -1,3 +1,11 @@
 [filenames]
 image = Image.csv
 experiment = Experiment.csv
+
+[database_engine]
+database = SQLite   
+
+[schema]
+reference_option = sample
+ref_fraction = 1
+type_conversion = int2float 
diff --git a/cytominer_database/config/config_htqc.ini b/cytominer_database/config/config_htqc.ini
@@ -2,3 +2,11 @@
 image = image.csv
 object = object.csv
 experiment = Experiment.csv
+
+[database_engine]
+database = SQLite   
+
+[schema]
+reference_option = sample
+ref_fraction = 1
+type_conversion = int2float