Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet_integration #122

Merged
merged 153 commits into from
Sep 17, 2020
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
153 commits
Select commit Hold shift + click to select a range
d606e21
[WIP] Integrated Parquet Engine Option, started with test modifications
diskontinuum Dec 16, 2019
6a66fa8
Included Parquet-Writer in ingest.py : In this version, the writer is…
diskontinuum Jan 2, 2020
7e766a5
Added 'database_engine' section to the config files, incl. tests.
diskontinuum Jan 3, 2020
abe71e9
Added Parquet Option to config files. Corrected argument list of into().
diskontinuum Jan 6, 2020
7b9e1e0
Fixed SyntaxError: positional argument follows keyword argument
diskontinuum Jan 6, 2020
cf6039e
debuggig: removed comments from config.ini, removed splitext() when r…
diskontinuum Jan 6, 2020
847dae7
debugged: changed argument list of getWriter()
diskontinuum Jan 6, 2020
25557c2
corrected 4 bugs: 1)completed argument list when calling getWriters()…
diskontinuum Jan 7, 2020
5369033
Started integration of schema-checking
diskontinuum Jan 16, 2020
98e673c
-
diskontinuum Jan 18, 2020
3194ce5
Added sampling and conversion functions
diskontinuum Jan 22, 2020
a321c09
Merge pull request #1 from diskontinuum/convert_schemata
diskontinuum Jan 22, 2020
0e84e68
Does not contain changes from colab notebook.
diskontinuum Feb 8, 2020
fc27c3e
Added original ingest.py and deleted old partial code. All new code i…
diskontinuum Feb 10, 2020
cee0029
fixed imports and naming bugs
diskontinuum Feb 10, 2020
d9e8185
Added option to ignore object.csv
diskontinuum Feb 10, 2020
0c07ad2
renamed WIP folder
diskontinuum Feb 11, 2020
e1292dd
deleted deprecated ingest.py. The new version is named ingest_parquet…
diskontinuum Feb 14, 2020
1b64786
removed additional test folder, updated test_ingest.py, renamed inges…
diskontinuum Feb 19, 2020
848baad
updated install requirement 'pyarrow' in setup.py
diskontinuum Feb 19, 2020
6c414f3
added 'pyarrow>=0.16.1' to requirements.txt
diskontinuum Feb 24, 2020
a5af51c
removed pyarrow from install_requires in setup.py
diskontinuum Feb 24, 2020
7f26041
comma
diskontinuum Feb 24, 2020
a34c99d
quotation marks
diskontinuum Feb 24, 2020
9016f01
reduced requirements pyarrow>=0.16.1 to pyarrow>=0.16.0
diskontinuum Feb 24, 2020
62d4685
added numpy to requirements.txt
diskontinuum Feb 24, 2020
5a3ef5b
renaming of ingest files to check if tests work on original files in …
diskontinuum Feb 24, 2020
a7cb5a3
Added original test_ingest.py code
diskontinuum Feb 24, 2020
7efc190
changed function argument name in munge.py. Reading the config file (…
diskontinuum Feb 24, 2020
0d2f585
new separate folders for new layout of config file (cytominer_databas…
diskontinuum Feb 24, 2020
d34ce21
changed config_file arguments to config_path for function calls of cy…
diskontinuum Feb 27, 2020
a0a72c6
added new options to the default config.ini. Old tests pass.
diskontinuum Feb 27, 2020
1dedbda
modified the local config.ini files in all test folders. All original…
diskontinuum Feb 27, 2020
e67a239
Changed entries in config.ini files to default value 'sample' and sta…
diskontinuum Feb 28, 2020
2565b6c
Refatored. ToDo: Finish original tests, add new tests, clean up and d…
diskontinuum Apr 28, 2020
c42dd91
Original tests pass with given config files. ToDo: Test with differen…
diskontinuum Apr 28, 2020
e5193b9
deleted test duplicates
diskontinuum Apr 28, 2020
8e45b21
Added a documentation for the configuration file config.ini to rhe re…
diskontinuum Apr 28, 2020
e08cb83
updated readme.
diskontinuum Apr 28, 2020
406d164
updated readme.
diskontinuum Apr 28, 2020
b72dc1d
updated readme.
diskontinuum Apr 28, 2020
dd7d50e
updated readme.
diskontinuum Apr 28, 2020
c4f3de7
updated readme.
diskontinuum Apr 28, 2020
38a988d
updated readme.
diskontinuum Apr 28, 2020
cfad784
updated readme.
diskontinuum Apr 28, 2020
be5a24a
updated readme.
diskontinuum Apr 28, 2020
d2eec56
updated readme.
diskontinuum Apr 28, 2020
3226f95
updated readme.
diskontinuum Apr 28, 2020
92e0c85
updated readme.
diskontinuum Apr 28, 2020
adc98df
updated readme.
diskontinuum Apr 28, 2020
0cd216c
updated readme.
diskontinuum Apr 28, 2020
bfb56cb
updated readme.
diskontinuum Apr 28, 2020
a97b644
updated readme.
diskontinuum Apr 28, 2020
382c8ce
updated readme.
diskontinuum Apr 28, 2020
3e7c795
Merged load_df(). Debugged. Passes all tests. ToDo: Further unificati…
diskontinuum May 4, 2020
e7e089f
Massive refactoring and clean-up. Merged engine-options. Tests pass.
diskontinuum May 5, 2020
0d19431
Added docstrings. Deleted old comments.
diskontinuum May 7, 2020
d96303f
Further code cleanse
diskontinuum May 7, 2020
d553b38
added crucial assertions to write.py. ToDo: build pytest analogon.
diskontinuum May 8, 2020
df2eba3
removed unnecessary assignments in main script
diskontinuum May 8, 2020
51cb5f6
added one missing assignment
diskontinuum May 8, 2020
309ab0d
Update README.rst
diskontinuum May 8, 2020
e166bf9
Update README.rst
diskontinuum May 8, 2020
c987cf5
Updated documentation in main script
diskontinuum May 8, 2020
6ed7b8c
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum May 8, 2020
5f9631d
Finalized docstring comments in main script, including general explan…
diskontinuum May 8, 2020
d0e95c6
change all type-converting functions (definitions and calls) to not r…
diskontinuum May 18, 2020
d0af406
merging the tableSchema functions
diskontinuum May 19, 2020
1834cd2
simplified and unified the generation of reference files dictionary
diskontinuum May 26, 2020
84ec340
removed redundant list unpacking
diskontinuum May 26, 2020
733d96d
More expclicit name retrieval
diskontinuum May 27, 2020
91ff904
Tests pass : corrected string capitalization error
diskontinuum May 27, 2020
33d3f82
Update cytominer_database/ingest_incl_parquet.py
diskontinuum May 27, 2020
03cd7b2
Update README.rst
diskontinuum May 27, 2020
befc8a9
Update cytominer_database/load.py
diskontinuum May 27, 2020
874a22b
comments removed from read_config()
diskontinuum May 27, 2020
ea2f5e8
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum May 27, 2020
09fec3b
various PR suggestions
diskontinuum May 27, 2020
5e8236a
renamed get_table_paths_from_directory_list() to directory_list_to_pa…
diskontinuum May 27, 2020
3241f2d
renamed get_unique_reference_dirs() to get_path_dictionary
diskontinuum May 27, 2020
79b5c35
renamed return arguments ref_dirs and ref_dir to sampled_path_diction…
diskontinuum May 27, 2020
9bc8d54
Update cytominer_database/config/config_cellpainting.ini
diskontinuum May 31, 2020
800b6d3
Changed default sampling fractions from 1 to 0.3 in the config files.
diskontinuum May 31, 2020
dd527e0
Update cytominer_database/config/config_htqc.ini
diskontinuum May 31, 2020
fb246b6
Update cytominer_database/ingest.py
diskontinuum May 31, 2020
1850236
Update cytominer_database/ingest.py
diskontinuum May 31, 2020
de908d6
deleted old code
diskontinuum May 31, 2020
2cf7a99
deleted space
diskontinuum May 31, 2020
1bacfac
removed docstring prose
diskontinuum May 31, 2020
e2fd6b7
isolated warning-block
diskontinuum May 31, 2020
810a52c
isolated warning-block
diskontinuum May 31, 2020
038b77b
warning box .rst
diskontinuum May 31, 2020
f33d42f
Added .html version of README.rst. Warning box is displayed correctly.
diskontinuum May 31, 2020
51c6462
strong emphasis instead of warning box
diskontinuum May 31, 2020
4d7fbc2
added .vscode json files to .gitignore
diskontinuum May 31, 2020
c4fd31d
Update .gitignore
diskontinuum Jun 24, 2020
7630aae
vscode cleanup
diskontinuum Jun 24, 2020
bd50178
Update README.rst
diskontinuum Jun 24, 2020
8276cae
Update cytominer_database/config/config_default.ini
diskontinuum Jun 24, 2020
a841db0
Update cytominer_database/ingest.py
diskontinuum Jun 24, 2020
14428e0
Update cytominer_database/write.py
diskontinuum Jun 24, 2020
4c3c021
Update cytominer_database/ingest.py
diskontinuum Jul 1, 2020
aa10636
rm unused imports
diskontinuum Jul 1, 2020
2d10a34
smaller import: os -> os.path
diskontinuum Jul 1, 2020
00ddfeb
decreased or removed os imports
diskontinuum Jul 1, 2020
3ce001e
narrowed remaining os imports where possible
diskontinuum Jul 1, 2020
8b4f35b
Update cytominer_database/ingest.py
diskontinuum Jul 1, 2020
32b30e2
removed comments
diskontinuum Jul 1, 2020
cfa0c27
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum Jul 1, 2020
97c09ca
Removed unnecessary comments
diskontinuum Jul 1, 2020
80728cb
Added raise ValueError to type_conver_dataframe(), for the case in wh…
diskontinuum Jul 1, 2020
4ba394c
Moved and renamed constant list of column names that should not be co…
diskontinuum Jul 1, 2020
b0ca497
Added UserWarning in case no values were converted
diskontinuum Jul 1, 2020
5b80f1d
missing bracket error
diskontinuum Jul 1, 2020
13c3369
Added further ValueErrors and UserWarnings (in case of conversion err…
diskontinuum Jul 1, 2020
076c939
ran BLACK formatting tool over all files
diskontinuum Jul 1, 2020
52a8812
removed python 2
diskontinuum Jul 1, 2020
367cf5c
replaced tricky list comprehension
diskontinuum Jul 1, 2020
c729cf4
replaced the f-string with .format()
diskontinuum Jul 1, 2020
4a07121
removed print()-statements
diskontinuum Jul 1, 2020
f273299
removed remaining print()-statements from chaotic tableSchema.py
diskontinuum Jul 1, 2020
339ac19
f-strings for the original ingest()
diskontinuum Jul 1, 2020
f709963
fixed prefixing of column labels in both ingest() and load() (no list…
diskontinuum Jul 1, 2020
4bc4ab7
Update README.rst
diskontinuum Jul 1, 2020
7e14bcc
Update README.rst
diskontinuum Jul 1, 2020
8138130
Update README.rst
diskontinuum Jul 1, 2020
0ebe5c3
Update README.rst
diskontinuum Jul 1, 2020
080b0cc
Update README.rst
diskontinuum Jul 1, 2020
9a1d4fd
Update README.rst
diskontinuum Jul 1, 2020
92e90bd
Update README.rst
diskontinuum Jul 1, 2020
8563d6c
bracket
diskontinuum Jul 2, 2020
512c48e
Merge branch 'colab_updates' of github.com:diskontinuum/cytominer-dat…
diskontinuum Jul 2, 2020
c55560b
added author 'Frances Hubis' and updated year
diskontinuum Jul 24, 2020
a2c9ae6
added author 'Frances Hubis' and git link to CONTRIBUTORS.md
diskontinuum Jul 24, 2020
d1610fc
renamed module ingest_incl_parquet.py to ingest_parquet.py . Changed …
diskontinuum Jul 24, 2020
6759ce7
added a 'parquet' flag option
diskontinuum Jul 24, 2020
022fb94
removed old-named ingest_incl_parquet.py
diskontinuum Jul 24, 2020
81f6e89
changed command structure. Now the flag '--parquet' will trigger the …
diskontinuum Jul 24, 2020
6126085
updated README.md to contain the click command instruction for the 'p…
diskontinuum Jul 24, 2020
d2660ac
removed deprecated filename
diskontinuum Jul 24, 2020
fe10d81
removed 'parquet' flag - command tests pass again
diskontinuum Jul 26, 2020
d16bcfb
included new command flag '--variable_engine/--no-variable_engine' i…
diskontinuum Jul 26, 2020
2529591
renamed new code version 'ingest_parquet.py'/'test_ingest_parquet.py'…
diskontinuum Jul 26, 2020
45a39e9
prettified code with black
diskontinuum Jul 26, 2020
c17d852
reversed accidental erasure
diskontinuum Jul 28, 2020
aae91e9
reversed ordering
diskontinuum Jul 28, 2020
094707a
-
diskontinuum Jul 28, 2020
edf2953
Merge branch 'master' into colab_updates
diskontinuum Jul 28, 2020
ed68760
Update CONTRIBUTORS.md
diskontinuum Jul 29, 2020
6cfd9d1
Update cytominer_database/commands/command_ingest_variable_engine.py
diskontinuum Jul 29, 2020
799997f
Added Hubis to copyright list and ran black again.
diskontinuum Jul 29, 2020
1b2976c
Addressed incomplete doctring in ingest.seed()
diskontinuum Jul 29, 2020
91c871c
Update cytominer_database/ingest.py
gwaybio Sep 17, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,5 @@ tests/__pycache__/
.pytest_cache/
*.ipynb
.DS_Store
.vscode/settings.json
.vscode/launch.json
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
15 changes: 15 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal"
}
]
}
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here - lets remove these two files and add .vscode to .gitignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added .vscode/settings.json and .vscode/launch.json to the .gitignore. Not sure how to proceed - surely I can't just delete the files ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used vscode before. If you delete these files, will the software not work anymore? If so, here is a workaround:

# From top directory
mv .vscode .vscode_temp
git rm -r .vscode/
git commit -m 'vscode cleanup'
git push
mv .vscode_temp .vscode

If vscode autogenerates these files, then no need to do the whole mv .vscode_temp rigamarole.

"python.pythonPath": "/opt/anaconda3/envs/cytominer1/bin/python"
}
82 changes: 80 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
==================
cytominer-database
==================

Expand All @@ -17,16 +18,93 @@ high-throughput imaging experiment. The measurements are stored across thousands
cytominer-database helps you organize these data into a single database backend, such as SQLite.

Why cytominer-database?
-----------------------
=======================
While tools like CellProfiler can store measurements directly in databases, it is usually infeasible to create a
centralized database in which to store these measurements. A more scalable approach is to create a set of CSVs per
“batch” of images, and then later merge these CSVs.

cytominer-database ingest reads these CSVs, checks for errors, then ingests them into a database backend, including
cytominer-database ingest reads these CSVs, checks for errors, then ingests
them into a database backend, including
SQLite, MySQL, PostgresSQL, and several other backends supported by odo.
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: sh

cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini

will ingest the CSV files nested under source_directory into a SQLite backend

How to use the configuration file
=================================
The configuration file ingest_config.ini must be located in the source_directory
and can be modified to specify the ingestion.
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
There are three different sections.

The [filenames] section
-----------------------

.. code-block::

[filenames]
image = image.csv #or: Image.csv
object = object.csv #or: Object.csv

Cytominer-Database is currently limited to the following measurement file kinds:
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
Cells.csv, Cytoplasm.csv, Nuclei.csv, Image.csv, Object.csv.
The [filenames] section in the configuration file saves the correct basename of existing measurement files
(this may be important in the case of inconsistent capitalization).
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved

The [database_engine] section
-----------------------------

.. code-block::

[ingestion_engine]
engine = Parquet #or: SQLite

The [database_engine] section specifies the backend. Possible key-value pairs are:
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
**engine** = *SQLite* or **engine** = *Parquet*.

The [schema] section
--------------------

.. code-block::

[schema]
reference_option = sample #or: path/to/reference/folder relative to source_directory
ref_fraction = 1 #or: any decimal value in [0, 1]
type_conversion = int2float #or: all2string

The [schema] section specifies how to manage incompatibilities in the table schema
of the files.
In that case, a Parquet file is fixed to a schema with which it was first opened,
i.e. by the first file which is written (the reference file). To append the data
of all .csv files of that file-kind, it is important to assure the reference file
satisfies certain incompatibility requirements, e.g. does not miss any columns
and all existing files can be automatically converted to the reference schema.
This section is used only if the files are ingested to Parquet format and was
gwaybio marked this conversation as resolved.
Show resolved Hide resolved
developed to handle the special cases in which tables that cannot be concatenated automatically.
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved

There are two options for the key **reference_option**:
The first option is to create a designated folder containing one .csv reference file for every kind of file ("Cytoplasm.csv", "Nuclei.csv", ...)
and save the folder path in the config file as **reference_option** = *path/to/reference/folder*,
where the path is relative to the source_directory from the ingest command.
These reference files' schema will determine the schema of the Parquet file into which all .csv files of its kind will be ingested.
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved


**This option relies on manual selection, hence the chosen reference files must be checked explicitly: Make sure the .csv files are complete in the number of columns and contain no NaN values.**

Alternatively, the reference files can be found automatically from a sampled subset of all existing files.
This is the case if **reference_option** = *sample* is set.
A subset of all files is sampled uniformly at random and the first table with
the maximum number of columns among all sampled .csv files is chosen as the reference table.
If this case, an additional key **ref_fraction** can be set, which specifies the fraction of files
sampled among all files. The default value is **ref_fraction** = *1* , for which
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
all tables are compared in width. This key is only used if "reference_option=sample".
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved

Lastly, the key **type_conversion** determines how the schema types are handled in the case of disagreement.
The default value is *int2float*, for which all integer columns are converted to floats.
This has been proven helpful for trivial columns (0-valued column), which may be of "int" type
and cannot be written into the same table as non-trivial files with non-zero float values.
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
Automatic type conversion can be avoided by converting all values to string-type.
This can be done by setting **type_conversion** = *all2string*.
However, the loss of type information might be a disadvantage in downstream tasks.
2 changes: 1 addition & 1 deletion cytominer_database/commands/command_ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,6 @@
)
def command(source, target, config_file, munge, skip_image_prefix):
if munge:
cytominer_database.munge.munge(config_file=config_file, source=source)
cytominer_database.munge.munge(config_path=config_file, source=source)

cytominer_database.ingest.seed(source, target, config_file, skip_image_prefix)
7 changes: 7 additions & 0 deletions cytominer_database/config/config_cellpainting.ini
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,10 @@ image = Image.csv
object = object.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
9 changes: 9 additions & 0 deletions cytominer_database/config/config_default.ini
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
[filenames]
image = Image.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float

diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
8 changes: 8 additions & 0 deletions cytominer_database/config/config_htqc.ini
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,11 @@
image = image.csv
object = object.csv
experiment = Experiment.csv

[database_engine]
database = SQLite

[schema]
reference_option = sample
ref_fraction = 1
type_conversion = int2float
77 changes: 32 additions & 45 deletions cytominer_database/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,48 +69,36 @@ def into(input, output, name, identifier, skip_table_prefix=False):
from the names of columns.
"""

with backports.tempfile.TemporaryDirectory() as directory:
source = os.path.join(directory, os.path.basename(input))

# create a temporary CSV file which is identical to the input CSV file
# but with the column names prefixed with the name of the compartment
# (or `Image`, if this is an image CSV file, and `skip_table_prefix` is False)
with open(input, "r") as fin, open(source, "w") as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)

headers = next(reader)
if not skip_table_prefix:
headers = [__format__(name, header) for header in headers]

# The first column is `TableNumber`, which is the unique identifier for the image CSV
headers = ["TableNumber"] + headers

writer.writerow(headers)

[writer.writerow([identifier] + row) for row in reader]

# Now ingest the temp CSV file (with the modified column names) into the database backend
# the rows of the CSV file are inserted into a table with name `name`.
with warnings.catch_warnings():
# Suppress the following warning on Python 3:
#
# /usr/local/lib/python3.6/site-packages/odo/utils.py:128: DeprecationWarning: inspect.getargspec() is
# deprecated, use inspect.signature() or inspect.getfullargspec()
warnings.simplefilter("ignore", category=DeprecationWarning)

engine = create_engine(output)
con = engine.connect()

df = pd.read_csv(source, index_col=0)
df.to_sql(name=name, con=con, if_exists="append")
with warnings.catch_warnings():
# Suppress the following warning on Python 3:
#
# /usr/local/lib/python3.6/site-packages/odo/utils.py:128: DeprecationWarning: inspect.getargspec() is
# deprecated, use inspect.signature() or inspect.getfullargspec()
warnings.simplefilter("ignore", category=DeprecationWarning)
engine = create_engine(output)
con = engine.connect()

df = pd.read_csv(input)
#print("In into(): df['Index']") # no index prepended yet
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
#print(df['Index'])
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
# add "name"prefix to column headers
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
if not skip_table_prefix:
no_prefix = ["ImageNumber", "ObjectNumber"] # exception columns
df.columns = [
i if i in no_prefix else name + "_" + i for i in df.columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
i if i in no_prefix else name + "_" + i for i in df.columns
i if i in no_prefix else f"{name}_{i}" for i in df.columns

f strings are legit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to not accept this "suggestion"? Github makes these "suggestions" easy to manage. Let's hop on a call next week to discuss :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diskontinuum - let's discuss how to "accept" a commit via the github gui. it really is quite nice! (and is very close to zero effort)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwaygenomics : I kept ingest.py as an unmodified copy of the original :) , hence I did not introduce any changes.
The modified code is in ingest_incl_parquet.py, where I introduced the fstring for this list comprehension :) .
BTW, this suggestion failed at the Travis CI - I then rewrote the line without ist comprehensions (or fstrings).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Now I introduced this change to the original as well.

]
# add TableNumber
number_of_rows, _ = df.shape
table_number_column = [identifier] * number_of_rows # create additional column
df.insert(0, "TableNumber", table_number_column, allow_duplicates=False)
print("In into(): df.shape is ", df.shape)
df.to_sql(name=name, con=con, if_exists="append", index = False)
diskontinuum marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df.to_sql(name=name, con=con, if_exists="append", index = False)
df.to_sql(name=name, con=con, if_exists="append", index=False)

minor style suggestion


def checksum(pathname, buffer_size=65536):
"""
Generate a 32-bit unique identifier for a file.

:param pathname: input file
:param buffer_size: buffer size

:param buffer_size: buffer size
gwaybio marked this conversation as resolved.
Show resolved Hide resolved
"""
with open(pathname, "rb") as stream:
result = zlib.crc32(bytes(0))
Expand All @@ -125,36 +113,36 @@ def checksum(pathname, buffer_size=65536):

return result & 0xffffffff

def seed(source, target, config_file, skip_image_prefix=True):
def seed(source, target, config_path, skip_image_prefix=True):
"""
Read CSV files into a database backend.

:param config_file: Configuration file.
:param source: Directory containing subdirectories that contain CSV files.
:param target: Connection string for the database.
:param skip_image_prefix: True if the prefix of image table name should be excluded
:param : True if the prefix of image table name should be excluded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param : True if the prefix of image table name should be excluded
:param skip_image_prefix: True if the prefix of image table name should be excluded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment still needs addressing

from the names of columns from per image table
"""
config_file = cytominer_database.utils.read_config(config_file)
config_file = cytominer_database.utils.read_config(config_path)

# list the subdirectories that contain CSV files
directories = sorted(list(cytominer_database.utils.find_directories(source)))

for directory in directories:

# get the image CSV and the CSVs for each of the compartments
try:
compartments, image = cytominer_database.utils.validate_csv_set(config_file, directory)
except IOError as e:
click.echo(e)

continue
except sqlalchemy.exc.DatabaseError as e:
click.echo(e)
continue

# get a unique identifier for the image CSV. This will later be used as the TableNumber column
# the casting to int is to allow the database to be readable by CellProfiler Analyst, which
# requires TableNumber to be an integer.
identifier = checksum(image)

name, _ = os.path.splitext(config_file["filenames"]["image"])

# ingest the image CSV
Expand All @@ -163,7 +151,6 @@ def seed(source, target, config_file, skip_image_prefix=True):
skip_table_prefix=skip_image_prefix)
except sqlalchemy.exc.DatabaseError as e:
click.echo(e)

continue

# ingest the CSV for each compartment
Expand Down
Loading