Add value-based tests for table ingestion #129

diskontinuum · 2020-09-29T13:15:05Z

Current tests only check the size of (but not the content of) the concatenated tables (for both --parquet and --sqlite).
However, the tables have been modified:

Added TableNumber column and added prefixed in get_and_modify_df89
Alignment with reference dataframe in write_to_disk().

To compare the written and chopped files with the original ones, value-based tests would have to modify the original tables accordingly.

The text was updated successfully, but these errors were encountered:

diskontinuum · 2020-09-29T13:19:52Z

Attention: When reading the files from Parquet back into a Pandas dataframe, nullable values are implicitly type-converted from int to float (read here), which will throw an error when asserting equivalence.

To make sure the types are identical even for the new NaN columns, cast them explicitly (with df[x] = df[x].astype(ref_df[x].dtypes) ).

diskontinuum · 2020-09-29T13:24:32Z

Attention:
Do not use the index_col=0 parameter in pandas_df = pd.read_csv(tmp_source, index_col=0) when importing from csvt to Pandas dataframes, because the first column will be used as a row label and cease to exist as a column.

gwaybio · 2020-10-29T19:06:47Z

totally agree with adding value-based tests... our current system of only checking shape is very dangerous!

gwaybio · 2020-10-29T19:07:19Z

I am adding a documentation snippet previously in write.py (removed in #130)

""" --------- code snippets for testing code ---------
# -------- pandas dataframe alignment ----------------
# (note: missing columns are added with same name and type
#  as in ref_dataframe, but containing NaN values.)
dataframe, ref_dataframe_new = dataframe.align(ref_dataframe, join="right", axis=1)
# assert that the reference table has not been modified by the alignment.
assert ref_dataframe_new.equals(ref_dataframe)
# --------- identical pandas schemata-----------------
assert dataframe.dtypes.equals(ref_dataframe.dtypes)
# --------- identical pyarrow schemata----------------
# (note: use "==" for pyarrrow schema comparisons, not "is")
table = pyarrow.Table.from_pandas(dataframe)
assert (table.schema.types == writers_dict[name]["schema"].types)
assert (table.schema.names == writers_dict[name]["schema"].names)
"""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add value-based tests for table ingestion #129

Add value-based tests for table ingestion #129

diskontinuum commented Sep 29, 2020

diskontinuum commented Sep 29, 2020 •

edited

Loading

diskontinuum commented Sep 29, 2020

gwaybio commented Oct 29, 2020

gwaybio commented Oct 29, 2020

Add value-based tests for table ingestion #129

Add value-based tests for table ingestion #129

Comments

diskontinuum commented Sep 29, 2020

diskontinuum commented Sep 29, 2020 • edited Loading

diskontinuum commented Sep 29, 2020

gwaybio commented Oct 29, 2020

gwaybio commented Oct 29, 2020

diskontinuum commented Sep 29, 2020 •

edited

Loading