Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add value-based tests for table ingestion #129

Open
diskontinuum opened this issue Sep 29, 2020 · 4 comments
Open

Add value-based tests for table ingestion #129

diskontinuum opened this issue Sep 29, 2020 · 4 comments

Comments

@diskontinuum
Copy link
Contributor

Current tests only check the size of (but not the content of) the concatenated tables (for both --parquet and --sqlite).
However, the tables have been modified:

  • Added TableNumber column and added prefixed in get_and_modify_df89
  • Alignment with reference dataframe in write_to_disk().

To compare the written and chopped files with the original ones, value-based tests would have to modify the original tables accordingly.

@diskontinuum
Copy link
Contributor Author

diskontinuum commented Sep 29, 2020

Attention: When reading the files from Parquet back into a Pandas dataframe, nullable values are implicitly type-converted from int to float (read here), which will throw an error when asserting equivalence.

To make sure the types are identical even for the new NaN columns, cast them explicitly (with df[x] = df[x].astype(ref_df[x].dtypes) ).

@diskontinuum
Copy link
Contributor Author

Attention:
Do not use the index_col=0 parameter in pandas_df = pd.read_csv(tmp_source, index_col=0) when importing from csvt to Pandas dataframes, because the first column will be used as a row label and cease to exist as a column.

@gwaybio
Copy link
Member

gwaybio commented Oct 29, 2020

totally agree with adding value-based tests... our current system of only checking shape is very dangerous!

@gwaybio
Copy link
Member

gwaybio commented Oct 29, 2020

I am adding a documentation snippet previously in write.py (removed in #130)

""" --------- code snippets for testing code ---------
# -------- pandas dataframe alignment ----------------
# (note: missing columns are added with same name and type
#  as in ref_dataframe, but containing NaN values.)
dataframe, ref_dataframe_new = dataframe.align(ref_dataframe, join="right", axis=1)
# assert that the reference table has not been modified by the alignment.
assert ref_dataframe_new.equals(ref_dataframe)
# --------- identical pandas schemata-----------------
assert dataframe.dtypes.equals(ref_dataframe.dtypes)
# --------- identical pyarrow schemata----------------
# (note: use "==" for pyarrrow schema comparisons, not "is")
table = pyarrow.Table.from_pandas(dataframe)
assert (table.schema.types == writers_dict[name]["schema"].types)
assert (table.schema.names == writers_dict[name]["schema"].names)
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants