To run unit tests, use a terminal at the root of the project you are working on (sql, synth, or evaluator):
pip install -r tests/requirements.txt
pytest tests
Many tests use CSV datasets stored under the datasets
folder. Some of these files are large, and will be downloaded on first run of pytest.
The tests use a shared set of database connections, test_databases
, loaded in a fixture.
Get a PrivateReader
over PUMS_1000 using the pandas
engine:
privacy = Privacy(epsilon=1.0, delta=0.01)
reader = test_databases.get_private_reader(engine='pandas', database='PUMS')
res = reader.execute('COUNT(age) AS age_count FROM PUMS.PUMS')
assert(res['age_count'][0] > 500)
The fixture will cache connections for multiple database engines and databases, if they're available on the test machine.
If you want to access the cached connections outside of pytest, you can instantiate the test_databases
manually:
from tests.setup.dataloader.db import DbCollection
test_databases = DbCollection()
print(test_databases)
From the command line, you can get the same status by issuing python tests/check_databases.py
from a command prompt in the SQL project. Output will look like this. If there are problems connecting to any of the configured test databases, these will print an error and display without the "connected" status:
None@pandas://None:None
PUMS -> PUMS (connected)
PUMS_pid -> PUMS_pid (connected)
PUMS_dup -> PUMS_dup (connected)
postgres@postgres://localhost:5432
PUMS -> pums (connected)
PUMS_large -> pums (connected)
PUMS_pid -> pums_pid (connected)
PUMS_dup -> pums_pid (connected)
sa@sqlserver://tcp:10.0.0.199:1433
PUMS -> pums (connected)
PUMS_large -> pums (connected)
PUMS_pid -> pums_pid (connected)
PUMS_dup -> pums_pid (connected)
The current pre-cached datasets are:
PUMS
: The 1000-row sample of PUMS, usingrow_privacy
. When querying, usePUMS.PUMS
as the table name.PUMS_pid
: The 1000-row sample of PUMS, with an additional column,pid
for a person primary key. Does not userow_privacy
. When querying, usePUMS.PUMS
as the table name.PUMS_dup
: Oversample of PUMS_pid with ~1750 rows, each person appears 1-3 times. Use this table for tests where reservoir needs to work. When querying, usePUMS.PUMS
as the table name.PUMS_null
: PUMS_pid with ~40% of rows having one NULL value. Use this table for tests where NULL behavior is important. When querying, usePUMS.PUMS
as the table name.PUMS_large
: A sample of PUMS with 1.2 million records, and a person primary key. When querying, usePUMS.PUMS_large
as the table name.
The current supported engines are:
pandas
- does not includePUMS_large
, due to memory constraintspostgres
sqlserver
spark
- can use all datasets, but requiresexport TEST_SPARK=1
bigquery
The service-hosted SQL engines, pandas
and sqlserver
, will be connected, if the connections are configured locally. See this for more information.
The GitHub Actions CI runners for SQL Server, Postgres, Spark and BigQuery will automatically install and run these engines.
Note that the connections-unit.yml
can be configured to use different table names in the database, and the unit test fixture will automatically update test queries and metadata to use the appropriate table name. In the default local and CI builds for Postgres, we map PUMS_dup
to pums.pums.pums_dup
to test three part names, and we install PUMS_null
in postgres public
schema and point to pums
(with no schema) to test public schema search path resolution. You can check the connections-unit.yaml
under tests/setup/postgres
to see the syntax.
Unit tests which check query outputs should run on all available engines. In normal development, this will be pandas
only, but in CI will include spark
, postgres
, sqlserver
and bigquery
. Developers can have all of these installed locally (except bigquery
which will require GCP credentials), and pandas
can be disabled with export SKIP_PANDAS=1
. If you have a unit test that requests pandas only, make sure the test checks for None
from the reader, since DbCollection
will not return pandas readers when SKIP_PANDAS
is set.
Runs a test against all available engines that have PUMS_pid:
readers = test_databases.get_private_readers(database='PUMS_pid', privacy=privacy)
print(f'We have {len(readers)} engines available to test against.')
for reader in readers:
print(f'Querying engine {reader.engine}')
res = reader.execute('SELECT COUNT(*) AS n FROM PUMS.PUMS')
print(res)
if reader.engine == "spark":
res = test_databases.to_tuples(res)
assert(n > 500)
Note that we select from PUMS.PUMS
in the PUMS_pid
database. This is because the schemas are identical, apart from a hidden primary key. All queries that work against one should work against the other.
In the example above, note the different code for spark
engine. This is required, because spark by default returns a DataFrame
or RDD
. It's a great idea to tune your tests to work with spark, but there may be cases where you are sure your unit tests should never run in spark. In these cases, you can simply skip the reader when cycling through:
readers = test_databases.get_private_readers(database='PUMS_pid', privacy=privacy)
for reader in readers if reader.engine != 'spark':
res = reader.execute('SELECT COUNT(*) AS n FROM PUMS.PUMS')
n = res[0]
assert(n > 100)
Some unit tests require overriding behavior specified in the metadata, such as clamp_counts
or censor_dims
. These can be overridden when accessing the private readers, by passing in the values to the overrides
argument:
overrides = {'censor_dims': False, 'clamp_counts': False}
readers = test_databases.get_private_readers(database='PUMS_pid', privacy=privacy, overrides=overrides)