Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests and minor fixes for "anonymize_database.anonymize_data" function #22

Merged
merged 18 commits into from
Aug 28, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
2907b5b
Added df_least_nan, df_duplicated_columns methods to create dataframe…
lorenz-gorini Aug 18, 2020
a300809
Changed from sklearn.preprocessing.OneHotEncoder to EncodingFunctions…
lorenz-gorini Aug 21, 2020
0d1d191
Fixed issue 12 by changing from EncodingFunctions.ONEHOT/ORDINAL clas…
lorenz-gorini Aug 21, 2020
a0bf2f0
Fixed repeated code after rebase
lorenz-gorini Aug 25, 2020
ce6be4d
Added df_generic Mock, that is used for mocking a generic Pandas Data…
lorenz-gorini Aug 20, 2020
4f0d72f
Changed from sklearn.preprocessing.OneHotEncoder to EncodingFunctions…
lorenz-gorini Aug 21, 2020
dec0ea1
Fixed issue 12 by changing from EncodingFunctions.ONEHOT/ORDINAL clas…
lorenz-gorini Aug 21, 2020
8f15a1e
In import_df_with_info_from_file function:
lorenz-gorini Aug 25, 2020
6c33926
Completed tests for import/export_df_with_info functions for DataFram…
lorenz-gorini Aug 25, 2020
80f0af0
Fixes issue #19 because now "show_columns_type" considers every value…
lorenz-gorini Aug 25, 2020
96cbaeb
Refactored according to flake8
lorenz-gorini Aug 25, 2020
93cff6d
Moved temporary_data_dir fixture to conftest.py since it is a generic…
lorenz-gorini Aug 26, 2020
82a5997
Fixes issue #21 by adding a "random_seed" argument to "anonymize_data…
lorenz-gorini Aug 26, 2020
591b0fc
Added test for "anonymize_database.anonymize_data" function.
lorenz-gorini Aug 26, 2020
1df8074
Fixed tests after rebase
lorenz-gorini Aug 27, 2020
f56c41a
Fixed according to PR comments (minor typos in docstrings)
lorenz-gorini Aug 27, 2020
37aa71c
Formatted according to flake8
lorenz-gorini Aug 27, 2020
1fd2ef8
Fixed minor typos docstrings
lorenz-gorini Aug 28, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 129 additions & 53 deletions src/pd_extras/anonymize_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,34 @@
import os
import random
import string
from pathlib import Path
from typing import Tuple, Union

import numpy as np
import pandas as pd


def add_nonce_func(string_array):
def add_nonce_func(
string_array: Union[str, int, float, np.array]
) -> Union[str, int, float, np.array]:
"""
This function takes an array of strings passed as "string_array" and
attaches them nonces (random prefix and suffix), using Vectorization.

:param cols_values: This is a list of numpy arrays, i.e. the columns we add nonce to
:return: np.array of strings with nonces
Add random prefix and suffix to an array of strings ``string_array``

This function takes an array of strings passed as ``string_array`` and
attaches nonces (random prefix and suffix) to each string.
It can also be used in a vectorized way
Prefix and suffix will contain 12 random characters each.

Parameters
----------
string_array: Union[str, int, float, np.array]
This can be a number, a string or a numpy array of values
(e.g. a DataFrame column)

Returns
-------
np.array
Array of strings with nonces
"""
return (
"".join(random.choice(string.hexdigits) for i in range(12))
Expand All @@ -22,17 +38,28 @@ def add_nonce_func(string_array):
)


def add_id_owner_col(private_df, cols_to_hash):
def add_id_owner_col(
private_df: pd.DataFrame, cols_to_hash: Tuple[str]
) -> pd.DataFrame:
"""
This function uses the columns of the "private_df" database to generate an hash value
and it creates an "ID_OWNER" column with those values.
To generate hash values, we add nonces (random prefix and suffix) to the column values and then we use "sha256".
See https://medium.com/luckspark/hashing-pandas-dataframe-column-with-nonce-763a8c23a833 for more info.

:param private_df: Pandas.DataFrame with the owner's private data
:param cols_to_hash: This is a list of column names with the infos we want to hash

:return: Pandas.DataFrame similar to "private_df" with a new "ID_OWNER" column
To generate hash values, the function adds nonces (random prefix and suffix)
to the column values and then we use "sha256".
See https://medium.com/luckspark/hashing-pandas-dataframe-column-with-nonce-763a8c23a833
for more info.

Parameters
----------
private_df: pd.DataFrame
DataFrame with the owner's private data
cols_to_hash: Tuple[str]
This is a list of column names with the infos we want to hash

Returns
-------
pd.DataFrame
DataFrame similar to ``private_df`` with a new "ID_OWNER" column
"""
# Turn rows into strings to be used
rows_into_strings = np.sum(
Expand All @@ -53,29 +80,41 @@ def hash_lambda(owner_name):
return private_df


def create_private_info_db(df, private_cols_to_map):
def create_private_info_db(
df: pd.DataFrame, private_cols_to_map: Tuple[str]
) -> pd.DataFrame:
"""
This function creates a Pandas.DataFrame where you will store all the owner's
private data needed to identify them.
These informations are listed in "private_cols_to_map" argument.

:param df: Pandas.DataFrame that we will anonymize
:param private_cols_to_map: This is a list of the columns that will be stored in the
private_db that will be returned, along with the new "ID_OWNER"
:return: Pandas.DataFrame with the values of the "private_cols_to_map" and their hashed value in the column "ID_OWNER"
Create a DataFrame with private data and a unique ID.

This function will store in a DataFrame all the owner's private data
contained in the columns ``private_cols_to_map`` needed to identify them.
The function will also add a unique owner ID (in the column "OWNER_ID") that
is hashed based on ``private_cols_to_map``.
In case there are multiple rows with the same private info
(e.g.: multiple data from the same customer), only one of those rows
is included in the returned DataFrame.

Parameters
----------
df: pd.DataFrame
DataFrame that we will anonymize
private_cols_to_map: Tuple[str]
List of the columns that will be stored in the private_db
that will be returned, along with the new "ID_OWNER"

Returns
-------
pd.DataFrame
DataFrame with the values of the ``private_cols_to_map`` and
their hashed value in the column "ID_OWNER"
"""
# Create the private_db with the columns with private infos only
# Create the private_db with the columns with private info only
private_df = df[private_cols_to_map]

# Get unique combinations of the columns you chose
private_df = (
private_df.groupby(private_cols_to_map, as_index=False, group_keys=False)
.size()
.reset_index()
)

# Eliminate size column
private_df = private_df.drop(columns=[0])
# In case there are multiple rows with the same private info
# (e.g.: multiple data from the same customer), only one of these rows
# should be included in ``private_df``
private_df.drop_duplicates(inplace=True)
alessiamarcolini marked this conversation as resolved.
Show resolved Hide resolved

# Add the ID_OWNER column with the hash value of the row
private_df = add_id_owner_col(private_df, private_cols_to_map)
Expand All @@ -84,40 +123,77 @@ def create_private_info_db(df, private_cols_to_map):


def anonymize_data(
df, file_name, private_cols_to_remove, private_cols_to_map, dest_path
):
df: pd.DataFrame,
file_name: str,
private_cols_to_remove: Tuple[str],
private_cols_to_map: Tuple[str],
dest_path: Union[Path, str],
random_seed: int = 42,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""
This function will take the Pandas DataFrame "df" and it will return two files written inside the "dest_path":
1. One file (called "[file_name]_anonym") will contain the database "df" where
we replaced the columns "private_cols_to_remove" with the column "ID_OWNER"
2. Another file (called "[file_name]_private_info") will contain only the
owner infos "private_cols_to_map", which we associated an ID_OWNER to.
The ID_OWNER will be hashed using SHA256.
Separate generic from private data leaving a unique ID as map between them.

:param df: Pandas.DataFrame that we will anonymize
:param file_name: Name of the database we are working on (no ".csv" suffix). Used as prefix when saving csv output files.
:param private_cols_to_remove: Columns that will be removed from "_anonym" file
:param private_cols_to_map: Columns of the "_private_info" files
:param dest_path: The directory where we will save the two files

:return: [file_name]_anonym : pd.DataFrame
[file_name]_private_info : pd.DataFrame
This function will take the Pandas DataFrame ``df`` and it will return two
files written inside the ``dest_path`` directory:
1. One file (called "[file_name]_anonym") will contain the database ``df`` where
we replaced the columns ``private_cols_to_remove`` with the column "ID_OWNER"
2. Another file (called "[file_name]_private_info") will contain only the
owner infos ``private_cols_to_map``, which we associated an ID_OWNER to.
To generate hash values for the "ID_OWNER" column values, the algorithm
adds nonces (random prefix and suffix) to the column values and then
it uses "SHA256" algorithm.

Parameters
----------
df: pd.DataFrame
DataFrame that we will anonymize
file_name: str
Name of the database we are working on (no ".csv" suffix). Used as
prefix when saving csv output files.
private_cols_to_remove: Tuple[str]
Columns that will be removed from "_anonym" file
private_cols_to_map: Tuple[str]
Columns of the "_private_info" files
dest_path: Union[Path, str]
The directory where we will save the two files
random_seed: int
Integer value used as "seed" for the generation of random prefixes and
suffixes in "nonces".

Returns
-------
pd.DataFrame
DataFrame containing only the private info ``private_cols_to_map``,
along with another column "ID_OWNER" that allows to map these private
informations to the data in the other DataFrame. This file is
also saved to "[``dest_path``] / [``file_name``]_private_info.csv" file.
pd.DataFrame
DataFrame containing the same infos as the DataFrame ``df``, but
the columns "private_cols_to_remove" have been replaced by "ID_OWNER"
column.
This file is also saved to "[``dest_path``] / [``file_name``]_anonym.csv"
file.
"""
# Fix the random seed for the generation of random prefixes and
# suffixes in "nonces", used for creating "ID_OWNER" column.
random.seed(random_seed)
# Create the "_anonym" DataFrame which will contain the anonymized database
anonym_df = df.copy()
# Fill NaN values in the columns we will map, to make DataFrame merge easier
df[private_cols_to_map] = df[private_cols_to_map].fillna("----")
# Create the "_private_info" db which will contain the map to owner's private data
private_df = create_private_info_db(df, private_cols_to_map)

# Create the "_anonym" DataFrame which will contain the anonymized database
anonym_df = pd.DataFrame(df_sani)

# Merge to insert the new ID_OWNER column
# Merge to insert the new ID_OWNER column corresponding to the
# private column value combinations
anonym_df = anonym_df.merge(private_df)

# Delete the columns with private owner's data
anonym_df = anonym_df.drop(private_cols_to_remove, axis=1)

# Write the two DataFrames to CSV files
dest_path = str(dest_path)
file_name = str(file_name)
try:
private_df.to_csv(
os.path.join(dest_path, f"{file_name}_private_info.csv"),
Expand Down
38 changes: 38 additions & 0 deletions src/tests/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import os
import shutil
from pathlib import Path

import pytest


@pytest.fixture(scope="module")
def temporary_data_dir(request) -> Path:
"""
Create a temporary directory for test data and delete it after test end.

The temporary directory is created in the working directory and it is
named "temp_test_data_folder".
The fixture uses a finalizer that deletes the temporary directory where
every test data was saved. Therefore every time the user calls tests that
use this fixture (and save data inside the returned directory), at the end
of the test the finalizer deletes this directory.

Parameters
----------

Returns
-------
Path
Path where every temporary file used by tests is saved.
"""
temp_data_dir = Path(os.getcwd()) / "temp_test_data_folder"
try:
os.mkdir(temp_data_dir)
except FileExistsError:
pass

def remove_temp_dir_created():
shutil.rmtree(temp_data_dir)

request.addfinalizer(remove_temp_dir_created)
return temp_data_dir
40 changes: 40 additions & 0 deletions src/tests/dataframewithinfo_util.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import itertools
import random
from datetime import date
from typing import Tuple

import pandas as pd

Expand Down Expand Up @@ -381,6 +382,45 @@ def df_duplicated_columns(duplicated_cols_count: int) -> pd.DataFrame:

return pd.DataFrame(df_duplicated)

@staticmethod
def df_with_private_info(private_cols: Tuple[str]):
"""
Create DataFrame with private info columns along with data columns

The returned DataFrame mock contains (len(private_cols) + 2) columns
and 5 rows. Particularly it contains the columns listed in ``private_cols``
with string values, and 2 data columns containing
integer values.
Two of these rows have same values in ``private_cols`` columns, but different
values in the other 2 data columns (this could be simulating a DataFrame
with multiple rows related to the same customer/patient).

Parameters
----------
private_cols: Tuple[str]
List of columns that will be created as private columns

Returns
-------
pd.DataFrame
DataFrame mock containing (len(private_cols) + 2) columns
and 5 rows. Particularly it contains the columns listed in ``private_cols``
with generic string values, and 2 data columns containing
integer values.

"""
df_private_info_dict = {}
sample_size = 5
for i, col in enumerate(private_cols):
df_private_info_dict[col] = [
f"col_{i}_value_{k}" for k in range(sample_size - 1)
]
# Add a duplicated row (it may be associated to the same customer)
df_private_info_dict[col].append(f"col_{i}_value_{sample_size-2}")
df_private_info_dict["data_col_0"] = list(range(sample_size))
df_private_info_dict["data_col_1"] = list(range(sample_size))
return pd.DataFrame(df_private_info_dict)


class SeriesMock:
@staticmethod
Expand Down
Loading