IPython magic for simple, organized, compressed and encrypted storage & transfer of files between notebooks.
The %vault
magic provides a reproducible caching mechanism for variables exchange between notebooks.
The cache is compressed, persistent and safe.
Differently to the builtin %store
magic, the variables are stored in plain sight,
in a zipped archive, so that they can be easily accessed for manual inspection,
or for the use by other tools.
Let's open the vault (it will be created if not here yet):
%open_vault -p data/storage.zip
Generate some dummy dataset:
from pandas import DataFrame
from random import choice, randint
cities = ['London', 'Delhi', 'Tokyo', 'Lagos', 'Warsaw', 'Chongqing']
salaries = DataFrame([
{'salary': randint(0, 100), 'city': choice(cities)}
for i in range(10000)
])
And store it in the vault:
%vault store salaries in datasets
Stored salaries (None → 40CA7812) at Sunday, 08. Dec 2019 11:58
A short description is printed out (including a CRC32 hashsum and a timestamp) by default, but can be disabled by passing --timestamp False
to %open_vault
magic.
Even more information enhancing the reproducibility is stored in the cell metadata.
We can now load the stored DataFrame in another (or the same) notebook:
%vault import salaries from datasets
Reduced memory usage by 87.28%, from 0.79 MB to 0.10 MB.
Imported salaries (40CA7812) at Sunday, 08. Dec 2019 12:02
Thanks to (optional) memory optimizations we saved some RAM (87% as compared to unoptimized pd.read_csv()
result).
If we already have the salaries variable, we can use as
, just like in the Python import system.
%vault import salaries from datasets as salaries_dataset
from pandas import read_csv
to_csv = lambda df: df.to_csv()
%vault store salaries in datasets with to_csv as salaries_csv
%vault import salaries_csv from datasets with read_csv
from pandas import read_excel
%vault import 'cars.xlsx' as cars_dataset with read_excel
Syntax:
- easy to understand in plain language (avoid abbreviations when possible),
- while intuitive for Python developers,
- ...but sufficiently different so that it would not be mistaken with Python constructs
- for example, we could have
%from x import y
, but this looks very like normal Python; having%vault from x import y
makes it sufficiently easy to distinguish
- for example, we could have
- star imports are better avoided, thus not supported
- as imports may be confusing if there is more than one
Reproducibility:
- promote good reproducible and traceable organization of files:
- promote storage in plain text files and the use of DataFrame
- pickling is initially fun, but really try to change your class definitions and load your data again.
- print out a short hashsum and human-readable datetime (always in UTC),
- while providing even more details in cell metadata
- promote storage in plain text files and the use of DataFrame
- allow to trace instances of the code being modified post execution
Security:
- think of it as a tool to minimize the damage in case of accidental
git add
of data files (even if those should have been elsewhere and.gitignore
d in the first place), - or, as an additional layer of security for already anonymized data,
- but this tool is not aimed at facilitating the storage of highly sensitive data
- you have to set a password, or explicitly set
--secure False
to get rid of a security warning
Each operation will print out the timestamp and the CRC32 short checksum of the files involved. The timestamp of the operation is reported in the UTC timezone in a human-readable format.
This can be disabled by setting -t False
or --timestamp False
, however for the sake of reproducibility
it is encouraged to keep this information visible in the notebook.
More precise information including the SHA256 cheksum (with a lower probability of collisions), and a full timestamp (to detect potential race condition errors in file write operations) are embedded in the metadata of the cell. You can disable this by setting --metadata False.
The exact command line is also stored in the metadata, so that if you accidentally modify the code cell without re-running the code, the change can be tracked down.
In order to enforce interoperability plain text files are used for pandas DataFrame and Series objects.
Other variables are stores as pickle objects. The location of the storage archive on the disk defaults
to storage.zip
in the current directory, and can changed using %open_vault
magic:
%open_vault -p custom_storage.zip
The encryption is not intended as a high security mechanism, but only as an additional layer of protection for already anonymized data.
The password to encrypt the storage archive is retrieved from the environmental variable,
using a name provided in encryption_variable
during the setup.
%open_vault -e ENV_STORAGE_KEY
Pandas DataFrames are by-default memory optimized by conversion of string variables to (ordered) categorical columns (pandas equivalent of R's factors/levels). Each string column will be tested for the memory improvement and the optimization will be only applied if it does reduce the memory usage.
The storage archive is conceptually similar to Hierarchical Data Format (e.g. HDF5) object - it contains:
- a hierarchy of files, and
- a metadata files
I believe that HDF may be the future, but this future is not here yet - numerous issues with the packages handling the HDF files, as well as low performance and compression rate prompted me to stay with a simple zip format now.
ZIP is a popular file format with known features and limitations - files can be password encrypted, while the file list is always accessible. This is okay given that the code of the project is assumed to be public, and only the files in the storage area are assumed to be of encrypted, increasing the security in case of unauthorized access.
As the limitations of the ZIP encryption are assumed to be a common knowledge, I hope that managing expectations of the level of security offered by this package will be easier.
Pre-requirements:
- Python 3.6+
- 7zip (16.02+)
Installation:
pip3 install data_vault
not implemented, up for discussion
To enable high-performance subsetting a simple, grep-like pre-filtering is provided:
Import only first five rows:
%vault from notebook import large_frame.rows[:5] as large_frame_head
When subsetting, the use of as
is required to prevent potential confusion of the original large_frame
object with its subset.
To import only rows including text "SNP":
%vault from notebook import large_frame.grep("SNP") as large_frame_snps
By design, no advanced filtering is intended at this step.
However, if your file is too big to fit into memory and you need more advanced filtering,
you can provide your custom import function to the low-level load_storage_object
magic:
def your_function(f):
return f.read() # do some fancy filtering here
%vault import 'notebook_path/variable.tsv' as variable with your_function