A DataJoint extension that integrates hashes and other features. Requires DataJoint version 0.12.9.
Minimal examples for using DataJointPlus. Documentation to be added.
pip3 install datajoint-plus
import datajoint_plus as djp
schema = djp.schema("project_schema")
Set flag enable_hashing=True
in any DataJoint User class (djp.Lookup
, djp.Manual
, djp.Computed
, djp.Imported
, djp.Part
) to enable automatic hashing.
This will require two additional properties to be defined:
hash_name
: (string) The attribute that will contain the hash values.
- The hash_name must be added to the definition as a
varchar
type and can be added as aprimary
orsecondary
key attribute - The hashing mechanism is
md5
and has a maximum of32
characters. Specify how many characters from[1, 32]
of the hash should be stored.- E.g. a
12
character hash should be added to the definition asvarchar(12)
- E.g. a
hashed_attrs
: (string, or list/ tuple of strings) One or more primary and/or secondary attributes that should be hashed upon insertion.
The following lookup table named Exclusion
when created will automatically add two entries under reason
, "no data"
and "incorrect data"
, with corresponding hashes under the primary key: exclusion_hash
.
@schema
class Exclusion(djp.Lookup):
enable_hashing = True
hash_name = 'exclusion_hash'
hashed_attrs = 'reason'
definition = f"""
# reasons for excluding imported data entries
exclusion_hash : varchar(6)
---
reason : varchar(48) # reason for exclusion
"""
contents = [
{'reason': 'no data'},
{'reason': 'incorrect data'}
]
Note: The table name is automatically added to the header next to the special character ~
that is used for parsing. In addition, the hash_name
and hashed_attrs
are also added to the header. This can be turned off, but is useful because it allows virtual modules to parse the header and reproduce the hashing configuration, even in the absence of the code that defined the original table.
New entries can be added to this table with direct insertion. Direct insertions must use named datatypes. Supported types include:
- pandas dataframe
- DataJoint expression
- Python dictionary (or list of dictionaries for multiple rows)
E.g.
Exclusion.insert1(
{'reason': 'requires manual review'}
)
or
Exclusion.insert(
[
{'reason': 'requires manual review'},
{'reason': 'some other reason'}
]
)
This example uses the master table as an aggregator for hashes.
Each part table represents a unique method type with a pre-defined set of parameters. Every row in the part table is one method with specific parameter values.
If a new method type is desired with a new set of parameters, a new part table can added for it at anytime.
When individual methods are added to part tables, they recieve a hash that is also automatically added to the master table.
For maximum flexibility, downstream tables would depend on the master table.
@schema
class ImportMethod(djp.Lookup):
hash_name = 'import_method_hash' # we define a hash_name for the master even though the hashing happens in the parts
hash_part_table_names = True # this is already True by default and need not be included but shown for clarity. With this enabled, every method will generate a unique hash across all part tables
definition = """
# method for importing data
import_method_hash : varchar(12) # method hash
"""
class FromCSV(djp.Part):
enable_hashing = True # the part table does the hashing
hash_name = 'import_method_hash'
hashed_attrs = 'param1', 'param2' # these values generate the hash
definition = """
# methods for loading data from a CSV
-> master
---
param1 : int # some parameter
param2 : varchar(48) # some other parameter
ts_inserted=CURRENT_TIMESTAMP : timestamp
"""
def run(self):
return self.fetch1()
class FromAPI(djp.Part):
enable_hashing = True
hash_name = 'import_method_hash'
hashed_attrs = 'param1', 'param2', 'param3'
definition = """
# methods for importing data using some api
-> master
---
param1 : varchar(12) #
param2 : Decimal(3,2) #
param3 : int #
ts_inserted=CURRENT_TIMESTAMP : timestamp
"""
def run(self):
return self.fetch1()
To insert new methods, we insert to the Part
table directly and use insert_to_master=True
. The order of events is:
- The hash is generated by the part table
- The hash is inserted to the master table
- The hash and params are inserted to the part table.
Importantly, these steps occur in one transaction, so if any of them fail, no insertions will occur.
Inserting some methods:
ImportMethod.FromCSV.insert1({'param1': 1, 'param2': 'some parameter'}, insert_to_master=True)
ImportMethod.FromCSV.insert1({'param1': 32, 'param2': 'some other parameter'}, insert_to_master=True)
ImportMethod.FromAPI.insert1({'param1': 'a param', 'param2': 4.2, 'param3': 38}, insert_to_master=True)
ImportMethod.FromAPI.insert1({'param1': 'another param', 'param2': 6.9, 'param3': 99}, insert_to_master=True)
Now, to get to a single method we can use the master table: ImportMethod
and the import_method_hash
as such:
ImportMethod.restrict_one_part({'import_method_hash': '902421e75df6'})
ImportMethod.r1p({'import_method_hash': '902421e75df6'}) # alias for restrict_one_part
or
ImportMethod.restrict_one_part_with_hash('902421e75df6')
ImportMethod.r1pwh('902421e75df6') # alias for restrict_one_part_with_hash
Output:
By default, the restricted part table object is returned so we can call the run method directly:
ImportMethod.r1pwh('902421e75df6').run()
Here, the run method calls fetch1
on the restricted table to ensure that only one row remains after the restriction. Then it returns the parameters.
Output:
{'import_method_hash': '902421e75df6',
'param1': 'alt',
'param2': 6.9,
'param3': 99,
'ts_inserted': datetime.datetime(2022, 5, 10, 6, 19, 38)}
We can also recover the original hash with the hash1
method by rehashing the parameters:
ImportMethod.FromAPI.hash1(
{ 'param1': 'alt',
'param2': 6.9,
'param3': 99}
)
Output:
'902421e75df6'
Example to be added: Multiple hashes (master table performs hashing and part table performs hashing)
Descriptions will be expanded in the future.
djp.schema()
- same function as DataJoint schemadjp.create_djp_module()
- DataJointPlus virtual module. can be used to add DataJointPlus functionality to all DataJoint tables in an imported module that did not originally have DataJointPlus, or can create a virtual module with DataJointPlus functionality from an existing schema.schema.load_dependencies()
- Loads (or reloads) DataJoint networkx graph. Runs by default when schema is instantiated for bothdjp.schema
anddjp.create_djp_module
. If graph is not loaded (it's not loaded by default in DataJoint 0.12.9), thenTable.parts()
returns empty list even if it has part tables.schema.tables
- a DataJoint table created automatically for every schema (similar to~log
) that logs all tables in the schema and keeps track if they currently exist (schema.tables.exists
) or if they were deleted (schema.tables.deleted
).
These flags must be set during table definition and are constant once a table is defined. To change these features after the table is instantiated, the best practice is to delete and remake the table.
enable_hashing
- (bool
) defaultFalse
. IfTrue
,hash_name
andhashed_attrs
must be defined.hash_name
- (str
) The name of the primary or secondary DataJoint attribute that contains the hasheshashed_attrs
- (str
orlist/tuple
ofstr
) The DataJoint primary and/or secondary key attributes that will hashed upon insertionhash_group
- (bool
) defaultFalse
. IfTrue
, multiple rows inserted simultaneously are hashed together and given the same hashhash_table_name
- (bool
) defaultFalse
. IfTrue
, all hashes made in the table will also include the name of the tablehash_part_table_names
- (bool
) defaultTrue
. Property of a master table. IfTrue
, enforces that all hashes made in its part tables will always include the part table name in the hash (therefore, hashes will always be unique across parts)
Methods and properties common to all DJP user classes (djp.Lookup
, djp.Manual
, djp.Computed
, djp.Imported
, djp.Part
) after they are defined.
Table.Log
- Log record manager and logger. To create log:Table.Log('info', msg)
, To view logs:Table.Log.head()
orTable.Log.tail()
. Directory to save logs controlled bydjp.config['log_base_dir']
orENV
variable:DJ_LOG_BASE_DIR
. Default logging level set with globally withdjp.config['loglevel']
orENV
variableDJ_LOGLEVEL
or for specific tables withTable.loglevel
Table.class_name
- Name of class that generated table. Part tables formatted like'Master.Part'
Table.table_id
- Unique hash for every table based onfull_table_name
Table.get_earliest_entries()
- If theTable
has a timestamp attribute, returns theTable
restricted to the entry (or entries) that was (were) inserted earliestTable.get_latest_entries()
- If theTable
has a timestamp attribute, returns theTable
restricted to the entry (or entries) that was (were) inserted latestTable.aggr_max()
- Given an attribute name, returns theTable
restricted to the row with the max value of that attributeTable.aggr_min()
- Given an attribute name, returns theTable
restricted to the row with the min value of that attributeTable.aggr_nunique()
- Given an attribute name, returns the number of unique values inTable
for that attributeTable.include_attrs()
- returns a proj ofTable
with only provided attributes (Not guaranteed to have unique rows)Table.exclude_attrs()
- returns a proj ofTable
without the provided attributes (Not guaranteed to have unique rows)Table.is_master()
-True
ifTable
is master typeTable.is_part()
-True
ifTable
is part table typeTable.comment
- Just the user added portion of the table headerTable.hash_name
- name of attribute containing hash (None if nohash_name
was defined)Table.hashed_attrs
- list of attributes that get hashed upon insertion (None ifenable_hashing
isFalse
)Table.hash_len
- number of characters in hashTable.hash_group
-True
ifhash_group
is enabled forTable
Table.hash_table_name
-True
ifhash_table_name
is enabled forTable
Methods and properties common to DJP master user classes (djp.Lookup
, djp.Manual
, djp.Computed
, djp.Imported
) after they are defined.
Table.parts()
- returns a list of part tables (option to return full table names, dj.FreeTable, or class)Table.has_parts()
-True
if master has any part tables in the graphTable.number_of_parts()
- Number of part tables found in graphTable.restrict_parts()
- restrict all part tables with provided restriction.Table.restrict_one_part()
or aliasTable.r1p
- enforces that only one part table can be restricted successfully and returns restricted part tableTable.restrict_with_hash()
- pass a hash and returnTable
restricted by the hash (Table.hash_name
must be defined orhash_name
can be provided.)Table.restrict_part_with_hash()
- pass a hash and return a list of part tables restricted with hash. Tries to return the part table class by default.Table.restrict_one_part_with_hash()
or aliasTable.r1pwh
- Same asrestrict_part_with_hash
but errors if more than one part table contains hash. If successful returns the part table.Table.hash()
- provide one or more rows and generate the same hash thatTable
would generate if those rows were insertedTable.hash1()
- same asTable.hash()
but enforces that only one hash is returned.Table.join_parts()
- join all part tables according to specific methodsTable.union_parts()
- union across all part tablesTable.hash_part_table_names
-True
ifhash_part_table_names
was enabled forTable
Table.part_table_names_with_hash()
- returns names of all part tables that contain an entry matching provided hashTable.hashes_not_in_parts()
- returnsTable
restricted to hashes that are not found in any of its part tables
Methods and properties common to DJP part table classes (djp.Part
) after they are defined.
Table.class_name_valid_id
- returnsTable.class_name
with'.'
replace with'xx'
. Useful when usingclass_name
as an identifier where'.'
characters are not allowed.
Documentation of special cases and considerations
In general decimals can be hashed, however depending on your use case they can pose a problem.
Below is an example table that has one value in hashed_attrs
called param
that is mapped to Decimal(4,2)
in SQL:
from decimal import Decimal
@schema
class DecimalExample(djp.Lookup):
enable_hashing = True
hash_name = 'hash'
hashed_attrs = 'param'
definition = """
hash : varchar(20)
---
param: Decimal(4,2) # decimal type
"""
First point to note is that the following inserts give two distinct hashes, even though value of param
is identical in SQL:
DecimalExample.insert1({'param': 8.9})
DecimalExample.insert1({'param': Decimal(8.9)})
Secondly, because of how SQL outputs the Decimal, neither entry can reproduce their original hash, and instead both give another hash.
DecimalExample.hash1(DecimalExample.restrict_with_hash('01a86da7f62a5f33f613'))
DecimalExample.hash1(DecimalExample.restrict_with_hash('36bf4f8d74d683f345d2'))
Output:
> 'db01069f5c8ad563c10f'
Therefore, if hash reproducibility is required, float
should be considered over Decimal
.
Example with float:
@schema
class FloatExample(djp.Lookup):
enable_hashing = True
hash_name = 'hash'
hashed_attrs = 'param'
definition = """
hash : varchar(20)
---
param: float # float type
"""
# insert
FloatExample.insert1({'param': 8.9})
# recover hash
FloatExample.hash1(FloatExample.restrict_with_hash('01a86da7f62a5f33f613'))
Output:
> '01a86da7f62a5f33f613'