Releases: VIDA-NYU/domain-discovery-d4
Alternative D4 commands
This release includes a bug fix when streaming robust signature blocks from an in-memory buffer. It also introduces two alternative workflow steps for D4:
No expansion: Use the no-expand
option when discovering local domains on the original dataset columns (without expansion). This option will output a columns file in the same format as the expand-columns
step that can be used as input for the local-domains
step.
$> java -jar /home/user/lib/D4.jar no-expand --help
D4 - Data-Driven Domain Discovery - Version (0.30.1)
no-expand
--eqs=<file> [default: 'compressed-term-index.txt.gz']
--verbose=<boolean> [default: true]
--columns=<file> [default: 'expanded-columns.txt.gz']
Whole column as domain: Instead of discovering local domains within (expanded) columns there is now an option to treat each unique (expanded) column as a local domain.
$> java -jar /home/user/lib/D4.jar columns-as-domains --help
D4 - Data-Driven Domain Discovery - Version (0.30.1)
columns-as-domains
--eqs=<file> [default: 'compressed-term-index.txt.gz']
--columns=<file> [default: 'expanded-columns.txt.gz']
--verbose=<boolean> [default: true]
--localdomains=<file> [default: 'local-domains.txt.gz']
Maintain Term Frequencies
This release introduces several major changes to the file formats as well as additional options for context signature generation and signature robustification.
Term Index and Equivalence Classes Files
D4 now maintains frequency counts for each term (equivalence class) for each column that the term (equivalence class) occurs in. For terms and equivalence classes the list of columns is nor a comma-separated list of column-id:frequency pairs.
Signature Files
In the robust signature files, D4 now maintains the size of each block (in the number of terms for all equivalence classes in the block) as the first value of the comma-separated list. the following elements are pairs of eq-identifier:overlap-pairs.
Robust Signatures
D4 contains a new similarity measure for equivalence classes that is based in tf-idf (option --sim=TF-ICF
when creating signatures)
For minor drops, D4 now also includes a new robustifier (--robustifier=IGNORE-LAST
) that ignores the last block (instead of the largest block as LIBERAL does).
Ignore Minor Drop
This release introduces a new feature for robust signature generation and some changes to data preparation steps.
Ignore Minor Drops
The signature generation step now includes the --ignoreMinorDrops
option. A minor drop is defines as a steepest drop where the delta between the boundary terms for the drop is smaller than the delta for the terms in the block that is to the left of the drop. The aim is to avoid steepest drops that are more likely to occur by chance in areas of low similarity. If the difference of similarities within a block is greater than the next steepest drop we include all remaining elements in a context signature into a final block. This block may then be pruned by the LIBERAL trimmer (if it is used).
Changes to Data Preparation Steps
Term Index Generator
Term index generation is now multi-threaded. Individual threads parse different columns and add the terms to a common term index generator. The term index buffer is till being written to disk (in a blocking step) and all files are merged at the end (by a single thread).
Column Generation
Basic transformation of terms (trimming, removing consecutive white spaces, upper case) is no performed as part of the column generation step (and not in the term index generator).
Signature Trimmer and Robust Blocks Generator
This release introduces several major changes.
Distinguish between signature trimmer and robustification
We now distinguish between robustifying a context signature and between trimming a robust signature. The robustifying step takes a context signature and returns a list of signature blocks. Each block contains a list of equivalence class identifier. Signature blocks are generated using the steepest drop. We support two strategies for pruning signature blocks (into a robust set of signature blocks):
- LIBERAL: As before, all signature blocks starting with the largest block (unless it is the first block) are discarded.
- COLSUPP: Discard all blocks starting from the first block where not all equivalence classes co-occur in at least one column.
The motivation behind column support (COLSUPP) is that signature blocks are intended to represent subsets of semantic domains that a term belongs to. We can therefore assume that all the terms in the domain subset occur in at least one column together. Note that this is a very strong assumption since there are domains where some terms never co-occur in a column (e.g., last names). We can still discover these domains, however, when we merge local domains into strong domains.
Signature trimmers are used to prune robust signature blocks into robust signatures during column expansion and local domain discovery. The supported trimmers are as before (CONSERVATIVE, CENTRIST, LIBERAL).
Ignore Minor Drops
We added an option to further reduce the number of robust signature blocks in areas of low similarity by introducing the --ignoreMinorDrop
flag for signature generation. A minor drop occurs if the delta of the drop is smaller than the delta within the preceding block. The motivation is that we want to avoid steepest drops in areas of low variability.
Merge Similar Equivalence Classes
This release contains a tool that allows to merge equivalence classes that are similar (above a given threshold) in their sets of columns. Merging these equivalence classes that are likely to belong to the same domain (when using high similarity thresholds) further reduces the overall number of equivalence classes why this should not have a negative impact on the quality (precision and recall) of the discovered domains.
The current implementation is a first naive implementation that builds a graph where nodes are equivalence classes and there is an un-directed edge between nodes if their similarity is above the threshold. Connected components in this graph are then merged into single equivalence classes.
Other Changes
There are additional minor changes included in this release:
-
Remove option for fixed threshold when generating signature blocks. This is part of a refactoring of the steepest drop finder to not only return the index of the steepest drop but also the delta of the drop.
-
Introduce base classes for signature block sketches. The idea behind signature block sketches is to reduce the size of signature blocks by only maintaining a sketch (e.g., a random sample) of elements in the block.
-
Re-introduce tools for result evaluation when given ground truth domains.
New Strong Domains
Implements a new strong-domain discovery step (#2):
- Cluster local domains that support each other. Each cluster forms a strong domain
- For each strong domain rank terms based on the number of columns (in the strong domain) that they occur in
- Use steepest drop to group terms based on their weights
VLDB 2020
This release contains the latest version of the D4 code at the time of publishing the D4 paper in VLDB 2020.
Initial release
0.1 Add information about datasets