DNS TAPIR Datasets

Datasets generated and processed within the DNS TAPIR data collection and analysis components.

Events

Events are sent for analysis as they happen.

TAPIR Edge New Domain Event

These events are sent from Edge whenever a domain is new locally. Effectively, the Edge DNSTAP Minimiser (EDM) keeps a lookup table for domains it processes (excluding the Well-known set, see below).

Globally New Domain

TAPIR Core receives the "New Domain" events from TAPIR Edge and compares them to a central lookup table that summarises new domains across all edge nodes. Patterns in how these events propagate is also analysed, for instance co-occurence across multiple nodes. The lookup is implemented using a Key-Value database containing the "Previously Seen" set (see below), where non-existence indicates a globally new domain.

This data is sent to Edge Policy Processor as part of other observations, described in "Aggregated Observations" below.

Data format for transmitting this data outside of DNS Tapir is evolving (see "Detailed Observations" below).

Reports

Reports are collections of data that are sent for analysis at timed intervals

TAPIR Edge Histogram

This report is generated by Edge DNSTAP Minimiser (EDM) for the Well-known Domain data collected from DNSTAP data. The Edge local analysis. when fully implemented, will also aggregate data to this format for less well-known domains after ensuring privacy levels are met.

TAPIR Core Wellknown Domain Histogram

Reports from TAPIR Edge instances are aggregated and summarised to (currently) 5 minute windows into this report.

Vectors

Vectors are encoded sequences of queries, mainly for machine learning purposes. All tokenisation and encoding is done in TAPIR Edge, and is a work in progress - strongly dependent on the implementation of the edge analysis engine. The following can be seen as a rough example of one such strategy

TAPIR Edge Query Vector

Vectorised query information from Edge - Work in progress

Sets

Sets are, in essence, lookup tables. These are generally viewed as either known good or known bad baselines, but without actual ground truth the nominations are more correctly described as probably good or probably bad. Note that sets are not necessarily implemented as separate tables, and can very well be a unified lookup table with multiple sets.

TAPIR Edge uses a number of sets to map incoming data into categories. The most central of these is the list of well-known domains from which to generate summarised statistics.

Wellknown Domains

Examples of such a list (or lists, for exact and wildcard) can be found here:

This dataset is generated by TAPIR Core based on inputs such as OpenPageRank as well as internal research and used by EDM to categorise and minimise data send to Core.

Since some aspects of data only exists in TAPIR Edge, any categorisation based on those parameters needs to happen at the edge. The following are examples of such sets with which to tag the data:

Suspect Domains Include these domains in histogram - Work in progress
Suspect Nameservers Include domains served by these nameservers - Work in progress
Suspect Response Address Include domains that respond with these addresses - Work in progress

To highlight data for Edge Analyse, data on known client addresses that exhibit suspicious behavior can be tagged. This is helpful as metadata for generating vectors of query streams that may contain maliciosu domains.

Suspect Clients Tag local data from known suspicious clients - Work in progress

Edge Policy Manager also requires some datasets for generating policy decisions, such as allow-lists for domains to be excluded from policy. These can be local or received from TAPIR Core, for example

TAPIR Core maintains a global list of seen domains, used to assert that (within the time window of that data) a domain is new.

Previously Seen

Edge DNSTAP Minimise also requires datasets that ensure some data is never processed. Those can, for obvious reasons, not come from or be handled by TAPIR Core. Some examples are:

Opt-out IP addresses Client IP addresses to be ignored by TAPIR Edge
Internal Domains - exact Domains to exclude
Internal Domains - wildcard Suffixes to exclude

Transformations

Transformations is the process that transforms one dataset into a different dataset, typically for the purpose of feature extraction, aggregation and/or privacy enhancement.

Pseudonymisation

Classification tags are different across the system. This is on purpose, since it became very hard to align the requirements of the different components. Edge DNSTAP Minimise uses a 64-bit integer to represent 64 different tags on the incoming data. TAPIR Core uses an UTF-8 string where unicode glyphs can represent a large (practically infinite) number of observable traits. Finally, Edge Policy Processor is limited to a 32-bit Integer, representing 32 different "meta-tags" created by joining Core traits into observations.

From EDM to TAPIR Core
From TAPIR Core to POP

TAPIR Core implements automated processes to refine the incoming data. Currently the incoming histograms are grouped into an extended histogram (see "Well-known Histogram" above) spanning a 5 minute window. The transform can be found in this Jupyter Notebook example:

DataLoad

Filters

Filters primarily serve to minimise data by removing uninteresting data or noise. These filters act on the collected data and are different from filters acting on the DNS query-response process. This includes known single-label queries and other artefacts that cannot resolve.

TBD

Observations

Observations are publicised domains that pass a threshold. There may be multiple thresholds signifying an estimation of reliability or risk.

Observation Event (aggregated) are events sent from Core with aggregated observations ( ≤ 32 tags )
Observation Events (detailed) are events with full tag information ( ∞ tags )

To receive Events and generate Observations, two example Jupyter Notebooks can be found below. One uses a one-shot mechanism to send continuous MQTT messages to Edge Policy Processor, and the other implements a server that prints out incoming Events - and if a domain arrives as "something.something.foo.example.com" generates an observation for "something.something.foo".

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
core_data_examples		core_data_examples
edge_data_examples		edge_data_examples
future		future
notebooks		notebooks
AggregatedObservations.md		AggregatedObservations.md
HistogramReport.md		HistogramReport.md
HowtoGetIn.md		HowtoGetIn.md
LICENSE		LICENSE
LocalDNSLogs.md		LocalDNSLogs.md
NewDomainEvent.md		NewDomainEvent.md
PreviouslySeenSet.md		PreviouslySeenSet.md
PseudonymisationTransform.md		PseudonymisationTransform.md
README.md		README.md
TAPIRDataFlow.md		TAPIRDataFlow.md
WellknownDomainsSet.md		WellknownDomainsSet.md
WellknownHistogramReport.md		WellknownHistogramReport.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNS TAPIR Datasets

Events

TAPIR Edge New Domain Event

Globally New Domain

Reports

TAPIR Edge Histogram

TAPIR Core Wellknown Domain Histogram

Vectors

TAPIR Edge Query Vector

Sets

Transformations

Filters

Observations

About

Releases

Packages

Contributors 3

Languages

License

dnstapir/datasets

Folders and files

Latest commit

History

Repository files navigation

DNS TAPIR Datasets

Events

TAPIR Edge New Domain Event

Globally New Domain

Reports

TAPIR Edge Histogram

TAPIR Core Wellknown Domain Histogram

Vectors

TAPIR Edge Query Vector

Sets

Transformations

Filters

Observations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages