Please join our Telegram channel
https://t.me/joinchat/EYtM0k4eAKoPDjJhVamHpg
And please fill in this google form so we'll know who you are:
https://docs.google.com/forms/d/e/1FAIpQLSdcWkf3ciQeC_uqPS-l6qxcT-EMGdzteqxtSQ9HUvyy0L6tmw/viewform?usp=sf_link
Ever wondered what would happen if you just plug in that seemingly innocent USB you found laying around? You’re about to find out! In this devices-gone-rogue challenge - should you choose to accept it - you will gain access to traffic data of ~100K devices, and will be tasked with finding the devices that, well, misbehave. This challenge is fully unsupervised - so put your anomaly belt on and get to it!
The Data
Dataset
Evaluation
Solution Example
Submissions
Contact Us
Ok, so finally the data is here:
Devices
Sessions
The dataset contains device data and network traffic taken from several different networks. There are two CSV files:
- Devices.csv - Data of devices and their type, manufacturer and model.
- Sessions.csv - Details of the connections between a device and its hosts, aggregated by hours (each row holds the aggregated data of several sessions).
Field | Description |
---|---|
network_id | A numeric network identifier, this file contains device information from 4 independent networks (0, 1, 2, 3) |
device_id | A numeric device identifier, unique inside the network |
type | The device type, one of ("MOBILE_PHONE", "TABLET", "PC", "WATCH", "VOIP", "PRINTER", "IP_CAMERA") |
model | The device model |
manufacturer | The device manufacturer |
operating_system_version | The device operating system version |
- Other than "network_id", "device_id" and "type", all fields are optional, and can be null.
In the snippet below there are 4 devices: 3 apple watches and 1 ipad. They are all from network 0.
Field | Description |
---|---|
network_id | A numeric network identifier; the data contains sessions from 4 independent networks (0, 1, 2, 3) |
device_id | A numeric device identifier; it is only unique within its specific network |
timestamp | The hour for which the sessions are aggregated |
host | The domain the device was connected to; if domain is unknown, the host IP address will be displayed (this field is hashed) |
host_ip | The IP address of the host the device was connected to (this field is hashed) |
port_dst | The destination port used in the session |
transport_protocol | The connection protocol - could be TCP or UDP |
service_device_id | The device id of the host our device was connected to |
packets_count | Total number of packets transferred during the aggregated sessions |
outbound_bytes_count | Total bytes sent during the aggregated sessions |
inbound_bytes_count | Total bytes received during the aggregated sessions |
packet_loss | Total number of packets that were lost during the session |
retransmit_count | Total number of packets that were retransmitted during the aggregated sessions |
latency | Network latency during the aggregated sessions |
session_count | Number of sessions this aggregated row holds |
outbound_packets_count | Total number of packets sent during the aggregated sessions |
inbound_packets_count | Total number of packets received during the aggregated sessions |
outbound_bytes_max | Max number of bytes sent in one session of the aggregated sessions |
outbound_bytes_min | Min number of bytes sent in one session of the aggregated sessions |
outbound_bytes_mean | Mean number of bytes sent during the aggregated sessions |
outbound_bytes_median | Median of bytes sent during the aggregated sessions |
outbound_bytes_stddev | Standard deviation of bytes sent during the aggregated sessions |
inbound_bytes_max | Max number of bytes received in one session of the aggregated sessions |
inbound_bytes_min | Min number of bytes received in one session of the aggregated sessions |
inbound_bytes_mean | Mean number of bytes received during the aggregated sessions |
inbound_bytes_median | Median number of bytes received during the aggregated sessions |
inbound_bytes_stddev | Standard deviation of bytes received during the aggregated sessions |
outbound_packet_size_max | Max packet size sent in one session of the aggregated sessions |
outbound_packet_size_min | Min packet size sent in one session of the aggregated sessions |
outbound_packet_size_mean | Mean packet size sent during the aggregated sessions |
outbound_packet_size_median | Median packet size sent during the aggregated sessions |
outbound_packet_size_stddev | Standard deviation of packet size sent during the aggregated sessions |
inbound_packet_size_max | Max packet size received in one session of the aggregated sessions |
inbound_packet_size_min | Min packet size received in one session of the aggregated sessions |
inbound_packet_size_mean | Mean packet size received during the aggregated sessions |
inbound_packet_size_median | Median packet size received during the aggregated sessions |
inbound_packet_size_stddev | Standard deviation of packet size received during the aggregated sessions |
- Other than "network_id" and "device_id", all fields are optional and can be null.
- Sessions between two devices in the same network will only be displayed once, with the device_id field indicating the id of device that initiated the session, and service_device_id indicating the id of the target device.
In the snippet below you can find some aggregations for network 0.
Let's look at the first line - what does it tell us?
We can see that device no. 35 had initiated 39 sessions with host "ecbb92...", using protocol TCP and port 49152, during the hour that started at 156507480 (epoch time for 06.08.2019 07:00 UTC). During those sessions, a total of 260 packets were transferred.
What can you learn from the other lines?
Let's review some connections from network number 1. The following tables display the data of 5 devices and their relevant sessions:
From these snippets we can learn there is one PC, connected to 3 IP cameras, and that one of the IP cameras is connected to another IP camera. Here is a quick network diagram showing the connections:
Please note: the scoring guidelines have been updated.
As this is an unsupervised challenge, the evaluation process will be a mix of classic "leaderboard" evaluation and in-person review of the models used.
The final score will be composed of:
- Leaderboard Evaluation (60%)
Your model results will be matched against a prelabeled dataset, and AUC (Area Under the Curve) will be calculated on the test set. - Explainability (20%)
In what measures is this an anomaly? How important is it? - Innovation (20%)
Use of a non-trivial algorithm. Creation of ingenious useful features. Any other creative ideas could also be credited with extra scores. Surprise us.
There are many approaches to an anomaly detection in network security.
In the solution example notebook you'll find a very naive approach using Elliptic Envelope model.
The model run on each network separately using only 5 features to detect anomalies.
Ignoring the unsatisfying but predictable score, this solution contains the full flow - reading the data, creating feature set, running the model and than sending the results to our leader board.
Our Leader Board can be accessed here https://leaderboard.datahack.org.il/armis
In order to automatically grade your results - and, of course, appear in the challenge leaderboard - you need to send your results using a HTTP POST request to our api endpoint.
The results - a list of all device IDs, with the anomaly score for each device - needs to be sent as a JSON list, with the following structure:
[
[
"network_id - The network id int",
"device_id - The network id int",
"confidence - The anomaly score for this device - float between 0 and 1"
]
]
Please pay close attention to the order of the inner array - network id, device id, confidence.
For example:
[
[ 1, 222, 0.75 ],
[ 0, 24, 0.11 ]
]
In this example there are 2 devices: the first one is device_id 222 from network 1, and its anomaly score is 0.75 (the higher the score, the more likely it is that this device is anomalous).
The second device is device_id 24 from network 0, and it received the anomaly score of 0.11.
A detailed code for submission is available in the solution example.
After each submission, please send us your code, in a zipped file, to "[email protected]", so we review your solution.
More details about DataHack leader board can be found here
For every question, suggestion, or any other observations you might have, please do not hesitate to contact us: [email protected]