This software submission is made for the Insight Coding Challenge 2017 – Anomaly Detection. The software calculates when a purchase made by a user is “out the ordinary” for the standards of his social network purchasing habits. The formal definition of the problem is available here.
The approach followed is outlined below.
The batch_log.json
and stream_log.json
files are read in memory. This is done only for technicality (keep the lock on files for as little as possible). Besides, in a production scenario, files would have been arriving from a (streaming) API.
The batch_log.json
is used first to read the social network parameters (depth and maximum consecutive purchases). The rest of the data are used to "train" the social network. No anomaly detection is done at this point. This is done when the other log file is evaluated.
It is given from the challenge description, that the purchase and relationship data in the input log files is "arriving" in the order from older to newer. Code that handles cases where newer records contain older data is considered tedious.
It's the name of the component that builds the social network, adding persons and relationships according to the streaming data.
It is comprised of a NetworkTraverser
component responsible for finding the social group of a given person but for a given depth. It uses a simple but efficient BFS algorithm (with a twist in order to track the depth).
Another important component is PurchaseMerger
. Instead of using the naive approach of holding all the purchase records of every individual and scan on the entire array for the past M records of it's social group, a different approach is used: Every person has it's own rolling list of M latest purchases (RollingList
component, a.k.a. a simple LinkedList). In this manner, i.e. by scattering and then gathering back the data, we increase the locality of reference of the required data.
A simple component that calculates the statistical parameters: count, mean, variance, standard deviation of the population (it is debatable why the stdev of the population is used (1/N) and not the biased stdev of the sample (1/N-1) ).
This software is --I assume-- different that most software submitted for this Challenge, in the sense that is uses the brand new technology from Microsoft, called .NET Core. .NET Core brings the richness, speed and maturity of the .NET Framework to Linux and Mac. Binaries compiled with .NET Core run natively on both Linux, Windows and Mac. The code is written in C# 6 using Visual Studio Code (on Linux) and Visual Studio (on Windows). The library used for JSON parsing and manipulation is the excellent Json.NET from NewtonSoft. For the unit testing project, x-Unit.NET was chosen.
The application requires .NET Core version 1.1.2 to be installed, in order to run. You can find detailed installation instructions for both Linux, Windows and Mac here. On all operating systems, administrative privileges are required for the installation.
Containers are popular these days, especially Docker. Why not submit the C# based software in it? The answer is performance impediments caused by running software via another layer.
The location of the application DLL is in the folder
./src/AnomalyDetection.Publish
and it's execution is made using the following command syntax:
dotnet <path_to_execution_dll_file> <path_to_batch_log_json> <path_to_stream_log_json> <path_to_output_log_json> [--verbose]
The --verbose
switch displays detailed information about the anomalies detected as soon as they are detected.
For example on Linux:
dotnet "./src/AnomalyDetection.Publish/AnomalyDetection.dll" "./log_input/batch_log.json" "./log_input/stream_log.json" "./log_output/flagged_purchases.json" --verbose
Due to an intolerance of the test_suite regarding the newline characters and spaces, it was unable to provide a screenshot of a successful test message.
However running the first suite test (test1) using the run.sh
script, yielded the same correct result.
Additionally, the 50MB sample file was used as well (can be seen by running run-sample.sh
).
The application has been tested successfully in Windows, Linux Mint 18.1 and Ubuntu Linux 16.10.
There are unit tests for every component used in the application. Feel free to (you should) look around in the ./src/AnomalyDetection.Tests
folder.
It was a nice challenge and a lovely weekend :)