This is the main file to be run. It takes in user arguments and facilitates the call of different functions which take in these passed arguments. The flow consists of:
- Initialize the enviornment
- Read files and generate features
- Generate scatter plots for each feature pair and assign outlier scores to points for each scatter plot
- Obtain the points-of-interest, i.e the global outlier points
- Generate Time series plots depicting the outliers
- Run the LookOut algorithm to obtain the best plots to show to the user
The user arguments are explained as follows:
-f
|--datafile
: The file with data to fit on the model-t
|--trainfile
: The file with data to train the model-l
|--logfile
: The logfile ; default - log.txt-df
|--datafolder
: The folder containing the datafile and trainfile ; default - Data/-lf
|--logfolder
: The folder containing the logfiles ; default - Logs/-pf
|--plotfolder
: The folder into which to output the plots ; default - Plots/-d
|--delimiter
: The csv datafile delimiter ; default - ","-b
|--budget
: Number of plots to display ; default - 3-n
|--number
: Number of outliers to choose ; default - 10-p
|--pval
: Outlier score scaling factor ; default - 1.0-s
|--show
: Specify if all generated plots are to be stored in the plotfolder ; default - false-bs
|--baselines
: Specify if you want to run the baseline algorithms ; default - false-mrg
|--merge
: Specify if the global set of outliers will be picked from a merged ranklist ; default - false-if
|--iforests
: Specify if the global set of outliers will be picked using iForests ; default - false-dict
|--dictated
: Specify if the global set of outliers will be dictated (see feature_file.py) ; default - false
This file takes the bipartite graph between outliers and plots as input and runs the LookOut algorithm to obtain the b (budget) best plots. There are also two other baselines that can be run to compare with the Lookout algorithms, namely Greedy TopK and Random Selection.
This file creates an iForests model. It trains the model via the training data and then attempts to score the test data on the trained model. A further improvements on the scores can be done by also further training another model on the test data itself and generating new test scores. Then the two obtained scores can be intropolated for best results.
The objective of this file is to process the outlier scores from each pair-wise scatter plot and generate an output data matrix that can be used to populate the bipartite graph. We generate two types of output matrices, scaled_matrix and normal_matrix. The scaled_matrix is equivalent to the normal_matrix, except with all the scores scaled by a factor pval with help of a scaling function defined in helper.py.
This file returns the global outlier objects that serve as the points-of-interest for the user. These global outliers can be calculated via three methods.
merge
: The ranklists, of all the points, from their individual 2-Dimensional iForest scores will be merged into a single ranklist based on a merging algorithm. Then the top n ranked points will serve as the set out outliers.iforests
: The set of n outliers will be chosen by running the iForests algorithm in the complete multidimensional feature space. The top n scored points will be chosen.dictated
: The user will define the set of outlier ids in the features_file.py. In this case the number of outliers chosen will be equal to the length of the outlier id list.
This file along with test.py control the flow of the program. This file particularly looks at creating the environment to run the LookOut algorithm.
- This file takes in the intermediate outputs: rank_list, outliers, and features.
- It then calls ranklist.py to generate scaled_matrix and normal_matrix.
- Then the scaled_matrix is used to generate the bipartite graph which is passed on to LookOut.py which in-turn produces the final list of best plots.
- The various metrics of the chosen plots are calculated using the Algorithm Helpers functions in helper.py.
- The repective focus_plots are created and saved to the Plots Folder.
This file contains various helper functions that are used by several of the algorithm files. It serves a dual purpose of removing unnecessary logic from the main files and also increases reusability of code. Broadly the helper functions can be grouped as:
- Data Analysis Functions : These functions will return different statistical metrics on the input data (list format), such as, min, max, mean, median, and std_dev.
- Pandas Data Parse Functions : Several functions specifically designed to get pandas data object as input. They either calculate different metrics or perform minor fixes on the data.
- Feature Handlers : These functions help manipulate and merge multiple features required for generating plots and iForest scores.
- Scaling Function : Contains the logic to scale the outlier scores of each point before comparing.
- Algorithm Helpers : These functions, get_coverage and generate_frequency_list, calculate necessary metrics critical to the LookOut algorithm.
- Initialize Environment : This function parses user arguments and check if all the required folders are and files are avaiable during runtime.
This file contains functions that help create different type of plots that might be useful to the user. It consists of the four mian functions:
- generate_scatter_plots : This function makes calls to scatter_plot function and itterates over a list of feature pairs. It helps consolidate the scores of the points in each of the generated scatter plots in a variable rank_matrix.
- scatter_plot : This function, when given two features, will create a scatter plot image that is saved to a folder and also generates outlier scores of the points w.r.t. the two features using the iForests algorithm in 2 dimensions.
- scatter_outliers : This function creates the final user output focus-plots (scatter plots) that have highlight outliying points in easy to see colors.
- time_series_plots : This function is responsible to create the time series plots for each of the features w.r.t the multiday averages.
This file declares four classes that are used by the LookOut algorithm to calculate the best visualization plots.
- Outlier : These class objects define an outlier in the bipartite graph, maintaining its identity and edge weight
- Plot : These class objects define a plot in the bipartite graph, maintaining its identity and global influence
- Edge : These class objects map the relation between an outlier object and a plot object
- Graph : This class defines a global object that declares the bipartitie graph between outliers and plots. It consists of several member function to update and manipulate the graph based on rules defined by the LookOut algorithm.
This file declares two classes that deal with data representation
- Feature : An object of this class contians all the logistical information required to handle the data of a feature we have defined. It contains the following fields:
- name : The feature name
- description : The displayed description of the feature
- type : Define the type of the data from either {continuous, discrete, or time_series}
- log : Mentions whether the data should be represented on log scale or linear scale
- analytics : Basic stats of the included data such as mean, median, min, max and std_dev
- data : The actual numeric data of the feature
- ids : The identity entity labels of each data entry
- Outlier : Objects of this class help capture important information of some selected points of interest. Each object contains the following fields:
- id : The outlier id
- score : A calculated anomaly score for this object / outlier
- anomaly : A bool value which specifies if the outlier appears as an outlier or not
- raw_data : The raw feature values of the outlier
- stat_data : The aggregate values of the raw data including mean and std_dev
- ratios : The ratio of the raw data w.r.t the mean and std_dev of the aggregates
This file contains parameters that specify the display variables such as the terminal color prompts, and the styling of standard logging functions.
This file include all the data and feature specific variables:
identity_field
is used to declare the identity object columnidentity_is_time
is used to specify that the identity object is time based. Will be used to make the time-series graphsentry_limit
defines the lower limit for the number of entires of an object from the identity_field. An object with fewer entries than the limit will be ignored.time_series_data
defines that the data is temporal in nature or not with time based data entriestimestamp_field
declares the column name that contains the timestamp dataaggregate_fields
is a list of the aggregate field columnsobject_fields
is a list of the object field columnsnorm_field
declares any aggregate column to use as the base. All other aggregate fields will be normalized based on this base column.outlier_list
is used to particularly observe the characteristics of certain points of focus. This is a list of the object ids of those objects.
This file is used to create a datafile (.csv) with desired entries. The user can provide their preferences via the following options:
-t
|--team
: The team id ; default - 15 (LimeStone)-p
|--product
: The product id ; default - 2 (Futures)-v
|--venue
: The venue id ; default - 23 (CME)-s
|--sid
: The symbol id ; default - 0-y
|--year
: The year of historical file ; default - 2018-m
|--month
: The month of historical file ; default - 5-d
|--day
: The day of historical file ; default - 29-b
|--bucket
: The bucket size (data sampling rate in seconods) ; default - 30-pr
|--period
: The periodicity of the data ; default - 0 (1 day)
It makes a call to the elastic search engine with all the above parameters and generates a corresponding file (csv) which is placed in the Data folder.
This file is used to create a datafile (.csv) with desired entries. The user can provide their preferences via the following options:
-f
|--datafile
: The file from which to extract data ; default - ""-t
|--targetfile
: The file to export the data ; default - target.csv-m
|--mode
: specify type of extraction {full, partial, random} ; default - full-p
|--portion
: Fraction of data to extract ; default - 1.0-i
|--include
: Specify columns to include ; default - all-e
|--exclude
: Specify columns to exclude ; default - none
It reads a csv file and generates a corresponding target file (csv) which is placed in the Data folder. It should be used in conjunction to the create_files.py to further selectively modify the data.
This file is used to extract features from the read datafile. It makes use of the python pandas library and transforms the data to create feature data objects. There are four main processing steps in this file:
- Read File
- It reads the csv file data based on a delimiter and creates a pandas dataframe object. A boolean parameter can be passed, train, which helps specify to read the train file or test file
- Transform Data
- Apply filters to remove unwanted rows.
entry_limit
variable declared and defined in feature_files.py is one such filter. For current aggregated time series data we disable it withentry_limit = 0
- Calculate time series. Features like object Lifetime and Inter-Arrival Time are added as columns to the pandas dataframe. Currently for aggreagated data we wont calculate and time series features.
- Apply filters to remove unwanted rows.
- Create Features
- Features are of Four categories
- Identity Features : These are the objects of identity and stores their
IDs
andCOUNT
- Time Series Features : These are data temporal features like
LIFETIME
,IAT_VAR_MEAN
,MEAN_IAT
andMEDIAN_IAT
. In case of aggregated time series data we wont create time series features. - Aggregate Features : The features containing data that can be aggregated (summed, averaged, etc.). For an aggregate data column field we calculate features
FIELD
(summed) andstddev_FIELD
. Note) we currently work with aggregated data so thestddev_FIELD
features can be obtained from the train file. - Object Fields : The features containing identities like object names. For an object data column field we calculate features
FIELD
(unique count).
- Identity Features : These are the objects of identity and stores their
- Features are of Four categories
- Normalize Features
- First we delete flat features (Near zero mean or std_dev)
- Second we scale the numeric feature values using a scale algorithm.
There are two stages to run the code: Data Preparation and Run LookOut
Call on create_files.py to get a readable csv file for a particular team, venue, product and date. Read here for argument specification.
Note) You will have to set the file path HIST_DATA_DIR
to a suitable location to access the raw files.
python create_files.py -t <team_id> -y <year> -m <month> -d <day> [args list]
Call on extract.py to further filterout unwanted columns. Read here for argument specification.
Note) This step can be ignored if no columns have to be deleted
python extract.py -f <datafile> -t <targetfile> [args list]
e.g. Lets extract certain columns from the data
python extract.py -f <datafile> -t <targetfile> -i ['orders', 'cancels', 'trades', 'buyShares', 'sellShares']
Modify feature_file.py to the appropriate specification. Current default values should work okay.
identity_field = 'ts_epoch'
identity_is_time = True
entry_limit = 0 # The minimum entries that must exist per identity item (Set to 0 to disable)
time_series_data = False # Calculates lifetimes and IAT data (should be set to false if only one entry per identity)
timestamp_field = 'TIMESTAMP' # Used only if time_series_data is set to true
aggregate_fields = ['orders', 'cancels', 'trades', 'buyShares', 'sellShares', 'buyTradeShares', \
'sellTradeShares', 'buyNotional', 'sellNotional', 'buyTradeNotional', \
'sellTradeNotional', 'alters', 'selfTradePrevention']
object_fields = []
norm_field = 'orders'
outlier_list = []]
Run the LookOut Algorithm on the files generated from the Data Preparation stage with the configuations specified in the feature_file.py. Read here for argument specification.
python test.py -f <filename> -t <trainfile> -b <budget> -n <number> [-mrg | -if | -dict] (-if recommended) [args list]
The -f (filename) and -t (trainfile) must be specified by the user.
The -b (budget) and -n (number of outliers) have default values of 3 and 10 respectively. They don't have to be specified, but it is good practice to declare them.
It is recommended to use the -if (iForests) for best results in calculating global outliers. Look here for more information.
The outputs of the algorithm are reflected in the Plots folder. There will be three tyoes of files here:
- scatterplot.pdf - This contains all the scatter plots created for the pairwise features
- timeseries.pdf - This is the time series data plot for each feature from the test file w.r.t the average aggregated values in the trainfile
- LookOut-n-b-i-(train).png - These are the output focus plots from the algorithm. n - number of outliers, b - budget, and i - i^th plot. For each value of n, b and i there are two pngs: the first one is the testfile scatter plot and the second, appeneded with train is the trainfile scatter plot for that chosen feature pair.
Let us look at some LookOut-n-b-i-(train).png with n = 6 and b = 3
LookOut-6-3-0.png | LookOut-6-3-1.png | LookOut-6-3-2.png |
LookOut-6-3-0-train.png | LookOut-6-3-1-train.png | LookOut-6-3-2-train.png |