To empower DS2 customers and create a value to the data delivered to customer object stores(AWS S3, Azure Blobs), business value driven SDK/Package is provided to customer to run on a serverless function (Lambda, Azure functions) and derive metrics to various destination like cloudwatch, SNS, DyanoDB, CosmosDB etc..
Built With
- python3.8
- pandas
- httpagentparser
- Features
- Input
- Sample Output
- How to clone the GIT repo
- Code Structure
- Configuration Files
- Configuration File Setup
- Script Usage
- Testing with an Input file locally
- How to setup in AWS
- How to setup in Azure
- Derive metrics from DS2 Logs uploaded to AWS S3 or Azure blob Storage
- Supportability DS2 formats
- JSON
- STRUCTURED
Reads Structured(CSV) or JSON format input files produced from DataStream 2.
Generates the following aggregated Output of the selected fields per file
[
{
"start_timestamp": 1606768500,
"bytes_max": 3241.0,
"bytes_sum": 97230.0,
"bytes_count": 30.0,
"objsize_min": 10.0,
"objsize_max": 10.0,
"objsize_sum": 300.0,
"objsize_count": 30.0,
"uncompressedsize_sum": 3000.0,
"transfertimemsec_sum": 60.0,
"totalbytes_min": 3241.0,
"totalbytes_max": 3241.0,
"totalbytes_sum": 97230.0,
"totalbytes_mean": 3241.0,
"tlsoverheadtimemsec_min": 0.0,
"tlsoverheadtimemsec_max": 0.0,
"total_hits": 42,
"hits_2xx": 25,
"hits_3xx": 1,
"hits_4xx": 1,
"hits_5xx": 2,
"traffic_volume": 97230,
"cache_hit": 24,
"cache_miss": 6,
"offload_rate": 80.0,
"origin_response_time": 0,
"os": {
"Windows": 30
},
"browser": {
"Chrome": 30
},
"platform": {
"Windows": 30
}
}
]
Clone the repo,
git clone <repo url>
Reference - https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-clone
File/Module | Description |
---|---|
run_aggregation.py | Main module that is invoked to aggregate of the input data file |
/aggregation_modules | Modules for data aggregation |
/cloud_modules_* |
Utilities to interact with respective cloud services. Say,
|
/configs | contains sample configuration files |
/frontend_modules/provision_ui | UI framework using Django that helps to create provision.json file containing the list of selected metrics for aggregation. |
/tools | Contains standalone tools that can be used to setup other services on cloud that helps in analysing Datastream data. Example,
|
After git clone cd to the repository directory and start the pydoc server to get more details on the code base.
python -m pydoc -b .
The following are the details on the list of input configuration files used by this package
-
This JSON file consists of all dataset Fields available in datastream
Example
{ [...] "2003": { "name": "objSize", "cname": "Object size", "dtype": "bigint", "agg": [ "min", "max", "sum" ] }, [...] }
-
This file contains the following details,
field id The field id (say, "2003"
) corresponds to the thedatasetFieldId
in stream.json file."name" "dtype" data type of the field "cname" field description "agg" - The
"agg"
tag consists of the list of aggregate functions that can be supported by this field, provided that field is selected instream_json
file. - Thus removing it from
"agg"
list disables the function for that field. - Reference: Pandas > DataFrame > API Reference
- Following are the currently supported functions,
min Return the minimum of the values over the requested axis. max Return the maximum of the values over the requested axis. sum Return the sum of the values over the requested axis. count Count non-NA cells for each column or row. mean Return the mean of the values over the requested axis. median Return the median of the values over the requested axis. var Return unbiased variance over requested axis. any Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty) unique_counts Returns json containing counts of unique rows in the DataFrame.
Example for columncountry
it returns,"country": { "US": 42 },
- The
-
Sample File is stored in: configs/all_datastream2_fields.json
-
This is a common file and updated only when new fields are added to the datastream.
-
This JSON file contains the list of all the available custom functions that can be selected to aggregate the data.
Example
{ [...] "get_status_code_level_hit_counts": { "required-fields": [ "statuscode" ], "description": "Show Stat Count of HTTP requests" }, [...] }
-
This file contains the following details,
function name unique name for the function "description" Short description of this function "required-fields" dataset field names (in lowercase) from all_datastream2_fields.json
that are required to derive this function. -
Sample File is stored in: configs/all_custom_functions.json
-
This is a common file and updated only when new functions are added to the datastream.
-
Following are the list of current available custom functions and their recommended Memory,
custom function Required DS2 fields Recommended Memory Sample Output get_total_hits > 512 MB "total_hits": 42,
get_traffic_volume totalbytes > 512 MB "traffic_volume": 145964,
get_status_code_level_hit_counts statuscode > 512 MB "hits_2xx": 42, "hits_3xx": 0, "hits_4xx": 0, "hits_5xx": 0,
get_cachestatus cachestatus > 512 MB "cache_hit": 42, "cache_miss": 0,
get_offload_rate cachestatus > 512 MB "offload_rate": 10.0,
get_origin_response_time cachestatus, cacherefreshsrc, turnaroundtimemsec > 512 MB "origin_response_time": 10,
get_user_agent_details ua > 1024 MB "os": { "Windows": 30 }, "browser": { "Chrome": 30 }, "platform": { "Windows": 30 },
get_unique_visitor(for Azure only) ua, cliip > 2048 MB HTTP response code 200 OK Example:- { "2023-02-22": 11, "2023-02-21": 6 }
- This is a JSON file containing the stream specific details.
- i.e this file is used to understand the fields configured for this stream.
- This can be pulled from portal using the steps mentioned here
- Sample File is stored in: configs/stream.json
- This needs to be updated with the stream specific file.
-
This is a JSON file containing the subset of custom functions that are selected for this stream.
Example
{ "aggregation-interval": 300, "custom-functions": [ "get_status_code_level_hit_counts", "get_traffic_volume", [...] ] [...] "bytes": [ "max", "sum" ], [...] "city": [ "unique_counts" ], [...] }
-
function are triggerred to generate output for the above selected aggregate functions for the input files.
-
"aggregation-interval"
, specifies the time in secs to aggregate the data based on theRequest Time
.- Setting this to
-1
disables time based aggregation.
- Setting this to
-
Sample File is stored in: configs/provision.json
- This needs to be updated with the stream specific file.
-
This file can be manually edited or generated using the steps mentioned here
Follow this document to create the config files customised for your stream, Config files Setup Reference
% python run_aggregations.py --help
usage: [...]/run_aggregations.py [-h] [--loglevel {critical,error,warn,info,debug}]
[--input INPUT]
Helps aggregate data
options:
-h, --help show this help message and exit
--loglevel {critical,error,warn,info,debug}
logging level.
(default: info)
--input INPUT specify the input file to aggregate.
(default: /[...]/sample-input/test-data-custom.gz)
Before setting up in the cloud server, this can be tested locally to ensure the aggregated output is as expected as below.
- Ensure to update the configs as needed in
configs/
directory - The custom input file can be specified via
--input
option to run this. - Ensure the output generated is as expected.
Example
% python run_aggregations.py --input sample-input/test-data-custom.gz
Result...
[
{
"start_timestamp": 1606768500,
"bytes_max": 3241.0,
"bytes_sum": 97230.0,
"bytes_count": 30.0,
"objsize_min": 10.0,
"objsize_max": 10.0,
"objsize_sum": 300.0,
"objsize_count": 30.0,
"uncompressedsize_sum": 3000.0,
"transfertimemsec_sum": 60.0,
"totalbytes_min": 3241.0,
"totalbytes_max": 3241.0,
"totalbytes_sum": 97230.0,
"totalbytes_mean": 3241.0,
"tlsoverheadtimemsec_min": 0.0,
"tlsoverheadtimemsec_max": 0.0,
"tlsoverheadtimemsec_sum": 0.0,
"total_hits": 30,
"hits_2xx": 25,
"hits_3xx": 1,
"hits_4xx": 1,
"hits_5xx": 2,
"traffic_volume": 97230,
"cache_hit": 24,
"cache_miss": 6,
"offload_rate": 80.0,
"origin_response_time": 0,
"os": {
"Windows": 30
},
"browser": {
"Chrome": 30
},
"platform": {
"Windows": 30
}
}
]
Follow this document for detailed setup instructions on AWS, AWS Setup Reference
Follow this document for detailed setup instructions on Azure, Azure Setup Reference