DataStream2 SDK

To empower DS2 customers and create a value to the data delivered to customer object stores(AWS S3, Azure Blobs), business value driven SDK/Package is provided to customer to run on a serverless function (Lambda, Azure functions) and derive metrics to various destination like cloudwatch, SNS, DyanoDB, CosmosDB etc..

Built With

python3.8
pandas
httpagentparser

Features

Back to Top

Derive metrics from DS2 Logs uploaded to AWS S3 or Azure blob Storage
Supportability DS2 formats
- JSON
- STRUCTURED

Input

Back to Top

Reads Structured(CSV) or JSON format input files produced from DataStream 2.

Sample Output

Back to Top

Generates the following aggregated Output of the selected fields per file

[
  {
    "start_timestamp": 1606768500,
    "bytes_max": 3241.0,
    "bytes_sum": 97230.0,
    "bytes_count": 30.0,
    "objsize_min": 10.0,
    "objsize_max": 10.0,
    "objsize_sum": 300.0,
    "objsize_count": 30.0,
    "uncompressedsize_sum": 3000.0,
    "transfertimemsec_sum": 60.0,
    "totalbytes_min": 3241.0,
    "totalbytes_max": 3241.0,
    "totalbytes_sum": 97230.0,
    "totalbytes_mean": 3241.0,
    "tlsoverheadtimemsec_min": 0.0,
    "tlsoverheadtimemsec_max": 0.0,
    "total_hits": 42,
    "hits_2xx": 25,
    "hits_3xx": 1,
    "hits_4xx": 1,
    "hits_5xx": 2,
    "traffic_volume": 97230,
    "cache_hit": 24,
    "cache_miss": 6,
    "offload_rate": 80.0,
    "origin_response_time": 0,
    "os": {
      "Windows": 30
    },
    "browser": {
      "Chrome": 30
    },
    "platform": {
      "Windows": 30
    }
  }
]

How to clone the GIT repo

Back to Top

Clone the repo,

git clone <repo url>

Reference - https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-clone

Code Structure

Back to Top

File/Module	Description
run_aggregation.py	Main module that is invoked to aggregate of the input data file
/aggregation_modules	Modules for data aggregation
/cloud_modules_*	Utilities to interact with respective cloud services. Say, */cloud_modules_aws:* Utilities to interact with aws S3 storage */cloud_modules_azure:* Utilities to interact with Azure Blob storage
/configs	contains sample configuration files
/frontend_modules/provision_ui	UI framework using Django that helps to create `provision.json` file containing the list of selected metrics for aggregation.
/tools	Contains standalone tools that can be used to setup other services on cloud that helps in analysing Datastream data. Example, *tools/athena:* To setup Athena in AWS that helps in querying the Datastream 2 data directly from S3 buckets.

More Details

After git clone cd to the repository directory and start the pydoc server to get more details on the code base.

python -m pydoc -b .

Configuration Files

Back to Top

The following are the details on the list of input configuration files used by this package

all_datastream2_fields.json
all_custom_functions.json
stream.json
provision.json

all_datastream2_fields.json

This JSON file consists of all dataset Fields available in datastream

Example

{
    [...]
    "2003": {
        "name": "objSize",
        "cname": "Object size",
        "dtype": "bigint",
        "agg": [
            "min", "max", "sum"
        ]
    },
    [...]
}

This file contains the following details,

field id The field id (say, "2003") corresponds to the the datasetFieldId in stream.json file.

"name"

"dtype" data type of the field

"cname" field description

"agg"

The "agg" tag consists of the list of aggregate functions that can be supported by this field, provided that field is selected in stream_json file.
Thus removing it from "agg" list disables the function for that field.
Reference: Pandas > DataFrame > API Reference

Following are the currently supported functions,

min	Return the minimum of the values over the requested axis.
max	Return the maximum of the values over the requested axis.
sum	Return the sum of the values over the requested axis.
count	Count non-NA cells for each column or row.
mean	Return the mean of the values over the requested axis.
median	Return the median of the values over the requested axis.
var	Return unbiased variance over requested axis.
any	Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty)
unique_counts	Returns json containing counts of unique rows in the DataFrame. Example for column `country` it returns, "country": { "US": 42 },

Sample File is stored in: configs/all_datastream2_fields.json
This is a common file and updated only when new fields are added to the datastream.

all_custom_functions.json

This JSON file contains the list of all the available custom functions that can be selected to aggregate the data.

Example

{
    [...]
    "get_status_code_level_hit_counts": {
        "required-fields": [
            "statuscode"
        ],
        "description": "Show Stat Count of HTTP requests"
    },
    [...]
}

This file contains the following details,

function name	unique name for the function
"description"	Short description of this function
"required-fields"	dataset field names (in lowercase) from `all_datastream2_fields.json` that are required to derive this function.

Sample File is stored in: configs/all_custom_functions.json
This is a common file and updated only when new functions are added to the datastream.

Following are the list of current available custom functions and their recommended Memory,

custom function	Required DS2 fields	Recommended Memory	Sample Output
get_total_hits		> 512 MB	"total_hits": 42,
get_traffic_volume	totalbytes	> 512 MB	"traffic_volume": 145964,
get_status_code_level_hit_counts	statuscode	> 512 MB	"hits_2xx": 42, "hits_3xx": 0, "hits_4xx": 0, "hits_5xx": 0,
get_cachestatus	cachestatus	> 512 MB	"cache_hit": 42, "cache_miss": 0,
get_offload_rate	cachestatus	> 512 MB	"offload_rate": 10.0,
get_origin_response_time	cachestatus, cacherefreshsrc, turnaroundtimemsec	> 512 MB	"origin_response_time": 10,
get_user_agent_details	ua	> 1024 MB	"os": { "Windows": 30 }, "browser": { "Chrome": 30 }, "platform": { "Windows": 30 },
get_unique_visitor(for Azure only)	ua, cliip	> 2048 MB	HTTP response code 200 OK Example:- { "2023-02-22": 11, "2023-02-21": 6 }

stream.json

This is a JSON file containing the stream specific details.
i.e this file is used to understand the fields configured for this stream.
This can be pulled from portal using the steps mentioned here
Sample File is stored in: configs/stream.json
- This needs to be updated with the stream specific file.

provision.json

This is a JSON file containing the subset of custom functions that are selected for this stream.

Example

{
    "aggregation-interval": 300,
    "custom-functions": [
        "get_status_code_level_hit_counts",
        "get_traffic_volume",
        [...]
    ]
    [...]
    "bytes": [
        "max",
        "sum"
    ],
    [...]
    "city": [
        "unique_counts"
    ],
    [...]
}

function are triggerred to generate output for the above selected aggregate functions for the input files.
"aggregation-interval", specifies the time in secs to aggregate the data based on the Request Time.
- Setting this to -1 disables time based aggregation.
Sample File is stored in: configs/provision.json
- This needs to be updated with the stream specific file.
This file can be manually edited or generated using the steps mentioned here

Configuration File Setup

Back to Top

Follow this document to create the config files customised for your stream, Config files Setup Reference

Script Usage

Back to Top

% python run_aggregations.py --help
usage: [...]/run_aggregations.py [-h] [--loglevel {critical,error,warn,info,debug}]
                                                                [--input INPUT]

Helps aggregate data

options:
  -h, --help            show this help message and exit
  --loglevel {critical,error,warn,info,debug}
                        logging level.
                        (default: info)
                        
  --input INPUT         specify the input file to aggregate.
                        (default: /[...]/sample-input/test-data-custom.gz)

Testing with an Input file locally

Back to Top

Before setting up in the cloud server, this can be tested locally to ensure the aggregated output is as expected as below.

Ensure to update the configs as needed in configs/ directory
The custom input file can be specified via --input option to run this.
Ensure the output generated is as expected.

Example

% python run_aggregations.py --input sample-input/test-data-custom.gz
Result...
[
  {
    "start_timestamp": 1606768500,
    "bytes_max": 3241.0,
    "bytes_sum": 97230.0,
    "bytes_count": 30.0,
    "objsize_min": 10.0,
    "objsize_max": 10.0,
    "objsize_sum": 300.0,
    "objsize_count": 30.0,
    "uncompressedsize_sum": 3000.0,
    "transfertimemsec_sum": 60.0,
    "totalbytes_min": 3241.0,
    "totalbytes_max": 3241.0,
    "totalbytes_sum": 97230.0,
    "totalbytes_mean": 3241.0,
    "tlsoverheadtimemsec_min": 0.0,
    "tlsoverheadtimemsec_max": 0.0,
    "tlsoverheadtimemsec_sum": 0.0,
    "total_hits": 30,
    "hits_2xx": 25,
    "hits_3xx": 1,
    "hits_4xx": 1,
    "hits_5xx": 2,
    "traffic_volume": 97230,
    "cache_hit": 24,
    "cache_miss": 6,
    "offload_rate": 80.0,
    "origin_response_time": 0,
    "os": {
      "Windows": 30
    },
    "browser": {
      "Chrome": 30
    },
    "platform": {
      "Windows": 30
    }
  }
]

How to setup in AWS

Back to Top

Follow this document for detailed setup instructions on AWS, AWS Setup Reference

How to setup in Azure

Back to Top

Follow this document for detailed setup instructions on Azure, Azure Setup Reference

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
aggregation_modules		aggregation_modules
azure_unique_visitor		azure_unique_visitor
cloud_modules_aws		cloud_modules_aws
cloud_modules_azure		cloud_modules_azure
common_module		common_module
configs		configs
docs		docs
frontend_modules		frontend_modules
sample-input		sample-input
tools/athena		tools/athena
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base-requirements.txt		base-requirements.txt
host.json		host.json
requirements.txt		requirements.txt
run_aggregations.py		run_aggregations.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataStream2 SDK

Table of Contents

Features

Input

Sample Output

How to clone the GIT repo

Code Structure

More Details

Configuration Files

all_datastream2_fields.json

all_custom_functions.json

stream.json

provision.json

Configuration File Setup

Script Usage

Testing with an Input file locally

How to setup in AWS

How to setup in Azure

About

Releases

Packages

Contributors 5

Languages

License

akamai/datastream-sdk

Folders and files

Latest commit

History

Repository files navigation

DataStream2 SDK

Table of Contents

Features

Input

Sample Output

How to clone the GIT repo

Code Structure

More Details

Configuration Files

all_datastream2_fields.json

all_custom_functions.json

stream.json

provision.json

Configuration File Setup

Script Usage

Testing with an Input file locally

How to setup in AWS

How to setup in Azure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages