Skip to content

Tools using 3rd party cloud services for utilizing DS data for customers to get the most out of DataStream data and have visibility on activity on the Akamai platform.

License

Notifications You must be signed in to change notification settings

akamai/datastream-sdk

Repository files navigation

DataStream2 SDK

To empower DS2 customers and create a value to the data delivered to customer object stores(AWS S3, Azure Blobs), business value driven SDK/Package is provided to customer to run on a serverless function (Lambda, Azure functions) and derive metrics to various destination like cloudwatch, SNS, DyanoDB, CosmosDB etc..

Built With

  • python3.8
  • pandas
  • httpagentparser

Table of Contents

Features

Back to Top

  1. Derive metrics from DS2 Logs uploaded to AWS S3 or Azure blob Storage
  2. Supportability DS2 formats
    • JSON
    • STRUCTURED

Input

Back to Top

Reads Structured(CSV) or JSON format input files produced from DataStream 2.

Sample Output

Back to Top

Generates the following aggregated Output of the selected fields per file

[
  {
    "start_timestamp": 1606768500,
    "bytes_max": 3241.0,
    "bytes_sum": 97230.0,
    "bytes_count": 30.0,
    "objsize_min": 10.0,
    "objsize_max": 10.0,
    "objsize_sum": 300.0,
    "objsize_count": 30.0,
    "uncompressedsize_sum": 3000.0,
    "transfertimemsec_sum": 60.0,
    "totalbytes_min": 3241.0,
    "totalbytes_max": 3241.0,
    "totalbytes_sum": 97230.0,
    "totalbytes_mean": 3241.0,
    "tlsoverheadtimemsec_min": 0.0,
    "tlsoverheadtimemsec_max": 0.0,
    "total_hits": 42,
    "hits_2xx": 25,
    "hits_3xx": 1,
    "hits_4xx": 1,
    "hits_5xx": 2,
    "traffic_volume": 97230,
    "cache_hit": 24,
    "cache_miss": 6,
    "offload_rate": 80.0,
    "origin_response_time": 0,
    "os": {
      "Windows": 30
    },
    "browser": {
      "Chrome": 30
    },
    "platform": {
      "Windows": 30
    }
  }
]

How to clone the GIT repo

Back to Top

Clone the repo,

git clone <repo url>

Reference - https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-clone

Code Structure

Back to Top

File/Module Description
run_aggregation.py Main module that is invoked to aggregate of the input data file
/aggregation_modules Modules for data aggregation
/cloud_modules_* Utilities to interact with respective cloud services. Say,
  • /cloud_modules_aws: Utilities to interact with aws S3 storage
  • /cloud_modules_azure: Utilities to interact with Azure Blob storage
/configs contains sample configuration files
/frontend_modules/provision_ui UI framework using Django that helps to create provision.json file containing the list of selected metrics for aggregation.
/tools Contains standalone tools that can be used to setup other services on cloud that helps in analysing Datastream data.
Example,
  • tools/athena: To setup Athena in AWS that helps in querying the Datastream 2 data directly from S3 buckets.

More Details

After git clone cd to the repository directory and start the pydoc server to get more details on the code base.

python -m pydoc -b .

Configuration Files

Back to Top

The following are the details on the list of input configuration files used by this package

all_datastream2_fields.json

  1. This JSON file consists of all dataset Fields available in datastream

    Example
    {
        [...]
        "2003": {
            "name": "objSize",
            "cname": "Object size",
            "dtype": "bigint",
            "agg": [
                "min", "max", "sum"
            ]
        },
        [...]
    }
  2. This file contains the following details,

    field id The field id (say, "2003") corresponds to the the datasetFieldId in stream.json file.
    "name"
    "dtype" data type of the field
    "cname" field description
    "agg"
    • The "agg" tag consists of the list of aggregate functions that can be supported by this field, provided that field is selected in stream_json file.
    • Thus removing it from "agg" list disables the function for that field.
    • Reference: Pandas > DataFrame > API Reference
    • Following are the currently supported functions,
      min Return the minimum of the values over the requested axis.
      max Return the maximum of the values over the requested axis.
      sum Return the sum of the values over the requested axis.
      count Count non-NA cells for each column or row.
      mean Return the mean of the values over the requested axis.
      median Return the median of the values over the requested axis.
      var Return unbiased variance over requested axis.
      any Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty)
      unique_counts Returns json containing counts of unique rows in the DataFrame.
      Example for column country it returns,
       "country": {
         "US": 42
       },
                           
  3. Sample File is stored in: configs/all_datastream2_fields.json

  4. This is a common file and updated only when new fields are added to the datastream.

all_custom_functions.json

  1. This JSON file contains the list of all the available custom functions that can be selected to aggregate the data.

    Example
    {
        [...]
        "get_status_code_level_hit_counts": {
            "required-fields": [
                "statuscode"
            ],
            "description": "Show Stat Count of HTTP requests"
        },
        [...]
    }
  2. This file contains the following details,

    function name unique name for the function
    "description" Short description of this function
    "required-fields" dataset field names (in lowercase) from all_datastream2_fields.json that are required to derive this function.
  3. Sample File is stored in: configs/all_custom_functions.json

  4. This is a common file and updated only when new functions are added to the datastream.

  5. Following are the list of current available custom functions and their recommended Memory,

    custom function Required DS2 fields Recommended Memory Sample Output
    get_total_hits > 512 MB
     
     "total_hits": 42,
                     
    get_traffic_volume totalbytes > 512 MB
     
     "traffic_volume": 145964, 
                     
    get_status_code_level_hit_counts statuscode > 512 MB
     
     "hits_2xx": 42,
     "hits_3xx": 0,
     "hits_4xx": 0,
     "hits_5xx": 0,
                     
    get_cachestatus cachestatus > 512 MB
     "cache_hit": 42,
     "cache_miss": 0,
                     
    get_offload_rate cachestatus > 512 MB
     
     "offload_rate": 10.0, 
                     
    get_origin_response_time cachestatus, cacherefreshsrc, turnaroundtimemsec > 512 MB
     
     "origin_response_time": 10, 
                     
    get_user_agent_details ua > 1024 MB
     "os": {
         "Windows": 30
         },
     "browser": {
         "Chrome": 30
         },
     "platform": {
         "Windows": 30
         },
                     
    get_unique_visitor(for Azure only) ua, cliip > 2048 MB
     
                      HTTP response code
                      200 OK
                       Example:-
                       {
                       "2023-02-22": 11,
                       "2023-02-21": 6
                       }
                     

stream.json

  1. This is a JSON file containing the stream specific details.
  2. i.e this file is used to understand the fields configured for this stream.
  3. This can be pulled from portal using the steps mentioned here
  4. Sample File is stored in: configs/stream.json
    • This needs to be updated with the stream specific file.

provision.json

  1. This is a JSON file containing the subset of custom functions that are selected for this stream.

    Example
    {
        "aggregation-interval": 300,
        "custom-functions": [
            "get_status_code_level_hit_counts",
            "get_traffic_volume",
            [...]
        ]
        [...]
        "bytes": [
            "max",
            "sum"
        ],
        [...]
        "city": [
            "unique_counts"
        ],
        [...]
    }
  2. function are triggerred to generate output for the above selected aggregate functions for the input files.

  3. "aggregation-interval", specifies the time in secs to aggregate the data based on the Request Time.

    • Setting this to -1 disables time based aggregation.
  4. Sample File is stored in: configs/provision.json

    • This needs to be updated with the stream specific file.
  5. This file can be manually edited or generated using the steps mentioned here

Configuration File Setup

Back to Top

Follow this document to create the config files customised for your stream, Config files Setup Reference

Script Usage

Back to Top

% python run_aggregations.py --help
usage: [...]/run_aggregations.py [-h] [--loglevel {critical,error,warn,info,debug}]
                                                                [--input INPUT]

Helps aggregate data

options:
  -h, --help            show this help message and exit
  --loglevel {critical,error,warn,info,debug}
                        logging level.
                        (default: info)
                        
  --input INPUT         specify the input file to aggregate.
                        (default: /[...]/sample-input/test-data-custom.gz)

Testing with an Input file locally

Back to Top

Before setting up in the cloud server, this can be tested locally to ensure the aggregated output is as expected as below.

  • Ensure to update the configs as needed in configs/ directory
  • The custom input file can be specified via --input option to run this.
  • Ensure the output generated is as expected.
Example
% python run_aggregations.py --input sample-input/test-data-custom.gz
Result...
[
  {
    "start_timestamp": 1606768500,
    "bytes_max": 3241.0,
    "bytes_sum": 97230.0,
    "bytes_count": 30.0,
    "objsize_min": 10.0,
    "objsize_max": 10.0,
    "objsize_sum": 300.0,
    "objsize_count": 30.0,
    "uncompressedsize_sum": 3000.0,
    "transfertimemsec_sum": 60.0,
    "totalbytes_min": 3241.0,
    "totalbytes_max": 3241.0,
    "totalbytes_sum": 97230.0,
    "totalbytes_mean": 3241.0,
    "tlsoverheadtimemsec_min": 0.0,
    "tlsoverheadtimemsec_max": 0.0,
    "tlsoverheadtimemsec_sum": 0.0,
    "total_hits": 30,
    "hits_2xx": 25,
    "hits_3xx": 1,
    "hits_4xx": 1,
    "hits_5xx": 2,
    "traffic_volume": 97230,
    "cache_hit": 24,
    "cache_miss": 6,
    "offload_rate": 80.0,
    "origin_response_time": 0,
    "os": {
      "Windows": 30
    },
    "browser": {
      "Chrome": 30
    },
    "platform": {
      "Windows": 30
    }
  }
]

How to setup in AWS

Back to Top

Follow this document for detailed setup instructions on AWS, AWS Setup Reference

How to setup in Azure

Back to Top

Follow this document for detailed setup instructions on Azure, Azure Setup Reference

About

Tools using 3rd party cloud services for utilizing DS data for customers to get the most out of DataStream data and have visibility on activity on the Akamai platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published