Skip to content

Various parsers used for data privacy detection and shared data between hosts. HAR file parsing tools and accompanying API.

License

Notifications You must be signed in to change notification settings

porteron/http-archive-parser

Repository files navigation

HTTP Logo Archive Parser

Description

The HTTP Archive Parser stands up a server with endpoints to parse HTTP Archive files in various ways. It's initial purpose was to detect data privacy violations in a user session. Part of that system has been broken out into a more general purpose parser, which will expose Shared Strings in a user's session. It helps identify dataflow between different host domains.

It looks to match strings such as "cookies", "headers", and "query parameters".

You can specify various reports to run on the HAR file.

  • Shared String
  • Shared String Entity List
  • Shared String Differential

Currently Reports will be read and stored in S3. You will have to fill out the .env file with the proper credentials. There will be future support for using local filesystem.


Installation

$ npm install

Running the app

# development
$ npm run start

# watch mode
$ npm run start:dev

# production mode
$ npm run start:prod

Test

# unit tests
$ npm run test

# e2e tests
$ npm run test:e2e

# test coverage
$ npm run test:cov

Parser

There are many properties you can customize for the parser. The config is located in the parser/har/parser.config.js file.

Improper modification of these values can lead to unneccessary parsing conditions which leads to long parsing times.

FIRST_CHAR_MIN_LEN and FIRST_CHAR_MAX_LEN values are most sensitive. The smaller the FIRST_CHAR_MIN_LEN the more strings the parser will consider in the file. You should probably always have this value greater than 6 or 7. Most unique identifiers are greater than 7 so go ahead and set it higher if that is what you are looking for.

Below are the supported config values

{
    LEVELS: [
        'request',
        'response'
    ],
    ENTRY_TYPES: [
        'headers',
        'cookies',
        'queryString'
    ],
    FIRST_CHAR_MIN_LEN: 7,
    FIRST_CHAR_MAX_LEN: 200,
    REPORT_KEY_NAME_MAX_LENGTH: 60,
    REPORT_URL_MAX_LENGTH: 120,
    INCLUDE_INITIATOR: true,
    INCLUDE_SERVER_IP: false,
    MATCH_COUNT_MIN: 2,
    IGNORE_LIST: [],
    INCLUDE_LIST: [],
    REPORT_PARAMS: [],
    IGNORE_SAME_REQUESTS: true,
    FILTER_SAME_HOST_URL: true,
    FILTER_TIMESTAMPS: true,
    FILTER_URL_VALUES: false,
}

How it works

POST to {SERVER_HOST}/collection-event/parse

  • There are two supported ways to pass your file to the parser
    1. Send the entire raw HAR contents in the request body
    2. Send the name of the HAR file stored in S3

Example Requests for Various Parsing

Header Request Format

Headers
Content-Type	application/json
mx-token	TEST-KEY-PARSER

Supported Request Body

{
  "format": "json", // OPTIONAL - also accepts "csv" - default is json
  "save": bool, // OPTIONAL - true or false to save to bucket - default is true
  "update": bool, // OPTIONAL - true or false to overwrite existing file - default is false
  "report_type": "sharedStrings", // or "differential" or "entityList" 
  "files": ["<S3 HAR FILE NAME>"] // if differential pass two files
  // OR
  "raw": [{HAR1}], // if differential pass two raw HAR files as json objects
}

Shared Strings Parse

Request Body
{
	"report_type": "sharedStrings",
	"format": "json",
	"files": ["<S3 HAR FILE NAME>"]
}

Entity List Parse

Request Body
{
	"report_type": "entityList",
	"format": "json", 
	"files": ["<S3 HAR FILE NAME>"]
}

HAR Differential

Request Body
{
	"report_type": "differential",
	"format": "json",
	"files": ["<S3 HAR FILE NAME>", "<S3 HAR FILE TO DIFF AGAINST>"]
}

Development

Creating a new module

Before starting development it is important to understand the NEST framework. There are a couple basic concepts that will help to understand the purpose of each file. The basic stucture is that each Module has a Component, Service/Repository, Data Transfer Model, & Interface. Some of these constucts are just fancy words for very simple purpose.

You can run the CreateFullModule.sh script to have the following files autogenerated for you:

(Note that there are occasionally syntax issues when creating a module with a module name that is more than 1 word long)

  • controller
  • module
  • dto
  • repository
  • spec test file

Take a look at the CreateFullModule.sh code to understand what is happening.

After you run the script there are still a few things that need to be manually added.

  • You will need to manually enter in the values into the dto file. Look at the src/interfaces/entities/<your module> for all the fields that you will need to add in. Use existing files for reference.

The alternative is to use the nest cli and do it all manually.


Resources


HTTP Archive Parser is built with TypeScript Nest framework

Maintainers

Nicholas Porter - https://github.com/porteron

About

Various parsers used for data privacy detection and shared data between hosts. HAR file parsing tools and accompanying API.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published