MALVADA - A Windows Malware Execution Traces Dataset generation framework

MALVADA is a software framework that parses one or more CAPE .json reports coming from Windows programs and processes them in different phases to provide various statistics about their contents.

The main objective of MALVADA is to help generate datasets. Specifically, reporting datasets generated with CAPE (although it can be extended to other sandboxing engines format).

Installation

Install the requirements specified in requirements.txt.

$ pip3 install -r requirements.txt

The last requirement specified in the requirements.txt file is AVClass (from malicialab). In case you face any problem during installation, you can try to install it independently with:

$ pip3 install avclass-malicialab

Usage

To use this framework you just need to run the main script malvada.py (/src/malvada.py) and pass it the path to a directory that contains the set of .json reports you want to process:

$ python3 malvada.py directory

NOTE: The phases MALVADA comprises can be invoked individually, calling their respective scripts.

The tool will process all the reports in the directory and move them in their corresponding folders, if appropriate. You can test the tool using the report samples provided in test_reports.

The help message is printed with the -h flag:

$ python3 malvada.py -h
usage: malvada.py [-h] [-w WORKERS] [-s] [-vt VT_POSITIVES_THRESHOLD] [-a ANONIMIZE_TERMS] json_dir

Generates the MALset dataset from CAPE reports. WARNING: This script will modify the reports in the directory provided.

positional arguments:
  json_dir              The directory containing one or more json reports.

options:
  -h, --help            show this help message and exit
  -w WORKERS, --workers WORKERS
                        Number of workers to use (default: 10).
  -s, --silent          Silent mode (default: False).
  -vt VT_POSITIVES_THRESHOLD, --vt-positives-threshold VT_POSITIVES_THRESHOLD
                        Threshold for VirusTotal positives (default: 10).
  -a ANONIMIZE_TERMS, --anonimize-terms ANONIMIZE_TERMS
                        Replace the terms in the file provided with [REDACTED], one by line (default: 'terms_to_anonymize.txt').

Phases of MALVADA

MALVADA processes the reports in the following phases:

Detect incorrect reports. That is, those that are poorly formatted for some reason (samples do not run, they crash, etc...).
Remove duplicate reports (based on the SHA512 of the submitted sample).
Sanitize and anonymize reports. That is, remove sensitive information and the terms specified (by default) in terms_to_anonymize.txt.
Add AVClass result to the report. That is, parse the results from all VT vendors, transform them into valid input for AVClass and invoke AVClass itself. The AVClass consesus result is added in the key avclass_detection.
Generate statistics.

Example

Output after executing MALVADA with the test_reports:

$ python3 src/malvada.py test_reports -w 100 (100 workers, default is 10)

How to cite

If you are using this software, please cite it as follows:

TBD

More info in the "Cite this repository" GitHub contextual menu.

Authors

Razvan Raducu
Alain Villagrasa Labrador
Ricardo J. Rodríguez
Pedro Álvarez

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MALVADA - A Windows Malware Execution Traces Dataset generation framework

Installation

Usage

Phases of MALVADA

Example

How to cite

Authors

Files

README.md

Latest commit

History

README.md

File metadata and controls

MALVADA - A Windows Malware Execution Traces Dataset generation framework

Installation

Usage

Phases of MALVADA

Example

How to cite

Authors