Skip to content

How Tos

littleRound edited this page Nov 19, 2020 · 1 revision

Table of contents

Prerequisites

  1. Install kg4sec (see installation)
  2. Optional: prepare collected / crawled security reports

How to run the pipeline to extract graphs from collected websites

Step 1. Prepare security reports

Security reports are the source of the information and you should prepare them before using kg4sec. You can either choose to

  1. Download the dataset prepared by our team (see: TODO)
  2. Use our crawler (see: TODO) to collect the latest reports yourself
  3. Implement your own crawlers

No matter which way you choose, you should have a folder of reports ready after this step, which will be the input for the next step.

Step 2. Convert security reports to one JSON list and output it to one single file

We have an Report abstraction that represents all types of security reports. To convert your collected reports into Reports that our system can recognize in the next step, use the following command:

kg4sec import -i <input_folder> -o <output_filename> -l <source_name>

Here, the source_name is extremely important because it determines what parsers and extractors will be called in the later processing stages. Please refer to the supported sources (todo) before running the command.

Notice that if the output filename is not ended with .report.json, such a suffix will be appended to the specified output filename. To change this setting, please create your own configuration file and overwrite the default configuration using -c parameter in the command. To be more specific, change the value of procedures.porter.output_extension_name to whatever you want.

After this step, you should have an output called something.report.json which looks pretty much like

[
    {
        "payload": {
            "xxxx.html": "PCFkb2......1sPg==",
            "xxxx.html": "AgICAg......1sPg=="
        },
        "title": "some title",
        "source": "symantec_threat (should be <source_name>)",
        "create_on": "2020-01-01T00:00:00.000001",
        "remarks": {},
        "id": "8699997317c3480289241d70a656f3fd",
        "type": "HTML"
    },
    {"...": "..."},
    {"...": "..."}
]

Tips:

If for one security report, there are more than one HTML page and two HTML page shares the same prefix, our system will automatically merge them into one Report object. For instance, the example above has a payload that is a dictionary, which means it is merged from two separate HTML files.

Step 3. Filter the single report file

There is no guarantee that all collected webpages or PDFs are valid security reports. To filter out those that are not, you can use the check procedure using the following command

kg4sec check -i <input_filename> -o <output_filename>

The result will be in the same format as the original file. And the number of reports in the new file is guaranteed to be less or equal to the original file.

Step 4. Parse the structure of the reports

No matter what format the reports are under, they are stored as payload in the second step.

In this step, we will use kg4sec to parse the reports and extract structural information. To achieve that, run

kg4sec parse -i <input_filename> -o <output_filename>

As we stressed earlier, different parsers will be used for reports from different sources, but the result will be in a unified format specified in intelligence.py.

And just like the outputs in step 2 have a suffix called .report.json, the default output suffix for parsing result is .intel.json.

The result will be something like below. However, you may notice that some fields are incomplete. In the next step, we will use extractors to fill some of these fields with information extracted from unstructured text.

[
    {
        "name": "xxx.xxx.xxx",
        "source_report_id": "8699997317c3480289241d70a656f3fd",
        "source": "symantec_threat (some <source_name>)",
        "source_url": "http://xxx.com",
        "summary": "xxx is a program that perform some malicious behavior.",
        "url": "http://xxxx/xxxx",
        "author": null,
        "publisher": "xxxx",
        "version": "xxxx",
        "discovered_date": null,
        "updated_date": "February 1, 2000 00:00:01 AM",
        "risk_level": null,
        "risk_impact_level": "High",
        "ref_cve": null,
        "remarks": {
            "Symantec Doc Id": "xxx",
        },
        "related_file_names": [
            "some_malicious.so"
            "some_malicious.dll",
            "xxxx.exe",
            "xxxx.bin",
        ],
        "related_file_paths": [
            "C:\\Windows\\System",
        ],
        "related_ips": [],
        "affected_systems": [
            "Windows",
            "Linux",
            "OsX",
        ],
        "alias": [],
        "description": "",
        "extracted": [],
        "id": "4dd42f6b29414f3898eb2f102f96e336",
        "threat_type": "spyware",
        "type": "THREAT",
        "some more fields": "some more values"
    },
    {"...": "..."},
    {"...": "..."}
]

Step 5. Extract unstructured knowledge

As the core functionality of kg4sec, we want to capture the large volume of knowledge buried in the unstructured security-related description text. We have different extractors with NLP technologies to extract entities and relations from the parsed reports. The command is

kg4sec extract -i <input_filename> -o <output_filename>

The format will be the same as step 4 but more fields will be filled. For example, more IOCs and their relations will be extracted.

Step 6. Export the knowledge to a storage backend

JSON is a good format for storing information. However, it is not designed for flexible data access, which is critical for downstream applications like illustration of extracted knowledge and more. In kg4sec, we support the export of knowledge from the internal representation to other storage backends like databases, especially graph databases such as neo4j. To use the default neo4j database connector, there are two steps.

Step 6.1. Write a configuration file

To connect to a neo4j database, you need to provide access information like hostname, port, username, and password. This information can be configured using configuration files. We use YAML for our configuration files. An example should like below (for a more specific example, see config_default.yml):

database:
  neo4j:
    host: "localhost"
    port: 7687
    username: "neo4j"
    password: "neo4j"

You should put the above content into a file called my_config.yml, then

Step 6.2. Run command to export the intelligence file to neo4j

After get the configuration file ready, run

kg4sec export -f neo4j -i <input_filename> -c <modified_config>

Here, the input_filename should be a something.intel.json by default, and the <modified_config> can be my_config.yml that you just created in the same folder. This might take some time so you might want to use --use_tqdm to monitor the procedure. Once it is finished, you should have the security knowledge graph stored in neo4j. Congratulations!

How to adapt the system to new report sources

To enable the system to process reports from your specified sources, you need to

  1. Implement your own crawlers for data collection
  2. Implement parser for that specific source
    1. Find the corresponding abstraction (e.g. is it a Threat or a Blog, you can find the properties in intelligence.pyl). If you think the knowledge from the new source belongs to something completely different, feel free to create your own abstraction inheriting Intelligence.
    2. Add a parser file like this one. You can use BeautifulSoup for easy html parsing.
    3. Add an entry in this file.
    4. Put a SMALL batch of instances (<1Mb for single source, please limit the size of this folder and make sure no data regulation is violated) into the test data folder at .../kg4sec/tests/test_data.
    5. Write a test file like this one to test whether your parser works with the rest of the components.
  3. Optional: adjust or add the extractors to make the extractions for entities and relations more accurate

How to add new storage backend

By default, our system use neo4j for knowledge storage. However, you can also implement other connectors to support the different storage backends. The steps are

  1. Convert our intelligence abstraction into something that can be stored in your backend
  2. Use the corresponding python package to connect to your backend
  3. And don't forget to put the environment-related values into the default configuration file. For how we handle configurations in our project, see config_default.yml for one configuration file example, and see how we load it in our util function)
  4. Add related test cases in the tests folder and test it with pytest