-
Notifications
You must be signed in to change notification settings - Fork 0
How Tos
- Prerequisities
- How to run the pipeline to extract graphs from collected websites
- How to adapt the system to new report sources
- How to add new storage backend
- Install kg4sec (see installation)
- Optional: prepare collected / crawled security reports
Security reports are the source of the information and you should prepare them before using kg4sec
. You can either choose to
- Download the dataset prepared by our team (see: TODO)
- Use our crawler (see: TODO) to collect the latest reports yourself
- Implement your own crawlers
No matter which way you choose, you should have a folder of reports ready after this step, which will be the input for the next step.
We have an Report abstraction that represents all types of security reports. To convert your collected reports into Reports that our system can recognize in the next step, use the following command:
kg4sec import -i <input_folder> -o <output_filename> -l <source_name>
Here, the source_name
is extremely important because it determines what parsers and extractors will be called in the later processing stages. Please refer to the supported sources (todo) before running the command.
Notice that if the output filename is not ended with .report.json
, such a suffix will be appended to the specified output filename. To change this setting, please create your own configuration file and overwrite the default configuration using -c
parameter in the command. To be more specific, change the value of procedures.porter.output_extension_name
to whatever you want.
After this step, you should have an output called something.report.json
which looks pretty much like
[
{
"payload": {
"xxxx.html": "PCFkb2......1sPg==",
"xxxx.html": "AgICAg......1sPg=="
},
"title": "some title",
"source": "symantec_threat (should be <source_name>)",
"create_on": "2020-01-01T00:00:00.000001",
"remarks": {},
"id": "8699997317c3480289241d70a656f3fd",
"type": "HTML"
},
{"...": "..."},
{"...": "..."}
]
Tips:
If for one security report, there are more than one HTML page and two HTML page shares the same prefix, our system will automatically merge them into one Report object. For instance, the example above has a payload that is a dictionary, which means it is merged from two separate HTML files.
There is no guarantee that all collected webpages or PDFs are valid security reports. To filter out those that are not, you can use the check
procedure using the following command
kg4sec check -i <input_filename> -o <output_filename>
The result will be in the same format as the original file. And the number of reports in the new file is guaranteed to be less or equal to the original file.
No matter what format the reports are under, they are stored as payload in the second step.
In this step, we will use kg4sec
to parse the reports and extract structural information. To achieve that, run
kg4sec parse -i <input_filename> -o <output_filename>
As we stressed earlier, different parsers will be used for reports from different sources, but the result will be in a unified format specified in intelligence.py.
And just like the outputs in step 2 have a suffix called .report.json
, the default output suffix for parsing result is .intel.json
.
The result will be something like below. However, you may notice that some fields are incomplete. In the next step, we will use extractors to fill some of these fields with information extracted from unstructured text.
[
{
"name": "xxx.xxx.xxx",
"source_report_id": "8699997317c3480289241d70a656f3fd",
"source": "symantec_threat (some <source_name>)",
"source_url": "http://xxx.com",
"summary": "xxx is a program that perform some malicious behavior.",
"url": "http://xxxx/xxxx",
"author": null,
"publisher": "xxxx",
"version": "xxxx",
"discovered_date": null,
"updated_date": "February 1, 2000 00:00:01 AM",
"risk_level": null,
"risk_impact_level": "High",
"ref_cve": null,
"remarks": {
"Symantec Doc Id": "xxx",
},
"related_file_names": [
"some_malicious.so"
"some_malicious.dll",
"xxxx.exe",
"xxxx.bin",
],
"related_file_paths": [
"C:\\Windows\\System",
],
"related_ips": [],
"affected_systems": [
"Windows",
"Linux",
"OsX",
],
"alias": [],
"description": "",
"extracted": [],
"id": "4dd42f6b29414f3898eb2f102f96e336",
"threat_type": "spyware",
"type": "THREAT",
"some more fields": "some more values"
},
{"...": "..."},
{"...": "..."}
]
As the core functionality of kg4sec
, we want to capture the large volume of knowledge buried in the unstructured security-related description text. We have different extractors with NLP technologies to extract entities and relations from the parsed reports. The command is
kg4sec extract -i <input_filename> -o <output_filename>
The format will be the same as step 4 but more fields will be filled. For example, more IOCs and their relations will be extracted.
JSON is a good format for storing information. However, it is not designed for flexible data access, which is critical for downstream applications like illustration of extracted knowledge and more. In kg4sec
, we support the export of knowledge from the internal representation to other storage backends like databases, especially graph databases such as neo4j
. To use the default neo4j database connector, there are two steps.
To connect to a neo4j database, you need to provide access information like hostname, port, username, and password. This information can be configured using configuration files. We use YAML for our configuration files. An example should like below (for a more specific example, see config_default.yml):
database:
neo4j:
host: "localhost"
port: 7687
username: "neo4j"
password: "neo4j"
You should put the above content into a file called my_config.yml
, then
After get the configuration file ready, run
kg4sec export -f neo4j -i <input_filename> -c <modified_config>
Here, the input_filename should be a something.intel.json
by default, and the <modified_config> can be my_config.yml
that you just created in the same folder. This might take some time so you might want to use --use_tqdm
to monitor the procedure. Once it is finished, you should have the security knowledge graph stored in neo4j. Congratulations!
To enable the system to process reports from your specified sources, you need to
- Implement your own crawlers for data collection
- Implement parser for that specific source
- Find the corresponding abstraction (e.g. is it a
Threat
or aBlog
, you can find the properties in intelligence.pyl). If you think the knowledge from the new source belongs to something completely different, feel free to create your own abstraction inheritingIntelligence
. - Add a parser file like this one. You can use
BeautifulSoup
for easy html parsing. - Add an entry in this file.
- Put a SMALL batch of instances (<1Mb for single source, please limit the size of this folder and make sure no data regulation is violated) into the test data folder at
.../kg4sec/tests/test_data
. - Write a test file like this one to test whether your parser works with the rest of the components.
- Find the corresponding abstraction (e.g. is it a
- Optional: adjust or add the extractors to make the extractions for entities and relations more accurate
By default, our system use neo4j
for knowledge storage. However, you can also implement other connectors to support the different storage backends. The steps are
- Convert our intelligence abstraction into something that can be stored in your backend
- Use the corresponding python package to connect to your backend
- And don't forget to put the environment-related values into the default configuration file. For how we handle configurations in our project, see config_default.yml for one configuration file example, and see how we load it in our util function)
- Add related test cases in the tests folder and test it with
pytest