Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic: add extractor for VMRay dynamic sandbox traces #2208

Merged
merged 125 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
3141e94
Add vmray text to JSON parser.
r-sm2024 Jun 10, 2024
bdc94c1
Merge branch 'master' into vmray_extractor
r-sm2024 Jun 11, 2024
a9dafe2
example using pydantic-xml to parse flog.xml
mr-tz Jun 13, 2024
a797405
vmray: add example models for summary_v2.json
mike-hunhoff Jun 13, 2024
ca02b4a
vmray: expand extractor to emit file export features
mike-hunhoff Jun 13, 2024
970b184
vmray: add stubs for file imports
mike-hunhoff Jun 13, 2024
7d0ac71
vmray: cleanup pydantic models and implement file section extraction
mike-hunhoff Jun 13, 2024
8d3f032
vmray: clean up pydantic models and implement base address extraction
mike-hunhoff Jun 13, 2024
346a069
vmray: clean up VMRayAnalysis
mike-hunhoff Jun 13, 2024
7e079d4
vmray: restrict analysis to PE files
mike-hunhoff Jun 13, 2024
00cb792
vmray: clean up pydantic models and add sample hash extraction
mike-hunhoff Jun 13, 2024
8b913e0
vmray: extract global features for PE files
mike-hunhoff Jun 14, 2024
6548048
vmray: clean up global_.py debug output
mike-hunhoff Jun 14, 2024
51656fe
vmray: merge upstream
mike-hunhoff Jun 18, 2024
f3d6952
vmray: invoke VMRay feature extractor from capa.main
mike-hunhoff Jun 18, 2024
8f32b7f
vmray: emit process handles
mike-hunhoff Jun 18, 2024
b3ebf80
vmray: emit process name
mike-hunhoff Jun 18, 2024
be274d1
Merge branch 'mandiant:master' into vmray_extractor
r-sm2024 Jun 18, 2024
e5fa800
vmray: emit empty thread features
mike-hunhoff Jun 18, 2024
d26a806
vmray: update scripts/show-features.py to emit process name from extr…
mike-hunhoff Jun 18, 2024
2b70086
Add VMRayanalysis model and call parser
r-sm2024 Jun 18, 2024
3cca808
Add VMRayanalysis model and call parser
r-sm2024 Jun 18, 2024
574d61a
Add VMRayanalysis model and call parser
r-sm2024 Jun 18, 2024
85a85e9
vmray: emit recorded artifacts as strings
mike-hunhoff Jun 18, 2024
789332e
Merge branch 'vmray-extractor' into vmray_extractor
r-sm2024 Jun 18, 2024
21887d1
vmray: merge upstream
mike-hunhoff Jun 18, 2024
a1a1712
Merge branch 'vmray-extractor' into vmray_extractor
mr-tz Jun 19, 2024
a544aed
add vmray-extractor branch for tests
mr-tz Jun 19, 2024
d10b396
add pydantic-xml dependency
mr-tz Jun 19, 2024
453a640
formatting
mr-tz Jun 19, 2024
fbdfea1
add testing code
mr-tz Jun 19, 2024
d256cc8
update model and re-add summary_v2.json models
mr-tz Jun 19, 2024
740c739
remove file
mr-tz Jun 19, 2024
0c9d3d0
fix ruff
mr-tz Jun 19, 2024
8757dad
Merge pull request #2155 from r-sm2024/vmray_extractor
mr-tz Jun 19, 2024
5be68d0
vmray: remove debug code and update call features entry point
mike-hunhoff Jun 20, 2024
ec21f3b
vmray: use xmltodict instead of pydantic_xml to improve performance
mike-hunhoff Jun 20, 2024
19502ef
vmray: connect process, thread, and call
mike-hunhoff Jun 20, 2024
9ef705a
vmray: remove old comments
mike-hunhoff Jun 20, 2024
544899a
vmray: add os v. monitor id comment
mike-hunhoff Jun 20, 2024
4b08e62
vmray: fix flake8 lints
mike-hunhoff Jun 20, 2024
29fa315
vmray: fix deptry lints
mike-hunhoff Jun 20, 2024
9df611f
vmray: add comments
mike-hunhoff Jun 20, 2024
ec6c9c9
vmray: remove unused fields from summary_v2 pydantic models
mike-hunhoff Jun 20, 2024
9be35f9
vmray: remove unneeded unpacking
mike-hunhoff Jun 20, 2024
d1f6bb3
Merge branch 'master' into vmray-extractor
mr-tz Jul 3, 2024
194017b
vmray: merge upstream
mike-hunhoff Jul 12, 2024
81581fe
vmray: emit string file featureS
mike-hunhoff Jul 12, 2024
cbf6ecb
Merge branch 'vmray-extractor' of github.com:mandiant/capa into vmray…
mike-hunhoff Jul 12, 2024
aad4854
vmray: use process OS PID instead of monitor ID
mike-hunhoff Jul 12, 2024
bcdaa80
vmray: emit file import features
mike-hunhoff Jul 12, 2024
da05457
vmray: emit number call features for input parameters
mike-hunhoff Jul 12, 2024
5b7a0ca
vmray: emit number call features for output parameters
mike-hunhoff Jul 12, 2024
e2f5eb7
vmray: clean up models
mike-hunhoff Jul 12, 2024
4bbe9e1
vmray: emit number and string call features for pointer dereference
mike-hunhoff Jul 13, 2024
06631fc
vmray: remove call feature extraction for out parameters
mike-hunhoff Jul 13, 2024
931a9b9
vmray: clean up models
mike-hunhoff Jul 13, 2024
85632f6
vmray: clean up models
mike-hunhoff Jul 13, 2024
253d70e
vmray: add comments
mike-hunhoff Jul 13, 2024
307b0cc
vmray: add comments
mike-hunhoff Jul 13, 2024
1f5b6ec
vmray: improve comments
mike-hunhoff Jul 13, 2024
26b5870
vmray: improve comments
mike-hunhoff Jul 13, 2024
28c278b
vmray: improve comments
mike-hunhoff Jul 13, 2024
4f2467c
vmray: update CHANGELOG
mike-hunhoff Jul 13, 2024
5214675
vmray: update tests.yml
mike-hunhoff Jul 13, 2024
42fddfb
vmray: improve comments
mike-hunhoff Jul 13, 2024
af26bef
vmray: fix lints
mike-hunhoff Jul 13, 2024
1588974
vmray: merge upstream
mike-hunhoff Jul 17, 2024
b68a91e
vmray: validate supported flog version
mike-hunhoff Jul 17, 2024
ec7e431
vmray: update comment for extract_process_features
mike-hunhoff Jul 17, 2024
cc87ef3
vmray: remove and document extract_call_features comments
mike-hunhoff Jul 17, 2024
100df45
vmray: add logging for skipped deref param types
mike-hunhoff Jul 17, 2024
19a6f3a
vmray: improve supported file type validation
mike-hunhoff Jul 17, 2024
330c77a
vmray: implement get_call_name
mike-hunhoff Jul 17, 2024
fd7bd94
vmray: remove outdated comments
mike-hunhoff Jul 18, 2024
5afea29
vmray: update CHANGELOG release notes with VMRay integration
mike-hunhoff Jul 18, 2024
998537d
vmray: remove outdated comments
mike-hunhoff Jul 18, 2024
64a09d3
vmray: remove broken assert for unique OS PIDs
mike-hunhoff Jul 18, 2024
6f7cc7c
vmray: improve detections for unsupported input files
mike-hunhoff Jul 18, 2024
24a31a8
vmray: add comments to __init__.py
mike-hunhoff Jul 18, 2024
8bf0d16
vmray: add init support for ELF files
mike-hunhoff Jul 18, 2024
6e0dc83
vmray: refactor global_.py
mike-hunhoff Jul 19, 2024
673f7cc
vmray: refactor models.py
mike-hunhoff Jul 19, 2024
658927c
vmray: refactor models.py
mike-hunhoff Jul 19, 2024
28792ec
vmray: add model tests for FunctionCall
mike-hunhoff Jul 19, 2024
2ba2a2b
vmray: remove unneeded json.loads from __init__.py
mike-hunhoff Jul 19, 2024
4490097
vmray: add summary_v2.json model tests
mike-hunhoff Jul 19, 2024
98939f8
vmray: improve FunctionCall model
mike-hunhoff Jul 19, 2024
4dfc53a
vmray: refactor model tests
mike-hunhoff Jul 19, 2024
6ef485f
vmray: refactor model tests
mike-hunhoff Jul 19, 2024
3b94961
vmray: complete pefile model tests
mike-hunhoff Jul 19, 2024
46b68d1
vmray: improve models.py comments
mike-hunhoff Jul 23, 2024
cbdc744
vmray: merge upstream
mike-hunhoff Jul 23, 2024
31e53fa
vmray: improve models.py comments
mike-hunhoff Jul 23, 2024
f471386
vmray: merge upstream and fix conflicts
mike-hunhoff Jul 24, 2024
f6d12bc
vmray: fix lints
mike-hunhoff Jul 24, 2024
85373a7
cape: add explicit check for CAPE report format file extension
mike-hunhoff Jul 24, 2024
6e146bb
vmray: fix lints
mike-hunhoff Jul 24, 2024
9a1364c
vmray: document vmray support in README
mike-hunhoff Jul 24, 2024
b8d3d77
vmray: document vmray support in README
mike-hunhoff Jul 24, 2024
5b7a2be
vmray: remove outdated comments __init__.py
mike-hunhoff Jul 25, 2024
7b3812a
vmray: improve error reporting
mike-hunhoff Jul 25, 2024
05fb8f6
vmray: fix flake8 lints
mike-hunhoff Jul 25, 2024
b967213
vmray: improve comments __init__.py
mike-hunhoff Jul 25, 2024
3043fd6
vmray: merge upstream
mike-hunhoff Jul 29, 2024
51b853d
vmray: remove bad print statements
mike-hunhoff Jul 29, 2024
1a3cf4a
vmray: update extractor.py format_params
mike-hunhoff Jul 29, 2024
8cba23b
vmray: improve extract_import_names
mike-hunhoff Jul 29, 2024
87dfa50
scripts: remove old code from show-features.py
mike-hunhoff Jul 29, 2024
7bf0b39
core: improve error message for vmray
mike-hunhoff Jul 29, 2024
139dcc4
vmray: improve logging
mike-hunhoff Jul 29, 2024
71c515d
vmray: improve comments __init__.py
mike-hunhoff Jul 29, 2024
a8d849e
vmray: improve comments models.py
mike-hunhoff Jul 30, 2024
3982356
load gzipped rd, see capa-testfiles#245
mr-tz Jul 31, 2024
e83f289
add script to minimize vmray archive to only relevant files
mr-tz Jul 31, 2024
e476354
add dynamic vmray feature tests
mr-tz Jul 31, 2024
afb7286
assert sample analysis data is present
mr-tz Aug 1, 2024
c0a7f76
Merge branch 'master' into vmray-extractor
mr-tz Aug 9, 2024
6ff08ae
Merge branch 'master' into vmray-extractor
yelhamer Aug 17, 2024
d98c315
Merge branch 'master' into vmray-extractor
mr-tz Aug 26, 2024
e8550f2
rename using dashes for consistency
mr-tz Aug 26, 2024
9eab7eb
update names
mr-tz Aug 26, 2024
6ce130e
Merge branch 'master' into vmray-extractor
mr-tz Aug 26, 2024
e468116
Merge branch 'vmray-extractor' of github.com:mandiant/capa into vmray…
mr-tz Aug 26, 2024
fa92cfd
Merge branch 'master' into vmray-extractor
mr-tz Aug 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,15 @@

## master (unreleased)

Unlock powerful malware analysis with capa's new [VMRay sandbox](https://www.vmray.com/) integration! Simply provide a VMRay analysis archive, and capa will automatically extract and match capabilties, streamlining your workflow.

### New Features
- regenerate ruleset cache automatically on source change (only in dev mode) #2133 @s-ff

- add landing page https://mandiant.github.io/capa/ @williballenthin #2310
- add rules website https://mandiant.github.io/capa/rules @DeeyaSingh #2310
- add .justfile @williballenthin #2325
- dynamic: add support for VMRay dynamic sandbox traces #2208 @mike-hunhoff @r-sm2024 @mr-tz
mike-hunhoff marked this conversation as resolved.
Show resolved Hide resolved

### Breaking Changes

Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,13 +150,15 @@ function @ 0x4011C0
...
```

## analyzing sandbox reports
Additionally, capa also supports analyzing sandbox reports for dynamic capability extraction.
In order to use this, you first submit your sample to one of supported sandboxes for analysis, and then run capa against the generated report file.
capa also supports dynamic capabilities detection for multiple sandboxes including:
* [CAPE](https://github.com/kevoreilly/CAPEv2) (supported report formats: `.json`, `.json_`, `.json.gz`)
* [DRAKVUF](https://github.com/CERT-Polska/drakvuf-sandbox/) (supported report formats: `.log`, `.log.gz`)
* [VMRay](https://www.vmray.com/) (supported report formats: analysis archive `.zip`)

Currently, capa supports the [CAPE sandbox](https://github.com/kevoreilly/CAPEv2) and the [DRAKVUF sandbox](https://github.com/CERT-Polska/drakvuf-sandbox/). In order to use either, simply run capa against the generated file (JSON for CAPE or LOG for DRAKVUF sandbox) and it will automatically detect the sandbox and extract capabilities from it.

Here's an example of running capa against a packed binary, and then running capa against the CAPE report of that binary:
To use this feature, submit your file to a supported sandbox and then download and run capa against the generated report file. This feature enables capa to match capabilities against dynamic and static features that the sandbox captured during execution.

Here's an example of running capa against a packed file, and then running capa against the CAPE report generated for the same packed file:

```yaml
$ capa 05be49819139a3fdcdbddbdefd298398779521f3d68daa25275cc77508e42310.exe
Expand Down
2 changes: 2 additions & 0 deletions capa/features/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -462,6 +462,7 @@ def evaluate(self, features: "capa.engine.FeatureSet", short_circuit=True):
FORMAT_SC64 = "sc64"
FORMAT_CAPE = "cape"
FORMAT_DRAKVUF = "drakvuf"
FORMAT_VMRAY = "vmray"
FORMAT_FREEZE = "freeze"
FORMAT_RESULT = "result"
STATIC_FORMATS = {
Expand All @@ -476,6 +477,7 @@ def evaluate(self, features: "capa.engine.FeatureSet", short_circuit=True):
DYNAMIC_FORMATS = {
FORMAT_CAPE,
FORMAT_DRAKVUF,
FORMAT_VMRAY,
FORMAT_FREEZE,
FORMAT_RESULT,
}
Expand Down
161 changes: 161 additions & 0 deletions capa/features/extractors/vmray/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Copyright (C) 2024 Mandiant, Inc. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at: [package root]/LICENSE.txt
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and limitations under the License.
import logging
from typing import Dict, List, Tuple, Optional
from pathlib import Path
from zipfile import ZipFile
from collections import defaultdict

from capa.exceptions import UnsupportedFormatError
from capa.features.extractors.vmray.models import File, Flog, SummaryV2, StaticData, FunctionCall, xml_to_dict

logger = logging.getLogger(__name__)

DEFAULT_ARCHIVE_PASSWORD = b"infected"

SUPPORTED_FLOG_VERSIONS = ("2",)


class VMRayAnalysis:
def __init__(self, zipfile_path: Path):
self.zipfile = ZipFile(zipfile_path, "r")

# summary_v2.json is the entry point to the entire VMRay archive and
# we use its data to find everything else that we need for capa
mike-hunhoff marked this conversation as resolved.
Show resolved Hide resolved
self.sv2 = SummaryV2.model_validate_json(
self.zipfile.read("logs/summary_v2.json", pwd=DEFAULT_ARCHIVE_PASSWORD)
)
self.file_type: str = self.sv2.analysis_metadata.sample_type

# flog.xml contains all of the call information that VMRay captured during execution
flog_xml = self.zipfile.read("logs/flog.xml", pwd=DEFAULT_ARCHIVE_PASSWORD)
flog_dict = xml_to_dict(flog_xml)
self.flog = Flog.model_validate(flog_dict)

if self.flog.analysis.log_version not in SUPPORTED_FLOG_VERSIONS:
raise UnsupportedFormatError(
"VMRay feature extractor does not support flog version %s" % self.flog.analysis.log_version
)

self.exports: Dict[int, str] = {}
self.imports: Dict[int, Tuple[str, str]] = {}
self.sections: Dict[int, str] = {}
self.process_ids: Dict[int, int] = {}
self.process_threads: Dict[int, List[int]] = defaultdict(list)
self.process_calls: Dict[int, Dict[int, List[FunctionCall]]] = defaultdict(lambda: defaultdict(list))
self.base_address: int

self.sample_file_name: Optional[str] = None
self.sample_file_analysis: Optional[File] = None
self.sample_file_static_data: Optional[StaticData] = None

self._find_sample_file()

# VMRay analysis archives in various shapes and sizes and file type does not definitively tell us what data
# we can expect to find in the archive, so to be explicit we check for the various pieces that we need at
# minimum to run capa analysis
if self.sample_file_name is None or self.sample_file_analysis is None:
raise UnsupportedFormatError("VMRay archive does not contain sample file (file_type: %s)" % self.file_type)

if not self.sample_file_static_data:
raise UnsupportedFormatError("VMRay archive does not contain static data (file_type: %s)" % self.file_type)

if not self.sample_file_static_data.pe and not self.sample_file_static_data.elf:
raise UnsupportedFormatError(
"VMRay feature extractor only supports PE and ELF at this time (file_type: %s)" % self.file_type
)

# VMRay does not store static strings for the sample file so we must use the source file
# stored in the archive
sample_sha256: str = self.sample_file_analysis.hash_values.sha256.lower()
sample_file_path: str = f"internal/static_analyses/{sample_sha256}/objects/files/{sample_sha256}"

logger.debug("file_type: %s, file_path: %s", self.file_type, sample_file_path)

self.sample_file_buf: bytes = self.zipfile.read(sample_file_path, pwd=DEFAULT_ARCHIVE_PASSWORD)

self._compute_base_address()
self._compute_imports()
self._compute_exports()
self._compute_sections()
self._compute_process_ids()
self._compute_process_threads()
self._compute_process_calls()

def _find_sample_file(self):
for file_name, file_analysis in self.sv2.files.items():
if file_analysis.is_sample:
# target the sample submitted for analysis
self.sample_file_name = file_name
self.sample_file_analysis = file_analysis

if file_analysis.ref_static_data:
# like "path": ["static_data","static_data_0"] where "static_data_0" is the summary_v2 static data
# key for the file's static data
self.sample_file_static_data = self.sv2.static_data[file_analysis.ref_static_data.path[1]]
mike-hunhoff marked this conversation as resolved.
Show resolved Hide resolved

break
mike-hunhoff marked this conversation as resolved.
Show resolved Hide resolved

def _compute_base_address(self):
assert self.sample_file_static_data is not None
mike-hunhoff marked this conversation as resolved.
Show resolved Hide resolved
if self.sample_file_static_data.pe:
self.base_address = self.sample_file_static_data.pe.basic_info.image_base

def _compute_exports(self):
assert self.sample_file_static_data is not None
if self.sample_file_static_data.pe:
for export in self.sample_file_static_data.pe.exports:
self.exports[export.address] = export.api.name

def _compute_imports(self):
assert self.sample_file_static_data is not None
if self.sample_file_static_data.pe:
for module in self.sample_file_static_data.pe.imports:
for api in module.apis:
self.imports[api.address] = (module.dll, api.api.name)

def _compute_sections(self):
assert self.sample_file_static_data is not None
if self.sample_file_static_data.pe:
for pefile_section in self.sample_file_static_data.pe.sections:
self.sections[pefile_section.virtual_address] = pefile_section.name
elif self.sample_file_static_data.elf:
for elffile_section in self.sample_file_static_data.elf.sections:
self.sections[elffile_section.header.sh_addr] = elffile_section.header.sh_name

def _compute_process_ids(self):
for process in self.sv2.processes.values():
# we expect VMRay's monitor IDs to be unique, but OS PIDs may be reused
assert process.monitor_id not in self.process_ids.keys()
self.process_ids[process.monitor_id] = process.os_pid

def _compute_process_threads(self):
# logs/flog.xml appears to be the only file that contains thread-related data
# so we use it here to map processes to threads
for function_call in self.flog.analysis.function_calls:
pid: int = self.get_process_os_pid(function_call.process_id) # flog.xml uses process monitor ID, not OS PID
tid: int = function_call.thread_id

assert isinstance(pid, int)
assert isinstance(tid, int)

if tid not in self.process_threads[pid]:
self.process_threads[pid].append(tid)

def _compute_process_calls(self):
for function_call in self.flog.analysis.function_calls:
pid: int = self.get_process_os_pid(function_call.process_id) # flog.xml uses process monitor ID, not OS PID
tid: int = function_call.thread_id

assert isinstance(pid, int)
assert isinstance(tid, int)

self.process_calls[pid][tid].append(function_call)

def get_process_os_pid(self, monitor_id: int) -> int:
return self.process_ids[monitor_id]
53 changes: 53 additions & 0 deletions capa/features/extractors/vmray/call.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright (C) 2024 Mandiant, Inc. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at: [package root]/LICENSE.txt
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and limitations under the License.
import logging
from typing import Tuple, Iterator

from capa.features.insn import API, Number
from capa.features.common import String, Feature
from capa.features.address import Address
from capa.features.extractors.vmray.models import PARAM_TYPE_INT, PARAM_TYPE_STR, Param, FunctionCall, hexint
from capa.features.extractors.base_extractor import CallHandle, ThreadHandle, ProcessHandle

logger = logging.getLogger(__name__)


def get_call_param_features(param: Param, ch: CallHandle) -> Iterator[Tuple[Feature, Address]]:
if param.deref is not None:
# pointer types contain a special "deref" member that stores the deref'd value
# so we check for this first and ignore Param.value as this always contains the
# deref'd pointer value
if param.deref.value is not None:
if param.deref.type_ in PARAM_TYPE_INT:
yield Number(hexint(param.deref.value)), ch.address
elif param.deref.type_ in PARAM_TYPE_STR:
yield String(param.deref.value), ch.address
mike-hunhoff marked this conversation as resolved.
Show resolved Hide resolved
else:
logger.debug("skipping deref param type %s", param.deref.type_)
elif param.value is not None:
if param.type_ in PARAM_TYPE_INT:
yield Number(hexint(param.value)), ch.address


def extract_call_features(ph: ProcessHandle, th: ThreadHandle, ch: CallHandle) -> Iterator[Tuple[Feature, Address]]:
call: FunctionCall = ch.inner

if call.params_in:
for param in call.params_in.params:
yield from get_call_param_features(param, ch)

yield API(call.name), ch.address


def extract_features(ph: ProcessHandle, th: ThreadHandle, ch: CallHandle) -> Iterator[Tuple[Feature, Address]]:
for handler in CALL_HANDLERS:
for feature, addr in handler(ph, th, ch):
yield feature, addr


CALL_HANDLERS = (extract_call_features,)
122 changes: 122 additions & 0 deletions capa/features/extractors/vmray/extractor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Copyright (C) 2024 Mandiant, Inc. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at: [package root]/LICENSE.txt
# Unless required by applicable law or agreed to in writing, software distributed under the License
# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and limitations under the License.


from typing import List, Tuple, Iterator
from pathlib import Path

import capa.helpers
import capa.features.extractors.vmray.call
import capa.features.extractors.vmray.file
import capa.features.extractors.vmray.global_
from capa.features.common import Feature, Characteristic
from capa.features.address import NO_ADDRESS, Address, ThreadAddress, DynamicCallAddress, AbsoluteVirtualAddress
from capa.features.extractors.vmray import VMRayAnalysis
from capa.features.extractors.vmray.models import PARAM_TYPE_STR, Process, ParamList, FunctionCall
from capa.features.extractors.base_extractor import (
CallHandle,
SampleHashes,
ThreadHandle,
ProcessHandle,
DynamicFeatureExtractor,
)


def get_formatted_params(params: ParamList) -> List[str]:
params_list: List[str] = []

for param in params:
if param.deref and param.deref.value is not None:
deref_value: str = f'"{param.deref.value}"' if param.deref.type_ in PARAM_TYPE_STR else param.deref.value
params_list.append(f"{param.name}: {deref_value}")
else:
value: str = "" if param.value is None else param.value
params_list.append(f"{param.name}: {value}")

return params_list


class VMRayExtractor(DynamicFeatureExtractor):
def __init__(self, analysis: VMRayAnalysis):
assert analysis.sample_file_analysis is not None

super().__init__(
hashes=SampleHashes(
md5=analysis.sample_file_analysis.hash_values.md5.lower(),
sha1=analysis.sample_file_analysis.hash_values.sha1.lower(),
sha256=analysis.sample_file_analysis.hash_values.sha256.lower(),
)
)

self.analysis = analysis

# pre-compute these because we'll yield them at *every* scope.
self.global_features = list(capa.features.extractors.vmray.global_.extract_features(self.analysis))

def get_base_address(self) -> Address:
# value according to the PE header, the actual trace may use a different imagebase
return AbsoluteVirtualAddress(self.analysis.base_address)

def extract_file_features(self) -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.vmray.file.extract_features(self.analysis)

def extract_global_features(self) -> Iterator[Tuple[Feature, Address]]:
yield from self.global_features

def get_processes(self) -> Iterator[ProcessHandle]:
yield from capa.features.extractors.vmray.file.get_processes(self.analysis)

def extract_process_features(self, ph: ProcessHandle) -> Iterator[Tuple[Feature, Address]]:
# we have not identified process-specific features for VMRay yet
yield from []

def get_process_name(self, ph) -> str:
process: Process = ph.inner
return process.image_name

def get_threads(self, ph: ProcessHandle) -> Iterator[ThreadHandle]:
for thread in self.analysis.process_threads[ph.address.pid]:
address: ThreadAddress = ThreadAddress(process=ph.address, tid=thread)
yield ThreadHandle(address=address, inner={})

def extract_thread_features(self, ph: ProcessHandle, th: ThreadHandle) -> Iterator[Tuple[Feature, Address]]:
if False:
# force this routine to be a generator,
# but we don't actually have any elements to generate.
yield Characteristic("never"), NO_ADDRESS
return

def get_calls(self, ph: ProcessHandle, th: ThreadHandle) -> Iterator[CallHandle]:
for function_call in self.analysis.process_calls[ph.address.pid][th.address.tid]:
addr = DynamicCallAddress(thread=th.address, id=function_call.fncall_id)
yield CallHandle(address=addr, inner=function_call)

def extract_call_features(
self, ph: ProcessHandle, th: ThreadHandle, ch: CallHandle
) -> Iterator[Tuple[Feature, Address]]:
yield from capa.features.extractors.vmray.call.extract_features(ph, th, ch)

def get_call_name(self, ph, th, ch) -> str:
call: FunctionCall = ch.inner
call_formatted: str = call.name

# format input parameters
if call.params_in:
call_formatted += f"({', '.join(get_formatted_params(call.params_in.params))})"
else:
call_formatted += "()"

# format output parameters
if call.params_out:
call_formatted += f" -> {', '.join(get_formatted_params(call.params_out.params))}"

return call_formatted

@classmethod
def from_zipfile(cls, zipfile_path: Path):
return cls(VMRayAnalysis(zipfile_path))
Loading