-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
395 add the information about match context to the database #439
base: master
Are you sure you want to change the base?
Changes from 14 commits
ffdd46c
511a255
106eff0
8021242
e041af2
7b1b2e8
ebc9277
a536f95
acd8587
c253dfa
12228bd
c1ce768
e259807
fc5a2f6
7fdfe9e
36aa90e
d0cdcd6
1af4050
7bdcef9
a5a3058
830a5b4
57fd651
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
"""Added context column into match table | ||
Revision ID: f623e1057b00 | ||
Revises: 6b495d5a4855 | ||
Create Date: 2024-11-13 15:14:14.618258 | ||
""" | ||
from alembic import op | ||
import sqlalchemy as sa | ||
|
||
|
||
# revision identifiers, used by Alembic. | ||
revision = "f623e1057b00" | ||
down_revision = "6b495d5a4855" | ||
branch_labels = None | ||
depends_on = None | ||
|
||
|
||
def upgrade() -> None: | ||
# ### commands auto generated by Alembic - please adjust! ### | ||
op.add_column("match", sa.Column("context", sa.JSON(), nullable=True)) | ||
# ### end Alembic commands ### | ||
|
||
|
||
def downgrade() -> None: | ||
# ### commands auto generated by Alembic - please adjust! ### | ||
op.drop_column("match", "context") | ||
# ### end Alembic commands ### |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,4 +1,5 @@ | ||||||||||||||||||||||||||
from typing import List, Optional, cast | ||||||||||||||||||||||||||
import base64 | ||||||||||||||||||||||||||
from typing import List, Optional, cast, Dict | ||||||||||||||||||||||||||
import logging | ||||||||||||||||||||||||||
from rq import get_current_job, Queue # type: ignore | ||||||||||||||||||||||||||
from redis import Redis | ||||||||||||||||||||||||||
|
@@ -68,7 +69,12 @@ def get_datasets(self) -> List[str]: | |||||||||||||||||||||||||
return list(result["result"]["datasets"].keys()) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
def update_metadata( | ||||||||||||||||||||||||||
self, job: JobId, orig_name: str, path: str, matches: List[str] | ||||||||||||||||||||||||||
self, | ||||||||||||||||||||||||||
job: JobId, | ||||||||||||||||||||||||||
orig_name: str, | ||||||||||||||||||||||||||
path: str, | ||||||||||||||||||||||||||
matches: List[str], | ||||||||||||||||||||||||||
context: Dict[str, List[Dict[str, str]]], | ||||||||||||||||||||||||||
) -> None: | ||||||||||||||||||||||||||
"""Saves matches to the database, and runs appropriate metadata | ||||||||||||||||||||||||||
plugins. | ||||||||||||||||||||||||||
|
@@ -93,9 +99,49 @@ def update_metadata( | |||||||||||||||||||||||||
del metadata["path"] | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
# Update the database. | ||||||||||||||||||||||||||
match = Match(file=orig_name, meta=metadata, matches=matches) | ||||||||||||||||||||||||||
match = Match( | ||||||||||||||||||||||||||
file=orig_name, meta=metadata, matches=matches, context=context | ||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||
self.db.add_match(job, match) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
@staticmethod | ||||||||||||||||||||||||||
def read_file(file_path: str) -> bytes: | ||||||||||||||||||||||||||
"""Reads the entire file content. | ||||||||||||||||||||||||||
Returns: | ||||||||||||||||||||||||||
bytes: The content of the file. | ||||||||||||||||||||||||||
""" | ||||||||||||||||||||||||||
with open(file_path, "rb") as file: | ||||||||||||||||||||||||||
return file.read() | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
@staticmethod | ||||||||||||||||||||||||||
def read_bytes_from_offset( | ||||||||||||||||||||||||||
data: bytes, matched_length: int, offset: int, byte_range: int = 32 | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a very good method name. Maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also there's no reason to use |
||||||||||||||||||||||||||
) -> tuple[bytes, bytes, bytes]: | ||||||||||||||||||||||||||
"""Reads a specific range of bytes from the already loaded file content around a given offset. | ||||||||||||||||||||||||||
Args: | ||||||||||||||||||||||||||
data (bytes): Data to read. | ||||||||||||||||||||||||||
matched_length (int): Number of bytes to read. | ||||||||||||||||||||||||||
offset (int): The offset in bytes from which to start reading. | ||||||||||||||||||||||||||
byte_range (int): The range in bytes to read around the offset (default is 32). | ||||||||||||||||||||||||||
Returns: | ||||||||||||||||||||||||||
bytes: A chunk of bytes from the file, starting from the given offset minus bit_range | ||||||||||||||||||||||||||
and ending at offset plus matched_length and byte_range. | ||||||||||||||||||||||||||
""" | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generated by chatgpt? 😀 I don't think the docstrings here are very good (some parts are actually plain wrong), and we don't generally use that docstring format for arguments, so I think argument docs can/should be removed. Maybe:
Suggested change
|
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
before = data[max(0, offset - byte_range) : offset] | ||||||||||||||||||||||||||
matching = data[offset : offset + matched_length] | ||||||||||||||||||||||||||
after = data[ | ||||||||||||||||||||||||||
offset | ||||||||||||||||||||||||||
+ matched_length : min( | ||||||||||||||||||||||||||
len(data), offset + matched_length + byte_range | ||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||
] | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
return before, matching, after | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
def execute_yara(self, job: Job, files: List[str]) -> None: | ||||||||||||||||||||||||||
rule = yara.compile(source=job.raw_yara) | ||||||||||||||||||||||||||
num_matches = 0 | ||||||||||||||||||||||||||
|
@@ -108,10 +154,18 @@ def execute_yara(self, job: Job, files: List[str]) -> None: | |||||||||||||||||||||||||
path = self.plugins.filter(orig_name) | ||||||||||||||||||||||||||
if not path: | ||||||||||||||||||||||||||
continue | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
matches = rule.match(path) | ||||||||||||||||||||||||||
if matches: | ||||||||||||||||||||||||||
data = self.read_file(path) | ||||||||||||||||||||||||||
context = self.get_match_context(data, matches) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
self.update_metadata( | ||||||||||||||||||||||||||
job.id, orig_name, path, [r.rule for r in matches] | ||||||||||||||||||||||||||
job=job.id, | ||||||||||||||||||||||||||
orig_name=orig_name, | ||||||||||||||||||||||||||
path=path, | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why explicit parameter names everywhere? It's a bit verbose, but OK - just wondering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read somewhere that when we use such a notation, we don’t have to worry about the order of the arguments, and we already know during the function call what variables are being assigned to (if the naming differs). That was someone’s opinion; I’ll adapt to the project then. :) |
||||||||||||||||||||||||||
matches=[r.rule for r in matches], | ||||||||||||||||||||||||||
context=context, | ||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||
num_matches += 1 | ||||||||||||||||||||||||||
except yara.Error: | ||||||||||||||||||||||||||
|
@@ -140,6 +194,36 @@ def execute_yara(self, job: Job, files: List[str]) -> None: | |||||||||||||||||||||||||
f"in {scanned_datasets}/{job.total_datasets} ({dataset_percent:.0%}) of datasets.", | ||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
def get_match_context( | ||||||||||||||||||||||||||
self, data: bytes, matches: List[yara.Match] | ||||||||||||||||||||||||||
) -> dict: | ||||||||||||||||||||||||||
context = {} | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for now we use typing classes everywhere (and also misses key/value types)
Suggested change
(value type to be fixed depending on how other comments are resolved) |
||||||||||||||||||||||||||
for yara_match in matches: | ||||||||||||||||||||||||||
match_context = [] | ||||||||||||||||||||||||||
for string_match in yara_match.strings: | ||||||||||||||||||||||||||
expression_keys = [] | ||||||||||||||||||||||||||
for expression_key in string_match.instances: | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is only used for presence checking, this should be a set instead of a list. But I think it's not necessary at all (see next comment) |
||||||||||||||||||||||||||
if expression_key in expression_keys: | ||||||||||||||||||||||||||
continue | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This check is a no-op - for (regular, with no eq override) python object There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't fully get the idea behind this. If the intention was to only add a single match to the context (as originally planned, I think) then you should check if Or even better, just take the first instance - no loop necessary |
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
(before, matching, after,) = self.read_bytes_from_offset( | ||||||||||||||||||||||||||
data=data, | ||||||||||||||||||||||||||
offset=expression_key.offset, | ||||||||||||||||||||||||||
matched_length=expression_key.matched_length, | ||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||
match_context.append( | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||
{ | ||||||||||||||||||||||||||
"before": base64.b64encode(before).decode("utf-8"), | ||||||||||||||||||||||||||
"matching": base64.b64encode(matching).decode( | ||||||||||||||||||||||||||
"utf-8" | ||||||||||||||||||||||||||
), | ||||||||||||||||||||||||||
"after": base64.b64encode(after).decode("utf-8"), | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||
context.update({str(yara_match): match_context}) | ||||||||||||||||||||||||||
expression_keys.append(expression_key) | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thing that's missing is that you don't put string name anywhere - this makes it impossible to tell which string matched. In most cases there is a single yara rule per query, but multiple strings. It definitely must be stored somewhere. Probably the easiest way is to use a dict for contetx instead of a list:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Finally, this expression is overly complicated - I believe
is equivalent to
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here, in the match key, we have information about the string we are searching for. But okay, I’ll replace the list with a dictionry. |
||||||||||||||||||||||||||
return context | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
def init_search(self, job: Job, tasks: int) -> None: | ||||||||||||||||||||||||||
self.db.init_jobagent(job, self.db_id, tasks) | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this method is necessary - you can just inline this.
Even if you prefer to keep it, it doesn't belong in this class (Agent). But IMO it's best to inline it.