Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

395 add the information about match context to the database #439

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""Added context column into match table
Revision ID: f623e1057b00
Revises: 6b495d5a4855
Create Date: 2024-11-13 15:14:14.618258
"""
from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision = "f623e1057b00"
down_revision = "6b495d5a4855"
branch_labels = None
depends_on = None


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.add_column("match", sa.Column("context", sa.JSON(), nullable=True))
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.drop_column("match", "context")
# ### end Alembic commands ###
1 change: 1 addition & 0 deletions src/models/match.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ class Match(SQLModel, table=True):
)
)
job: Job = Relationship(back_populates="matches")
context: Dict[str, List[Dict[str, str]]] = Field(sa_column=Column(JSON))
92 changes: 88 additions & 4 deletions src/tasks.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from typing import List, Optional, cast
import base64
from typing import List, Optional, cast, Dict
import logging
from rq import get_current_job, Queue # type: ignore
from redis import Redis
Expand Down Expand Up @@ -68,7 +69,12 @@ def get_datasets(self) -> List[str]:
return list(result["result"]["datasets"].keys())

def update_metadata(
self, job: JobId, orig_name: str, path: str, matches: List[str]
self,
job: JobId,
orig_name: str,
path: str,
matches: List[str],
context: Dict[str, List[Dict[str, str]]],
) -> None:
"""Saves matches to the database, and runs appropriate metadata
plugins.
Expand All @@ -93,9 +99,49 @@ def update_metadata(
del metadata["path"]

# Update the database.
match = Match(file=orig_name, meta=metadata, matches=matches)
match = Match(
file=orig_name, meta=metadata, matches=matches, context=context
)
self.db.add_match(job, match)

@staticmethod
def read_file(file_path: str) -> bytes:
"""Reads the entire file content.
Returns:
bytes: The content of the file.
"""
with open(file_path, "rb") as file:
return file.read()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method is necessary - you can just inline this.

Even if you prefer to keep it, it doesn't belong in this class (Agent). But IMO it's best to inline it.

@staticmethod
def read_bytes_from_offset(
data: bytes, matched_length: int, offset: int, byte_range: int = 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very good method name. Maybe read_bytes_with_context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there's no reason to use @staticmethod here - it's generic, and has no relation to the Agent class. Please make it a global method

) -> tuple[bytes, bytes, bytes]:
"""Reads a specific range of bytes from the already loaded file content around a given offset.
Args:
data (bytes): Data to read.
matched_length (int): Number of bytes to read.
offset (int): The offset in bytes from which to start reading.
byte_range (int): The range in bytes to read around the offset (default is 32).
Returns:
bytes: A chunk of bytes from the file, starting from the given offset minus bit_range
and ending at offset plus matched_length and byte_range.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated by chatgpt? 😀

I don't think the docstrings here are very good (some parts are actually plain wrong), and we don't generally use that docstring format for arguments, so I think argument docs can/should be removed. Maybe:

Suggested change
"""Reads a specific range of bytes from the already loaded file content around a given offset.
Args:
data (bytes): Data to read.
matched_length (int): Number of bytes to read.
offset (int): The offset in bytes from which to start reading.
byte_range (int): The range in bytes to read around the offset (default is 32).
Returns:
bytes: A chunk of bytes from the file, starting from the given offset minus bit_range
and ending at offset plus matched_length and byte_range.
"""Return `matched_length` bytes from `offset`, along with `byte_range` bytes before and after the match.


before = data[max(0, offset - byte_range) : offset]
matching = data[offset : offset + matched_length]
after = data[
offset
+ matched_length : min(
len(data), offset + matched_length + byte_range
)
]

return before, matching, after

def execute_yara(self, job: Job, files: List[str]) -> None:
rule = yara.compile(source=job.raw_yara)
num_matches = 0
Expand All @@ -108,10 +154,18 @@ def execute_yara(self, job: Job, files: List[str]) -> None:
path = self.plugins.filter(orig_name)
if not path:
continue

matches = rule.match(path)
if matches:
data = self.read_file(path)
context = self.get_match_context(data, matches)

self.update_metadata(
job.id, orig_name, path, [r.rule for r in matches]
job=job.id,
orig_name=orig_name,
path=path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why explicit parameter names everywhere? It's a bit verbose, but OK - just wondering.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read somewhere that when we use such a notation, we don’t have to worry about the order of the arguments, and we already know during the function call what variables are being assigned to (if the naming differs). That was someone’s opinion; I’ll adapt to the project then. :)

matches=[r.rule for r in matches],
context=context,
)
num_matches += 1
except yara.Error:
Expand Down Expand Up @@ -140,6 +194,36 @@ def execute_yara(self, job: Job, files: List[str]) -> None:
f"in {scanned_datasets}/{job.total_datasets} ({dataset_percent:.0%}) of datasets.",
)

def get_match_context(
self, data: bytes, matches: List[yara.Match]
) -> dict:
context = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now we use typing classes everywhere (and also misses key/value types)

Suggested change
) -> dict:
) -> Dict[str, ???]:

(value type to be fixed depending on how other comments are resolved)

for yara_match in matches:
match_context = []
for string_match in yara_match.strings:
expression_keys = []
for expression_key in string_match.instances:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used for presence checking, this should be a set instead of a list. But I think it's not necessary at all (see next comment)

if expression_key in expression_keys:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is a no-op - for (regular, with no eq override) python object object in SOMETHING will always be false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully get the idea behind this. If the intention was to only add a single match to the context (as originally planned, I think) then you should check if string_match was already added instead.

Or even better, just take the first instance - no loop necessary


(before, matching, after,) = self.read_bytes_from_offset(
data=data,
offset=expression_key.offset,
matched_length=expression_key.matched_length,
)
match_context.append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(before, matching, after,) = self.read_bytes_from_offset(
data=data,
offset=expression_key.offset,
matched_length=expression_key.matched_length,
)
(before, matching, after) = self.read_bytes_from_offset(
data,
expression_key.offset,
expression_key.matched_length,
)

{
"before": base64.b64encode(before).decode("utf-8"),
"matching": base64.b64encode(matching).decode(
"utf-8"
),
"after": base64.b64encode(after).decode("utf-8"),
}
)
context.update({str(yara_match): match_context})
expression_keys.append(expression_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str(yara_match)? Probably better to avoid relying on str() representation and to use something like yara_match.rule (or .name? I don't remember)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's missing is that you don't put string name anywhere - this makes it impossible to tell which string matched. In most cases there is a single yara rule per query, but multiple strings.

It definitely must be stored somewhere. Probably the easiest way is to use a dict for contetx instead of a list:

{
    "rule_name_1": {
        "string_1": {
            "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
            "match": "AbcDe9f=",
            "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
        },
        "string_2": {
            "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
            "match": "AbcDe9f=",
            "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
        }
    },
    "rule_name_2": {
        "string_1": {
            "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
            "match": "AbcDe9f=",
            "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
        },
        "string_2": {
            "before": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
            "match": "AbcDe9f=",
            "after": "CgkJCQkJcHJpbnQgU1RERVJSICRtc2c7CgkJCQkJdW4=",
        }
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, this expression is overly complicated - I believe

context.update({A: B})

is equivalent to

context[A] = B

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, in the match key, we have information about the string we are searching for. But okay, I’ll replace the list with a dictionry.

return context

def init_search(self, job: Job, tasks: int) -> None:
self.db.init_jobagent(job, self.db_id, tasks)

Expand Down
Loading