Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new JobFile model to store UrsaDB iterator batch files #420

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

mickol34
Copy link
Collaborator

@mickol34 mickol34 commented Oct 9, 2024

Your checklist for this pull request

  • I've read the contributing guideline.
  • I've tested my changes by building and running mquery, and testing changed functionality (if applicable)
  • I've added automated tests for my change (if applicable, optional)
  • I've updated documentation to reflect my change (if applicable)

What is the current behaviour?

Currently query to UrsaDB returns iterator and it's length which is then passed to batcher and YARA parser to unpack.

What is the new behaviour?
Now iterator is immediately popped after querying and several JobFile objects are created which store batches of files to execute YARA on.

Test plan

App should work the same way for various queries and rules. In case of other system failure during executing YARA there should remain JobFile instances (since those are deleted upon successful YARA execution).

Closing issues

fixes #381

src/lib/ursadb.py Outdated Show resolved Hide resolved

class QueryResult(SQLModel, table=True):
job_id: str = Field(foreign_key="job.internal_id", primary_key=True)
files: List[str] = Field(sa_column=Column(ARRAY(String)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use SQL arrays for this, it should be a regular record (file: str)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means I should multiple objects for every file returned from .query()?

src/lib/ursadb.py Outdated Show resolved Hide resolved
src/models/queryresult.py Outdated Show resolved Hide resolved
src/tasks.py Outdated Show resolved Hide resolved
src/tasks.py Outdated Show resolved Hide resolved
@msm-cert
Copy link
Member

BTW. this fails because QueryResult table does not exist in the DB. You need to create alembic migration for it (see also recent PR about enum rework). I didn't re-review the rest of the code yet.

@mickol34 mickol34 force-pushed the fix/store-job-files-in-new-model-381 branch from a3470ba to 173fa53 Compare October 17, 2024 10:16
@mickol34
Copy link
Collaborator Author

Referring to #381 I'm not sure if ursadb.pop() is dead code, if it's used in all_indexed_files and all_indexed_names functions. If there are any other cleanups and/or tests to add, please comment below for me to fix them.

@mickol34 mickol34 marked this pull request as ready for review October 17, 2024 13:55
@mickol34 mickol34 requested a review from msm-cert October 17, 2024 13:56
@msm-cert msm-cert changed the title Draft: fix: rewrite query_ursadb not to use iterators Draft: rewrite query_ursadb not to use iterators Oct 18, 2024
@msm-cert
Copy link
Member

Ugh, this is tough with regards to RAM usage. Nothing wrong with this PR by itself, but I'll have to think what to suggest. Let's keep it open as a draft for now. Sorry!

By the way, in the current form storing the files in the database (QueryResult objects) is a bit pointless, since nobody ever uses them, right? 🙃

Copy link
Member

@msm-cert msm-cert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think we can continue with this but with some large changes:

  1. Restore query to the previous version (so return an iterator)
  2. When querying, use that iterator to immediately add all the files from the result to the database (call pop in a loop and insert to the database, no need to handle was_locked state)
  3. I think QueryResult should be something like JobFile (id: int, job_id: int, file: str), so workers don't have to get all the results at once (can just select with limit and offset)
  4. Alternatively, insert multiple QueryResults into the database, each with a batch of files (this is easier to implement but is less flexible. But still OK probably)
  5. Importantly, when adding tasks: agent.queue.enqueue(... don't add filenames there. That's because it's stored in redis (== ram) and can potentially be very large. Instead, store just offset or batch_id (depending on which option in the previous you picked)

The reasoning here is that we prefer to have files in the database instead of in the opaque and weird ursadb iterator, but we have to be careful to avoid keeping the entire set of files at once in memory. In most cases this set will be small, but we can't OOM if someone does a huge query (this was actually a problem in the past).

I realise this may sound a bit convoluted, feel free to contact me in case of any questions

@mickol34 mickol34 force-pushed the fix/store-job-files-in-new-model-381 branch from caed157 to e81a04a Compare November 6, 2024 15:49
@mickol34 mickol34 requested a review from msm-cert November 6, 2024 15:58
@mickol34 mickol34 changed the title Draft: rewrite query_ursadb not to use iterators feat: new JobFile model to store UrsaDB iterator batch files Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store job files in the database instead of ursadb iterators
2 participants