Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kaybee merge take long even with one single repository configured #48

Open
sumeetpatil opened this issue Nov 6, 2020 · 5 comments
Open
Labels
bug Something isn't working component/kaybee

Comments

@sumeetpatil
Copy link
Member

sumeetpatil commented Nov 6, 2020

OS: macOS
kaybee merge take long even with one single repository
kaybee version 0.6.15

I used the git cloned file system to do the tests. kaybeeconf.yaml contains one single repo -
repo: file:///*********

Git cloned file system with tar containing 577 vulnerabilities -

kaybee pull
date && kaybee merge && date   
Fri Nov  6 11:20:33 CET 2020
......
Fri Nov  6 11:45:36 CET 2020

It took ~25mins

Git cloned file system without tar containing 720 vulnerabilities -

kaybee pull
date && kaybee merge && date   
Fri Nov  6 12:04:08 CET 2020
......
Fri Nov  6 12:17:47 CET 2020

It took ~13mins

@sumeetpatil sumeetpatil added the bug Something isn't working label Nov 6, 2020
@copernico
Copy link
Contributor

copernico commented Nov 6, 2020

I think the fact that we use Git as the underlying storage is the issue: if we accessed the files on the local filesystem (after cloning) this would be much faster. I need to validate this theory, but if it is true, we should adopt a more neutral approach to storing statements. Git should be supported as any other storage, but after the remote sources are copied locally, we must be able to access them as ordinary files, not as commits (from which we extract files).

The consequence is that the signature will not be contained in the commit but in as dedicated file, sibling of statement.yaml (which is a positive side-effect). The only downside is that one would have to sign statements before committing (instead of signing on commit), but this is a negligible disadvantage compare to the advantages we would obtain.

@sumeetpatil
Copy link
Member Author

Hi @copernico ,
It actually works fast on linux system approx 3-4mins for https://github.com/SAP/project-kb/tree/vulnerability-data-with-changed-source-code . I was using a mac. I will get you the exact numbers running the same on linux.

@copernico
Copy link
Contributor

Hi @sumeetpatil
thanks for the additional investigation, good to know that on Linux the issue is not as annoying. I still consider 3-4 mins too long for what the merge operation actually does, so getting rid of the unnecessary coupling with git is still on the agenda for me.
Maybe some profiling would put the final word on this?

@henrikplate
Copy link
Contributor

The "unnecessary coupling" only refers to the merge operation, correct? I think we should not question to use Git as storage for statements, as it comes with plenty of features, e.g., signatures and version control, that would be expensive to develop on top of other storages.

@copernico
Copy link
Contributor

copernico commented Nov 12, 2020

The "unnecessary coupling" only refers to the merge operation, correct? I think we should not question to use Git as storage for statements, as it comes with plenty of features, e.g., signatures and version control, that would be expensive to develop on top of other storages.

Hi @henrikplate, I know in the meantime we have clarified this, but let me write it down as a documentation for anyone interested.

The way we use Git is more than just storing statements in a repository; we also use it to sign statements. As a matter of fact, the key entity that kaybee operates on are commits (not statement files). Commits are signed, statements are not. It is not possible to validate the signature of a statement file, only to validate the signature of a commit (which in turn may contain multiple statement files).

I would therefore decouple the two concerns:

  • storage/retrieval of statement files
  • manipulation (validation, merging, exporting) of statement data

I see no issue in using Git for the former (indeed, I do agree that keeping version of statements is great idea), but we should also be able to treat statement files per-se, without requiring that they be "wrapped" by a commit. When working with the information contained in statements, it should not matter how that information was stored in a remote source. Git or FTP should be the same.

A commit (in Git, SVN, or any other code management system) is just a way to "transport" the data. Once retrieved locally (kaybee pull), the statement files should be accessed as files. The implication is that it should be possible to validate them as files (hence, we need to keep the signature as a file, not as a meta-data of a commit).

Proposal:

  • we keep supporting git, in the sense that kaybee must keep the capability to pull from git repositories
  • we add the ability to fetch statements over http from a normal webserver (I would add the requirement of going through SSL)
  • we also add the ability to fetch statements from other 'convenience' source types (local folders and maybe compressed archives); this would enable us to fetch from git and then treat the local copy as a source of statement (files) -- remember: currently we cannot use the local clone that way, because the files are not enough (we need the commits)
  • we start storing signatures as separate files; anyone will be able to take a statement and the accompanying signature file to validate one against the other, regardless of which repository type was used to store them. This is the decoupling I'm advocating.

Consequences:

  • we cannot sign statements by signing commits; this is not a big deal IMHO, because one can still sign with the gpg command
    or even better, use the kaybee publish --sign to sign statements prior to publishing. (note: this command does not exist at this time)
  • operations involving reading and writing statements are (much?) faster because we would be manipulating files instead of Git trees and commits.
  • we can store and transfer statements any way we like, of course versioning statements remains the recommended (but not required) way
  • no strict requirements on hosting statements in a particular branch of a repo: any branch would do, any folder name, at any depth. Also, as written above, a folder in a webserver would do. Finally, a compressed archive sent via email would be fine too
    (I'm sure this is useful in certain scenarios).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/kaybee
Projects
None yet
Development

No branches or pull requests

3 participants