Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lookup_entry and lookup_entry_by_path to TreeRef #1686

Merged
merged 10 commits into from
Nov 24, 2024

Conversation

cruessler
Copy link
Contributor

This PR is related to a comment in my gix blame PR: #1453 (comment).

I wrote the new code in the context of #1453 and then cherry-picked the result into this separate PR. I initially took both methods from

/// Follow a sequence of `path` components starting from this instance, and look them up one by one until the last component
/// is looked up and its tree entry is returned.
///
/// # Performance Notes
///
/// Searching tree entries is currently done in sequence, which allows to the search to be allocation free. It would be possible
/// to reuse a vector and use a binary search instead, which might be able to improve performance over all.
/// However, a benchmark should be created first to have some data and see which trade-off to choose here.
///
pub fn lookup_entry<I, P>(&self, path: I) -> Result<Option<Entry<'repo>>, find::existing::Error>
where
I: IntoIterator<Item = P>,
P: PartialEq<BStr>,
{
let mut buf = self.repo.empty_reusable_buffer();
buf.clear();
let mut path = path.into_iter().peekable();
buf.extend_from_slice(&self.data);
while let Some(component) = path.next() {
match TreeRefIter::from_bytes(&buf)
.filter_map(Result::ok)
.find(|entry| component.eq(entry.filename))
{
Some(entry) => {
if path.peek().is_none() {
return Ok(Some(Entry {
inner: entry.into(),
repo: self.repo,
}));
} else {
let next_id = entry.oid.to_owned();
let obj = self.repo.objects.find(&next_id, &mut buf)?;
if !obj.kind.is_tree() {
return Ok(None);
}
}
}
None => return Ok(None),
}
}
Ok(None)
}
and
/// Like [`Self::lookup_entry()`], but takes a `Path` directly via `relative_path`, a path relative to this tree.
///
/// # Note
///
/// If any path component contains illformed UTF-8 and thus can't be converted to bytes on platforms which can't do so natively,
/// the returned component will be empty which makes the lookup fail.
pub fn lookup_entry_by_path(
&self,
relative_path: impl AsRef<std::path::Path>,
) -> Result<Option<Entry<'repo>>, find::existing::Error> {
use crate::bstr::ByteSlice;
self.lookup_entry(relative_path.as_ref().components().map(|c: std::path::Component<'_>| {
gix_path::os_str_into_bstr(c.as_os_str())
.unwrap_or_else(|_| "".into())
.as_bytes()
}))
}
, respectively. I left the doc comments untouched as they seem adequate in their new context as well. The code was modified to work with the APIs available in gix-object. I needed to add gix-path as a dependency in order to have access to gix_path::os_str_into_bstr.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making more convenient tree access available on 'the lower levels'.

Something that is sorely missing is tests. With that I think it would quickly become clear that this isn't actually working as the implementation doesn't attempt to load its subtrees.

@cruessler
Copy link
Contributor Author

Thanks a lot for making more convenient tree access available on 'the lower levels'.

Something that is sorely missing is tests. With that I think it would quickly become clear that this isn't actually working as the implementation doesn't attempt to load its subtrees.

Thanks for having a first look! (Sorry if this first attempt is incomplete. I only have limited context yet on how this works, so any feedback is greatly appreciated. I hope that this does not take away too much of your time. I’m a big fan of getting feedback very early as I’m certain one or two hints from your side will help me a lot initially. I’ll mark this PR as draft for the time being.)

@cruessler cruessler marked this pull request as draft November 18, 2024 12:45
@Byron
Copy link
Member

Byron commented Nov 18, 2024

No worries at all. I think this this crate also has just the trait you need: Find, so these subtrees can be loaded into a buffer which probably would have to be passed as parameter as well. And from there it should work very similarly as 'on the upper levels'.

@cruessler
Copy link
Contributor Author

I hope I now have a better understanding of how lookup by path works. Just to confirm: is it correct to recursively load trees and run lookup_entry on them?

The code as I just pushed it, does not compile as the buffer that is passed into lookup_entry cannot be passed to a recursive call to lookup_entry. Before I dig into how to best solve that, I first want to make sure this is indeed the correct approach. If it is, what would be the most idiomatic solution to this problem?

@Byron
Copy link
Member

Byron commented Nov 21, 2024

Just to confirm: is it correct to recursively load trees and run lookup_entry on them?

In abstract terms, yes, but it should really be implemented like in gix so it's a loop. Real recursion is only ever done when it's also limited by some counter to prevent attacks.

If it is, what would be the most idiomatic solution to this problem?

Rust helps :). Like mentioned before, it should be possible to translate the gix level implementation to this one, it remains the same except that everything that's Repository is replaced with something passed as parameter.

@cruessler
Copy link
Contributor Author

I just pushed a non-recursive implementation that passes the 2 tests that I had added before, so it seems to be working. :-)

I’m not really happy that it has self.entries.clone(), even though that was the most straightforward way to work around the ownership issues that I had when reading from both self.entries as well as obj.entries in subsequent iterations of the loop.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much more like it, thank you!

The following patch would fix the problem of clone()

diff --git a/gix-object/src/tree/ref_iter.rs b/gix-object/src/tree/ref_iter.rs
index fda6f7275..76f063a9f 100644
--- a/gix-object/src/tree/ref_iter.rs
+++ b/gix-object/src/tree/ref_iter.rs
@@ -1,4 +1,5 @@
 use bstr::BStr;
+use std::borrow::Cow;
 use winnow::{error::ParserError, prelude::*};
 
 use crate::{tree, tree::EntryRef, TreeRef, TreeRefIter};
@@ -65,7 +66,7 @@ impl<'a> TreeRef<'a> {
         P: PartialEq<BStr>,
     {
         let mut path = path.into_iter().peekable();
-        let mut entries = self.entries.clone();
+        let mut entries = Cow::Borrowed(&self.entries);
 
         while let Some(component) = path.next() {
             match entries.iter().find(|entry| component.eq(entry.filename)) {
@@ -76,7 +77,7 @@ impl<'a> TreeRef<'a> {
                         let next_id = entry.oid.to_owned();
                         let obj = odb.find_tree(&next_id, buffer)?;
 
-                        entries = obj.entries;
+                        entries = Cow::Owned(obj.entries);
                     }
                 }
                 None => return Ok(None),

…but it would still leave the problem of inefficiency as it allocates where it shouldn't.

TreeRefIter should be used instead.

P: PartialEq<BStr>,
{
let mut path = path.into_iter().peekable();
let mut entries = self.entries.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick should be to stick to what's done in gix and implement this on the TreeRefIter, something I thought I saw in an earlier version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I implemented the method on TreeRef was that, in its current iteration, the gix-blame PR uses odb.find_tree which returns a TreeRef. Since it seems I could switch to odb.find_tree_iter in the gix-blame PR instead, I’ll move lookup_entry to TreeRefIter as you suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This has the chance to be a small but measurable performance boost as well. I still remember comparing two versions of this - what it is now (in gix), and one where it would allocate a Vec and binary-search through it. And it turned out the allocation really kills performance.

///
pub fn lookup_entry<I, P>(
&self,
odb: impl crate::FindExt,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be crate::Find, which always implements FindExt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint! Changed.

.unwrap()
.unwrap();

assert!(matches!(entry, Entry { .. }));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would deduplicate the tests and compare the whole entry, showing that one can get trees and files. I'd also show what happens if the requested entry doesn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Changed.

@cruessler
Copy link
Contributor Author

As far as I can see, I’m done with the changes in response to your comments.

@Byron Byron force-pushed the move-lookup-entry-to-gix-object branch from a163909 to 71e9282 Compare November 24, 2024 14:04
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

If you think it's ready, I think it's ready as well :).

This does not compile as there is only one buffer for potentially many
times `lookup_entry` is called. This is intended to check whether the
general approach is correct.
Return `Entry` instead of `EntryRef`. `EntryRef` needs the underlying
data to be kept by `TreeRef`, but `lookup_entry` only temporarily loads
sub-trees, so there is no data to return a reference to.
Adapt functions to new context. This mostly means copying the
implementation from `gix::types::Tree` and applying minor changes
related to `buffer` being passed as an argument.
@Byron Byron force-pushed the move-lookup-entry-to-gix-object branch from 71e9282 to d913bc7 Compare November 24, 2024 14:28
@Byron Byron marked this pull request as ready for review November 24, 2024 14:30
- return `EntryRef`
- more structured tests
@Byron Byron force-pushed the move-lookup-entry-to-gix-object branch from d913bc7 to d7f4991 Compare November 24, 2024 14:31
@Byron Byron enabled auto-merge November 24, 2024 14:31
@Byron Byron merged commit 39227a9 into GitoxideLabs:main Nov 24, 2024
20 checks passed
@cruessler
Copy link
Contributor Author

Thanks a lot!

If you think it's ready, I think it's ready as well :).

Awesome!

One additional thing that I had in mind that didn’t make it into the PR was adding a comment to bisect_entry, providing context on the difference between bisect_entry and lookup_entry_by_path. When I was looking for a function to get an Entry from a Tree, I immediately chose bisect_entry because the name seemed to promise the performance I was looking for. What I didn’t notice was that this was the wrong function for what I wanted to do.

@Byron
Copy link
Member

Byron commented Nov 25, 2024

I see. Probably it would be easiest to make these doc-adjustments in the gix blame PR - I plan to not take it into the next year :D .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants