Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add low-level HashTable API #466

Merged
merged 9 commits into from
Oct 19, 2023
Merged

Add low-level HashTable API #466

merged 9 commits into from
Oct 19, 2023

Conversation

Amanieu
Copy link
Member

@Amanieu Amanieu commented Aug 31, 2023

The primary use case for this type over HashMap or HashSet is to support types that do not implement the Hash and Eq traits, but instead require additional data not contained in the key itself to compute a hash and compare two elements for equality.

HashTable has some similarities with RawTable, but has a completely safe API. It is intended as a replacement for the existing raw entry API, with the intend of deprecating the latter and eventually removing it.

Examples of when this can be useful include:

  • An IndexMap implementation where indices into a Vec are stored as elements in a HashTable<usize>. Hashing and comparing the elements requires indexing the associated Vec to get the actual value referred to by the index.
  • Avoiding re-computing a hash when it is already known.
  • Mutating the key of an element in a way that doesn't affect its hash.

To achieve this, HashTable methods that search for an element in the table require a hash value and equality function to be explicitly passed in as arguments. The method will then iterate over the elements with the given hash and call the equality function on each of them, until a match is found.

@Zoxc
Copy link
Contributor

Zoxc commented Sep 2, 2023

I wonder if it would make sense to have eq and hasher utility methods for when you have Eq or Hash available. Passing different closures to methods is a bit of a code generation trap. It could help in some cases for that.

@JustForFun88
Copy link
Contributor

I wonder if it would make sense to have eq and hasher utility methods for when you have Eq or Hash available. Passing different closures to methods is a bit of a code generation trap. It could help in some cases for that.

I think it's hardly possible to predict what type users will want to store in a table. It can be T, (T1, T2), (T1, T2, ...Tn).

@beviu
Copy link

beviu commented Sep 4, 2023

In both HashMap and HashTable, only one entry "view" (Entry or RawEntry) structure can exist at a time. If you want to manipulate multiple entries at the same time, it seems like you have to use the raw Bucket and RawTable API.

I think that this is not possible to fix though: it is similar to arrays where there can only be one mutable reference to an element at a time (although there is split_off in that case).

The use case I was wondering about from #450 was if you want modify an entry's key (possibly changing its hash), but also check if the new key is already in the map before actually?

  1. If the check is done before looking up the entry for the old key, then the bucket for the new key is looked up twice.
  2. If the check is done after looking up the entry for the old key, then the bucket for the old key is looked up twice.

@bors
Copy link
Contributor

bors commented Sep 5, 2023

☔ The latest upstream changes (presumably #468) made this pull request unmergeable. Please resolve the merge conflicts.

Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with actually converting indexmap to this. Apart from get_many_mut noted below, I think I would also need something like Occupied/VacantEntry::into_table(self) -> &mut HashTable, because my OccupiedEntry::remove methods need to adjust other indices in the table.

src/table.rs Outdated Show resolved Hide resolved
src/table.rs Outdated Show resolved Hide resolved
src/table.rs Show resolved Hide resolved
src/table.rs Outdated Show resolved Hide resolved
@Zoxc
Copy link
Contributor

Zoxc commented Sep 12, 2023

It would be useful to have fallible variants of entry and insert_unchecked. They can have better code generation as they don't need to panic.

bors added a commit to rust-lang-ci/rust that referenced this pull request Sep 21, 2023
Optimize hash map operations in the query system

This optimizes hash map operations in the query system by explicitly passing hashes and using more optimal operations. `find_or_find_insert_slot` in particular saves a hash table lookup over `entry`. It's not yet available in a safe API, but will be in rust-lang/hashbrown#466.

<table><tr><td rowspan="2">Benchmark</td><td colspan="1"><b>Before</b></th><td colspan="2"><b>After</b></th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 <b>clap</b>:check</td><td align="right">1.6189s</td><td align="right">1.6129s</td><td align="right"> -0.37%</td></tr><tr><td>🟣 <b>hyper</b>:check</td><td align="right">0.2353s</td><td align="right">0.2337s</td><td align="right"> -0.67%</td></tr><tr><td>🟣 <b>regex</b>:check</td><td align="right">0.9344s</td><td align="right">0.9289s</td><td align="right"> -0.59%</td></tr><tr><td>🟣 <b>syn</b>:check</td><td align="right">1.4693s</td><td align="right">1.4652s</td><td align="right"> -0.28%</td></tr><tr><td>🟣 <b>syntex_syntax</b>:check</td><td align="right">5.6606s</td><td align="right">5.6439s</td><td align="right"> -0.30%</td></tr><tr><td>Total</td><td align="right">9.9185s</td><td align="right">9.8846s</td><td align="right"> -0.34%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">0.9956s</td><td align="right"> -0.44%</td></tr></table>

r? `@cjgillot`
@Amanieu
Copy link
Member Author

Amanieu commented Sep 22, 2023

It would be useful to have fallible variants of entry and insert_unchecked. They can have better code generation as they don't need to panic.

You mean fallible allocations? Isn't that already addressed by try_reserve?

@Amanieu Amanieu force-pushed the hashtable branch 3 times, most recently from 98e2f78 to c33ce2d Compare September 22, 2023 06:27
src/table.rs Outdated Show resolved Hide resolved
@Zoxc
Copy link
Contributor

Zoxc commented Sep 22, 2023

You mean fallible allocations? Isn't that already addressed by try_reserve?

I mean variants which won't grow, like Vec::push_within_capacity.

@Amanieu
Copy link
Member Author

Amanieu commented Sep 23, 2023

I mean variants which won't grow, like Vec::push_within_capacity.

I could see an insert_unchecked_within_capacity, but this doesn't make sense for entry since the first thing is does is reserve(1). I'll leave it for a future PR though.

@Zoxc
Copy link
Contributor

Zoxc commented Sep 23, 2023

It does make sense for entry_within_capacity too. It would just return None if there isn't capacity for 1 insertion.

src/table.rs Outdated Show resolved Hide resolved
src/table.rs Outdated Show resolved Hide resolved
eq: impl FnMut(&T) -> bool,
hasher: impl Fn(&T) -> u64,
) -> Entry<'_, T, A> {
match self.raw.find_or_find_insert_slot(hash, eq, hasher) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm playing with HashTable and came across a small problem. Can we directly provide some version of the find_or_find_insert_slot (checked insert) function?
It's very annoying that every time I need to insert a value I have to use the entry syntax (which is not that cheap) or use find + insert_unchecked, which is slow.

It can be something like this one:

pub fn insert<V>(
    &mut self,
    hash: u64,
    value: T,
    mut eq: impl FnMut(&T, &T) -> bool,
    hasher: impl Fn(&T) -> u64,
    replace: impl FnOnce(&mut T, T) -> V,
) -> Option<V> {
    match self
        .raw
        .find_or_find_insert_slot(hash, |found_val| eq(found_val, &value), hasher)
    {
        Ok(bucket) => Some(replace(unsafe { &mut bucket.as_mut() }, value)),
        Err(slot) => {
            unsafe {
                self.raw.insert_in_slot(hash, slot, value);
            }
            None
        }
    }
}

Then it can be used like:

pub struct NewMap<K, V, S = DefaultHashBuilder, A: Allocator = Global> {
    pub(crate) hash_builder: S,
    pub(crate) table: HashTable<(K, V), A>,
}

impl<K, V, S, A> NewMap<K, V, S, A>
where
    K: Eq + core::hash::Hash,
    S: core::hash::BuildHasher,
    A: Allocator,
{
    pub fn insert(&mut self, k: K, v: V) -> Option<V> {
        let hash = make_hash::<K, S>(&self.hash_builder, &k);
        let hasher = make_hasher::<_, V, S>(&self.hash_builder);

        self.table.insert(
            hash,
            (k, v),
            |found, new| found.0 == new.0,
            hasher,
            |(_, val_ref), (_, val)| core::mem::replace(val_ref, val),
        )
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entry already maps directly to find_or_find_insert_slot. You can then use Entry::insert to unconditionally overwrite an existing value, or Entry::or_insert to only insert a new value if an old one doesn't already exist.

bors added a commit to rust-lang-ci/rust that referenced this pull request Sep 25, 2023
Optimize hash map operations in the query system

This optimizes hash map operations in the query system by explicitly passing hashes and using more optimal operations. `find_or_find_insert_slot` in particular saves a hash table lookup over `entry`. It's not yet available in a safe API, but will be in rust-lang/hashbrown#466.

<table><tr><td rowspan="2">Benchmark</td><td colspan="1"><b>Before</b></th><td colspan="2"><b>After</b></th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 <b>clap</b>:check</td><td align="right">1.6189s</td><td align="right">1.6129s</td><td align="right"> -0.37%</td></tr><tr><td>🟣 <b>hyper</b>:check</td><td align="right">0.2353s</td><td align="right">0.2337s</td><td align="right"> -0.67%</td></tr><tr><td>🟣 <b>regex</b>:check</td><td align="right">0.9344s</td><td align="right">0.9289s</td><td align="right"> -0.59%</td></tr><tr><td>🟣 <b>syn</b>:check</td><td align="right">1.4693s</td><td align="right">1.4652s</td><td align="right"> -0.28%</td></tr><tr><td>🟣 <b>syntex_syntax</b>:check</td><td align="right">5.6606s</td><td align="right">5.6439s</td><td align="right"> -0.30%</td></tr><tr><td>Total</td><td align="right">9.9185s</td><td align="right">9.8846s</td><td align="right"> -0.34%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">0.9956s</td><td align="right"> -0.44%</td></tr></table>

r? `@cjgillot`
The primary use case for this type over [`HashMap`] or [`HashSet`] is to
support types that do not implement the [`Hash`] and [`Eq`] traits, but
instead require additional data not contained in the key itself to compute a
hash and compare two elements for equality.

`HashTable` has some similarities with `RawTable`, but has a completely
safe API. It is intended as a replacement for the existing raw entry
API, with the intend of deprecating the latter and eventually removing
it.

Examples of when this can be useful include:
- An `IndexMap` implementation where indices into a `Vec` are stored as
  elements in a `HashTable<usize>`. Hashing and comparing the elements
  requires indexing the associated `Vec` to get the actual value referred to
  by the index.
- Avoiding re-computing a hash when it is already known.
- Mutating the key of an element in a way that doesn't affect its hash.

To achieve this, `HashTable` methods that search for an element in the table
require a hash value and equality function to be explicitly passed in as
arguments. The method will then iterate over the elements with the given
hash and call the equality function on each of them, until a match is found.
@Amanieu
Copy link
Member Author

Amanieu commented Oct 19, 2023

@bors r+

@bors
Copy link
Contributor

bors commented Oct 19, 2023

📌 Commit b533626 has been approved by Amanieu

It is now in the queue for this repository.

@Amanieu
Copy link
Member Author

Amanieu commented Oct 19, 2023

@bors r+

@bors
Copy link
Contributor

bors commented Oct 19, 2023

📌 Commit 9556bf4 has been approved by Amanieu

It is now in the queue for this repository.

@bors
Copy link
Contributor

bors commented Oct 19, 2023

⌛ Testing commit 9556bf4 with merge ef84e09...

@bors
Copy link
Contributor

bors commented Oct 19, 2023

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing ef84e09 to master...

@bors bors merged commit ef84e09 into rust-lang:master Oct 19, 2023
25 checks passed
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 18, 2024
Optimize hash map operations in the query system

This optimizes hash map operations in the query system by explicitly passing hashes and using more optimal operations. `find_or_find_insert_slot` in particular saves a hash table lookup over `entry`. It's not yet available in a safe API, but will be in rust-lang/hashbrown#466.

<table><tr><td rowspan="2">Benchmark</td><td colspan="1"><b>Before</b></th><td colspan="2"><b>After</b></th></tr><tr><td align="right">Time</td><td align="right">Time</td><td align="right">%</th></tr><tr><td>🟣 <b>clap</b>:check</td><td align="right">1.6189s</td><td align="right">1.6129s</td><td align="right"> -0.37%</td></tr><tr><td>🟣 <b>hyper</b>:check</td><td align="right">0.2353s</td><td align="right">0.2337s</td><td align="right"> -0.67%</td></tr><tr><td>🟣 <b>regex</b>:check</td><td align="right">0.9344s</td><td align="right">0.9289s</td><td align="right"> -0.59%</td></tr><tr><td>🟣 <b>syn</b>:check</td><td align="right">1.4693s</td><td align="right">1.4652s</td><td align="right"> -0.28%</td></tr><tr><td>🟣 <b>syntex_syntax</b>:check</td><td align="right">5.6606s</td><td align="right">5.6439s</td><td align="right"> -0.30%</td></tr><tr><td>Total</td><td align="right">9.9185s</td><td align="right">9.8846s</td><td align="right"> -0.34%</td></tr><tr><td>Summary</td><td align="right">1.0000s</td><td align="right">0.9956s</td><td align="right"> -0.44%</td></tr></table>

r? `@cjgillot`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants