Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metadata cache core impl #25552

Merged
merged 2 commits into from
Nov 18, 2024
Merged

feat: metadata cache core impl #25552

merged 2 commits into from
Nov 18, 2024

Conversation

hiltontj
Copy link
Contributor

@hiltontj hiltontj commented Nov 13, 2024

Part of #25543

Implement the base MetaCache type that holds the hierarchical structure of the metadata cache providing methods to:

  • create the cache (here),
  • push rows from the WAL into the cache (here),
  • get record batches out of the cache with a given set of predicates (here), and
  • prune the cache based on a max_age and a max_cardinality (here)

Tests

  • Verify predicate evaluation when fetching record batches from the cache (here)
  • Verify pruning of the cache (here)

Notes

  • This PR adds a influxdb3_cache crate, which just holds the metadata cache for now, but may well also hold the last-n-value cache and parquet cache to get those out of the influxdb3_write crate
  • Some accessory changes were made, namely, to expose the WriteValidator from influxdb3_write so it can be used as a more light-weight way of parsing line protocol and producing Rows for tests
  • To close the related issue, I still need to add a MetaCacheProvider and other pieces necessary to tie this into the system.

@hiltontj hiltontj added the v3 label Nov 13, 2024
@hiltontj hiltontj self-assigned this Nov 13, 2024
@hiltontj hiltontj force-pushed the hiltontj/meta-cache branch 3 times, most recently from 61a655f to a2b4e77 Compare November 15, 2024 20:33
@hiltontj hiltontj marked this pull request as ready for review November 15, 2024 20:38
@hiltontj
Copy link
Contributor Author

hiltontj commented Nov 15, 2024

@alamb I tagged you on this as it is using StringView and I thought that may interest you. Starting here:

/// Gather a record batch from a cache given the set of predicates
///
/// This assumes the predicates are well behaved, and validated before being passed in. For example,
/// there cannot be multiple predicates on a single column; the caller needs to validate/transform
/// the incoming predicates from Datafusion before calling.
///
/// Entries in the cache that have not been seen since before the `max_age` of the cache will be
/// filtered out of the result.
pub fn to_record_batch(&self, predicates: &[Predicate]) -> Result<RecordBatch, ArrowError> {
// predicates may not be provided for all columns in the cache, or not be provided in the
// order of columns in the cache. This re-orders them to the correct order, and fills in any
// gaps with None.
let predicates: Vec<Option<&Predicate>> = self
.column_ids
.iter()
.map(|id| predicates.iter().find(|p| &p.column_id == id))
.collect();
// Uses a [`StringViewBuilder`] to compose the set of [`RecordBatch`]es. This is convenient for
// the sake of nested caches, where a predicate on a higher branch in the cache will need to have
// its value in the outputted batches duplicated.
let mut builders: Vec<StringViewBuilder> = (0..self.column_ids.len())
.map(|_| StringViewBuilder::new())
.collect();
let expired_time_ns = self.expired_time_ns();
let _ = self
.data
.evaluate_predicates(expired_time_ns, &predicates, &mut builders);
RecordBatch::try_new(
Arc::clone(&self.schema),
builders
.into_iter()
.map(|mut builder| Arc::new(builder.finish()) as ArrayRef)
.collect(),
)
}

@hiltontj hiltontj changed the title feat: metadata cache initial cache struct feat: metadata cache core impl Nov 15, 2024
Implement the base MetaCache type that holds the hierarchical structure
of the metadata cache providing methods to create and push rows from the
WAL into the cache.

Added a prune method as well as a method for gathering record batches
from a meta cache. A test was added to check the latter for various
predicates and that the former works, though, pruning shows that we need
to modify how record batches are produced such that expired entries are
not emitted.
@hiltontj hiltontj force-pushed the hiltontj/meta-cache branch from a2b4e77 to dd5f816 Compare November 16, 2024 01:17
@hiltontj hiltontj force-pushed the hiltontj/meta-cache branch from dff0b2e to 0ae8f82 Compare November 16, 2024 20:15
Copy link
Contributor

@mgattozzi mgattozzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me if this has follow up PRs (which I'm assuming since this doesn't seem to be hooked in)

}
}

// TODO: remove this when fully implemented, for now just suppressing compiler warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for a follow up PR when the metadata cache actually gets hooked up right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the type will get used when I implement the DataFusion trait to hook this all up to the query path in #25544

@hiltontj hiltontj removed the request for review from alamb November 18, 2024 16:14
@hiltontj hiltontj merged commit 53f54a6 into main Nov 18, 2024
13 checks passed
@hiltontj hiltontj deleted the hiltontj/meta-cache branch November 18, 2024 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants