-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(profiling): datadog-alloc based StringTable #404
Conversation
02d90fe
to
c30fbc5
Compare
2ae7586
to
31e33c2
Compare
c30fbc5
to
12008dd
Compare
5a3f760
to
e212171
Compare
12008dd
to
fcabc50
Compare
e212171
to
f74b130
Compare
c858430
to
5fce6ca
Compare
e102e07
to
a4da143
Compare
5fce6ca
to
f3d6e74
Compare
6ae45a4
to
83b7fa1
Compare
e9e34af
to
a1f5a62
Compare
f3d6e74
to
4b1d04e
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #404 +/- ##
==========================================
+ Coverage 66.17% 66.26% +0.08%
==========================================
Files 187 189 +2
Lines 23171 23650 +479
==========================================
+ Hits 15334 15672 +338
- Misses 7837 7978 +141
|
3883f89
to
fa7a21f
Compare
5c8e5e8
to
3ace892
Compare
fa7a21f
to
724f019
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, I have to say, after the PR with the really scary looking things for the allocator, this one is surprisingly clean and straightforward. (I always kinda fear a Rust PR will turn gnarly eventually lol)
I've left a few notes. I'd be curious to see the results of benchmarking this new implementation vs the previous one.
pub fn small_wordpress_profile(c: &mut Criterion) { | ||
c.bench_function("benching string interning on wordpress profile", |b| { | ||
b.iter(|| { | ||
let mut table = StringTable::new(); | ||
let n_strings = WORDPRESS_STRINGS.len(); | ||
for string in WORDPRESS_STRINGS { | ||
black_box(table.intern(string)); | ||
} | ||
assert_eq!(n_strings, table.len()); | ||
|
||
// re-insert, should nothing should be inserted. | ||
for string in WORDPRESS_STRINGS { | ||
black_box(table.intern(string)); | ||
} | ||
assert_eq!(n_strings, table.len()) | ||
}) | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: did you get a chance to compare this with the previous implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet, but intend to!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, some numbers! Changing from the old implementation to the new one resulted in:
benching string interning on wordpress profile
time: [53.695 µs 53.977 µs 54.279 µs]
change: [-28.080% -27.464% -26.850%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high severe
And then changing it back resulted in:
benching string interning on wordpress profile
time: [75.224 µs 75.501 µs 75.820 µs]
change: [+40.269% +41.386% +42.503%] (p = 0.00 < 0.05)
Performance has regressed.
This is unsurprising, because we spend less time in allocators, but I'm glad to have some data!
// So with a capacity like 3, we end up reallocating a bunch on or | ||
// before the very first sample. The number here is not fine-tuned, | ||
// just skipping some obviously bad, tiny sizes. | ||
strings.reserve(32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I'd go even bigger -- 512 or maybe more? E.g. for one minute of data of a non-idle realistic application this seems like we'll hit it easily?
(I guess we could even try measuring this in staging or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only we had statistics on string data from libdatadog profiles... I'd love to try and target something like a 50%+ of the real-world data we see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could gather this if we wanted to -- we know in the backend which runtime a profile comes from, and so could even look at the data per-runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of reserving a bit too much shouldn't be that bad
// Should be a bit less than 4096, but use this over hard-coding a | ||
// number, to make it more resilient to implementation changes. | ||
let remaining_capacity = allocator.remaining_capacity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: We could maybe turn this comment into an assertion, e.g. assert!(remaining_capacity < 406);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing that would make it less resilient to implementation changes. What if the Global
allocator starts over-sizing allocations too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we learn about it through our tests and update them accordingly? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, is the idea not to merge this PR until we merge #402?
I think it's in pretty good state, but since a draft I'm not sure what plans you have and if I should approve as-is or wait for next steps, let me know! 🙏
I added the old implementation for other people to A/B test if they want. Import whichever
Performance is improved. I used |
One GitLab job failed due to an internal error when doing setup. It passed when I restarted it. |
|
||
let prev = profile | ||
.reset_and_return_previous(None) | ||
.expect("reset to succeed"); | ||
|
||
// Resolve the string values to check that they match (their string | ||
// table offsets may not match). | ||
let mut strings = Vec::with_capacity(profile.strings.len()); | ||
let mut strings_iter = profile.strings.into_lending_iter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this become a .map().collect()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lending iterator doesn't have it implemented. I could do that if you want, but I'd prefer to do it in another PR. Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future PR is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing map
is possible it has usability issues. I'm still working through the compiler errors, but it seems doable. However, collect
seems even harder, because you need to have a Map
or something which breaks the borrow of the item (e.g. String::from
) because otherwise the borrow prevents the lending iterator from moving to the next item (because it's mut
).
} | ||
} | ||
|
||
// To benchmark a different implementation, import a different one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a cfg
rather than commenting out a line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you are benchmarking you generally make tweaks to things and run it again. It's not just a static A/B thing we're going to run in CI so you want to automate swapping it out.. So it seems weird to have a cfg for this reason.
I think to use the cfg without making code changes, that would mean making a feature for it. Having a feature just for a benchmarking thing seems odd. Plus it breaks the idea of feature unification (features should be additive).
Do these things make sense? Thoughts?
/// the arena is dropped. | ||
pub trait ArenaAllocator: Allocator { | ||
/// Copies the str into the arena, and returns a slice to the new str. | ||
fn allocate(&self, str: &str) -> Result<&str, AllocError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocate
is a funny name for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have any ideas for a better one? If I had a better one, I'd have used it ^_^
// So with a capacity like 3, we end up reallocating a bunch on or | ||
// before the very first sample. The number here is not fine-tuned, | ||
// just skipping some obviously bad, tiny sizes. | ||
strings.reserve(32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of reserving a bit too much shouldn't be that bad
|
||
// Always hold the empty string as item 0. Do not insert it via intern | ||
// because that will try to allocate zero-bytes from the storage, | ||
// which is sketchy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a worthwhile special case to handle in the allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trouble is, what allocators are supposed to do for zero-sized allocations is not pinned down. In C, you get implementation defined behavior for malloc(0)
and it's generally recommended to avoid it because of this. The debate is still open for Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nomicon agrees: https://doc.rust-lang.org/nomicon/vec/vec-alloc.html
/// was originally inserted. | ||
/// | ||
/// # Panics | ||
/// This panics if the allocator fails to allocate a new chunk/node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know errors are annoying to propagate, but panic will bring down the customer process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am aware, but this is just like the GlobalAlloc
panic'ing if it runs out of memory. Same fringe error condition.
// the lifetime of the string table or iterator when | ||
// exposed to the user. The string table and iterator will | ||
// keep the arena alive, making the access safe. | ||
unsafe { core::mem::transmute::<&str, &'static str>(s) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to give it the same lifetime as self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such thing as a 'self
lifetime, unfortunately 🙁
If you mean using something like <'a>(&'a self, ...)
and using 'a
, it will complain it doesn't live long enough, or something about self-referential lifetimes, or maybe even borrowing something as mut while something is borrowed as const (it can manifest a few ways).
As mentioned in #404 (comment), we could use ouroboros
to do it. It's just ugly and complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth it for this PR
Fixing max() in max_level_hint(). And properly dropping the telemetry worker: When the runtime is shutdown with pending apps, it will send Stop, but never actually shutdown the telemetry instances. This regressed with d1fb3bc. Now we simply drop the telemetry handle, which implicitly stops the worker as well causes the worker to be joined properly. Signed-off-by: Bob Weinand <[email protected]>
This reverts commit a5d8b29.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We should be careful about performance measures while deploying this.
The CI failure is not relevant to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// the lifetime of the string table or iterator when | ||
// exposed to the user. The string table and iterator will | ||
// keep the arena alive, making the access safe. | ||
unsafe { core::mem::transmute::<&str, &'static str>(s) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth it for this PR
|
||
// Always hold the empty string as item 0. Do not insert it via intern | ||
// because that will try to allocate zero-bytes from the storage, | ||
// which is sketchy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nomicon agrees: https://doc.rust-lang.org/nomicon/vec/vec-alloc.html
What does this PR do?
Creates a
StringTable
where the individual strings are allocated inan arena allocator. This uses the Chain, Linear, and Virtual Allocators
provided by datadog-alloc to create a chain of 4 MiB chunks, each of
which is virtually allocated.
Also changes the chain allocator to handle large allocations, in case
we encounter a large string. Previously it would error if the
allocation size was larger than the node size.
Motivation
This separates the StringTable from the system allocator. Since strings
are variable length, separating them from the system allocator has more
benefits than many other items.
This also is a bit faster. This is probably due to the simpler
allocation strategy and better data locality. See this for details:
#404 (comment)
Additional Notes
This has been a long-time coming. I'm excited for it to finally merge!
How to test the change?
As long as you weren't using string table implementation details,
nothing should change. The apis for FFI users is unchanged, for
instance.