Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(profiling): datadog-alloc based StringTable #404

Merged
merged 13 commits into from
May 20, 2024

Conversation

morrisonlevi
Copy link
Contributor

@morrisonlevi morrisonlevi commented Apr 25, 2024

What does this PR do?

Creates a StringTable where the individual strings are allocated in
an arena allocator. This uses the Chain, Linear, and Virtual Allocators
provided by datadog-alloc to create a chain of 4 MiB chunks, each of
which is virtually allocated.

Also changes the chain allocator to handle large allocations, in case
we encounter a large string. Previously it would error if the
allocation size was larger than the node size.

Motivation

This separates the StringTable from the system allocator. Since strings
are variable length, separating them from the system allocator has more
benefits than many other items.

This also is a bit faster. This is probably due to the simpler
allocation strategy and better data locality. See this for details:
#404 (comment)

Additional Notes

This has been a long-time coming. I'm excited for it to finally merge!

How to test the change?

As long as you weren't using string table implementation details,
nothing should change. The apis for FFI users is unchanged, for
instance.

@github-actions github-actions bot added the profiling Relates to the profiling* modules. label Apr 25, 2024
@morrisonlevi morrisonlevi changed the title perf(profiling): use allocator-api2 based StringTable refactor(profiling): use allocator-api2 based StringTable Apr 25, 2024
@morrisonlevi morrisonlevi changed the title refactor(profiling): use allocator-api2 based StringTable refactor(profiling): add allocator-api2 based StringTable Apr 25, 2024
@morrisonlevi morrisonlevi force-pushed the levi/alloc-string-table branch from 02d90fe to c30fbc5 Compare April 28, 2024 04:33
@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch from 2ae7586 to 31e33c2 Compare April 28, 2024 04:51
@morrisonlevi morrisonlevi force-pushed the levi/alloc-string-table branch from c30fbc5 to 12008dd Compare April 28, 2024 04:52
@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch 2 times, most recently from 5a3f760 to e212171 Compare April 29, 2024 23:12
@morrisonlevi morrisonlevi force-pushed the levi/alloc-string-table branch from 12008dd to fcabc50 Compare April 29, 2024 23:13
@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch from e212171 to f74b130 Compare May 1, 2024 17:46
@morrisonlevi morrisonlevi force-pushed the levi/alloc-string-table branch 2 times, most recently from c858430 to 5fce6ca Compare May 1, 2024 18:32
@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch 2 times, most recently from e102e07 to a4da143 Compare May 1, 2024 21:43
@morrisonlevi morrisonlevi force-pushed the levi/alloc-string-table branch from 5fce6ca to f3d6e74 Compare May 2, 2024 05:11
@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch from 6ae45a4 to 83b7fa1 Compare May 2, 2024 05:12
@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch from e9e34af to a1f5a62 Compare May 4, 2024 15:18
@morrisonlevi morrisonlevi force-pushed the levi/alloc-string-table branch from f3d6e74 to 4b1d04e Compare May 4, 2024 15:18
@codecov-commenter
Copy link

codecov-commenter commented May 4, 2024

Codecov Report

Attention: Patch coverage is 99.50980% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 66.26%. Comparing base (1ecd94c) to head (0d4bfba).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
+ Coverage   66.17%   66.26%   +0.08%     
==========================================
  Files         187      189       +2     
  Lines       23171    23650     +479     
==========================================
+ Hits        15334    15672     +338     
- Misses       7837     7978     +141     
Components Coverage Δ
crashtracker 19.34% <ø> (ø)
datadog-alloc 98.43% <97.05%> (+0.06%) ⬆️
data-pipeline 51.45% <ø> (ø)
data-pipeline-ffi 0.00% <ø> (ø)
ddcommon 81.52% <ø> (+0.29%) ⬆️
ddcommon-ffi 74.93% <ø> (ø)
ddtelemetry 53.72% <ø> (ø)
ipc 81.27% <ø> (ø)
profiling 77.52% <100.00%> (+0.68%) ⬆️
profiling-ffi 60.05% <ø> (ø)
serverless 0.00% <ø> (ø)
sidecar 31.75% <51.32%> (+2.07%) ⬆️
sidecar-ffi 0.00% <0.00%> (ø)
spawn-worker 54.98% <ø> (ø)
trace-mini-agent 69.12% <ø> (ø)
trace-normalization 97.79% <ø> (ø)
trace-obfuscation 95.74% <ø> (ø)
trace-protobuf 25.64% <ø> (ø)
trace-utils 68.85% <ø> (ø)

@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch from 3883f89 to fa7a21f Compare May 9, 2024 19:12
@morrisonlevi morrisonlevi force-pushed the levi/alloc-string-table branch from 5c8e5e8 to 3ace892 Compare May 9, 2024 19:32
@morrisonlevi morrisonlevi force-pushed the levi/sample-types-on-reset branch from fa7a21f to 724f019 Compare May 9, 2024 19:47
Base automatically changed from levi/sample-types-on-reset to main May 9, 2024 20:09
@morrisonlevi morrisonlevi changed the base branch from main to levi/alloc-crate May 9, 2024 20:10
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, I have to say, after the PR with the really scary looking things for the allocator, this one is surprisingly clean and straightforward. (I always kinda fear a Rust PR will turn gnarly eventually lol)

I've left a few notes. I'd be curious to see the results of benchmarking this new implementation vs the previous one.

Comment on lines +9 to +26
pub fn small_wordpress_profile(c: &mut Criterion) {
c.bench_function("benching string interning on wordpress profile", |b| {
b.iter(|| {
let mut table = StringTable::new();
let n_strings = WORDPRESS_STRINGS.len();
for string in WORDPRESS_STRINGS {
black_box(table.intern(string));
}
assert_eq!(n_strings, table.len());

// re-insert, should nothing should be inserted.
for string in WORDPRESS_STRINGS {
black_box(table.intern(string));
}
assert_eq!(n_strings, table.len())
})
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: did you get a chance to compare this with the previous implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet, but intend to!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, some numbers! Changing from the old implementation to the new one resulted in:

benching string interning on wordpress profile
                        time:   [53.695 µs 53.977 µs 54.279 µs]
                        change: [-28.080% -27.464% -26.850%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high severe

And then changing it back resulted in:

benching string interning on wordpress profile
                        time:   [75.224 µs 75.501 µs 75.820 µs]
                        change: [+40.269% +41.386% +42.503%] (p = 0.00 < 0.05)
                        Performance has regressed.

This is unsurprising, because we spend less time in allocators, but I'm glad to have some data!

profiling/src/collections/string_table/mod.rs Outdated Show resolved Hide resolved
profiling/src/collections/string_table/mod.rs Outdated Show resolved Hide resolved
// So with a capacity like 3, we end up reallocating a bunch on or
// before the very first sample. The number here is not fine-tuned,
// just skipping some obviously bad, tiny sizes.
strings.reserve(32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'd go even bigger -- 512 or maybe more? E.g. for one minute of data of a non-idle realistic application this seems like we'll hit it easily?

(I guess we could even try measuring this in staging or something)

Copy link
Contributor Author

@morrisonlevi morrisonlevi May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only we had statistics on string data from libdatadog profiles... I'd love to try and target something like a 50%+ of the real-world data we see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could gather this if we wanted to -- we know in the backend which runtime a profile comes from, and so could even look at the data per-runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost of reserving a bit too much shouldn't be that bad

profiling/src/collections/string_table/mod.rs Outdated Show resolved Hide resolved
Comment on lines +310 to +312
// Should be a bit less than 4096, but use this over hard-coding a
// number, to make it more resilient to implementation changes.
let remaining_capacity = allocator.remaining_capacity();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: We could maybe turn this comment into an assertion, e.g. assert!(remaining_capacity < 406);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing that would make it less resilient to implementation changes. What if the Global allocator starts over-sizing allocations too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we learn about it through our tests and update them accordingly? ;)

profiling/src/collections/string_table/mod.rs Outdated Show resolved Hide resolved
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is the idea not to merge this PR until we merge #402?

I think it's in pretty good state, but since a draft I'm not sure what plans you have and if I should approve as-is or wait for next steps, let me know! 🙏

Base automatically changed from levi/alloc-crate to main May 13, 2024 15:59
@morrisonlevi
Copy link
Contributor Author

morrisonlevi commented May 15, 2024

In particular, I wonder about IndexMap vs using a Vec + HashMap.

Vec + HashMap performs worse than using IndexMap, which is good (you would hope a structure meant to solve this, does it well).

I added the old implementation for other people to A/B test if they want. Import whichever StringTable you want to use, and then do:

cargo bench -p datadog-profiling

Performance is improved. I used perf to look for low hanging fruit, there was nothing obvious to me there. We're ready for final review.

@morrisonlevi morrisonlevi marked this pull request as ready for review May 15, 2024 03:58
@morrisonlevi morrisonlevi requested review from a team as code owners May 15, 2024 03:58
@morrisonlevi
Copy link
Contributor Author

One GitLab job failed due to an internal error when doing setup. It passed when I restarted it.


let prev = profile
.reset_and_return_previous(None)
.expect("reset to succeed");

// Resolve the string values to check that they match (their string
// table offsets may not match).
let mut strings = Vec::with_capacity(profile.strings.len());
let mut strings_iter = profile.strings.into_lending_iter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this become a .map().collect()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lending iterator doesn't have it implemented. I could do that if you want, but I'd prefer to do it in another PR. Let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future PR is fine

Copy link
Contributor Author

@morrisonlevi morrisonlevi May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing map is possible it has usability issues. I'm still working through the compiler errors, but it seems doable. However, collect seems even harder, because you need to have a Map or something which breaks the borrow of the item (e.g. String::from) because otherwise the borrow prevents the lending iterator from moving to the next item (because it's mut).

}
}

// To benchmark a different implementation, import a different one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a cfg rather than commenting out a line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you are benchmarking you generally make tweaks to things and run it again. It's not just a static A/B thing we're going to run in CI so you want to automate swapping it out.. So it seems weird to have a cfg for this reason.

I think to use the cfg without making code changes, that would mean making a feature for it. Having a feature just for a benchmarking thing seems odd. Plus it breaks the idea of feature unification (features should be additive).

Do these things make sense? Thoughts?

profiling/src/collections/string_table/mod.rs Show resolved Hide resolved
/// the arena is dropped.
pub trait ArenaAllocator: Allocator {
/// Copies the str into the arena, and returns a slice to the new str.
fn allocate(&self, str: &str) -> Result<&str, AllocError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate is a funny name for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have any ideas for a better one? If I had a better one, I'd have used it ^_^

// So with a capacity like 3, we end up reallocating a bunch on or
// before the very first sample. The number here is not fine-tuned,
// just skipping some obviously bad, tiny sizes.
strings.reserve(32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost of reserving a bit too much shouldn't be that bad


// Always hold the empty string as item 0. Do not insert it via intern
// because that will try to allocate zero-bytes from the storage,
// which is sketchy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a worthwhile special case to handle in the allocator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble is, what allocators are supposed to do for zero-sized allocations is not pinned down. In C, you get implementation defined behavior for malloc(0) and it's generally recommended to avoid it because of this. The debate is still open for Rust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiling/src/collections/string_table/mod.rs Show resolved Hide resolved
/// was originally inserted.
///
/// # Panics
/// This panics if the allocator fails to allocate a new chunk/node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know errors are annoying to propagate, but panic will bring down the customer process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am aware, but this is just like the GlobalAlloc panic'ing if it runs out of memory. Same fringe error condition.

// the lifetime of the string table or iterator when
// exposed to the user. The string table and iterator will
// keep the arena alive, making the access safe.
unsafe { core::mem::transmute::<&str, &'static str>(s) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to give it the same lifetime as self?

Copy link
Contributor Author

@morrisonlevi morrisonlevi May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing as a 'self lifetime, unfortunately 🙁
If you mean using something like <'a>(&'a self, ...) and using 'a, it will complain it doesn't live long enough, or something about self-referential lifetimes, or maybe even borrowing something as mut while something is borrowed as const (it can manifest a few ways).
As mentioned in #404 (comment), we could use ouroboros to do it. It's just ugly and complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth it for this PR

bwoebi and others added 3 commits May 16, 2024 17:08
Fixing max() in max_level_hint().

And properly dropping the telemetry worker:
When the runtime is shutdown with pending apps, it will send Stop, but never actually shutdown the telemetry instances. This regressed with d1fb3bc.
Now we simply drop the telemetry handle, which implicitly stops the worker as well causes the worker to be joined properly.

Signed-off-by: Bob Weinand <[email protected]>
@github-actions github-actions bot added sidecar and removed sidecar labels May 16, 2024
@github-actions github-actions bot removed the ci-build label May 17, 2024
Copy link
Contributor

@r1viollet r1viollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
We should be careful about performance measures while deploying this.
The CI failure is not relevant to this PR.

Copy link
Contributor

@danielsn danielsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// the lifetime of the string table or iterator when
// exposed to the user. The string table and iterator will
// keep the arena alive, making the access safe.
unsafe { core::mem::transmute::<&str, &'static str>(s) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth it for this PR

profiling/src/collections/string_table/mod.rs Show resolved Hide resolved

// Always hold the empty string as item 0. Do not insert it via intern
// because that will try to allocate zero-bytes from the storage,
// which is sketchy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@morrisonlevi morrisonlevi merged commit 9b531d5 into main May 20, 2024
27 of 29 checks passed
@morrisonlevi morrisonlevi deleted the levi/alloc-string-table branch May 20, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Relates to the profiling* modules.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants