perf(profiling): datadog-alloc based StringTable #404

morrisonlevi · 2024-04-25T22:23:38Z

What does this PR do?

Creates a StringTable where the individual strings are allocated in
an arena allocator. This uses the Chain, Linear, and Virtual Allocators
provided by datadog-alloc to create a chain of 4 MiB chunks, each of
which is virtually allocated.

Also changes the chain allocator to handle large allocations, in case
we encounter a large string. Previously it would error if the
allocation size was larger than the node size.

Motivation

This separates the StringTable from the system allocator. Since strings
are variable length, separating them from the system allocator has more
benefits than many other items.

This also is a bit faster. This is probably due to the simpler
allocation strategy and better data locality. See this for details:
#404 (comment)

Additional Notes

This has been a long-time coming. I'm excited for it to finally merge!

How to test the change?

As long as you weren't using string table implementation details,
nothing should change. The apis for FFI users is unchanged, for
instance.

codecov-commenter · 2024-05-04T15:26:50Z

Codecov Report

Attention: Patch coverage is 99.50980% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 66.26%. Comparing base (1ecd94c) to head (0d4bfba).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
+ Coverage   66.17%   66.26%   +0.08%     
==========================================
  Files         187      189       +2     
  Lines       23171    23650     +479     
==========================================
+ Hits        15334    15672     +338     
- Misses       7837     7978     +141

Components	Coverage Δ
crashtracker	`19.34% <ø> (ø)`
datadog-alloc	`98.43% <97.05%> (+0.06%)`	⬆️
data-pipeline	`51.45% <ø> (ø)`
data-pipeline-ffi	`0.00% <ø> (ø)`
ddcommon	`81.52% <ø> (+0.29%)`	⬆️
ddcommon-ffi	`74.93% <ø> (ø)`
ddtelemetry	`53.72% <ø> (ø)`
ipc	`81.27% <ø> (ø)`
profiling	`77.52% <100.00%> (+0.68%)`	⬆️
profiling-ffi	`60.05% <ø> (ø)`
serverless	`0.00% <ø> (ø)`
sidecar	`31.75% <51.32%> (+2.07%)`	⬆️
sidecar-ffi	`0.00% <0.00%> (ø)`
spawn-worker	`54.98% <ø> (ø)`
trace-mini-agent	`69.12% <ø> (ø)`
trace-normalization	`97.79% <ø> (ø)`
trace-obfuscation	`95.74% <ø> (ø)`
trace-protobuf	`25.64% <ø> (ø)`
trace-utils	`68.85% <ø> (ø)`

ivoanjo

Oh wow, I have to say, after the PR with the really scary looking things for the allocator, this one is surprisingly clean and straightforward. (I always kinda fear a Rust PR will turn gnarly eventually lol)

I've left a few notes. I'd be curious to see the results of benchmarking this new implementation vs the previous one.

ivoanjo · 2024-05-10T12:13:45Z

profiling/benches/interning_strings.rs

+pub fn small_wordpress_profile(c: &mut Criterion) {
+    c.bench_function("benching string interning on wordpress profile", |b| {
+        b.iter(|| {
+            let mut table = StringTable::new();
+            let n_strings = WORDPRESS_STRINGS.len();
+            for string in WORDPRESS_STRINGS {
+                black_box(table.intern(string));
+            }
+            assert_eq!(n_strings, table.len());
+
+            // re-insert, should nothing should be inserted.
+            for string in WORDPRESS_STRINGS {
+                black_box(table.intern(string));
+            }
+            assert_eq!(n_strings, table.len())
+        })
+    });
+}


I'm curious: did you get a chance to compare this with the previous implementation?

I haven't yet, but intend to!

Alright, some numbers! Changing from the old implementation to the new one resulted in:

benching string interning on wordpress profile time: [53.695 µs 53.977 µs 54.279 µs] change: [-28.080% -27.464% -26.850%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high severe

And then changing it back resulted in:

benching string interning on wordpress profile time: [75.224 µs 75.501 µs 75.820 µs] change: [+40.269% +41.386% +42.503%] (p = 0.00 < 0.05) Performance has regressed.

This is unsurprising, because we spend less time in allocators, but I'm glad to have some data!

profiling/src/collections/string_table/mod.rs

ivoanjo · 2024-05-10T12:26:53Z

profiling/src/collections/string_table/mod.rs

+        // So with a capacity like 3, we end up reallocating a bunch on or
+        // before the very first sample. The number here is not fine-tuned,
+        // just skipping some obviously bad, tiny sizes.
+        strings.reserve(32);


To be honest, I'd go even bigger -- 512 or maybe more? E.g. for one minute of data of a non-idle realistic application this seems like we'll hit it easily?

(I guess we could even try measuring this in staging or something)

If only we had statistics on string data from libdatadog profiles... I'd love to try and target something like a 50%+ of the real-world data we see.

I guess we could gather this if we wanted to -- we know in the backend which runtime a profile comes from, and so could even look at the data per-runtime.

The cost of reserving a bit too much shouldn't be that bad

profiling/src/collections/string_table/mod.rs

ivoanjo · 2024-05-13T10:21:30Z

alloc/src/chain.rs

+        // Should be a bit less than 4096, but use this over hard-coding a
+        // number, to make it more resilient to implementation changes.
+        let remaining_capacity = allocator.remaining_capacity();


Minor: We could maybe turn this comment into an assertion, e.g. assert!(remaining_capacity < 406);

Doing that would make it less resilient to implementation changes. What if the Global allocator starts over-sizing allocations too?

Then we learn about it through our tests and update them accordingly? ;)

profiling/src/collections/string_table/mod.rs

ivoanjo

Just to clarify, is the idea not to merge this PR until we merge #402?

I think it's in pretty good state, but since a draft I'm not sure what plans you have and if I should approve as-is or wait for next steps, let me know! 🙏

morrisonlevi · 2024-05-15T03:58:55Z

In particular, I wonder about IndexMap vs using a Vec + HashMap.

Vec + HashMap performs worse than using IndexMap, which is good (you would hope a structure meant to solve this, does it well).

I added the old implementation for other people to A/B test if they want. Import whichever StringTable you want to use, and then do:

cargo bench -p datadog-profiling

Performance is improved. I used perf to look for low hanging fruit, there was nothing obvious to me there. We're ready for final review.

morrisonlevi · 2024-05-15T15:03:20Z

One GitLab job failed due to an internal error when doing setup. It passed when I restarted it.

danielsn · 2024-05-16T19:11:05Z

profiling/src/internal/profile.rs


        let prev = profile
            .reset_and_return_previous(None)
            .expect("reset to succeed");

        // Resolve the string values to check that they match (their string
        // table offsets may not match).
+        let mut strings = Vec::with_capacity(profile.strings.len());
+        let mut strings_iter = profile.strings.into_lending_iter();


Could this become a .map().collect()?

The lending iterator doesn't have it implemented. I could do that if you want, but I'd prefer to do it in another PR. Let me know.

Future PR is fine

Doing map is possible it has usability issues. I'm still working through the compiler errors, but it seems doable. However, collect seems even harder, because you need to have a Map or something which breaks the borrow of the item (e.g. String::from) because otherwise the borrow prevents the lending iterator from moving to the next item (because it's mut).

profiling/src/collections/string_table/wordpress_test_data.rs

danielsn · 2024-05-16T19:20:35Z

profiling/benches/interning_strings.rs

+    }
+}
+
+// To benchmark a different implementation, import a different one.


Make this a cfg rather than commenting out a line

When you are benchmarking you generally make tweaks to things and run it again. It's not just a static A/B thing we're going to run in CI so you want to automate swapping it out.. So it seems weird to have a cfg for this reason.

I think to use the cfg without making code changes, that would mean making a feature for it. Having a feature just for a benchmarking thing seems odd. Plus it breaks the idea of feature unification (features should be additive).

Do these things make sense? Thoughts?

profiling/src/collections/string_table/mod.rs

danielsn · 2024-05-16T19:31:33Z

profiling/src/collections/string_table/mod.rs

+/// the arena is dropped.
+pub trait ArenaAllocator: Allocator {
+    /// Copies the str into the arena, and returns a slice to the new str.
+    fn allocate(&self, str: &str) -> Result<&str, AllocError> {


allocate is a funny name for this

Have any ideas for a better one? If I had a better one, I'd have used it ^_^

danielsn · 2024-05-16T19:35:29Z

profiling/src/collections/string_table/mod.rs

+        // So with a capacity like 3, we end up reallocating a bunch on or
+        // before the very first sample. The number here is not fine-tuned,
+        // just skipping some obviously bad, tiny sizes.
+        strings.reserve(32);


The cost of reserving a bit too much shouldn't be that bad

danielsn · 2024-05-16T19:36:30Z

profiling/src/collections/string_table/mod.rs

+
+        // Always hold the empty string as item 0. Do not insert it via intern
+        // because that will try to allocate zero-bytes from the storage,
+        // which is sketchy.


Seems like a worthwhile special case to handle in the allocator.

The trouble is, what allocators are supposed to do for zero-sized allocations is not pinned down. In C, you get implementation defined behavior for malloc(0) and it's generally recommended to avoid it because of this. The debate is still open for Rust.

The nomicon agrees: https://doc.rust-lang.org/nomicon/vec/vec-alloc.html

profiling/src/collections/string_table/mod.rs

danielsn · 2024-05-16T19:38:07Z

profiling/src/collections/string_table/mod.rs

+    /// was originally inserted.
+    ///
+    /// # Panics
+    /// This panics if the allocator fails to allocate a new chunk/node.


I know errors are annoying to propagate, but panic will bring down the customer process.

Yes, I am aware, but this is just like the GlobalAlloc panic'ing if it runs out of memory. Same fringe error condition.

danielsn · 2024-05-16T19:39:46Z

profiling/src/collections/string_table/mod.rs

+                    // the lifetime of the string table or iterator when
+                    // exposed to the user. The string table and iterator will
+                    // keep the arena alive, making the access safe.
+                    unsafe { core::mem::transmute::<&str, &'static str>(s) }


is it possible to give it the same lifetime as self?

There is no such thing as a 'self lifetime, unfortunately 🙁
If you mean using something like <'a>(&'a self, ...) and using 'a, it will complain it doesn't live long enough, or something about self-referential lifetimes, or maybe even borrowing something as mut while something is borrowed as const (it can manifest a few ways).
As mentioned in #404 (comment), we could use ouroboros to do it. It's just ugly and complicated.

Probably not worth it for this PR

Fixing max() in max_level_hint(). And properly dropping the telemetry worker: When the runtime is shutdown with pending apps, it will send Stop, but never actually shutdown the telemetry instances. This regressed with d1fb3bc. Now we simply drop the telemetry handle, which implicitly stops the worker as well causes the worker to be joined properly. Signed-off-by: Bob Weinand <[email protected]>

This reverts commit a5d8b29.

r1viollet

LGTM
We should be careful about performance measures while deploying this.
The CI failure is not relevant to this PR.

danielsn

LGTM

danielsn · 2024-05-17T15:03:24Z

profiling/src/collections/string_table/mod.rs

+                    // the lifetime of the string table or iterator when
+                    // exposed to the user. The string table and iterator will
+                    // keep the arena alive, making the access safe.
+                    unsafe { core::mem::transmute::<&str, &'static str>(s) }


Probably not worth it for this PR

profiling/src/collections/string_table/mod.rs

danielsn · 2024-05-17T15:06:37Z

profiling/src/collections/string_table/mod.rs

+
+        // Always hold the empty string as item 0. Do not insert it via intern
+        // because that will try to allocate zero-bytes from the storage,
+        // which is sketchy.


The nomicon agrees: https://doc.rust-lang.org/nomicon/vec/vec-alloc.html

github-actions bot added the profiling Relates to the profiling* modules. label Apr 25, 2024

morrisonlevi changed the title ~~perf(profiling): use allocator-api2 based StringTable~~ refactor(profiling): use allocator-api2 based StringTable Apr 25, 2024

morrisonlevi changed the title ~~refactor(profiling): use allocator-api2 based StringTable~~ refactor(profiling): add allocator-api2 based StringTable Apr 25, 2024

morrisonlevi force-pushed the levi/alloc-string-table branch from 02d90fe to c30fbc5 Compare April 28, 2024 04:33

morrisonlevi force-pushed the levi/sample-types-on-reset branch from 2ae7586 to 31e33c2 Compare April 28, 2024 04:51

morrisonlevi force-pushed the levi/alloc-string-table branch from c30fbc5 to 12008dd Compare April 28, 2024 04:52

morrisonlevi force-pushed the levi/sample-types-on-reset branch 2 times, most recently from 5a3f760 to e212171 Compare April 29, 2024 23:12

morrisonlevi force-pushed the levi/alloc-string-table branch from 12008dd to fcabc50 Compare April 29, 2024 23:13

ivoanjo mentioned this pull request May 1, 2024

perf(profiling): store string data in an arena allocator #227

Closed

morrisonlevi force-pushed the levi/sample-types-on-reset branch from e212171 to f74b130 Compare May 1, 2024 17:46

morrisonlevi force-pushed the levi/alloc-string-table branch 2 times, most recently from c858430 to 5fce6ca Compare May 1, 2024 18:32

morrisonlevi force-pushed the levi/sample-types-on-reset branch 2 times, most recently from e102e07 to a4da143 Compare May 1, 2024 21:43

morrisonlevi force-pushed the levi/alloc-string-table branch from 5fce6ca to f3d6e74 Compare May 2, 2024 05:11

morrisonlevi force-pushed the levi/sample-types-on-reset branch from 6ae45a4 to 83b7fa1 Compare May 2, 2024 05:12

morrisonlevi mentioned this pull request May 3, 2024

feat(profiling)!: add arena-based string table #306

Closed

2 tasks

morrisonlevi force-pushed the levi/sample-types-on-reset branch from e9e34af to a1f5a62 Compare May 4, 2024 15:18

morrisonlevi force-pushed the levi/alloc-string-table branch from f3d6e74 to 4b1d04e Compare May 4, 2024 15:18

morrisonlevi force-pushed the levi/sample-types-on-reset branch from 3883f89 to fa7a21f Compare May 9, 2024 19:12

morrisonlevi force-pushed the levi/alloc-string-table branch from 5c8e5e8 to 3ace892 Compare May 9, 2024 19:32

morrisonlevi force-pushed the levi/sample-types-on-reset branch from fa7a21f to 724f019 Compare May 9, 2024 19:47

Base automatically changed from levi/sample-types-on-reset to main May 9, 2024 20:09

morrisonlevi changed the base branch from main to levi/alloc-crate May 9, 2024 20:10

ivoanjo reviewed May 10, 2024

View reviewed changes

ivoanjo reviewed May 13, 2024

View reviewed changes

Base automatically changed from levi/alloc-crate to main May 13, 2024 15:59

morrisonlevi added 3 commits May 14, 2024 10:41

fix: export iter module

728167f

docs: correct outdated reason for panic

4563e52

bench: provide old impl for comparison

f6dfe39

morrisonlevi marked this pull request as ready for review May 15, 2024 03:58

morrisonlevi requested review from a team as code owners May 15, 2024 03:58

morrisonlevi added 2 commits May 15, 2024 08:02

Merge branch 'main' into levi/alloc-string-table

12b9764

Merge branch 'main' into levi/alloc-string-table

52a2b57

morrisonlevi requested review from ivoanjo, danielsn and gleocadie May 15, 2024 14:06

Merge branch 'main' into levi/alloc-string-table

84355ee

danielsn reviewed May 16, 2024

View reviewed changes

bwoebi and others added 3 commits May 16, 2024 17:08

docs: improve note on WordPress strings

26e1246

Merge branch 'main' into levi/alloc-string-table

81c3263

github-actions bot added sidecar and removed sidecar labels May 16, 2024

See if avoiding the cache fixes the windows CI issue

a5d8b29

github-actions bot added the ci-build label May 17, 2024

Revert "See if avoiding the cache fixes the windows CI issue"

2b792af

This reverts commit a5d8b29.

github-actions bot removed the ci-build label May 17, 2024

r1viollet approved these changes May 17, 2024

View reviewed changes

Merge branch 'main' into levi/alloc-string-table

0d4bfba

danielsn approved these changes May 20, 2024

View reviewed changes

morrisonlevi merged commit 9b531d5 into main May 20, 2024
27 of 29 checks passed

morrisonlevi deleted the levi/alloc-string-table branch May 20, 2024 17:38

perf(profiling): datadog-alloc based StringTable #404

perf(profiling): datadog-alloc based StringTable #404

Conversation

morrisonlevi commented Apr 25, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

How to test the change?

codecov-commenter commented May 4, 2024 • edited Loading

Codecov Report

ivoanjo left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morrisonlevi May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo left a comment

Choose a reason for hiding this comment

morrisonlevi commented May 15, 2024 • edited Loading

morrisonlevi commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morrisonlevi May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morrisonlevi May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r1viollet left a comment

Choose a reason for hiding this comment

danielsn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morrisonlevi commented Apr 25, 2024 •

edited

Loading

codecov-commenter commented May 4, 2024 •

edited

Loading

ivoanjo left a comment •

edited

Loading

morrisonlevi May 10, 2024 •

edited

Loading

morrisonlevi commented May 15, 2024 •

edited

Loading

morrisonlevi May 20, 2024 •

edited

Loading

morrisonlevi May 16, 2024 •

edited

Loading