fix(otel_processor): wait for runner process termination #641

SergeTupchiy · 2023-10-18T15:21:06Z

As new_export_table(ExportingTable) is called right after killing the runner, it's necessary to wait for its termination.
Otherwise, new_export_table/1can raise badarg error because the table is still not destroyed.

codecov · 2023-10-18T17:36:32Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
apps/opentelemetry/src/otel_batch_processor.erl	`90.08% <100.00%> (+4.36%)`	⬆️
apps/opentelemetry/src/otel_simple_processor.erl	`62.12% <0.00%> (-1.95%)`	⬇️

📢 Thoughts on this report? Let us know!.

tsloughter · 2023-10-19T10:52:22Z

Trying to think of how we could have a test for this. A test exporter that just hangs? Not sure what would be checked to verify it is working though. Just that the processor doesn't crash maybe...

SergeTupchiy · 2023-10-19T11:37:09Z

@tsloughter ,

Pushed a test, also has manually verified that it reached kill_runner/1 by adding some debug printouts (removed them afterwards).
There are better ways to be sure certain code is executed during the test, e.g. https://github.com/kafka4beam/snabbkaffe, but introducing it would probably be too much for this fix..

tsloughter · 2023-10-19T14:04:35Z

I suppose technically, instead of using receive, we should switch to another state that waits for the DOWN (and postpones any other messages).

SergeTupchiy · 2023-10-19T14:43:57Z

I suppose technically, instead of using receive, we should switch to another state that waits for the DOWN (and postpones any other messages).

That would probably look more complicated. Selective receive matching a newly created reference (in this case created by monitor) must be optimized by the emulator..

Alternatively, I think it can be fixed in a different way: do not give_away ETS table to the runner process (anyway, it just needs to read from it and the table will always be public because anyone must be able to write to it). Delete/recreate the table in the context of the state machine process, right after killing the runner or receiving EXIT | completed message. There will be no need to wait for the runner to exit after emitting kill exit signal to it.

tsloughter · 2023-10-21T17:27:40Z

@SergeTupchiy yea, part of the reason for the give away was to ensure if the runner crashed the ets table it was using doesn't live on, even if there is a bug or something in the processor.

I'm not sure its worth it but now with this fix from you I guess its not worth changing.

tsloughter · 2023-10-21T17:30:44Z

Wait, I could have swore we renamed the... hm. Now I vaguely recall maybe it was removed to ensure writes that could come after it is handed off don't fail..

SergeTupchiy · 2023-10-21T18:52:39Z

Wait, I could have swore we renamed the... hm. Now I vaguely recall maybe it was removed to ensure writes that could come after it is handed off don't fail..

@tsloughter, this discussion makes me think that the whole batching/exporting mechanism is prone to loosing writes under some race conditions.
When it's time to export, the table name to accept writes is reversed in persistent term, but there may be quite a large number of concurrent writes to the handed off table with no guarantee that all of them succeed before table traversal is started. It probably can get even worse because of duplicate bag tab:

to_proto_by_instrumentation_scope(Tab) ->
    Key = ets:first(Tab),
    to_proto_by_instrumentation_scope(Tab, Key).

to_proto_by_instrumentation_scope(_Tab, '$end_of_table') ->
    [];
to_proto_by_instrumentation_scope(Tab, InstrumentationScope) ->
    InstrumentationScopeSpans = lists:foldl(fun(Span, Acc) ->
                                                      [to_proto(Span) | Acc]
                                              end, [], ets:lookup(Tab, InstrumentationScope)),

Once ets:lookup(Tab, ets:first(Tab)) return, any delayed writes with the same key (instrumentation scope) are lost forever..

tsloughter · 2023-10-22T09:57:00Z

@SergeTupchiy yup, its possible. I guess that should be in a comment. I couldn't think of a good way around it, only to limit it.

tsloughter · 2023-10-22T09:59:20Z

hmmmmmmm, what if we used ets:take instead of ets:lookup and reused the table.

tsloughter · 2023-10-22T10:04:10Z

Eh, reuse won't work as it is if we add support for concurrent exports.

SergeTupchiy · 2023-10-22T10:24:19Z

@tsloughter, I've also thought about ets:take/2 yesterday 🙂 👍 and I think it can work fine without introducing too many changes or re-designing the whole implementation.
We can ets:take/2 from the handed off table and don't delete the table after exporting (it also implies its ownership must not be transferred to the runner, so that the table survives the runs and no data is lost). Then, the table may contain some 'residual' records inserted during the export, we can try to traverse it once more (e.g., if we see it's not empty), but it still doesn't give 100% guarantee since we can't be sure that all the writes have been done.
Thus, we can simply complete this export cycle and then the processor will export all that 'residual' records on the next but one exporter run.
The only problem I see is the case when there are some residual records after the export and then there are absolutely no more writes, so exporter is not run anymore and these records are left un-exported. However, we can add some extra table size checks to solve/ further mitigate this.

SergeTupchiy · 2023-10-22T10:30:15Z

Eh, reuse won't work as it is if we add support for concurrent exports.
I'm not sure that I understand your concern... Also, according to the spec:
Export() will never be called concurrently for the same exporter instance

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#exportbatch
I may also misunderstand the spec, though.. 😃

tsloughter · 2023-10-22T10:44:35Z

It is a little confusing. Export() won't be called concurrently but that doesn't mean Export can't return before completing the request.

It is a little weird because we didn't want to deal with all the logic of concurrent sending in the processor itself, so Export not being called concurrently guarantees to the exporters that they are in control of how to implement how many requests are being sent concurrently -- and stuff like retries.

tsloughter · 2023-10-22T10:52:38Z

And if we were to do this I would want it to be 100%, meaning no chance of losing spans except in the case of a crash. So we'd want to keep using that table for writes after the export, instead of just rechecking after the export to see if there are more records since technically the late write could come well after that as well.

The original goal was to limit memory usage and make GC easy on the VM.

SergeTupchiy · 2023-10-22T11:37:16Z

Yeah, rechecking after the export is probably makes little sense, as it gives very weak guarantee. I would just export by taking records that are present during the traverse and rely on exporting any 'late' writes in the next but one exporter run (implying the table is not deleted). It's better to accept the chance of late export then probable data loss. Do you see significant drawbacks in this idea?

tsloughter · 2023-10-22T14:29:14Z

Right, that's what I was thinking too.

The only drawbacks is how to do concurrent exports and I worry about GC performance of individual records being dropped instead of just letting the whole table get removed when the process ends. The latter may be unfounded, I haven't done any actual benchmarking.

SergeTupchiy · 2023-10-22T14:53:13Z

I hope it won't have a dramatic impact on performance,
but of course, benchmarking should be done..
I would probably also like to benchmark duplicate bag vs set. span id or span/trace id pair can be used as a unique key in a set table insted of instrumentation scope. Motivation is that bag table type is not recommended for storing high number of objects under the same key. That's a different question, though, sorry for off-topic. 🙂

tsloughter · 2023-10-22T14:55:53Z

Oh, I didn't know that about duplicate bag. Is that in the docs somewhere?

SergeTupchiy · 2023-10-22T14:57:39Z

Yes,

For tables of type bag and duplicate_bag, avoid inserting an extensive amount of objects with the same key. It will hurt insert and lookup performance as well as real time characteristics of the runtime environment (hash bucket linear search do not yield).
https://www.erlang.org/doc/man/ets.html

tsloughter · 2023-10-23T09:54:12Z

Aggggh, how did I miss the no yielding part...

Since we do no lookups besides the export I didn't think that would be an issue and would have expected insert to not be impacted by the size of the list of elements in the bucket... Do you know why insert would be hurt? I'm going to go ask around.

But the yielding definitely concerns me :(

tsloughter · 2023-10-23T09:59:27Z

Ah, this warning is new to the docs.

SergeTupchiy · 2023-10-23T11:06:08Z

@tsloughter,
After a quick look at the source code, duplciate_bag doesn't look to hurt inserts:
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_db_hash.c#L1383

And yielding is disabled only for DB_BAG (but not for DB_DUPLICATE_BAG):
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_db.c#L1811

Moreover, yielding seems to only be used for list insertions: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_db.c#L1852

Note: I'm not an in-depth BEAM expert, so please don't take my conclusions for granted.
Meanwhile, I'm preparing some benchmarks, will share the results as soon as they are ready.

tsloughter · 2023-10-23T14:34:08Z

Great! Thanks for the research.

SergeTupchiy · 2023-10-23T17:26:57Z

@tsloughter,

Some benchmark results:

traversal (delete the whole table vs ets:take/2) https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#traverse-results
ets:take/2 seems insignificantly faster.
The benchmark only traverses the table, meaning there are no concurrent writes during the benchmark. The reasons for that are two:
- ~~laziness~~ simplicity 🙂
- in the real world scenario, no high write rate is expected (since writes are switched to the second table).
I didn't tried to measure memory usage though..
Also, benchmark functions are not quite pure: they don't just traverse the table but also convert records to_proto. Motivation was to run scenarios closer to the real world. Will probably add extra scenario without to_proto conversion, but, anyway it looks like there can't be a big difference.
concurrent inserts (set vs duplicate_bag) https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#insert-results
All inserts to the duplicate_bag were done to the same key.
When the table was pre-populated with a large number of records (see: OtelBench.ets_insert_to_large_tab_bench()), duplicate_bag actually seemed to outperform set (though results differed significantly from run to run).
Note: the benchmark time also includes random span generation step.

tsloughter · 2023-10-25T09:55:08Z

So duplicate_bag looks good? As well as using take?

SergeTupchiy · 2023-10-25T10:45:40Z

So duplicate_bag looks good? As well as using take?

Yes, take looks good, set looks slightly better when the number of concurrent inserts is high (probably because writes are done to different buckets and fine grained locking is applicable, while all inserts in duplicate_bag are done to the same key).
When writes are being done to the table that already holds many records (100 000 in my bench), duplicate_bag looked actually better (but concurrency was reduced, as the benchmark was reaching my memory limits..).

1000 concurrent processes, each generating and writing 1 span record, set is slightly faster:

parallel: 1000
inputs: 1
Estimated total run time: 1.07 min

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_set                  571.79        1.75 ms  ±3026.37%     0.00575 ms       90.74 ms
ets_duplicate_bag        540.94        1.85 ms  ±1093.70%     0.00726 ms      110.31 ms

Comparison: 
ets_set                  571.79
ets_duplicate_bag        540.94 - 1.06x slower +0.0997 ms

100 processes, each generating and inserting 1 span record to a table that already has at least 100 000 records:

parallel: 100
inputs: 1
Estimated total run time: 24 s

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_duplicate_bag        6.19 K      161.52 μs  ±3867.78%        5.70 μs     6352.91 μs
ets_set                  3.94 K      253.68 μs  ±9495.01%        6.07 μs     8046.54 μs

Comparison: 
ets_duplicate_bag        6.19 K
ets_set                  3.94 K - 1.57x slower +92.16 μs

bryannaegele · 2023-10-25T15:57:51Z

Did you do your benchmarks with write optimized or just default settings?

SergeTupchiy · 2023-10-25T16:08:27Z

Did you do your benchmarks with write optimized or just default settings?

Optimized: https://github.com/SergeTupchiy/otel_bench/blob/main/src/otel_ets_insert.erl#L20

starbelly · 2023-10-26T04:06:53Z

So duplicate_bag looks good? As well as using take?

Yes, take looks good, set looks slightly better when the number of concurrent inserts is high (probably because writes are done to different buckets and fine grained locking is applicable, while all inserts in duplicate_bag are done to the same key). When writes are being done to the table that already holds many records (100 000 in my bench), duplicate_bag looked actually better (but concurrency was reduced, as the benchmark was reaching my memory limits..).

1000 concurrent processes, each generating and writing 1 span record, set is slightly faster:
parallel: 1000
inputs: 1
Estimated total run time: 1.07 min

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_set                  571.79        1.75 ms  ±3026.37%     0.00575 ms       90.74 ms
ets_duplicate_bag        540.94        1.85 ms  ±1093.70%     0.00726 ms      110.31 ms

Comparison: 
ets_set                  571.79
ets_duplicate_bag        540.94 - 1.06x slower +0.0997 ms
100 processes, each generating and inserting 1 span record to a table that already has at least 100 000 records:
parallel: 100
inputs: 1
Estimated total run time: 24 s

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_duplicate_bag        6.19 K      161.52 μs  ±3867.78%        5.70 μs     6352.91 μs
ets_set                  3.94 K      253.68 μs  ±9495.01%        6.07 μs     8046.54 μs

Comparison: 
ets_duplicate_bag        6.19 K
ets_set                  3.94 K - 1.57x slower +92.16 μs

Hi :) I'm not sure what your constraints are memory wise, but if you can share your benchmark code I'd be happy to run on something big 😁

Also, I would be interested in seeing the results using different bench marking tools.

SergeTupchiy · 2023-10-26T07:26:59Z

@starbelly ,
bench code: https://github.com/SergeTupchiy/otel_bench/ (please see both lib/ and src/)
I ran it on my rather old laptop 🙂 (hardware listed here in the beginning: https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#traverse-results), I think triggering GC during the benchmark can help reclaiming memory but I haven't tried that yet.

starbelly · 2023-10-28T18:59:33Z

@starbelly , bench code: https://github.com/SergeTupchiy/otel_bench/ (please see both lib/ and src/) I ran it on my rather old laptop 🙂 (hardware listed here in the beginning: https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#traverse-results), I think triggering GC during the benchmark can help reclaiming memory but I haven't tried that yet.

Thank you 🙇 . I did not get a chance on friday to run these on a large machine, but will do so next week (half a terabyte to a terabyte of mem, 96 cores, etc.)

That said, I ran one benchmark on my macbook and here is what I got :

 OtelBench.ets_insert_bench()
Operating System: macOS
CPU Information: Apple M2
Number of Available Cores: 8
Available memory: 24 GB
Elixir 1.15.6
Erlang 26.0.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 1 min
memory time: 0 ns
reduction time: 0 ns
parallel: 500
inputs: 1, 100, 1K
Estimated total run time: 6.20 min

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_duplicate_bag with input 100 ...
Benchmarking ets_duplicate_bag with input 1K ...
Benchmarking ets_set with input 1 ...
Benchmarking ets_set with input 100 ...
Benchmarking ets_set with input 1K ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_set                  264.93        3.77 ms     ±5.47%        3.77 ms        4.24 ms
ets_duplicate_bag        230.56        4.34 ms     ±5.79%        4.33 ms        4.95 ms

Comparison:
ets_set                  264.93
ets_duplicate_bag        230.56 - 1.15x slower +0.56 ms

##### With input 100 #####
Name                        ips        average  deviation         median         99th %
ets_set                   41.22       24.26 ms    ±15.98%       23.22 ms       37.35 ms
ets_duplicate_bag          6.23      160.60 ms    ±12.35%      162.71 ms      175.23 ms

Comparison:
ets_set                   41.22
ets_duplicate_bag          6.23 - 6.62x slower +136.34 ms

##### With input 1K #####
Name                        ips        average  deviation         median         99th %
ets_set                    5.24        0.191 s     ±9.46%        0.184 s         0.23 s
ets_duplicate_bag          0.61         1.65 s    ±18.45%         1.67 s         2.53 s

Comparison:
ets_set                    5.24
ets_duplicate_bag          0.61 - 8.63x slower +1.46 s

Benchmarks are fun like that though right? :) So, I will proceed to run them all on various machines.

SergeTupchiy · 2023-10-28T19:37:40Z

Thanks @starbelly,
I think using unique keys for records (e.g. trace_id instead of instrumentation scope) is expected to have better performance overall. It can benefit from write_concurrency and decentralized_counters. It's not even necessary to use set I think, we can keep using duplicate_bag but with some unique key per record (trace_id, span id or both of them).
I prepared a better benchmark: https://github.com/SergeTupchiy/otel_bench/blob/main/lib/otel_bench.ex#L227
Each process inserts only one record at a time which is closer to real-world use case than inputs of 1, 1K, 100K.
Also it tests various ETS options,
results on my weak machine:

parallel: 4
Name                                                                ips        average  deviation         median         99th %
set write_concurrency                                          295.45 K        3.38 μs ±20436.64%        2.17 μs        7.66 μs
set write_concurrency decentralized_counters                   259.41 K        3.85 μs ±30856.40%        1.94 μs        7.77 μs
dup bag write_concurrency uniq key                             258.88 K        3.86 μs ±30931.51%        2.17 μs        7.63 μs
dup bag write_concurrency decentralized_counters uniq key      257.72 K        3.88 μs ±31951.56%        1.94 μs        7.74 μs
dup bag                                                        174.51 K        5.73 μs  ±6220.26%        3.40 μs       23.62 μs
dup bag write_concurrency decentralized_counters               173.28 K        5.77 μs  ±8498.78%        3.10 μs       24.04 μs
dup bag write_concurrency                                      166.31 K        6.01 μs  ±8442.25%        2.82 μs       25.54 μs
set                                                            154.04 K        6.49 μs  ±8706.76%        4.80 μs       24.62 μs
dup bag uniq key                                               150.14 K        6.66 μs ±12265.40%        4.69 μs       23.98 μs

bryannaegele · 2023-10-29T15:34:04Z

I think changing to set is fine, but we should probably do that in a separate PR.

Is this ready for review then?

SergeTupchiy · 2023-10-29T15:55:02Z

Yes, I think it deserves a separate PR and this one is ready, thanks.

bryannaegele · 2023-10-29T18:30:22Z

Cool. @SergeTupchiy if you can rebase and resolve those conflicts, we can get this merged

SergeTupchiy · 2023-10-29T19:32:55Z

@bryannaegele , done

tsloughter · 2023-10-30T09:11:32Z

Note that export will change to change a good bit to work with set.

SergeTupchiy requested a review from a team October 18, 2023 15:21

github-actions bot added language-erlang scope-sdk labels Oct 18, 2023

ferd approved these changes Oct 18, 2023

View reviewed changes

SergeTupchiy force-pushed the fix-processor-kill-runner branch from 920da22 to 286e7c5 Compare October 19, 2023 11:34

SergeTupchiy force-pushed the fix-processor-kill-runner branch from 286e7c5 to 0b307b9 Compare October 19, 2023 11:39

tsloughter approved these changes Oct 19, 2023

View reviewed changes

fix(otel_processor): wait for runner process termination

3b24398

SergeTupchiy force-pushed the fix-processor-kill-runner branch from 0b307b9 to 3b24398 Compare October 29, 2023 19:32

bryannaegele approved these changes Oct 29, 2023

View reviewed changes

bryannaegele merged commit 0ec0e86 into open-telemetry:main Oct 29, 2023
19 of 20 checks passed

SergeTupchiy mentioned this pull request Dec 27, 2023

Improve batch overload protection and log handler #671

Open

fix(otel_processor): wait for runner process termination #641

fix(otel_processor): wait for runner process termination #641

Conversation

SergeTupchiy commented Oct 18, 2023

codecov bot commented Oct 18, 2023 • edited Loading

Codecov Report

tsloughter commented Oct 19, 2023

SergeTupchiy commented Oct 19, 2023 • edited Loading

tsloughter commented Oct 19, 2023

SergeTupchiy commented Oct 19, 2023

tsloughter commented Oct 21, 2023

tsloughter commented Oct 21, 2023

SergeTupchiy commented Oct 21, 2023

tsloughter commented Oct 22, 2023

tsloughter commented Oct 22, 2023

tsloughter commented Oct 22, 2023

SergeTupchiy commented Oct 22, 2023

SergeTupchiy commented Oct 22, 2023 • edited Loading

tsloughter commented Oct 22, 2023

tsloughter commented Oct 22, 2023

SergeTupchiy commented Oct 22, 2023

tsloughter commented Oct 22, 2023

SergeTupchiy commented Oct 22, 2023

tsloughter commented Oct 22, 2023

SergeTupchiy commented Oct 22, 2023

tsloughter commented Oct 23, 2023

tsloughter commented Oct 23, 2023

SergeTupchiy commented Oct 23, 2023

tsloughter commented Oct 23, 2023

SergeTupchiy commented Oct 23, 2023

tsloughter commented Oct 25, 2023

SergeTupchiy commented Oct 25, 2023 • edited Loading

bryannaegele commented Oct 25, 2023

SergeTupchiy commented Oct 25, 2023 • edited Loading

starbelly commented Oct 26, 2023

SergeTupchiy commented Oct 26, 2023 • edited Loading

starbelly commented Oct 28, 2023

SergeTupchiy commented Oct 28, 2023 • edited Loading

bryannaegele commented Oct 29, 2023

SergeTupchiy commented Oct 29, 2023

bryannaegele commented Oct 29, 2023

SergeTupchiy commented Oct 29, 2023

tsloughter commented Oct 30, 2023

codecov bot commented Oct 18, 2023 •

edited

Loading

SergeTupchiy commented Oct 19, 2023 •

edited

Loading

SergeTupchiy commented Oct 22, 2023 •

edited

Loading

SergeTupchiy commented Oct 25, 2023 •

edited

Loading

SergeTupchiy commented Oct 25, 2023 •

edited

Loading

SergeTupchiy commented Oct 26, 2023 •

edited

Loading

SergeTupchiy commented Oct 28, 2023 •

edited

Loading