Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(otel_processor): wait for runner process termination #641

Merged

Conversation

SergeTupchiy
Copy link
Contributor

As new_export_table(ExportingTable) is called right after killing the runner, it's necessary to wait for its termination.
Otherwise, new_export_table/1can raise badarg error because the table is still not destroyed.

@codecov
Copy link

codecov bot commented Oct 18, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Files Coverage Δ
apps/opentelemetry/src/otel_batch_processor.erl 90.08% <100.00%> (+4.36%) ⬆️
apps/opentelemetry/src/otel_simple_processor.erl 62.12% <0.00%> (-1.95%) ⬇️

📢 Thoughts on this report? Let us know!.

@tsloughter
Copy link
Member

Trying to think of how we could have a test for this. A test exporter that just hangs? Not sure what would be checked to verify it is working though. Just that the processor doesn't crash maybe...

@SergeTupchiy SergeTupchiy force-pushed the fix-processor-kill-runner branch from 920da22 to 286e7c5 Compare October 19, 2023 11:34
@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Oct 19, 2023

@tsloughter ,

Pushed a test, also has manually verified that it reached kill_runner/1 by adding some debug printouts (removed them afterwards).
There are better ways to be sure certain code is executed during the test, e.g. https://github.com/kafka4beam/snabbkaffe, but introducing it would probably be too much for this fix..

@SergeTupchiy SergeTupchiy force-pushed the fix-processor-kill-runner branch from 286e7c5 to 0b307b9 Compare October 19, 2023 11:39
@tsloughter
Copy link
Member

I suppose technically, instead of using receive, we should switch to another state that waits for the DOWN (and postpones any other messages).

@SergeTupchiy
Copy link
Contributor Author

I suppose technically, instead of using receive, we should switch to another state that waits for the DOWN (and postpones any other messages).

That would probably look more complicated. Selective receive matching a newly created reference (in this case created by monitor) must be optimized by the emulator..

Alternatively, I think it can be fixed in a different way: do not give_away ETS table to the runner process (anyway, it just needs to read from it and the table will always be public because anyone must be able to write to it). Delete/recreate the table in the context of the state machine process, right after killing the runner or receiving EXIT | completed message. There will be no need to wait for the runner to exit after emitting kill exit signal to it.

@tsloughter
Copy link
Member

@SergeTupchiy yea, part of the reason for the give away was to ensure if the runner crashed the ets table it was using doesn't live on, even if there is a bug or something in the processor.

I'm not sure its worth it but now with this fix from you I guess its not worth changing.

@tsloughter
Copy link
Member

Wait, I could have swore we renamed the... hm. Now I vaguely recall maybe it was removed to ensure writes that could come after it is handed off don't fail..

@SergeTupchiy
Copy link
Contributor Author

Wait, I could have swore we renamed the... hm. Now I vaguely recall maybe it was removed to ensure writes that could come after it is handed off don't fail..

@tsloughter, this discussion makes me think that the whole batching/exporting mechanism is prone to loosing writes under some race conditions.
When it's time to export, the table name to accept writes is reversed in persistent term, but there may be quite a large number of concurrent writes to the handed off table with no guarantee that all of them succeed before table traversal is started. It probably can get even worse because of duplicate bag tab:

to_proto_by_instrumentation_scope(Tab) ->
    Key = ets:first(Tab),
    to_proto_by_instrumentation_scope(Tab, Key).

to_proto_by_instrumentation_scope(_Tab, '$end_of_table') ->
    [];
to_proto_by_instrumentation_scope(Tab, InstrumentationScope) ->
    InstrumentationScopeSpans = lists:foldl(fun(Span, Acc) ->
                                                      [to_proto(Span) | Acc]
                                              end, [], ets:lookup(Tab, InstrumentationScope)),

Once ets:lookup(Tab, ets:first(Tab)) return, any delayed writes with the same key (instrumentation scope) are lost forever..

@tsloughter
Copy link
Member

@SergeTupchiy yup, its possible. I guess that should be in a comment. I couldn't think of a good way around it, only to limit it.

@tsloughter
Copy link
Member

hmmmmmmm, what if we used ets:take instead of ets:lookup and reused the table.

@tsloughter
Copy link
Member

Eh, reuse won't work as it is if we add support for concurrent exports.

@SergeTupchiy
Copy link
Contributor Author

@tsloughter, I've also thought about ets:take/2 yesterday 🙂 👍 and I think it can work fine without introducing too many changes or re-designing the whole implementation.
We can ets:take/2 from the handed off table and don't delete the table after exporting (it also implies its ownership must not be transferred to the runner, so that the table survives the runs and no data is lost). Then, the table may contain some 'residual' records inserted during the export, we can try to traverse it once more (e.g., if we see it's not empty), but it still doesn't give 100% guarantee since we can't be sure that all the writes have been done.
Thus, we can simply complete this export cycle and then the processor will export all that 'residual' records on the next but one exporter run.
The only problem I see is the case when there are some residual records after the export and then there are absolutely no more writes, so exporter is not run anymore and these records are left un-exported. However, we can add some extra table size checks to solve/ further mitigate this.

@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Oct 22, 2023

Eh, reuse won't work as it is if we add support for concurrent exports.
I'm not sure that I understand your concern... Also, according to the spec:
Export() will never be called concurrently for the same exporter instance

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#exportbatch
I may also misunderstand the spec, though.. 😃

@tsloughter
Copy link
Member

It is a little confusing. Export() won't be called concurrently but that doesn't mean Export can't return before completing the request.

It is a little weird because we didn't want to deal with all the logic of concurrent sending in the processor itself, so Export not being called concurrently guarantees to the exporters that they are in control of how to implement how many requests are being sent concurrently -- and stuff like retries.

@tsloughter
Copy link
Member

And if we were to do this I would want it to be 100%, meaning no chance of losing spans except in the case of a crash. So we'd want to keep using that table for writes after the export, instead of just rechecking after the export to see if there are more records since technically the late write could come well after that as well.

The original goal was to limit memory usage and make GC easy on the VM.

@SergeTupchiy
Copy link
Contributor Author

Yeah, rechecking after the export is probably makes little sense, as it gives very weak guarantee. I would just export by taking records that are present during the traverse and rely on exporting any 'late' writes in the next but one exporter run (implying the table is not deleted). It's better to accept the chance of late export then probable data loss. Do you see significant drawbacks in this idea?

@tsloughter
Copy link
Member

Right, that's what I was thinking too.

The only drawbacks is how to do concurrent exports and I worry about GC performance of individual records being dropped instead of just letting the whole table get removed when the process ends. The latter may be unfounded, I haven't done any actual benchmarking.

@SergeTupchiy
Copy link
Contributor Author

I hope it won't have a dramatic impact on performance,
but of course, benchmarking should be done..
I would probably also like to benchmark duplicate bag vs set. span id or span/trace id pair can be used as a unique key in a set table insted of instrumentation scope. Motivation is that bag table type is not recommended for storing high number of objects under the same key. That's a different question, though, sorry for off-topic. 🙂

@tsloughter
Copy link
Member

Oh, I didn't know that about duplicate bag. Is that in the docs somewhere?

@SergeTupchiy
Copy link
Contributor Author

Yes,

For tables of type bag and duplicate_bag, avoid inserting an extensive amount of objects with the same key. It will hurt insert and lookup performance as well as real time characteristics of the runtime environment (hash bucket linear search do not yield).
https://www.erlang.org/doc/man/ets.html

@tsloughter
Copy link
Member

Aggggh, how did I miss the no yielding part...

Since we do no lookups besides the export I didn't think that would be an issue and would have expected insert to not be impacted by the size of the list of elements in the bucket... Do you know why insert would be hurt? I'm going to go ask around.

But the yielding definitely concerns me :(

@tsloughter
Copy link
Member

Ah, this warning is new to the docs.

@SergeTupchiy
Copy link
Contributor Author

@tsloughter,
After a quick look at the source code, duplciate_bag doesn't look to hurt inserts:
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_db_hash.c#L1383

And yielding is disabled only for DB_BAG (but not for DB_DUPLICATE_BAG):
https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_db.c#L1811

Moreover, yielding seems to only be used for list insertions: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_db.c#L1852

Note: I'm not an in-depth BEAM expert, so please don't take my conclusions for granted.
Meanwhile, I'm preparing some benchmarks, will share the results as soon as they are ready.

@tsloughter
Copy link
Member

Great! Thanks for the research.

@SergeTupchiy
Copy link
Contributor Author

@tsloughter,

Some benchmark results:

  • traversal (delete the whole table vs ets:take/2) https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#traverse-results
    ets:take/2 seems insignificantly faster.
    The benchmark only traverses the table, meaning there are no concurrent writes during the benchmark. The reasons for that are two:

    • laziness simplicity 🙂
    • in the real world scenario, no high write rate is expected (since writes are switched to the second table).

    I didn't tried to measure memory usage though..
    Also, benchmark functions are not quite pure: they don't just traverse the table but also convert records to_proto. Motivation was to run scenarios closer to the real world. Will probably add extra scenario without to_proto conversion, but, anyway it looks like there can't be a big difference.

  • concurrent inserts (set vs duplicate_bag) https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#insert-results
    All inserts to the duplicate_bag were done to the same key.
    When the table was pre-populated with a large number of records (see: OtelBench.ets_insert_to_large_tab_bench()), duplicate_bag actually seemed to outperform set (though results differed significantly from run to run).
    Note: the benchmark time also includes random span generation step.

@tsloughter
Copy link
Member

So duplicate_bag looks good? As well as using take?

@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Oct 25, 2023

So duplicate_bag looks good? As well as using take?

Yes, take looks good, set looks slightly better when the number of concurrent inserts is high (probably because writes are done to different buckets and fine grained locking is applicable, while all inserts in duplicate_bag are done to the same key).
When writes are being done to the table that already holds many records (100 000 in my bench), duplicate_bag looked actually better (but concurrency was reduced, as the benchmark was reaching my memory limits..).

1000 concurrent processes, each generating and writing 1 span record, set is slightly faster:

parallel: 1000
inputs: 1
Estimated total run time: 1.07 min

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_set                  571.79        1.75 ms  ±3026.37%     0.00575 ms       90.74 ms
ets_duplicate_bag        540.94        1.85 ms  ±1093.70%     0.00726 ms      110.31 ms

Comparison: 
ets_set                  571.79
ets_duplicate_bag        540.94 - 1.06x slower +0.0997 ms

100 processes, each generating and inserting 1 span record to a table that already has at least 100 000 records:

parallel: 100
inputs: 1
Estimated total run time: 24 s

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_duplicate_bag        6.19 K      161.52 μs  ±3867.78%        5.70 μs     6352.91 μs
ets_set                  3.94 K      253.68 μs  ±9495.01%        6.07 μs     8046.54 μs

Comparison: 
ets_duplicate_bag        6.19 K
ets_set                  3.94 K - 1.57x slower +92.16 μs

@bryannaegele
Copy link
Contributor

Did you do your benchmarks with write optimized or just default settings?

@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Oct 25, 2023

Did you do your benchmarks with write optimized or just default settings?

Optimized: https://github.com/SergeTupchiy/otel_bench/blob/main/src/otel_ets_insert.erl#L20

@starbelly
Copy link

So duplicate_bag looks good? As well as using take?

Yes, take looks good, set looks slightly better when the number of concurrent inserts is high (probably because writes are done to different buckets and fine grained locking is applicable, while all inserts in duplicate_bag are done to the same key). When writes are being done to the table that already holds many records (100 000 in my bench), duplicate_bag looked actually better (but concurrency was reduced, as the benchmark was reaching my memory limits..).

1000 concurrent processes, each generating and writing 1 span record, set is slightly faster:

parallel: 1000
inputs: 1
Estimated total run time: 1.07 min

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_set                  571.79        1.75 ms  ±3026.37%     0.00575 ms       90.74 ms
ets_duplicate_bag        540.94        1.85 ms  ±1093.70%     0.00726 ms      110.31 ms

Comparison: 
ets_set                  571.79
ets_duplicate_bag        540.94 - 1.06x slower +0.0997 ms

100 processes, each generating and inserting 1 span record to a table that already has at least 100 000 records:

parallel: 100
inputs: 1
Estimated total run time: 24 s

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_set with input 1 ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_duplicate_bag        6.19 K      161.52 μs  ±3867.78%        5.70 μs     6352.91 μs
ets_set                  3.94 K      253.68 μs  ±9495.01%        6.07 μs     8046.54 μs

Comparison: 
ets_duplicate_bag        6.19 K
ets_set                  3.94 K - 1.57x slower +92.16 μs

Hi :) I'm not sure what your constraints are memory wise, but if you can share your benchmark code I'd be happy to run on something big 😁

Also, I would be interested in seeing the results using different bench marking tools.

@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Oct 26, 2023

@starbelly ,
bench code: https://github.com/SergeTupchiy/otel_bench/ (please see both lib/ and src/)
I ran it on my rather old laptop 🙂 (hardware listed here in the beginning: https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#traverse-results), I think triggering GC during the benchmark can help reclaiming memory but I haven't tried that yet.

@starbelly
Copy link

@starbelly , bench code: https://github.com/SergeTupchiy/otel_bench/ (please see both lib/ and src/) I ran it on my rather old laptop 🙂 (hardware listed here in the beginning: https://github.com/SergeTupchiy/otel_bench/blob/main/results.md#traverse-results), I think triggering GC during the benchmark can help reclaiming memory but I haven't tried that yet.

Thank you 🙇 . I did not get a chance on friday to run these on a large machine, but will do so next week (half a terabyte to a terabyte of mem, 96 cores, etc.)

That said, I ran one benchmark on my macbook and here is what I got :

 OtelBench.ets_insert_bench()
Operating System: macOS
CPU Information: Apple M2
Number of Available Cores: 8
Available memory: 24 GB
Elixir 1.15.6
Erlang 26.0.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 1 min
memory time: 0 ns
reduction time: 0 ns
parallel: 500
inputs: 1, 100, 1K
Estimated total run time: 6.20 min

Benchmarking ets_duplicate_bag with input 1 ...
Benchmarking ets_duplicate_bag with input 100 ...
Benchmarking ets_duplicate_bag with input 1K ...
Benchmarking ets_set with input 1 ...
Benchmarking ets_set with input 100 ...
Benchmarking ets_set with input 1K ...

##### With input 1 #####
Name                        ips        average  deviation         median         99th %
ets_set                  264.93        3.77 ms     ±5.47%        3.77 ms        4.24 ms
ets_duplicate_bag        230.56        4.34 ms     ±5.79%        4.33 ms        4.95 ms

Comparison:
ets_set                  264.93
ets_duplicate_bag        230.56 - 1.15x slower +0.56 ms

##### With input 100 #####
Name                        ips        average  deviation         median         99th %
ets_set                   41.22       24.26 ms    ±15.98%       23.22 ms       37.35 ms
ets_duplicate_bag          6.23      160.60 ms    ±12.35%      162.71 ms      175.23 ms

Comparison:
ets_set                   41.22
ets_duplicate_bag          6.23 - 6.62x slower +136.34 ms

##### With input 1K #####
Name                        ips        average  deviation         median         99th %
ets_set                    5.24        0.191 s     ±9.46%        0.184 s         0.23 s
ets_duplicate_bag          0.61         1.65 s    ±18.45%         1.67 s         2.53 s

Comparison:
ets_set                    5.24
ets_duplicate_bag          0.61 - 8.63x slower +1.46 s

Benchmarks are fun like that though right? :) So, I will proceed to run them all on various machines.

@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Oct 28, 2023

Thanks @starbelly,
I think using unique keys for records (e.g. trace_id instead of instrumentation scope) is expected to have better performance overall. It can benefit from write_concurrency and decentralized_counters. It's not even necessary to use set I think, we can keep using duplicate_bag but with some unique key per record (trace_id, span id or both of them).
I prepared a better benchmark: https://github.com/SergeTupchiy/otel_bench/blob/main/lib/otel_bench.ex#L227
Each process inserts only one record at a time which is closer to real-world use case than inputs of 1, 1K, 100K.
Also it tests various ETS options,
results on my weak machine:

parallel: 4
Name                                                                ips        average  deviation         median         99th %
set write_concurrency                                          295.45 K        3.38 μs ±20436.64%        2.17 μs        7.66 μs
set write_concurrency decentralized_counters                   259.41 K        3.85 μs ±30856.40%        1.94 μs        7.77 μs
dup bag write_concurrency uniq key                             258.88 K        3.86 μs ±30931.51%        2.17 μs        7.63 μs
dup bag write_concurrency decentralized_counters uniq key      257.72 K        3.88 μs ±31951.56%        1.94 μs        7.74 μs
dup bag                                                        174.51 K        5.73 μs  ±6220.26%        3.40 μs       23.62 μs
dup bag write_concurrency decentralized_counters               173.28 K        5.77 μs  ±8498.78%        3.10 μs       24.04 μs
dup bag write_concurrency                                      166.31 K        6.01 μs  ±8442.25%        2.82 μs       25.54 μs
set                                                            154.04 K        6.49 μs  ±8706.76%        4.80 μs       24.62 μs
dup bag uniq key                                               150.14 K        6.66 μs ±12265.40%        4.69 μs       23.98 μs

@bryannaegele
Copy link
Contributor

I think changing to set is fine, but we should probably do that in a separate PR.

Is this ready for review then?

@SergeTupchiy
Copy link
Contributor Author

Yes, I think it deserves a separate PR and this one is ready, thanks.

@bryannaegele
Copy link
Contributor

Cool. @SergeTupchiy if you can rebase and resolve those conflicts, we can get this merged

@SergeTupchiy SergeTupchiy force-pushed the fix-processor-kill-runner branch from 0b307b9 to 3b24398 Compare October 29, 2023 19:32
@SergeTupchiy
Copy link
Contributor Author

@bryannaegele , done

@bryannaegele bryannaegele merged commit 0ec0e86 into open-telemetry:main Oct 29, 2023
19 of 20 checks passed
@tsloughter
Copy link
Member

Note that export will change to change a good bit to work with set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants