-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(otel_processor): wait for runner process termination #641
fix(otel_processor): wait for runner process termination #641
Conversation
Codecov ReportAttention:
📢 Thoughts on this report? Let us know!. |
Trying to think of how we could have a test for this. A test exporter that just hangs? Not sure what would be checked to verify it is working though. Just that the processor doesn't crash maybe... |
920da22
to
286e7c5
Compare
Pushed a test, also has manually verified that it reached |
286e7c5
to
0b307b9
Compare
I suppose technically, instead of using |
That would probably look more complicated. Selective Alternatively, I think it can be fixed in a different way: do not give_away ETS table to the runner process (anyway, it just needs to read from it and the table will always be public because anyone must be able to write to it). Delete/recreate the table in the context of the state machine process, right after killing the runner or receiving |
@SergeTupchiy yea, part of the reason for the give away was to ensure if the runner crashed the ets table it was using doesn't live on, even if there is a bug or something in the processor. I'm not sure its worth it but now with this fix from you I guess its not worth changing. |
Wait, I could have swore we renamed the... hm. Now I vaguely recall maybe it was removed to ensure writes that could come after it is handed off don't fail.. |
@tsloughter, this discussion makes me think that the whole batching/exporting mechanism is prone to loosing writes under some race conditions.
Once |
@SergeTupchiy yup, its possible. I guess that should be in a comment. I couldn't think of a good way around it, only to limit it. |
hmmmmmmm, what if we used |
Eh, reuse won't work as it is if we add support for concurrent exports. |
@tsloughter, I've also thought about |
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#exportbatch |
It is a little confusing. It is a little weird because we didn't want to deal with all the logic of concurrent sending in the processor itself, so |
And if we were to do this I would want it to be 100%, meaning no chance of losing spans except in the case of a crash. So we'd want to keep using that table for writes after the export, instead of just rechecking after the export to see if there are more records since technically the late write could come well after that as well. The original goal was to limit memory usage and make GC easy on the VM. |
Yeah, rechecking after the export is probably makes little sense, as it gives very weak guarantee. I would just export by taking records that are present during the traverse and rely on exporting any 'late' writes in the next but one exporter run (implying the table is not deleted). It's better to accept the chance of late export then probable data loss. Do you see significant drawbacks in this idea? |
Right, that's what I was thinking too. The only drawbacks is how to do concurrent exports and I worry about GC performance of individual records being dropped instead of just letting the whole table get removed when the process ends. The latter may be unfounded, I haven't done any actual benchmarking. |
I hope it won't have a dramatic impact on performance, |
Oh, I didn't know that about duplicate bag. Is that in the docs somewhere? |
Yes,
|
Aggggh, how did I miss the no yielding part... Since we do no lookups besides the export I didn't think that would be an issue and would have expected insert to not be impacted by the size of the list of elements in the bucket... Do you know why insert would be hurt? I'm going to go ask around. But the yielding definitely concerns me :( |
Ah, this warning is new to the docs. |
@tsloughter, And yielding is disabled only for Moreover, yielding seems to only be used for list insertions: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_db.c#L1852 Note: I'm not an in-depth BEAM expert, so please don't take my conclusions for granted. |
Great! Thanks for the research. |
Some benchmark results:
|
So duplicate_bag looks good? As well as using |
Yes, 1000 concurrent processes, each generating and writing 1 span record,
100 processes, each generating and inserting 1 span record to a table that already has at least 100 000 records:
|
Did you do your benchmarks with write optimized or just default settings? |
Optimized: https://github.com/SergeTupchiy/otel_bench/blob/main/src/otel_ets_insert.erl#L20 |
Hi :) I'm not sure what your constraints are memory wise, but if you can share your benchmark code I'd be happy to run on something big 😁 Also, I would be interested in seeing the results using different bench marking tools. |
@starbelly , |
Thank you 🙇 . I did not get a chance on friday to run these on a large machine, but will do so next week (half a terabyte to a terabyte of mem, 96 cores, etc.) That said, I ran one benchmark on my macbook and here is what I got :
Benchmarks are fun like that though right? :) So, I will proceed to run them all on various machines. |
Thanks @starbelly,
|
I think changing to set is fine, but we should probably do that in a separate PR. Is this ready for review then? |
Yes, I think it deserves a separate PR and this one is ready, thanks. |
Cool. @SergeTupchiy if you can rebase and resolve those conflicts, we can get this merged |
0b307b9
to
3b24398
Compare
@bryannaegele , done |
Note that export will change to change a good bit to work with |
As
new_export_table(ExportingTable)
is called right after killing the runner, it's necessary to wait for its termination.Otherwise,
new_export_table/1
can raisebadarg
error because the table is still not destroyed.