Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data reliability and persistence story for SDK #633

Open
reyang opened this issue Apr 29, 2019 · 1 comment
Open

Data reliability and persistence story for SDK #633

reyang opened this issue Apr 29, 2019 · 1 comment
Assignees

Comments

@reyang
Copy link
Contributor

reyang commented Apr 29, 2019

This is a follow up on #632.

In SDK, we need to have a clear story for the following situations, we can either decide to support it in the core OpenCensus SDK, or leave it to specific exporter.

  1. When the SDK failed to export data to the backend system due to networking issues, to prevent eating up all the memory, we need to either discard excessive data (depending on the case, it could be either latest or oldest), or store them locally (e.g. file, log, reliable pipe, ETW).
  2. In case of application exit/restart/crash, we want to reduce the data loss. Although data loss is unavoidable given we're not a fully transactional system (e.g. your code writes traces to a queue, and the process got killed before the queue item got processed, the data will get lost), having ability to store things locally and being able to pick up later (after machine or application restart) would be useful for some cases.
  3. Console application (backend job, periodic task, command line tools) might need to store the traces during the exit grace period, since sending all the data across networking might not be possible within that grace period.
  4. There are cases where developers need more reliability, for example, auditing logs and QoS logs. We might need to provide an alternative way, so developers can sacrifice performance (e.g. without going through the queue, synchronously persist the log in a local storage or even transmit the data across the network) for reliability.

The design principles:

  1. Need to work in a multi-threading environment.
  2. Need to work in a multi-processing environment (e.g. one application has multiple process instances running at the same time).
  3. Should leverage existing stuff if possible, rather than reinventing wheels.
  4. Need to have solution for both agent and agent-less scenario.
@reyang
Copy link
Contributor Author

reyang commented Apr 29, 2019

@bogdandrutu @c24t @songy23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant