Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce persistent storage to Azure exporter #632

Merged
merged 23 commits into from
Apr 29, 2019
Merged

Introduce persistent storage to Azure exporter #632

merged 23 commits into from
Apr 29, 2019

Conversation

reyang
Copy link
Contributor

@reyang reyang commented Apr 26, 2019

NOTE: discussion of data reliability and persistence will continue in #633.

@c24t @songy23 I hope to get your early feedback before I finish the docs/tests. And to see if we could take the same approach for agents / other SDKs.

I plan to add local file persistency for Azure trace exporter, to help the following cases:

  1. The application is experiencing networking issue, or the ingestion/backend is not responding. Instead of accumulating traces in memory, user can opt-in and configure the exporter to dump information to local file in order to reduce memory pressure. [Memory increase due to too many traces #590]
  2. In case of application crash / restart, we need a way to pick up the leftover traces.
  3. The storage should provide consistency support under multiple processing / multi-threading.
  4. For console application (e.g. backend job, periodic task, command line tools), there could be unsent traces after the grace period, either due to massive amount of traces, or due to intermittent transmission error.

The LocalFileStorage is designed to take minimum dependency on operating system level synchronization primitives, the only requirement is file rename should be mutual exclusive. This makes it easier to port the logic to other languages/platforms.

from opencensus.ext.azure.common.storage import LocalFileStorage

if __name__ == '__main__':
    using LocalFileStorage('test', maintenance_period=5) as stor:
        for blob in stor.gets():
            print(blob.fullpath)
        blob = stor.put(['Hello, World!'])
        blob = stor.get()
        if blob and blob.lease(10):
            print(blob.get())
            blob.delete()

@c24t
Copy link
Member

c24t commented Apr 26, 2019

The implementation looks good, but if you have the option to export to the agent without buffering and then exporting from the agent to some persistent sink (a file, a message queue, etc.), I think this would be better than building filesystem access into the exporters.

I don't know if this would solve your use cases though. It would solve (3), but not (2), although I'm not sure that this solves (2) either. I agree that we need a better general solution for (1), e.g. letting the user configure whether to drop messages when the backend is down, which messages to drop when we hit the memory limit, etc.

@songy23 can speak to persistence in the agent.

@bogdandrutu
Copy link

Why do you not have a FileExporter, and run a demon on that machine (oc-agent for example) to read from these files and export traces/metrics?

@bogdandrutu
Copy link

Why I think this is better:

  1. You don't need to implement the logic of retrieving in all the languages;
  2. In system like K8S you may not be rescheduled on the same host after a restart;

@reyang
Copy link
Contributor Author

reyang commented Apr 26, 2019

Why do you not have a FileExporter, and run a demon on that machine (oc-agent for example) to read from these files and export traces/metrics?

For Microsoft, there are client scenario (e.g. command line tools) which doesn't allow agent.

@bogdandrutu
Copy link

Don't know what commanline tools you have there but usually they run for a short period of time. I think having a small document where we try to capture all the requirements and possible implementations and analyze tradeoffs will be great for this feature.

I am not against having this capability but I would like to understand what design fits best our requirements.

@reyang
Copy link
Contributor Author

reyang commented Apr 26, 2019

Don't know what commanline tools you have there but usually they run for a short period of time. I think having a small document where we try to capture all the requirements and possible implementations and analyze tradeoffs will be great for this feature.

I've described the requirement in this PR. Let's use this PR for now, if the conversation goes too long to fit here, I'm okay to create a separate document.

I am not against having this capability but I would like to understand what design fits best our requirements.

Yep, totally understand.

@reyang reyang closed this Apr 26, 2019
@reyang reyang reopened this Apr 26, 2019
@bogdandrutu
Copy link

Couple of questions about the design:

  • Does written data have a TTL?
  • Do we have a size limit on the disk?
  • Do we want to support something like file rotations (similar to logs)?
  • Do we want to use the logging system for this?

@reyang
Copy link
Contributor Author

reyang commented Apr 26, 2019

@bogdandrutu these are great questions!

Here goes the definition of the LocalFileStorage class:

class LocalFileStorage(object):
    def __init__(
        self,
        path,
        max_size=100*1024*1024,  # 100MB
        maintenance_period=60,  # 1 minute
        retention_period=7*24*60*60,  # 7 days
        write_timeout=60,  # 1 minute
    ):
  • TTL is controlled by the retention_period.
  • size limit is controlled by max_size (although I haven't yet implemented it at this moment).
  • rotation is done by the blobs API (we have lease, cleanup job).
  • for logging system, I would prefer to write locally and having out-of-proc agent (e.g. FluentBit) to deliver it. This should be more efficient/reliable/performant. What OpenCensus library would do is to 1) integrate with the runtime logging 2) enrich the log data with trace_id/span_id/tags/etc. There are corner cases where we don't have agent, where we might need to provide some alternative option (but we shouldn't optimize for this corner scenario).

@reyang
Copy link
Contributor Author

reyang commented Apr 26, 2019

The implementation looks good, but if you have the option to export to the agent without buffering and then exporting from the agent to some persistent sink (a file, a message queue, etc.), I think this would be better than building filesystem access into the exporters.

Definitely. For Windows we have ETW. If we can have a cross-platform mechanism in OpenCensus, that'll be fantastic!

I don't know if this would solve your use cases though. It would solve (3), but not (2), although I'm not sure that this solves (2) either. I agree that we need a better general solution for (1), e.g. letting the user configure whether to drop messages when the backend is down, which messages to drop when we hit the memory limit, etc.

For (2), it is tricky, given OpenCensus is not designed to be full-transactional (e.g. When we end a span, we put it in the memory queue and wait for the exporter to pick it up later. This means the span data could get lost if application crashed in the middle). Making it fully transactional is way too expensive for the major scenarios, and probably is not what we want to shoot for. (@c24t and I discussed about this, and so far we only see auditing scenario which might require this)

In this PR, I want to provide a certain level of solution to 2), instead of a 100% guarantee.
For example, in case we get "Server Too Busy" from the agent/ingestion, there are three options:

  • Simply drop the data.
  • Put the data back to the queue, which could cause significant memory pressure. Also, the more we put in memory, the more data loss if there is a crash/restart.
  • Persist the data locally, and retry later.

@SergeyKanzhelev
Copy link
Member

@reyang thank you! Persistent storage will definitely help in specific scenarios and is a valuable addition to Azure exporter. It's great that you are working on it in a generic fashion so it may be reused. On this path, you will probably hit the need to introduce SpanData exporting feature census-instrumentation/opencensus-specs#255 that is currently implemented in C# and OpenConsensus project. And then to solve a problem of a potentially writing span to disk again.

Long term we will need to post guidance on how persistence can be achieved via the combination of Flush on exit, agent-based persistency, SDK-based persistency, and going full transactional. Hopefully, the need to SDK-based persistency will be minimal as it's hardest to implement.

Some things to remember while implementing:

  • Have a comment on "path" that specifying temp folder has potential privacy and security concerns as telemetry will be shared with other apps and accounts deployed on the same host.
  • it may be a good idea to allow using environment variables or resource attributes as a part of a file name or path. This way one misbehaving app cannot "attack" another by having too many random files.
  • some way of mutexing may be needed between processes while reading and removing files to avoid double upload issues.
  • some way of limiting the oldest time of span start should be in place. So very old spans eventually will be simply dropped and not written and attempted to be sent over and over.

@reyang
Copy link
Contributor Author

reyang commented Apr 27, 2019

  • some way of mutexing may be needed between processes while reading and removing files to avoid double upload issues.

We're using file rename as the synchronization primitive among processes.
One clarification, we're trying to mitigate double load, while scientifically this is unavoidable from client side (e.g. if client got a connection reset, there is no way to tell whether the server accepted the data or not at that very moment). De-dupe will need help from the backend.

  • some way of limiting the oldest time of span start should be in place. So very old spans eventually will be simply dropped and not written and attempted to be sent over and over.

We have retention_period in LocalFileStorage which is 7 days by default. Anything older than 7 days will be dropped.

@reyang reyang changed the title [WIP] Introduce persistent storage Introduce persistent storage to Azure exporter Apr 29, 2019
Copy link
Member

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but otherwise LGTM. We can revisit before moving this out of the azure exporter.

},
timeout=self.options.timeout,
)
except Exception as ex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to catch RequestException specifically here to tell retryable from non-retryable errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO comment for now, will revisit when refactoring the exporter specific context (e.g. instead of having exporter manipulating the blacklist, having a dedicated context flag in exporter logic, so that integrations like requests would not intercept such activities and cause dead loop).

:param args: The kwargs passed in while calling `function`.
"""

def __init__(self, interval, function, args=None, kwargs=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature change here means we lose access to the other Thread constructor args, but that's probably fine.

elapsed_time = time.time() - start_time
wait_time = max(self.interval - elapsed_time, 0)

def cancel(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called this stop instead of cancel originally because it's still possible to call it after the function has run, but the change makes sense if you want this to mimic Timer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to name things, here we take the easy approach to mimic Timer.

My personal thinking: stop sounds like a synchronous operation, which notifies the thread and wait for it to join. (e.g. I would expect the thread to be stopped after stop returns)

opencensus/metrics/transport.py Outdated Show resolved Hide resolved
start_time = time.time()
self.function(*self.args, **self.kwargs)
elapsed_time = time.time() - start_time
wait_time = max(self.interval - elapsed_time, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better than the original.

else:
yield LocalFileBlob(path)

def get(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is unused, what do you need it for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not used except as an util for test cases. I was thinking to keep it as symmetrical to put?

@reyang reyang merged commit 992b223 into master Apr 29, 2019
@reyang
Copy link
Contributor Author

reyang commented Apr 29, 2019

Discussion of data reliability and persistence will continue in #633.

@reyang reyang deleted the azure branch April 30, 2019 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants