-
Notifications
You must be signed in to change notification settings - Fork 64
secondary sampling
Secondary sampling means participants choose trace data from a request even when it is not sampled with B3. This is particularly important with customer support or triage in large deployments. For example:
- I want to see 10% of gateway requests with a path expression
/play/*
. However, I only want data from the gateway and playback services. - I want to see 15% of
authUser()
gRPC requests, but only data between the auth service and its cache.
This design allows multiple participants to perform investigations that possibly overlap, while only incurring overhead at most once. For example, if B3 is sampled, all investigations reuse its data. If B3 is not, investigations only record data at trigger points in the request.
The fundamentals of this design are the following:
- A function of request creates zero or many "sampling keys". This function trigger anywhere in the service graph.
- A header
sampling
is co-propagated with B3 including these keys and associated metadata. - A delimited span tag
sampled_keys
is added to all recorded spans.sampled_keys
is a subset of all propagated keys relevant for this hop. Notably, it may include a keyword 'b3' if the span was B3 sampled. - A "trace forwarder" routes data to relevant participants by parsing the
sampled_keys
tag.
Typically, a Zipkin trace is sampled up front and before any activity is recorded. B3 propagation conveys the sampling decision downwards consistently. In other words, a "no" the decision never changes from unsampled to sampled on the same request.
Many large sites use random sampling, to ensure a small percentage <5% result in a trace. While nuanced, it is important to note that even when random sampling, sites often have blacklists which prevent instrumentation from triggering at all. A prime example are health checks which are usually never recorded even if everything else is randomly sampled.
Many conflate Zipkin and B3 with pure random sampling, because initially that was the only choice. However times have changed. Sites often use conditions such as an http request to choose data. For example, record 100% of traffic at a specific endpoint (while randomly sampling other traffic). Choosing what to record based on context including request and node-specific state is called conditional sampling.
In either case of random or conditional sampling, there's other guards as well. For example, decisions are subject to a rate-limit. For example, up to 1000 traces per second for this endpoint means effectively 100% until/unless that cap is reached. Further concepts are available in William Louth's Scaling Distributed Tracing talk.
The important takeaway is that existing Zipkin sites select traces based on criteria visible at the beginning of the request. Once selected, this data is expected to be recorded into Zipkin consistently even if the request crosses 300 services.
For the rest of this document, we'll call this up front, consistent decision the "primary sampling decision". We'll understand that this primary decision is propagated in-process in a trace context and across nodes using B3 propagation.
Secondary sampling decisions can happen anywhere in the trace and can trigger recording anywhere also. For example, a gateway could add a sampling key that is triggered only upon reaching a specific service. Sampling keys are human readable labels corresponding to a trace participant. There's no established registry or mechanism for choosing these labels, as it is site-specific. An example might be auth15pct
.
The sampling
header (or specifically propagated field) carries the sampling keys and any state associated with them (such as TTL values). Keys are semi-colon delimited. If keys have metadata associated, metadata begins at the colon (:) character and is itself comma delimited. This means semi-colons, colons and commas are all reserved characters.
For example, here's encoding 100 requests per second and a TTL of 1 for the authcache
key:
sampling: authcache:rps=100,ttl=1
The naming convention sampling
follows the same design concern as b3 single. Basically, hyphens cause problems across messaging links. By avoiding them, we allow the same system to work with message traces as opposed to just RPC ones, and with no conversion concerns.
The application is unaware secondary sampling. It is critical that this design and tooling in no way change the api surface to instrumentation libraries, such as what's used by frameworks like Spring Boot. This reduces implementation risk and also allows the feature to be enabled or disabled without affecting production code.
Moreover, the fact that there are multiple participants choosing data differently should not be noticeable by instrumentation. All participants use the same trace and span IDs, which means log correlation is not affected. Sharing instrumentation and reporting means we are not burdening the application with redundant overhead. It also means we are not requiring engineering effort to re-instrument each time a participant triggers recording.
Each participant in the trace could have different capacities, retention rates and billing implications. The responsibility for this is a zipkin-compatible endpoint, which routes the same data to participants associated with a sampling key. We'll call this the trace forwarder. Some examples are PitchFork and Zipkin Forwarder.
If the trace forwarder sees two keys b3 and gateway, it knows to forward the same span to the standard Zipkin backend as well as the API Gateway team's Zipkin.
As trace data is completely out-of-band: it is decoupled from request headers. For example, if the forwarder needs to see sampled keys, they must be encoded into a tag sampled_keys
.
The naming convention sampled_keys
two important facets. One is that it is encoded lower_snake_case. This is to allow straight-forward json path expressions, like tags.sampled_keys
. Secondly this is the word "sampled" to differentiate this from the sampling
header. Keys sampled are a subset of all sampling keys, hence the word "sampled" not "sampling". The value is comma separated as it is easy to tokenize. It isn't a list because Zipkin's data format only allows string values.
The special sampling key b3
ensures secondarily sampled data are not confused with B3 sampled data. Remember, in normal Zipkin installs, presence of spans at all imply they were B3 sampled. Now that there are multiple destinations, we need to back-fill a tag to indicate the base case. This ensures the standard install doesn't accidentally receive more data than was B3 sampled. b3
should never appear in the sampling
header: it is a pointer to the sampling state of B3 headers.
It is possible that some sampling keys skip hops, or services, when recording. When this happens, parent IDs will be wrong, and also any dependency links will also be wrong. It may be heuristically possible to reconnect the spans, but this will push complexity into the forwarder, at least requiring it to buffer a trace based on a sampling key.
There are a couple ways to mitigate this. One is don't ever skip nodes! this is the easiest by far. Another way is to use a more complex state management which propagates the upstream context in the sampling
header similar to how tracestate
was originally designed.
Not all tracing libraries have the same features. The following are required for this design to work:
- ability to trigger a "local sampled" decision independent of the B3 decision, which propagates to child contexts
- propagation components must be extensible such that the
sampling
field can be extracted and injected - trace context extractors must see request objects, to allow for secondary request sampling decisions.
- often they can only see headers, but they now need to see the entire request object (ex the http path)
- ability to attach extra data to the trace context, in order to store sampling key state.
- a span finished hook needs to be able write the
sampled_keys
tag based on this state. - the span reporter needs to be able to see all spans, not just B3 sampled ones.
The following fictitious application is used for use case scenarios. It highlights that only sites with 10 or more applications will benefit from the added complexity of secondary sampling. Small sites may be fine just recording everything.
In most scenarios, the gateway provisions sampling keys even if they are triggered downstream.
gateway -> api -> auth -> cache -> authdb
-> recommendations -> cache -> recodb
-> playback -> license -> cache -> licensedb
-> moviemetadata
-> streams
I want to see up to 100 authUser()
gRPC requests per second, but only between the auth service and its cache.
This scenario is interesting as the decision happens well past the gateway. As the auth service only interacts with the database via the cache, and cache is its only downstream, it is easiest to implement this with ttl=1
The gateway adds the sampling key authcache
to the sampling
field. It attaches the requests per second parameter and the ttl.
sampling: authcache:rps=100,ttl=1
The complete state authcache:rps=100,ttl=1
is ignored by all nodes until the auth service. Even if they add other sampling keys, they do not drop this one.
When the auth service sees a sampling parameter (rps), it triggers a decision. Assuming the decision is yes, it consumes the tps=100
sampling parameter and passes the TTL value down to the cache service. Otherwise, it redacts the key.
Let's assume the decision was pass. The auth service records the request regardless of B3 headers and appends authcache
to span.tags.sampled_keys
when reporting the span. Outbound headers include the name of the sampling key and the ttl value.
sampling: authcache:ttl=1
The cache service triggers, decrements the ttl to 0, and appends authcache
to span.tags.sampled_keys
when reporting the span. The header injection logic knows to redact any ttl=0 fields from further propagation. In other words, no special logic is needed to redact authcache.
I want to see 1 gateway request per second with a path expression /play/*
. Moreover, I only want data from the gateway and playback services.
This use case is interesting because the trigger occurs at the same node that provisions the sampling key. Also, it involves skipping the api
service, which may cause some technical concerns at the forwarding layer.
The gateway adds the sampling key gatewayplay
to the sampling
field. It attaches the requests per second parameter.
sampling: gatewayplay:rps=1
As the gateway itself is also a participant, internally it triggers a sampling decision. If that decision is accept, than only the gatewayplay
key is propagated. If it was drop, then the key is redacted. In other words, a sampling key could be added and deleted before ever being encoded as a header, when the node provisioning the sampling key is also a participant.
Let's assume the decision was pass. The gateway service records the request regardless of B3 headers and appends gatewayplay
to span.tags.sampled_keys
when reporting the span. Outbound headers include just the name of the sampling key.
sampling: gatewayplay
The sampling field is unaltered as it passes the api service because out-of-band configuration does not trigger on the key gatewayplay
. The api service would only report data if the request was B3 sampled.
The api service triggers on this key, ensuring data is sampled locally even if B3 is unsampled. When reporting, it appends gatewayplay
to span.tags.sampled_keys