-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Component instancing is complicated. Is it really necessary? #10534
Comments
Q: what are the benefits of using a custom syntax |
I am also interested in why we need the Another question I have is whether it makes sense to consider the changes in isolation? Would it be a net positive if we land the processors change but not the receivers/exporters/connectors one, or viceversa? |
Re |
This is a good question. It's prompted me to reframe how these changes could be described. I think we could look at it more generally as "instancing by pipeline" and "instancing by data type". Processors are unique in that they are instanced by pipeline, but this effectively means they are also instanced by data type because every pipeline has a data type. Therefore, I think we could "step down" processors to "instancing by data type". If nothing else, this would have the benefit of reducing the overall complexity of the instancing discussion because all pipeline components would instance by data type only. Then, we can look separately at whether instancing by data type is still beneficial. |
After discussion on the 2024-07-11 Collector Stability working group, I am going to add this to self-observability milestone and the Collector 1.0 project board. |
@djaglowski i agree that the instancing question is quite complex. i wonder how instancing processor per data type would work across all the existing processors. Thinking through this, i think for most processors it wouldn't make much of a difference. Would this be a problem for the tailsampling processor? An example that comes to mind where it seems like it might be a problem is around something like rate limiting configuration, applied on a per signal basis instead of per pipeline, may cause unexpected behaviour changes. |
I'm glad to see this as part of Collector: v1. Component instancing was a big challenge when implementing component health, and will be a challenge for users trying to understand the health of their collectors. I'm not sure what the ideal solution looks like, but I'm definitely supportive of this work. I'd be very interested in how we might be able to clean up the shared component complications. |
Yes, this would certainly be a breaking change. I think my original example with the batch processor demonstrates the effect. Basically, if you have two logs pipelines that share an instance of the batch processor, then you are batching together logs from both pipelines. The same pattern would apply to rate limiting w/ the tailsampling processor. I assume if we implement any part of this proposal we will have to message it loud and clear. I'm confident there is reasonable way to advise users on how to preserve the same behavior we have today. e.g. "If you want the same behavior as before, just declare an instance of the processor for each pipeline." processors:
tailsampling/1: ...
# tailsampling/2: ...
service:
pipelines:
traces/1:
receivers: ...
processors: [ tailsampling/1 ]
exporters: ...
traces/2:
receivers: ...
processors: [ tailsampling/1 ] # shared with traces/1
# processors: [ tailsampling/2 ] # distinct from traces/1
exporters: ... To expand on this, here's a simplified version of my original example: traces/1:
receivers: [ in/1 ]
processors: [ foo ]
exporters: [ out/1 ]
traces/2:
receivers: [ in/2 ]
processors: [ foo ]
exporters: [ out/2 ] Here, data flows from There are also situations where sharing processors between pipelines is structurally problematic. Iterating on the above example a bit: traces/1:
receivers: [ in/1 ]
processors: [ foo, bar ]
exporters: [ out/1 ]
traces/2:
receivers: [ in/2 ]
processors: [ bar, foo ]
exporters: [ out/2 ] What happens here? I think the answer would have to be that there is a cycle, so it is an invalid configuration. But again, if we can say clearly that "Each Component Configuration is instantiated exactly once (or once per data type).", then the solution is simply to explicitly declare your instances. e.g. Use All of this is to say that this proposal is explicitly a breaking change, but if we settle on a model of explicit instancing, it's much easier to reason about precisely what is shared and what is not shared. |
If we move to a model where each Component Configuration is instantiated exactly once, then we no longer would need shared component at all. All components would just always work that way. However, we need to determine how factories change, because instead of |
This makes sense. From the component health perspective, we will be interested in knowing what pipelines a component belongs to. I'm sure there are solutions to this problem and it's one that should be on our list of requirements. |
Is there any way we can make this backwards compatible or minimize user impact or any breakage? I think this is something good to do, but I would like to avoid major changes in the configuration format at this point (minor changes or edge cases should be fine with a migration plan) |
I think there would have to be a breaking change but we can do a lot to mitigate the pain. First, I think it's likely that many users are not affected at all. Still, I would expect we do at least the following:
As mentioned above, these are the situations where processor configuration reuse would need be migrated:
Examples can be quite verbose to introduce users to YAML anchors. Given the following configuration: receivers:
filelog/1: # Because this is a receiver, this represents one *instance*.
include: local/anchors/1.log
filelog/2:
include: local/anchors/2.log
processors:
# Currently, because this is a processor, this is *not* an instance, but a *reusable configuration*.
batch:
send_batch_size: 10000
timeout: 10s
exporters:
file/1: # Because this is an exporter, this represents one *instance*.
path: local/anchors/out/1.log
file/2:
path: local/anchors/out/2.log
service:
pipelines:
logs/1:
receivers: [ filelog/1 ]
processors: [ batch ] # Use the processor configuration once.
exporters: [ file/1 ]
logs/2:
receivers: [ filelog/2 ]
processors: [ batch ] # Use the same processor configuration again.
exporters: [ file/2 ] With the changed behavior, processors are instanced the same way as receivers and exporters. This means that what you define in the However, if you want to reuse a configuration, you can do so with YAML anchors. To migrate the above example, we can declare two separate instances of the ...
processors:
# Declare a processor instance named 'batch/1' AND a configuration named 'myReusableBatchConfig'.
batch/1: &myReusableBatchConfig
send_batch_size: 10000
timeout: 10s
# Declare another instance and reference the 'myReusableBatchConfig' configuration.
batch/2: *myReusableBatchConfig
...
service:
pipelines:
logs/1:
receivers: [ filelog/1 ]
processors: [ batch/1 ] # updated from 'batch' to 'batch/1'
exporters: [ file/1 ]
logs/2:
receivers: [ filelog/2 ]
processors: [ batch/2 ] # updated from 'batch' to 'batch/2'
exporters: [ file/2 ] |
I've opened #10664 to demonstrate what instancing processors per data type could look like. |
Only reading #10534 (comment) I understand that this is less than ideal change, since we move the "complication" of setting up the processors instances to the user that is required to know that some components cannot be shared between pipelines etc. I have never heard any of the customers I talked to complaining about this, also in a processing graph in lots of the systems I know, we have receiver/exporter/connectors as nodes and pipelines (list of processors between nodes) as edges. If you try to do what you propose here I would just remove processors and only have connectors, because now every processor will become a Node in the graph. I personally vote a big -1 on this. |
Is your feature request related to a problem? Please describe.
Users define
component.ID
's in their collector configurations using the formattype[/name]
wheretype
is for exampleotlp
orfilelog
andname
is an additional arbitrary string which helps identify a unique configuration.When the collector runs with a given configuration, it often will create and run multiple instances of a given component. The rules that govern how instancing works are not well documented but I have previously described them in detail here.
In short:
sharedcomponent
is a hacky "best practice" which allows component authors to circumvent instancingI believe that instancing solves some problems but introduces others. Ultimately, I believe we may be better off removing the notion of instancing as described below. Although we are working aggressively towards 1.0 of the collector, I think we should seriously consider this change because it would substantially improve one requirement of 1.0 (self-observability).
Terminology
Even discussing this issue is difficult because we do not have clear terminology so I will attempt to introduce some terms here which are useful for the rest of the discussion:
A "Component Configuration" is an element which a user defines within the
receivers
,processors
,exporters
,connectors
, orextensions
section. For example, in the following configuration, there are 3 Component Configurations (otlp/in
,batch
andotlp/out
):A "Component Instance" is a corresponding struct which the collector instantiates internally. The following configuration will result in more than 3 Component Instances:
If you are doubting that our current instancing behavior is confusing, please consider whether you can easily answer the following:
(my answer here)
Problem 1 - Observability
Observability is all about understanding the internal state of a system. As an observability tool it is critical that we are a good exemplar of an observable system. However, our current notion of instancing is difficult to understand and therefore makes the collector less observable than it should be.
Users should be able to understand when telemetry describes a specific Component Instance but we do not currently have an externally defined schema for this. In fact, proper identity is dependent on both the class (receiver/processor/exporter/connector) and type (otlp/filelog) of component so it may not even be possible to define a static schema for identity attributes.
Additionally, users should be able to understand the relationship between a Component Configuration and its associated Component Instance(s). Even if the identify problem were reasonably well solved, any attempt to communicate to users about a specific Instance is muddled behind these rules that define these relationships. Even a simple configuration such as the example above may require a deep understanding of collector internals in order to understand the number of Component Instances, how they each relate to a Component Configuration, and which pipeline or pipelines contain which Component Instances.
Problem 2 - Maintainability
Identifying and managing Component Instances alongside Component Configurations is difficult to do even within the collector codebase.
The effort to design and implement a robust component status model was largely complicated by the fact that distinct Component Instance that each correspond to the same Component Configuration may be in different states. For example, an exporter which pushes logs to one endpoint and traces to another should be very clear if one is healthy while the other is not.
Additionally, having worked on some early prototypes of a hot-reload capability, I recently discovered it is very difficult to accomplish something as simple as getting a handle to the set of Component Instances that were instantiated from a given Component Configuration. (Once the collector is started, if we want to apply a new config which only changes a single Component Configuration, we need to get a handle to the corresponding Component Instances before we can take any meaningful action.)
Describe the solution you'd like
I think we should seriously consider whether or not component instancing is actually a net positive, and what it would look if each Component Configuration were instantiated exactly once. In order to determine whether this model would be an improvement it is necessary to consider the reasons why we currently create Component Instances.
Processors
Processors are currently instanced in a unique way relative to other components. Every pipeline is given a unique instance of a processor in order to ensure that pipelines do not overlap. For example:
If
batch
were a single instance, then the inputs from both pipelines would all flow into the same processor, and the output of that processor would flow to both exporters. Essentially it would be the same as:However, users could very easily explicitly configure their pipelines to remain separate by declaring two separate
batch
processors:Importantly, we would lose one clear benefit here, which is that a single processor configuration could not be reused. In other words, the user would have to define
batch/1
andbatch/2
separately in theprocessors
section. This may be a small problem in many cases but for complex processor configurations it could lead to a lot of duplicated configuration. Users can instead rely on YAML anchors if necessary.SeeExplicitly Instanced Configurations
below for an explanation of how this can be resolved.A notable implementation detail for this design is that we would need the service graph to manage fanouts and capabilities for every consumer. This is certainly doable and might actually be less complicated because we can have one generic way to calculate these factors vs the current logic which has one set of logic immediately after receivers and another immediately before exporters.
Receivers & Exporters
Receivers and exporters are currently instanced per data type. I believe the primary reason for this is to protect against order of instantiation bugs. For example, consider the following psuedocode:
In the alternative pattern there is a chance that during instantiation of the component, each additional type may introduce an incorrect interaction into the state of the struct. In my opinion, this is the biggest open question about the feasibility and impact of this approach. That said, this is already opted-into by components using the
sharedcomponent
pattern. I don't believe it has proven to be very difficult to work with in those cases, so perhaps this is not much of a concern. If there is sufficient interest in this proposal I will investigate further.Connectors
Connectors are currently instanced per ordered pair of data types. I believe the considerations here are inherited from receivers and exporters so I am not aware of any unique considerations at this point.
Extensions
Extensions already follow the one-instance-per-configuration pattern which I am proposing to apply across the board.
### Explicitly Instanced ConfigurationsAs mentioned earlier one of the immediate drawbacks of removing our current instancing logic is that users could not necessarily reuse configurations for processors. This is already the case for other components but arguably it is useful functionality nonetheless so I think we should consider how else to provide it. First, there is a notion of YAML anchors which we could recommend. However, I believe it would not be difficult to provide users a native alternative.In short, we can make all Component Configurations reusable by allowing users to explicitly instance components within the pipeline definitions. I believe this would require us to designate one additional special character for this syntax. As a placeholder I'll use#
but obviously we would need to consider this more carefully. A complete example:In this example, we've defined a Component Configuration for atransform
processor once, and then used it twice by applying an explicit name to each usage of it. Two separate instances are created because we explicitly indicated to do so. Additional scrutiny of this syntax is necessary but the point is that some syntax could presumably allow users to indicate whether an additional instance of the configuration is intended.The text was updated successfully, but these errors were encountered: