Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is an OpenTelemetry Collector, what is a distribution? #8555

Open
5 tasks
jpkrohling opened this issue Sep 27, 2023 · 91 comments
Open
5 tasks

What is an OpenTelemetry Collector, what is a distribution? #8555

jpkrohling opened this issue Sep 27, 2023 · 91 comments

Comments

@jpkrohling
Copy link
Member

jpkrohling commented Sep 27, 2023

We had a discussion recently around what is an OpenTelemetry Collector and what is a distribution of the Collector. I would like to gather your opinions.

@dyladan proposed that only what the SIG Collector produces can be called an "OpenTelemetry Collector" and that a distribution has to fulfill the following requirements:

  • uses the collector framework (upstream not a fork)
  • includes only plugins/components which are compatible with the collector framework. they don't need to be in the otel repos, but you should be able to point the upstream collector builder at them

I tend to agree with him, but I'm eager to hear your opinions. The GC might have the right to make the final decision if we can't get an agreement, but I think we can indeed reach a consensus, at least between the GC and the Collector maintainers (core and contrib).


Update - 2024-07-17: based on the state of the discussion so far, here are the issues we identified:

@dashpole
Copy link
Contributor

Here was my take from 2020: https://docs.google.com/document/d/1jHOYTRRI91UdyMEfqV7WNPEAxSQKP13b_jPcQX4oe9I/edit?usp=sharing

TL;DR

Other projects (prometheus, kubernetes) have successfully created conformance programs by testing conformant behavior, rather than requiring the use of certain code packages. An example of "conformant behavior" could be:

  • Must accept a collector configuration yaml file, which includes a set of components (e.g. otlp receiver/exporter, batch processor, healthcheck extension).
  • Must pass basic testbed tests with this configuration.

The easiest way to construct a "conformant" collector distribution would be to simply use collector libraries, or the collector builder, but it wouldn't necessarily require it.

@djaglowski
Copy link
Member

I like the idea of defining conformance to a standard but it's unclear to me what we are suggesting will be the effect of being conformant. In other words, let's say we define what it means to be an "OpenTelemetry Collector", and someone has a product which meets all the requirements. Isn't it still a trademark issue for them to say that their product is an OpenTelemetry Collector?

IANAL but as I understand it, The Linux Foundation has a trademark on the term OpenTelemetry and their trademark guidelines define how the trademark may and may not be used.

e.g. It would be a trademark violation for a company to name their product "Company OpenTelemetry Collector" because the trademark may not be used in a product name. However, it is ok to use the phrase "Company Distribution for OpenTelemetry Collector" because it is a reference to the trademark and does not imply that the trademark is part of the product name.

I don't mean to nitpick but I can't figure out how one would communicate the fact that they officially have an OpenTelemetry Collector without violating the trademark guidelines.

@atoulme
Copy link
Contributor

atoulme commented Sep 27, 2023

What does this clarification do and how does it help the project? I am unclear on why this is coming up, is this impacting the OpenTelemetry project's ability to graduate within the CNCF?

@bryan-aguilar
Copy link
Contributor

I like the idea of defining conformance to a standard but it's unclear to me what we are suggesting will be the effect of being conformant.

I think in this case the effect would be that you cannot call yourself a "Collector distribution" without passing X,Y,Z conformance tests.

I think the trademark issue is separate though and has already been enforced in the past.

includes only plugins/components which are compatible with the collector framework. they don't need to be in the otel repos, but you should be able to point the upstream collector builder at them

I'm not sure I fully understand this one. Would this by proxy mean that "a collector distribution" must be built, or be able to be built, with OCB? I think this may be too limiting. Consider this scenario. Contributor X build a new Collector component type. It is ideal for their specific use case, and they don't plan on contributing upstream but they build it on top of the collector framework. OCB does not recognize this component type and thus fails to build it. Would this not qualify as a distribution?

@codeboten
Copy link
Contributor

Just linking this other issue here that suggests a distribution should be added to the spec: open-telemetry/opentelemetry-specification#2873

As the issue points out, distribution is already in the official documentation: https://opentelemetry.io/docs/concepts/distributions/

@codeboten
Copy link
Contributor

Note the doc linked above also includes a link to the definition of the collector today: https://opentelemetry.io/docs/concepts/components/#collector

The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported.

@austinlparker
Copy link
Member

I guess my question would be if Collector SIG disagrees with the definition of distribution that's currently on the website.

@bryan-aguilar
Copy link
Contributor

One thing that came up today during discussions today at the Operator Sig and also separately in discussions with @Aneurysm9 is command support.

Should collector distributions be required to support both the Collector validate and components command? Do we need to ensure that any future commands are able to be supported by distributions that do not use OCB?

cc: @jaronoff97

@jaronoff97
Copy link
Contributor

My expectation as someone building features on top of the collector is that any collector distribution uses the collector builder or at least can be marshalled in to a struct that matches the collector go framework. Being able to adhere to that would ensure that how we design Kubernetes features will always work for any distribution.

@trask
Copy link
Member

trask commented Jun 13, 2024

What does this clarification do and how does it help the project?

I think this is a great question to help anchor this discussion.

Here's one scenario that comes to mind.

Consider if (hypothetically) Google offers an OpenTelemetry Collector Distro for GCP that has lots of great 1st party GCP support.

But their distro doesn't include (hypothetically) the Honeycomb Marker Exporter, because they don't want to be on the hook for supporting that exporter.

This situation seems somewhat unavoidable, as I'm not sure we want to force all distros to include all components, both for size and support reasons.

If the OpenTelemetry Collector could support dynamic linking, then users could just drop the Honeycomb Marker Exporter into their GCP distro, and the problem is solved, but it sounds like dynamic linking is a no go because of Go.

So we would need another way to ensure that OpenTelemetry Collector distros can be extended and don't lock users into the distro's ecosystem.

[just for one example, potentially we could say that anything called an OpenTelemetry Collector distro must be built using the OpenTelemetry Collector Builder and that all the distro components must be publicly available so that users can extend the distro themselves]

@yurishkuro
Copy link
Member

@trask I don't think your example answers the question, at least for me. And we had an hour-long discussion on the call where we still didn't explicitly enumerate what problems we're trying to address by the discussion. I heard at least two problems, one on the call, another in your answer:

  1. OTEL Collector maintainers are concerned with getting a lot of user questions in the official Slack related to 3rd-party collector distros, all because they are calling themselves "OTEL collector ..."
  2. (from your comment) A user who's running a 3rd party distro needs to add another component the collector, what do they do

Some thoughts on (2):

  • if the 3rd party distro is fully open source, then user can just build their own flavor that includes additional components
  • if the 3rd party distro has closed source parts, there is no way to extend it today, nor in the near future given the Go's ecosystem
    • WASM-based plugins are possible but likely won't be efficient enough due to data model complexity that cannot be easily transferred across the Go/WASM boundary without additional transformations.
    • The user still has a workaround of adding another oss-only collector in the pipeline (even more inefficient than WASM)
  • whatever the solution, the discussion of "what is collector" seems quite tangential to the problem

@trask
Copy link
Member

trask commented Jun 14, 2024

if the 3rd party distro is fully open source, then user can just build their own flavor that includes additional components

it's not very user friendly and about 100x more painful than the plugin-based ecosystems I've worked with before where I can just upload a pre-built component into my existing system. I guess I was hoping we could get as close to the convenience that other plugin-based ecosystems offer, within the constraints of Golang.

whatever the solution, the discussion of "what is collector" seems quite tangential to the problem

I think the connection is that we have an opportunity to make requirements on something that wants to call itself an OpenTelemetry Collector distro, and so it's our chance to enforce something like this (if we want)

fwiw, the example I gave

[just for one example, potentially we could say that anything called an OpenTelemetry Collector distro must be built using the OpenTelemetry Collector Builder and that all the distro components must be publicly available so that users can extend the distro themselves]

aligns with the definition proposed by @dyladan and @jpkrohling above:

that a distribution has to fulfill the following requirements:

  • uses the collector framework (upstream not a fork)
  • includes only plugins/components which are compatible with the collector framework. they don't need to be in the otel repos, but you should be able to point the upstream collector builder at them

@yurishkuro
Copy link
Member

includes only plugins/components which are compatible with the collector framework.

This ^ already excludes existing distros that use proprietary code. More importantly, it doesn't answer the question which problem a definition like this solves. I see no reason to debate the criteria without deciding why we're doing it. To quote a good book:

  • “Would you tell me, please, which way I ought to go from here?”
  • “That depends a good deal on where you want to get to.”
  • “I don't much care where.”
  • “Then it doesn't much matter which way you go.”

@trask
Copy link
Member

trask commented Jun 14, 2024

I see no reason to debate the criteria without deciding why we're doing it.

I totally agree which is why I tried to provide one possible "why" above. I'm looking forward to seeing what other "whys" people have in mind.

@djaglowski
Copy link
Member

The primary reason I care about a definition here is that users are advised to limit the collector to contain only the components necessary for an environment. In the absence of a dynamic plugin model (which to my knowledge no collector maintainer believes is feasible), we are recommending that users deploy a "collector" that we have not built ourselves. Since we are not recommending a concrete binary, I believe we need to define precisely what we are recommending. Additionally, we expect that as a user's needs evolve they will migrate to another "collector" that contains a different set of components. Therefore, a definition would serve to establish expectations for what stays the same between "collectors" vs what may be different.

I would like to highlight that the issue asks for two definitions, but there appear to be at least three categories of collectors which have been discussed. Very roughly:

  1. "Official" collectors - those produced by the Collector SIG
  2. "Custom" collectors - those produced by users following our recommendation to limit components for their environment
  3. "Distributions" - those published by vendors or organizations

The conversation so far seems to have blurred (2) and (3), and we might explicitly conclude that this is not an important distinction. However, for now, I'm drawing this distinction because the "whys" I've described above specifically apply to (2).

@tedsuo
Copy link

tedsuo commented Jun 15, 2024

I have two problems that I would like to see resolved.

Problem one: remove confusion about what a Collector is

The first problem is basic confusion about "what a Collector is." Not a Collector distro, but the term Collector itself.

If someone points to a binary and calls it a Collector, just about everyone in the community would assume that the binary is a build of the collector codebase plus some plugins. Even if a binary was described as some kind of "Vendor Specific Collector Distribution," that core assumption would still be there.

That seems a bit obvious, but we're now starting to see projects pop up which don't match this definition. One example is Grafana Alloy. My understanding is that Alloy is basically the pre-existing Grafana agent, plus some additional components that it shares with the Collector codebase. Which is a totally fine thing to be! But when I first came across it, it was described as a "vendor neutral OpenTelemetry Collector distribution." Like everyone else in the community, that description made me think it was something completely different – that it was the Collector codebase plus some Grafana-specific plugins. I was super confused when I discovered that wasn't the case!

Again, no disrespect to Grafana or the Alloy project; it seems like a totally fine project to me. But the naming threw me for a loop. Imagine if CouchDB started calling itself Redis because it shared some Redis code in order to add a feature. That would be really confusing!

I'm sure the Grafana folks are reasonable, and we can just talk to them about it. But I imagine that there may be more instances of this in the future, so it seems prudent that we provide some kind of official definition of a Collector that roughly matches community expectations, in order to avoid confusion. Namely, that a Collector is a build of the collector framework plus some plugins.

Problem two: who do I talk to for technical support?

At the heart of all the various collector distro discussions is the question "who is responsible for helping me with this thing?"

We have users who come into our slack channels asking for technical support. What technical support do we want to give? Who do we point them to if we don't want to give them support? Do we just support the core and contrib builds of the Collector? What if a users makes their own build, but it only contains a subset of plugins in the contrib build? What if they add just one plugin that they wrote themselves? What if a vendor provided the build? What if the vendor build only contains contrib plugins? What if it's the contrib build but their configuration file is absolutely insane? Technical support is really important, and telling someone "no we won't help you" is disappointing. So we need a really clear cut definition for what we are willing to support.

Maybe there are additional problems, but those are the two where I am currently seeing real world issues related to a lack of clear definitions around the Collector.

@yurishkuro
Copy link
Member

Problem one: remove confusion about what a Collector is

I don't think this in itself is a problem. Whatever someone calls their binary doesn't concern me unless I have an actual problem to solve and their naming creates confusion preventing me from solving the problem (like coming to OTEL support group when the actual "collector" is something else entirely). So your #2 is an actual problem, but #1 is not, it's more like a possible root cause for #2. But #2 could be caused by other things too - a distro may actually be a "collector" as you want to define it, yet the question is about a custom or even proprietary plugin.

In other words, if #2 is the only problem you want to solve, it needs a policy of what is appropriate scope for support questions. There may be a definition of collector that helps this policy, but doesn't help other problems, such as one in #8555 (comment). And there may be other approaches to the policy rather than relying on "what is collector" question. Such as: go talk to your vendor who provided the binary, irrespective of whether it matches any definition of collector or not. I would actually be a strong proponent of that exact rule - vendors have paying customers, they can allocate resources for tech support, instead of putting this burden on oss volunteers in OTEL.

@tedsuo
Copy link

tedsuo commented Jun 16, 2024

@yurishkuro number one is definitely a problem. We are actively addressing an example of it right now. It is related to number two, but it causes other fundamental confusions.

I agree that for most projects, #1 is not an issue – no one is going to name their project Redis. But perhaps because OpenTelemetry is something of a standard, there seems to be a natural inclination to imply that projects which process OTLP are part of OpenTelemetry even when they are maintained outside of the project, with the Collector being the main target. I don't think that defining what a Collector is needs to be difficult or complicated, but we should write it down anyways. We have other problems, their solutions don't need to be related to the sign on the wall we need to put up declaring that the term Collector only refers to this codebase.

@jpkrohling
Copy link
Member Author

Thank you all for the renewed interest in defining Collector and Collector distributions. I watched the recording from last Thursday and spoke to several of you on Slack (GC and Collector leads). Here’s a summary of the situation as I understand it.

We already have a few definitions in place, such as:

  • Distribution: "A distribution is a customized version of an OpenTelemetry component. A distribution is a wrapper around an upstream OpenTelemetry repository with some customizations."
  • Collector: "The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported."

Commercial vendors are being asked to support the "OTel Collector" by their customers, as evidenced by the number of commercial vendors listed as having a distribution of the Collector:

  • AWS Distro for OpenTelemetry (ADOT)
  • Grafana Alloy
  • Liatrio Distribution of the OpenTelemetry Collector
  • observIQ BindPlane Agent
  • RedHat RHOSDT OpenTelemetry Collector Distribution
  • Splunk Distribution of OpenTelemetry Collector
  • Sumo Logic Distribution for OpenTelemetry Collector

Each vendor has a different approach to meeting this demand. Some assist customers using a curated list of upstream components, others offer support (with SLAs) for their official binaries with vetted upstream components, and others provide extra features at different levels. These approaches are categorized on the distribution definition page as "Pure," "Plus," and "Minus."

However, not all of these approaches resonate equally within the GC and with Collector maintainers: we accept some approaches as distributions but not others. We can't pinpoint why they are different, making it harder for vendors to comply with the (non-existent) requirements to be called a distribution. The GC has politely asked one of these vendors to stop calling itself a Collector, without providing a clear path forward for the project to regain the right to be called a distribution. Lack of knowledge about these projects adds to the confusion. For instance, I have seen inaccurate claims about ADOT and Alloy.

@atoulme, @bogdandrutu, and @yurishkuro have questioned the actual problem we are aiming to solve. While their question might seem odd, there wasn’t a clear articulation of the problem: we feel that something is off but can't pinpoint why we don't want certain projects to be called a distribution of the Collector. One argument by @djaglowski was well-received: we want users to have a consistent experience and be able to reuse their knowledge when switching between "flavors" of the Collector, whether custom-built, vendor-built, or community-built.

I have also heard a few other arguments, which I'll address here:

  • @codeboten expressed concern that users might come to our GitHub repositories and Slack channels with questions about downstream (custom or vendor) distributions, causing distraction to the already overloaded maintainers. A counter-argument (sorry, I forgot by whom) was made that we want to encourage users (and vendors) to stay close to us. My personal opinion is that we are handling this well for now: when we see an issue related to a specific distribution, we typically tag specific people from the vendor on the GitHub issue or Slack thread. It's in their best interest to handle those questions.
  • @trask mentioned a hypothetical situation where a vendor requires using a custom exporter to send data to them, while their cloud provider requires a custom processor to enrich telemetry with cloud metadata (cluster, region, etc.). I haven't seen this situation before, and I think most relevant vendors (minus one or two) can ingest OTLP natively. As long as distributions provide an OTLP exporter and vendors can ingest OTLP, users won't face this problem. I recognize a general concern about lock-in by offering components that can't be used elsewhere.

To me, it's clear that we need an objective set of rules in addition to our existing subjective definitions, so the ecosystem can thrive with options for our users while retaining their ability to reuse their knowledge and switch between flavors without getting locked-in. If we can agree on this need, here’s what I propose as an initial draft, with the promise to develop it further elsewhere:

  • A build of the Collector is what can be obtained by the result of the OpenTelemetry Collector Builder (ocb) or that can be reproduced with the builder. This is typically the result of end-users picking which components they want to use in production. Builds of the Collector can include proprietary components, but those components should be reusable in other builds (like end-user custom builds).
  • A distribution is a build of the Collector done by the Collector SIG with a set of components available from our repositories (this one and contrib).
  • The OpenTelemetry Collector (following @codeboten's definition from the website) is one specific distribution, produced by the Collector SIG.
  • An OpenTelemetry Collector compatible solution is a binary that "acts like a collector and walks like a collector." For that, we’d define a set of tests that such a binary needs to pass (certification program of sorts), which may include:
    • Ability to use the same configuration format
    • Ability to replace the OpenTelemetry Collector at runtime (e.g., by changing the image property of the Collector's CR on the Operator)
    • Ability to be managed by OpAMP, once ready
    • Ability to be observed by a specific set of metrics

@tedsuo
Copy link

tedsuo commented Jun 18, 2024

Thanks @jpkrohling that's a great layout. My only suggestion is that I think Collector Build and Collector Distro can be combined. Anything that can be reproduced by the builder can be called a Distro, regardless of who issued it.

@jpkrohling
Copy link
Member Author

In my previous message, I should have stressed more that we didn't have a consensus on whether we had a problem to solve. Before addressing why I think we need a build and a distribution, I'd like to take a step back and have a consensus.

Community, Collector leads, TC, GC: please vote on this issue. The options are:

❤️ No problem to solve at the moment. Let the ecosystem use our subjective definitions (status quo)
👍🏽 We have a problem with the subjective definitions and need a concrete set of rules

Note that you are NOT voting on my draft proposal.

@yurishkuro
Copy link
Member

Let me try one last time. You cannot solve a "problem" of "what is collector" without deciding why, i.e. what success criteria you want to meet by "solving" it. The poll above provides exactly zero answers to that question.

@cartermp
Copy link
Contributor

Not sure how helpful this is, but this is my take from working with several hundred customers adopting OTel:

  • There is a lot of variation on what collectors they use. Contrib versions galore, base collector, ObservIQ distro, Honeycomb's collector config for metrics compression that creates a distro, ADOT distro, AWS distro (these are different?), K8s distro, customer-specific distros, something called a "Local OpenTelemetry Collector binary". Contrib is the biggest category here.
  • The HNY-specific thing can be a pain for people since it requires rebuilding a collector binary and deploying when you need to make updates, which customers forget about
  • A ton of customers configure and "forget" their distro. It works, they never touch it, so it's often months or years out of date
  • Occasionally someone gets confused because they need to use the transformprocessor or something useful like that, but their distro (usually the base distro) doesn't support it
  • ADOT distro sometimes brings some pains but it's very much seen as a problem with lambda
  • Some customers have asked for support, often expecting a distro (we offer support without use of a distro)
  • Alloy in particular has bubbled up as "interesting" to some folks and they saw it as a grafana-specific thing

So I guess my experience is that there isn't a terrible problem here to resolve, but there is quite a bit of variation in what people use, and that sometimes leads to confusion or a bad experience depending on what they're using.

I see here echoes of what it means to adopt OTel. If you propose an alternative API, but still emit semantic conventions and OTLP data under the hood, is that OTel? I'd say yes. Is your binary, Acme Corp. Collector, capable of accepting and emitting OTLP, and also uses the batchprocessor with some different defaults under the hood? I'd call that a collector as well.

@jpkrohling
Copy link
Member Author

@yurishkuro, please bear with us. Your input has been valuable and I think we are now in a better position because of your questions. I'll try again, starting with what I see as the problems we are trying to solve:

  • Bad user experience (or confusion): Users struggle to understand the differences between various distributions and how they relate to the upstream OpenTelemetry project, as evidenced by the comments from @tedsuo and @cartermp, among others.
  • Vendor uncertainty: vendors are unsure about the requirements and guidelines for their distributions to be recognized officially, leading to potential misalignments and disputes within the community, as evidenced by the current request from the GC for a distribution to not be called as such anymore, without telling them exactly what's wrong.

If we define we want to work on those problems, here are the goals for me:

  • Provide clarity: establish clear, objective criteria for what constitutes an OpenTelemetry Collector distribution so that both vendors and the community know what is and what isn't a distribution.
  • Consistent user experience: by establishing objective criteria, we ensure that users can have a consistent experience across different distributions if they stick to the aspects we establish, enabling users to switch between distributions without relearning or facing incompatibilities, while at the same time being able to use distribution-specific components or features.

@austinlparker
Copy link
Member

I think the simplest way to conceptualize the 'problem' is that the only thing that the project defines as hard requirements for 'what is an OpenTelemetry ' is what's in the specification. This falls apart when you start talking about things like the collector - there's not really a specification for the collector. This can lead to not only user confusion (see above), but also confusion for vendors and integrators building in the ecosystem.

Ultimately, we need to be able to provide some guarantees to both of these groups -- to users, we need to be able to have clear guidance for questions like:

  • If I write a custom receiver, will that work with other collectors?
  • Are configurations portable between different collectors?
  • Do all collectors support a single management protocol?

To builders, we need guidance around:

  • How to name things to avoid user confusion
  • Implementation guidance on creating consistent experiences across variations
    e.g., if I was to rewrite the collector in rust, what parts should I preserve? How different can I be from upstream before a collector isn't a collector?

@yurishkuro
Copy link
Member

@jpkrohling

Provide clarity: establish clear, objective criteria for what constitutes an OpenTelemetry Collector distribution so that both vendors and the community know what is and what isn't a distribution.

Don't you see that this is a pure tautology? "We want to know because we want to know". Any definition will match that. E.g. the following definition is clear and objective, and completely besides the point as it does not address the unspoken problems:

  • collector is a desk
  • collector distribution is a desk that is shipped to your home disassembled

Consistent user experience: by establishing objective criteria, we ensure that users can have a consistent experience across different distributions if they stick to the aspects we establish, enabling users to switch between distributions without relearning or facing incompatibilities, while at the same time being able to use distribution-specific components or features.

This is getting closer to the issue, but it's very hand-wavy. @austinlparker 's comment #8555 (comment) is more concrete. Basically, we can approach this as a product requirement spec. Try to phrase everything as a use case:

"as a {user role} I want to {perform an action} so that I can {achieve an outcome}".

For example, with one of Austin's bullet points:

  • Are configurations portable between different collectors?
  • Rewrite: as an end user I want to take my collector config that I use with distro X and use it with distro Y so that I have the same behavior.

Phrased like this, an immediate question from me - is that what we actually want? How is that even possible? It means that the two distros are 100% functionally equivalent (at least on the features I already used with distro X), which defeats the purpose of distros in the first place. Ability to swap implementations is a nice theoretical goal, but there are other goals users may have, like I don't want to run binaries 100s of MBs in size bundling every possible feature.

So rather than keep debating completely arbitrary definitions of collector, let's first

  1. list what use cases we want to satisfy (aka "problems to solve"),
  2. whether we indeed agree that we want to satisfy them,
  3. and whether it's even possible to satisfy many of them at once (as a compromise).

Doing so will implicitly inform the definition of the collector, based on actual problems / goals / user needs, not based on a tautological definition of a problem.

@austinlparker
Copy link
Member

austinlparker commented Jun 20, 2024

@yurishkuro There is an immediate need for the collector, as a SIG, to define what the requirements of another piece of software calling itself an 'OpenTelemetry Collector' must align with. This is, as you said, a product requirement. I stated my rationale above, but I would like to expand on it with the bigger issue here.

As OpenTelemetry continues to mature and graduates, we (the GC and project leadership more generally) will need to create requirements around certification and compatibility. This is both easy, and hard. For instance, it is relatively easy to set a requirement around something like OTLP. If you write OTLP, then you must write valid OTLP to any compliant OTLP receiver. It is also somewhat easy to say 'Supports OpenTelemetry API' by ensuring that you can get the active span from context and modify it, etc.

The collector, however, is much more difficult to quantify by these standards. I agree, in principle, that it might not be desirable for non-specced config files to be portable. I would generally agree that a receiver written for upstream may not necessarily work with other implementations. With that said, what is the distinction that we are going to use? You can hopefully understand my reluctance to say "Ok, well, you can just call anything that receives OTLP a Collector" because that could be very confusing for users, especially as management tools proliferate. Similarly, it does not benefit users to remove one source of lock-in (the API/SDK) then replace it with another (the pipeline/collector layer).

I would honestly be fine saying 'there is only one thing called an OpenTelemetry Collector, and it is anything that is built with upstream ocb'. Everyone else in the ecosystem can be 'OTLP compatible' or whatever other words we come up with.

edit: By 'non-specced' config files above, I mean configuration files that do not align with a published specification (eg, the upcoming file-based config options)

@austinlparker
Copy link
Member

Just to be crystal clear -- I think an entirely acceptable outcome of this is stating the following:

  • An OpenTelemetry Collector is a specific piece of software that is built, maintained, and published by the OpenTelemetry project.
  • An OpenTelemetry Collector is also any distribution of the Collector that is built using the ocb tool.
  • No other software may call itself an 'OpenTelemetry Collector'

@codeboten
Copy link
Contributor

After discussing this at the specification meeting on 26-Nov-2024, the spec SIG agreed that having a definition in the spec will be valuable for the project and end users. Thank you for all the discussion in this thread, I will open a spec PR with my existing definition and would love to see the discussion continue there. Will close this once that PR is opened

@yurishkuro
Copy link
Member

yurishkuro commented Nov 26, 2024

@codeboten There is a terminology problem with your proposal #8555 (comment)

OTEL project does not have exclusive right to the use of the words Collector and Distribution. The only "enforceable" part of your definition is the definition of OpenTelemetry Collector, which can be clarified as "for the purpose of that definition the terms collector and distribution are defined as follows...". But that means the two terms do not stand on their own, they are just an implementation detail of the main definition. If a vendor maintains some sort of collector, it is by definition not an OpenTelemetry Collector since it's not maintained by OTel maintainers, and thus the vendors still have no guidelines what to call their binary. They are perfectly within their rights to call it Collector or even Distribution no matter how it's implemented because that's just general terms.

@thampiotr
Copy link

thampiotr commented Nov 26, 2024

@codeboten There is a terminology problem with your proposal #8555 (comment)
...

@austinlparker this ^ is closely related to my concern which I wrote about in the second point of this comment.
Thanks for taking a time to share some more context, but I think that the specific concerns that @yurishkuro and I have raised have not been fully addressed.

@codeboten
Copy link
Contributor

They are perfectly within their rights to call it Collector or even Distribution no matter how it's implemented because that's just general terms.

I don't think the goal with this definition should be to enforce the terminology, in fact i would prefer it wasn't. I don't want to get into the business of chasing down "collectors" in the wild. I just want users to know what their getting when they're getting the OpenTelemetry Collector and what they should get with any other collector. This is ultimately my goal here.

Ideally, that definition leads to a set of tests or common practices that can be documented on the opentelemetry.io website to give end users the tools they need to do things like bringing their own components or using existing components from the ecosystem with any thing that calls itself a collector. And if the thing that calls itself a collector doesn't align with the definition, then users can go and ask whoever publishes this thing for changes to better align with the definition.

I've added a comment in the spec issue here: open-telemetry/opentelemetry-specification#4309 (comment)

@austinlparker
Copy link
Member

@codeboten There is a terminology problem with your proposal #8555 (comment)

OTEL project does not have exclusive right to the use of the words Collector and Distribution. The only "enforceable" part of your definition is the definition of OpenTelemetry Collector, which can be clarified as "for the purpose of that definition the terms collector and distribution are defined as follows...". But that means the two terms do not stand on their own, they are just an implementation detail of the main definition. If a vendor maintains some sort of collector, it is by definition not an OpenTelemetry Collector since it's not maintained by OTel maintainers, and thus the vendors still have no guidelines what to call their binary. They are perfectly within their rights to call it Collector or even Distribution no matter how it's implemented because that's just general terms.

We do not have exclusive rights over the english language, that is correct. However, we do have exclusive rights over what we, the project, promulgate. I cannot control every vendor in the world (nor do I wish to) but it is extremely valuable for OpenTelemetry, as a project, to clearly define how we use certain words such as 'collector' and 'distribution' as an inclusion filter for things like the OpenTelemetry website/registry (and other marketing/community activities).

@yurishkuro
Copy link
Member

I don't want to go to another, mostly empty issue, why can't the discussion continue here? Why split it?

I intentionally put "enforceable" in quotes, that was not my point. My point is that the only definition you're providing is of OpenTelemetry Collector. And there is a choice - do we want that to refer only to artifacts produced by the maintainers, or also to something that vendors can produce? So far the proposed definition is completely internal - let the users know what they can expect from the official OpenTelemetry Collector. Which is not at all what all the problems mentioned in this issue are about - user confusion, vendor lock-in, none of them are addressed. Which always brings me to the very first question I posted on this thread - what is it that you're trying to solve? Just defining what the official collector is and the principles of how it's being built?

@ithompson-gp
Copy link

ithompson-gp commented Nov 26, 2024

This is ultimately my goal here.

As an end User, your comment seems to be inferring a User guide, not a specification. And, as an end User, I would prefer specifications to be well defined, have specific terminology and use well formed ideas (if not, then it requires more refinement or, not a specification at all).

Again, as an end User, I would rather place the expectation of what a User of a Collector might expect - from a purely OTel PoV - within guidelines and User documentation (as this might be helpful to orient understanding). There is absolutely no way to enforce - nor should it be enforced - those picking up the Collector code base and building their own 'Collector' as to what they should be providing (to a User). That seems a reach.

Ultimately, I think the reality of a Collector being a thing that can collect things, process things, and produce things in many permutations of ways may not lend itself to be a specification. Perhaps an oversimplification.

100% please do enhance the User documentation, I can't imagine what routes a specification would bring us down

@fstab
Copy link
Member

fstab commented Nov 26, 2024

A Distribution is a package that is produced by utilizing open source tooling maintained by the OpenTelemetry project

My 2 ct: I think this is too restrictive.

I understand the desire that users should be able to bring their own components. If users have custom components that are compliant with OTel interfaces they should be able to use these components with every collector distribution. That's ok.

However, the sentence above is more restrictive: It's not enough for a distribution to support all OTel-compliant components, and to provide a way to include them in the build. The sentence above says a distribution must not include any alternative components that are incompatible with the current OTel interfaces. I would appreciate if distributions were allowed to experiment with alternative approaches, like components for native Prometheus pipelines. If these approaches turn out to be useful they will eventually be contributed back upstream, which will benefit the community.

@austinlparker
Copy link
Member

austinlparker commented Nov 26, 2024

defining what the official collector is

Yes, if we define what a collector is, then we have definitionally defined what it is not. Given the proposed definition, here's a quick scorecard of things that are in the community that are, or are not, collectors by this definition.

Name Collector Distribution
OpenTelemetry Collector Yes Yes (Distro for k8s, Contrib, etc.)
Grafana Alloy No No
FluentBit No No
Datadog Vector No No
ADOT No Yes
Dynatrace Collector No Yes
observIQ bindplane agent No No (does not support ocb afaik?)

@austinlparker
Copy link
Member

austinlparker commented Nov 26, 2024

Now, do all of those things support OpenTelemetry? Yep. They also support OTLP! That's great! We should have more things that support OpenTelemetry and OTLP. However, I hope you can appreciate that without some normative guidance on what exactly you must do to wind up in one of those buckets, we will invariably see more end-user confusion around how they can extend the collector interfaces, how they can migrate between distributions, etc. This is not hypothetical -- this already happens!

edit: I would also point out that this isn't just an OpenTelemetry problem; You can look at the k8s community to see what happens when the project doesn't provide clear, normative guidance around the names of things.

@yurishkuro
Copy link
Member

if your 2nd column refers to OpenTelemetry Collector then it's obvious that everything will be No because those are not OTEL-project artifacts. If you really meant unqualified "collector" by the internal definition, then how can something be a distro without being a collector?

My suggestion:

  • drop the definition of collector altogether
  • define a term OpenTelemetry Collector Distribution that meets the stated goals / restrictions
  • define OpenTelemetry Collector as "official distribution built by OTEL project"

@austinlparker
Copy link
Member

if your 2nd column refers to OpenTelemetry Collector then it's obvious that everything will be No because those are not OTEL-project artifacts. If you really meant unqualified "collector" by the internal definition, then how can something be a distro without being a collector?

Given the conversation so far, I felt like it was valuable to be explicit rather than implicit about the proposed definition.

My suggestion:

  • drop the definition of collector altogether
  • define a term OpenTelemetry Collector Distribution that meets the stated goals / restrictions
  • define OpenTelemetry Collector as "official distribution built by OTEL project"

Your third point here seems to be, effectively, what the proposed definition is. We're reserving 'OpenTelemetry Collector' to mean the specific artifact that we produce, while providing guidance for how to create a distribution that complies with our goals (no lock-in, etc.) Could you elaborate on what un-defining 'collector' does, other than allow for existing artifacts that define themselves as 'OpenTelemetry Collectors' to continue to do so?

@yurishkuro
Copy link
Member

Could you elaborate on what un-defining 'collector' does

This goes back to my point about terminology conflict. Say we define what we mean by the word "collector" for the purpose of defining what OpenTelemetry Collector means. How does it help a vendor? What sentence could they construct that could refer to that internal helper term? What they could use are the top-level terms, in this case "OpenTelemetry Collector Distribution".

@austinlparker
Copy link
Member

Could you elaborate on what un-defining 'collector' does

This goes back to my point about terminology conflict. Say we define what we mean by the word "collector" for the purpose of defining what OpenTelemetry Collector means. How does it help a vendor? What sentence could they construct that could refer to that internal helper term? What they could use are the top-level terms, in this case "OpenTelemetry Collector Distribution".

Vendors aren't the only audience for this, they're a part of it.

A normative definition of the term and product "OpenTelemetry Collector" provides guidance to end-users and third-parties by clearly saying "the artifact and product OpenTelemetry Collector refers to a specific piece of software built and distributed by the OpenTelemetry project". This means that when someone, for example, takes a training course that covers the OpenTelemetry Collector, the subject of that course is unambiguous. When someone does a conference talk about the OpenTelemetry Collector, it continues to be unambiguous. If I have a deployed OpenTelemetry Collector that I need support for, it is clear where I can get that support, etc.

Without defining what an OpenTelemetry Collector is, we cannot effectively define what a Collector Distribution is. A Distribution definition helps me understand other things as an end-user -- for example, the extensibility of that product, the applicability of given configurations and rules for things like transform processors or sampling processors, etc. If we define distributions without also defining collectors, it is a mostly meaningless distinction -- if Alloy, FluentBit, and Vector are all 'OpenTelemetry Collectors' then what possible definition of a 'distribution' can you come up with that satisfies the other requirements of the definition (e.g., component extensibility, config interop, management interop, etc.)

@yurishkuro
Copy link
Member

I said do define OpenTelemetry Collector, but don't define collector (leave it to Webster).

@austinlparker
Copy link
Member

So if the proposed definition prefixed instances of 'Collector' with 'OpenTelemetry', e.g.

A Collector is a mechanism that:

becomes

An OpenTelemetry Collector is a mechanism that:

You'd be fine with it? Because I believe that is the intent, and I agree that making it explicit is fine.

@dehaansa
Copy link

defining what the official collector is

Yes, if we define what a collector is, then we have definitionally defined what it is not. Given the proposed definition, here's a quick scorecard of things that are in the community that are, or are not, collectors by this definition.

Name Collector Distribution
OpenTelemetry Collector Yes Yes (Distro for k8s, Contrib, etc.)
Grafana Alloy No No
FluentBit No No
Datadog Vector No No
ADOT No Yes
Dynatrace Collector No Yes
observIQ bindplane agent No No (does not support ocb afaik?)

Does a definition that excludes the majority of the corporate sponsors of community investment seem to benefit the community? Suppose we define OTel Collector & OTel Distribution in this exclusionary way. In that case, it at least feels important to have a third term that encompasses good faith contributors to the community who may not have feasible (business or technological) routes to rebuilding their collectors to comply with the limits enforced specifically by the OCB tooling requirements.

@codeboten
Copy link
Contributor

Does a definition that excludes the majority of the corporate sponsors of community investment seem to benefit the community? Suppose we define OTel Collector & OTel Distribution in this exclusionary way...

Right, this isn't about excluding existing collectors. How could any of them fit a definition that hasn't existed previously? 😄

This would give all distributions and their publishers an opportunity to align with the project to ensure the goal of avoiding vendor lock-in is achieved. Whatever definition makes sense to achieve this, i'm in favour of

@yurishkuro
Copy link
Member

@dehaansa I think it's fine to be exclusionary in that regard. The OpenTelemetry Collector is a specific piece of software with specific design goals, ecosystem, and compatibility guarantees. If you write a software that looks like it's doing the same thing but has a completely different implementation and ecosystem (e.g. FluentBit), then it's not an OpenTelemetry Collector, it's just something that supports OTLP. I think the principles @codeboten outlined in #8555 (comment) are quite reasonable because they aim to address user issues like extensibility, compatibility, configuration portability, ecosystem familiarity & continuity, etc.

@tedsuo
Copy link

tedsuo commented Nov 26, 2024

Apologies if this is long, I'm attempting to find a balance between keeping it short while also explaining some of the failure modes this project can get into if it is not careful. @codeboten I've tried to respond to your main points, and at the bottom I suggest a more streamlined definition.

A Collector MUST allow end users to receive, process, emit telemetry in various formats

This is a vague requirement, I'm not sure that it helps. You can build a Collector that does not emit telemetry. On the other hand, FluentBit can receive, process, and emit telemetry. So it feels like this definition does not really say anything specific about the Collector.

A Collector MUST allow users to bring their own components, to ensure no vendor lock-in can occur.

This is the actual heart of the issue. We can't create a behavior-based definition of a Collector because the Collector is completely pluggable. However! The pluggable nature of the Collector is an incredibly valuable feature. I believe that this pluggability is what we want to preserve: mixing and matching plugins, including arbitrary plugins that end users create themselves. This is the feature that keeps the Otel Collector community from fracturing.

A Distribution is a package that is produced by utilizing open source tooling maintained by the OpenTelemetry project and contains any combination of components.

I assume this means something along the lines of "All Collectors must be compiled using the Collector Builder, and anything the Collector Builder can produce counts as a Collector."

At first glance, it would seem like tying the definition of a Collector to the Builder would make for a clean, strict definition of a Collector. This was certainly my first thought. But I have been convinced that it's actually the opposite, and it's worth explaining why.

The builder isn't a spec; it's just software. You can make it do anything. If we base the definition of the Collector on what the Builder can do, that will create an incentive for organizations to make PRs attempting to extend the builder to build whatever it is they want to call a Collector. Saying yes or no to those requests would be very difficult, because we would not have a spec to point at as justification. Instead we would be back at square one, trying to come up with a set of principles to base our decisions on.

Bendable definitions lead to pressure campaigns

The "builder-based" approach would also create all kinds of political problems for our project. We will be accused of favoritism every time we say "no" to a request to extend the Builder. Then the Builder maintainers will be called out for weaponizing Otel to favor their corporate interests over their competitors. After that, the GC (and possibly the Builder maintainers) will become a target for threats and strong-arm tactics until we relent and accept their change.

This isn't a theoretical problem! There have already been attempts to strongarm the project in various ways. We have very intentionally architected the OTel project to avoid these kinds of failure modes. For example, the GC is elected because that avoids the situation where companies come to the project and say "you will give us three GC seats or else."

I normally don't talk publicly about that side of the biz, but I really can't emphasize enough that having the rules defined in a way where we can't be pressured is extremely important. It's a big part of why the OTel community is so free of politics, compared to other industry-size OSS projects I have worked on.

TL;DR; using the Builder as proof of compliance doesn't help to define what a Collector is. It just transposes the question from "What is a Collector?" to "What is the Builder allowed to do?" So, while it seems logical at first blush, dragging the Builder into this certification would create a lot of social stress without giving us very much in return.

No take backs

Speaking of social stress and political threats to the project, I do want to clarify that any definition of a Collector that would exclude the things we have already declared to be Collector distros – ADOT and the Splunk distro, for example – would need to first get approval from those member organizations. It's against our principles to renege on something that big; it would immediately create a crisis.

It's also not the outcome we want. We want things like ADOT, Splunk, and Alloy to count as collectors; we're trying to find a definition that allows for projects like these while avoiding projects that try to do something nasty with our brand, or unintentionally force our users to choose between the special features in that particular distro and the features they get from the Collector plugin ecosystem. It's more important that the certification process results in the community we want than it is for the process to be some kind of clean, automatically testable conformance test.

Proposal

Based on this I recommend that we stick to a slightly more limited definition of a Collector.

  • An OpenTelemetry Collector MUST accept a Collector Config file.
  • An OpenTelemetry Collector MUST be able to be compiled with any and all additional Collector plugins that the user wishes to include.
  • A compiled instance of an OpenTelemetry Collector – with a specific set of plugins and features – is referred to as a Collector Distro.

This definition focuses on the result, not the process. It is also a concrete and actionable definition: any project that wishes to have an official OpenTelemetry Collector Certification can apply for an audit. The audit is performed as code review, not an automated testing process. Having to come to us for an audit also creates an opportunity for conversation and makes sure that we don't get sideswiped by a new project coming out of the blue.

@codeboten
Copy link
Contributor

An OpenTelemetry Collector MUST accept a Collector Config file.

Can you expand on what you consider a Collector Config file? Would the configuration available in the alloy example https://github.com/grafana/alloy/?tab=readme-ov-file#example mean that it is excluded from this definition?

@austinlparker
Copy link
Member

I'm going to respond to these points individually.

A Collector MUST allow users to bring their own components, to ensure no vendor lock-in can occur.

This is the actual heart of the issue. We can't create a behavior-based definition of a Collector because the Collector is completely pluggable. However! The pluggable nature of the Collector is an incredibly valuable feature. I believe that this pluggability is what we want to preserve: mixing and matching plugins, including arbitrary plugins that end users create themselves. This is the feature that keeps the Otel Collector community from fracturing.

I think everyone is more or less in agreement that this is crucial -- perhaps a better way of framing this debate is "how do we protect against lock-in and enable credible exit for end-users without overly specifying the Collector API/ABI".

A Distribution is a package that is produced by utilizing open source tooling maintained by the OpenTelemetry project and contains any combination of components.

I assume this means something along the lines of "All Collectors must be compiled using the Collector Builder, and anything the Collector Builder can produce counts as a Collector."

At first glance, it would seem like tying the definition of a Collector to the Builder would make for a clean, strict definition of a Collector. This was certainly my first thought. But I have been convinced that it's actually the opposite, and it's worth explaining why.

The builder isn't a spec; it's just software. You can make it do anything. If we base the definition of the Collector on what the Builder can do, that will create an incentive for organizations to make PRs attempting to extend the builder to build whatever it is they want to call a Collector. Saying yes or no to those requests would be very difficult, because would not have a spec to point at as justification. We would just be back at square one, trying to come up with a set of principles to base our decisions on.

I mean, you could replace "ocb" with "conformance suite" and I don't know what would really be different here. If we created some clean-sheet bash script or AST parser that looked at your interfaces and said "yep, this compiles" or "nope, it doesn't" then I'm not quite sure how we'd avoid similar situations (people would just argue that our definition of 'compliance' was too strict, and they'd point to the lack of a specification as a rationale).

Bendable definitions lead to pressure campaigns

The "builder-based" approach would also create all kinds of political problems for our project. We will be accused of favoritism every time we say "no" to a request to extend the Builder. Then the Builder maintainers will be called out for weaponizing Otel to favor their corporate interests over their competitors. After that, the GC (and possibly the Builder maintainers) will become a target for threats and strong-arm tactics until we relent and accept their change.

This isn't a theoretical problem! There have already been attempts to strongarm the project in various ways. We have very intentionally architected the OTel project to avoid these kinds of failure modes. For example, the GC is elected because that avoids the situation where companies come to the project and say "you will give us three GC seats or else."

I normally don't talk publicly about that side of the biz, but I really can't emphasize enough that having the rules defined in a way where we can't be pressured is extremely important. It's a big part of why the OTel community is so free of politics, compared to other industry-size OSS projects I have worked on.

TL;DR; using the Builder as proof of compliance doesn't help to define what a Collector is. It just transposes the question from "What is a Collector?" to "What is the Builder allowed to do?" So, while it seems logical at first blush, dragging the Builder into this certification would create a lot of social stress without giving us very much in return.

We already have all kinds of political problems, some of them in this very thread. We have to deal with vendors spreading FUD about the Collector, we have to deal with vendors trying to 'land grab' OpenTelemetry names/concepts and tie them to their marketing/product positioning, and we have 'one-way' collector implementations right now where you can get in, but not out, easily. To my earlier point, the builder isn't what's important here, the definition is what's important, and the builder is a mechanism by which we can programmatically enforce/prove that definition.

No take backs

Speaking of social stress and political threats to the project, I do want to clarify that any definition of a Collector that would exclude the things we have already declared to be Collector distros – ADOT and the Splunk distro, for example – would need to first get approval from those member organizations. It's against our principles to renege on something that big; it would immediately create a crisis.

It's also not the outcome we want. We want things like ADOT, Splunk, and Alloy to count as collectors; we're trying to find a definition that allows for projects like these while avoiding projects that try to do something nasty with our brand, or unintentionally force our users to choose between the special features in that particular distro and the features they get from the Collector plugin ecosystem. It's more important that the certification process results in the community we want than it is for the process to be some kind of clean, automatically testable conformance test.

I disagree with this categorically. We never made an express or implicit 'promise' to anyone about the Collector other than what's been written out. The fact that the Collector is deliberately unspecified should be as big of a clue to that as anyone. Perhaps it should be, but that's a bit outside the scope of this issue (and I personally don't think it should be -- the thing that matters is, which is OTLP, from an interop perspective). We cannot possibly craft a definition that is both rigorous and complete while also creating special carve-outs for first movers; Doing so would also be unfair to anyone else who might come along and develop for the ecosystem later.

Proposal

Based on this I recommend that we stick to a slightly more limited definition of a Collector.

  • An OpenTelemetry Collector MUST accept a Collector Config file.
  • An OpenTelemetry Collector MUST be able to be compiled with any and all additional Collector plugins that the user wishes to include.
  • A compiled instance of an OpenTelemetry Collector – with a specific set of plugins and features – is referred to as a Collector Distro.

This definition focuses on the result, not the process. It is also a concrete and actionable definition: any project that wishes to have an official OpenTelemetry Collector Certification can apply for an audit. The audit is performed as code review, not an automated testing process. Having to come to us for an audit also creates an opportunity for conversation and makes sure that we don't get sideswiped by a new project coming out of the blue.

I do not functionally see how this is different than the proposal as-written. Points 1 and 2 are self-referential; a Collector that accepts the config file and accepts existing plugins must be written in Go, and thus can be built via ocb. The third point is ultimately just semantic sugar over the constraints of the first two points; Obviously, any compiled collector can be a Distribution.

@austinlparker
Copy link
Member

austinlparker commented Nov 26, 2024

After doing some more thinking, I do think that there's value in trying to avoid a hard requirement on our build tooling. As a compromise, what if we adopted the MUST accept a Collector Config file language while also amending the 'MUST allow you to bring your own components' line to MUST allow users to include any component that satisfies the component interface?

Edit: The benefit here is that we could then offer self-certification by providing a configuration file with expected inputs/outputs, which would be less work to maintain on our part and also avoid having to make our build tooling a load-bearing part of the conformance process for third parties.

Edit x2: Ted and I were just agreeing violently with each other I think

@yurishkuro
Copy link
Member

I interpreted "build with ocb" more like as a shorthand for the actual composability requirements to be spelled out. It's not the ocb that's important, but how the code included in a distro is organized. If the code is proprietary, obviously a user cannot build their own collector. If the code is public and implements Component API but is not structured to be composable by something like ocb then it's still very difficult to the user to build their own collector. So ocb provides an easy smell test, but ultimately means an underlying set of requirements that we can document.

Case in point: Jaeger implements several custom components which are all compatible with the Collector framework, but for a number of reasons we intentionally didn't make them into standalone modules the way otel-coll-contrib is organized, so building with ocb is currently not possible (not a priority for Jaeger right now).

Similar story with "accepts OTEL config format" - intuitively we understand what that means, the actual PR should spell it out in more details.

@dashpole
Copy link
Contributor

I do not functionally see how this is different than the proposal as-written.

I think an important distinction is that it would allow a distributions to add additional functionality which is not necessarily implemented using the component interfaces. For example, it could support an additional configuration provider) or an additional configuration file format so long as it could be re-built by users with additional components.

I wonder if requiring it to support building from an OCB configuration file (but not requiring OCB itself) is the right middle ground. It ensures the "add an additional component" flow is the same across distributions without preventing a distribution from innovating in other ways, or having functionality which is not encapsulated in a component. A distribution could also modify their build process, so long as they can support the config file format. It seems easily testable as well: Build with a custom OCB config file, then run the binary with a custom collector config.

@Aneurysm9
Copy link
Member

Say I wanted to work on a collector with highly optimized native Prometheus metrics pipelines, i.e. Prometheus metrics are processed and forwarded directly without converting them to OTel protobuf internally. These Prometheus pipelines would not implement the interfaces used by upstream OTel collector components.

Is there a way to build an OpenTelemetry collector distribution that supports standard OTel components, but also includes these optimized Prometheus pipelines?

I don't know that there can be. At least not if it doesn't support having processors that operate on the OTLP data model between the receivers and exporters. Something that collects and produces Prometheus metrics and nothing else is a Prometheus derivative, not an OpenTelemetry Collector derivative. The two things may be able to live in the same binary, but they're fundamentally different systems.

@jpkrohling
Copy link
Member Author

jpkrohling commented Nov 27, 2024

@codeboten wrote:

The goal with this point is to provide end users with a way to reproduce their builds with tools that are documented in opentelemetry

While it's desirable that downstream distributions use the tooling our users are familiar with, I wouldn't include that as a requirement to call the produced binary an OTel Collector Distribution. I'm not sure everyone in this thread is aware, but not everything in an OCB manifest is a component: we have support for config providers, which are not components. If we say that a distribution is only what an ocb-like tool can build, we'd be including config providers in the mix (and any other similar features).

@TylerHelmuth wrote:

Said another way "if I can't bring my component to your Collector it isn't a Collector".

I believe this is key to this matter.

@austinlparker wrote:

Under the definition as proposed, such a component would need to implement the Component interface, but it's actual implementation wouldn't really matter

This statement has it backwards: we require OTel Collector Distributions to be able to include OTel Collector Components, but we should not require OTel Collector Distributions to include only OTel Collector Components.

@yurishkuro wrote:

Case in point: Jaeger implements several custom components which are all compatible with the Collector framework, but for a number of reasons we intentionally didn't make them into standalone modules the way otel-coll-contrib is organized, so building with ocb is currently not possible (not a priority for Jaeger right now).

Off-topic, but I think it can: you'll probably have to specify the gomod and import/name/path attributes for each module.

@tedsuo wrote:

  • An OpenTelemetry Collector MUST accept a Collector Config file.
  • An OpenTelemetry Collector MUST be able to be compiled with any and all additional Collector plugins that the user wishes to include.
  • A compiled instance of an OpenTelemetry Collector – with a specific set of plugins and features – is referred to as a Collector Distro.

I think this is clear, short, and captures all the important aspects. I would just replace "plugins" with "components" and qualify the common words "collector", "distro", "components": OTel Collector, OTel Collector Distribution, OTel Collector Component. I would probably also require the config file to use at least one specific config provider.

  • An OpenTelemetry Collector MUST accept a OpenTelemetry Collector Config file.
  • An OpenTelemetry Collector MUST be able to be compiled with any and all additional OpenTelemetry Collector Components that the user wishes to include.
  • A compiled instance of an OpenTelemetry Collector – with a specific set of plugins and features – is referred to as a OpenTelemetry Collector Distro.

@codeboten wrote:

Can you expand on what you consider a Collector Config file?

I would be explicit that we expect our YAML schema to be accepted, while still allowing other formats. We'd have to get our schema sorted out though :-)

@austinlparker
Copy link
Member

This statement has it backwards: we require OTel Collector Distributions to be able to include OTel Collector Components, but we should not require OTel Collector Distributions to include only OTel Collector Components.

Can you provide an example of a collector distribution that would not be compatible with any collector component, if we define 'collector component' to be a go module that implements the component interface?

Here's why I think we have to have bidirectional component guarantees. If we say "collectors may include components that do not support the component interface", then we open the door to "Collector Distributions" that import a few parts of the collector as a shim (e.g., OTLP receiver and service definitions), then replace all the 'moving parts' with something else -- e.g., a proprietary agent, sampler, et. al. By this definition, the Datadog Agent could be construed as an "OpenTelemetry Collector Distribution".

From a practical perspective, I don't think that it is terribly useful to end-users to allow for such a broad scope; if anything can be a collector distribution, then it's a meaningless definition. Ultimately, I am less concerned by the exact mechanism we use to certify that this bidirectional coupling exists, but I have a strong preference towards something that is self-certifying and does not require us to make judgement calls. In my mind, the following are more-or-less equivalent ways of expressing this preference:

  • A Distribution MUST support the Collector Config Format (defined as ...) AND a Distribution MUST allow users to include any component that satisfies the Component interface.
    This satisfies the requirements by ensuring configuration portability and bidirectional component interfaces without specifying build steps/tools.

  • A Distribution MUST be buildable using a standard manifest (defined as ...)
    This satisfies the requirements by ensuring bidirectional component interfaces without specifying build steps/tools, and implicitly satisfies configuration portability (because a standard manifest means you could add in the standard config parser)

This does not mean that we cannot have a wide ecosystem of tools that fill the role of collectors/pipeline components in OpenTelemetry -- indeed, I expect that more will be created. What I feel like is important to those is that we have clear guidance about what features they support -- e.g., "Supports OpenTelemetry", "Supports OTLP", "Supports OpAMP" -- as guidance for end users on how those tools fit into their overall observability deployment. A restrictive definition of Collectors and Distributions do not preclude us from having more broad categories; I'd argue that it actually helps in creating a definition of those categories and provides more avenues for ecosystem development and innovation. If you don't have to try and force your collector-shaped idea into a collector distribution shaped box, then you can focus on what matters (DX/UX, runtime environment, etc.)

@codeboten
Copy link
Contributor

Opened a PR to the spec with the definition as proposed by @tedsuo and updated by @jpkrohling open-telemetry/opentelemetry-specification#4313

I'm open to edits/suggestions/updates, please make suggestions on the document in that PR as I think it will be easier to parse than in this github issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests