-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Component(s)
connector/spanmetrics, exporter/prometheusremotewrite
What happened?
Subject: Issue Report: Incorrect Behavior in OpenTelemetry Collector Spanmetrics
Issue Description:
We're facing a peculiar issue with the OpenTelemetry Collector's Spanmetrics connector and could use some help sorting it out.
Here's a quick rundown:
Problem:
- We've set up an architecture using Grafana Stack LGTM, with Grafana Loki, Tempo, and Mimir for logs, tracing, and metrics, respectively.
- The goal is to sample traces efficiently but capture 100% of spanmetrics for a comprehensive APM dashboard.
- Our setup involves the otel/opentelemetry-collector-contrib as a load balancer, handling trace metrics with the 'spanmetrics' connector and routing traces/metrics based on an attribute_source to apply our internal's tenant distribuition inside Grafana's services.
- Traces are correctly routed and stored in Grafana Tempo, but the spanmetrics exhibit strange behavior on Grafana Mimir.
Spanmetrics Configuration:
connectors:
spanmetrics:
histogram:
explicit:
buckets: [1ms, 2ms, ... , 10000s]
namespace: traces.spanmetrics
dimensions:
- name: http.status_code
- name: http.method
- name: rpc.grpc.status_code
- name: db.system
- name: external.service
- name: k8s.cluster.nameIssue Details:
- Executing code that generates a specific span 10 times accumulates the counter timeseries correctly.
- However, querying the metric using PromQL functions like increase or rate yields inaccurate results.
- For example,
increase(traces_spanmetrics_calls_total{service_name="my-service"}[5m])shows a continuously increasing line, reaching 600 executions, and never returning to 0, even after a trace-free period.
Observations:
-
The discrepancy is causing inflated values in application metrics, with rate showing over 100,000,000 spans/minute for an app generating 40,000 spans/minute.
-
We sought help on the Grafana Mimir Slack channel (link) without success, but since we haven't found issues with metrics generated by our own applications, it suggests the problem lies within the OpenTelemetry Collector.
Screenshots:
In this last example, the metric only stopped because we restarted the opentelemetry-collector that was serving these spanmetrics
Another example of the metric being incorrect after the application no longer generates new spans:
If you need more details or logs, just let us know!
Collector version
0.83.0
Environment information
Environment
Kubernetes using official helm-chart:
image:
# If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
repository: otel/opentelemetry-collector-contrib
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "0.83.0"
# When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
digest: ""
OpenTelemetry Collector configuration
There are 2 yaml helm configurations in this section.
The loadbalancer:
# Default values for opentelemetry-collector.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
nameOverride: ""
fullnameOverride: ""
# Valid values are "daemonset", "deployment", and "statefulset".
mode: "deployment"
configMap:
# Specifies whether a configMap should be created (true by default)
create: true
# Base collector configuration.
# Supports templating. To escape existing instances of {{ }}, use {{` <original content> `}}.
# For example, {{ REDACTED_EMAIL }} becomes {{` {{ REDACTED_EMAIL }} `}}.
config:
receivers:
jaeger: null
zipkin: null
prometheus: null
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
max_recv_msg_size_mib: 500
http:
endpoint: ${env:MY_POD_IP}:4318
processors:
batch:
send_batch_max_size: 8192
routing:
from_attribute: k8s.cluster.name
attribute_source: resource
table:
- value: a
exporters:
- prometheusremotewrite/mimir-a
- value: b
exporters:
- prometheusremotewrite/mimir-b
- value: c
exporters:
- prometheusremotewrite/mimir-c
- value: d
exporters:
- prometheusremotewrite/mimir-d
- value: e
exporters:
- prometheusremotewrite/mimir-e
- value: e
exporters:
- prometheusremotewrite/mimir-f
- value: f
exporters:
- prometheusremotewrite/mimir-g
- value: g
exporters:
- prometheusremotewrite/mimir-h
- value: h
exporters:
- prometheusremotewrite/mimir-i
- value: J
exporters:
- prometheusremotewrite/mimir-j
# If set to null, will be overridden with values based on k8s resource limits
memory_limiter: null
connectors:
spanmetrics:
histogram:
explicit:
buckets: [1ms, 2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s, 20s, 40s, 100s, 500s, 1000s, 10000s]
namespace: traces.spanmetrics
dimensions:
- name: http.status_code
- name: http.method
- name: rpc.grpc.status_code
- name: db.system
- name: external.service
- name: k8s.cluster.name
exporters:
logging: null
prometheusremotewrite/mimir-a:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanaaMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-b:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanabMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-c:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanacMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-d:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanadMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-e:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanaFirehoseMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-f:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanaeMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-g:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanafMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-h:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanagMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-i:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanahMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-j:
endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanajMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: opentelemetry-collector-tail.tempo-system.svc.cluster.local
port: 4317
extensions:
# The health_check extension is mandatory for this chart.
# Without the health_check extension the collector will fail the readiness and liveliness probes.
# The health_check extension can be modified, but should never be removed.
health_check: {}
memory_ballast:
size_in_percentage: 33
service:
telemetry:
metrics:
address: 0.0.0.0:8888
logs:
encoding: json
extensions:
- health_check
- memory_ballast
pipelines:
logs: null
metrics:
receivers:
- spanmetrics
processors:
- memory_limiter
- batch
- routing
exporters:
- prometheusremotewrite/mimir-a
- prometheusremotewrite/mimir-b
- prometheusremotewrite/mimir-c
- prometheusremotewrite/mimir-d
- prometheusremotewrite/mimir-e
- prometheusremotewrite/mimir-f
- prometheusremotewrite/mimir-g
- prometheusremotewrite/mimir-h
- prometheusremotewrite/mimir-i
- prometheusremotewrite/mimir-j
traces:
receivers:
- otlp
processors:
- memory_limiter
- batch
exporters:
- loadbalancing
- spanmetrics
image:
# If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
repository: otel/opentelemetry-collector-contrib
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "0.83.0"
# When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
digest: ""
imagePullSecrets: []
# OpenTelemetry Collector executable
command:
name: otelcol-contrib
extraArgs:
- --feature-gates=pkg.translator.prometheus.NormalizeName
nodeSelector:
role: lgtm
tolerations:
- effect: NoSchedule
key: grafana-stack
operator: Exists
# Configuration for ports
# nodePort is also allowed
ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
hostPort: 4317
protocol: TCP
# nodePort: 30317
appProtocol: grpc
otlp-http:
enabled: true
containerPort: 4318
servicePort: 4318
hostPort: 4318
protocol: TCP
jaeger-compact:
enabled: false
containerPort: 6831
servicePort: 6831
hostPort: 6831
protocol: UDP
jaeger-thrift:
enabled: false
containerPort: 14268
servicePort: 14268
hostPort: 14268
protocol: TCP
jaeger-grpc:
enabled: false
containerPort: 14250
servicePort: 14250
hostPort: 14250
protocol: TCP
zipkin:
enabled: false
containerPort: 9411
servicePort: 9411
hostPort: 9411
protocol: TCP
metrics:
# The metrics port is disabled by default. However you need to enable the port
# in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
enabled: true
containerPort: 8888
servicePort: 8888
protocol: TCP
# Resource limits & requests. Update according to your own use case as these values might be too low for a typical deployment.
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"
# only used with deployment mode
replicaCount: 4
# only used with deployment mode
revisionHistoryLimit: 10
service:
type: ClusterIP
# type: LoadBalancer
# loadBalancerIP: 1.2.3.4
# loadBalancerSourceRanges: []
annotations: {}
# PodDisruptionBudget is used only if deployment enabled
podDisruptionBudget:
enabled: true
# minAvailable: 2
maxUnavailable: 1
rollout:
rollingUpdate: {}
# When 'mode: daemonset', maxSurge cannot be used when hostPort is set for any of the ports
# maxSurge: 25%
# maxUnavailable: 0
strategy: RollingUpdate
clusterRole:
# Specifies whether a clusterRole should be created
# Some presets also trigger the creation of a cluster role and cluster role binding.
# If using one of those presets, this field is no-op.
create: false
# Annotations to add to the clusterRole
# Can be used in combination with presets that create a cluster role.
annotations: {}
# The name of the clusterRole to use.
# If not set a name is generated using the fullname template
# Can be used in combination with presets that create a cluster role.
name: ""
# A set of rules as documented here : https://kubernetes.io/docs/reference/access-authn-authz/rbac/
# Can be used in combination with presets that create a cluster role to add additional rules.
rules:
- apiGroups:
- ''
resources:
- 'endpoints'
verbs:
- 'get'
- 'list'
- 'watch'
clusterRoleBinding:
# Annotations to add to the clusterRoleBinding
# Can be used in combination with presets that create a cluster role binding.
annotations: {}
# The name of the clusterRoleBinding to use.
# If not set a name is generated using the fullname template
# Can be used in combination with presets that create a cluster role binding.
name: ""The tail sampler:
# Default values for opentelemetry-collector.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
nameOverride: ""
fullnameOverride: ""
# Valid values are "daemonset", "deployment", and "statefulset".
mode: "deployment"
configMap:
# Specifies whether a configMap should be created (true by default)
create: true
# Base collector configuration.
# Supports templating. To escape existing instances of {{ }}, use {{` <original content> `}}.
# For example, {{ REDACTED_EMAIL }} becomes {{` {{ REDACTED_EMAIL }} `}}.
config:
receivers:
jaeger: null
zipkin: null
prometheus: null
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
max_recv_msg_size_mib: 500
http: null
processors:
batch:
send_batch_max_size: 8192
# If set to null, will be overridden with values based on k8s resource limits
memory_limiter: null
tail_sampling:
decision_wait: 60s
policies:
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
routing:
from_attribute: k8s.cluster.name
attribute_source: resource
# default_exporters:
# - otlp/default
table:
- value: a
exporters:
- otlp/tempo-a
- value: b
exporters:
- otlp/tempo-b
- value: c
exporters:
- otlp/tempo-c
- value: d
exporters:
- otlp/tempo-d
- value: e
exporters:
- otlp/tempo-e
- value: f
exporters:
- otlp/tempo-f
- value: g
exporters:
- otlp/tempo-g
- value: h
exporters:
- otlp/tempo-h
- value: i
exporters:
- otlp/tempo-i
- value: j
exporters:
- otlp/tempo-j
exporters:
logging: null
# otlp/default:
# endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
# tls:
# insecure: true
# headers:
# x-scope-orgid: aMimir
otlp/tempo-a:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanaaTempo
otlp/tempo-b:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanabTempo
otlp/tempo-c:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanacTempo
otlp/tempo-d:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanadTempo
otlp/tempo-e:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanaeTempo
otlp/tempo-f:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanafTempo
otlp/tempo-g:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanagTempo
otlp/tempo-h:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanahTempo
otlp/tempo-i:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanaiTempo
otlp/tempo-j:
endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanajTempo
extensions:
# The health_check extension is mandatory for this chart.
# Without the health_check extension the collector will fail the readiness and liveliness probes.
# The health_check extension can be modified, but should never be removed.
health_check: {}
memory_ballast:
size_in_percentage: 33
service:
telemetry:
metrics:
address: 0.0.0.0:8888
logs:
encoding: json
extensions:
- health_check
- memory_ballast
pipelines:
logs: null
metrics: null
traces:
receivers:
- otlp
processors:
- memory_limiter
- tail_sampling
- batch
- routing
exporters:
- otlp/tempo-a
- otlp/tempo-b
- otlp/tempo-c
- otlp/tempo-d
- otlp/tempo-e
- otlp/tempo-f
- otlp/tempo-g
- otlp/tempo-h
- otlp/tempo-i
- otlp/tempo-j
image:
# If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
repository: otel/opentelemetry-collector-contrib
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "0.83.0"
# When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
digest: ""
imagePullSecrets: []
# OpenTelemetry Collector executable
command:
name: otelcol-contrib
extraArgs: []
nodeSelector:
role: lgtm
tolerations:
- effect: NoSchedule
key: grafana-stack
operator: Exists
# Configuration for ports
# nodePort is also allowed
ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
hostPort: 4317
protocol: TCP
# nodePort: 30317
appProtocol: grpc
otlp-http:
enabled: false
containerPort: 4318
servicePort: 4318
hostPort: 4318
protocol: TCP
jaeger-compact:
enabled: false
containerPort: 6831
servicePort: 6831
hostPort: 6831
protocol: UDP
jaeger-thrift:
enabled: false
containerPort: 14268
servicePort: 14268
hostPort: 14268
protocol: TCP
jaeger-grpc:
enabled: false
containerPort: 14250
servicePort: 14250
hostPort: 14250
protocol: TCP
zipkin:
enabled: false
containerPort: 9411
servicePort: 9411
hostPort: 9411
protocol: TCP
metrics:
# The metrics port is disabled by default. However you need to enable the port
# in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
enabled: true
containerPort: 8888
servicePort: 8888
protocol: TCP
# Resource limits & requests. Update according to your own use case as these values might be too low for a typical deployment.
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 100m
memory: 500Mi
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"
# only used with deployment mode
replicaCount: 4
# only used with deployment mode
revisionHistoryLimit: 10
service:
type: ClusterIP
# type: LoadBalancer
# loadBalancerIP: 1.2.3.4
# loadBalancerSourceRanges: []
clusterIP: None
annotations: {}
# PodDisruptionBudget is used only if deployment enabled
podDisruptionBudget:
enabled: true
# minAvailable: 2
maxUnavailable: 1
rollout:
rollingUpdate: {}
# When 'mode: daemonset', maxSurge cannot be used when hostPort is set for any of the ports
# maxSurge: 25%
# maxUnavailable: 0
strategy: RollingUpdateignore the exporters' names, I replaced them off
### Log output
_No response_
### Additional context
_No response_


