Skip to content

[Epic]: Vendor Agnostic OpenTelemetry Observability Support #735

@crivetimihai

Description

@crivetimihai

Epic: Vendor Agnostic OpenTelemetry Observability Support

Executive Summary

Implement a comprehensive, vendor-agnostic observability layer for MCP Gateway using OpenTelemetry standards, enabling distributed tracing, metrics collection, and integration with any OTLP-compatible backend (Phoenix, Jaeger, Datadog, New Relic, etc.).

Type of Feature

  • Core Infrastructure Enhancement
  • Production Readiness Feature
  • Performance Monitoring
  • Enterprise Integration

Background & Motivation

As MCP Gateway scales to production deployments, comprehensive observability becomes critical for:

  • Performance Monitoring: Track latency, throughput, and error rates across all operations
  • Debugging: Trace requests across distributed gateway instances and federated servers
  • Cost Management: Monitor token usage and API costs for LLM operations
  • Compliance: Audit trail for security and regulatory requirements
  • SLA Management: Ensure service level agreements are met

The implementation must be vendor-agnostic to support diverse enterprise monitoring stacks while maintaining zero overhead when disabled.

User Stories

Story 1: DevOps Engineer - Production Monitoring

As a: DevOps Engineer responsible for MCP Gateway in production

I want: Real-time visibility into gateway performance, errors, and resource usage

So that: I can proactively identify issues, optimize performance, and maintain SLAs

Acceptance Criteria

Given I have configured OpenTelemetry with my preferred backend (Datadog/New Relic/etc)
When the gateway processes requests
Then I should see:
  - Request traces with timing breakdowns
  - Error rates and exception details
  - Resource utilization metrics
  - Federation call chains across gateways

Given I notice high latency in tool invocations
When I examine the trace timeline
Then I should see:
  - Which specific operation is slow
  - Database query times
  - External API call durations
  - Plugin processing overhead

Story 2: Platform Administrator - Cost & Usage Analytics

As a: Platform Administrator managing multi-tenant MCP deployments

I want: Detailed usage metrics and cost attribution per user/tenant

So that: I can implement chargeback, optimize costs, and enforce quotas

Acceptance Criteria

Given I have enabled observability with custom resource attributes
When tools are invoked by different users
Then traces should include:
  - User/tenant identification
  - Token counts (prompt, completion, total)
  - Estimated costs based on model pricing
  - Tool usage frequency per user
  - Resource access patterns

Given I need to implement rate limiting
When I analyze usage metrics
Then I should see:
  - Requests per second by user
  - Token consumption rates
  - Peak usage periods
  - Quota utilization percentages

Story 3: Developer - Debugging & Optimization

As a: Developer building applications with MCP Gateway

I want: Detailed traces to debug issues and optimize performance

So that: I can quickly identify bottlenecks and improve application efficiency

Acceptance Criteria

Given I'm debugging a failed tool invocation
When I search for the trace by request ID
Then I should see:
  - Complete request/response payloads
  - Error messages and stack traces
  - Retry attempts and circuit breaker status
  - Plugin execution order and timing

Given I want to optimize prompt templates
When I analyze prompt rendering traces
Then I should see:
  - Template compilation time
  - Variable substitution details
  - Message token counts
  - Cache hit/miss rates

Technical Design

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     MCP Gateway                              │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │            Observability Module                       │  │
│  │                                                       │  │
│  │  ┌─────────────┐  ┌──────────────┐  ┌────────────┐ │  │
│  │  │   Tracer    │  │ Span Manager │  │  Exporter  │ │  │
│  │  │  Provider   │  │              │  │  Registry  │ │  │
│  │  └─────────────┘  └──────────────┘  └────────────┘ │  │
│  └──────────────────────────────────────────────────────┘  │
│                           │                                  │
│  ┌────────────────────────┼────────────────────────────┐   │
│  │                 Instrumented Services                │   │
│  │                                                      │   │
│  │  ┌──────────┐  ┌─────────┐  ┌──────────┐          │   │
│  │  │   Tool   │  │ Prompt  │  │ Resource │  ┌──────┐│   │
│  │  │ Service  │  │ Service │  │ Service  │  │ More ││   │
│  │  └──────────┘  └─────────┘  └──────────┘  └──────┘│   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                               │
                               ▼
            ┌──────────────────────────────────────┐
            │         OTLP Protocol                │
            └──────────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        ▼                      ▼                      ▼
┌──────────────┐     ┌──────────────┐      ┌──────────────┐
│   Phoenix    │     │    Jaeger    │      │   Datadog    │
│  (LLM Focus) │     │ (Distributed │      │ (Enterprise) │
└──────────────┘     │   Tracing)   │      └──────────────┘
                     └──────────────┘

Core Components

1. Observability Module (mcpgateway/observability/)

# observability/__init__.py
from .tracer import init_telemetry, get_tracer
from .decorators import trace_operation, trace_async
from .context import create_span, add_event, set_attribute
from .exporters import ExporterRegistry
from .metrics import MetricsCollector

# observability/tracer.py
class ObservabilityManager:
    """Manages OpenTelemetry initialization and configuration."""
    
    def __init__(self, config: ObservabilityConfig):
        self.config = config
        self.tracer = None
        self.meter = None
        self.logger_provider = None
        
    def initialize(self) -> None:
        """Initialize telemetry with configured backend."""
        if not self.config.enabled:
            return
            
        # Setup resource attributes
        resource = self._create_resource()
        
        # Initialize tracer provider
        self.tracer = self._setup_tracing(resource)
        
        # Initialize metrics (future)
        if self.config.metrics_enabled:
            self.meter = self._setup_metrics(resource)
            
        # Initialize logging (future)
        if self.config.logs_enabled:
            self.logger_provider = self._setup_logging(resource)

2. Service Instrumentation

# services/tool_service.py
class ToolService:
    @trace_async("tool.invoke")
    async def invoke_tool(
        self, 
        name: str, 
        arguments: Dict[str, Any],
        context: Optional[TraceContext] = None
    ) -> ToolResult:
        """Invoke a tool with full observability."""
        
        # Extract trace context for distributed tracing
        if context:
            propagate_context(context)
            
        # Add semantic attributes
        set_attributes({
            "tool.name": name,
            "tool.arguments.count": len(arguments),
            "user.id": get_current_user(),
            "tenant.id": get_current_tenant(),
        })
        
        # Add event for tool execution start
        add_event("tool.execution.started", {
            "tool.version": tool.version,
            "tool.type": tool.integration_type
        })
        
        try:
            result = await self._execute_tool(name, arguments)
            
            # Record metrics
            record_metric("tool.invocations", 1, {
                "tool": name,
                "status": "success"
            })
            
            # Token usage for LLM tools
            if result.token_usage:
                record_metric("tokens.used", result.token_usage.total, {
                    "tool": name,
                    "model": result.model
                })
                
            return result
            
        except Exception as e:
            record_exception(e)
            record_metric("tool.errors", 1, {"tool": name})
            raise

Implementation checklist

Phase 1: Core Infrastructure

  • Create observability module structure
  • Implement tracer initialization with multiple backends
  • Add configuration via environment variables
  • Create span management utilities
  • Add decorator for automatic tracing
  • Implement graceful degradation when disabled

Phase 2: Service Instrumentation

  • Instrument ToolService with comprehensive tracing
  • Add tracing to PromptService
  • Instrument ResourceService operations
  • Add federation tracing to GatewayService
  • Implement batch operation tracing
  • Add database query instrumentation

Phase 3: Advanced Features

  • Implement trace context propagation for distributed tracing
  • Add sampling strategies (always, probabilistic, rate-limited)
  • Implement custom span processors for data enrichment
  • Add metrics collection (counters, histograms, gauges)
  • Implement span filtering and data sanitization
  • Add trace correlation with logs

Phase 4: LLM-Specific Instrumentation

  • Token counting and cost calculation
  • Prompt/completion capture (with PII filtering)
  • Model performance metrics
  • Streaming response instrumentation
  • RAG pipeline tracing
  • Vector search operation tracing

Phase 5: Production Hardening

  • Performance optimization (minimize overhead)
  • Memory management for high-volume tracing
  • Circuit breaker for exporter failures
  • TLS configuration for secure export
  • Authentication for commercial backends
  • Documentation and runbooks

Configuration

Environment Variables

# Core Settings
OTEL_ENABLE_OBSERVABILITY=true              # Master switch
OTEL_SERVICE_NAME=mcp-gateway               # Service identifier
OTEL_SERVICE_VERSION=${VERSION}             # Version from deployment
OTEL_DEPLOYMENT_ENVIRONMENT=production      # Environment tag

# Exporter Configuration
OTEL_TRACES_EXPORTER=otlp                   # otlp|jaeger|zipkin|console|none
OTEL_EXPORTER_OTLP_ENDPOINT=https://collector.example.com:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc            # grpc|http/protobuf
OTEL_EXPORTER_OTLP_HEADERS=api-key=secret   # Authentication headers
OTEL_EXPORTER_OTLP_INSECURE=false          # TLS verification

# Sampling Configuration
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1                 # Sample 10% of traces

# Resource Attributes
OTEL_RESOURCE_ATTRIBUTES=tenant.id=acme,region=us-east-1,cluster=prod-1

# Performance Tuning
OTEL_BSP_MAX_QUEUE_SIZE=2048               # Span buffer size
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512         # Batch size for export
OTEL_BSP_SCHEDULE_DELAY=5000               # Export interval (ms)

# Data Privacy
OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=12000  # Truncate large values
OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT=1200        # General attribute limit
OTEL_SANITIZE_PII=true                        # Remove sensitive data

Configuration File (config/observability.yaml)

observability:
  enabled: true
  
  tracing:
    enabled: true
    exporter: 
      type: otlp
      endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
      headers:
        api-key: ${OTEL_API_KEY}
    
    sampling:
      strategy: adaptive  # always|probabilistic|adaptive|rate_limited
      rate: 0.1          # For probabilistic sampling
      max_per_second: 100  # For rate-limited sampling
    
    propagators:
      - tracecontext    # W3C Trace Context
      - baggage        # W3C Baggage
      - b3multi        # B3 Multi-header (Zipkin)
    
  metrics:
    enabled: true
    export_interval: 60s
    
    collectors:
      - name: system
        enabled: true
        interval: 30s
      - name: http
        enabled: true
      - name: database
        enabled: true
        
  logs:
    enabled: false  # Future enhancement
    
  data_privacy:
    sanitize_pii: true
    
    sensitive_fields:
      - password
      - api_key
      - token
      - ssn
      - credit_card
    
    mask_patterns:
      - regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        replacement: '[EMAIL]'
      - regex: '\b\d{3}-\d{2}-\d{4}\b'
        replacement: '[SSN]'

Testing Strategy

Unit Tests

# tests/unit/observability/test_tracer.py
class TestObservabilityManager:
    def test_initialization_with_otlp(self):
        """Test OTLP exporter initialization."""
        
    def test_initialization_disabled(self):
        """Test graceful no-op when disabled."""
        
    def test_multiple_backend_support(self):
        """Test Jaeger, Zipkin, Console exporters."""
        
    def test_sampling_strategies(self):
        """Test various sampling configurations."""

Integration Tests

# tests/integration/test_observability_integration.py
class TestObservabilityIntegration:
    async def test_tool_invocation_tracing(self):
        """Test complete tool invocation trace."""
        
    async def test_distributed_tracing(self):
        """Test trace context propagation."""
        
    async def test_error_recording(self):
        """Test exception capture and reporting."""

Performance Tests

# tests/performance/test_observability_overhead.py
class TestObservabilityPerformance:
    def test_overhead_when_enabled(self):
        """Measure latency impact with tracing."""
        
    def test_memory_usage(self):
        """Monitor memory consumption."""
        
    def test_high_volume_tracing(self):
        """Test under load conditions."""

Success Metrics

  1. Performance Impact

    • Latency overhead < 1ms per operation
    • Memory overhead < 50MB for typical workload
    • CPU overhead < 2%
  2. Coverage

    • 100% of service methods instrumented
    • All error paths captured
    • Distributed traces working across federation
  3. Adoption

    • Compatible with top 5 APM platforms
    • Used by 80% of production deployments
    • Positive feedback from operations teams
  4. Reliability

    • Zero crashes due to observability code
    • Graceful degradation on exporter failure
    • No data loss with circuit breaker

Migration Plan

For Existing Deployments

  1. Phase 1: Preparation

    # Install observability dependencies
    pip install mcp-contextforge-gateway[observability]
  2. Phase 2: Configuration

    # Start with console exporter for testing
    export OTEL_TRACES_EXPORTER=console
    export OTEL_SERVICE_NAME=mcp-gateway-test
  3. Phase 3: Backend Setup

    # Deploy chosen backend (e.g., Jaeger)
    docker-compose -f docker-compose.observability.yml up -d
    
    # Configure gateway to use it
    export OTEL_TRACES_EXPORTER=jaeger
    export OTEL_EXPORTER_JAEGER_ENDPOINT=http://localhost:14268/api/traces
  4. Phase 4: Production Rollout

    # Enable sampling for production
    export OTEL_TRACES_SAMPLER=parentbased_traceidratio
    export OTEL_TRACES_SAMPLER_ARG=0.01  # Start with 1% sampling

Documentation Requirements

  1. User Guide

    • Quick start guide for each backend
    • Configuration reference
    • Troubleshooting guide
    • Performance tuning guide
  2. Developer Guide

    • How to add instrumentation
    • Custom span processors
    • Testing with traces
    • Best practices
  3. Operations Guide

    • Deployment patterns
    • Monitoring dashboards
    • Alert configuration
    • Capacity planning

Security Considerations

  1. Data Privacy

    • PII sanitization before export
    • Configurable field masking
    • Opt-in for payload capture
  2. Access Control

    • Secure exporter endpoints
    • API key management
    • TLS for data in transit
  3. Compliance

    • GDPR compliance for EU data
    • Audit trail retention policies
    • Data residency configuration

Related Issues

Primary Observability Issues

Logging & Monitoring

Performance & Scale

Security & Compliance

Related Infrastructure

Testing & Quality

Documentation & Deployment

Related Architecture Decisions

  • ADR-005: Structured JSON Logging - Logging architecture
  • ADR-010: Observability with Prometheus - Metrics architecture

Dependencies

  • OpenTelemetry SDK and exporters
  • No vendor lock-in to specific backends
  • Optional LLM-specific libraries (OpenLLMetry)

Risks & Mitigations

Risk Impact Mitigation
Performance degradation High Implement sampling, optimize hot paths
Memory leaks High Implement span limits, regular profiling
Exporter failures Medium Circuit breaker, local buffering
Compliance violations High PII sanitization, data governance
Complexity for users Medium Sensible defaults, clear documentation

Alternatives Considered

  1. Custom Metrics System

    • Pros: Full control, optimized for MCP
    • Cons: Maintenance burden, no ecosystem
  2. Direct APM Integration

    • Pros: Vendor support, rich features
    • Cons: Vendor lock-in, licensing costs
  3. Prometheus + Grafana Only

    • Pros: Open source, mature
    • Cons: Limited tracing, separate stack

Decision: OpenTelemetry provides the best balance of standardization, flexibility, and ecosystem support.

Appendix: Example Traces

Tool Invocation Trace

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "operationName": "tool.invoke",
  "startTime": 1614544192123456,
  "duration": 1234567,
  "attributes": {
    "tool.name": "github_search",
    "tool.id": "550e8400-e29b-41d4-a716-446655440000",
    "user.id": "user123",
    "tenant.id": "acme-corp",
    "http.method": "POST",
    "http.url": "/tools/invoke",
    "http.status_code": 200
  },
  "events": [
    {
      "time": 1614544192123500,
      "name": "tool.validation.completed"
    },
    {
      "time": 1614544192124000,
      "name": "plugin.pre_invoke.executed",
      "attributes": {
        "plugin.name": "RateLimiter"
      }
    }
  ],
  "status": {
    "code": "OK"
  }
}

Federated Request Trace

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spans": [
    {
      "spanId": "parent123",
      "operationName": "gateway.request",
      "serviceName": "mcp-gateway-1"
    },
    {
      "spanId": "child456",
      "parentSpanId": "parent123", 
      "operationName": "gateway.forward",
      "serviceName": "mcp-gateway-2",
      "attributes": {
        "peer.service": "mcp-gateway-1",
        "rpc.method": "tools/invoke"
      }
    }
  ]
}

Implementation Coverage Matrix

This comprehensive observability implementation supports multiple open issues:

Issue Title How This Implementation Addresses It
#175 OpenLLMetry Integration ✅ Full OpenTelemetry implementation with LLM-specific attributes (tokens, costs, models)
#218 Prometheus Metrics ✅ Metrics collection infrastructure ready for prometheus-fastapi-instrumentator
#272 Grafana Dashboards ✅ OpenTelemetry data exportable to Grafana via Tempo/Prometheus
#300 Structured JSON Logging ✅ Correlation IDs via trace context propagation
#432 Performance Optimization ✅ Performance benchmarks and overhead measurement included
#535 Audit Logging ✅ Trace data provides audit trail with user attribution
#683 Debug Headers ✅ X-Trace-Id propagation via W3C Trace Context
#699 Metrics Enhancement ✅ Comprehensive metrics collection for all operations
#727 Phoenix Integration ✅ Phoenix supported as OTLP-compatible backend

Summary

The vendor-agnostic approach ensures compatibility with existing monitoring stacks while the OpenTelemetry foundation provides future-proof standardization.

Next Steps

  1. Close subset issues that this implementation covers:

References

Sub-issues

Metadata

Metadata

Assignees

Labels

devopsDevOps activities (containers, automation, deployment, makefiles, etc)enhancementNew feature or requestobservabilityObservability, logging, monitoringpythonPython / backend development (FastAPI)triageIssues / Features awaiting triage

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions