-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Epic: Vendor Agnostic OpenTelemetry Observability Support
Executive Summary
Implement a comprehensive, vendor-agnostic observability layer for MCP Gateway using OpenTelemetry standards, enabling distributed tracing, metrics collection, and integration with any OTLP-compatible backend (Phoenix, Jaeger, Datadog, New Relic, etc.).
Type of Feature
- Core Infrastructure Enhancement
- Production Readiness Feature
- Performance Monitoring
- Enterprise Integration
Background & Motivation
As MCP Gateway scales to production deployments, comprehensive observability becomes critical for:
- Performance Monitoring: Track latency, throughput, and error rates across all operations
- Debugging: Trace requests across distributed gateway instances and federated servers
- Cost Management: Monitor token usage and API costs for LLM operations
- Compliance: Audit trail for security and regulatory requirements
- SLA Management: Ensure service level agreements are met
The implementation must be vendor-agnostic to support diverse enterprise monitoring stacks while maintaining zero overhead when disabled.
User Stories
Story 1: DevOps Engineer - Production Monitoring
As a: DevOps Engineer responsible for MCP Gateway in production
I want: Real-time visibility into gateway performance, errors, and resource usage
So that: I can proactively identify issues, optimize performance, and maintain SLAs
Acceptance Criteria
Given I have configured OpenTelemetry with my preferred backend (Datadog/New Relic/etc)
When the gateway processes requests
Then I should see:
- Request traces with timing breakdowns
- Error rates and exception details
- Resource utilization metrics
- Federation call chains across gateways
Given I notice high latency in tool invocations
When I examine the trace timeline
Then I should see:
- Which specific operation is slow
- Database query times
- External API call durations
- Plugin processing overhead
Story 2: Platform Administrator - Cost & Usage Analytics
As a: Platform Administrator managing multi-tenant MCP deployments
I want: Detailed usage metrics and cost attribution per user/tenant
So that: I can implement chargeback, optimize costs, and enforce quotas
Acceptance Criteria
Given I have enabled observability with custom resource attributes
When tools are invoked by different users
Then traces should include:
- User/tenant identification
- Token counts (prompt, completion, total)
- Estimated costs based on model pricing
- Tool usage frequency per user
- Resource access patterns
Given I need to implement rate limiting
When I analyze usage metrics
Then I should see:
- Requests per second by user
- Token consumption rates
- Peak usage periods
- Quota utilization percentages
Story 3: Developer - Debugging & Optimization
As a: Developer building applications with MCP Gateway
I want: Detailed traces to debug issues and optimize performance
So that: I can quickly identify bottlenecks and improve application efficiency
Acceptance Criteria
Given I'm debugging a failed tool invocation
When I search for the trace by request ID
Then I should see:
- Complete request/response payloads
- Error messages and stack traces
- Retry attempts and circuit breaker status
- Plugin execution order and timing
Given I want to optimize prompt templates
When I analyze prompt rendering traces
Then I should see:
- Template compilation time
- Variable substitution details
- Message token counts
- Cache hit/miss rates
Technical Design
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ MCP Gateway │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Observability Module │ │
│ │ │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ Tracer │ │ Span Manager │ │ Exporter │ │ │
│ │ │ Provider │ │ │ │ Registry │ │ │
│ │ └─────────────┘ └──────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────┼────────────────────────────┐ │
│ │ Instrumented Services │ │
│ │ │ │
│ │ ┌──────────┐ ┌─────────┐ ┌──────────┐ │ │
│ │ │ Tool │ │ Prompt │ │ Resource │ ┌──────┐│ │
│ │ │ Service │ │ Service │ │ Service │ │ More ││ │
│ │ └──────────┘ └─────────┘ └──────────┘ └──────┘│ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ OTLP Protocol │
└──────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Phoenix │ │ Jaeger │ │ Datadog │
│ (LLM Focus) │ │ (Distributed │ │ (Enterprise) │
└──────────────┘ │ Tracing) │ └──────────────┘
└──────────────┘
Core Components
1. Observability Module (mcpgateway/observability/
)
# observability/__init__.py
from .tracer import init_telemetry, get_tracer
from .decorators import trace_operation, trace_async
from .context import create_span, add_event, set_attribute
from .exporters import ExporterRegistry
from .metrics import MetricsCollector
# observability/tracer.py
class ObservabilityManager:
"""Manages OpenTelemetry initialization and configuration."""
def __init__(self, config: ObservabilityConfig):
self.config = config
self.tracer = None
self.meter = None
self.logger_provider = None
def initialize(self) -> None:
"""Initialize telemetry with configured backend."""
if not self.config.enabled:
return
# Setup resource attributes
resource = self._create_resource()
# Initialize tracer provider
self.tracer = self._setup_tracing(resource)
# Initialize metrics (future)
if self.config.metrics_enabled:
self.meter = self._setup_metrics(resource)
# Initialize logging (future)
if self.config.logs_enabled:
self.logger_provider = self._setup_logging(resource)
2. Service Instrumentation
# services/tool_service.py
class ToolService:
@trace_async("tool.invoke")
async def invoke_tool(
self,
name: str,
arguments: Dict[str, Any],
context: Optional[TraceContext] = None
) -> ToolResult:
"""Invoke a tool with full observability."""
# Extract trace context for distributed tracing
if context:
propagate_context(context)
# Add semantic attributes
set_attributes({
"tool.name": name,
"tool.arguments.count": len(arguments),
"user.id": get_current_user(),
"tenant.id": get_current_tenant(),
})
# Add event for tool execution start
add_event("tool.execution.started", {
"tool.version": tool.version,
"tool.type": tool.integration_type
})
try:
result = await self._execute_tool(name, arguments)
# Record metrics
record_metric("tool.invocations", 1, {
"tool": name,
"status": "success"
})
# Token usage for LLM tools
if result.token_usage:
record_metric("tokens.used", result.token_usage.total, {
"tool": name,
"model": result.model
})
return result
except Exception as e:
record_exception(e)
record_metric("tool.errors", 1, {"tool": name})
raise
Implementation checklist
Phase 1: Core Infrastructure
- Create observability module structure
- Implement tracer initialization with multiple backends
- Add configuration via environment variables
- Create span management utilities
- Add decorator for automatic tracing
- Implement graceful degradation when disabled
Phase 2: Service Instrumentation
- Instrument ToolService with comprehensive tracing
- Add tracing to PromptService
- Instrument ResourceService operations
- Add federation tracing to GatewayService
- Implement batch operation tracing
- Add database query instrumentation
Phase 3: Advanced Features
- Implement trace context propagation for distributed tracing
- Add sampling strategies (always, probabilistic, rate-limited)
- Implement custom span processors for data enrichment
- Add metrics collection (counters, histograms, gauges)
- Implement span filtering and data sanitization
- Add trace correlation with logs
Phase 4: LLM-Specific Instrumentation
- Token counting and cost calculation
- Prompt/completion capture (with PII filtering)
- Model performance metrics
- Streaming response instrumentation
- RAG pipeline tracing
- Vector search operation tracing
Phase 5: Production Hardening
- Performance optimization (minimize overhead)
- Memory management for high-volume tracing
- Circuit breaker for exporter failures
- TLS configuration for secure export
- Authentication for commercial backends
- Documentation and runbooks
Configuration
Environment Variables
# Core Settings
OTEL_ENABLE_OBSERVABILITY=true # Master switch
OTEL_SERVICE_NAME=mcp-gateway # Service identifier
OTEL_SERVICE_VERSION=${VERSION} # Version from deployment
OTEL_DEPLOYMENT_ENVIRONMENT=production # Environment tag
# Exporter Configuration
OTEL_TRACES_EXPORTER=otlp # otlp|jaeger|zipkin|console|none
OTEL_EXPORTER_OTLP_ENDPOINT=https://collector.example.com:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc # grpc|http/protobuf
OTEL_EXPORTER_OTLP_HEADERS=api-key=secret # Authentication headers
OTEL_EXPORTER_OTLP_INSECURE=false # TLS verification
# Sampling Configuration
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # Sample 10% of traces
# Resource Attributes
OTEL_RESOURCE_ATTRIBUTES=tenant.id=acme,region=us-east-1,cluster=prod-1
# Performance Tuning
OTEL_BSP_MAX_QUEUE_SIZE=2048 # Span buffer size
OTEL_BSP_MAX_EXPORT_BATCH_SIZE=512 # Batch size for export
OTEL_BSP_SCHEDULE_DELAY=5000 # Export interval (ms)
# Data Privacy
OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT=12000 # Truncate large values
OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT=1200 # General attribute limit
OTEL_SANITIZE_PII=true # Remove sensitive data
Configuration File (config/observability.yaml
)
observability:
enabled: true
tracing:
enabled: true
exporter:
type: otlp
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
headers:
api-key: ${OTEL_API_KEY}
sampling:
strategy: adaptive # always|probabilistic|adaptive|rate_limited
rate: 0.1 # For probabilistic sampling
max_per_second: 100 # For rate-limited sampling
propagators:
- tracecontext # W3C Trace Context
- baggage # W3C Baggage
- b3multi # B3 Multi-header (Zipkin)
metrics:
enabled: true
export_interval: 60s
collectors:
- name: system
enabled: true
interval: 30s
- name: http
enabled: true
- name: database
enabled: true
logs:
enabled: false # Future enhancement
data_privacy:
sanitize_pii: true
sensitive_fields:
- password
- api_key
- token
- ssn
- credit_card
mask_patterns:
- regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
replacement: '[EMAIL]'
- regex: '\b\d{3}-\d{2}-\d{4}\b'
replacement: '[SSN]'
Testing Strategy
Unit Tests
# tests/unit/observability/test_tracer.py
class TestObservabilityManager:
def test_initialization_with_otlp(self):
"""Test OTLP exporter initialization."""
def test_initialization_disabled(self):
"""Test graceful no-op when disabled."""
def test_multiple_backend_support(self):
"""Test Jaeger, Zipkin, Console exporters."""
def test_sampling_strategies(self):
"""Test various sampling configurations."""
Integration Tests
# tests/integration/test_observability_integration.py
class TestObservabilityIntegration:
async def test_tool_invocation_tracing(self):
"""Test complete tool invocation trace."""
async def test_distributed_tracing(self):
"""Test trace context propagation."""
async def test_error_recording(self):
"""Test exception capture and reporting."""
Performance Tests
# tests/performance/test_observability_overhead.py
class TestObservabilityPerformance:
def test_overhead_when_enabled(self):
"""Measure latency impact with tracing."""
def test_memory_usage(self):
"""Monitor memory consumption."""
def test_high_volume_tracing(self):
"""Test under load conditions."""
Success Metrics
-
Performance Impact
- Latency overhead < 1ms per operation
- Memory overhead < 50MB for typical workload
- CPU overhead < 2%
-
Coverage
- 100% of service methods instrumented
- All error paths captured
- Distributed traces working across federation
-
Adoption
- Compatible with top 5 APM platforms
- Used by 80% of production deployments
- Positive feedback from operations teams
-
Reliability
- Zero crashes due to observability code
- Graceful degradation on exporter failure
- No data loss with circuit breaker
Migration Plan
For Existing Deployments
-
Phase 1: Preparation
# Install observability dependencies pip install mcp-contextforge-gateway[observability]
-
Phase 2: Configuration
# Start with console exporter for testing export OTEL_TRACES_EXPORTER=console export OTEL_SERVICE_NAME=mcp-gateway-test
-
Phase 3: Backend Setup
# Deploy chosen backend (e.g., Jaeger) docker-compose -f docker-compose.observability.yml up -d # Configure gateway to use it export OTEL_TRACES_EXPORTER=jaeger export OTEL_EXPORTER_JAEGER_ENDPOINT=http://localhost:14268/api/traces
-
Phase 4: Production Rollout
# Enable sampling for production export OTEL_TRACES_SAMPLER=parentbased_traceidratio export OTEL_TRACES_SAMPLER_ARG=0.01 # Start with 1% sampling
Documentation Requirements
-
User Guide
- Quick start guide for each backend
- Configuration reference
- Troubleshooting guide
- Performance tuning guide
-
Developer Guide
- How to add instrumentation
- Custom span processors
- Testing with traces
- Best practices
-
Operations Guide
- Deployment patterns
- Monitoring dashboards
- Alert configuration
- Capacity planning
Security Considerations
-
Data Privacy
- PII sanitization before export
- Configurable field masking
- Opt-in for payload capture
-
Access Control
- Secure exporter endpoints
- API key management
- TLS for data in transit
-
Compliance
- GDPR compliance for EU data
- Audit trail retention policies
- Data residency configuration
Related Issues
Primary Observability Issues
- [Feature]: Phoenix Observability Integration plugin #727: [Feature] Phoenix Observability Integration plugin - Core implementation ticket
- [Feature Request]: Add OpenLLMetry Integration for Observability #175: [Feature Request] Add OpenLLMetry Integration for Observability - LLM-specific instrumentation
- [Feature Request]: Prometheus Metrics Instrumentation using prometheus-fastapi-instrumentator #218: [Feature Request] Prometheus Metrics Instrumentation using prometheus-fastapi-instrumentator - Metrics collection
- [Feature Request]: Observability - Pre-built Grafana Dashboards & Loki Log Export #272: [Feature Request] Observability - Pre-built Grafana Dashboards & Loki Log Export - Visualization & logs
- [Feature]: Metrics Enhancement (export all data, capture all metrics, fix last used timestamps, UI improvements) #699: [Feature] Metrics Enhancement (export all data, capture all metrics, fix last used timestamps, UI improvements) - Metrics improvements
Logging & Monitoring
- [Feature Request]: Structured JSON Logging with Correlation IDs #300: [Feature Request] Structured JSON Logging with Correlation IDs - Request tracing support
- [SECURITY FEATURE]: Audit Logging System #535: [SECURITY FEATURE] Audit Logging System - Compliance & security audit trails
- [Feature Request]: Enhance Metrics Tab UI with Virtual Servers and Top 5 Performance Tables #368: [Feature Request] Enhance Metrics Tab UI with Virtual Servers and Top 5 Performance Tables - UI metrics display
- [Bug]: Fix "metrics-loading" Element Not Found Console Warning #374: [Bug] Fix "metrics-loading" Element Not Found Console Warning - Metrics UI fix
Performance & Scale
- [PERFORMANCE]: Performance Optimization Implementation and Guide for MCP Gateway (baseline) #432: [PERFORMANCE] Performance Optimization Implementation and Guide for MCP Gateway (baseline) - Performance benchmarking
- [CHORE]: Implement chaos engineering tests for fault tolerance validation (network partitions, service failures) #253: [CHORE] Implement chaos engineering tests for fault tolerance validation - Resilience testing
- [CHORE]: Implement comprehensive Playwright test automation for the entire MCP Gateway Admin UI with Makefile targets and GitHub Actions #255: [CHORE] Implement comprehensive Playwright test automation - E2E test observability
Security & Compliance
- [SECURITY FEATURE]: Configurable Well-Known URI Handler including security.txt and robots.txt #540: [SECURITY FEATURE] Configurable Well-Known URI Handler including security.txt - Security endpoints
- [SECURITY FEATURE]: Add Security Configuration Validation and Startup Checks #534: [SECURITY FEATURE] Add Security Configuration Validation and Startup Checks - Config validation
- [SECURITY FEATURE]: CSRF Token Protection System #543: [SECURITY FEATURE] CSRF Token Protection System - Security tracing
- [SECURITY FEATURE]: Enhanced Session Management for Admin UI #541: [SECURITY FEATURE] Enhanced Session Management for Admin UI - Session tracking
Related Infrastructure
- [SECURITY FEATURE]: Role-Based Access Control (RBAC) - User/Team/Global Scopes for full multi-tenancy support #283: [SECURITY FEATURE] Role-Based Access Control (RBAC) - User/tenant attribution for traces
- [Feature Request]: Hot-Reload Configuration Without Restart (move from .env to configuration database table) (draft) #545: [Feature Request] Hot-Reload Configuration Without Restart - Dynamic config updates
- [Feature Request]: Debug headers and passthrough headers, e.g. X-Tenant-Id, X-Trace-Id, Authorization for time server (go) (draft) #683: [Feature Request] Debug headers and passthrough headers (X-Tenant-Id, X-Trace-Id) - Trace propagation
- [Feature]: Add tool hooks (tool_pre_invoke / tool_post_invoke) to plugin system #682: [Feature] Add tool hooks (tool_pre_invoke / tool_post_invoke) to plugin system - Plugin instrumentation
Testing & Quality
- [CHORE]: Set up contract testing with Pact (pact-python) including Makefile and GitHub Actions targets #281: [CHORE] Set up contract testing with Pact - API contract observability
- [CHORE]: Add mutation testing with mutmut for test quality validation #280: [CHORE] Add mutation testing with mutmut - Test quality metrics
- [CHORE]: Implement 90% Test Coverage Quality Gate and automatic badge and coverage html / markdown report publication #261: [CHORE] Implement 90% Test Coverage Quality Gate - Coverage metrics
- [CHORE]: SAST (Semgrep) and DAST (OWASP ZAP) automated security testing Makefile targets and GitHub Actions #259: [CHORE] SAST (Semgrep) and DAST (OWASP ZAP) automated security testing - Security scan metrics
Documentation & Deployment
- [DOCS]: GA Documentation Review & End-to-End Validation Audit #264: [DOCS] GA Documentation Review & End-to-End Validation Audit - Documentation completeness
- [CHORE]: Add post-deploy step to helm that configures the Time Server as a Gateway (draft) #402: [CHORE] Add post-deploy step to helm that configures the Time Server - Deployment automation
- [Bug]: Remove migration step from Helm chart (now automated, no longer needed) #383: [Bug] Remove migration step from Helm chart - Deployment simplification
- [CHORE]: Fix PostgreSQL Volume Name Conflicts in Helm Chart (draft) #377: [CHORE] Fix PostgreSQL Volume Name Conflicts in Helm Chart - Storage management
Related Architecture Decisions
- ADR-005: Structured JSON Logging - Logging architecture
- ADR-010: Observability with Prometheus - Metrics architecture
Dependencies
- OpenTelemetry SDK and exporters
- No vendor lock-in to specific backends
- Optional LLM-specific libraries (OpenLLMetry)
Risks & Mitigations
Risk | Impact | Mitigation |
---|---|---|
Performance degradation | High | Implement sampling, optimize hot paths |
Memory leaks | High | Implement span limits, regular profiling |
Exporter failures | Medium | Circuit breaker, local buffering |
Compliance violations | High | PII sanitization, data governance |
Complexity for users | Medium | Sensible defaults, clear documentation |
Alternatives Considered
-
Custom Metrics System
- Pros: Full control, optimized for MCP
- Cons: Maintenance burden, no ecosystem
-
Direct APM Integration
- Pros: Vendor support, rich features
- Cons: Vendor lock-in, licensing costs
-
Prometheus + Grafana Only
- Pros: Open source, mature
- Cons: Limited tracing, separate stack
Decision: OpenTelemetry provides the best balance of standardization, flexibility, and ecosystem support.
Appendix: Example Traces
Tool Invocation Trace
{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"operationName": "tool.invoke",
"startTime": 1614544192123456,
"duration": 1234567,
"attributes": {
"tool.name": "github_search",
"tool.id": "550e8400-e29b-41d4-a716-446655440000",
"user.id": "user123",
"tenant.id": "acme-corp",
"http.method": "POST",
"http.url": "/tools/invoke",
"http.status_code": 200
},
"events": [
{
"time": 1614544192123500,
"name": "tool.validation.completed"
},
{
"time": 1614544192124000,
"name": "plugin.pre_invoke.executed",
"attributes": {
"plugin.name": "RateLimiter"
}
}
],
"status": {
"code": "OK"
}
}
Federated Request Trace
{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spans": [
{
"spanId": "parent123",
"operationName": "gateway.request",
"serviceName": "mcp-gateway-1"
},
{
"spanId": "child456",
"parentSpanId": "parent123",
"operationName": "gateway.forward",
"serviceName": "mcp-gateway-2",
"attributes": {
"peer.service": "mcp-gateway-1",
"rpc.method": "tools/invoke"
}
}
]
}
Implementation Coverage Matrix
This comprehensive observability implementation supports multiple open issues:
Issue | Title | How This Implementation Addresses It |
---|---|---|
#175 | OpenLLMetry Integration | ✅ Full OpenTelemetry implementation with LLM-specific attributes (tokens, costs, models) |
#218 | Prometheus Metrics | ✅ Metrics collection infrastructure ready for prometheus-fastapi-instrumentator |
#272 | Grafana Dashboards | ✅ OpenTelemetry data exportable to Grafana via Tempo/Prometheus |
#300 | Structured JSON Logging | ✅ Correlation IDs via trace context propagation |
#432 | Performance Optimization | ✅ Performance benchmarks and overhead measurement included |
#535 | Audit Logging | ✅ Trace data provides audit trail with user attribution |
#683 | Debug Headers | ✅ X-Trace-Id propagation via W3C Trace Context |
#699 | Metrics Enhancement | ✅ Comprehensive metrics collection for all operations |
#727 | Phoenix Integration | ✅ Phoenix supported as OTLP-compatible backend |
Summary
The vendor-agnostic approach ensures compatibility with existing monitoring stacks while the OpenTelemetry foundation provides future-proof standardization.
Next Steps
- Close subset issues that this implementation covers:
- Consider closing [Feature]: Phoenix Observability Integration plugin #727 (Phoenix-specific) in favor of this vendor-agnostic approach
- Mark [Feature Request]: Add OpenLLMetry Integration for Observability #175 (OpenLLMetry) as partially addressed by LLM instrumentation
- Link [Feature Request]: Prometheus Metrics Instrumentation using prometheus-fastapi-instrumentator #218 (Prometheus) as dependent on this implementation