-
Notifications
You must be signed in to change notification settings - Fork 243
Description
Priority: Low (Infrastructure/Deployment)
Description:
The Docker container for MCP Gateway fails health checks, getting stuck in "starting" status for extended periods before transitioning to "unhealthy". This prevents proper container orchestration, auto-scaling, and monitoring in production environments.
Steps to Reproduce:
-
Build and run the Docker container:
make docker-stop make docker-run
-
Monitor container health status:
# Check container status docker ps --filter name=mcpgateway --format "table {{.Names}}\t{{.Status}}" # Inspect health status docker inspect mcpgateway | jq '.[0].State.Health.Status'
-
Check container logs:
docker logs mcpgateway --tail 50
Expected Behavior:
- Container should pass health checks within 30-60 seconds
- Health status should show "healthy"
- Container should respond to health check endpoints
Actual Behavior:
- Container remains in "health: starting" for ~2 minutes
- Eventually transitions to "unhealthy" status
- Health check appears to be failing or missing
Investigation Findings:
- Missing HEALTHCHECK instruction: The
Containerfile.lite
is missing a HEALTHCHECK instruction - Application runs on port 4444: The container exposes port 4444, not 8000
- Scratch-based image: The container uses a minimal scratch image without common tools like curl or wget
- Gunicorn startup delay: Health checks fail with "Server disconnected without sending a response" during gunicorn startup
- localhost vs 127.0.0.1: Using
localhost
may resolve to IPv6 (::1) which the server might not bind to
Impact:
- Container orchestration systems (Kubernetes, Docker Swarm, ECS) cannot properly manage the container
- Load balancers cannot determine if the container is ready to receive traffic
- Auto-scaling and self-healing capabilities are broken
- Production deployments may route traffic to unhealthy instances
- CI/CD pipelines may incorrectly report successful deployments
Root Cause Analysis:
The issue appears to stem from:
- Missing or improperly configured HEALTHCHECK instruction in Dockerfile
- Lack of health check tools (curl/wget) in the minimal container image
- Possible network connectivity issues within the container
Suggested Fix:
Add a Python-based HEALTHCHECK instruction to the Containerfile.lite:
# Add this before the final CMD instruction
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD ["python3", "-c", "import httpx,sys;sys.exit(0 if httpx.get('http://localhost:4444/health',timeout=5).status_code==200 else 1)"]
This one-liner health check:
- Uses httpx (which should be available in the project dependencies)
- Connects to the health endpoint on port 4444
- Returns exit code 0 (healthy) for HTTP 200 responses
- Returns exit code 1 (unhealthy) for any other status or connection errors
- Has a 5-second timeout to prevent hanging
Prerequisites:
- Ensure
httpx
is included in your project dependencies (pyproject.toml) - Verify that
/health
endpoint doesn't require authentication - Check that the application is binding to
0.0.0.0
not just127.0.0.1
Placement in Containerfile.lite:
Add the HEALTHCHECK instruction after the USER 1001
line and before the final CMD
instruction.
Additional Debugging Steps:
-
Test health endpoint from inside the container:
docker exec mcpgateway python3 -c "import httpx;print(httpx.get('http://localhost:4444/health').status_code)"
-
Check if the app is actually running:
docker exec mcpgateway ps aux | grep python
-
Review Docker events and health check logs:
docker events --filter container=mcpgateway docker inspect mcpgateway --format='{{json .State.Health}}'
Testing Requirements:
- Verify health check passes within the start period
- Test with docker-compose to ensure orchestration works
- Test health check behavior during application startup
- Verify health check fails appropriately when app is unhealthy
Related Configurations:
- May need to adjust health check parameters based on application startup time
- Consider separate readiness and liveness endpoints for Kubernetes
- Health check should not require authentication
Alternative Solutions:
docker-compose and helm already implement health checks.
-
Use docker-compose health check configuration:
healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3 start_period: 60s
-
For Kubernetes, use HTTP probes:
livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60 periodSeconds: 30
Related Issues:
- [Security]: Eliminate all lint issues in web stack #338 (Container optimizations)
- Consider implementing separate
/ready
endpoint for readiness checks - May need to update Helm charts to reflect proper health check configuration