Skip to content

[CHORE]: Implement Docker HEALTHCHECK #362

@crivetimihai

Description

@crivetimihai

Priority: Low (Infrastructure/Deployment)

Description:
The Docker container for MCP Gateway fails health checks, getting stuck in "starting" status for extended periods before transitioning to "unhealthy". This prevents proper container orchestration, auto-scaling, and monitoring in production environments.

Steps to Reproduce:

  1. Build and run the Docker container:

    make docker-stop
    make docker-run
  2. Monitor container health status:

    # Check container status
    docker ps --filter name=mcpgateway --format "table {{.Names}}\t{{.Status}}"
    
    # Inspect health status
    docker inspect mcpgateway | jq '.[0].State.Health.Status'
  3. Check container logs:

    docker logs mcpgateway --tail 50

Expected Behavior:

  • Container should pass health checks within 30-60 seconds
  • Health status should show "healthy"
  • Container should respond to health check endpoints

Actual Behavior:

  • Container remains in "health: starting" for ~2 minutes
  • Eventually transitions to "unhealthy" status
  • Health check appears to be failing or missing

Investigation Findings:

  1. Missing HEALTHCHECK instruction: The Containerfile.lite is missing a HEALTHCHECK instruction
  2. Application runs on port 4444: The container exposes port 4444, not 8000
  3. Scratch-based image: The container uses a minimal scratch image without common tools like curl or wget
  4. Gunicorn startup delay: Health checks fail with "Server disconnected without sending a response" during gunicorn startup
  5. localhost vs 127.0.0.1: Using localhost may resolve to IPv6 (::1) which the server might not bind to

Impact:

  • Container orchestration systems (Kubernetes, Docker Swarm, ECS) cannot properly manage the container
  • Load balancers cannot determine if the container is ready to receive traffic
  • Auto-scaling and self-healing capabilities are broken
  • Production deployments may route traffic to unhealthy instances
  • CI/CD pipelines may incorrectly report successful deployments

Root Cause Analysis:
The issue appears to stem from:

  1. Missing or improperly configured HEALTHCHECK instruction in Dockerfile
  2. Lack of health check tools (curl/wget) in the minimal container image
  3. Possible network connectivity issues within the container

Suggested Fix:

Add a Python-based HEALTHCHECK instruction to the Containerfile.lite:

# Add this before the final CMD instruction
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD ["python3", "-c", "import httpx,sys;sys.exit(0 if httpx.get('http://localhost:4444/health',timeout=5).status_code==200 else 1)"]

This one-liner health check:

  • Uses httpx (which should be available in the project dependencies)
  • Connects to the health endpoint on port 4444
  • Returns exit code 0 (healthy) for HTTP 200 responses
  • Returns exit code 1 (unhealthy) for any other status or connection errors
  • Has a 5-second timeout to prevent hanging

Prerequisites:

  • Ensure httpx is included in your project dependencies (pyproject.toml)
  • Verify that /health endpoint doesn't require authentication
  • Check that the application is binding to 0.0.0.0 not just 127.0.0.1

Placement in Containerfile.lite:
Add the HEALTHCHECK instruction after the USER 1001 line and before the final CMD instruction.

Additional Debugging Steps:

  1. Test health endpoint from inside the container:

    docker exec mcpgateway python3 -c "import httpx;print(httpx.get('http://localhost:4444/health').status_code)"
  2. Check if the app is actually running:

    docker exec mcpgateway ps aux | grep python
  3. Review Docker events and health check logs:

    docker events --filter container=mcpgateway
    docker inspect mcpgateway --format='{{json .State.Health}}'

Testing Requirements:

  • Verify health check passes within the start period
  • Test with docker-compose to ensure orchestration works
  • Test health check behavior during application startup
  • Verify health check fails appropriately when app is unhealthy

Related Configurations:

  • May need to adjust health check parameters based on application startup time
  • Consider separate readiness and liveness endpoints for Kubernetes
  • Health check should not require authentication

Alternative Solutions:

docker-compose and helm already implement health checks.

  1. Use docker-compose health check configuration:

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
  2. For Kubernetes, use HTTP probes:

    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 60
      periodSeconds: 30

Related Issues:

Metadata

Metadata

Assignees

Labels

choreLinting, formatting, dependency hygiene, or project maintenance chorescicdIssue with CI/CD process (GitHub Actions, scaffolding)devopsDevOps activities (containers, automation, deployment, makefiles, etc)good first issueGood for newcomershelp wantedExtra attention is neededtriageIssues / Features awaiting triage

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions