[CHORE]: Implement Docker HEALTHCHECK

**Priority:** Low (Infrastructure/Deployment)

**Description:**
The Docker container for MCP Gateway fails health checks, getting stuck in "starting" status for extended periods before transitioning to "unhealthy". This prevents proper container orchestration, auto-scaling, and monitoring in production environments.

**Steps to Reproduce:**
1. Build and run the Docker container:
   ```bash
   make docker-stop
   make docker-run
   ```

2. Monitor container health status:
   ```bash
   # Check container status
   docker ps --filter name=mcpgateway --format "table {{.Names}}\t{{.Status}}"
   
   # Inspect health status
   docker inspect mcpgateway | jq '.[0].State.Health.Status'
   ```

3. Check container logs:
   ```bash
   docker logs mcpgateway --tail 50
   ```

**Expected Behavior:**
- Container should pass health checks within 30-60 seconds
- Health status should show "healthy"
- Container should respond to health check endpoints

**Actual Behavior:**
- Container remains in "health: starting" for ~2 minutes
- Eventually transitions to "unhealthy" status
- Health check appears to be failing or missing

**Investigation Findings:**
1. **Missing HEALTHCHECK instruction**: The `Containerfile.lite` is missing a HEALTHCHECK instruction
2. **Application runs on port 4444**: The container exposes port 4444, not 8000
3. **Scratch-based image**: The container uses a minimal scratch image without common tools like curl or wget
4. **Gunicorn startup delay**: Health checks fail with "Server disconnected without sending a response" during gunicorn startup
5. **localhost vs 127.0.0.1**: Using `localhost` may resolve to IPv6 (::1) which the server might not bind to

**Impact:**
- Container orchestration systems (Kubernetes, Docker Swarm, ECS) cannot properly manage the container
- Load balancers cannot determine if the container is ready to receive traffic
- Auto-scaling and self-healing capabilities are broken
- Production deployments may route traffic to unhealthy instances
- CI/CD pipelines may incorrectly report successful deployments

**Root Cause Analysis:**
The issue appears to stem from:
1. Missing or improperly configured HEALTHCHECK instruction in Dockerfile
2. Lack of health check tools (curl/wget) in the minimal container image
3. Possible network connectivity issues within the container

**Suggested Fix:**

Add a Python-based HEALTHCHECK instruction to the Containerfile.lite:

```dockerfile
# Add this before the final CMD instruction
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD ["python3", "-c", "import httpx,sys;sys.exit(0 if httpx.get('http://localhost:4444/health',timeout=5).status_code==200 else 1)"]
```

This one-liner health check:
- Uses httpx (which should be available in the project dependencies)
- Connects to the health endpoint on port 4444
- Returns exit code 0 (healthy) for HTTP 200 responses
- Returns exit code 1 (unhealthy) for any other status or connection errors
- Has a 5-second timeout to prevent hanging

**Prerequisites:**
- Ensure `httpx` is included in your project dependencies (pyproject.toml)
- Verify that `/health` endpoint doesn't require authentication
- Check that the application is binding to `0.0.0.0` not just `127.0.0.1`

**Placement in Containerfile.lite:**
Add the HEALTHCHECK instruction after the `USER 1001` line and before the final `CMD` instruction.

**Additional Debugging Steps:**
1. Test health endpoint from inside the container:
   ```bash
   docker exec mcpgateway python3 -c "import httpx;print(httpx.get('http://localhost:4444/health').status_code)"
   ```

2. Check if the app is actually running:
   ```bash
   docker exec mcpgateway ps aux | grep python
   ```

3. Review Docker events and health check logs:
   ```bash
   docker events --filter container=mcpgateway
   docker inspect mcpgateway --format='{{json .State.Health}}'
   ```

**Testing Requirements:**
- Verify health check passes within the start period
- Test with docker-compose to ensure orchestration works
- Test health check behavior during application startup
- Verify health check fails appropriately when app is unhealthy

**Related Configurations:**
- May need to adjust health check parameters based on application startup time
- Consider separate readiness and liveness endpoints for Kubernetes
- Health check should not require authentication

**Alternative Solutions:**
> docker-compose and helm already implement health checks.

1. Use docker-compose health check configuration:
   ```yaml
   healthcheck:
     test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
     interval: 30s
     timeout: 10s
     retries: 3
     start_period: 60s
   ```

2. For Kubernetes, use HTTP probes:
   ```yaml
   livenessProbe:
     httpGet:
       path: /health
       port: 8000
     initialDelaySeconds: 60
     periodSeconds: 30
   ```

**Related Issues:**
- #338 (Container optimizations)
- Consider implementing separate `/ready` endpoint for readiness checks
- May need to update Helm charts to reflect proper health check configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CHORE]: Implement Docker HEALTHCHECK #362

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CHORE]: Implement Docker HEALTHCHECK #362

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions