This repository contains a Docker Compose setup for running TensorZero locally with SGLang and the Qwen3-32B-AWQ model.
TensorZero is a platform for building reliable LLM applications. This setup includes:
- SGLang: Serving the Qwen3-32B-AWQ model locally (using a custom Docker image with vLLM 0.8.4)
- ClickHouse: Database for storing inference data and analytics
- TensorZero Gateway: HTTP API for interacting with models
- TensorZero UI: Web interface for monitoring and analytics
This setup uses a custom SGLang Docker image that includes vLLM 0.8.4 for compatibility. The custom image is built from the Dockerfile
in this repository. The image is automatically built when you run docker-compose up
for the first time. Subsequent runs will use the cached image unless you rebuild it explicitly.
- Docker and Docker Compose
- NVIDIA GPU with sufficient VRAM (recommended: ~24GB for Qwen3-32B-AWQ)
- Copy the environment template:
cp .env.example .env
- Edit
.env
and set your API keys:
# Required: OpenAI API key for TensorZero services
OPENAI_API_KEY=your_openai_api_key_here
# Required: HuggingFace token for model access
HF_TOKEN=your_huggingface_token_here
- Clone this repository:
git clone <your-repo-url>
cd tensorzero-sglang
-
Set up environment variables (see Environment Setup above)
-
Build and start the services:
docker-compose up -d
Note: The first run will take some time as it builds the custom SGLang image with vLLM 0.8.4.
- Wait for all services to be healthy:
docker-compose ps
- Access the TensorZero UI at http://localhost:4000
- Port: 30000
- Model: Qwen3-32B-AWQ
- Health Check: http://localhost:30000/health
- Port: 8123
- Username: chuser
- Password: chpassword
- Database: tensorzero
- Port: 3020
- API Endpoint: http://localhost:3020
- Port: 4000
- URL: http://localhost:4000
The TensorZero configuration is defined in config/tensorzero.toml
:
- Model:
qwen3_local
- Points to the SGLang server - Function:
qwen
- Chat completion function using the local model
from tensorzero import TensorZeroGateway
client = TensorZeroGateway("http://localhost:3020")
response = client.inference(
function_name="qwen",
input={
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}
)
print(response.content)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3020/v1",
api_key="dummy" # Not required for local setup
)
response = client.chat.completions.create(
model="qwen",
messages=[
{"role": "user", "content": "Hello, how are you?"}
]
)
print(response.choices[0].message.content)
curl -X POST http://localhost:3020/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "qwen",
"input": {
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}
}'
- TensorZero UI: http://localhost:4000 - View inference analytics, model performance, and system metrics
- SGLang Health: http://localhost:30000/health - Check model server status
- ClickHouse: http://localhost:8123 - Direct database access
-
GPU Memory Issues
- Ensure you have sufficient VRAM (24GB+ recommended)
- Consider using a smaller model if memory is limited
-
Service Startup Failures
- Check Docker logs:
docker-compose logs [service-name]
- Ensure all environment variables are set correctly
- Check Docker logs:
-
Model Download Issues
- Verify your HuggingFace token has access to the model
- Check internet connectivity for model downloads
# View logs for all services
docker-compose logs -f
# View logs for specific service
docker-compose logs -f sglang
# Rebuild the custom SGLang image (if Dockerfile changes)
docker-compose build sglang
# Rebuild and restart all services
docker-compose up -d --build
# Force rebuild without cache
docker-compose build --no-cache sglang
# Restart services
docker-compose restart
# Stop all services
docker-compose down
# Remove all containers and volumes
docker-compose down -v
- Add model configuration to
config/tensorzero.toml
- Update the SGLang service in
docker-compose.yml
- Restart the services
Modify config/tensorzero.toml
to add:
- New models and providers
- Custom functions and variants
- Routing configurations
- For production use, refer to TensorZero deployment docs
- Use proper authentication in production environments
This project is provided as-is for educational and development purposes.