A Python-based interactive chat interface for Ollama models with advanced model management, error handling, and automatic service recovery capabilities.
- 🤖 Interactive Chat Interface: Clean command-line chat experience with colored output
- 🔄 Dynamic Model Switching: Switch between models on-the-fly with the 's' command
- 🛠️ Advanced Model Management: Load, unload, and manage Ollama models efficiently
- 🚨 Automatic Error Recovery: Intelligent error handling with automatic service restart
- 🎨 Formatted Responses: Color-coded output with special handling for thinking tags
- ⚙️ Configuration Management: Persistent configuration using INI files
- 🔧 Service Monitoring: Automatic Ollama service health checks and recovery
- 📊 Comprehensive Benchmarking: Advanced model performance evaluation with MoE support
- 🚫 Model Blacklisting: Exclude problematic models from benchmarking via configuration
- 🎮 Interactive Benchmark Mode: User-friendly guided benchmark configuration
- 📁 Organized Results: Category-based file output for better data management
- 🎯 GPU Optimization: Multi-GPU configuration tools for optimal performance
- 🧠 Advanced Parameter Control: Fine-tune model behavior with 20+ parameters
- 📷 Vision Model Support: Image input support for multimodal models
- 🤔 Thinking Mode: Automatic detection and support for reasoning models
- 🎛️ Real-time Configuration: Dynamic parameter application from config files
- Python 3.6+
- Ollama installed and configured
requests
library for HTTP communication
- Clone the repository:
git clone https://github.com/flickleafy/ollama_consumer.git
cd ollama_consumer
- Install required dependencies:
pip install requests
- Configure your settings:
cp config.ini.example config.ini
# Edit config.ini with your preferred settings
Run the main script to start the interactive chat:
python main.py
- Chat: Type your message and press Enter
- Switch Model: Type
s
to select a different model - Exit: Type
exit
,quit
, orq
to quit
The application provides comprehensive model management:
- Automatic Loading: Models are loaded on first use
- Smart Unloading: Previous models are unloaded when switching
- Error Recovery: Automatic recovery from model loading errors
- Service Restart: Automatic Ollama service restart if needed
The config.ini
file stores:
- Currently loaded model information
- Service configuration settings
- User preferences
- System prompt: Optional instruction that defines how the model should behave
- Model blacklist: Models to exclude from benchmarking
To set a system prompt that will be applied to all conversations, edit your config.ini
file:
[ollama]
current_loaded_model = llama3.2:latest
system_prompt = You are a helpful AI assistant. Please provide clear, accurate, and concise responses.
The system prompt will be automatically applied to all conversations with any model. Examples of useful system prompts:
You are a helpful coding assistant. Provide clean, well-commented code examples.
You are a creative writing assistant. Help with storytelling and creative ideas.
Respond in a professional and technical manner. Focus on accuracy and detail.
Keep responses brief and to the point. Use bullet points when appropriate.
To remove the system prompt, simply delete the system_prompt
line or leave it empty.
To exclude problematic models from benchmarking, add them to the blacklist section in your config.ini
file:
[blacklist]
# Models to exclude from benchmarking (JSON array format)
models = ["model1:latest", "problematic_model:tag"]
# Alternative formats also supported:
# models = model1:latest, model2:tag, model3:latest
# models = single_model:latest
Blacklisted models will be automatically filtered out from:
- All benchmark operations
- Interactive model selection
- Category-based filtering
This is useful for excluding models that:
- Are known to fail during benchmarking
- Cause system instability
- Take excessively long to load or respond
The Ollama consumer supports a comprehensive set of advanced model parameters that can be configured in your config.ini
file to fine-tune model behavior.
[ollama]
# Generation control
temperature = 0.7 # Creativity/randomness (0.0-2.0)
top_k = 40 # Token sampling pool size
top_p = 0.9 # Nucleus sampling threshold
repeat_penalty = 1.1 # Penalty for repetition (1.0 = no penalty)
seed = -1 # Random seed (-1 = random)
# Response length and context
num_predict = -1 # Max tokens to generate (-1 = model default)
num_ctx = 2048 # Context window size
num_batch = 512 # Batch size for processing
# GPU and memory optimization
num_gpu = -1 # GPU layers to use (-1 = auto)
main_gpu = 0 # Primary GPU index
low_vram = false # Enable low VRAM mode
num_gqa = 1 # Group query attention
num_thread = -1 # CPU threads (-1 = auto)
# Memory management
f16_kv = true # Use 16-bit for key/value cache
use_mmap = true # Use memory mapping
use_mlock = false # Lock memory pages
logits_all = false # Return logits for all tokens
vocab_only = false # Load vocabulary only
# Thinking and reasoning modes
enable_thinking = auto # Enable thinking mode (true/false/auto)
thinking_format = xml # Format for thinking tags (xml/markdown)
reasoning_depth = normal # Reasoning depth (shallow/normal/deep)
# Response format
stream_response = false # Enable streaming responses
raw_response = false # Return raw model output
For models that support image input (like qwen2.5vl, llava, etc.):
# Vision model parameters
enable_vision = auto # Enable vision support (true/false/auto)
image_quality = high # Image processing quality (low/medium/high)
max_image_size = 1024 # Maximum image dimension (pixels)
image_format = auto # Preferred image format (auto/jpeg/png)
For vision-capable models, you can include images in your prompts using the following syntax:
> img:path/to/image.jpg Describe what you see in this image
The system will automatically:
- Detect if the current model supports vision
- Load and encode the image
- Include it in the request to the model
- Show visual indicators when images are processed
Supported image formats: JPEG, PNG, GIF, WebP, BMP
The system automatically detects model capabilities:
Automatically detected by keywords: vision
, visual
, vl
, image
, multimodal
, llava
, moondream
, etc.
Examples: qwen2.5vl:latest
, llava:latest
, moondream:latest
Automatically detected by keywords: reasoning
, think
, thought
, o1
, qwq
, deepseek-r1
, etc.
Examples: phi4-reasoning:latest
, deepseek-r1:latest
, qwq:latest
The chat interface will show capabilities when you select a model:
- 📷 Vision support indicator
- 🧠 Thinking mode indicator
- Auto Values (-1): Let Ollama/model choose optimal settings
- Model-Specific: Parameters are applied intelligently based on model type
- Fallback: Invalid parameters are ignored without breaking functionality
- Override: You can disable config parameters by setting
use_config_params=False
in API calls
[ollama]
temperature = 1.2
top_p = 0.95
repeat_penalty = 1.05
num_predict = 500
Use for creative tasks like story writing, poetry, or brainstorming.
[ollama]
temperature = 0.1
top_k = 10
top_p = 0.5
repeat_penalty = 1.2
Use for factual analysis, code review, or technical documentation.
[ollama]
enable_vision = true
image_quality = high
max_image_size = 2048
temperature = 0.7
Then use in chat:
> img:/path/to/screenshot.png What programming language is shown in this code?
> img:/path/to/chart.jpg Analyze the trends in this graph
[ollama]
enable_thinking = true
thinking_format = xml
reasoning_depth = deep
temperature = 0.8
num_ctx = 4096
For complex reasoning tasks like math problems, logic puzzles, or strategic planning.
[ollama]
# GPU optimization
num_gpu = -1
main_gpu = 0
low_vram = false
f16_kv = true
# Memory efficiency
use_mmap = true
use_mlock = false
num_batch = 1024
num_thread = -1
For maximum performance on high-end hardware.
ollama_consumer/
├── main.py # Main application with chat interface
├── test_ollama.py # Test utilities and functions
├── benchmark.py # Model benchmarking script
├── check_gpu_config.py # GPU configuration checker and optimizer
├── sample_questions.json # Sample benchmark questions
├── config.ini # Configuration file (created from example)
├── config.ini.example # Example configuration
├── LICENSE # GPL v3 License
└── README.md # This file
The application includes robust error handling for common scenarios:
- Connection Issues: Automatic service restart attempts
- Model Loading Errors: Intelligent retry with recovery procedures
- HTTP 500 Errors: Automatic model unloading and service recovery
- Service Down: Multiple restart strategies with fallback options
Run the test suite:
python test_ollama.py
For users with multiple GPUs, this project includes tools and guidance to optimize Ollama's GPU usage for maximum performance.
Use the included GPU configuration checker to analyze your setup and get optimization recommendations:
python check_gpu_config.py
This tool will:
- Detect all NVIDIA GPUs and their capabilities
- Show current GPU utilization and memory usage
- Recommend optimal environment variables for your setup
- Display GPU priority order for best performance
The goal is to prioritize the most capable GPU while using weaker GPUs only when necessary for model splitting:
The check_gpu_config.py
script analyzes your GPUs by:
- Memory capacity (primary factor)
- Current utilization
- Temperature and power consumption
- Available memory
Based on your GPU configuration, set these environment variables:
# Example for dual GPU config (primary 16Gb) + (secondary 8Gb)
# Force Ollama to prioritize GPU 0 first
export CUDA_VISIBLE_DEVICES=0,1
# Use 90% of GPU memory (reserve 10% for system)
export OLLAMA_GPU_MEMORY_FRACTION=0.9
# Limit to one model at a time for optimal GPU usage
export OLLAMA_MAX_LOADED_MODELS=1
# Optional: Force all layers to GPU when possible
export OLLAMA_GPU_LAYERS=-1
With proper configuration:
- ✅ Single-GPU Models: Always load on the most capable GPU
- ✅ Large Models: Automatically split across GPUs when needed
- ✅ Memory Optimization: Reserve 10% GPU memory for system operations
- ✅ Performance Priority: Weaker GPUs used only when primary GPU can't handle the entire model
If you prefer manual configuration:
nvidia-smi --query-gpu=index,name,memory.total --format=csv
Order GPUs by capability in CUDA_VISIBLE_DEVICES
:
# Most capable GPU first, weaker GPUs after
export CUDA_VISIBLE_DEVICES=0,1,2 # Adjust indices based on your setup
# Reserve memory for system operations
export OLLAMA_GPU_MEMORY_FRACTION=0.9
# Limit concurrent models for optimal performance
export OLLAMA_MAX_LOADED_MODELS=1
Add to your shell profile (.bashrc
or .zshrc
):
echo 'export CUDA_VISIBLE_DEVICES=0,1
export OLLAMA_GPU_MEMORY_FRACTION=0.9
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_GPU_LAYERS=-1' >> ~/.zshrc
source ~/.zshrc
# Restart Ollama service
sudo systemctl restart ollama
# Or if using a different service manager:
# sudo service ollama restart
Proper GPU configuration provides:
- 🚀 Faster Model Loading: Models load on the most capable GPU first
- 💾 Efficient Memory Usage: Optimal memory allocation across GPUs
- ⚡ Better Performance: Reduced inference times for most workloads
- 🔄 Smart Fallback: Automatic model splitting only when necessary
- 🎯 Resource Optimization: Weaker GPUs reserved for overflow scenarios
Model Loading Slowly:
- Check if the model is being split unnecessarily
- Verify
CUDA_VISIBLE_DEVICES
order prioritizes your best GPU
GPU Memory Errors:
- Reduce
OLLAMA_GPU_MEMORY_FRACTION
to 0.8 or 0.7 - Ensure no other GPU-intensive processes are running
Models Loading on Wrong GPU:
- Verify environment variables are set correctly
- Restart Ollama service after configuration changes
- Check GPU memory availability with
nvidia-smi
The project includes a comprehensive benchmarking tool to evaluate model performance with advanced features for testing all types of models, including Mixture of Experts (MoE) models.
Run a benchmark with default questions on all available models:
python benchmark.py
Launch the interactive benchmark configuration:
python benchmark.py --interactive
# or simply
python benchmark.py
The interactive mode provides a guided setup with options for:
- Model selection (all, by category, or manual selection)
- Custom questions from files
- Output and logging configuration
- Category filtering
# Quiet mode (less output)
python benchmark.py --quiet
# Custom base filename (creates multiple category files)
python benchmark.py --output my_benchmark
# Filter by model size category
python benchmark.py --category small
python benchmark.py --category medium
python benchmark.py --category large
# Use custom questions from a JSON file
python benchmark.py --custom-questions sample_questions.json
# Show available model categories
python benchmark.py --list-categories
# Combine options
python benchmark.py --output benchmark_2025_01_15 --category medium --quiet
- 🤖 Advanced Model Support: Full support for Mixture of Experts (MoE) models with accurate parameter estimation
- 📊 Model Categorization: Automatic categorization by size (Small ≤16B, Medium 16B-100B, Large >100B)
- 🎯 Smart Parameter Detection: API-driven parameter extraction with pattern-based fallback
- 📁 Organized Output: Separate files for each category to avoid large monolithic files
- ⚡ Load Time Measurement: Records how long each model takes to load
- 🕐 Inference Time Tracking: Measures response time for each question
- 🧹 Response Cleaning: Removes thinking tags to store only final answers
- 🛡️ Robust Error Handling: Graceful handling of keyboard interrupts and model errors
- 📈 Comprehensive Analytics: Detailed performance statistics and comparisons
The benchmark automatically categorizes models:
- Small Models: ≤16B parameters (e.g., Llama 3.2 3B, Gemma 2 9B)
- Medium Models: 16B-100B parameters (e.g., Llama 3.1 70B)
- Large Models: >100B parameters (e.g., Llama 4 405B, MoE models like 128x17B)
- Unknown: Models with undetectable parameter counts
The benchmark includes sophisticated support for Mixture of Experts models:
- Conservative Parameter Estimation: Uses realistic scaling factors instead of linear multiplication
- Pattern Recognition: Detects MoE patterns like.
- API Integration: Prioritizes official API parameter data when available
Create a JSON file with your own questions:
[
"What is artificial intelligence?",
"Explain quantum computing simply.",
"Write a haiku about programming.",
"Solve this problem: What is 15 × 23?",
"Name three advantages of renewable energy."
]
Results are saved as separate JSON files for better organization:
benchmark_2025-01-15-14.30.15_summary.json # Overall results and statistics
benchmark_2025-01-15-14.30.15_small.json # Small models detailed results
benchmark_2025-01-15-14.30.15_medium.json # Medium models detailed results
benchmark_2025-01-15-14.30.15_large.json # Large models detailed results
benchmark_2025-01-15-14.30.15_unknown.json # Unknown size models results
- Summary File: Contains overall benchmark info, category summaries, and performance statistics
- Category Files: Detailed results for each model size category with individual model performance data
- ✅ Manageable File Sizes: Avoid large monolithic files
- ✅ Category-Specific Analysis: Easy to analyze specific model types
- ✅ Faster Loading: Load only the data you need
- ✅ Better Organization: Clear separation by model characteristics
The output includes detailed performance metrics, model comparisons, category statistics, and summary analytics to help you choose the best model for your use case.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- ModelManager: Handles model loading, unloading, and state management
- Service Management: Automatic Ollama service monitoring and restart
- Error Recovery: Comprehensive error handling and recovery procedures
- UI Components: Color-coded terminal output and user interaction
The application uses configparser
for persistent configuration storage, tracking:
- Current loaded model state
- Service configuration
- User preferences
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
- ✅ Commercial use - You may use this software commercially
- ✅ Modification - You may modify the source code
- ✅ Distribution - You may distribute this software
- ✅ Patent use - You may use any patents that contributors grant
- ✅ Private use - You may use this software privately
Requirements:
- 📋 License and copyright notice - Include license and copyright notice with the software
- 📋 State changes - Document changes made to the software
- 📋 Disclose source - Source code must be made available when distributing the software
- 📋 Same license - Modifications must be released under the same license
Limitations:
- ❌ Liability - This software is provided without warranty
- ❌ Warranty - No warranty is provided
If you encounter any issues or have questions:
- Check the Issues page
- Create a new issue with detailed information
- Include logs and error messages when possible
Enhanced Benchmarking System:
- 🧠 MoE Model Support: Full support for Mixture of Experts models with accurate parameter estimation
- 📊 Model Categorization: Automatic categorization by size (Small/Medium/Large/Unknown)
- 📁 Organized Output: Separate JSON files for each category to avoid large monolithic files
- 🎮 Interactive Mode: Guided benchmark configuration with user-friendly options
- 🎯 Category Filtering: Benchmark specific model size categories
- � Model Blacklisting: Exclude problematic models from benchmarking via configuration
- �🛡️ Enhanced Error Handling: Graceful keyboard interrupt handling and robust error recovery
- 🚀 Performance Improvements: Optimized parameter detection and response processing
Code Quality Improvements:
- 📖 Better Documentation: Comprehensive help text and usage examples
- Initial release with basic chat functionality
- Model management and switching capabilities
- Automatic error recovery and service restart
- Configuration management system
- Comprehensive error handling
- Built for use with Ollama
- Uses the Ollama REST API for model interaction
- Inspired by the need for robust local LLM interaction tools