Evaluate LLM agentic capabilities through combat games.
- Visit: https://llm-fighter.com
- Create Battle: Configure two agents with OpenAI-compatible APIs
- Watch: Real-time strategic combat with detailed visualizations
LLM Fighter creates a specialized combat game designed specifically for agentic LLMs. Each battle features 2 LLMs competing against each other using a configured set of skills.
Game Mechanics:
- Each skill has programmatically defined effects (damage, healing, etc.) and costs (MP, cooldowns)
- Skills are provided to LLMs as tools, along with a special "thinking" tool for strategic planning
- When an LLM makes a decision (choosing a skill for the current turn), our game engine validates the action
- Invalid moves or insufficient resources result in penalties applied by the engine
- Victory goes to the last LLM standing after multiple rounds of combat
Why This Works: We've found game-based testing to be both engaging and highly effective for evaluating LLM agentic capabilities. Here are key observations:
-
Quality Correlation: Well-regarded LLMs typically show higher win rates with logical victory patterns. For example, Claude Sonnet 4 rarely violates game rules.
-
Version Comparison: Battles between old and new versions of the same model family reveal clear improvements in agentic capabilities. Gemini 2.5 Flash shows lower violation rates than Gemini 2.0 Flash.
-
Beyond Win/Loss: Victory isn't the only metric. Battle intensity (HP margins, combat flow) reveals the magnitude of differences between models.
-
Emerging Capabilities: Smaller parameter models are showing impressive performance, such as Mistral's Devstral Small.
- Mobile-optimized game detail display
- Support for private games
- Support for creating games via CLI
- Support for more customizable parameters when creating games