Nexa SDK is an on-device inference framework that runs any model on any device, across any backend. It runs on CPUs, GPUs, NPUs with backend support for CUDA, Metal, Vulkan, and Qualcomm / Intel / AMD NPU. It handles multiple input modalities including text 📝, image 🖼️, and audio 🎧. The SDK includes an OpenAI-compatible API server with support for JSON schema-based function calling and streaming. It supports model formats such as GGUF, MLX, Nexa AI's own .nexa
format, enabling efficient quantized inference across diverse platforms.
- We support IBM Granite 4.0 with Nexa SDK on Day-0!
- Try it on AMD / Intel / Qualcomm / Apple GPU with
nexa infer NexaAI/granite-4.0-micro-GGUF
and on Qualcomm NPU withnexa infer NexaAI/Granite-4-Micro-NPU
- Image Generation with SDXL on AMD NPU
- LLM inference with DeepSeek-r1-distill-Qwen-1.5B and Llama3.2-3B on Intel NPU
- Real-time speech recognition with Parakeet v3 model
- First-ever Gemma-3n multimodal inference for GPU & CPU, in GGUF format.
- SDXL image generation from Civitai for GPU
- EmbeddingGemma for Qualcomm NPU
- Phi4-mini turbo and Phi3.5-mini for Qualcomm NPU
- Parakeet V3 model for Qualcomm NPU
- Nexa ML Turbo engine for optimized NPU performance
- Try Phi4-mini turbo and Llama3.2-3B-NPU-Turbo
- 80% faster at shorter contexts (<=2048), 33% faster at longer contexts (>2048) than current NPU solutions
- Unified interface supporting NPU/GPU/CPU backends:
- Single installer architecture eliminating dependency conflicts
- Lazy loading and plugin isolation for improved performance
- OmniNeural-4B: the first multimodal AI model built natively for NPUs — handling text, images, and audio in one model
- Check the model and demos at Hugging Face repo
- Check our OmniNeural-4B technical blog
- Parakeet and Kokoro models support in MLX format.
- new
/mic
mode to transcribe live speech directly in your terminal.
curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh
You can run any compatible GGUF, MLX, or nexa model from 🤗 Hugging Face by using the <full repo name>
.
Tip
You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragon® X Elite chip on your laptop.
-
Login & Get Access Token (required for Pro Models)
- Create an account at sdk.nexa.ai
- Go to Deployment → Create Token
- Run this once in your terminal (replace with your token):
nexa config set license '<your_token_here>'
-
Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU
nexa infer omni-neural
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/qwen3-1.7B-npu
Tip
GGUF runs on macOS, Linux, and Windows.
📝 Run and chat with LLMs, e.g. Qwen3:
nexa infer ggml-org/Qwen3-1.7B-GGUF
🖼️ Run and chat with Multimodal models, e.g. Qwen2.5-Omni:
nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF
Tip
MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example
📝 Run and chat with LLMs, e.g. Qwen3:
nexa infer NexaAI/Qwen3-4B-4bit-MLX
🖼️ Run and chat with Multimodal models, e.g. Gemma3n:
nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX
Essential Command | What it does |
---|---|
nexa -h |
show all CLI commands |
nexa pull <repo> |
Interactive download & cache of a model |
nexa infer <repo> |
Local inference |
nexa list |
Show all cached models with sizes |
nexa remove <repo> / nexa clean |
Delete one / all cached models |
nexa serve --host 127.0.0.1:8080 |
Launch OpenAI‑compatible REST server |
nexa run <repo> |
Chat with a model via an existing server |
👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!
See CLI Reference for full commands.
We would like to thank the following projects: