Nexa SDK

Nexa SDK is an on-device inference framework that runs any model on any device, across any backend. It runs on CPUs, GPUs, NPUs with backend support for CUDA, Metal, Vulkan, and Qualcomm / Intel / AMD NPU. It handles multiple input modalities including text 📝, image 🖼️, and audio 🎧. The SDK includes an OpenAI-compatible API server with support for JSON schema-based function calling and streaming. It supports model formats such as GGUF, MLX, Nexa AI's own .nexa format, enabling efficient quantized inference across diverse platforms.

Qualcomm NPU PC Demos

🖼️ Multi-Image Reasoning Spot the difference across two images in multi-round dialogue.	🎤 Image + Text → Function Call Snap a poster, add a voice note, and AI agent creates a calendar event.
🎶 Multi-Audio Comparison Tell the difference between two music clips locally.

Recent updates

📣 2025.10.02: Day-0 Support on NPU/GPU/CPU : IBM Granite 4.0

We support IBM Granite 4.0 with Nexa SDK on Day-0!
Try it on AMD / Intel / Qualcomm / Apple GPU with nexa infer NexaAI/granite-4.0-micro-GGUF and on Qualcomm NPU with nexa infer NexaAI/Granite-4-Micro-NPU

📣 2025.10.01: AMD NPU Support

Image Generation with SDXL on AMD NPU

📣 2025.09.23: Intel NPU Support

LLM inference with DeepSeek-r1-distill-Qwen-1.5B and Llama3.2-3B on Intel NPU

📣 2025.09.22: Apple Neural Engine (ANE) Support

Real-time speech recognition with Parakeet v3 model

📣 2025.09.15: New Models Support

First-ever Gemma-3n multimodal inference for GPU & CPU, in GGUF format.
SDXL image generation from Civitai for GPU
EmbeddingGemma for Qualcomm NPU
Phi4-mini turbo and Phi3.5-mini for Qualcomm NPU
Parakeet V3 model for Qualcomm NPU

📣 2025.09.05: Turbo Engine & Unified Interface

Nexa ML Turbo engine for optimized NPU performance
- Try Phi4-mini turbo and Llama3.2-3B-NPU-Turbo
- 80% faster at shorter contexts (<=2048), 33% faster at longer contexts (>2048) than current NPU solutions
Unified interface supporting NPU/GPU/CPU backends:
- Single installer architecture eliminating dependency conflicts
- Lazy loading and plugin isolation for improved performance

📣 2025.08.20: Qualcomm NPU Support with NexaML Turbo Engine

OmniNeural-4B: the first multimodal AI model built natively for NPUs — handling text, images, and audio in one model
Check the model and demos at Hugging Face repo
Check our OmniNeural-4B technical blog

📣 2025.08.12: ASR & TTS Support in MLX format

Parakeet and Kokoro models support in MLX format.
new /mic mode to transcribe live speech directly in your terminal.

Installation

macOS

Windows

Linux

curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh

Supported Models

You can run any compatible GGUF, MLX, or nexa model from 🤗 Hugging Face by using the <full repo name>.

Qualcomm NPU models

Tip

You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragon® X Elite chip on your laptop.

Quick Start (Windows arm64, Snapdragon X Elite)

Login & Get Access Token (required for Pro Models)
- Create an account at sdk.nexa.ai
- Go to Deployment → Create Token
- Run this once in your terminal (replace with your token):
```
nexa config set license '<your_token_here>'
```
Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU

nexa infer omni-neural
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/qwen3-1.7B-npu

GGUF models

Tip

GGUF runs on macOS, Linux, and Windows.

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer ggml-org/Qwen3-1.7B-GGUF

🖼️ Run and chat with Multimodal models, e.g. Qwen2.5-Omni:

nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF

MLX models

Tip

MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example

📝 Run and chat with LLMs, e.g. Qwen3:

nexa infer NexaAI/Qwen3-4B-4bit-MLX

🖼️ Run and chat with Multimodal models, e.g. Gemma3n:

nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX

CLI Reference

Essential Command	What it does
`nexa -h`	show all CLI commands
`nexa pull <repo>`	Interactive download & cache of a model
`nexa infer <repo>`	Local inference
`nexa list`	Show all cached models with sizes
`nexa remove <repo>` / `nexa clean`	Delete one / all cached models
`nexa serve --host 127.0.0.1:8080`	Launch OpenAI‑compatible REST server
`nexa run <repo>`	Chat with a model via an existing server

👉 To interact with multimodal models, you can drag photos or audio clips directly into the CLI — you can even drop multiple images at once!

See CLI Reference for full commands.

Acknowledgements

We would like to thank the following projects:

Name		Name	Last commit message	Last commit date
Latest commit History 942 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
bindings		bindings
examples/python		examples/python
runner		runner
ssl		ssl
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nexa SDK

Qualcomm NPU PC Demos

Recent updates

📣 2025.10.02: Day-0 Support on NPU/GPU/CPU : IBM Granite 4.0

📣 2025.10.01: AMD NPU Support

📣 2025.09.23: Intel NPU Support

📣 2025.09.22: Apple Neural Engine (ANE) Support

📣 2025.09.15: New Models Support

📣 2025.09.05: Turbo Engine & Unified Interface

📣 2025.08.20: Qualcomm NPU Support with NexaML Turbo Engine

📣 2025.08.12: ASR & TTS Support in MLX format

Installation

macOS

Windows

Linux

Supported Models

Qualcomm NPU models

Quick Start (Windows arm64, Snapdragon X Elite)

GGUF models

MLX models

CLI Reference

Acknowledgements

About

Uh oh!

Releases 20

Packages

Uh oh!

Contributors 38

Languages

License

NexaAI/nexa-sdk

Folders and files

Latest commit

History

Repository files navigation

Nexa SDK

Qualcomm NPU PC Demos

Recent updates

📣 2025.10.02: Day-0 Support on NPU/GPU/CPU : IBM Granite 4.0

📣 2025.10.01: AMD NPU Support

📣 2025.09.23: Intel NPU Support

📣 2025.09.22: Apple Neural Engine (ANE) Support

📣 2025.09.15: New Models Support

📣 2025.09.05: Turbo Engine & Unified Interface

📣 2025.08.20: Qualcomm NPU Support with NexaML Turbo Engine

📣 2025.08.12: ASR & TTS Support in MLX format

Installation

macOS

Windows

Linux

Supported Models

Qualcomm NPU models

Quick Start (Windows arm64, Snapdragon X Elite)

GGUF models

MLX models

CLI Reference

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Contributors 38

Languages

Packages