GitHub - BradHutchings/Mmojo-Server: Cross platform inference server in a single executable.

mmojo-server

Project Goals

The goal of this project is to build a single mmojo-server executable file that can run "anywhere":

x86_64 Windows
x86_64 Linux
ARM Windows
ARM Linux
ARM MacOS

I am inspired by the llamafile project. The main drawback of that project is that it has not kept up-to-date with llama.cpp and therefore, does not always support the latest models when llama.cpp supports them. Support for new models in llamafile takes work and time.

I want to use the MIT license as used by llama.cpp.

GPU support is not important to me and can be handled by platform specific builds of llama.cpp. CPU inference is quite adequate for many private end-user applications. Generic CPU inference is implemented. ARM and x86 tuned CPU inference is not implemented yet.

The ability to package support files, such as a custom web UI into the executable file is important to me. This is implemented.

The ability to package default arguments, in an "args" file, into the executable file is important to me. This is implemented.

The ability to read arguments from a file adjacent to the executable file is important to me. This is implemented.

The ability to package a gguf model into the executable file is important to me. This is not implemented yet.

I welcome any of my changes being implemented in the official llama.cpp.

Documentation

Follow these guides in order to build, package, and deploy mmojo-server:

My start-to-finish guide for building mmojo-server with Cosmo starts here.

Modifications to llama.cpp

To get this from the llama.cpp source base, there are few files that need to be modified:

common/arg.cpp -- Added a parameter for sleep after each batch.
common/common.cpp -- Location of cache directory for COSMOCC builds.
common/common.h -- Added a parameter for sleep after each batch.
tools/server/server.cpp -- Support embedded or adjacent "args" file, fix Cosmo name conflict with "defer" task member, add additional meta data to model_meta, stream reporting of evaluating progress, and more.
completion-ui -- Default UI is Mmojo Completion.
tools/server/public/loading-mmojo.html -- Loading page matches Mmojo Completion theme.

Reference

Here are some projects and pages you should be familiar with if you want to get the most out of mmojo-server:

llama.cpp - Georgi Gerganov and his team are the rock stars who are making the plumbing so LLMs can be available for developers of all kinds. The llama.cpp project is the industry standard for inference. I only fork it here because I want to make it a little better for my applications while preserving all its goodness.
llamafile - Llamafile lets you distribute and run LLMs with a single file. It is a Mozilla Foundation project that brough the Cosmopolitan C Library and llama.cpp together. It has some popular GPU support. It is based on an older version of llama.cpp and does not support all of the latest models supported by llama.cpp. Llamafile is an inspiration for this project.
Cosmopolitan Libc - Cosmopolitan is a project for building cross-platform binaries that run on x86_64 and ARM architectures, supporting Linux, Windows, macOS, and other operating systems. Like llamafile, I use Cosmo compile cross-platform executables of llama.cpp targets, including llama-server.
Actually Portable Executable (APE) Specification - Within the Cosmopolitan Libc repo is documentation about how the cross CPU, cross platform executable works.
Brad's LLMs - I share private local LLMs built with llamafile in a Hugging Face repo.

To Do List

In no particular order of importance, these are the things that bother me:

Package gguf file into executable file. The zip item needs to be aligned for mmap. There is a zipalign.c tool source in llamafile that seems loosely inspired by the Android zipalign too. I feel like there should be a more generic solution for this problem.
GPU support without a complicated kludge, and that can support all supported platform / CPU / GPU triads. Perhaps a plugin system with shared library dispatch? Invoking dev tools on Apple Metal like llamafile does is "complicated".
Code signing instructions. Might have to sign executables within the zip package, plus the package itself.
Clean up remaining build warnings, either by fixing source (i.e. Cosmo) or finding the magical compiler flags.
Copy the cosmo_args function into server.cpp so it could potentially be incorporated upstream in non-Cosmo builds. common/arg2.cpp might be a good landing spot. License in Cosmo source code appears to be MIT compatible with attribution.
- The args thing is cute, but it might be easier as a yaml file. Key value pairs. Flags can be keys with null values.
The --ctx-size parameter doesn't seem quite right given that new models have the training (or max) context size in their metadata. That size should be used subject to a maximum in a passed parameter. E.g. So a 128K model can run comfortably on a smaller device.
Write docs for a Deploying step. It should address the args file, removing the extra executable depending on platform, models, host, port. context size.
Make a .gitattributes file so we can set the default file to be displayed and keep the README.md from llama.cpp. This will help in syncing changes continually from upstream. Reference: https://git-scm.com/docs/gitattributes -- This doesn't actually work.
~~Cosmo needs libssl and libcrypto. Building these from scratch gets an error about Cosco not liking assembly files. Sort this out.~~ Implemented.

Name		Name	Last commit message	Last commit date
Latest commit History 7,176 Commits
.devops		.devops
.github		.github
ci		ci
cmake		cmake
common		common
completion-ui		completion-ui
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
prompts		prompts
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README-llama-cpp.md		README-llama-cpp.md
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mmojo-server

Project Goals

Documentation

Modifications to llama.cpp

Reference

To Do List

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

BradHutchings/Mmojo-Server

Folders and files

Latest commit

History

Repository files navigation

mmojo-server

Project Goals

Documentation

Modifications to llama.cpp

Reference

To Do List

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages