Using llama.cpp With llama-wrap

Introduction

llama.cpp is one of the most flexible ways to run local AI models on your own computer. It lets you run GGUF models through tools such as llama-server, which can then be used by chat apps, coding tools, local agents, or anything that can talk to a local API.

The problem is that llama-server commands can get long very quickly. Once you start changing model paths, context size, GPU layers, ports, cache settings, draft models, and extra flags, it becomes easy to forget what command worked last time.

That is why I made llama-wrap. It is not a chat UI. It is a small desktop launcher that helps you build, save, import, and run llama-server commands without typing the full command every time.

In this post, I want to show you how to install llama.cpp, run llama-wrap, launch a local model, and save that setup as a preset.

[!NOTE] ℹ️ Note This tutorial focuses on using llama-wrap as a launcher for llama-server. You still need to bring your own GGUF model file.

What You Need

Before starting, make sure you have:

A Mac or Linux machine
Access to Terminal
Python 3.10 or newer
Tkinter for your Python installation
A GGUF model file
llama.cpp with llama-server built or installed

[!WARNING] ⚠️ For Windows Users You can use llama-wrap on Windows, but this tutorial uses Mac and Linux terminal commands. If you are on Windows, you can either use the Windows release build of llama-wrap, or adapt the commands for PowerShell.

Setting Up llama.cpp

Installing the build tools

First, make sure you have Git, CMake, and a compiler installed.

On Ubuntu or Debian, run:

sudo apt update
sudo apt install git cmake build-essential

On Mac, you can install the command line tools with:

xcode-select --install

If you use Homebrew, you can also install the build tools with:

brew install git cmake

Downloading llama.cpp

Now download llama.cpp:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Then build llama-server:

cmake -B build
cmake --build build --config Release -t llama-server

When the build is done, test that llama-server runs:

./build/bin/llama-server --help

If you see the help output, llama.cpp is ready.

[!TIP] 💡 Tip If you have an Nvidia GPU, AMD GPU, or another acceleration setup, you should read the official llama.cpp build guide for your hardware. The basic build works for testing, but GPU builds can be much faster.

Making llama-server easier to launch

You can use llama-wrap with the full path to llama-server, so this step is optional.

If you want to make the command available everywhere, create a symbolic link:

sudo ln -s "$PWD/build/bin/llama-server" /usr/local/bin/llama-server

Then test it:

llama-server --help

If that works, llama-wrap can use the default llama.cpp executable setting.

Getting A GGUF Model

llama-wrap does not download models for you. You need at least one GGUF model file on your computer.

You can get GGUF models from places such as Hugging Face. Look for model files ending in:

.gguf

For a first test, choose a small model so you can make sure the setup works before trying larger models.

For example, you might keep your models in:

mkdir -p ~/models

Then put your downloaded GGUF file inside that folder.

[!TIP] 💡 Tip If you are not sure which quantization to download, start with a smaller Q4 file. It is usually easier to run locally than a larger file, especially on machines with limited RAM or VRAM.

Setting Up llama-wrap

Downloading llama-wrap

Now download llama-wrap:

git clone https://github.com/chelij/llama-wrap
cd llama-wrap

There are no third-party Python dependencies. llama-wrap uses the Python standard library.

To run it from source:

python llamawrap.py

If your system uses python3 instead of python, run:

python3 llamawrap.py

This should open the llama-wrap desktop window.

[!WARNING] ⚠️ Tkinter Missing If the window does not open and Python says tkinter is missing, install Tkinter for your system. On Ubuntu or Debian, try sudo apt install python3-tk.

Checking the llama-server executable

In llama-wrap, look for the executable field.

If llama-server works in your Terminal, you can leave the executable as:

llama-server

If it does not, use the full path to your built server.

For example:

/home/you/llama.cpp/build/bin/llama-server

On Mac, it may look more like:

/Users/you/llama.cpp/build/bin/llama-server

The important thing is that the path points to the actual llama-server executable.

Launching Your First Model

Choosing the model file

In llama-wrap, choose your main model GGUF file.

This is the file that would normally be passed to llama-server with:

-m /path/to/model.gguf

You do not need to type that by hand. Use the model browse button and select the file.

Setting basic flags

For a first launch, keep the setup simple.

Good starting values are:

Host: 127.0.0.1
Port: 8080
Context: 4096
GPU layers: auto if your build supports it, or a number that fits your machine

The exact options shown can change depending on the selected inferer. For normal llama.cpp, choose the llama.cpp inferer profile.

[!NOTE] ℹ️ Note If a value fails, lower it and try again. Local model setup is often a matter of matching the model size, context size, and GPU settings to your computer.

Starting the server

Press Launch in llama-wrap.

The output panel should begin showing llama-server logs. Wait until the model finishes loading.

Once the server is ready, open this in your browser:

http://127.0.0.1:8080

If the llama.cpp web UI opens, your local server is working.

You can also test the health endpoint from Terminal:

curl http://127.0.0.1:8080/health

If the server is ready, it should return a healthy response.

Saving The Setup As A Preset

Once you have a launch that works, save it as a preset.

Presets are useful because they remember things like:

Model path
Draft model path
MMProj path
Enabled flags and values
Extra args
Selected inferer
Executable path

In llama-wrap, enter a preset name and save it.

For example:

Qwen Local Server 8080

The next time you open llama-wrap, you can load the preset instead of rebuilding the command again.

[!NOTE] ℹ️ Note When running from source, presets are stored in history.json next to llamawrap.py. When running a packaged release, they are stored next to the packaged launcher.

Importing An Existing llama-server Command

If you already have a working llama-server command, you can import it into llama-wrap.

For example:

llama-server -m /models/model.gguf -ngl auto -c 32768 --host 127.0.0.1 --port 8081 -fa auto

Use the import button in llama-wrap, paste the command, and confirm.

Recognized values will be loaded into the UI. Unrecognized or advanced flags will be preserved in Extra args.

This is useful when you find a good command online or build one manually first, then want to make it easier to reuse.

Using Draft Models And MMProj Files

Draft models

The Draft field is for speculative decoding.

This lets a smaller model propose tokens while the main model checks them. When it works well, it can make generation faster.

In normal command form, this is the same idea as:

llama-server -m /models/large.gguf -md /models/small-draft.gguf --spec-draft-n-max 16

In llama-wrap, choose the draft GGUF file in the Draft field, then adjust the speculative decoding flags if needed.

[!TIP] 💡 Tip Do not start with speculative decoding on your first run. Get the normal model launch working first, save that preset, then experiment with draft models after.

MMProj files

Some multimodal or vision models need an MMProj file.

If your model requires one, select the MMProj GGUF file in llama-wrap. If your model is text-only, you can leave this empty.

Using ik_llama.cpp Or A Custom Inferer

llama-wrap can also help with ik_llama.cpp and other llama-server compatible executables.

Use the inferer selector:

llama.cpp for normal llama-server
ik_llama.cpp for IK builds and IK-specific flags
Custom for other compatible executables

If your executable is not in your PATH, put the full path in the executable field.

For example:

/home/you/ik_llama.cpp/build/bin/llama-server

For uncommon flags, use Extra args.

Troubleshooting

llama-wrap opens but launch fails

Check that the executable field points to llama-server correctly.

Test it in Terminal:

llama-server --help

If you are using a full path, test that full path:

/home/you/llama.cpp/build/bin/llama-server --help

The model does not load

Check that the model file is really a GGUF file and that the path is correct.

If the model is too large for your machine, try a smaller quantization or lower the context size.

The browser cannot connect

Check the host and port.

If llama-wrap launches the server on:

127.0.0.1:8080

then open:

http://127.0.0.1:8080

If you changed the port, use the port you selected.

Another app is using the port

Change the port in llama-wrap.

For example, use:

Then open:

http://127.0.0.1:8081

The server is slow

Try:

A smaller model
A smaller context size
A smaller quantization
More GPU layers if your hardware supports it
A GPU-enabled llama.cpp build

Do not change too many things at once. Change one setting, launch again, and compare.

Conclusion

With llama.cpp, you get a powerful local model server. With llama-wrap, you get a simpler way to launch it, adjust flags, save presets, and reuse working setups.

The best next step is to get one small GGUF model running first. Once that works, save it as a preset, then experiment with larger models, draft models, MMProj files, and extra llama-server arguments.

I hope that this tutorial has helped you set up llama.cpp with llama-wrap.

Feel free to contact me if you need further consultation for all things AI usage.

Using llama.cpp With llama-wrap

Introduction

What You Need

Setting Up llama.cpp

Installing the build tools

Downloading llama.cpp

Making llama-server easier to launch

Getting A GGUF Model

Setting Up llama-wrap

Downloading llama-wrap

Checking the llama-server executable

Launching Your First Model

Choosing the model file

Setting basic flags

Starting the server

Saving The Setup As A Preset

Importing An Existing llama-server Command

Using Draft Models And MMProj Files

Draft models

MMProj files

Using ik_llama.cpp Or A Custom Inferer

Troubleshooting

llama-wrap opens but launch fails

The model does not load

The browser cannot connect

Another app is using the port

The server is slow

Conclusion

Want this set up for your business?