Introducing llamaBench — LLM Inference Benchmark Runner
What is llamaBench?
llamaBench is an LLM inference benchmark runner designed for any OpenAI-compatible server. It works with any backend — ROCm (Lemonade SDK), CUDA (Ollama, vLLM), Vulkan, or CPU — and runs latency and throughput benchmarks against remote servers at multiple context depths, then generates comparison plots automatically.
Key Features
- Multi-server support — per-server config files (
config.<NAME>.sh) with auto-discovery - Multi-model runs — benchmark several models in a single invocation
- Variable context depths — test prompt processing and token generation from zero-context to hundreds of thousands of tokens
- Atomic results — temp-directory pattern ensures no partial output on failure
- Built-in plotting — matplotlib charts for prompt processing and token generation throughput vs. context depth
- Traceable results — backend versions embedded in every result file
Requirements
| Tool | Purpose |
|---|---|
| bash 4+ | Script runtime |
| curl | HTTP requests to OpenAI-compatible API |
| python3 | JSON parsing, plot generation |
| uvx / llama-benchy | Benchmark execution (install uv) |
| matplotlib | Chart generation (auto-installed if missing) |
Install matplotlib ahead of time:
1
pip3 install matplotlib
Quick Start
1
2
3
4
5
6
7
8
9
# 1. Create a server configuration
cp config.template.sh config.MYSERVER.sh
# Edit config.MYSERVER.sh with your server IP, port, models, and depths
# 2. Run benchmarks
./run_bench.sh MYSERVER
# 3. Check results
ls results/<timestamp>/
Configuration
Each server has its own config.<NAME>.sh file sourced by run_bench.sh. Copy the template and fill in your values:
1
cp config.template.sh config.MYSERVER.sh
Config Variables
| Variable | Description | Example |
|---|---|---|
IP |
Server IP address | "192.168.2.238" |
PORT |
API port | "13305" |
PLOT_PREFIX |
Filename prefix for plots | "combined." |
DEPTHS |
Array of context depths to test | (0 8192 32768 65535 128000) |
MODELS |
Array of model names on the server | ("user.MyModel-Q4") |
Benchmark Flow
- Config load — Source
config.<NAME>.sh, validate arrays, apply CLI overrides - Temp dir — Create isolated workspace (cleaned up on failure)
- System info — Fetch
/api/v1/system-info, extract backend versions - Per-model loop: Unload current model, run
llama-benchywith all configured depths, save markdown table - Plot — Generate PNG charts from all result files
- Finalize — Move temp contents to
results/<timestamp>/only on success
All intermediate files are written to a temporary directory. The final results folder is created only after all benchmarks and plots succeed. If any step fails, the temp directory is removed automatically.
Result Format
Each run produces a timestamped directory in results/:
system-info.json/system-info.md— Server hardware and backend versions- Per-model markdown tables with columns for tokens/sec, peak tokens/sec, TTFR, estimated PPT, and end-to-end TTFT
<prefix>p.png— Prompt processing throughput vs context depth<prefix>g.png— Token generation throughput vs context depth
Test types include:
- pp — Prompt processing (baseline, 2048 tokens)
- tg — Token generation (32 tokens)
- ctx_pp/tg @ d<N> — Full context processing/generation at depth N
- pp/tg @ d<N> — Incremental processing/generation at depth N
plot.py
The standalone Python script can also be used separately:
1
2
3
4
5
# Plot specific result files
python plot.py --prefix "output." results/*/model*.md
# Plot from stdin
cat results.md | python plot.py -
Notes
- An OpenAI-compatible inference server must be running and accessible before running benchmarks
- Models are unloaded between runs to ensure clean state
- Prefix caching is enabled by default in all benchmark runs
- Backend version headers are prepended to each result file for traceability
Check out the project on GitHub.