日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)
# Train an adapter. Ship it to Ollama. Move on.
Backpropagate is a Python library for fine-tuning large language models on a single GPU. Three lines of code train a 7B model on a 16GB card. One more command exports it to Ollama so you can `ollama run` your finetune. Works first-class on Windows.
from backpropagate import Trainer
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct")
trainer.train("my_data.jsonl", steps=100)
trainer.export("gguf", quantization="q4_k_m")
backprop export ./output/lora --format gguf --quantization q4_k_m --ollama --ollama-name my-model
ollama run my-model
That's it. There's no YAML config file. There's no `accelerate launch` ceremony. There's no separate "now convert it to GGUF" tutorial. If you have a CUDA GPU and a JSONL file with your training data, you're three lines away from a working finetune.
## Install
# Recommended: isolated Python install (no conflicts with system Python or other projects)
pipx install backpropagate
# Or via uv (faster install, same isolation)
uv tool install backpropagate
# Standard pip (if you manage your own virtualenv)
pip install backpropagate
If you want the optional features, swap the install for one of these:
pipx install "backpropagate[standard]" # adds Unsloth (2x faster training) + the web UI
pipx install "backpropagate[full]" # adds everything: unsloth, ui, monitoring, export, etc.
Prefer Docker? `docker pull ghcr.io/mcp-tool-shop-org/backpropagate:latest` works too. Images ship for both `linux/amd64` and `linux/arm64`, so Apple Silicon and ARM Linux operators get a native image. A canonical `compose.yaml` for "UI in a container" lives at the repo root — `docker compose up` brings the web UI up on `http://localhost:7860` with a persistent `~/.backpropagate` volume mount.
## Where Backpropagate sits in the space
There are several good libraries for fine-tuning LLMs. They're each great at different things:
Backpropagate is the missing option: **a 3-line Python API for solo operators on a single consumer GPU who want to train an adapter and ship it.** No YAML, no GUI, no online RL (PPO/GRPO), no multi-node. Just the loop everyone actually needs and the export step that gets in the way.
If you tried one of the libraries above and bounced off the config-file ceremony, or hit a model-family gap, or wanted Windows-first defaults — Backpropagate is for you.
## What you can fine-tune on a 16GB consumer GPU
Here's the practical envelope on a 16GB card (RTX 4080 / 5080 / 4070 Ti Super):
| Model | Method | Status |
|---|---|---|
| Qwen-3.5-4B / Phi-4-mini-3.8B / SmolLM3-3B | LoRA / QLoRA / DoRA | Comfortable. Full sequence length, room to spare. |
| SmolLM3-3B / Qwen2.5-3B / Llama-3.2-3B / Llama-3.2-1B | `mode="full"` (full fine-tuning) | v1.4 — pass `--mode=full` on `backprop train` or `Trainer(..., mode="full")`. Loads full-precision (bf16) weights — no 4-bit, no adapter; gradient checkpointing + paged 8-bit Adam keep the footprint inside 16GB. |
| Qwen-2.5-7B / Llama-3.1-8B / Mistral-7B | QLoRA | Standard. ~7-8 GB. Backpropagate's default presets. |
| Llama-3 13B | QLoRA + sample packing | Tight but works. Use shorter sequences. |
| Mixtral 8x7B (47B total parameters) | — | Out of scope — 2-bit (AQLM / QuIP#) breaks the mergeable-adapter + GGUF-export contract, so it was retired in the [v1.5 trajectory brief](docs/V1_5_BRIEF.md). On a 16GB card, use a ≤8B base. |
`mode="full"` admits models up to **4B parameters**. The four presets in the full-FT row above are genuine ~3B (true parameter count 3.08–3.24B) and fit a 16GB card. The 3.8–4B class (Phi-4-mini-3.8B, Qwen-3.5-4B) is also accepted by the ceiling but needs a **24GB+** card for full FT — weights + gradients alone approach 16GB before the optimizer and activations — so on a 16GB card use `mode="lora"` for those (they're in the LoRA row). Models >4B exit with `RUNTIME_FULL_FT_MODEL_TOO_LARGE`.
2-bit quantization (AQLM / QuIP#) is **out of scope**. It was scoped for v1.4, then retired in the [v1.5 trajectory brief](docs/V1_5_BRIEF.md): a 2-bit base can't be cleanly merged back into full-precision weights, which breaks Backpropagate's mergeable-adapter → GGUF → Ollama export contract (the whole point of the pipeline). The headroom levers Backpropagate ships instead are the v1.5 **FP8 compute path** (`--fp8`, Blackwell/Hopper) and `mode="full"` for ≤4B models — both stay mergeable and exportable.
For models 3B and smaller, full fine-tuning (not just LoRA) is feasible on 16GB and now ships in v1.4 as `mode="full"`. Pass `Trainer(..., mode="full")` or `backprop train --mode=full --model phi-4-mini-3.8b` to enable it. A hard gate refuses the mode for models > 4B with `RUNTIME_FULL_FT_MODEL_TOO_LARGE`, naming LoRA + the sub-4B presets as the recovery options. See [the full fine-tuning handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/) for the configuration math + Biderman 2024 / Thinking Machines 2025 quality comparison. For 7B+ models, full fine-tuning needs a 24GB+ GPU — consider an A100 cloud rental, or stick with LoRA, which recent research shows matches full fine-tuning quality on most post-training tasks anyway (see [the anti-pitch section](#what-backpropagate-is-not-for) for citations).
## What Backpropagate is NOT for
If your use case is below, you'll have a better time with a different library — Backpropagate is not the right pick and trying to make it work would cost more than just reaching for the right tool. Reading this section before you start saves the install-and-bounce cycle:
- **Full-parameter fine-tuning of 7B+ models** — Backpropagate uses LoRA / QLoRA, which trains a small adapter rather than updating every weight. For models 7B and larger, full fine-tuning needs 24GB+ of GPU memory and doesn't fit on a 16GB consumer card. For models 3B and smaller, full fine-tuning IS feasible on 16GB and ships in v1.4 as `mode="full"` (pass `Trainer(..., mode="full")` or `--mode=full` on the CLI; a hard gate raises `RUNTIME_FULL_FT_MODEL_TOO_LARGE` for models > 4B and names LoRA + the sub-4B presets as recoveries). The bigger picture: recent research ([Biderman 2024](https://arxiv.org/abs/2405.09673), [Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/)) shows that LoRA at correct configuration matches full fine-tuning quality on most post-training tasks (instruction-following, domain adaptation, persona/style) at 67% of the compute — so for the work most operators actually want, you don't lose anything by sticking with LoRA. `mode="full"` exists for the cases where you've measured a quality gap and decided to spend the extra compute. If you genuinely need full fine-tuning of a 7B+ model, use HuggingFace `transformers.Trainer` directly on a 24GB+ card.
- **Online RL — PPO / GRPO / RLVR** — Backpropagate does single-stage SFT plus reference-free preference tuning (ORPO ships in v1.5; SimPO/KTO are planned). What it does *not* do is online reinforcement learning — PPO, GRPO, or RLVR — which needs a reward model or a generation-and-scoring loop on top of the training step. For those, use TRL directly or LLaMA-Factory. (Reference-free preference tuning fits the single-stage envelope because there's no separate reference model to hold in memory; see the ORPO note under [Quick Start](#quick-start).)
- **Multi-node training** — single GPU on one machine only. Multi-GPU on one machine works (via `accelerate launch`) but isn't officially supported.
- **macOS training on the CUDA rail** — Apple Silicon doesn't have CUDA, so the CUDA path has to run on a Linux or Windows box with an NVIDIA GPU. You can still run the trained model on a Mac via Ollama. **New in v1.5:** an experimental MLX rail (`--backend mlx`) trains a LoRA adapter natively on Apple Silicon — see [Apple Silicon (MLX)](#apple-silicon-mlx--experimental-v15). It is LoRA-SFT-only and built-but-not-yet-dogfood-verified on real silicon, so for anything beyond a LoRA SFT (ORPO, full fine-tune, FP8, multi-run) you still want the CUDA rail.
- **Anything outside the tested model families** — Qwen 2.5 / 3.5 (7B / 4B), Phi-4-mini-3.8B, SmolLM3-3B, Llama 3.2 (3B / 1B), Mistral 7B. Other models often work but aren't pinned in CI.
If you need any of those things, reach for one of the libraries listed above. They're better at them.
## What Backpropagate gives you
Four things, in one install:
**1. A real 3-line API that runs without a config file.**
The snippet at the top of this README runs end-to-end. No `accelerate config`, no YAML, no Hydra overrides. Just `Trainer(model).train(data)` and you have a finetune.
**2. Windows that actually works.**
Most ML libraries treat Windows like an afterthought. Backpropagate is tested first-class on Windows + RTX 5080. The library handles the runtime quirks for you — it knows how to pre-tokenize your data so Windows multiprocessing doesn't crash, it automatically disables xformers on RTX 40/50 cards where it would break, and it picks dataloader settings that don't blow up. You don't have to know any of this. It just runs.
**3. Built for unattended runs.**
Training takes hours. You don't want to babysit it. Backpropagate is designed to be left running:
- If you run out of GPU memory, it automatically halves the batch size and retries — up to three times. No hand-tuning.
- If your GPU gets too hot, it pauses until things cool down and then continues.
- Every checkpoint is written atomically — if your laptop crashes mid-save, the previous good checkpoint is still intact.
- Every training run gets a unique ID that's stamped onto every log line, every checkpoint, and every Weights & Biases entry. If something goes wrong, one ID lets a maintainer correlate everything.
- Errors come with stable codes (`RUNTIME_GPU_OOM`, `DEP_OLLAMA_REGISTRATION_FAILED`, etc.) so you can grep your logs and the [troubleshooting guide](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting/) for the fix. CUDA-specific failures have a dedicated [CUDA troubleshooting page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting-cuda/).
**4. One command from trained adapter to `ollama run`.**
Lots of libraries train a model. Few of them get out of your way when you want to actually use it. Backpropagate exports to GGUF (the format Ollama uses) and registers an Ollama model in one command. You go from "training done" to "I can chat with my finetune" in about 30 seconds.
## Quick Start
The repo ships a tiny example dataset so the snippet from the top of this README runs on a clean install:
pipx install "backpropagate[standard]"
python -c "
from backpropagate import Trainer
trainer = Trainer('Qwen/Qwen2.5-7B-Instruct')
trainer.train('examples/quickstart.jsonl', steps=10)
trainer.export('gguf', quantization='q4_k_m')
"
This trains a Qwen 2.5 7B adapter on 5 short ShareGPT-format conversations, then exports the result to GGUF. For your own data, format your JSONL one example per line:
{"conversations": [{"from": "human", "value": "What is Python?"}, {"from": "gpt", "value": "A programming language."}]}
{"conversations": [{"from": "human", "value": "Explain recursion."}, {"from": "gpt", "value": "A function that calls itself."}]}
Alpaca (`instruction` / `output`), OpenAI chat (`messages`), and raw text formats also work — Backpropagate auto-detects the format.
### Preference tuning (ORPO)
New in v1.5: train on preferences instead of plain demonstrations. ORPO is reference-free and single-stage — it folds the preference signal into the SFT step, so there's no separate reward or reference model and the 3-line shape is unchanged. Pass `--method orpo` (CLI) or `method="orpo"` (Python) and feed it a dataset of `{prompt, chosen, rejected}` (or just `{chosen, rejected}`) rows:
{"prompt": "What is Python?", "chosen": "A high-level programming language known for readability.", "rejected": "idk look it up"}
{"prompt": "Explain recursion.", "chosen": "A function that calls itself with a smaller input until a base case.", "rejected": "when something repeats"}
from backpropagate import Trainer
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct", method="orpo")
trainer.train("preferences.jsonl", steps=100)
trainer.export("gguf", quantization="q4_k_m")
backprop train --data preferences.jsonl --method orpo --steps 100
### Reasoning-trace SFT (R1 distillation)
New in v1.5: distill a reasoning model the easy way. Pass `--reasoning-trace` (CLI) or `Trainer(..., reasoning_trace=True)` (Python) and feed it traces that keep a `
...` chain-of-thought inside the assistant turn — the pure-SFT half of [DeepSeek-R1](https://arxiv.org/abs/2501.12948) distillation, no RL required. Backpropagate keeps `
` in the training target, drops empty / over-long traces (trace-length filtering), and raises the default `max_seq_length` to 8192 for the longer CoT. Critically, `` stays **plain text** — no special tokens, no embedding resize — so the merged GGUF still exports to Ollama like any other fine-tune. SFT only. See the [reasoning-trace recipe](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/#reasoning-trace-sft-r1-distillation) for the dataset shape and the tunable token band.
### Apple Silicon (MLX) — experimental, v1.5
New in v1.5: **one API, two rails.** CUDA stays the canonical, verified backend; MLX is a second rail that trains on an M-series Mac via Apple's [`mlx_lm.lora`](https://github.com/ml-explore/mlx-lm) toolchain (unified memory, no CUDA). The same 3-line shape picks the rail by hardware — `backend='auto'` (the default) routes to CUDA on NVIDIA and to MLX on Apple Silicon, so existing CUDA rigs are byte-identical:
from backpropagate import Trainer
# On an M-series Mac with `pip install 'backpropagate[mlx]'`:
trainer = Trainer("mlx-community/Qwen2.5-0.5B-Instruct-4bit", backend="mlx")
trainer.train("examples/quickstart.jsonl", steps=100)
backprop train --data my_data.jsonl --backend mlx --steps 100
In v1.5 the MLX rail is **LoRA SFT only** — no ORPO, no FP8, no `mode='full'`, no multi-run on MLX yet (each is rejected with `CONFIG_INVALID_SETTING`; use `backend='cuda'`/`'auto'` on an NVIDIA box for those). The resulting adapter is plain safetensors and exports to Ollama through the same path as the CUDA rail.
For more end-to-end workflows (fine-tune-and-push-to-HF-Hub, resume after OOM, multi-run SLAO across a long campaign, etc.) see the [handbook recipes page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/).
### Web UI (optional)
If you'd rather click than type Python, install the UI extra and launch:
pipx install "backpropagate[ui]"
backprop ui --port 7862
A local web interface opens at `http://localhost:7862` for browsing datasets, validating formats, and assembling a training config visually. Training itself runs via `backprop train` (UI-driven training is on the roadmap — the Start button currently surfaces that note). The UI is local-only by default. To expose it to other devices, see [Web UI](#web-ui) below for the `--share` + `--auth` security contract.
## Multi-run training
If you want to fine-tune incrementally across multiple datasets — say you get new training data every week and want to add it without forgetting what you learned before — Backpropagate's `multi_run` mode is for you:
from backpropagate import Trainer
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct")
result = trainer.multi_run(
dataset="HuggingFaceH4/ultrachat_200k",
num_runs=5,
steps_per_run=100,
samples_per_run=1000,
)
This runs five training passes, merging the adapter between runs in a way that preserves earlier knowledge while incorporating new examples. The technique is based on recent continual-learning research — see [References](#references) at the bottom of this README.
The CLI version:
backprop multi-run --data my_data.jsonl --runs 5 --steps 100 --samples 1000
## Resume from checkpoint
A 5-run training that crashes at run 4 is recoverable. Every multi-run session writes its run ID into the on-disk history and checkpoint manifest, so picking up where you left off is one command:
backprop resume
backprop multi-run --data ... --resume
backprop train --data ... --resume # single-run resume
The default behavior of `backprop multi-run` (no `--resume`) auto-detects an in-progress entry in the same output directory and continues it. To force a clean start, point at a fresh output directory.
## Training history
Every `backprop train` and `backprop multi-run` invocation records a row in `