# autosentry
**Self-healing supervisor for long-running processes.**
Watch a command, catch the failure, fix it, leave a paper trail.
[](https://github.com/ulmentflam/autosentry/actions/workflows/ci.yml)
[](https://github.com/ulmentflam/autosentry/actions/workflows/pre-commit.yml)
[](https://pypi.org/project/autosentry/)
[](https://pypi.org/project/autosentry/)
[](https://github.com/astral-sh/ruff)
[](https://github.com/facebook/pyrefly)
[](./LICENSE)
`autosentry` supervises a long-running command — an ML training run, a data
pipeline, a service that's expected to stay up — watches its log stream for
known failure modes and anomalies, applies deterministic recovery rules when
it knows the answer, and escalates to a Claude Code shell when it doesn't.
Every incident is written into `.autosentry/incidents/` as a folder
containing the exploded source around the failure, a stack trace, snapshots
of the configs that were in effect, and the fix that was applied.
It was generalized from a domain-specific monitor (`rad_monitor.py`) that
auto-healed a multi-stage ML pipeline through weeks of repeated failures.
The shape — file-based state, file-based outbox, synchronous main loop —
is deliberately simple so an operator can `cat`, `grep`, and `kill -9`
their way out of any problem the supervisor can't.
## Contents
- [Why autosentry](#why-autosentry)
- [Install](#install)
- [Quick start](#quick-start)
- [How it works](#how-it-works)
- [Anatomy of an incident](#anatomy-of-an-incident)
- [Configuration reference](#configuration-reference)
- [Detectors](#detectors)
- [Recovery rules and Claude fallback](#recovery-rules-and-claude-fallback)
- [Fix branches and outcome verification](#fix-branches-and-outcome-verification)
- [Notifications](#notifications)
- [Launching from your AI editor](#launching-from-your-ai-editor)
- [Repairing old or broken installs](#repairing-old-or-broken-installs)
- [Update](#update)
- [Status & roadmap](#status--roadmap)
- [Comparison](#comparison)
- [FAQ](#faq)
- [Contributing](#contributing)
- [License](#license)
## Why autosentry
**The main feature is the agent.** autosentry's job is to put a
capable coding agent (Claude Code by default) in the chair when your
long-running process breaks, with enough structured context for it to
fix the actual bug — not just restart the process and hope. YAML
rules exist as a cheap fast lane for the small set of known
transients where a `kill -HUP` or an env-var nudge will do; for
everything else, the agent takes over by default.
Long-running jobs fail in three flavors:
1. **Known transient failures** — NCCL hiccups, connection resets, OOMs that
would clear with a smaller batch. The rule healer handles these in
one shot.
2. **Anomalies** — training stalls, loss spikes, throughput drops. Some
match a rule; most need diagnosis. The agent takes the ones rules
can't cover.
3. **Novel failures** — code bugs, config mistakes, library regressions.
The agent reads the exploded source, the snapshotted configs, and
the stack trace, then proposes a patch on an isolated `autosentry/fix-*`
branch. autosentry watches the fix for the verification window; if
the same detector re-fires, the fix is reverted and the next attempt
gets fresh context. Outcomes (`kept` / `regressed`) are tracked in
the attempts ledger.
autosentry's default posture is **escalate to the agent quickly**. Two
unverified rule-driven restarts and Claude takes over. A rule-based
fix that regresses inside the verify window pivots the next attempt
straight to the agent regardless of count — rules already failed on
that detector, so cycling them again is wasted budget. Both thresholds
are configurable (`healing.escalate_to_claude_after`,
`healing.escalate_on_rule_regression`); the defaults are tuned for
"agent first, rules as accelerator."
## Install
### One-line (recommended)
curl -fsSL https://raw.githubusercontent.com/ulmentflam/autosentry/main/install.sh | sh
The installer detects `uv`, `pipx`, or `pip` (in that order) and uses the
best one available. Pin a specific version with `AUTOSENTRY_VERSION=0.2.0`.
### Homebrew (macOS / Linux)
brew install ulmentflam/tap/autosentry
That one-liner taps `ulmentflam/homebrew-tap` and installs autosentry into its
own virtualenv. Already tapped? `brew install autosentry`. Upgrade with
`brew upgrade autosentry`. Each release auto-syncs the formula, so the tap
tracks the latest version.
### From PyPI
uv add autosentry # uv
pipx install autosentry # pipx (isolated)
pip install autosentry # plain pip
### From source
git clone https://github.com/ulmentflam/autosentry.git
cd autosentry
make install
macOS / iCloud Drive caveat
If your clone lives under `~/Library/Mobile Documents/`, iCloud sets
`UF_HIDDEN` on `_*.pth` files inside any venv and Python's `site.py`
then skips them, breaking editable installs. The `Makefile` detects
this and points the venv at `~/.cache/autosentry-venv` automatically.
Override with `make install VENV=/path/to/venv`.
### Let an AI agent do it
If you're already in a Claude Code / Cursor / Codex / Aider / OpenCode /
Windsurf / Zed / Continue / Gemini session, paste the prompt block below
and let the agent install and configure autosentry for the repo you're
sitting in. It's a short, declarative brief — the agent runs the right
commands for your stack, asks before destructive actions, and leaves
the repo in a state where `autosentry run` works on the next try.
Agent install brief — copy/paste into your session
Install and set up autosentry in this repo.
Follow this order; stop and ask me before doing anything that would
overwrite an existing file or change tracked code.
1. Verify autosentry isn't already installed. If it isn't, install it
with the one-liner from the README:
curl -fsSL https://raw.githubusercontent.com/ulmentflam/autosentry/main/install.sh | sh
Then confirm with `autosentry --version`.
2. Run `autosentry init --non-interactive` to scaffold the .autosentry/
tree (the config lives at `.autosentry/autosentry.yaml`). (Use
`--upgrade --force` if a config already exists and looks pre-0.6.1, or
to migrate a legacy root-level `autosentry.yaml` into `.autosentry/`.)
3. Inspect this repo to figure out:
- the right `process.command` (the thing I want supervised — read
pyproject.toml / package.json / Cargo.toml / go.mod / the
Makefile / scripts/ to guess; ASK ME before settling on it)
- which files belong in `config_snapshots` (env files, run configs,
pipeline definitions)
- a starting set of detectors and rules tailored to my stack
(OOM / NCCL / connection-reset patterns for ML; HTTP 5xx /
connection-refused for web; stall regex matching whatever
progress format my process emits)
Edit `.autosentry/autosentry.yaml` in place.
4. Install the /autosentry slash command for me with
`autosentry skills install --tool all`. This drops AGENTS.md plus the
per-tool wrappers so future sessions get the playbook automatically.
5. Run `autosentry doctor`. If anything is red, fix it. If it's all
green or only warnings, summarize the warnings.
6. Tell me the exact command to start the monitor in the background
(the `nohup autosentry run …` one-liner), but DO NOT run it
yourself. I'll start it.
Be terse. One or two sentences per step. Point me at
`.autosentry/autosentry.yaml` and `.autosentry/program.md` for context —
don't re-narrate the docs.
## Quick start
pip install autosentry # or: uv add autosentry / pipx install autosentry
cd my-project
autosentry init # interactive: detects your stack, asks for process.command
autosentry doctor # verifies the env is healthy
autosentry run # starts monitoring
`autosentry init` is interactive from a real terminal — it detects
whether your repo is python/node/go/rust, suggests a starter
`process.command`, offers to snapshot config files, and (if you say
yes) installs the `/autosentry` slash command into whichever AI editors
it can find. From scratch to a running monitor is roughly five
minutes; most of that is reading the YAML it wrote.
A healthy `autosentry run` opens with a `starting` line naming your
supervisor and command, hands control to the tick loop, and from then
on only emits log lines on detections, state changes, and verification
outcomes. Silent is healthy. Sanity-check from another shell with
`autosentry status` (live pid + `restarts` counter), `autosentry
watch` for the rich TUI, or `autosentry doctor` if anything looks
off.
### From inside an AI editor
If you're already in a Claude Code / Cursor / Codex / Aider / OpenCode
/ Windsurf / Zed / Continue / Gemini session, you have two options:
autosentry init --for-agent # writes .autosentry/AGENT_NOTES.md, a cheat
# sheet the agent reads instead of paraphrasing docs
autosentry onboard --for-agent # phase-aware plain-text playbook, no scaffolding
Or paste the [agent install brief](#let-an-ai-agent-do-it) into your
session and let it drive the whole sequence — install → init → detector
proposals → `autosentry skills install` → `autosentry doctor` — asking
before any destructive change.
### After it's running
tail -F .autosentry/logs/autosentry.log # structured log
autosentry watch # rich TUI: state, incidents, log tail
autosentry web # browse incidents in your browser
autosentry status # one-shot state dump
autosentry incidents list # CLI incident browser
autosentry incidents show 2026-05-26T14-32-10Z-error-traceback
autosentry probe # one-shot JSON liveness + pending incidents
autosentry probe --inject-prompt # emit a Stop-hook payload (used by session dispatch)
autosentry doctor --fix # auto-repair broken or stale installs
Bidirectional Slack (separate shell — the monitor stays offline-safe):
SLACK_BOT_TOKEN=xoxb-… autosentry dispatcher run --channel C0A4UK987ND
### The vault
Every time autosentry handles a significant event — an incident, a fix
attempt, a recurring failure pattern, a regression, a run that exhausted
its restart budget — it writes a wikilinked markdown note into
`.autosentry/vault/`. The vault is plain markdown; no tool is required
to read it.
.autosentry/vault/
├── index.md
├── runs/
.md
├── runs//child-.md ← supervised child restarts
├── incidents/.md
├── incidents//attempt-.md ← healer attempts
├── detectors/.md ← per-detector aggregators
├── patterns/.md ← recurring failure modes
├── regressions/.md ← fixes that didn't stick
└── exhaustions/.md ← runs that gave up
`autosentry web` exposes the vault at `/vault` (categorized note index),
`/vault//` (individual notes with wikilinks resolved to
in-app URLs), and `/vault/graph` (a Mermaid `graph TD` of run → child →
incident → attempt → outcome chains — click any node to drill in).
#### Open the vault in Obsidian
Open `.autosentry/vault/` as an Obsidian vault: **File → Open folder as
vault** and select the directory. All frontmatter, `[[wikilinks]]`, and
`#tags` work natively — no community plugins required.
A few things worth knowing once it's open:
- **Graph view** (Cmd-G / Ctrl-G) shows the same DAG that
`autosentry web` renders at `/vault/graph`. Obsidian's graph is live
and interactive; the web one is a static Mermaid snapshot.
- **Obsidian search** understands frontmatter tags directly:
`tag:#detector/oom`, `tag:#pattern`, `tag:#outcome/regressed`, and so
on. Every note autosentry writes has tags set — searching by tag is
usually faster than full-text.
- **The vault is derived data.** `.autosentry/incidents/index.jsonl` is
the source of truth. If the vault directory is deleted or corrupted,
run `autosentry doctor --fix` and it will be rebuilt from the index.
- **LLM narratives** are off by default. Set
`vault.narratives.enabled: true` in `.autosentry/autosentry.yaml` to
turn them on. When enabled, the first occurrence of a pattern,
regression, or exhaustion triggers a single LLM call that replaces the
templated `## Narrative` section in the relevant note with a
context-aware paragraph. Subsequent same-class events reuse the
template — only the first occurrence is narrated.
## How it works
┌─────────────────────────────────────────────────────────────────┐
│ autosentry monitor │
│ start ──► tick loop: read log lines → run detectors → fire │
│ healers → apply action → write incident → notify │
└─────────────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Supervisor Detectors Healers Notifiers
local / slurm / pattern / rules.yaml → log /
docker / attach traceback / Claude (sub- slack outbox /
stall / process or discord outbox /
exit_code interactive) webhook
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────┐
│ Incident store │ │ Slack dispatcher │
│ .autosentry/incidents/ │ │ outbox → Slack │
│ -/ │ │ Slack thread → inbox │
│ report.md │ │ (abort/pause/set…) │
│ trace.txt │ └──────────────────────┘
│ frames/*.md │
│ configs/* │ ┌──────────────────────┐
│ state.json │ ◄──│ autosentry watch │
│ fix/ │ ◄──│ autosentry web │
└──────────────────────────┘ └──────────────────────┘
Layers are deliberately small and pluggable:
| layer | role | built-in implementations |
|----------------|---------------------------------------------------------|----------------------------------------------|
| Supervisor | start, observe, restart the process | `local`, `slurm`, `docker`, `attach` |
| Detector | watch the log stream and process state for anomalies | `pattern`, `traceback`, `stall`, `exit_code` |
| Healer | decide what to do about a detection | YAML rules → Claude CLI fallback |
| Incident store | persist a forensic record of what happened + the fix | folder-per-incident with `index.jsonl` |
| Notifier | broadcast events | `log`, `slack_outbox`, `webhook` |
| Dispatcher | bidirectional Slack bridge (outbound + thread inbound) | `stdout`, `webhook`, `slack_api` |
| Visualization | operator surfaces | `autosentry watch`, `autosentry web` |
Read it left-to-right as a pipeline. The **supervisor** owns the
process and hands the monitor a log-line queue. **Detectors** each get
every line plus a periodic tick; the first one to fire produces a
`Detection`. A **healer** consumes that detection — first the
deterministic rule engine, then Claude if no rule matches (or if
escalation is active). The healer returns an action; the **monitor**
applies it (restart, env tweak, abort, custom command), captures the
attempt's outcome on an isolated [fix
branch](#fix-branches-and-outcome-verification), and asks the
**incident store** to commit a forensic folder. **Notifiers** broadcast
the event as a side effect. None of these layers know about the others'
guts — they share `Detection`, `HealerOutcome`, and `Incident` and
nothing else.
The monitor's main loop is a single, synchronous Python thread that
pulls log lines off the supervisor's queue. No async, no callbacks
across processes. If you can read
[`monitor.py`](./src/autosentry/monitor.py), you can debug anything
autosentry does.
## Anatomy of an incident
A `.autosentry/incidents/-/` folder looks like this:
2026-05-26T14-32-10Z-error-traceback/
├── report.md ← human-readable, the marquee artifact
├── trace.txt ← raw stack trace
├── log_excerpt.txt ← ±200 lines around the failure
├── frames/
│ ├── 01-train.py.md ← exploded source for frame 1 (function + ±10 lines)
│ ├── 02-loader.py.md
│ └── 03-_torch_dist.py.md ← library frame, source explode skipped
├── configs/
│ ├── run.yaml ← snapshot of each declared config file
│ └── .env
├── state.json ← monitor state at the moment of the incident
├── rule_match.json ← which YAML rule fired (or "claude")
└── fix/
├── action.json ← {"kind":"restart_with_env","env":{"BATCH_SIZE":"4"}}
├── diff.patch ← if Claude edited files, the diff lives here
└── claude_response.md ← Claude's diagnosis text (if invoked)
A real `report.md` for an OOM:
# Incident — 2026-05-26 14:32:10 UTC — error / traceback
**Process:** local · `python train.py --config configs/run.yaml`
**PID:** 41822 · **Restart #:** 2/5
**Detector:** `traceback` (Python)
**Resolution:** rule `oom` → restart_with_env (BATCH_SIZE=4)
---
## Source — frame 1
`src/train.py:142` in `TrainLoop.step()`
```python
class TrainLoop:
def step(self, batch):
self.optimizer.zero_grad()
>>> logits = self.model(batch["input_ids"]) # line 142
loss = self.criterion(logits, batch["labels"])
loss.backward()
## Stack trace
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.34 GiB
## Fix applied
Rule `oom` matched. Restarted with `BATCH_SIZE=4` (was `8`).
Anomalies (stall, loss spike, etc.) get the same shape — the trace section
is replaced by a recent-metrics block and the configs section is expanded
into a "decisions that might be relevant" view.
---
## Configuration reference
After `autosentry init` you have a `.autosentry/autosentry.yaml` with every
option commented in place. It lives inside `.autosentry/` alongside the
runtime state, and `init` drops a `.autosentry/.gitignore` so the whole
tree — config included — stays out of git by default (delete that file to
track the config). Relative paths in the config resolve against the
**project root** — the directory that contains `.autosentry/` — not the
config file's own directory. A pre-0.8 root-level `autosentry.yaml` is
still loaded as a fallback; `autosentry init --upgrade` migrates it.
The top-level shape:
| key | type | default | what it does |
|--------------------|---------------|----------------------------------|--------------|
| `process.kind` | enum | `local` | `local`, `slurm`, `docker`, or `attach` (tail an existing PID/log) |
| `process.command` | list[str] | — | argv passed to the process. No shell interpretation. |
| `process.cwd` | str | `.` | working dir, relative to the project root (the dir containing `.autosentry/`) |
| `process.env` | dict[str,str] | `{}` | env vars; values can interpolate `$VAR` / `${VAR}` |
| `process.restart_policy.max_restarts` | int | `10` | when exceeded, monitor gives up |
| `process.restart_policy.cooldown_seconds` | int | `60` | wait before restart |
| `process.lifecycle` | enum | `restart_on_failure` | `restart_on_failure` (clean exit ends the supervisor; default since 0.8.5), `one_shot` (any exit ends it), `restart_always` (both clean and dirty exits route through the healer — pre-0.8.5 behavior) |
| `dispatch.mode` | enum | `builtin` | `builtin` (monitor runs the healer) or `session` (Claude Code session dispatches via `/autosentry` skill — preferred for Claude Code users; free) |
| `monitor.poll_interval_seconds` | int | `30` | tick rate for status checks and tick-driven detectors |
| `monitor.log_dir` | str | `.autosentry/logs` | structured log + supervised process log live here |
| `monitor.log_excerpt_lines` | int | `200` | lines per incident `log_excerpt.txt` |
| `config_snapshots` | list[str] | `[]` | files copied verbatim into every incident folder |
| `source_explode.context_lines` | int | `10` | lines around hot line when AST framing fails |
| `source_explode.languages` | list | `[python, javascript, typescript, go, rust, java]` | tree-sitter grammars to load |
| `source_explode.skip_paths` | list | site-packages / node_modules / etc. | frames in these paths emit "library" stubs |
| `detectors` | list | see below | what to watch for |
| `rules` | list | `[]` | YAML rule engine; first match wins |
| `healing.claude.enabled` | bool/str| `auto` | `true`/`false`/`auto`; `auto` enables when skill or CLI is present |
| `healing.claude.mode` | enum | `auto` | `auto`/`interactive`/`langgraph`/`subprocess`; see [healer modes](#healer-modes). `subprocess` is soft-deprecated since 0.10.0. |
| `healing.claude.command` | list | `["claude", "--print"]` | how to invoke Claude in subprocess mode |
| `healing.claude.timeout_seconds` | int | `600` | Claude's budget per incident |
| `healing.claude.request_path` | str | `.autosentry/recovery_request.md`| interactive handshake: file the monitor writes |
| `healing.claude.response_path` | str | `.autosentry/recovery_response.md`| interactive handshake: file the subagent writes |
| `healing.langgraph.enabled` | bool | `false` | enable the LangGraph headless healer |
| `healing.langgraph.provider` | enum | `anthropic` | `anthropic` (`ANTHROPIC_API_KEY`), `openai` (`OPENAI_API_KEY`), or `google` (`GOOGLE_API_KEY`) |
| `healing.langgraph.model` | str | provider default | model name passed to the provider |
| `healing.langgraph.cross_check` | bool | `false` | run a second LLM on the diagnosis before finalizing; can use a different provider |
| `healing.langgraph.max_steps` | int | `10` | maximum graph steps before the diagnosis is finalized |
| `healing.langgraph.temperature` | float | `0.2` | LLM temperature |
| `healing.langgraph.request_timeout_seconds` | int | `120` | per-LLM-call timeout |
| `healing.escalate_to_claude_after` | int | `max_restarts // 5` (≥ 1) | force-escalate to the agent after N unverified rule restarts (default = 2 with `max_restarts=10`) |
| `healing.escalate_on_rule_regression` | bool | `true` | if a rule-based fix regresses, force the agent on the next attempt for that detector |
| `healing.verify_window_seconds` | int | `600` | window in which a re-fire counts as a regression |
| `healing.budget.max_attempts_per_detector_per_hour` | int | `5` | per-detector heal-attempt rate cap |
| `notifiers` | list | `[{kind: log}]` | event sinks |
| `state_path` | str | `.autosentry/state.json` | persistent state location |
| `incidents_dir` | str | `.autosentry/incidents` | where incident folders go |
| `vault.enabled` | bool | `true` | set `false` to disable all vault writes |
| `vault.path` | str | `.autosentry/vault` | where vault notes are written |
| `vault.pattern_threshold` | int | `3` | number of matching incidents before a `patterns/` note is created |
| `vault.similarity_threshold` | float | `0.2` | Levenshtein distance fraction for message-similarity grouping (0 = exact match only) |
| `vault.narratives.enabled` | bool | `false` | LLM-generated prose for first-occurrence significant events (patterns, regressions, exhaustions). Off by default. |
| `vault.narratives.provider` | enum | `anthropic` | same provider matrix as `healing.langgraph` — `anthropic`, `openai`, or `google` |
| `vault.narratives.model` | str | provider default | model name passed to the narrator |
---
## Detectors
| kind | what it does |
|---------------|-------------------------------------------------------------------------------------------------------|
| `pattern` | Fires when a log line matches a regex. Cheapest, most common. |
| `traceback` | Picks up multi-line stack traces from **Python**, **Node/JS**, **Go**, **Rust**, **Java**. |
| `stall` | With `metric_regex`: progress value doesn't advance for N seconds → anomaly. Without: no log output for N seconds → anomaly. |
| `exit_code` | Process exits non-zero (or zero too, if you flip `nonzero_only: false`). |
Example detector block:
```yaml
detectors:
- kind: pattern
name: oom
regex: "(OutOfMemoryError|CUDA out of memory)"
- kind: pattern
name: nccl
regex: "NCCL.*(error|timeout)"
- kind: traceback
- kind: stall
name: training_stall
metric_regex: "step (\\d+)/"
no_progress_seconds: 1800
- kind: exit_code
## Recovery rules and Claude fallback
The agent is the main healer. Rules are a cheap fast lane for the
small set of known transients where a deterministic action is
known-good — restart on `OOM`, set `NCCL_P2P_DISABLE=1` and restart on
NCCL hiccups, drop the batch size on `CUDA out of memory`. Anything
that isn't one of those falls through to the agent immediately. Rules
that *do* match but produce a fix that regresses inside the verify
window pivot the next attempt to the agent automatically (see
[Healer-aware restart budget](#healer-aware-restart-budget)).
Rules are tried top-down; the first one whose `match` clause is
satisfied by a detection wins. The Claude healer runs in one of two
modes — picked automatically based on what's installed (see
[mode resolution](#healer-modes) below).
### Healer modes
Four runtimes exist, each with a different billing model:
| runtime | billing | when to use |
|--------------------------------------|-------------------------------------------|----------------------------------------------|
| `dispatch.mode: session` | **Free** — runs under Claude Code sub. | **Preferred** for Claude Code users. |
| `healing.claude.mode: interactive` | **Free** — same Claude Code session. | Legacy handshake; superseded by `session`. |
| `healing.claude.mode: subprocess` | **Per-call** — `claude --print` to API. | Superseded; soft-deprecated in 0.10.0. |
| `healing.claude.mode: langgraph` | **Per-call** — your provider key. | Headless deployments (no Claude Code). |
**`dispatch.mode: session` (preferred for Claude Code users)** — the
Claude Code session itself is the healer. Under this mode the monitor
still detects failures, writes incident folders, and maintains the
heartbeat — but it doesn't invoke any healer internally. Instead, it
writes `.autosentry/session_dispatch_request` so the Stop hook (auto-
installed by `autosentry init`) can wake the running session. The
`/autosentry` skill then calls `autosentry probe`, walks each pending
incident, and dispatches the fix via `autosentry session apply`.
Everything runs inside the subscription you're already paying for.
Set `dispatch: { mode: session }` to opt in.
dispatch:
mode: session # default: builtin
The `restart_policy` safety net still runs under session mode — if
the child dies and no session is active to react, the supervisor still
auto-restarts up to `max_restarts` times.
**`healing.claude.mode: langgraph` (headless, BYO provider)** — a
multi-step LangGraph diagnosis graph that runs without any Claude Code
session. The graph: `prepare_context → diagnose (LLM + tools:
read_file / grep_repo / view_log_excerpt) → optional cross_check →
finalize`, bounded by `max_steps`. Supports three providers via BYO API
key:
- `anthropic` (`ANTHROPIC_API_KEY`) — Anthropic Claude models
- `openai` (`OPENAI_API_KEY`) — OpenAI models
- `google` (`GOOGLE_API_KEY`) — Google Gemini models
Cross-check can mix providers (e.g. Claude diagnoses, GPT validates).
Missing API keys surface a clean error and fall back to the
`restart_policy` safety net — no stack traces.
healing:
claude:
mode: langgraph
langgraph:
enabled: true
provider: anthropic # or openai, google
model: claude-opus-4-5
cross_check: false
max_steps: 10
temperature: 0.2
request_timeout_seconds: 120
**`healing.claude.mode: interactive` (legacy file handshake)** — write
a recovery request file with YAML frontmatter (incident id, detector,
recommended subagent type), then block waiting for a response. The
`/autosentry` slash command running in your open Claude Code session
sees the request, **spawns a subagent via the Task tool** with the
incident's full context, and writes the response (typically via
`autosentry healer respond`). Keeps your main session clean; the
subagent owns the diagnosis. Superseded by `dispatch.mode: session` for
new setups.
**`healing.claude.mode: subprocess` (deprecated)** — spawn
`claude --print` as a headless process, pipe the prompt in, capture
stdout. Still functional in 0.10.x but logs a one-line nudge at
config-load directing you to `dispatch.mode: session` (free) or
`healing.claude.mode: langgraph` (multi-step + provider choice).
#### The file handshake (interactive mode)
Two files, one direction each, polled by mtime. No sockets, no daemon.
| file | written by | read by |
|---------------------------------------|-------------------------------------------|-------------------------------|
| `.autosentry/recovery_request.md` | monitor (the blocking healer) | `/autosentry` skill in Claude |
| `.autosentry/recovery_response.md` | a Task-tool subagent (`autosentry healer respond`) | monitor (mtime-gated wait) |
Paths are configurable (`healing.claude.request_path` /
`.response_path`). The healer captures a baseline mtime *before*
writing the request so a stale response file from a previous run is
ignored. The wait is bounded by `healing.claude.timeout_seconds`.
healing:
claude:
enabled: auto # auto-detect; never red-light the doctor
mode: auto # interactive if skill installed, else subprocess
subagents:
default:
type: general-purpose
description: "Diagnose an autosentry incident"
training_stall:
type: Plan # specialize per detector
description: "Diagnose a stalled training loop"
Mode resolution (when `mode: auto`):
| `/autosentry` skill installed | `claude` on PATH | resolved mode |
|---|---|---|
| yes | — | `interactive` |
| no | yes | `subprocess` |
| no | no | disabled (rule-only — no red doctor row) |
`autosentry doctor` reports the resolved mode so you can see which one
will actually run.
### Subagents
In interactive mode the healer doesn't talk to Claude directly — it
prompts the `/autosentry` skill (running in the user's open session) to
spawn a **Task-tool subagent** of the type declared in the request
frontmatter. That subagent reads the incident folder, edits the repo
if needed, and writes the response file with one Bash call:
autosentry healer respond \
--action restart_with_env \
--set BATCH_SIZE=4 \
--diagnosis "OOM at step 8450; halving batch."
Per-detector subagent routing (`healing.claude.subagents`) lets each
failure mode get the right kind of investigator without inflating the
operator's main conversation.
Rule-only operation is a first-class mode, not a degraded one — set
`healing.claude.enabled: false` and autosentry skips both subprocess
and interactive paths.
rules:
- name: oom_halve_batch
match: { detector: oom }
action:
kind: restart_with_env
set:
BATCH_SIZE: half # halves the prior overlay
notify: true
- name: transient_restart
match: { detector: nccl }
action: { kind: restart, notify: true }
- name: stall_restart
match: { detector: training_stall }
action: { kind: restart, notify: true }
Supported actions: `restart`, `restart_with_env` (with `half`/`double`/literal
values in `set:`), `pause`, `abort`, `custom_command`.
- the recovery prompt template at `.autosentry/prompts/recovery.md`,
- the current `state.json`,
- the last incident report (so it has the exploded source frames),
- snapshots of every file listed in `config_snapshots`.
It is expected to (a) optionally edit files in place — those edits are
captured into `fix/diff.patch` — and (b) end its response with an `ACTION:`
block:
ACTION: restart_with_env
SET: BATCH_SIZE=4
If Claude says `ACTION: abort`, the monitor stops and waits for a human.
## Fix branches and outcome verification
A healer's *fix* is only as good as the next few minutes of runtime. To
keep regressions out of your working tree, every Claude-driven fix runs
on its own branch and isn't kept unless it survives a verification
window. The pattern is borrowed from
[autoresearch](https://github.com/ulmentflam/autoresearch).
1. When the Claude healer fires, autosentry creates
`autosentry/fix-` off the current HEAD.
2. Claude's edits land on that branch. The supervisor is restarted.
3. The monitor watches for the same detector to re-fire within
`healing.verify_window_seconds` (default 600s).
4. **No recurrence →** the attempt is marked `kept` in `attempts.tsv`.
With `healing.git.auto_merge: true` the branch fast-forwards into
your working branch and is deleted; otherwise the branch is left
for you to merge by hand.
5. **Recurrence inside the window →** the attempt is marked
`regressed`. The working tree is restored, you're returned to your
original branch, and the fix branch stays put as a forensic
artifact.
Every attempt is recorded in `.autosentry/attempts.tsv` — flat
tab-separated, append-only, grep-friendly. Browse it with
`autosentry analyze`:
$ autosentry analyze --since 24h
attempts — 14 total (last 24h)
kept=9 pending=1 regressed=3 crashed=1
top failing detectors
┃ detector ┃ attempts ┃
┃ training_stall ┃ 6 ┃
┃ oom ┃ 4 ┃
┃ nccl ┃ 3 ┃
per-rule success
┃ source ┃ total ┃ kept ┃ regressed ┃ success ┃
┃ oom_halve_batch ┃ 4 ┃ 3 ┃ 1 ┃ 75% ┃
┃ stall_restart ┃ 6 ┃ 3 ┃ 2 ┃ 60% ┃
┃ claude ┃ 3 ┃ 3 ┃ 0 ┃ 100% ┃
### Healer-aware restart budget
The restart counter is **outcome-aware**, not a dumb tally. Three pieces,
all in service of the same posture: get the agent on it before rules
burn the budget.
- **Kept fixes reset the counter.** When a verification window closes
with no recurrence, `state.restarts` drops back to 0. A run that
survives a real failure mid-week doesn't burn its restart budget
on the next, unrelated incident.
- **Force-escalate after two unverified restarts.** Once
`state.restarts` hits `healing.escalate_to_claude_after`
(default: `max(1, max_restarts // 5)` — so **2** with the default
`max_restarts=10`), the *next* detection skips the rule healer and
goes straight to the agent. Rules clearly aren't sticking; bring in
the heavier diagnosis.
- **Rule regression auto-pivots to the agent.** If a rule-based fix
regresses inside the verify window
(`healing.escalate_on_rule_regression=true` by default), the next
attempt for that detector skips rules entirely and routes to the
agent. Rules already failed on that detector — recycling them is
wasted budget. The marker clears after use, so a *different*
detector still gets the cheap rule path on its first try.
The complementary per-detector rate cap
(`healing.budget.max_attempts_per_detector_per_hour`, default 5) keeps a
runaway failure mode from monopolizing the healer. When it burns
through, the monitor still writes incidents and notifies — but stops
trying fixes for that detector until a manual `approve` lands in the
Slack inbox.
## Notifications
Notifier specs are a list under `notifiers:`. Built-ins:
notifiers:
- kind: log # always-on default
- kind: slack_outbox
outbox_path: .autosentry/slack_outbox.jsonl
channel: "C0A4UK987ND" # Slack channel id
thread_key: "pipeline"
- kind: discord_outbox
outbox_path: .autosentry/discord_outbox.jsonl
channel: "123456789012345678" # Discord channel id (snowflake)
thread_key: "pipeline"
- kind: webhook
url: "https://hooks.example.com/autosentry"
Neither notifier talks to chat directly — they append JSON lines to an
outbox file. A separate `autosentry dispatcher run` daemon drains the
outbox and (with the `slack_api` or `discord_bot` backend) also polls
the thread for replies. The dispatcher is *lazy*:
- **Outbox drain is mtime-gated** — when nothing has been queued, the
dispatcher's loop costs one `stat()` call.
- **Inbound polling is trigger-driven** — the monitor `touch()`es
`.autosentry/inbox_poll_request` on every detection fire, which is
what wakes the dispatcher's Slack-thread poll. A long-period sweep
(`--idle-inbound-seconds 300`) catches replies sent during quiet
stretches.
This indirection mirrors the original `rad_monitor.py` and lets
autosentry run on machines without outbound network. The monitor
consumes `slack_inbox.jsonl` / `discord_inbox.jsonl` on its own tick
and applies recognized commands (`abort`, `pause`, `resume`,
`set max_restarts N`, `approve`, `comment:`) directly to the
supervised process.
Pick the backend with env vars or `--backend`:
| credentials present | backend chosen | inbound? |
|-----------------------------------------|---------------------|----------|
| `SLACK_BOT_TOKEN` | `slack_api` | yes |
| `DISCORD_BOT_TOKEN` | `discord_bot` | yes |
| `SLACK_WEBHOOK_URL` | `webhook` | no |
| `DISCORD_WEBHOOK_URL` | `discord_webhook` | no |
| (none) | `stdout` | no |
Run a Slack daemon and a Discord daemon side by side — the dispatcher
auto-namespaces its state/inbox/marker files per platform.
## Launching from your AI editor
`autosentry skills install` drops a `/autosentry` slash-command into
your repo for whichever AI editor you use. Once it's there, typing
`/autosentry` asks the agent to bootstrap autosentry, configure it
for your process, or walk you through the last incident — without
leaving your editor.
# install in just this repo (default)
autosentry skills install # all tools, /autosentry skill
autosentry skills install --tool claude # one tool
autosentry skills install --skill init # the focused /autosentry-init slash command
autosentry skills install --skill update # the focused /autosentry-update slash command
autosentry skills install --skill all # /autosentry + /autosentry-init + /autosentry-update
# install once, inherit everywhere
autosentry skills install --scope global # writes ~/.claude/commands/, ~/.codex/prompts/, ...
autosentry skills install --scope global --skill all
autosentry skills list # full destination table (local + global)
Three skills land here:
- **`/autosentry`** — full operator playbook (install → init → run → operate → interactive recovery).
- **`/autosentry-init`** — focused onboarding of a fresh repo (no operator/recovery content). Smaller reading cost for AI agents that only have one job.
- **`/autosentry-update`** — focused check-and-upgrade: runs `autosentry update --check` and applies the right backend (uv / pipx / pip / Homebrew) when you're behind.
Supported tools:
| tool | dropped at | invoke |
|---------------------------------|-----------------------------------------|------------------|
| **Claude Code** | `.claude/commands/autosentry.md` | `/autosentry` |
| **OpenCode** | `.opencode/command/autosentry.md` | `/autosentry` |
| **OpenAI Codex CLI** | `.codex/prompts/autosentry.md` | `/autosentry` |
| **Gemini (Antigravity / CLI)** | `.gemini/commands/autosentry.toml` | `/autosentry` |
| **Cursor** | `.cursor/commands/autosentry.md` | `/autosentry` |
| **Aider** | `.aider.conf.yml` (binds `AGENTS.md`) | ambient context |
| **Continue.dev** | `.continue/config.json` | `/autosentry` |
| **Windsurf (Cascade)** | `.windsurfrules` | ambient context |
| **Zed** | `.zed/prompts/autosentry.md` | `/autosentry` |
| Universal (any AGENTS.md-aware) | `AGENTS.md` at the repo root | auto-loaded |
All wrappers defer to `AGENTS.md` for the full playbook; the
single-source-of-truth for the agent's instructions lives there.
The skill prompt itself is canonical and lives at
`src/autosentry/templates/skills/autosentry.md` inside this repo. All
per-tool wrappers either embed it or reference it.
## Repairing old or broken installs
`autosentry doctor --fix` runs idempotent auto-repairs across 11 checks.
Run it on any install that's misbehaving, was partially initialized, or
was upgraded from a pre-0.8 version without a migration pass:
autosentry doctor --fix
What each check repairs:
| check | what it catches | repair |
|----------------------------|----------------------------------------------|-----------------------------------------|
| legacy config | pre-0.8 root-level `autosentry.yaml` | move to `.autosentry/` |
| `.autosentry` tree | missing `incidents/` / `logs/` / `prompts/` | recreate dirs |
| vault dir | missing `.autosentry/vault/` | rebuild from `incidents/index.jsonl` |
| `state.json` | unparseable JSON | rotate aside as `.broken-` |
| `attempts.tsv` | malformed rows | rotate aside |
| `incidents/index.jsonl` | malformed JSONL lines | rotate aside |
| stop hook | `dispatch.mode: session` without hook | install via `autosentry hooks install` |
| langgraph api keys | provider key missing | (no auto-fix; surfaced loud) |
| recovery request | orphaned `recovery_request.md` blocking runs | rotate to `.stale-` |
| `claude.mode` steering | deprecated `subprocess` mode | (no fix; explains alternatives) |
`--fix` is idempotent — running it twice is a no-op the second time.
Sensitive things (API keys, config edits) are never auto-changed.
You can also manage the Stop hook independently:
autosentry hooks install # install the Stop hook into .claude/settings.local.json
autosentry hooks remove # remove it
## Update
autosentry update # update to latest stable
autosentry update --check # current vs latest; recommends how to upgrade
autosentry update --check --json # machine-readable: {"current","latest","is_outdated"}
autosentry update --pre # allow pre-releases
`autosentry update` auto-detects how it was installed — uv tool, pipx,
`pip --user`, or **Homebrew** — and runs the matching upgrade (`brew upgrade
autosentry` for tap installs). `--check` caches the PyPI lookup for a day
(pass `--no-cache` to force a live query) and always exits 0, so the
`/autosentry` skill can run it on every invocation and nudge you when a newer
release is out.
Or use the standalone updater (works for installs made by `install.sh` even
when the CLI itself is broken):
curl -fsSL https://raw.githubusercontent.com/ulmentflam/autosentry/main/update.sh | sh
## Status & roadmap
The package is **3 - Alpha** on PyPI. Individual subsystems below are
labelled "stable" because the test suite pins their behavior and the
public CLI / YAML schemas are under semver discipline. Alpha applies
to the project shape: defaults may shift, less-trodden combinations
(SLURM + interactive Claude + Discord, say) have had hours of use
rather than weeks. Expect breaking changes in `0.x` minor releases;
the [`CHANGELOG`](./CHANGELOG.md) calls them out.
| capability | status |
|---------------------------------------------------------|---------------|
| Local subprocess supervisor | **stable** |
| Pattern / traceback / stall / exit detectors | **stable** |
| YAML rule engine + Claude CLI fallback | **stable** |
| Tree-sitter source exploder (py/js/ts/go/rust/java) | **stable** |
| Incident store + structured logs + slack file-outbox | **stable** |
| `install.sh` one-liner | **stable** |
| `autosentry update` mechanism | **stable** |
| AI-editor skills (Claude/OpenCode/Codex/Gemini/Cursor) | **stable** |
| SLURM supervisor | **stable** |
| Docker supervisor | **stable** |
| Attach-to-PID supervisor | **stable** |
| Slack dispatcher daemon (outbound + inbound) | **stable** |
| `autosentry watch` status TUI | **stable** |
| `autosentry web` incident viewer | **stable** |
| Fix-branch isolation + outcome verification | **stable** |
| `attempts.tsv` ledger + `autosentry analyze` | **stable** |
| `program.md` operator mission statement | **stable** |
| Session dispatch (`dispatch.mode: session`) | **stable** |
| `autosentry probe` + Stop hook + `hooks install/remove` | **stable** |
| `autosentry session apply` action queue | **stable** |
| LangGraph headless healer (Anthropic/OpenAI/Google) | **stable** |
| Obsidian-compatible markdown vault | **stable** |
| `autosentry web` vault routes + Mermaid graph | **stable** |
| `autosentry doctor --fix` auto-repair | **stable** |
| LLM vault narratives (`vault.narratives`) | **stable** |
| PyPI release automation | planned |
| Slack interactive buttons (approve/abort UI) | planned |
## Comparison
autosentry is **not** trying to replace your supervisor of record. It runs
*above* one (or alongside it) and is designed for the long-running,
high-friction-to-restart job: the multi-day training run, the slow nightly
ETL, the batch service that can't easily be turned into a stateless k8s
deployment.
## FAQ
**Isn't this just systemd + a cron?**
For category (1) failures — yes, a `Restart=on-failure` unit covers
it. autosentry earns its keep when the cost of a bad restart is high
(re-warming a cache, re-loading model weights, re-running an
hour-long preprocessing stage) and when the failure mode isn't
"process exited non-zero" — a stalled training loop, a loss spike, a
silently-degraded throughput. systemd can't read your log stream,
match a regex against it, edit a config file, and verify the fix
didn't regress. autosentry runs *above* a supervisor of record, not
in place of it.
**Why YAML for rules instead of code/Python?**
Operators edit autosentry mid-incident, often from a phone over
Slack. YAML diff-reviews cleanly, survives a copy/paste into a chat
thread, and can't trigger arbitrary import-time side effects. The
escape hatch for genuine logic is `action: { kind: custom_command }`
— run a script, return an action. We took the same trade as
Kubernetes manifests for the same reason.
**What if Claude makes a worse fix?**
Two safeguards. First, every Claude edit lands on an isolated
`autosentry/fix-` branch — your working tree is
untouched until verification passes. Second, the
[verification window](#fix-branches-and-outcome-verification) restores
the working tree and marks the attempt `regressed` if the same
detector re-fires inside `healing.verify_window_seconds`. The diff
stays on disk as a forensic artifact. You can also clamp Claude to
diagnose-only with `healing.claude.command: ["claude", "--print",
"--no-edit"]`.
**Does autosentry need Claude installed?**
You get the most out of autosentry with an agent — that's the
headline feature. With `healing.claude.enabled: false` (or simply
with no `claude` CLI on PATH when `enabled: auto`), autosentry falls
back to rules-only. `doctor` won't redline — rules-only is a
supported mode — but you've turned off the part that makes this
different from systemd. Useful on air-gapped boxes, in CI where the
rule set is well-tested, or when you want a hard "no LLM in the
loop" guarantee.
**What happens to the process when the monitor itself crashes?**
The supervised process keeps running — autosentry uses
`start_new_session=True` so it doesn't share a process group. On restart,
the monitor reads its persisted state and resumes. If the process is gone
by then, it starts a fresh one.
**Can rules call out to scripts?**
Yes — `action: { kind: custom_command, command: [...] }`. Runs in the
configured `cwd` with the configured env. Returns the action.
**Will Claude actually edit my code?**
By default, yes. It runs in the configured `cwd`. Any file edits it makes
are captured as a diff in the incident folder so you can review and revert.
Set `healing.claude.command` to something more restrictive (e.g.
`["claude", "--print", "--no-edit"]`) if you'd rather Claude only diagnose.
**Won't this double-restart with my existing supervisor?**
In `attach` mode autosentry will never restart a process it didn't
start — restart actions raise. In `local`/`slurm`/`docker` mode
autosentry *is* the supervisor; don't also configure
`Restart=on-failure` for the same unit. Use one or the other.
**What's the resource overhead?**
The monitor is a single Python process that wakes on the
`poll_interval_seconds` tick (default 30s) and on each log line your
process writes. Steady-state memory is dominated by the loaded
tree-sitter grammars plus a bounded log-tail buffer. The dispatcher
is a separate, optional process and is mtime-gated — when no
notifications are queued, its loop is a single `stat()`. There's no
background polling of remote services and no persistent network
connection.
**My supervisor keeps running after the supervised process exits cleanly (code 0).**
This was the pre-0.8.5 default (`restart_always`). Set `process.lifecycle:
restart_on_failure` (the new default) — a clean exit ends the supervisor; a
non-zero exit still routes through the healer. For batch one-shot jobs, use
`process.lifecycle: one_shot` so any exit (clean or dirty) ends the supervisor.
**iCloud Drive keeps breaking my venv.**
See the [install](#install) note. Either point the venv outside iCloud or
let the `Makefile` auto-redirect it.
## License
Apache 2.0 — see [LICENSE](./LICENSE) and [NOTICE](./NOTICE).