Keshav0375/Sentinel
GitHub: Keshav0375/Sentinel
Stars: 0 | Forks: 0
# Sentinel
**Autonomous DevOps incident response agent** — multi-agent pipeline that triages alerts, diagnoses root causes, drafts remediation plans, and communicates status, with a mandatory human-approval gate before any destructive action.
Built as a portfolio project demonstrating production-grade agentic engineering: multi-agent orchestration, HITL safety, episodic memory, trajectory evaluation, and real-time observability.
## How It Works
Alert fires
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ (coordinates handoffs, enforces 15-call cap, writes STM) │
└──┬──────────┬──────────┬──────────┬──────────┬─────────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Triage Log Deploy Remediation Comms
Agent Analyst Correlator Agent Agent
│ │ │ │ │
│ fetch_ list_ draft_ draft_
get_ logs recent_ rollback_ slack_
service_ deploys pr summary
metadata │
search_ request_human_
past_ approval ◄── HITL gate
incidents │
▼
Human approves / rejects
│
▼
Incident resolved + stored
in episodic memory
Every tool call is captured as a trajectory. After resolution, an LLM judge scores the trajectory across 6 dimensions and writes a report viewable in the web dashboard.
## Features
- **Multi-agent pipeline** — 5 specialist agents (Triage, Log Analyst, Deploy Correlator, Remediation, Comms) orchestrated via OpenAI Agents SDK handoffs
- **HITL safety gate** — `request_human_approval` is the only path to destructive actions; no agent can bypass it
- **Episodic memory** — past incidents embedded and stored in SQLite; similar incidents surface automatically during triage
- **Semantic memory** — service map, dependency graph, and runbooks loaded from seed data
- **Live incident dashboard** — real-time SSE stream of agent activity at `http://localhost:8000/`; approve/reject HITL from the browser
- **Trajectory eval** — LLM-as-judge scores every incident across 6 dimensions (0–5), with a Markdown and JSON report
- **Eval results dashboard** — browse per-scenario scores at `http://localhost:8000/eval`
- **10 synthetic scenarios** — 5 failure classes (bad deploy, DB pool, downstream outage, memory leak, config regression)
## Quick Start
### Prerequisites
- Python 3.12+
- [Groq API key](https://console.groq.com/keys) (free tier is sufficient)
- Docker + docker-compose (optional but recommended)
### 1. Clone and configure
git clone https://github.com/keshxv/sentinel.git
cd sentinel
cp .env.example .env
# Edit .env — add your GROQ_API_KEY
### 2a. Run with Docker
docker-compose up --build
Dashboard: [http://localhost:8000](http://localhost:8000)
### 2b. Run without Docker
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
python -m data.seed # seed service map + runbooks
uvicorn sentinel.main:app --reload
Dashboard: [http://localhost:8000](http://localhost:8000)
### 3. Switch providers (optional)
Model strings use the `provider/model` format. Swap any provider with 3 env vars — no code changes.
**Switch analysis to Anthropic Claude:**
SENTINEL_ANALYSIS_MODEL=anthropic/claude-sonnet-4-6
ANTHROPIC_API_KEY=sk-ant-...
**Switch everything to OpenAI:**
SENTINEL_TRIAGE_MODEL=openai/gpt-4o-mini
SENTINEL_ANALYSIS_MODEL=openai/gpt-4o
SENTINEL_JUDGE_MODEL=openai/gpt-4o-mini
OPENAI_API_KEY=sk-...
See [available_models.md](available_models.md) for the full model table and provider capability matrix.
### 4. Fire a scenario
**From the dashboard** — select a scenario from the dropdown and click ⚡ Fire Scenario.
**From the CLI**:
python scripts/run_scenario.py bad_deploy_01
python scripts/run_scenario.py --list # see all 10 scenarios
### 5. Run the eval suite
python scripts/run_eval.py
# or a subset:
python scripts/run_eval.py --scenarios bad_deploy_01 db_pool_01
Results → `reports/eval_report.json` + `reports/eval_report.md`
Dashboard → [http://localhost:8000/eval](http://localhost:8000/eval)
## Demo
1. Select `bad_deploy_01` from the dropdown — a critical API gateway alert fires
2. Watch the five agent cards light up in sequence as each specialist completes its analysis
3. An amber HITL overlay appears when the Remediation Agent requests rollback approval
4. Click ✅ Approve — the pipeline continues and the incident resolves
5. Switch to `/eval` to see the LLM judge's per-dimension scores
## Tech Stack
| Layer | Technology |
|-------|-----------|
| Agent framework | [openai-agents](https://github.com/openai/openai-agents-python) (Agents SDK) |
| LLM | LiteLLM multi-provider — `groq/`, `openai/`, `anthropic/`, `azure/` via `provider/model` env var. Default: Groq (free tier). See [available_models.md](available_models.md). |
| API layer | FastAPI + uvicorn |
| Memory — episodic | SQLite via aiosqlite + cosine similarity |
| Memory — semantic | SQLite (service map, runbooks) |
| Embeddings | sentence-transformers `all-MiniLM-L6-v2` (local, 384 dims, zero cost) |
| Real-time | Server-Sent Events (`EventSource` API) |
| Models / validation | Pydantic v2 |
| Config | pydantic-settings |
| Logging | structlog (JSON) |
| HTTP client | httpx |
| Containerisation | Docker + docker-compose |
| Linting | ruff |
| Type checking | pyright |
| Tests | pytest + pytest-asyncio |
## Eval Results
**Eval dimensions (scored 0–5 by the judge):**
| Dimension | What it measures |
|-----------|-----------------|
| `triage_accuracy` | Right service + correct severity? |
| `root_cause_correctness` | Hypothesis matches known root cause? |
| `tool_efficiency` | Right tools in logical order? No redundant calls? |
| `mttr` | Time from alert to remediation proposal (lower is better) |
| `remediation_safety` | HITL gate used? Proposed fix appropriate? |
| `comms_quality` | Slack summary clear, complete, structured? |
Pass threshold: **3.0 / 5.0** (60%)
## Project Structure
sentinel/
├── src/sentinel/
│ ├── agents/ # 5 specialist agents + orchestrator
│ ├── tools/ # Tool implementations (fetch_logs, draft_rollback_pr, …)
│ ├── memory/ # Episodic (SQLite + embeddings), semantic, short-term
│ ├── eval/ # Rubric, LLM-as-judge, batch runner, report generator
│ ├── api/ # FastAPI routes (webhooks, SSE, HITL, incidents, scenarios)
│ ├── infra/ # DB, logging, tracing, event bus, dashboard emitter
│ ├── models/ # Pydantic domain models
│ ├── generator/ # Synthetic alert + log + deploy data
│ └── dashboard/ # index.html (live dashboard) + eval.html
├── data/
│ ├── scenarios/ # 10 JSON scenario files
│ └── services/ # service_map.json, dependency_graph.json, runbooks.json
├── scripts/
│ ├── run_scenario.py # Fire a single scenario from CLI
│ ├── run_eval.py # Batch eval across all scenarios
│ └── demo.py # Interactive demo runner (coming soon)
├── tests/ # ~1 000 tests across all layers
└── reports/ # Trajectory JSON, eval_report.json/md
## Architecture Decisions
**Why multi-agent, not single agent?** Each specialist has a focused system prompt and constrained tool set. This makes each agent independently testable and improvable — changing the Log Analyst's prompt doesn't risk breaking Triage.
**Why HITL at the tool level, not the prompt level?** There is no "execute rollback" tool — only `draft_rollback_pr` + `request_human_approval`. An agent can't bypass approval by ignoring a prompt instruction; it literally has no tool that acts without human sign-off.
**Why SQLite for episodic memory?** Zero infrastructure cost for the MVP. The embedding + cosine similarity approach works well at 10–100 incidents. Phase 2 swaps to Cosmos DB with a vector index for scale.
**Why a separate judge model?** Using a different model family for evaluation avoids self-grading bias. The agents run on `llama-3.3-70b-versatile`; the judge runs on `llama-3.1-8b-instant`. In production, you'd use a different provider entirely.
## Phase 2 Roadmap
_Documented here for interview conversations about production readiness:_
| Capability | Phase 2 approach |
|-----------|-----------------|
| Cloud deployment | Azure Container Apps + Bicep IaC |
| Database | Cosmos DB (vector + JSON) replacing SQLite |
| Cache | Redis for short-term memory |
| Model routing | LiteLLM done (Phase 9) — Kong AI Gateway for token budgets is Phase 2 |
| Observability | Self-hosted LangFuse for trace dashboards |
| HITL channel | Slack interactive buttons (real Slack app) |
| PR creation | GitHub App for real PR creation |
| Self-improvement | Nightly ACA Job — re-evals + auto-PR to improve failing prompts |
| Alert source | Azure Service Bus + Datadog webhook for real alerts |
| Adversarial evals | 40 scenarios, 8 failure classes, ambiguous multi-cause cases |
| CI/CD | GHA with eval score gates (fail deploy if avg < 3.5) |
## Running Tests
pytest # run all ~1 000 tests
pytest tests/test_eval/ # eval layer only
pytest tests/test_tools/ # tool layer only
pytest -x -v # stop on first failure, verbose
## License
MIT