anshikapundeel/ai-ops-design
GitHub: anshikapundeel/ai-ops-design
Stars: 0 | Forks: 0
# ai-ops-design
A design document for **AIOps Platform** — an AI-augmented incident
response and observability platform for medium-scale production
infrastructure (10–1000 services).
This repository is **documentation only**. No code lives here. The goal
is to lay out the design of a production-grade AIOps platform end-to-end:
data flow, failure modes, scaling strategy, cost model, security threat
model, and an honest treatment of where LLMs help and where they don't.
The platform itself is implemented across several other repositories;
this repo is the system-level view that ties them together. Where a
subsystem already exists in code, the doc links to it.
## Why this exists
Most published AIOps designs fall into one of two failure modes:
- **Vendor marketing diagrams** — "AI", "ML", and "predictive analytics"
labels on every box; no concrete data flow; no failure model; no
cost discussion.
- **Toy demos** — a notebook that calls an LLM on a log line and
declares the incident-response problem solved.
Real production AIOps platforms are mostly *not* AI. The AI layer
provides ~20% of the value at the top of the stack; the other 80% is
unglamorous infrastructure — reliable ingest, structured correlation,
durable storage, careful failure handling, sensible UX.
This document tries to be honest about that ratio, and about which parts
of the system actually benefit from an LLM versus which parts a
rule-engine handles better.
## The high-level shape

Eight components, in two planes:
**Data plane** (always-on, low-latency, deterministic):
1. **Telemetry collectors** — metrics, logs, traces, hardware events
2. **Storage tier** — time-series DB, log store, trace store
3. **Alert manager** — fires alerts on threshold/anomaly rules
4. **Correlation engine** — rule-based, deterministic; produces
structured findings
**Intelligence plane** (best-effort, async, advisory):
5. **Incident narrator** — LLM rephrases findings into human-readable
incident reports
6. **Past-incident retrieval** — vector similarity over previous
incidents
7. **Deployment risk analyzer** — rule-engine + LLM evaluates each
release for risk
8. **Regression detector** — compares current performance against
baseline; flags drift
The split is deliberate: **the data plane must work without the
intelligence plane**. If the LLM provider is down, incidents still
fire, still correlate, still get routed. The intelligence plane adds
clarity to the report; it is not a critical path.
## How to read this repo
Read in order. Each doc is ~10 minutes. Total ~80 minutes.
| # | Doc | What it covers |
|---|-----|----------------|
| 1 | [system-overview](docs/01-system-overview.md) | The whole platform in one tour, who-talks-to-whom, why each piece exists |
| 2 | [incident-pipeline](docs/02-incident-pipeline.md) | One alert's lifecycle: fired → triaged → narrated → routed → resolved |
| 3 | [data-flow](docs/03-data-flow.md) | Where data lives, retention tiers, hot/warm/cold paths |
| 4 | [failure-modes](docs/04-failure-modes.md) | What breaks when, and how the platform degrades gracefully |
| 5 | [scaling-strategy](docs/05-scaling-strategy.md) | Independent scaling axes per component; bottlenecks at each tier |
| 6 | [cost-model](docs/06-cost-model.md) | Back-of-envelope monthly cost at 10 / 100 / 1000 services |
| 7 | [security-threat-model](docs/07-security-threat-model.md) | What an attacker can do; what we defend against; what we don't |
| 8 | [llm-integration](docs/08-llm-integration.md) | Where LLMs add value, where they don't, the failure modes that matter |
## Reference implementations
Several subsystems already exist as standalone repos. The links below
show the design-to-code correspondence:
| Subsystem (in this design) | Reference implementation |
|---|---|
| Hardware-telemetry collector | [redfish-exporter](https://github.com/anshikapundeel/redfish-exporter) — Prometheus exporter for DMTF Redfish BMCs |
| Kernel-scheduler telemetry | [sched-trace](https://github.com/anshikapundeel/sched-trace) — eBPF/bpftrace toolkit for CPU scheduler tail latency |
| Distributed-tracing collector | [flow-trace](https://github.com/anshikapundeel/flow-trace) — OTLP/HTTP-JSON ingest, sharded ring buffer, embedded UI |
| Correlation engine | [perf-advisor](https://github.com/anshikapundeel/perf-advisor) — rule-based Linux performance analyzer with optional LLM hook |
| Performance auditing toolkits | [linux-perf-toolkit](https://github.com/anshikapundeel/linux-perf-toolkit), [pg-perf-toolkit](https://github.com/anshikapundeel/pg-perf-toolkit) |
Three subsystems are documented here but **not yet implemented**:
- **Incident narrator + past-incident retrieval** — see
[docs/02-incident-pipeline.md](docs/02-incident-pipeline.md) for
the design. Planned as `incident-pilot`.
- **Deployment risk analyzer** — see
[docs/08-llm-integration.md](docs/08-llm-integration.md) for design.
Planned as `deploy-risk`.
- **Regression detector** — see
[docs/05-scaling-strategy.md](docs/05-scaling-strategy.md) for
design. Planned as `regression-radar`.
## What this design is and isn't
**Is:**
- A coherent, opinionated design for an AIOps platform at the
10–1000-service scale.
- A frank treatment of which parts genuinely benefit from AI and
which are better off as deterministic systems.
- A reference for what each piece actually costs to run, how it scales,
what its failure modes are, and what its threat surface looks like.
**Isn't:**
- A hyperscale design. Different things matter at 10,000 services
(multi-region everything, lambda architecture for trace storage,
per-tenant isolation) and this doc doesn't pretend to cover them.
- A specific vendor recommendation. Where the design says "time-series
DB" it means Prometheus *or* VictoriaMetrics *or* Mimir; the design
works with any of them.
- A claim of novelty. The patterns here are well-known; the goal is
to document them clearly in one place, with the LLM integration
treated honestly.
## License
MIT (for the prose). Diagrams are CC-BY-4.0.