anshikapundeel/ai-ops-design

GitHub: anshikapundeel/ai-ops-design

Stars: 0 | Forks: 0

# ai-ops-design A design document for **AIOps Platform** — an AI-augmented incident response and observability platform for medium-scale production infrastructure (10–1000 services). This repository is **documentation only**. No code lives here. The goal is to lay out the design of a production-grade AIOps platform end-to-end: data flow, failure modes, scaling strategy, cost model, security threat model, and an honest treatment of where LLMs help and where they don't. The platform itself is implemented across several other repositories; this repo is the system-level view that ties them together. Where a subsystem already exists in code, the doc links to it. ## Why this exists Most published AIOps designs fall into one of two failure modes: - **Vendor marketing diagrams** — "AI", "ML", and "predictive analytics" labels on every box; no concrete data flow; no failure model; no cost discussion. - **Toy demos** — a notebook that calls an LLM on a log line and declares the incident-response problem solved. Real production AIOps platforms are mostly *not* AI. The AI layer provides ~20% of the value at the top of the stack; the other 80% is unglamorous infrastructure — reliable ingest, structured correlation, durable storage, careful failure handling, sensible UX. This document tries to be honest about that ratio, and about which parts of the system actually benefit from an LLM versus which parts a rule-engine handles better. ## The high-level shape ![system overview](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/870b5e8161043635.svg) Eight components, in two planes: **Data plane** (always-on, low-latency, deterministic): 1. **Telemetry collectors** — metrics, logs, traces, hardware events 2. **Storage tier** — time-series DB, log store, trace store 3. **Alert manager** — fires alerts on threshold/anomaly rules 4. **Correlation engine** — rule-based, deterministic; produces structured findings **Intelligence plane** (best-effort, async, advisory): 5. **Incident narrator** — LLM rephrases findings into human-readable incident reports 6. **Past-incident retrieval** — vector similarity over previous incidents 7. **Deployment risk analyzer** — rule-engine + LLM evaluates each release for risk 8. **Regression detector** — compares current performance against baseline; flags drift The split is deliberate: **the data plane must work without the intelligence plane**. If the LLM provider is down, incidents still fire, still correlate, still get routed. The intelligence plane adds clarity to the report; it is not a critical path. ## How to read this repo Read in order. Each doc is ~10 minutes. Total ~80 minutes. | # | Doc | What it covers | |---|-----|----------------| | 1 | [system-overview](docs/01-system-overview.md) | The whole platform in one tour, who-talks-to-whom, why each piece exists | | 2 | [incident-pipeline](docs/02-incident-pipeline.md) | One alert's lifecycle: fired → triaged → narrated → routed → resolved | | 3 | [data-flow](docs/03-data-flow.md) | Where data lives, retention tiers, hot/warm/cold paths | | 4 | [failure-modes](docs/04-failure-modes.md) | What breaks when, and how the platform degrades gracefully | | 5 | [scaling-strategy](docs/05-scaling-strategy.md) | Independent scaling axes per component; bottlenecks at each tier | | 6 | [cost-model](docs/06-cost-model.md) | Back-of-envelope monthly cost at 10 / 100 / 1000 services | | 7 | [security-threat-model](docs/07-security-threat-model.md) | What an attacker can do; what we defend against; what we don't | | 8 | [llm-integration](docs/08-llm-integration.md) | Where LLMs add value, where they don't, the failure modes that matter | ## Reference implementations Several subsystems already exist as standalone repos. The links below show the design-to-code correspondence: | Subsystem (in this design) | Reference implementation | |---|---| | Hardware-telemetry collector | [redfish-exporter](https://github.com/anshikapundeel/redfish-exporter) — Prometheus exporter for DMTF Redfish BMCs | | Kernel-scheduler telemetry | [sched-trace](https://github.com/anshikapundeel/sched-trace) — eBPF/bpftrace toolkit for CPU scheduler tail latency | | Distributed-tracing collector | [flow-trace](https://github.com/anshikapundeel/flow-trace) — OTLP/HTTP-JSON ingest, sharded ring buffer, embedded UI | | Correlation engine | [perf-advisor](https://github.com/anshikapundeel/perf-advisor) — rule-based Linux performance analyzer with optional LLM hook | | Performance auditing toolkits | [linux-perf-toolkit](https://github.com/anshikapundeel/linux-perf-toolkit), [pg-perf-toolkit](https://github.com/anshikapundeel/pg-perf-toolkit) | Three subsystems are documented here but **not yet implemented**: - **Incident narrator + past-incident retrieval** — see [docs/02-incident-pipeline.md](docs/02-incident-pipeline.md) for the design. Planned as `incident-pilot`. - **Deployment risk analyzer** — see [docs/08-llm-integration.md](docs/08-llm-integration.md) for design. Planned as `deploy-risk`. - **Regression detector** — see [docs/05-scaling-strategy.md](docs/05-scaling-strategy.md) for design. Planned as `regression-radar`. ## What this design is and isn't **Is:** - A coherent, opinionated design for an AIOps platform at the 10–1000-service scale. - A frank treatment of which parts genuinely benefit from AI and which are better off as deterministic systems. - A reference for what each piece actually costs to run, how it scales, what its failure modes are, and what its threat surface looks like. **Isn't:** - A hyperscale design. Different things matter at 10,000 services (multi-region everything, lambda architecture for trace storage, per-tenant isolation) and this doc doesn't pretend to cover them. - A specific vendor recommendation. Where the design says "time-series DB" it means Prometheus *or* VictoriaMetrics *or* Mimir; the design works with any of them. - A claim of novelty. The patterns here are well-known; the goal is to document them clearly in one place, with the LLM integration treated honestly. ## License MIT (for the prose). Diagrams are CC-BY-4.0.