npmiaman/causa

GitHub: npmiaman/causa

Causa 是一款基于结构因果模型和 LLM 代理的生产事件根因分析工具，通过反事实实验自动证明并量化告警的真正原因。

Stars: 0 | Forks: 0

# Causa ### The on-call agent that proves the actual cause. [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/) [![TypeScript](https://img.shields.io/badge/typescript-5.6+-blue.svg)](https://www.typescriptlang.org/) [![Next.js 16](https://img.shields.io/badge/next.js-16-black.svg)](https://nextjs.org/) [![Alpha](https://img.shields.io/badge/status-alpha-orange.svg)](#status) **PagerDuty tells you what's *correlated* during an outage. Causa tells you what's *causal* — and proves it by running a counterfactual experiment, live.** [Live demo](#-quick-start--5-minutes) · [Install the SDK](#-the-causareact-sdk) · [Architecture](docs/MULTI_TENANT_DESIGN.md) · [Security](docs/SECURITY.md) · [Roadmap](docs/ROADMAP.md)

## Table of contents - [What is Causa](#what-is-causa) - [Why it's different](#why-its-different) - [How it works](#how-it-works) - [The three-agent loop](#the-three-agent-loop) - [The math (Pearl's hierarchy)](#the-math-pearls-hierarchy) - [End-to-end flow](#end-to-end-flow) - [Architecture](#architecture) - [🚀 Quick start (5 minutes)](#-quick-start--5-minutes) - [📦 The `@causa/react` SDK](#-the-causareact-sdk) - [🔌 Wire up real telemetry](#-wire-up-real-telemetry) - [🧰 Repository layout](#-repository-layout) - [⚙️ Configuration reference](#️-configuration-reference) - [🛠 Development guide](#-development-guide) - [🧪 Testing](#-testing) - [🗺 Roadmap](#-roadmap) - [Status](#status) - [Engineering rules (`ML_AGENTS.md`)](#engineering-rules-ml_agentsmd) - [Citations](#citations) - [Contributing](#contributing) - [Security disclosure](#security-disclosure) - [License](#license) ## What is Causa Causa is an **open-source, self-hostable agent for production incident response**. When an alert fires, three LLM agents (Triage → Investigator → Verifier) work together to **rank the probable root cause by its quantified causal effect on the symptom**, and then **verify the top hypothesis by running a counterfactual experiment** — either in a simulator built from your environment's causal graph, or by spinning up a parallel `kind` cluster and replaying your workload with the candidate cause removed. It's not a correlation engine, and it's not an LLM-on-logs summarizer. It's a structural-causal-model + agentic-loop + verifiable-experiment stack that produces *falsifiable hypotheses with 95% confidence intervals*, every one of them traceable to a learned edge in a Postgres-backed graph. You can drop it into your project in two minutes: import { CausaProvider, CausaTrigger, CausaOverlay } from "@causa/react"; export default function App() { return ( ); } …or run the whole stack on your laptop with `docker compose up`. ## Why it's different Existing tooling answers a different question. | Capability | Rule-based AIOps
_{(Datadog Watchdog, PagerDuty AIOps)} | LLM log summarizer
_{(every YC AI-ops co. 2024)} | **Causa** | |---|:---:|:---:|:---:| | Causal reasoning (not just correlation) | ❌ | ❌ | **✅** | | Quantified confidence (CI / p-value) | ✅ | ❌ | **✅** | | Counterfactual: *"if I hadn't done X…"* | ❌ | ❌ | **✅** | | Falsifiable Prove-it verification | ❌ | ❌ | **✅** | | Bounded LLM cost + retry on transients | — | ❌ | **✅** | | Open-source, self-hostable | ❌ | some | **✅** | | Multi-tenant RLS by default | — | — | **✅** | The one question every on-call actually asks — *"would removing this change have fixed the symptom?"* — is a Level-2 do-calculus question. Correlation tools can't answer it. LLM log summarizers can guess and frequently hallucinate. Causa estimates it, gives you a CI, and lets you verify. ## How it works ### The three-agent loop Built on a **LangGraph** state machine. Every node is checkpointed to Postgres, so a crashed agent run can be resumed from where it left off. Every LLM call is wrapped in retry-on-transient (`ResourceExhausted` / `ServiceUnavailable`) backoff and tagged with per-call cost attribution. ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ alert webhook │ │ (PagerDuty / │ │ Opsgenie / │ │ Alertmanager) │ │ │ │ │ ▼ │ │ ┌─────────┐ IncidentContext ┌────────────────┐ │ │ │ Triage │ ─────────────────────▶│ Investigator │ │ │ │ agent │ │ agent │ │ │ └─────────┘ └────┬───────────┘ │ │ │ │ │ ranked │ │ │ Hypothesis[]│ │ │ ▼ │ │ ┌────────────────┐ │ │ user clicks │ Verifier │ │ │ "Prove it" ────────▶│ agent │ │ │ └────┬───────────┘ │ │ │ │ │ ▼ │ │ CONFIRMED / REJECTED / INCONCLUSIVE │ │ │ └──────────────────────────────────────────────────────────────────────┘ Each agent has a tightly scoped tool set: - **Triage** — `get_alert_details`, `get_recent_change_window`, `get_failing_service_metadata`, then **must** call `emit_incident_context`. - **Investigator** — `query_scg`, `run_counterfactual`, `get_signal_history`, `retrieve_similar_incidents` (pgvector cosine search over the tenant's postmortem corpus), then **must** call `emit_hypotheses`. - **Verifier** — `query_specific_signal`, `run_shadow_experiment`, then **must** call `emit_verification`. Prompts are versioned files under [`prompts/`](prompts/). Every change requires a version bump and an eval delta in [`eval/`](eval/) (the `ML_AGENTS.md` rule). ### The math (Pearl's hierarchy) Causa operates at all three levels. | Level | Symbol | Where it lives | |---|---|---| | **L1 · Association** | `P(Y \| X)` | The System Causal Graph's edge weights — discovered by the PC algorithm on baseline telemetry. | | **L2 · Intervention** | `P(Y \| do(X = x))` | The simulator forward-propagates from a pinned intervention through the DAG. Each child is sampled from its per-node LightGBM quantile regressor (or, for envs without 24 h of training data yet, a baseline-revert fallback). | | **L3 · Counterfactual** | `P(Y_x \| X', Y')` | "Prove it" replays the actual incident with the candidate cause removed. Same world, modified intervention. The verifier shadow can run in the simulator OR on a real parallel `kind` cluster (`KindShadowRunner`). | **Citations** - Pearl (2009). *Causality: Models, Reasoning, and Inference*. Cambridge. - Spirtes, Glymour, Scheines (2000). *Causation, Prediction, and Search*. (PC algorithm.) - Peters, Janzing, Schölkopf (2017). *Elements of Causal Inference*. MIT Press. - Shimizu et al. (2006). LiNGAM. JMLR. ### End-to-end flow 1. An alert hits `POST /wh/{pagerduty,opsgenie,alertmanager,github}/{secret}`. Webhook payload is parsed into a normalized `IncidentContext`. 2. The orchestrator resolves the right **environment** via the per-tenant URL secret. 3. If the env has `prometheus_url` set, real metrics are pulled via PromQL (range + instant). Otherwise the bundled rehearsed-incident fixtures stand in. 4. **Triage agent** runs (~5-10 s), emits the candidate set. 5. **Investigator agent** runs (~20-40 s), calls `run_counterfactual` per candidate, ranks by effect magnitude with 95% CI, emits the `Hypothesis[]`. 6. The Console streams every phase change over WebSocket. Redis persists the last 50 snapshots so a reconnecting client catches up. 7. The on-call clicks **Prove it**. **Verifier agent** runs the shadow (simulator by default, parallel `kind` cluster if configured + the tenant's plan permits it), returns `CONFIRMED` / `REJECTED` / `INCONCLUSIVE` with the chart. 8. Every step writes to an append-only `audit_events` row. Every LLM call writes a `cost_events` row. Total typical demo run: **~30 seconds, ~$0.10 of Gemini 2.5 Pro tokens.** ## Architecture ┌─────────────────────────────────────────────────────────────────────────┐ │ Browser (Next.js 16) │ │ ┌────────────────────────┐ ┌──────────────────────────────────────┐ │ │ │ Landing / │ │ Dashboard /dashboard │ │ │ │ (marketing) │ │ • Topbar + OrgSwitcher │ │ │ │ │ │ • Timeline (events + audit) │ │ │ │ │ │ • 3D Causal Graph (react-three-fiber)│ │ │ │ │ • Hypothesis stack + Prove-it │ │ │ └────────────────────────┘ └──────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ │ HTTPS + WSS │ ▼ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ FastAPI (causa-api) │ │ │ │ /auth/* magic-link + WorkOS SSO + JWT sessions │ │ /api/* tenant-scoped — incidents, environments, jobs │ │ /wh/* signature-verified webhooks (PagerDuty / Opsgenie / ...) │ │ /ws/* live incident updates │ │ /sdk-token SDK-popup auth handshake │ │ │ │ ┌──────────────┐ ┌────────────────┐ ┌─────────────────┐ ┌──────────┐ │ │ │ TenantContext│ │ LangGraph state│ │ Tool layer │ │ Cost + │ │ │ │ middleware │ │ machine (3 │ │ (typed I/O, │ │ audit │ │ │ │ + RBAC │ │ agents, Pg- │ │ schema- │ │ trackers│ │ │ │ │ │ checkpointed) │ │ validated) │ │ │ │ │ └──────────────┘ └────────────────┘ └─────────────────┘ └──────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ │Postgres │ │ Redis │ │Prometheus│ │ Gemini │ │ + RLS │ │ (RQ + │ │ (your │ │ 2.5 │ │+pgvector│ │ snapshots│ │ cluster)│ │ Pro │ └─────────┘ └─────────┘ └──────────┘ └─────────┘ ### `causa-core` — the math, no LLM Pure Python library that the agents call as tools. Importable on its own. | File | What it does | |---|---| | [`causa_core/scg.py`](packages/causa-core/causa_core/scg.py) | PC algorithm via `causal-learn`; builds + persists the SCG; shuffle/negative-control diagnostic; precision/recall vs ground-truth eval. | | [`causa_core/scg_from_topology.py`](packages/causa-core/causa_core/scg_from_topology.py) | Topology-derived stub SCG for new tenants without 24 h of baseline yet. | | [`causa_core/condmodel.py`](packages/causa-core/causa_core/condmodel.py) | Per-node LightGBM quantile regressors at α ∈ {0.05, 0.5, 0.95}. | | [`causa_core/cf.py`](packages/causa-core/causa_core/cf.py) | Vectorized counterfactual sampler — forward propagation through the DAG, 1000 samples in parallel. | | [`causa_core/cf_baseline.py`](packages/causa-core/causa_core/cf_baseline.py) | Baseline-revert counterfactual for envs without trained condmodels. | | [`causa_core/topology.py`](packages/causa-core/causa_core/topology.py) | Service topology + signal taxonomy (reference; per-env overrides live in DB). | | [`causa_core/synth.py`](packages/causa-core/causa_core/synth.py) | Synthetic telemetry generator with known ground-truth edges — used by the bundled demo + the agent eval set. | ### `causa-api` — orchestrator + multi-tenant SaaS | Subdir | What it does | |---|---| | [`auth/`](apps/api/causa_api/auth) | JWT, WorkOS OIDC, magic-link, RBAC (`owner` / `admin` / `on_call` / `viewer`), `TenantContext` middleware. | | [`api/`](apps/api/causa_api/api) | Tenant-scoped HTTP routes — incidents, environments, jobs. | | [`webhooks/`](apps/api/causa_api/webhooks) | Per-tenant signed webhook intake — PagerDuty / Opsgenie / Alertmanager / GitHub / generic. | | [`graph/`](apps/api/causa_api/graph) | LangGraph state machine: nodes, tool routing, PostgresSaver checkpointing, streaming. | | [`agents/`](apps/api/causa_api/agents) | Versioned-prompt loader + per-agent tool schemas. | | [`providers/`](apps/api/causa_api/providers) | `IncidentDataProvider` implementations — `PrometheusProvider` (real PromQL) + `FixtureProvider` (demo). | | [`shadow/`](apps/api/causa_api/shadow) | `ShadowRunner` Protocol + `SimulatorShadowRunner` + `KindShadowRunner` (real kind cluster Prove-it). | | [`workers/`](apps/api/causa_api/workers) | RQ background jobs — condmodel training, SCG rebuild, drift detection, nightly cron. | | [`repos/`](apps/api/causa_api/repos) | DB access patterns; every call is inside a `tenant_session` (RLS-enforced). | | [`db/`](apps/api/causa_api/db) | SQLAlchemy 2.0 async models + Alembic migrations + RLS policies. | | [`broadcaster.py`](apps/api/causa_api/broadcaster.py) | In-process pub/sub + Redis snapshot history for WS reconnect replay. | | [`rate_limit.py`](apps/api/causa_api/rate_limit.py) | Per-tenant Lua-script token bucket on Redis. | | [`orchestrator.py`](apps/api/causa_api/orchestrator.py) | The `run_incident()` + `run_verification()` entry points. | ### `causa-console` — Next.js 16 app Marketing landing at `/`, sign-in at `/login`, the live dashboard at `/dashboard`. The dashboard reuses real components — the 3D causal graph is `@react-three/fiber` (`CausalGraph3D.tsx`), the verification overlay is `recharts` with a split-screen `LineChart`, the hypothesis cards are shadcn primitives styled with our token system. ### `@causa/react` — embeddable SDK A single React package — `` + `` + `` + `useCausa()` hook. Authenticates the end user in a popup, talks to the API with bearer tokens (no cross-site cookies — Safari ITP safe), syncs sessions across tabs via `BroadcastChannel`. ## 🚀 Quick start — 5 minutes ### Prerequisites - **macOS / Linux** - [Homebrew](https://brew.sh/) (macOS) or apt - A **Gemini API key** — free tier works ([create one](https://aistudio.google.com/app/apikey)) ### One-time tool install brew install uv pnpm postgresql@17 redis brew services start postgresql@17 brew services start redis ### Clone + set up git clone https://github.com/your-org/causa.git cd causa # Python workspace uv sync --all-packages # JS workspace pnpm install # Postgres: create db + roles + extensions (one-time, ~3 sec) /opt/homebrew/opt/postgresql@17/bin/psql -d postgres -f infra/postgres/init.sql # Run migrations (creates schema + RLS policies) cd apps/api && uv run alembic upgrade head && cd ../.. ### Configure cp .env.example .env Edit `.env` — at minimum set: GEMINI_API_KEY=AIza... # required CAUSA_DEPLOYMENT_MODE=demo # auto-creates a 'demo' tenant on startup CAUSA_DATABASE_URL=postgresql+asyncpg://app_user:app_user_pw@localhost:5432/causa CAUSA_DATABASE_URL_ADMIN=postgresql+asyncpg://app_admin:app_admin_pw@localhost:5432/causa CAUSA_DATABASE_URL_SYNC=postgresql://app_admin:app_admin_pw@localhost:5432/causa ### Generate the demo fixtures (one-time, ~5 sec) uv run python -m causa_core.cli synth --scenario baseline --hours 4 --out fixtures uv run python -m causa_core.cli synth --scenario recommendation_cascade --hours 0.1 --out fixtures make scg # learns the SCG (~15 s) uv run python -m causa_core.cli condmodel-fit # fits LightGBM (~30 s) ### Run # Terminal 1 make api # Terminal 2 make console Open **http://localhost:3000**. Click **Get started** in the hero. Sign in with any email — in dev mode the magic link is printed inline on the confirmation card. After sign-in you land on **/dashboard**. Press **fire demo incident** (top right of the dashboard). Watch the timeline fill, the 3D causal graph light up the path from `recommendationservice.memory` → `frontend.latency_p99` → `checkoutservice.error_rate`, and the ranked hypotheses appear with their 95 % CIs. Click **Prove it** on the top hypothesis — the verifier runs and an overlay slides in with the split-screen chart showing CONFIRMED. **Cost:** ~$0.10. **Wall clock:** ~30 seconds. ### CLI-only smoke test (no UI required) make incident # fires the rehearsed incident, prints the ranked output ## 📦 The `@causa/react` SDK Install in any React app: npm install @causa/react # or: pnpm add @causa/react / yarn add @causa/react / bun add @causa/react Wrap your root: import { CausaProvider, CausaTrigger, CausaOverlay } from "@causa/react"; export default function App() { return ( {/* floating button — bottom right by default */} {/* slide-in drawer */} ); } Anyone clicking the trigger gets a Causa popup magic-link sign-in. After they're in, the overlay shows their org's incidents, lets them fire a demo run, and surfaces ranked hypotheses + Prove-it verdicts. No backend changes on your side. Headless? Use the `useCausa()` hook: import { useCausa } from "@causa/react"; function YourCustomTrigger() { const causa = useCausa(); return ( ); } See [`packages/sdk-react/README.md`](packages/sdk-react/README.md) for the full API. ### Publishing the SDK to npm ./scripts/publish-sdk.sh # dry-run — verify ./scripts/publish-sdk.sh --real # actually publish (requires npm login) ## 🔌 Wire up real telemetry The bundled demo uses synthetic data so the install path is fast. To run Causa against your real cluster, attach Prometheus + send real alerts. ### 1 · Configure your environment # Add your Prometheus URL + (optional) PromQL overrides + auth header curl -X PATCH http://localhost:8000/api/environments/$ENV_ID \ -H 'content-type: application/json' \ -b /tmp/causa_cookies \ -d '{ "prometheus_url": "https://prom.your-cluster.example.com", "service_label": "service", "extra_label_selectors": ", cluster=\"prod-us\"" }' ### 2 · Auto-discover services + call graph curl -X POST http://localhost:8000/api/environments/$ENV_ID/discover Causa queries Prometheus in this order: 1. `traces_service_graph_request_total` (Tempo / OTel `metrics_generator`) — gives caller→callee edges directly. 2. `istio_requests_total{source_workload, destination_workload}` — Istio / Linkerd service mesh. 3. Fallback: `up{}` — enumerates jobs; SCG learns edges from data. The discovered topology persists on the environment row + a topology-stub SCG is activated so the orchestrator can start ranking hypotheses immediately, before you've collected 24 h of baseline. ### 3 · Train per-node LightGBM models on real data # Triggers a background RQ job; returns a job_id curl -X POST http://localhost:8000/api/environments/$ENV_ID/train # Poll status curl http://localhost:8000/api/jobs/$JOB_ID This pulls 24 h of baseline metrics from your Prometheus, resamples to 5 s buckets, fits LightGBM quantile regressors per node, persists the joblib artifact + a `CondModel` row. The orchestrator picks up the new artifact on the next incident automatically. ### 4 · Schedule nightly retrain + drift checks # Run a worker make worker # In a separate cron container, every night: uv run python -m causa_api.workers.retraining enqueue_nightly_for_all_envs The retrainer re-runs PC on fresh baseline, backtest-gates promotion at Jaccard ≥ 0.70 vs. the previous SCG, and on promotion cascades a drift check + condmodel refit. ### 5 · Point your alert source at Causa # Generate a per-tenant webhook secret (admin only) curl -X POST http://localhost:8000/api/webhook_secrets \ -H 'content-type: application/json' \ -d '{"source": "pagerduty", "environment_id": "..."}' # → {"url": "https://api.causa.dev/wh/pagerduty/"} Configure that URL in PagerDuty's webhook integration. When an alert fires, the orchestrator parses the payload → derives the symptom signal id → runs the three-agent loop → writes the result to the audit log → broadcasts to any WS subscribers. Supported sources today: **PagerDuty, Opsgenie, Alertmanager, GitHub (deploys/pushes), generic**. ## 🧰 Repository layout causa/ ├── apps/ │ ├── api/ FastAPI orchestrator + LangGraph agents │ │ ├── alembic/ Migrations (0001 schema + RLS, 0002 telemetry) │ │ ├── causa_api/ │ │ │ ├── api/ HTTP routes (/api/*) │ │ │ ├── auth/ JWT + WorkOS + magic-link + RBAC + middleware │ │ │ ├── agents/ Prompt loader, agent runners │ │ │ ├── graph/ LangGraph state machine, nodes, tools │ │ │ ├── providers/ PrometheusProvider, auto-discovery, resolver │ │ │ ├── repos/ Tenant-scoped DB access │ │ │ ├── shadow/ ShadowRunner (Simulator + Kind) │ │ │ ├── webhooks/ Alert / event intake │ │ │ ├── workers/ RQ jobs (training, drift, retrain) │ │ │ ├── db/ SQLAlchemy 2.0 models │ │ │ ├── orchestrator.py Top-level run_incident() + run_verification() │ │ │ ├── tools.py Tool handlers (LLM-callable) │ │ │ ├── broadcaster.py WS fan-out + Redis snapshot history │ │ │ ├── rate_limit.py Per-tenant Lua token bucket │ │ │ └── main.py FastAPI app + lifespan │ │ └── pyproject.toml │ ├── causa-console/ Next.js 16 app │ │ ├── src/ │ │ │ ├── app/ / /dashboard /login /sdk/login … │ │ │ ├── components/ │ │ │ │ ├── landing/ Nav, Hero3D, CodeBlock, Footer, icons │ │ │ │ ├── CausalGraph3D.tsx 3D incident visualization (r3f + drei) │ │ │ │ ├── Timeline.tsx Live event log │ │ │ │ ├── HypothesisStack.tsx │ │ │ │ ├── VerificationOverlay.tsx │ │ │ │ ├── Topbar.tsx + OrgSwitcher.tsx │ │ │ │ └── ui/ shadcn primitives (button, dialog, card, …) │ │ │ ├── lib/ api client, store (zustand) │ │ │ ├── types/causa.ts TypeScript mirror of Pydantic shapes │ │ │ └── middleware.ts Protects /dashboard │ │ └── package.json │ ├── deployer/ Mock GitHub-Actions style deploy event emitter │ └── flagsvc/ Mock feature-flag service │ ├── packages/ │ ├── shared/ Single source of truth for Pydantic types │ ├── causa-core/ Math: SCG, simulator, synth │ └── sdk-react/ @causa/react npm package │ ├── infra/ │ ├── postgres/init.sql Bootstrap roles + db │ ├── prometheus-values.yaml kube-prometheus-stack overrides │ └── helm/causa/ Helm chart (api + console + postgres + redis) │ ├── docs/ │ ├── MULTI_TENANT_DESIGN.md Architecture deep-dive │ ├── ROADMAP.md Phase 0 → Phase 5 │ └── SECURITY.md SOC 2 control register + sub-processors │ ├── prompts/ Versioned agent prompts (ML_AGENTS §12.6) │ ├── triage.v1.md │ ├── investigator.v1.md │ └── verifier.v1.md │ ├── eval/incidents/ Eval set (per ML_AGENTS §12.1) ├── failures/ Regression case library (ML_AGENTS §12.14) ├── tests/ pytest suite (13 tests, all green) ├── scripts/ publish-sdk.sh, run_demo.py ├── fixtures/ Generated synthetic telemetry (gitignored) ├── models/ Trained SCG + condmodels (gitignored) │ ├── DECISIONS.md Every non-trivial decision + reversal cost ├── PROBLEM.md Problem framing (ML_AGENTS §4) ├── LIMITATIONS.md What Causa does NOT do; failure modes ├── LICENSES.md Third-party data + model licensing ├── ML_AGENTS.md Mandatory operating rules for any AI agent ├── PRD.md Original hackathon spec (historical) ├── Makefile make help — all dev tasks └── pyproject.toml uv workspace root ## ⚙️ Configuration reference All settings come from environment variables. See [`.env.example`](.env.example). ### Required | Variable | Default | Description | |---|---|---| | `GEMINI_API_KEY` | — | Google Gemini API key. Required for any agent run. | ### Deployment mode | Variable | Default | Description | |---|---|---| | `CAUSA_DEPLOYMENT_MODE` | `multi_tenant` | `demo` auto-creates a default tenant + env on startup and mounts back-compat legacy endpoints. `multi_tenant` requires real magic-link sign-in. | ### Database + Redis | Variable | Default | Description | |---|---|---| | `CAUSA_DATABASE_URL` | `postgresql+asyncpg://app_user:app_user_pw@localhost:5433/causa` | Async, RLS-enforced — used for all runtime queries. | | `CAUSA_DATABASE_URL_ADMIN` | `postgresql+asyncpg://app_admin:app_admin_pw@localhost:5433/causa` | BYPASSRLS — used only by Alembic + admin tools. | | `CAUSA_DATABASE_URL_SYNC` | `postgresql://app_admin:app_admin_pw@localhost:5433/causa` | Sync URL for Alembic. | | `REDIS_URL` | `redis://localhost:6379/0` | Used by broadcaster snapshot history, rate limiter, RQ. | ### Auth | Variable | Default | Description | |---|---|---| | `CAUSA_JWT_SECRET` | dev placeholder | **Set to a real secret in production.** HMAC SHA-256 signing key for session JWTs. | | `CAUSA_JWT_TTL_HOURS` | `12` | Session length. | | `CAUSA_COOKIE_SECURE` | `false` | Set `true` in production behind HTTPS. | | `CAUSA_COOKIE_DOMAIN` | (unset) | Set if hosting console + api under different subdomains. | | `CAUSA_PUBLIC_APP_URL` | `http://localhost:3000` | Console URL (used in magic-link emails). | | `CAUSA_PUBLIC_API_URL` | `http://localhost:8000` | API URL (used in SSO callbacks). | ### WorkOS SSO (optional) | Variable | Default | Description | |---|---|---| | `WORKOS_API_KEY` | (unset) | Set to enable enterprise SSO via WorkOS. | | `WORKOS_CLIENT_ID` | (unset) | WorkOS OAuth client id. | ### Magic-link email (optional) | Variable | Default | Description | |---|---|---| | `RESEND_API_KEY` | (unset) | If unset, magic links are printed inline in the sign-in response (dev mode). Set to send real emails. | | `CAUSA_MAGICLINK_FROM` | `login@causa.dev` | From-address for magic-link emails. | ### Agent budgets (ML_AGENTS §12.9) | Variable | Default | Description | |---|---|---| | `CAUSA_MAX_TOKENS_PER_INCIDENT` | `200000` | Hard cap; agent halts on exceed. | | `CAUSA_MAX_USD_PER_INCIDENT` | `1.00` | Hard cap; agent halts on exceed. | | `CAUSA_MODEL` | `gemini-2.5-pro` | LLM. Set to `gemini-2.5-flash` for ~4× cheaper / faster runs. | ### Telemetry (when not pointing at real Prometheus) | Variable | Default | Description | |---|---|---| | `CAUSA_SCG_PATH` | `models/scg.json` | Demo SCG location. | | `CAUSA_CONDMODELS_DIR` | `models/condmodels` | Demo condmodel location. | | `PROMETHEUS_URL` | `http://localhost:9090` | Default for the demo env's PrometheusProvider. | ### Kill switch (ML_AGENTS §12.19) | Variable | Default | Description | |---|---|---| | `CAUSA_KILL_SWITCH_TOKEN` | dev placeholder | Header required for `POST /halt/{incident_id}`. | ## 🛠 Development guide ### Common tasks make help # show every available make target # Stack make demo-up # full demo stack via docker-compose make demo-down # tear it down # Causa pipelines make scg # rebuild the demo SCG make incident # CLI smoke (no UI) — fires the rehearsed incident make incident-no-verify # same, skip Prove-it step # Services make api # FastAPI on :8000 make console # Next.js dev on :3000 make worker # RQ worker — picks up training / drift / retrain jobs make flagsvc # Mock feature-flag service on :8002 # Quality gates make eval # run agent eval suite (≥30 incidents per ML_AGENTS §12.1) make test # pytest -q across all packages make fmt # ruff + prettier make lint # ruff + eslint ### Code style - **Python:** `ruff` (line length 100, target py312). Format on save. - **TypeScript:** strict mode, no `any` without justification, `pnpm format` runs prettier. - **Commit messages:** present tense, imperative. *"add Prometheus auto-discovery"*, not *"added"*. ### Migrations # Create a new migration manually (autogenerate is unreliable for RLS policies) cd apps/api uv run alembic revision -m "describe-the-change" # edit alembic/versions/_*.py — write upgrade()/downgrade() uv run alembic upgrade head ### Adding a new agent tool 1. Add the JSON schema to `apps/api/causa_api/tools.py` (one of the `*_tool_schemas` functions). 2. Add the Python handler to `ToolHandlers` in the same file. 3. Wire it into the agent's tool map in `apps/api/causa_api/graph/tools.py`. 4. Reference it in the agent's prompt at `prompts/.v.md` — and **bump the version**. 5. Add an eval case in `eval/incidents/` that exercises it. ### Adding a new alert source 1. Add a parser in `apps/api/causa_api/webhooks/parse.py` returning the canonical alert dict shape. 2. Add the route in `apps/api/causa_api/webhooks/routes.py` (verify the source's HMAC signature). 3. Add the source enum value to `causa_api/db/models/enums.py:WebhookSource`. ## 🧪 Testing make test # pytest -q, 13 tests uv run pytest tests/test_synth.py -v # synthetic generator suite uv run pytest tests/test_shared_types.py # Pydantic schema tests ### What the test suite verifies - **Shared type contracts** match between Python (Pydantic) and TypeScript. - **Synthetic generator** produces stationary baselines + the rehearsed incident's symptom crosses threshold deterministically. - **SCG learner** recovers ≥ 75% of ground-truth edges at α=0.001 (precision ≥ 0.95). - **Shuffle/negative-control test** finds 0 edges under H0 (`ML_AGENTS §5.5`). - **Counterfactual simulator** propagates an intervention through the DAG and produces a CI. ### Smoke test against the live Gemini API GEMINI_API_KEY=AIza... make incident # → ranked hypotheses + CONFIRMED verdict, ~$0.10, ~30 s ### Verify the SDK package ./scripts/publish-sdk.sh # dry-run, no upload # → 📦 @causa/react@0.1.0 → https://registry.npmjs.org/ ## 🗺 Roadmap | Phase | Scope | Status | |---|---|---| | **0** | Multi-tenant foundation: Postgres RLS, auth, RBAC, LangGraph agent loop, SDK | ✅ Shipped | | **1** | Real Prometheus + OTel + auto-discovery + drift + retraining | ✅ Shipped | | **2** | Stateful operations: scheduled retraining, drift alerts, champion-challenger | ✅ Shipped | | **3** | Real parallel shadow cluster (KindShadowRunner local; multi-tenant cloud pools next) | ⚠️ Local works | | **4** | Trust & compliance: SOC 2 controls, CMEK, private-model option | ⚠️ Controls + docs ready; attestation period TBD | | **5** | Scale + ecosystem: schema-per-tenant, multi-region, public API + SDKs | ❌ Future | See [`docs/ROADMAP.md`](docs/ROADMAP.md) for the full Phase 0 → Phase 5 breakdown with effort estimates and risks. ## Status **Alpha.** Not yet production-ready for a tenant who hasn't read this README end-to-end. What works today: - ✅ Three-agent loop on Gemini 2.5 Pro via LangGraph - ✅ Multi-tenant SaaS foundation (Postgres RLS, JWT auth, RBAC) - ✅ Real Prometheus ingestion + service auto-discovery - ✅ LightGBM condmodel training (background jobs) - ✅ Drift detection + scheduled SCG retraining - ✅ Real parallel `kind` cluster for Prove-it (self-hosted) - ✅ `@causa/react` embeddable SDK - ✅ Magic-link + WorkOS SSO + popup auth for the SDK What's documented but not yet polished: - ⚠️ Helm chart deploys but isn't load-tested - ⚠️ Audit log is unbounded — needs a retention policy - ⚠️ DR drill not performed - ⚠️ Pen test not performed What's not built: - ❌ Cloud-hosted multi-tenant shadow-cluster pools (Phase 3) - ❌ Private-model deployment (Vertex-AI in customer GCP) (Phase 4) - ❌ SOC 2 Type II attestation (Phase 4) - ❌ Schema-per-tenant migration for enterprise (Phase 5) - ❌ Vue / Svelte / React-Native SDKs Be honest with your judges / customers / yourself about which row you're on. ## Engineering rules ([`ML_AGENTS.md`](ML_AGENTS.md)) This repo ships with a **mandatory operating manual** for any AI coding agent working in it. Before touching ML / agent / RAG / fine-tuning / evaluation code, agents must read [`ML_AGENTS.md`](ML_AGENTS.md) in full and cite section numbers when explaining decisions (e.g. *"per §5.5, we ran a shuffle/negative-control test"*). The most load-bearing rules: - **§1 — Operating principles**: rigor over approval, volunteer skepticism, numbers over narrative. - **§4 — Problem framing**: every project starts with `PROBLEM.md` answering 12 specific questions. - **§5.5 — Leakage checks**: shuffle/negative-control test before claiming any non-trivial result. - **§6.4 — Complexity additions require ablations**: every added layer/feature/tool gets a removal ablation. - **§12.1 — Eval-first**: ≥30 hand-crafted task instances before any agent code. - **§12.6 — Prompts are code**: versioned files, eval delta on every change. - **§12.9 / §12.29 — Cost + token budgets**: per-call enforcement at prompt construction. - **§12.19 — Kill switch**: every long-running agent has a tested stop path. Every non-trivial decision in this codebase is recorded in [`DECISIONS.md`](DECISIONS.md) with date + reason + alternatives + reversal cost — per `§23.3`. ## Citations The math is from real causal-inference literature. Cited so you can verify our claims are not buzzwords: - **Pearl, J.** (2009). *Causality: Models, Reasoning, and Inference.* Cambridge University Press. - **Spirtes, P., Glymour, C., Scheines, R.** (2000). *Causation, Prediction, and Search.* MIT Press. (PC algorithm.) - **Peters, J., Janzing, D., Schölkopf, B.** (2017). *Elements of Causal Inference: Foundations and Learning Algorithms.* MIT Press. - **Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A.** (2006). "A linear non-Gaussian acyclic model for causal discovery." *Journal of Machine Learning Research* 7, 2003–2030. (LiNGAM.) Reference implementations: [`causal-learn`](https://causal-learn.readthedocs.io), [`LightGBM`](https://lightgbm.readthedocs.io). ## Contributing Causa is alpha. PRs are welcome, especially: - Additional `IncidentDataProvider` implementations (Datadog, New Relic, Cloud Monitoring). - Additional alert-source parsers (`apps/api/causa_api/webhooks/parse.py`). - Additional shadow-runner implementations (`apps/api/causa_api/shadow/`). - Eval-set incidents (`eval/incidents/*.yaml`) — every new failure mode helps. - Tightening the SOC 2 open-items list in [`docs/SECURITY.md`](docs/SECURITY.md). **Before opening a PR:** 1. Read [`ML_AGENTS.md`](ML_AGENTS.md) if your change touches agents, ML, RAG, eval, or prompts. 2. Run `make test` — must be green. 3. Run `make lint` — must be green. 4. If your change is non-trivial, add an entry to [`DECISIONS.md`](DECISIONS.md). 5. If you're adding a behavior change, add an eval case in `eval/incidents/`. 6. If you're adding a dependency, justify it in `DECISIONS.md`. ## Security disclosure Find a vulnerability? Email **security@causa.dev** (don't open a public issue). We aim to respond within 24 hours. See [`docs/SECURITY.md`](docs/SECURITY.md) for the full security model — sub-processor list, control register mapped to SOC 2 Trust Service Criteria, data classification, and the vendor security questionnaire template. ## License [Apache License 2.0](LICENSE). Use it, fork it, host it, sell on top of it.

**PagerDuty tells you what's correlated. Causa tells you what's causal — and proves it.** Built with care by people who've stared at red dashboards at 3 AM.

标签：DLL 劫持, Python, 人工智能体, 因果推断, 大语言模型, 搜索引擎查询, 故障排查, 无后门, 请求拦截, 运维, 逆向工具