# Causa
### The on-call agent that proves the actual cause.
[](LICENSE)
[](https://www.python.org/)
[](https://www.typescriptlang.org/)
[](https://nextjs.org/)
[](#status)
**PagerDuty tells you what's *correlated* during an outage. Causa tells you what's *causal* — and proves it by running a counterfactual experiment, live.**
[Live demo](#-quick-start--5-minutes) · [Install the SDK](#-the-causareact-sdk) · [Architecture](docs/MULTI_TENANT_DESIGN.md) · [Security](docs/SECURITY.md) · [Roadmap](docs/ROADMAP.md)
## Table of contents
- [What is Causa](#what-is-causa)
- [Why it's different](#why-its-different)
- [How it works](#how-it-works)
- [The three-agent loop](#the-three-agent-loop)
- [The math (Pearl's hierarchy)](#the-math-pearls-hierarchy)
- [End-to-end flow](#end-to-end-flow)
- [Architecture](#architecture)
- [🚀 Quick start (5 minutes)](#-quick-start--5-minutes)
- [📦 The `@causa/react` SDK](#-the-causareact-sdk)
- [🔌 Wire up real telemetry](#-wire-up-real-telemetry)
- [🧰 Repository layout](#-repository-layout)
- [⚙️ Configuration reference](#️-configuration-reference)
- [🛠 Development guide](#-development-guide)
- [🧪 Testing](#-testing)
- [🗺 Roadmap](#-roadmap)
- [Status](#status)
- [Engineering rules (`ML_AGENTS.md`)](#engineering-rules-ml_agentsmd)
- [Citations](#citations)
- [Contributing](#contributing)
- [Security disclosure](#security-disclosure)
- [License](#license)
## What is Causa
Causa is an **open-source, self-hostable agent for production incident response**. When an alert fires, three LLM agents (Triage → Investigator → Verifier) work together to **rank the probable root cause by its quantified causal effect on the symptom**, and then **verify the top hypothesis by running a counterfactual experiment** — either in a simulator built from your environment's causal graph, or by spinning up a parallel `kind` cluster and replaying your workload with the candidate cause removed.
It's not a correlation engine, and it's not an LLM-on-logs summarizer. It's a structural-causal-model + agentic-loop + verifiable-experiment stack that produces *falsifiable hypotheses with 95% confidence intervals*, every one of them traceable to a learned edge in a Postgres-backed graph.
You can drop it into your project in two minutes:
import { CausaProvider, CausaTrigger, CausaOverlay } from "@causa/react";
export default function App() {
return (
);
}
…or run the whole stack on your laptop with `docker compose up`.
## Why it's different
Existing tooling answers a different question.
| Capability | Rule-based AIOps (Datadog Watchdog, PagerDuty AIOps) | LLM log summarizer (every YC AI-ops co. 2024) | **Causa** |
|---|:---:|:---:|:---:|
| Causal reasoning (not just correlation) | ❌ | ❌ | **✅** |
| Quantified confidence (CI / p-value) | ✅ | ❌ | **✅** |
| Counterfactual: *"if I hadn't done X…"* | ❌ | ❌ | **✅** |
| Falsifiable Prove-it verification | ❌ | ❌ | **✅** |
| Bounded LLM cost + retry on transients | — | ❌ | **✅** |
| Open-source, self-hostable | ❌ | some | **✅** |
| Multi-tenant RLS by default | — | — | **✅** |
The one question every on-call actually asks — *"would removing this change have fixed the symptom?"* — is a Level-2 do-calculus question. Correlation tools can't answer it. LLM log summarizers can guess and frequently hallucinate. Causa estimates it, gives you a CI, and lets you verify.
## How it works
### The three-agent loop
Built on a **LangGraph** state machine. Every node is checkpointed to Postgres, so a crashed agent run can be resumed from where it left off. Every LLM call is wrapped in retry-on-transient (`ResourceExhausted` / `ServiceUnavailable`) backoff and tagged with per-call cost attribution.
┌──────────────────────────────────────────────────────────────────────┐
│ │
│ alert webhook │
│ (PagerDuty / │
│ Opsgenie / │
│ Alertmanager) │
│ │ │
│ ▼ │
│ ┌─────────┐ IncidentContext ┌────────────────┐ │
│ │ Triage │ ─────────────────────▶│ Investigator │ │
│ │ agent │ │ agent │ │
│ └─────────┘ └────┬───────────┘ │
│ │ │
│ ranked │ │
│ Hypothesis[]│ │
│ ▼ │
│ ┌────────────────┐ │
│ user clicks │ Verifier │ │
│ "Prove it" ────────▶│ agent │ │
│ └────┬───────────┘ │
│ │ │
│ ▼ │
│ CONFIRMED / REJECTED / INCONCLUSIVE │
│ │
└──────────────────────────────────────────────────────────────────────┘
Each agent has a tightly scoped tool set:
- **Triage** — `get_alert_details`, `get_recent_change_window`, `get_failing_service_metadata`, then **must** call `emit_incident_context`.
- **Investigator** — `query_scg`, `run_counterfactual`, `get_signal_history`, `retrieve_similar_incidents` (pgvector cosine search over the tenant's postmortem corpus), then **must** call `emit_hypotheses`.
- **Verifier** — `query_specific_signal`, `run_shadow_experiment`, then **must** call `emit_verification`.
Prompts are versioned files under [`prompts/`](prompts/). Every change requires a version bump and an eval delta in [`eval/`](eval/) (the `ML_AGENTS.md` rule).
### The math (Pearl's hierarchy)
Causa operates at all three levels.
| Level | Symbol | Where it lives |
|---|---|---|
| **L1 · Association** | `P(Y \| X)` | The System Causal Graph's edge weights — discovered by the PC algorithm on baseline telemetry. |
| **L2 · Intervention** | `P(Y \| do(X = x))` | The simulator forward-propagates from a pinned intervention through the DAG. Each child is sampled from its per-node LightGBM quantile regressor (or, for envs without 24 h of training data yet, a baseline-revert fallback). |
| **L3 · Counterfactual** | `P(Y_x \| X', Y')` | "Prove it" replays the actual incident with the candidate cause removed. Same world, modified intervention. The verifier shadow can run in the simulator OR on a real parallel `kind` cluster (`KindShadowRunner`). |
**Citations**
- Pearl (2009). *Causality: Models, Reasoning, and Inference*. Cambridge.
- Spirtes, Glymour, Scheines (2000). *Causation, Prediction, and Search*. (PC algorithm.)
- Peters, Janzing, Schölkopf (2017). *Elements of Causal Inference*. MIT Press.
- Shimizu et al. (2006). LiNGAM. JMLR.
### End-to-end flow
1. An alert hits `POST /wh/{pagerduty,opsgenie,alertmanager,github}/{secret}`. Webhook payload is parsed into a normalized `IncidentContext`.
2. The orchestrator resolves the right **environment** via the per-tenant URL secret.
3. If the env has `prometheus_url` set, real metrics are pulled via PromQL (range + instant). Otherwise the bundled rehearsed-incident fixtures stand in.
4. **Triage agent** runs (~5-10 s), emits the candidate set.
5. **Investigator agent** runs (~20-40 s), calls `run_counterfactual` per candidate, ranks by effect magnitude with 95% CI, emits the `Hypothesis[]`.
6. The Console streams every phase change over WebSocket. Redis persists the last 50 snapshots so a reconnecting client catches up.
7. The on-call clicks **Prove it**. **Verifier agent** runs the shadow (simulator by default, parallel `kind` cluster if configured + the tenant's plan permits it), returns `CONFIRMED` / `REJECTED` / `INCONCLUSIVE` with the chart.
8. Every step writes to an append-only `audit_events` row. Every LLM call writes a `cost_events` row.
Total typical demo run: **~30 seconds, ~$0.10 of Gemini 2.5 Pro tokens.**
## Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Browser (Next.js 16) │
│ ┌────────────────────────┐ ┌──────────────────────────────────────┐ │
│ │ Landing / │ │ Dashboard /dashboard │ │
│ │ (marketing) │ │ • Topbar + OrgSwitcher │ │
│ │ │ │ • Timeline (events + audit) │ │
│ │ │ │ • 3D Causal Graph (react-three-fiber)│
│ │ │ │ • Hypothesis stack + Prove-it │ │
│ └────────────────────────┘ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│ HTTPS + WSS │
▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ FastAPI (causa-api) │
│ │
│ /auth/* magic-link + WorkOS SSO + JWT sessions │
│ /api/* tenant-scoped — incidents, environments, jobs │
│ /wh/* signature-verified webhooks (PagerDuty / Opsgenie / ...) │
│ /ws/* live incident updates │
│ /sdk-token SDK-popup auth handshake │
│ │
│ ┌──────────────┐ ┌────────────────┐ ┌─────────────────┐ ┌──────────┐ │
│ │ TenantContext│ │ LangGraph state│ │ Tool layer │ │ Cost + │ │
│ │ middleware │ │ machine (3 │ │ (typed I/O, │ │ audit │ │
│ │ + RBAC │ │ agents, Pg- │ │ schema- │ │ trackers│ │
│ │ │ │ checkpointed) │ │ validated) │ │ │ │
│ └──────────────┘ └────────────────┘ └─────────────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐
│Postgres │ │ Redis │ │Prometheus│ │ Gemini │
│ + RLS │ │ (RQ + │ │ (your │ │ 2.5 │
│+pgvector│ │ snapshots│ │ cluster)│ │ Pro │
└─────────┘ └─────────┘ └──────────┘ └─────────┘
### `causa-core` — the math, no LLM
Pure Python library that the agents call as tools. Importable on its own.
| File | What it does |
|---|---|
| [`causa_core/scg.py`](packages/causa-core/causa_core/scg.py) | PC algorithm via `causal-learn`; builds + persists the SCG; shuffle/negative-control diagnostic; precision/recall vs ground-truth eval. |
| [`causa_core/scg_from_topology.py`](packages/causa-core/causa_core/scg_from_topology.py) | Topology-derived stub SCG for new tenants without 24 h of baseline yet. |
| [`causa_core/condmodel.py`](packages/causa-core/causa_core/condmodel.py) | Per-node LightGBM quantile regressors at α ∈ {0.05, 0.5, 0.95}. |
| [`causa_core/cf.py`](packages/causa-core/causa_core/cf.py) | Vectorized counterfactual sampler — forward propagation through the DAG, 1000 samples in parallel. |
| [`causa_core/cf_baseline.py`](packages/causa-core/causa_core/cf_baseline.py) | Baseline-revert counterfactual for envs without trained condmodels. |
| [`causa_core/topology.py`](packages/causa-core/causa_core/topology.py) | Service topology + signal taxonomy (reference; per-env overrides live in DB). |
| [`causa_core/synth.py`](packages/causa-core/causa_core/synth.py) | Synthetic telemetry generator with known ground-truth edges — used by the bundled demo + the agent eval set. |
### `causa-api` — orchestrator + multi-tenant SaaS
| Subdir | What it does |
|---|---|
| [`auth/`](apps/api/causa_api/auth) | JWT, WorkOS OIDC, magic-link, RBAC (`owner` / `admin` / `on_call` / `viewer`), `TenantContext` middleware. |
| [`api/`](apps/api/causa_api/api) | Tenant-scoped HTTP routes — incidents, environments, jobs. |
| [`webhooks/`](apps/api/causa_api/webhooks) | Per-tenant signed webhook intake — PagerDuty / Opsgenie / Alertmanager / GitHub / generic. |
| [`graph/`](apps/api/causa_api/graph) | LangGraph state machine: nodes, tool routing, PostgresSaver checkpointing, streaming. |
| [`agents/`](apps/api/causa_api/agents) | Versioned-prompt loader + per-agent tool schemas. |
| [`providers/`](apps/api/causa_api/providers) | `IncidentDataProvider` implementations — `PrometheusProvider` (real PromQL) + `FixtureProvider` (demo). |
| [`shadow/`](apps/api/causa_api/shadow) | `ShadowRunner` Protocol + `SimulatorShadowRunner` + `KindShadowRunner` (real kind cluster Prove-it). |
| [`workers/`](apps/api/causa_api/workers) | RQ background jobs — condmodel training, SCG rebuild, drift detection, nightly cron. |
| [`repos/`](apps/api/causa_api/repos) | DB access patterns; every call is inside a `tenant_session` (RLS-enforced). |
| [`db/`](apps/api/causa_api/db) | SQLAlchemy 2.0 async models + Alembic migrations + RLS policies. |
| [`broadcaster.py`](apps/api/causa_api/broadcaster.py) | In-process pub/sub + Redis snapshot history for WS reconnect replay. |
| [`rate_limit.py`](apps/api/causa_api/rate_limit.py) | Per-tenant Lua-script token bucket on Redis. |
| [`orchestrator.py`](apps/api/causa_api/orchestrator.py) | The `run_incident()` + `run_verification()` entry points. |
### `causa-console` — Next.js 16 app
Marketing landing at `/`, sign-in at `/login`, the live dashboard at `/dashboard`.
The dashboard reuses real components — the 3D causal graph is `@react-three/fiber` (`CausalGraph3D.tsx`), the verification overlay is `recharts` with a split-screen `LineChart`, the hypothesis cards are shadcn primitives styled with our token system.
### `@causa/react` — embeddable SDK
A single React package — `` + `` + `` + `useCausa()` hook. Authenticates the end user in a popup, talks to the API with bearer tokens (no cross-site cookies — Safari ITP safe), syncs sessions across tabs via `BroadcastChannel`.
## 🚀 Quick start — 5 minutes
### Prerequisites
- **macOS / Linux**
- [Homebrew](https://brew.sh/) (macOS) or apt
- A **Gemini API key** — free tier works ([create one](https://aistudio.google.com/app/apikey))
### One-time tool install
brew install uv pnpm postgresql@17 redis
brew services start postgresql@17
brew services start redis
### Clone + set up
git clone https://github.com/your-org/causa.git
cd causa
# Python workspace
uv sync --all-packages
# JS workspace
pnpm install
# Postgres: create db + roles + extensions (one-time, ~3 sec)
/opt/homebrew/opt/postgresql@17/bin/psql -d postgres -f infra/postgres/init.sql
# Run migrations (creates schema + RLS policies)
cd apps/api && uv run alembic upgrade head && cd ../..
### Configure
cp .env.example .env
Edit `.env` — at minimum set:
GEMINI_API_KEY=AIza... # required
CAUSA_DEPLOYMENT_MODE=demo # auto-creates a 'demo' tenant on startup
CAUSA_DATABASE_URL=postgresql+asyncpg://app_user:app_user_pw@localhost:5432/causa
CAUSA_DATABASE_URL_ADMIN=postgresql+asyncpg://app_admin:app_admin_pw@localhost:5432/causa
CAUSA_DATABASE_URL_SYNC=postgresql://app_admin:app_admin_pw@localhost:5432/causa
### Generate the demo fixtures (one-time, ~5 sec)
uv run python -m causa_core.cli synth --scenario baseline --hours 4 --out fixtures
uv run python -m causa_core.cli synth --scenario recommendation_cascade --hours 0.1 --out fixtures
make scg # learns the SCG (~15 s)
uv run python -m causa_core.cli condmodel-fit # fits LightGBM (~30 s)
### Run
# Terminal 1
make api
# Terminal 2
make console
Open **http://localhost:3000**. Click **Get started** in the hero. Sign in with any email — in dev mode the magic link is printed inline on the confirmation card. After sign-in you land on **/dashboard**.
Press **fire demo incident** (top right of the dashboard). Watch the timeline fill, the 3D causal graph light up the path from `recommendationservice.memory` → `frontend.latency_p99` → `checkoutservice.error_rate`, and the ranked hypotheses appear with their 95 % CIs. Click **Prove it** on the top hypothesis — the verifier runs and an overlay slides in with the split-screen chart showing CONFIRMED.
**Cost:** ~$0.10. **Wall clock:** ~30 seconds.
### CLI-only smoke test (no UI required)
make incident # fires the rehearsed incident, prints the ranked output
## 📦 The `@causa/react` SDK
Install in any React app:
npm install @causa/react
# or: pnpm add @causa/react / yarn add @causa/react / bun add @causa/react
Wrap your root:
import { CausaProvider, CausaTrigger, CausaOverlay } from "@causa/react";
export default function App() {
return (
{/* floating button — bottom right by default */}
{/* slide-in drawer */}
);
}
Anyone clicking the trigger gets a Causa popup magic-link sign-in. After they're in, the overlay shows their org's incidents, lets them fire a demo run, and surfaces ranked hypotheses + Prove-it verdicts. No backend changes on your side.
Headless? Use the `useCausa()` hook:
import { useCausa } from "@causa/react";
function YourCustomTrigger() {
const causa = useCausa();
return (
);
}
See [`packages/sdk-react/README.md`](packages/sdk-react/README.md) for the full API.
### Publishing the SDK to npm
./scripts/publish-sdk.sh # dry-run — verify
./scripts/publish-sdk.sh --real # actually publish (requires npm login)
## 🔌 Wire up real telemetry
The bundled demo uses synthetic data so the install path is fast. To run Causa against your real cluster, attach Prometheus + send real alerts.
### 1 · Configure your environment
# Add your Prometheus URL + (optional) PromQL overrides + auth header
curl -X PATCH http://localhost:8000/api/environments/$ENV_ID \
-H 'content-type: application/json' \
-b /tmp/causa_cookies \
-d '{
"prometheus_url": "https://prom.your-cluster.example.com",
"service_label": "service",
"extra_label_selectors": ", cluster=\"prod-us\""
}'
### 2 · Auto-discover services + call graph
curl -X POST http://localhost:8000/api/environments/$ENV_ID/discover
Causa queries Prometheus in this order:
1. `traces_service_graph_request_total` (Tempo / OTel `metrics_generator`) — gives caller→callee edges directly.
2. `istio_requests_total{source_workload, destination_workload}` — Istio / Linkerd service mesh.
3. Fallback: `up{}` — enumerates jobs; SCG learns edges from data.
The discovered topology persists on the environment row + a topology-stub SCG is activated so the orchestrator can start ranking hypotheses immediately, before you've collected 24 h of baseline.
### 3 · Train per-node LightGBM models on real data
# Triggers a background RQ job; returns a job_id
curl -X POST http://localhost:8000/api/environments/$ENV_ID/train
# Poll status
curl http://localhost:8000/api/jobs/$JOB_ID
This pulls 24 h of baseline metrics from your Prometheus, resamples to 5 s buckets, fits LightGBM quantile regressors per node, persists the joblib artifact + a `CondModel` row. The orchestrator picks up the new artifact on the next incident automatically.
### 4 · Schedule nightly retrain + drift checks
# Run a worker
make worker
# In a separate cron container, every night:
uv run python -m causa_api.workers.retraining enqueue_nightly_for_all_envs
The retrainer re-runs PC on fresh baseline, backtest-gates promotion at Jaccard ≥ 0.70 vs. the previous SCG, and on promotion cascades a drift check + condmodel refit.
### 5 · Point your alert source at Causa
# Generate a per-tenant webhook secret (admin only)
curl -X POST http://localhost:8000/api/webhook_secrets \
-H 'content-type: application/json' \
-d '{"source": "pagerduty", "environment_id": "..."}'
# → {"url": "https://api.causa.dev/wh/pagerduty/"}
Supported sources today: **PagerDuty, Opsgenie, Alertmanager, GitHub (deploys/pushes), generic**.
## 🧰 Repository layout
causa/
├── apps/
│ ├── api/ FastAPI orchestrator + LangGraph agents
│ │ ├── alembic/ Migrations (0001 schema + RLS, 0002 telemetry)
│ │ ├── causa_api/
│ │ │ ├── api/ HTTP routes (/api/*)
│ │ │ ├── auth/ JWT + WorkOS + magic-link + RBAC + middleware
│ │ │ ├── agents/ Prompt loader, agent runners
│ │ │ ├── graph/ LangGraph state machine, nodes, tools
│ │ │ ├── providers/ PrometheusProvider, auto-discovery, resolver
│ │ │ ├── repos/ Tenant-scoped DB access
│ │ │ ├── shadow/ ShadowRunner (Simulator + Kind)
│ │ │ ├── webhooks/ Alert / event intake
│ │ │ ├── workers/ RQ jobs (training, drift, retrain)
│ │ │ ├── db/ SQLAlchemy 2.0 models
│ │ │ ├── orchestrator.py Top-level run_incident() + run_verification()
│ │ │ ├── tools.py Tool handlers (LLM-callable)
│ │ │ ├── broadcaster.py WS fan-out + Redis snapshot history
│ │ │ ├── rate_limit.py Per-tenant Lua token bucket
│ │ │ └── main.py FastAPI app + lifespan
│ │ └── pyproject.toml
│ ├── causa-console/ Next.js 16 app
│ │ ├── src/
│ │ │ ├── app/ / /dashboard /login /sdk/login …
│ │ │ ├── components/
│ │ │ │ ├── landing/ Nav, Hero3D, CodeBlock, Footer, icons
│ │ │ │ ├── CausalGraph3D.tsx 3D incident visualization (r3f + drei)
│ │ │ │ ├── Timeline.tsx Live event log
│ │ │ │ ├── HypothesisStack.tsx
│ │ │ │ ├── VerificationOverlay.tsx
│ │ │ │ ├── Topbar.tsx + OrgSwitcher.tsx
│ │ │ │ └── ui/ shadcn primitives (button, dialog, card, …)
│ │ │ ├── lib/ api client, store (zustand)
│ │ │ ├── types/causa.ts TypeScript mirror of Pydantic shapes
│ │ │ └── middleware.ts Protects /dashboard
│ │ └── package.json
│ ├── deployer/ Mock GitHub-Actions style deploy event emitter
│ └── flagsvc/ Mock feature-flag service
│
├── packages/
│ ├── shared/ Single source of truth for Pydantic types
│ ├── causa-core/ Math: SCG, simulator, synth
│ └── sdk-react/ @causa/react npm package
│
├── infra/
│ ├── postgres/init.sql Bootstrap roles + db
│ ├── prometheus-values.yaml kube-prometheus-stack overrides
│ └── helm/causa/ Helm chart (api + console + postgres + redis)
│
├── docs/
│ ├── MULTI_TENANT_DESIGN.md Architecture deep-dive
│ ├── ROADMAP.md Phase 0 → Phase 5
│ └── SECURITY.md SOC 2 control register + sub-processors
│
├── prompts/ Versioned agent prompts (ML_AGENTS §12.6)
│ ├── triage.v1.md
│ ├── investigator.v1.md
│ └── verifier.v1.md
│
├── eval/incidents/ Eval set (per ML_AGENTS §12.1)
├── failures/ Regression case library (ML_AGENTS §12.14)
├── tests/ pytest suite (13 tests, all green)
├── scripts/ publish-sdk.sh, run_demo.py
├── fixtures/ Generated synthetic telemetry (gitignored)
├── models/ Trained SCG + condmodels (gitignored)
│
├── DECISIONS.md Every non-trivial decision + reversal cost
├── PROBLEM.md Problem framing (ML_AGENTS §4)
├── LIMITATIONS.md What Causa does NOT do; failure modes
├── LICENSES.md Third-party data + model licensing
├── ML_AGENTS.md Mandatory operating rules for any AI agent
├── PRD.md Original hackathon spec (historical)
├── Makefile make help — all dev tasks
└── pyproject.toml uv workspace root
## ⚙️ Configuration reference
All settings come from environment variables. See [`.env.example`](.env.example).
### Required
| Variable | Default | Description |
|---|---|---|
| `GEMINI_API_KEY` | — | Google Gemini API key. Required for any agent run. |
### Deployment mode
| Variable | Default | Description |
|---|---|---|
| `CAUSA_DEPLOYMENT_MODE` | `multi_tenant` | `demo` auto-creates a default tenant + env on startup and mounts back-compat legacy endpoints. `multi_tenant` requires real magic-link sign-in. |
### Database + Redis
| Variable | Default | Description |
|---|---|---|
| `CAUSA_DATABASE_URL` | `postgresql+asyncpg://app_user:app_user_pw@localhost:5433/causa` | Async, RLS-enforced — used for all runtime queries. |
| `CAUSA_DATABASE_URL_ADMIN` | `postgresql+asyncpg://app_admin:app_admin_pw@localhost:5433/causa` | BYPASSRLS — used only by Alembic + admin tools. |
| `CAUSA_DATABASE_URL_SYNC` | `postgresql://app_admin:app_admin_pw@localhost:5433/causa` | Sync URL for Alembic. |
| `REDIS_URL` | `redis://localhost:6379/0` | Used by broadcaster snapshot history, rate limiter, RQ. |
### Auth
| Variable | Default | Description |
|---|---|---|
| `CAUSA_JWT_SECRET` | dev placeholder | **Set to a real secret in production.** HMAC SHA-256 signing key for session JWTs. |
| `CAUSA_JWT_TTL_HOURS` | `12` | Session length. |
| `CAUSA_COOKIE_SECURE` | `false` | Set `true` in production behind HTTPS. |
| `CAUSA_COOKIE_DOMAIN` | (unset) | Set if hosting console + api under different subdomains. |
| `CAUSA_PUBLIC_APP_URL` | `http://localhost:3000` | Console URL (used in magic-link emails). |
| `CAUSA_PUBLIC_API_URL` | `http://localhost:8000` | API URL (used in SSO callbacks). |
### WorkOS SSO (optional)
| Variable | Default | Description |
|---|---|---|
| `WORKOS_API_KEY` | (unset) | Set to enable enterprise SSO via WorkOS. |
| `WORKOS_CLIENT_ID` | (unset) | WorkOS OAuth client id. |
### Magic-link email (optional)
| Variable | Default | Description |
|---|---|---|
| `RESEND_API_KEY` | (unset) | If unset, magic links are printed inline in the sign-in response (dev mode). Set to send real emails. |
| `CAUSA_MAGICLINK_FROM` | `login@causa.dev` | From-address for magic-link emails. |
### Agent budgets (ML_AGENTS §12.9)
| Variable | Default | Description |
|---|---|---|
| `CAUSA_MAX_TOKENS_PER_INCIDENT` | `200000` | Hard cap; agent halts on exceed. |
| `CAUSA_MAX_USD_PER_INCIDENT` | `1.00` | Hard cap; agent halts on exceed. |
| `CAUSA_MODEL` | `gemini-2.5-pro` | LLM. Set to `gemini-2.5-flash` for ~4× cheaper / faster runs. |
### Telemetry (when not pointing at real Prometheus)
| Variable | Default | Description |
|---|---|---|
| `CAUSA_SCG_PATH` | `models/scg.json` | Demo SCG location. |
| `CAUSA_CONDMODELS_DIR` | `models/condmodels` | Demo condmodel location. |
| `PROMETHEUS_URL` | `http://localhost:9090` | Default for the demo env's PrometheusProvider. |
### Kill switch (ML_AGENTS §12.19)
| Variable | Default | Description |
|---|---|---|
| `CAUSA_KILL_SWITCH_TOKEN` | dev placeholder | Header required for `POST /halt/{incident_id}`. |
## 🛠 Development guide
### Common tasks
make help # show every available make target
# Stack
make demo-up # full demo stack via docker-compose
make demo-down # tear it down
# Causa pipelines
make scg # rebuild the demo SCG
make incident # CLI smoke (no UI) — fires the rehearsed incident
make incident-no-verify # same, skip Prove-it step
# Services
make api # FastAPI on :8000
make console # Next.js dev on :3000
make worker # RQ worker — picks up training / drift / retrain jobs
make flagsvc # Mock feature-flag service on :8002
# Quality gates
make eval # run agent eval suite (≥30 incidents per ML_AGENTS §12.1)
make test # pytest -q across all packages
make fmt # ruff + prettier
make lint # ruff + eslint
### Code style
- **Python:** `ruff` (line length 100, target py312). Format on save.
- **TypeScript:** strict mode, no `any` without justification, `pnpm format` runs prettier.
- **Commit messages:** present tense, imperative. *"add Prometheus auto-discovery"*, not *"added"*.
### Migrations
# Create a new migration manually (autogenerate is unreliable for RLS policies)
cd apps/api
uv run alembic revision -m "describe-the-change"
# edit alembic/versions/_*.py — write upgrade()/downgrade()
uv run alembic upgrade head
### Adding a new agent tool
1. Add the JSON schema to `apps/api/causa_api/tools.py` (one of the `*_tool_schemas` functions).
2. Add the Python handler to `ToolHandlers` in the same file.
3. Wire it into the agent's tool map in `apps/api/causa_api/graph/tools.py`.
4. Reference it in the agent's prompt at `prompts/.v.md` — and **bump the version**.
5. Add an eval case in `eval/incidents/` that exercises it.
### Adding a new alert source
1. Add a parser in `apps/api/causa_api/webhooks/parse.py` returning the canonical alert dict shape.
2. Add the route in `apps/api/causa_api/webhooks/routes.py` (verify the source's HMAC signature).
3. Add the source enum value to `causa_api/db/models/enums.py:WebhookSource`.
## 🧪 Testing
make test # pytest -q, 13 tests
uv run pytest tests/test_synth.py -v # synthetic generator suite
uv run pytest tests/test_shared_types.py # Pydantic schema tests
### What the test suite verifies
- **Shared type contracts** match between Python (Pydantic) and TypeScript.
- **Synthetic generator** produces stationary baselines + the rehearsed incident's symptom crosses threshold deterministically.
- **SCG learner** recovers ≥ 75% of ground-truth edges at α=0.001 (precision ≥ 0.95).
- **Shuffle/negative-control test** finds 0 edges under H0 (`ML_AGENTS §5.5`).
- **Counterfactual simulator** propagates an intervention through the DAG and produces a CI.
### Smoke test against the live Gemini API
GEMINI_API_KEY=AIza... make incident
# → ranked hypotheses + CONFIRMED verdict, ~$0.10, ~30 s
### Verify the SDK package
./scripts/publish-sdk.sh # dry-run, no upload
# → 📦 @causa/react@0.1.0 → https://registry.npmjs.org/
## 🗺 Roadmap
| Phase | Scope | Status |
|---|---|---|
| **0** | Multi-tenant foundation: Postgres RLS, auth, RBAC, LangGraph agent loop, SDK | ✅ Shipped |
| **1** | Real Prometheus + OTel + auto-discovery + drift + retraining | ✅ Shipped |
| **2** | Stateful operations: scheduled retraining, drift alerts, champion-challenger | ✅ Shipped |
| **3** | Real parallel shadow cluster (KindShadowRunner local; multi-tenant cloud pools next) | ⚠️ Local works |
| **4** | Trust & compliance: SOC 2 controls, CMEK, private-model option | ⚠️ Controls + docs ready; attestation period TBD |
| **5** | Scale + ecosystem: schema-per-tenant, multi-region, public API + SDKs | ❌ Future |
See [`docs/ROADMAP.md`](docs/ROADMAP.md) for the full Phase 0 → Phase 5 breakdown with effort estimates and risks.
## Status
**Alpha.** Not yet production-ready for a tenant who hasn't read this README end-to-end.
What works today:
- ✅ Three-agent loop on Gemini 2.5 Pro via LangGraph
- ✅ Multi-tenant SaaS foundation (Postgres RLS, JWT auth, RBAC)
- ✅ Real Prometheus ingestion + service auto-discovery
- ✅ LightGBM condmodel training (background jobs)
- ✅ Drift detection + scheduled SCG retraining
- ✅ Real parallel `kind` cluster for Prove-it (self-hosted)
- ✅ `@causa/react` embeddable SDK
- ✅ Magic-link + WorkOS SSO + popup auth for the SDK
What's documented but not yet polished:
- ⚠️ Helm chart deploys but isn't load-tested
- ⚠️ Audit log is unbounded — needs a retention policy
- ⚠️ DR drill not performed
- ⚠️ Pen test not performed
What's not built:
- ❌ Cloud-hosted multi-tenant shadow-cluster pools (Phase 3)
- ❌ Private-model deployment (Vertex-AI in customer GCP) (Phase 4)
- ❌ SOC 2 Type II attestation (Phase 4)
- ❌ Schema-per-tenant migration for enterprise (Phase 5)
- ❌ Vue / Svelte / React-Native SDKs
Be honest with your judges / customers / yourself about which row you're on.
## Engineering rules ([`ML_AGENTS.md`](ML_AGENTS.md))
This repo ships with a **mandatory operating manual** for any AI coding agent working in it. Before touching ML / agent / RAG / fine-tuning / evaluation code, agents must read [`ML_AGENTS.md`](ML_AGENTS.md) in full and cite section numbers when explaining decisions (e.g. *"per §5.5, we ran a shuffle/negative-control test"*).
The most load-bearing rules:
- **§1 — Operating principles**: rigor over approval, volunteer skepticism, numbers over narrative.
- **§4 — Problem framing**: every project starts with `PROBLEM.md` answering 12 specific questions.
- **§5.5 — Leakage checks**: shuffle/negative-control test before claiming any non-trivial result.
- **§6.4 — Complexity additions require ablations**: every added layer/feature/tool gets a removal ablation.
- **§12.1 — Eval-first**: ≥30 hand-crafted task instances before any agent code.
- **§12.6 — Prompts are code**: versioned files, eval delta on every change.
- **§12.9 / §12.29 — Cost + token budgets**: per-call enforcement at prompt construction.
- **§12.19 — Kill switch**: every long-running agent has a tested stop path.
Every non-trivial decision in this codebase is recorded in [`DECISIONS.md`](DECISIONS.md) with date + reason + alternatives + reversal cost — per `§23.3`.
## Citations
The math is from real causal-inference literature. Cited so you can verify our claims are not buzzwords:
- **Pearl, J.** (2009). *Causality: Models, Reasoning, and Inference.* Cambridge University Press.
- **Spirtes, P., Glymour, C., Scheines, R.** (2000). *Causation, Prediction, and Search.* MIT Press. (PC algorithm.)
- **Peters, J., Janzing, D., Schölkopf, B.** (2017). *Elements of Causal Inference: Foundations and Learning Algorithms.* MIT Press.
- **Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A.** (2006). "A linear non-Gaussian acyclic model for causal discovery." *Journal of Machine Learning Research* 7, 2003–2030. (LiNGAM.)
Reference implementations: [`causal-learn`](https://causal-learn.readthedocs.io), [`LightGBM`](https://lightgbm.readthedocs.io).
## Security disclosure
Find a vulnerability? Email **security@causa.dev** (don't open a public issue). We aim to respond within 24 hours.
See [`docs/SECURITY.md`](docs/SECURITY.md) for the full security model — sub-processor list, control register mapped to SOC 2 Trust Service Criteria, data classification, and the vendor security questionnaire template.
## License
[Apache License 2.0](LICENSE). Use it, fork it, host it, sell on top of it.
**PagerDuty tells you what's correlated. Causa tells you what's causal — and proves it.**
Built with care by people who've stared at red dashboards at 3 AM.