AnirudhS3110/Sentinel-AI
GitHub: AnirudhS3110/Sentinel-AI
Stars: 0 | Forks: 0
# SentinelAI
AI-powered incident orchestration for production infrastructure. Submit an incident with logs; six specialized agents plan, classify, analyze, validate, remediate, and report — with live progress streamed to a command-center dashboard.
Teach Stack: Next.js, NestJS, TypeScript, PostgreSQL, Prisma, Redis, BullMQ, Socket.IO, LangGraph, LangChain, Google Gemini, Tailwind CSS, Framer Motion, Recharts, Firebase Auth, Railway, Vercel, Neon, Upstash
**Live demo:** [sentinel-ai-v1.vercel.app](https://sentinel-ai-v1.vercel.app/)
## Table of contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Monorepo structure](#monorepo-structure)
- [AI agents](#ai-agents)
- [Workflow pipeline](#workflow-pipeline)
- [Tech stack](#tech-stack)
- [Data model](#data-model)
- [Real-time updates](#real-time-updates)
- [Authentication](#authentication)
- [Prerequisites](#prerequisites)
- [Local setup](#local-setup)
- [Environment variables](#environment-variables)
- [Scripts](#scripts)
- [Deployment](#deployment)
- [App routes](#app-routes)
## Overview
SentinelAI automates the incident lifecycle:
1. A user creates an incident (title, description, raw logs) via the web UI.
2. The **API** persists the incident and enqueues the first job on **BullMQ** (Redis).
3. The **worker** runs six **Gemini**-powered agents in sequence, updating PostgreSQL and publishing events.
4. The **API** forwards events over **Socket.IO** so the dashboard updates in real time.
Orchestration is **queue-driven** (BullMQ). **LangGraph** is used only for validation routing (retry analysis vs. continue vs. fail).
## Architecture
flowchart TB
subgraph Client["Browser (Next.js)"]
UI[Dashboard / Incidents]
REST[REST client]
SIO[Socket.IO client]
end
subgraph API["apps/api — NestJS"]
HTTP[REST + Auth]
GW[IncidentsGateway]
SUB[Redis subscriber]
PROD[Queue producer]
end
subgraph Worker["apps/worker — NestJS"]
PROC[BullMQ processors ×6]
AGT[Gemini agents ×6]
LG[LangGraph router]
PUB[Redis publisher]
end
subgraph Data["Infrastructure"]
PG[(PostgreSQL)]
RD[(Redis)]
end
UI --> REST --> HTTP
UI --> SIO --> GW
HTTP --> PG
PROD --> RD
RD --> PROC
PROC --> AGT
AGT --> LG
PROC --> PG
PROC --> PUB --> RD --> SUB --> GW
| Service | Port (local) | Role |
|---------|----------------|------|
| **web** | `3000` | Next.js UI; proxies `/backend/*` → API |
| **api** | `3001` | REST API, Firebase auth, BullMQ **producer**, Socket.IO |
| **worker** | — | BullMQ **consumer**, Gemini LLM calls, workflow routing |
All three processes must run locally (or be deployed separately in production) for end-to-end orchestration.
## Monorepo structure
sentinel-ai/
├── apps/
│ ├── web/ # Next.js 16 dashboard & marketing
│ ├── api/ # NestJS API + WebSockets + queue producer
│ └── worker/ # NestJS worker + agents + queue consumers
├── packages/
│ └── shared/ # Enums, Zod schemas, queue names, events, prompts
├── railway.toml # Railway build/start for API
├── DEPLOYMENT.md # Production deploy checklist
└── package.json # npm workspaces root
| Package | npm name | Description |
|---------|----------|-------------|
| `apps/web` | `web` | React 19, TanStack Query, Firebase client, Recharts |
| `apps/api` | `api` | Prisma, BullMQ producer, Firebase Admin, Socket.IO |
| `apps/worker` | `worker` | LangChain + Gemini, LangGraph, BullMQ consumers |
| `packages/shared` | `@sentinel/shared` | Shared types, constants, agent output schemas |
## AI agents
Six agents run in a fixed pipeline. Each agent uses **Google Gemini** (`gemini-2.5-flash-lite` by default) via LangChain, with **Zod-validated JSON** outputs defined in `packages/shared`.
| # | Agent | Queue | Purpose | Key output |
|---|--------|-------|---------|------------|
| 1 | **Planner** | `planner` | Builds an investigation plan from incident context and logs | `steps`, `focusAreas`, `estimatedDurationMinutes` |
| 2 | **Classification** | `classification` | Assigns severity, category, and incident type | `severity`, `category`, `incidentType`, `confidence` |
| 3 | **Analysis** | `analysis` | Root cause analysis with evidence | `rootCause`, `affectedServices`, `evidence`, `confidence` |
| 4 | **Validation** | `validation` | Checks analysis safety and correctness | `valid`, `issues`, `safetyScore`, `requiresRetry` |
| 5 | **Remediation** | `remediation` | Proposes fix steps and rollback plan | `steps`, `rollbackPlan`, `requiresHumanApproval` |
| 6 | **Report** | `report-generation` | Compiles the final post-mortem | `summary`, `rootCause`, `remediation`, `timeline` |
**Implementation paths**
| Agent | Worker source |
|-------|----------------|
| Planner | `apps/worker/src/agents/planner/planner.agent.ts` |
| Classification | `apps/worker/src/agents/classification/classification.agent.ts` |
| Analysis | `apps/worker/src/agents/analysis/analysis.agent.ts` |
| Validation | `apps/worker/src/agents/validation/validation.agent.ts` |
| Remediation | `apps/worker/src/agents/remediation/remediation.agent.ts` |
| Report | `apps/worker/src/agents/report/report.agent.ts` |
**LLM layer:** `apps/worker/src/common/llm.service.ts` — structured JSON schema → JSON mode → manual extraction fallback.
**Prompts & schemas:** `packages/shared/src/llm/agent-prompts.ts`, `packages/shared/src/schemas/agent-outputs.ts`
## Workflow pipeline
### Stages
| Stage | `IncidentStatus` | Set by |
|-------|------------------|--------|
| Plan | `PLANNING` | Planner |
| Classify | `CLASSIFICATION` | Classification |
| Analyze | `ROOT_CAUSE_ANALYSIS` | Analysis |
| Validate | `VALIDATION` | Validation |
| Remediate | `REMEDIATION` | Remediation |
| Human approval | `HUMAN_APPROVAL` | Remediation (auto-advances today) |
| Report | `REPORT_GENERATION` | Report agent |
| Done | `RESOLVED` | Report stored |
| Failed | `FAILED` | Agent error or validation exhausted |
### Control flow
POST /incidents (API)
→ enqueue planner
→ Planner → Classification → Analysis → Validation
↑_________________________| (retry, max 3)
→ Remediation → Report → RESOLVED
- **BullMQ** chains jobs: each processor completes, then enqueues the next queue (`apps/worker/src/common/queue-dispatcher.service.ts`).
- **API** only enqueues the first job (`apps/api/src/queues/queue-producer.service.ts`).
- **LangGraph** (`apps/worker/src/workflows/workflow-router.service.ts`) decides after validation:
- `retry_analysis` — re-run analysis (up to `MAX_VALIDATION_RETRIES`, default 3)
- `remediation` — continue pipeline
- `failed` — mark incident failed
Post-validation rules also force retry when `safetyScore < 0.5` or analysis `confidence < 0.4`.
## Tech stack
| Layer | Technologies |
|-------|----------------|
| **Frontend** | Next.js 16, React 19, Tailwind CSS 4, TanStack Query, Framer Motion, Socket.IO client, Recharts, XYFlow |
| **API** | NestJS 11, Prisma 7, PostgreSQL, BullMQ, Socket.IO, Firebase Admin |
| **Worker** | NestJS 11, BullMQ, LangChain, `@langchain/google-genai`, LangGraph, Zod |
| **Shared** | TypeScript, Zod 4, shared enums and event types |
| **Infra** | PostgreSQL, Redis (queues + pub/sub) |
| **AI** | Google Gemini API |
| **Auth** | Firebase Authentication |
## Data model
Prisma schema: `apps/api/prisma/schema.prisma`
| Model | Purpose |
|-------|---------|
| `User` | Linked to Firebase UID |
| `Incident` | Title, logs, severity, category, status |
| `WorkflowExecution` | Per-incident run, `currentStage`, `retryCount` |
| `AgentExecution` | Per-agent input/output, duration, status |
| `IncidentReport` | Final generated report |
## Real-time updates
1. Worker/API publish `WorkflowEventPayload` to Redis channel `sentinel:workflow:events`.
2. API `RedisSubscriberService` receives events and broadcasts via Socket.IO.
3. Web clients connect to namespace `/incidents`, join room `incident:{id}`, and listen for `workflow.event`.
| Component | Path |
|-----------|------|
| Event types | `packages/shared/src/events/workflow-events.ts` |
| Gateway | `apps/api/src/websocket/incidents.gateway.ts` |
| Client hook | `apps/web/hooks/use-workflow-socket.ts` |
## Authentication
- **Production:** Firebase ID token in `Authorization: Bearer` header; API verifies with Firebase Admin and upserts `User` in PostgreSQL.
- **Local dev (no Firebase):** Web uses a dev bearer token; API decodes JWT payload without verification (`apps/web/lib/dev-auth.ts`).
Configure Firebase web keys on the client (`NEXT_PUBLIC_FIREBASE_*`) and Admin credentials on the API (`FIREBASE_*`).
## Prerequisites
- **Node.js** 20+ (22 recommended)
- **npm** 10+
- **PostgreSQL** 14+
- **Redis** 6+
- **Google Gemini API key** (`GEMINI_API_KEY`) — required for the worker
- **Firebase project** (optional for local dev with dev-auth bypass)
## Local setup
### 1. Clone and install
git clone https://github.com/AnirudhS3110/Sentinel-AI.git
cd sentinel-ai # or your clone directory name
npm install
### 2. Configure environment
**API** — copy and edit:
cp apps/api/.env.example apps/api/.env
**Worker** — use the same `.env` or a dedicated file with at least:
DATABASE_URL=postgresql://user:password@localhost:5432/sentinel
REDIS_URL=redis://localhost:6379
GEMINI_API_KEY=your_gemini_api_key
**Web** — create `apps/web/.env.local`:
NEXT_PUBLIC_API_URL=http://localhost:3000/backend
NEXT_PUBLIC_WS_URL=http://localhost:3001
NEXT_PUBLIC_WEB_URL=http://localhost:3000
# Optional: Firebase (omit to use dev-auth)
NEXT_PUBLIC_FIREBASE_API_KEY=
NEXT_PUBLIC_FIREBASE_AUTH_DOMAIN=
NEXT_PUBLIC_FIREBASE_PROJECT_ID=
NEXT_PUBLIC_FIREBASE_APP_ID=
The Next.js dev server proxies `/backend/*` to the API (`apps/web/next.config.ts`, override with `API_PROXY_TARGET`).
### 3. Database
npm run db:generate
npm run db:push # quick start
# or
npm run db:migrate # with migrations
### 4. Build shared package
npm run build --workspace=@sentinel/shared
### 5. Run all services (three terminals)
npm run dev:api # http://localhost:3001
npm run dev:worker # BullMQ consumers + Gemini
npm run dev:web # http://localhost:3000
### 6. Verify
curl http://localhost:3001/health
# → {"ok":true}
Open [http://localhost:3000](http://localhost:3000), sign in, and create an incident from the dashboard.
## Environment variables
### API (`apps/api`)
| Variable | Required | Description |
|----------|----------|-------------|
| `DATABASE_URL` | Yes | PostgreSQL connection string |
| `REDIS_URL` | Yes | Redis for BullMQ and pub/sub |
| `PORT` | No | HTTP port (default `3001`; Railway sets this) |
| `CORS_ORIGIN` | Prod | Comma-separated allowed origins (e.g. Vercel URL) |
| `FIREBASE_PROJECT_ID` | Prod | Firebase Admin |
| `FIREBASE_CLIENT_EMAIL` | Prod | Firebase Admin |
| `FIREBASE_PRIVATE_KEY` | Prod | Firebase Admin (escape `\n` in PEM) |
### Worker (`apps/worker`)
| Variable | Required | Description |
|----------|----------|-------------|
| `DATABASE_URL` | Yes | Same database as API |
| `REDIS_URL` | Yes | Same Redis as API |
| `GEMINI_API_KEY` | Yes | Google AI API key |
| `GEMINI_MODEL` | No | Default `gemini-2.5-flash-lite` |
| `GEMINI_TEMPERATURE` | No | Default `0` |
| `MAX_VALIDATION_RETRIES` | No | Default `3` |
### Web (`apps/web`)
| Variable | Required | Description |
|----------|----------|-------------|
| `NEXT_PUBLIC_API_URL` | Yes | REST base (local: `http://localhost:3000/backend`) |
| `NEXT_PUBLIC_WS_URL` | Yes | Socket.IO server (local: `http://localhost:3001`) |
| `NEXT_PUBLIC_WEB_URL` | No | Public site URL |
| `NEXT_PUBLIC_FIREBASE_*` | Prod | Firebase web SDK config |
| `API_PROXY_TARGET` | No | Next rewrite target (default `http://localhost:3001`) |
## Scripts
| Command | Description |
|---------|-------------|
| `npm run dev:web` | Start Next.js dev server |
| `npm run dev:api` | Start API in watch mode |
| `npm run dev:worker` | Start worker in watch mode |
| `npm run build` | Build shared → api → worker |
| `npm run db:generate` | Generate Prisma client |
| `npm run db:push` | Push schema to database |
| `npm run db:migrate` | Run Prisma migrations |
Per-workspace builds:
npm run build --workspace=@sentinel/shared
npm run build --workspace=api
npm run build --workspace=worker
npm run build --workspace=web
## Deployment
Production requires **three** deploy targets: **Vercel (web)**, **Railway (API)**, and **a worker process** (Railway second service, Render, Fly.io, etc.).
### API (Railway)
See [DEPLOYMENT.md](./DEPLOYMENT.md) for troubleshooting (e.g. "Failed to fetch").
- Service root: **monorepo root**
- Build: `npm ci && npm run build --workspace=@sentinel/shared && npm run build --workspace=api`
- Start: `npm run start:prod --workspace=api`
- Health: `GET /health`
- Public URL: use HTTPS **without** `:3001` (e.g. `https://sentinalai-apiservice.up.railway.app`)
### Worker (separate service)
Same `DATABASE_URL`, `REDIS_URL`, and `GEMINI_API_KEY` as the API:
npm ci
npm run build --workspace=@sentinel/shared
npm run build --workspace=worker
npm run start:prod --workspace=worker
### Web (Vercel)
| Variable | Example |
|----------|---------|
| `NEXT_PUBLIC_WEB_URL` | `https://sentinel-ai-v1.vercel.app` |
| `NEXT_PUBLIC_API_URL` | `https://sentinalai-apiservice.up.railway.app` |
| `NEXT_PUBLIC_WS_URL` | `https://sentinalai-apiservice.up.railway.app` |
| `NEXT_PUBLIC_FIREBASE_*` | Your Firebase web app |
Redeploy after changing environment variables.
## App routes
| Route | Description |
|-------|-------------|
| `/` | Landing page |
| `/login` | Firebase sign-in |
| `/dashboard` | Command center, live activity |
| `/incidents/[id]` | Incident detail, agent grid, timeline |
| `/workflows` | Workflow visualization |
| `/agents` | Agent fleet metrics |
| `/reports` | Incident reports |
| `/architecture` | In-app architecture overview |
## License
Private / unlicensed unless otherwise specified in the repository.
标签:自动化攻击