spoorthi01012004m/Agentic_incident_response
GitHub: spoorthi01012004m/Agentic_incident_response
Stars: 0 | Forks: 0
# AI Incident Response System
Enterprise-grade multi-agent incident investigation and root-cause analysis platform built using LangGraph, LangChain, and OpenAI.
# Overview
The AI Incident Response System is a production-oriented backend platform designed to automate operational incident investigation workflows using AI agents, forensic correlation, anomaly detection, and hallucination-safe verification pipelines.
The system simulates how modern Site Reliability Engineering (SRE) and Incident Response teams investigate infrastructure failures, correlate operational evidence, generate root-cause hypotheses, validate AI-generated reasoning, and produce investigation reports.
This project follows enterprise backend engineering principles including:
* modular architecture
* centralized configuration management
* workflow orchestration
* schema validation
* production logging
* evaluation pipelines
* operational simulation testing
* evidence-grounded reasoning
# Key Features
## Multi-Agent Investigation Workflow
The system contains specialized AI agents responsible for different phases of incident investigation:
| Agent | Responsibility |
| ---------------- | ----------------------------------------------------------- |
| Triage Agent | Incident severity assessment and impacted service detection |
| Forensics Agent | Evidence correlation and propagation-chain analysis |
| Hypothesis Agent | Root-cause hypothesis generation |
| Verifier Agent | Hallucination detection and evidence validation |
## LangGraph Workflow Orchestration
The platform uses LangGraph to orchestrate:
* state transitions
* workflow execution
* distributed investigation stages
* AI agent coordination
* evaluation lifecycle
## Evidence-Grounded Root Cause Analysis
The system only generates conclusions using:
* operational logs
* alerts
* metrics
* chat history
* runbook data
This minimizes hallucination risk and improves RCA reliability.
## Hallucination-Safe Verification
The verifier pipeline validates:
* unsupported claims
* invalid evidence references
* hallucinated reasoning
* overconfident hypotheses
This ensures safer AI-generated operational analysis.
## Timeline Reconstruction
The platform reconstructs incident progression using:
* anomalies
* alerts
* logs
* operational chat signals
to generate chronological incident timelines.
## Enterprise Logging & Observability
Production-grade logging includes:
* structured logs
* rotating log files
* centralized logging configuration
* workflow execution tracing
* error stream separation
## Evaluation Framework
The system evaluates investigation quality using:
* evidence coverage scoring
* hallucination risk scoring
* workflow reliability scoring
* RCA quality scoring
* timeline completeness scoring
# System Architecture
AI_Incident_Response_System/
│
├── agents/ # AI investigation agents
├── config/ # Centralized configuration
├── data/ # Operational datasets
├── docs/ # Architecture documentation
├── gold/ # Golden evaluation datasets
├── graph/ # LangGraph workflow engine
├── models/ # Pydantic schemas
├── outputs/ # Generated reports
├── services/ # Business logic services
├── tests/ # Automated and simulation tests
├── tools/ # Shared operational tools
├── utils/ # Shared utilities
│
├── main.py # Workflow entrypoint
├── requirements.txt
├── .env
└── README.md
# Workflow Lifecycle
The incident investigation workflow follows these stages:
1. Load Operational Data
↓
2. Detect Anomalies
↓
3. Execute Triage Agent
↓
4. Execute Forensics Agent
↓
5. Generate RCA Hypotheses
↓
6. Verify AI Reasoning
↓
7. Reconstruct Timeline
↓
8. Evaluate Investigation Quality
↓
9. Generate Reports
# Technologies Used
| Technology | Purpose |
| ---------- | ------------------------ |
| Python | Core backend development |
| LangGraph | Workflow orchestration |
| LangChain | AI agent integration |
| OpenAI | LLM-powered reasoning |
| Pandas | Metrics analysis |
| Pydantic | Schema validation |
| Pytest | Automated testing |
# Step-by-Step Backend Development Process
## 1. Project Initialization
The backend project structure was designed using enterprise software engineering principles with clear separation of concerns.
Major architectural layers:
* agents
* services
* tools
* workflow graph
* configuration layer
* schema layer
* evaluation layer
## 2. Workflow State Design
A centralized distributed workflow state was implemented using TypedDict to enable:
* shared agent memory
* workflow communication
* incident context propagation
* observability tracking
## 3. Operational Data Loading
Custom file loaders were implemented to safely ingest:
* alerts
* logs
* metrics
* chat records
* runbooks
The ingestion layer includes:
* validation
* exception handling
* type safety
* logging
## 4. Anomaly Detection Engine
The anomaly engine was built to detect:
* latency spikes
* elevated error rates
* operational degradation patterns
The detector uses configurable enterprise thresholds.
## 5. AI Agent Development
Four specialized AI agents were created:
### Triage Agent
Responsible for:
* severity estimation
* impacted service detection
* incident classification
### Forensics Agent
Responsible for:
* evidence correlation
* propagation-chain analysis
* forensic investigation
### Hypothesis Agent
Responsible for:
* RCA generation
* confidence estimation
* root-cause ranking
### Verifier Agent
Responsible for:
* hallucination detection
* evidence validation
* unsupported claim rejection
## 6. LangGraph Workflow Orchestration
LangGraph was integrated to:
* coordinate agents
* manage execution state
* orchestrate workflow transitions
## 7. Evaluation Framework
A scoring framework was implemented to evaluate:
* evidence quality
* hallucination safety
* RCA reliability
* workflow performance
## 8. Reporting Pipeline
The system generates:
* incident reports
* evaluation summaries
* remediation action items
* operational timelines
## 9. Testing Framework
Two testing layers were implemented:
### Automated Tests
Pytest-based engineering validation.
### Operational Simulation Tests
Scenario-based incident simulation test cases.
## 10. Production Hardening
The backend was hardened using:
* centralized configuration
* environment-based secrets
* rotating logs
* workflow observability
* schema validation
* structured exceptions
# Installation Guide
## 1. Clone Repository
git clone
cd AI_Incident_Response_System
## 2. Create Virtual Environment
### Windows
python -m venv venv
venv\Scripts\activate
### Linux / macOS
python3 -m venv venv
source venv/bin/activate
## 3. Install Dependencies
pip install -r requirements.txt
## 4. Configure Environment Variables
Create a `.env` file in the project root.
Example:
OPENAI_API_KEY=your_openai_api_key
DEFAULT_MODEL=gpt-4o-mini
LOG_LEVEL=INFO
# Running the Backend
## Execute Main Workflow
python main.py
# Expected Workflow Execution
The system will:
1. Load operational datasets
2. Detect anomalies
3. Execute investigation agents
4. Validate AI-generated reasoning
5. Generate incident timeline
6. Produce evaluation reports
7. Save outputs
# Generated Outputs
The following files are generated inside:
outputs/
| File | Purpose |
| ----------------------- | --------------------------- |
| incident_report.md | Final investigation report |
| action_items.json | Remediation recommendations |
| evaluation_summary.json | Evaluation metrics |
# Running Automated Tests
## Execute Full Test Suite
pytest tests/
## Execute Individual Test File
pytest tests/test_workflow.py
# Operational Simulation Tests
The project includes enterprise-style operational simulations:
tests/test_case_1.txt
...
tests/test_case_12.txt
These simulate:
* latency spikes
* database saturation
* cascading failures
* hallucination detection
* forensic investigations
* distributed degradation
# Hallucination Prevention Strategy
The platform minimizes hallucination risk using:
* evidence-grounded prompts
* verifier pipelines
* evidence reference validation
* confidence reduction logic
* unsupported claim detection
# Logging & Observability
Logs are automatically generated inside:
logs/
| File | Purpose |
| --------------- | ----------------------- |
| application.log | Workflow execution logs |
| errors.log | Error tracking |
# Future Improvements
Potential enterprise extensions:
* real-time streaming ingestion
* vector database integration
* distributed workflow execution
* Kubernetes deployment
* SIEM integration
* Slack/MS Teams integrations
* real-time observability dashboards
# Author
AI Incident Response System
Enterprise Multi-Agent Incident Investigation Platform
# License
This project is intended for educational, research, and backend engineering demonstration purposes.