spoorthi01012004m/Agentic_incident_response

GitHub: spoorthi01012004m/Agentic_incident_response

Stars: 0 | Forks: 0

# AI Incident Response System Enterprise-grade multi-agent incident investigation and root-cause analysis platform built using LangGraph, LangChain, and OpenAI. # Overview The AI Incident Response System is a production-oriented backend platform designed to automate operational incident investigation workflows using AI agents, forensic correlation, anomaly detection, and hallucination-safe verification pipelines. The system simulates how modern Site Reliability Engineering (SRE) and Incident Response teams investigate infrastructure failures, correlate operational evidence, generate root-cause hypotheses, validate AI-generated reasoning, and produce investigation reports. This project follows enterprise backend engineering principles including: * modular architecture * centralized configuration management * workflow orchestration * schema validation * production logging * evaluation pipelines * operational simulation testing * evidence-grounded reasoning # Key Features ## Multi-Agent Investigation Workflow The system contains specialized AI agents responsible for different phases of incident investigation: | Agent | Responsibility | | ---------------- | ----------------------------------------------------------- | | Triage Agent | Incident severity assessment and impacted service detection | | Forensics Agent | Evidence correlation and propagation-chain analysis | | Hypothesis Agent | Root-cause hypothesis generation | | Verifier Agent | Hallucination detection and evidence validation | ## LangGraph Workflow Orchestration The platform uses LangGraph to orchestrate: * state transitions * workflow execution * distributed investigation stages * AI agent coordination * evaluation lifecycle ## Evidence-Grounded Root Cause Analysis The system only generates conclusions using: * operational logs * alerts * metrics * chat history * runbook data This minimizes hallucination risk and improves RCA reliability. ## Hallucination-Safe Verification The verifier pipeline validates: * unsupported claims * invalid evidence references * hallucinated reasoning * overconfident hypotheses This ensures safer AI-generated operational analysis. ## Timeline Reconstruction The platform reconstructs incident progression using: * anomalies * alerts * logs * operational chat signals to generate chronological incident timelines. ## Enterprise Logging & Observability Production-grade logging includes: * structured logs * rotating log files * centralized logging configuration * workflow execution tracing * error stream separation ## Evaluation Framework The system evaluates investigation quality using: * evidence coverage scoring * hallucination risk scoring * workflow reliability scoring * RCA quality scoring * timeline completeness scoring # System Architecture AI_Incident_Response_System/ │ ├── agents/ # AI investigation agents ├── config/ # Centralized configuration ├── data/ # Operational datasets ├── docs/ # Architecture documentation ├── gold/ # Golden evaluation datasets ├── graph/ # LangGraph workflow engine ├── models/ # Pydantic schemas ├── outputs/ # Generated reports ├── services/ # Business logic services ├── tests/ # Automated and simulation tests ├── tools/ # Shared operational tools ├── utils/ # Shared utilities │ ├── main.py # Workflow entrypoint ├── requirements.txt ├── .env └── README.md # Workflow Lifecycle The incident investigation workflow follows these stages: 1. Load Operational Data ↓ 2. Detect Anomalies ↓ 3. Execute Triage Agent ↓ 4. Execute Forensics Agent ↓ 5. Generate RCA Hypotheses ↓ 6. Verify AI Reasoning ↓ 7. Reconstruct Timeline ↓ 8. Evaluate Investigation Quality ↓ 9. Generate Reports # Technologies Used | Technology | Purpose | | ---------- | ------------------------ | | Python | Core backend development | | LangGraph | Workflow orchestration | | LangChain | AI agent integration | | OpenAI | LLM-powered reasoning | | Pandas | Metrics analysis | | Pydantic | Schema validation | | Pytest | Automated testing | # Step-by-Step Backend Development Process ## 1. Project Initialization The backend project structure was designed using enterprise software engineering principles with clear separation of concerns. Major architectural layers: * agents * services * tools * workflow graph * configuration layer * schema layer * evaluation layer ## 2. Workflow State Design A centralized distributed workflow state was implemented using TypedDict to enable: * shared agent memory * workflow communication * incident context propagation * observability tracking ## 3. Operational Data Loading Custom file loaders were implemented to safely ingest: * alerts * logs * metrics * chat records * runbooks The ingestion layer includes: * validation * exception handling * type safety * logging ## 4. Anomaly Detection Engine The anomaly engine was built to detect: * latency spikes * elevated error rates * operational degradation patterns The detector uses configurable enterprise thresholds. ## 5. AI Agent Development Four specialized AI agents were created: ### Triage Agent Responsible for: * severity estimation * impacted service detection * incident classification ### Forensics Agent Responsible for: * evidence correlation * propagation-chain analysis * forensic investigation ### Hypothesis Agent Responsible for: * RCA generation * confidence estimation * root-cause ranking ### Verifier Agent Responsible for: * hallucination detection * evidence validation * unsupported claim rejection ## 6. LangGraph Workflow Orchestration LangGraph was integrated to: * coordinate agents * manage execution state * orchestrate workflow transitions ## 7. Evaluation Framework A scoring framework was implemented to evaluate: * evidence quality * hallucination safety * RCA reliability * workflow performance ## 8. Reporting Pipeline The system generates: * incident reports * evaluation summaries * remediation action items * operational timelines ## 9. Testing Framework Two testing layers were implemented: ### Automated Tests Pytest-based engineering validation. ### Operational Simulation Tests Scenario-based incident simulation test cases. ## 10. Production Hardening The backend was hardened using: * centralized configuration * environment-based secrets * rotating logs * workflow observability * schema validation * structured exceptions # Installation Guide ## 1. Clone Repository git clone cd AI_Incident_Response_System ## 2. Create Virtual Environment ### Windows python -m venv venv venv\Scripts\activate ### Linux / macOS python3 -m venv venv source venv/bin/activate ## 3. Install Dependencies pip install -r requirements.txt ## 4. Configure Environment Variables Create a `.env` file in the project root. Example: OPENAI_API_KEY=your_openai_api_key DEFAULT_MODEL=gpt-4o-mini LOG_LEVEL=INFO # Running the Backend ## Execute Main Workflow python main.py # Expected Workflow Execution The system will: 1. Load operational datasets 2. Detect anomalies 3. Execute investigation agents 4. Validate AI-generated reasoning 5. Generate incident timeline 6. Produce evaluation reports 7. Save outputs # Generated Outputs The following files are generated inside: outputs/ | File | Purpose | | ----------------------- | --------------------------- | | incident_report.md | Final investigation report | | action_items.json | Remediation recommendations | | evaluation_summary.json | Evaluation metrics | # Running Automated Tests ## Execute Full Test Suite pytest tests/ ## Execute Individual Test File pytest tests/test_workflow.py # Operational Simulation Tests The project includes enterprise-style operational simulations: tests/test_case_1.txt ... tests/test_case_12.txt These simulate: * latency spikes * database saturation * cascading failures * hallucination detection * forensic investigations * distributed degradation # Hallucination Prevention Strategy The platform minimizes hallucination risk using: * evidence-grounded prompts * verifier pipelines * evidence reference validation * confidence reduction logic * unsupported claim detection # Logging & Observability Logs are automatically generated inside: logs/ | File | Purpose | | --------------- | ----------------------- | | application.log | Workflow execution logs | | errors.log | Error tracking | # Future Improvements Potential enterprise extensions: * real-time streaming ingestion * vector database integration * distributed workflow execution * Kubernetes deployment * SIEM integration * Slack/MS Teams integrations * real-time observability dashboards # Author AI Incident Response System Enterprise Multi-Agent Incident Investigation Platform # License This project is intended for educational, research, and backend engineering demonstration purposes.