mohilamin/ai-soc-telemetry-triage-platform

GitHub: mohilamin/ai-soc-telemetry-triage-platform

Stars: 0 | Forks: 0

# AI SOC Telemetry Triage Platform ## Why I Built This I built this to model the data engineering layer behind AI-assisted SOC triage: detection rules, alert correlation, MITRE-style mapping, analyst queues, timelines, and scorecards. The key challenge I wanted to capture was the part that usually gets hidden in simple demos: how data, signals, decisions, constraints, evidence, and operating risk move through a system that someone else could inspect and run locally. I intentionally kept this version local and synthetic because the goal is to make the architecture and tradeoffs reviewable without external services, private data, paid APIs, or cloud setup. ## Real Business Problem SOC teams receive too many disconnected alerts; the hard part is correlating telemetry into explainable incidents with evidence and response actions. This matters because production teams do not only need outputs. They need evidence, ownership, repeatable validation, failure modes, and a path from local prototype to governed production system. ## What This Project Proves - security telemetry modeling - detection-as-code - incident correlation - risk scoring - analyst workflow design - evidence reporting - production-style data pipeline design - synthetic but realistic data modeling - scorecard generation - API/dashboard serving - testable architecture - honest limitation framing ## Architecture In Plain English Synthetic telemetry is normalized, matched to detection rules, mapped to attacker behavior, deduplicated, correlated into incidents, prioritized, and exported as analyst-ready outputs. The important pattern is that inputs are not just transformed into outputs. They are turned into scored, documented artifacts that can be reviewed by operators, analysts, engineers, and business stakeholders. ## Key Design Decisions - Synthetic data keeps the repo safe to run and share publicly. - Deterministic local logic makes validation repeatable without paid APIs. - DuckDB or local artifacts provide warehouse-style inspection without cloud setup. - FastAPI shows how the system could be served as a service layer. - Streamlit gives reviewers a fast way to inspect the outputs visually. - Scorecards make quality, risk, reliability, or readiness measurable. - Tests and Ruff keep the repo from being only documentation. - Docker/CI files show the intended deployment shape without claiming production readiness. See [docs/design-decisions.md](docs/design-decisions.md) for the detailed tradeoff record. ## Validation Evidence Latest validation run: 2026-06-02. - Pipeline: passed - Pytest: passed (86 tests) - Ruff: passed - Repository quality docs check: passed - Detailed command output is recorded in [docs/validation-log.md](docs/validation-log.md). ## Generated Artifacts To Inspect - security telemetry - detection alerts - incident records - analyst queue - MITRE-style coverage - runbooks - SOC scorecards ## How To Review This Repo Recruiter / hiring manager: - Read this README first. - Review [docs/recruiter-summary.md](docs/recruiter-summary.md) if present. - Check [docs/validation-log.md](docs/validation-log.md). - Use [docs/repo-review-guide.md](docs/repo-review-guide.md) for the quickest path. Senior engineer: - Review the architecture docs. - Inspect the `src/` modules. - Inspect tests and generated scorecards. - Read [docs/design-decisions.md](docs/design-decisions.md) and [docs/tradeoffs-and-simplifications.md](docs/tradeoffs-and-simplifications.md). Interview path: - Run the pipeline command from the validation log. - Launch the dashboard or API if this repo includes them. - Explain one design decision and one simplification honestly. ## Known Limitations - Synthetic data only. - Local prototype rather than deployed production system. - Deterministic rules or simulations where a production system may use live models, streaming data, or enterprise integrations. - No real sensitive data is used. - No authentication, RBAC, secrets management, or production security boundary unless explicitly stated elsewhere in the repo. - External systems are simulated instead of connected live. ## Production Roadmap - ingest SIEM/EDR/cloud logs - integrate SOAR/case management - add threat intel enrichment - stream detections - add auth and analyst workflow controls See [docs/production-roadmap.md](docs/production-roadmap.md) for the staged roadmap. ## Executive Summary This project simulates a modern Security Operations Center data platform. A basic security dashboard asks: **"What alerts fired?"** This project asks: **"Which alerts belong together, how severe is the incident, what evidence supports it, what is the likely attacker behavior, what systems are impacted, and what should the analyst do next?"** Security teams receive telemetry from identity systems, endpoints, cloud logs, SaaS apps, email gateways, DNS, firewalls, and AI applications. The challenge is not simply collecting alerts. The challenge is correlating them into meaningful incidents, reducing noise, prioritizing risk, and giving analysts explainable evidence. This platform generates synthetic SOC telemetry, injects attack scenarios, applies Sigma-style detection rules, maps detections to MITRE-style tactics, correlates alerts into incidents, generates analyst queues, estimates blast radius, and produces SOC scorecards. **Positioning:** I build AI-assisted SOC data platforms that turn fragmented security telemetry into correlated incidents, analyst-ready investigations, and measurable detection quality. ## Business Problem Modern SOC teams face alert overload: - too many low-quality alerts - duplicate alerts from multiple systems - identity alerts not linked to endpoint activity - cloud access anomalies not linked to data movement - suspicious email activity not linked to authentication events - AI prompt-injection attempts not connected to data access - analysts spending time on false positives - missing blast-radius context - weak incident timelines - inconsistent runbooks The business risk is missed attacks, delayed response, analyst fatigue, and poor visibility into SOC effectiveness. ## Why This Is Not a Basic Security Dashboard This repo does not only show alert counts. It builds a deterministic SOC triage pipeline: synthetic telemetry generation, attack injection, detection-as-code, MITRE-style mapping, deduplication, incident correlation, severity/confidence scoring, blast-radius analysis, analyst queues, timelines, runbooks, scorecards, API, dashboard, Docker, and CI. ## Architecture flowchart LR A["Synthetic Assets"] --> B["Synthetic Telemetry"] C["Injected Attack Scenarios"] --> B B --> D["Normalization"] D --> E["Sigma-Style Detection Engine"] E --> F["Security Alerts"] F --> G["Deduplication"] G --> H["Incident Correlation"] H --> I["Severity + Confidence Scoring"] I --> J["Blast Radius"] I --> K["Analyst Queue"] H --> L["Timelines + Evidence"] K --> M["Runbook Recommendations"] M --> N["SOC Scorecards"] N --> O["DuckDB"] O --> P["FastAPI"] O --> Q["Streamlit"] ## Telemetry Flow flowchart TD A["Identity Logs"] --> J["Unified Telemetry"] B["Endpoint Events"] --> J C["Cloud Access"] --> J D["Network Flow"] --> J E["DNS Logs"] --> J F["Email Security"] --> J G["SaaS Audit"] --> J H["Firewall Logs"] --> J I["AI App Security"] --> J ## Detection Flow flowchart LR A["Rules YAML"] --> B["Rule Loader"] B --> C["Sigma-Style Engine"] C --> D["Detection Results"] D --> E["Alerts with Evidence"] E --> F["MITRE-Style Mapping"] ## Correlation Flow flowchart LR A["Alerts"] --> B["Deduplicate"] B --> C["Entity + Time Window Correlation"] C --> D["Incident Builder"] D --> E["Incident Alert Links"] ## Triage Flow flowchart TD A["Incident"] --> B["Severity Inputs"] B --> C["Confidence Score"] C --> D["False Positive Estimate"] D --> E["Analyst Queue"] E --> F["Next Best Action"] ## Incident Response Flow flowchart LR A["Incident"] --> B["Timeline"] A --> C["Evidence"] A --> D["Blast Radius"] A --> E["Runbook"] E --> F["Response Recommendation"] ## Attack Scenario Catalog The platform injects 20 controlled scenarios: impossible travel, password spray, brute force against privileged user, MFA fatigue, cloud privilege escalation, suspicious service account access, data exfiltration, OAuth consent abuse, phishing click, endpoint malware, ransomware precursor, DNS beaconing, AI prompt injection, AI sensitive data request, insider data access, cloud key exposure, lateral movement, C2 pattern, mass public file sharing, and dormant account reactivation. ## Detection Rule Examples Rules live under `rules/` by source domain. Each rule includes rule ID, title, log source, detection logic, severity, tactic, technique, false-positive notes, and recommended response. The local rule engine is Sigma-style, not an official Sigma integration. ## MITRE-Style Mapping Detections map to MITRE-style tactics and techniques such as Initial Access, Credential Access, Privilege Escalation, Defense Evasion, Discovery, Lateral Movement, Command and Control, Exfiltration, and Impact. These mappings are synthetic and local for portfolio demonstration. ## Analyst Workflow 1. Review `/soc-summary` or the dashboard Executive Overview. 2. Open the analyst queue. 3. Inspect correlated incident evidence and timeline. 4. Review blast-radius report. 5. Use the recommended runbook. 6. Mark known false positives or escalate high-severity incidents. ## Scorecards - `detection_quality_report.json/csv` - `incident_triage_report.json/csv` - `mitre_coverage_report.json/csv` - `soc_performance_report.json/csv` - `false_positive_report.json/csv` - `response_readiness_report.json/csv` - `attack_scenario_detection_report.json/csv` ## Quickstart python -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip python -m pip install -r requirements.txt python -m src.data_generation.generate_assets python -m src.data_generation.generate_telemetry python -m src.data_generation.inject_attack_scenarios python -m src.data_generation.generate_ground_truth python -m src.pipeline.run_all python -m pytest python -m ruff check . ## API uvicorn src.api.main:app --reload Endpoints include `/health`, `/soc-summary`, `/telemetry-sources`, `/alerts`, `/alerts/{alert_id}`, `/incidents`, `/incidents/{incident_id}`, `/analyst-queue`, `/blast-radius/{incident_id}`, `/mitre-coverage`, `/scorecards`, `/runbooks`, `/evidence/{incident_id}`, `/simulate-attack-scenario`, `/triage-incident`, and `/mark-false-positive`. ## Dashboard streamlit run src/dashboard/app.py Dashboard sections: Executive Overview, Telemetry Sources, Detection Rules, Alerts, Correlated Incidents, Analyst Queue, MITRE-Style Coverage, Incident Timeline, Blast Radius, False Positive Review, AI App Security Events, Runbook Recommendations, and SOC Scorecards. ## Validation V0.1 target: - asset generation passes - telemetry generation passes - attack scenario injection passes - ground truth generation passes - full pipeline passes - at least 70 tests pass - ruff passes - API and dashboard launch locally ## Known Limitations - synthetic telemetry only - local DuckDB instead of SIEM/data lake - Sigma-style local rules, not official Sigma rule repository integration - MITRE-style mapping, not official coverage validation - deterministic rules instead of ML-based detection - no real threat intel feeds - no cloud deployment - no authentication - no real EDR/SIEM/SOAR integration ## Future Enhancements - Sigma rule import/export - MITRE ATT&CK Navigator layer export - Splunk/Elastic/Sentinel connector simulation - SOAR playbook execution - OpenTelemetry log ingestion - Kafka streaming telemetry - cloud log adapters - identity provider integration - threat intel enrichment - ML anomaly detection - case management workflow - Slack/Jira/PagerDuty escalation - cloud deployment - role-based access control ## Project Status V0.1: Working baseline.