chanadinh/threat-hunting-pipeline

GitHub: chanadinh/threat-hunting-pipeline

一个端到端的智能体威胁狩猎流水线，通过整合多源情报、LLM 驱动的 Sigma 规则生成和 OpenSearch 日志检索，实现从威胁情报到检测狩猎报告的全自动化。

Stars: 0 | Forks: 0

# 威胁狩猎 Pipeline 端到端智能体威胁狩猎 pipeline。拉取威胁情报源 → 标准化 → 映射到 MITRE ATT&CK → 通过具备 RAG 的 LLM 生成 Sigma 规则 → 基于 OpenSearch 日志进行狩猎 → 生成包含建议的分析师报告。 ## 架构与参考图表中的 AWS 架构 1:1 对应。每个 pipeline 阶段都是一个 Python 模块，同时也充当 Lambda handler；在生产环境中编排由 EventBridge + Step Functions 完成，在本地则是一个单一的 Python 函数。 ``` feeds (CISA KEV, NVD, OTX, MISP) │ ▼ [1] feed-ingest → S3 raw-feeds (Lambda) [2] normalize → Finding[] (unified) (Lambda) [3] mitre-map → ATT&CK technique IDs (Lambda; STIX lookup + LLM fallback) [4] sigma-convert → Sigma YAML rules (Lambda → Fargate LLM gateway → SageMaker qwen3-coder) ↑ Qdrant RAG: top-5 Sigma exemplars [5] es-query → hunt hits (Lambda → OpenSearch) [6] scoring → hunt_rank (Lambda) [7] report-gen → markdown / docx → S3 (Lambda → LLM) [7b] story-agent → multi-finding narrative (Lambda → LLM) ``` ## 项目布局 ``` pipeline/ schemas.py Finding, SigmaRule, HuntHit, Severity, FeedSource config.py env-driven config (single source of truth) stages/ ingest, normalize, mitre_map, sigma_gen, hunt, score, report llm/ LLM gateway (anthropic | sagemaker | mock) + prompts rag/ Qdrant retriever + in-memory cosine fallback storage/ s3, opensearch, docdb helpers orchestrator.py local Step Functions equivalent lambdas/ AWS handler shims (each calls a pipeline.stages function) infra/ Terraform: VPC, S3, OpenSearch, DocDB, Redis, Lambda, Step Functions data/ sigma_examples/ seed RAG corpus (8 hand-picked rules) samples/ offline test payloads per feed mitre/ cache slot for STIX bundle scripts/run_pipeline.py tests/test_end_to_end.py docker-compose.yml local OpenSearch + Qdrant + Mongo ``` ## 快速开始（本地，无 AWS） ``` cd /home/linux/threat-hunting-pipeline pip install -r requirements.txt python scripts/run_pipeline.py --offline --source cisa_kev,nvd,otx,misp -v ``` 离线模式使用内置的示例情报源、内存级 RAG 检索器（从 `data/sigma_examples/` 导入种子数据）、mock LLM，并将输出写入到 `/tmp/thp-*`。MITRE STIX bundle 将从 `/home/linux/attack/enterprise-attack.json` 读取。 ## 本地实战（真实 LLM，真实 OpenSearch） ``` docker compose up -d # opensearch, qdrant, mongo export LLM_BACKEND=anthropic export ANTHROPIC_API_KEY=... # uses claude-opus-4-7 by default python scripts/run_pipeline.py --source cisa_kev -v ``` ## 生产环境（AWS 架构图） ``` cd infra/ terraform init && terraform apply ``` 配置内容：跨 2 个 AZ 的 VPC · S3 存储桶（raw-feeds、reports、sigma-rules、 t3000-corpus、dashboard、access-logs） · OpenSearch（3 个 data + 3 个 master 节点，multi-AZ， KMS 加密） · DocumentDB · ElastiCache (Redis) · 位于 Step Functions 状态机背后的 8 个 Lambda · EventBridge 每日 cron 定时任务（06:00 UTC） · 按阶段划分作用域的 IAM 角色。此 Terraform 未包含的内容（刻意为之——它们需要单独决策）： - **用于 qwen3-coder 30B 的 SageMaker endpoint** — 通过 SageMaker JumpStart 或自定义 container 进行配置；在 Lambda 上设置 `SAGEMAKER_ENDPOINT` 环境变量。 - **Fargate LLM gateway** — 轻量级服务，围绕 SageMaker endpoint 封装了请求批处理、重试和成本核算功能。 - **Fargate Qdrant** — 用于 Sigma 示例语料库的持久化向量 DB。 - **Fargate syslog 接收器** — UDP/TCP 514 监听器（Lambda 无法绑定原生 UDP）。 - 用于分析师 dashboard 的 **Cognito 用户池 + ALB + WAF + CloudFront**。 ## 智能 RAG 步骤的工作原理（阶段 4）对于每一对 `(finding, technique)`： 1. 构建查询字符串：`f"{finding.title} {finding.description[:300]} {technique}"`。 2. 检索器从语料库中返回相似度最高的 top-`k`（默认 5）条 Sigma 规则。已标记为该 technique 的规则会获得 `+0.25` 的得分加成。 3. 使用以下内容提示 LLM： - 威胁上下文（title、CVE、产品、observables） - 目标 ATT&CK technique - 5 条作为完整 YAML 的示例规则（作为样式指南，而非模板） 4. LLM 生成一条 Sigma YAML 规则。代码块会被去除，该文档将被解析并根据最低 schema（`title`、 `logsource` 以及带有 `condition` 的 `detection`）进行校验。无效的规则将被丢弃并在 `Finding` 上记录错误，而不是抛出异常 —— 部分输出总比没有输出更有用。 5. 每条被接受的规则都会连同出处（RAG 示例 ID、置信度得分、generated_by）一起写入 S3 中的 `{finding_id}/{rule_id}.yml` 路径下。 ## 组件替换 - **不同的 LLM：** 在 `pipeline/llm/client.py` 中添加一个类，并通过 `LLM_BACKEND=...` 进行配置。 - **不同的向量 DB：** 在 `pipeline/rag/retriever.py` 中添加一个类，需实现相同的 `index() / search()` 接口。 - **不同的日志存储：** 替换 `pipeline/storage/opensearch.py`（Splunk、 Elastic、ClickHouse —— 它们都有现成的 Sigma backend）。 ## 成本护栏 - `orchestrator.run(max_findings=5)` 限制了每次运行的 LLM 调用次数。 - `sigma_gen.run` 为每个 finding 最多生成 3 条规则。 - 生产环境中的 Step Functions Map 状态在运行 LLM 密集型阶段时会带并发限制 —— 在性能基准确立后，于 `infra/stepfunctions.tf` 中进行设置。 ## 状态 - [x] 全部 7 个阶段均已实现且可离线运行 - [x] Lambda handler + Terraform 骨架 - [ ] Dashboard（Vite 静态站点 → S3 → CloudFront） - [ ] 在 pysigma 锁定到经过测试的版本后，在 `hunt.py` 中实现真实的 Sigma backend - [ ] 每次 LLM 调用的成本遥测

标签：RAG, Sigma规则, 威胁情报, 安全运营, 开发者工具, 扫描框架, 搜索引擎查询, 漏洞探索, 目标导入, 请求拦截, 逆向工具