Vasco1290/SecRAG

GitHub: Vasco1290/SecRAG

面向网络安全威胁情报的领域专用 RAG 系统,支持多源安全文档摄取、来源归属问答与 IOC 自动提取。

Stars: 0 | Forks: 0

# SecRAG — 网络安全威胁情报 RAG SecRAG 能够提取 MITRE ATT&CK 报告、CVE 公告和威胁情报 PDF,然后通过流式 API 提供精确且带有来源归属的回答——并配备会话记忆、IOC 提取和评估框架。 ![SecRAG Next.js SOC 仪表盘](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/c2768eaca5212628.png) [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/1e8c54adc5212634.svg)](https://github.com/Vasco1290/SecRAG/actions/workflows/ci.yml) [![Python](https://img.shields.io/badge/Python-3.11+-3776AB?style=flat&logo=python&logoColor=white)](https://python.org) [![FastAPI](https://img.shields.io/badge/FastAPI-0.111+-009688?style=flat&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![Next.js](https://img.shields.io/badge/Next.js-15-000000?style=flat&logo=nextdotjs&logoColor=white)](https://nextjs.org) [![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5+-orange?style=flat)](https://trychroma.com) [![License: MIT](https://img.shields.io/badge/License-MIT-green?style=flat)](LICENSE) ## 功能 | 功能 | 详情 | |---------|---------| | **混合检索** | Dense (MiniLM) + BM25 sparse,通过 Reciprocal Rank Fusion 融合 | | **Cross-Encoder 重排序** | `BAAI/bge-reranker-base` 在 LLM 处理前将 top-20 重新打分为 top-5 | | **会话记忆** | 每个 `session_id` 的滑动窗口对话历史(10 轮 deque) | | **Provider 抽象** | 通过一个 `LLM_PROVIDER` 环境变量切换 Groq / OpenAI / Ollama | | **异步提取** | PDF 上传立即返回 `job_id`;轮询 `/ingest/status/{job_id}` | | **文档管理** | `GET /documents` · `DELETE /documents/{source}` | | **安全 NER** | 在提取时抽取 CVE ID、MITRE T-code、IPv4、SHA-256/MD5 | | **流式 SSE** | token 级别流式传输,附带来源引用和延迟细分 | | **评估框架** | Recall@K、MRR、NDCG@K、Faithfulness — CI 门控阈值 | | **CI 流水线** | 每次推送均执行 ruff + pytest + docker build | ## 架构 ``` ┌──────────────────────────────────────────────────────────────────┐ │ User Query │ │ (session_id → ConversationMemory) │ └───────────────────────────┬──────────────────────────────────────┘ │ ┌────────▼────────┐ │ FastAPI /chat │ X-API-Key auth · SSE stream └────────┬────────┘ │ ┌──────────────────▼──────────────────┐ │ Hybrid Retrieval │ │ ┌─────────────┐ ┌──────────────┐ │ │ │ Dense (ANN) │ │ Sparse (BM25)│ │ │ │ ChromaDB │ │ rank-bm25 │ │ │ └──────┬──────┘ └──────┬───────┘ │ │ │ RRF (k=60) │ │ │ └───────┬────────┘ │ │ top-20 candidates │ └──────────────────┬──────────────────┘ │ ┌──────────────────▼──────────────────┐ │ Cross-Encoder Reranking │ │ BAAI/bge-reranker-base │ │ → top-5 chunks │ └──────────────────┬──────────────────┘ │ ┌──────────────────▼──────────────────┐ │ LLM (Groq / OpenAI / Ollama) │ │ context + history + question │ │ streaming=True, temperature=0.1 │ └──────────────────┬──────────────────┘ │ SSE events: chunk · sources · metrics · done ``` ## 项目结构 ``` SecRAG/ ├── core/ │ ├── config.py # Pydantic Settings (env var loading) │ ├── parser.py # PDF parsing + security NER │ ├── chunker.py # Sliding-window sentence-boundary chunker │ ├── embedder.py # SentenceTransformer wrapper (lazy-loaded) │ ├── vector_store.py # ChromaDB + BM25 + RRF hybrid retrieval │ ├── reranker.py # Cross-encoder reranking │ ├── memory.py # Session-scoped ConversationMemory │ ├── providers.py # LLM provider abstraction (Groq/OpenAI/Ollama) │ └── engine.py # End-to-end RAG orchestration ├── api/ │ ├── main.py # App factory, lifespan, CORS │ ├── dependencies.py # Engine singleton + API key auth │ ├── models.py # Pydantic v2 schemas │ └── routers/ │ ├── health.py # GET /health [public] │ ├── ingest.py # POST /ingest [protected, async] │ │ # GET /ingest/status/{id} [protected] │ ├── documents.py # GET /documents [protected] │ │ # DELETE /documents/{src} [protected] │ └── chat.py # POST /chat [protected, SSE] ├── eval/ │ ├── golden_set.json # 10 curated cybersecurity QA pairs │ ├── metrics.py # Recall@K, MRR, NDCG@K, Faithfulness │ └── run_eval.py # CLI eval runner with CI thresholds ├── frontend/ # Next.js 15 SOC dashboard ├── scripts/ │ └── seed_data.py # Auto-seed MITRE ATT&CK + NVD CVEs ├── tests/ ├── .github/workflows/ci.yml ├── CHANGELOG.md ├── CONTRIBUTING.md └── docker-compose.yml ``` ## 快速开始 ### 1. 克隆与配置 ``` git clone https://github.com/Vasco1290/SecRAG.git cd SecRAG cp .env.example .env ``` 编辑 `.env`: ``` LLM_PROVIDER=groq # groq | openai | ollama LLM_MODEL=llama-3.3-70b-versatile GROQ_API_KEY=your_groq_api_key # https://console.groq.com (free tier) SECRAG_API_KEY=your_strong_random_secret ``` ### 2. 安装与运行 ``` python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt # Terminal 1 — API uvicorn api.main:app --reload --port 8000 # Terminal 2 — Frontend cd frontend && npm install && npm run dev ``` 打开 [http://localhost:3000](http://localhost:3000) ### 3. 初始化知识库 ``` python scripts/seed_data.py ``` 从 NVD 下载 MITRE ATT&CK Enterprise PDF 和前 50 个 CRITICAL CVE。 ## 切换 LLM Provider ``` # Groq (默认 — 最快,免费层) LLM_PROVIDER=groq LLM_MODEL=llama-3.3-70b-versatile GROQ_API_KEY=gsk_... # OpenAI LLM_PROVIDER=openai LLM_MODEL=gpt-4o-mini OPENAI_API_KEY=sk-... # Ollama (本地,无需 API key) LLM_PROVIDER=ollama LLM_MODEL=llama3.2 OLLAMA_BASE_URL=http://localhost:11434 ``` 无需修改代码 —— 只需重启 API。 ## API 参考 所有受保护的 endpoint 都需要 `X-API-Key` 请求头。完整的交互式文档请访问 [http://localhost:8000/docs](http://localhost:8000/docs)。 ### `POST /ingest` — 异步 PDF 提取 ``` curl -X POST http://localhost:8000/ingest \ -H "X-API-Key: $KEY" -F "file=@report.pdf" # → {"job_id": "abc-123", "status": "queued", "filename": "report.pdf"} curl http://localhost:8000/ingest/status/abc-123 -H "X-API-Key: $KEY" # → {"status": "done", "chunks_indexed": 87, "metadata": {...}} ``` ### `GET /documents` — 列出已提取的来源 ``` curl http://localhost:8000/documents -H "X-API-Key: $KEY" # → {"documents": [{"source": "report.pdf", "chunk_count": 87}], "total_chunks": 87} ``` ### `DELETE /documents/{source_name}` — 删除文档 ``` curl -X DELETE "http://localhost:8000/documents/report.pdf" -H "X-API-Key: $KEY" # → {"status": "deleted", "source": "report.pdf", "chunks_deleted": 87} ``` ### `POST /chat` — 流式 RAG 查询 (SSE) ``` curl -X POST http://localhost:8000/chat \ -H "X-API-Key: $KEY" \ -H "Content-Type: application/json" \ -d '{"query": "What are APT29 initial access techniques?", "session_id": "analyst-1"}' \ --no-buffer ``` ``` event: chunk data: APT29, also known as Cozy Bear... event: sources data: [{"doc": "MITRE_ATT&CK.pdf", "page": 42, "excerpt": "...", "iocs": {"cves": [], "mitre_techniques": ["T1566"]}}] event: metrics data: {"retrieval_ms": 87, "llm_ms": 1240, "total_ms": 1327, "chunks_used": 5} event: done data: [DONE] ``` ## 评估 ``` # 使用已植入文档且正在运行的 API: python eval/run_eval.py --api-url http://localhost:8000 --api-key $KEY --k 5 # 将结果写入 JSON: python eval/run_eval.py --out eval_results.json ``` 示例输出: ``` recall_at_k 0.4800 ✓ mrr 0.3500 ✓ ndcg_at_k 0.4200 ✓ faithfulness 0.3600 ✓ ``` 指标基于关键词重叠(不依赖 LLM 评判)并进行 CI 门控 —— 如果任何指标低于其阈值,运行器将以退出代码 1 退出。 ## Docker ``` cp .env.example .env # fill in keys docker compose up --build docker compose exec api python scripts/seed_data.py ``` - API + Swagger: [http://localhost:8000/docs](http://localhost:8000/docs) - 仪表盘: [http://localhost:3000](http://localhost:3000) ## 配置参考 | 变量 | 默认值 | 描述 | |----------|---------|-------------| | `LLM_PROVIDER` | `groq` | LLM 后端: `groq` \| `openai` \| `ollama` | | `LLM_MODEL` | `llama-3.3-70b-versatile` | 所选 provider 的模型名称 | | `GROQ_API_KEY` | — | 当 `LLM_PROVIDER=groq` 时必填 | | `OPENAI_API_KEY` | — | 当 `LLM_PROVIDER=openai` 时必填 | | `OLLAMA_BASE_URL` | `http://localhost:11434` | 当 `LLM_PROVIDER=ollama` 时必填 | | `SECRAG_API_KEY` | **必填** | API 认证密钥 (`X-API-Key` header) | | `EMBEDDING_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | HuggingFace embedding 模型 | | `RERANKER_MODEL` | `BAAI/bge-reranker-base` | Cross-encoder 重排序模型 | | `ENABLE_RERANKING` | `true` | 切换重排序阶段 | | `CHUNK_SIZE` | `512` | 每个 chunk 的目标 token 数 | | `CHUNK_OVERLAP` | `64` | chunk 之间重叠的 token 数 | | `TOP_K_RETRIEVAL` | `20` | 混合搜索候选数 | | `TOP_K_FINAL` | `5` | 重排序后传递给 LLM 的 chunk 数 | | `NVD_API_KEY` | *(可选)* | 将 NVD 速率限制提高到 50 req/30s | ## 许可证 MIT — 详见 [LICENSE](LICENSE)。
标签:AI风险缓解, AV绕过, DLL 劫持, FastAPI, RAG, 向量数据库, 大语言模型, 威胁情报, 安全问答, 开发者工具, 请求拦截, 逆向工具