sahilkapur1993-sys/cyber-risk-rag-assistant

GitHub: sahilkapur1993-sys/cyber-risk-rag-assistant

该项目是一个基于 RAG 架构的网络风险智能问答助手，通过向量化检索 NVD 漏洞数据库并结合大语言模型，允许用户以自然语言快速查询真实的 CVE 威胁情报。

Stars: 0 | Forks: 0

# 网络风险智能助手 ## 目录 1. [项目概述](#project-overview) 2. [架构](#architecture) 3. [技术栈](#tech-stack) 4. [仓库结构](#repository-structure) 5. [数据管道](#data-pipeline) 6. [RAG 管道](#rag-pipeline) 7. [后端 API](#backend-api) 8. [前端](#frontend) 9. [AWS 基础设施](#aws-infrastructure) 10. [本地开发配置](#local-development-setup) 11. [部署指南](#deployment-guide) 12. [成本明细](#cost-breakdown) ## 项目概述网络风险智能助手从 NVD（National Vulnerability Database）接入实时的 CVE（Common Vulnerabilities and Exposures）数据，使用向量 embedding 对其进行索引，并对外暴露自然语言查询接口。用户可以提出诸如 *“影响 Microsoft 产品的最严重漏洞是什么？”* 之类的问题，并获取基于真实 CVE 数据的 AI 生成答案。 **核心优势：** - 实时数据 —— 这不是普通的 PDF 聊天机器人。它查询的是真实、最新的 CVE 威胁情报。 - 领域专属 —— 专为网络风险和安全专业人士设计。 - 生产级 AWS 基础设施 —— 每周自动化的数据管道、持久化存储、实时部署。 ## 架构 ``` ┌─────────────────────────────────────────────────────┐ │ DATA PIPELINE (Weekly) │ │ │ │ AWS Glue Workflow │ │ ┌─────────────┐ ┌────────────┐ ┌──────────┐ │ │ │ Job 1 │ -> │ Job 2 │ -> │ Job 3 │ │ │ │ Fetch CVEs │ │ Chunk Data │ │ Embed │ │ │ │ (NVD API) │ │ │ │ (OpenAI) │ │ │ └─────────────┘ └────────────┘ └──────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ S3: raw/cve_feed.json S3: processed/ S3: faiss_index/ │ cve_chunks.json embeddings.npy └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ APP SERVER (EC2) │ │ │ │ User → nginx (port 80) → FastAPI (port 8000) │ │ │ │ │ rag/pipeline.py │ │ │ │ │ ┌──────────┴──────────┐ │ │ │ │ │ │ OpenAI API S3 (load index) │ │ (embed question) (at startup) │ └─────────────────────────────────────────────────────┘ ``` **数据流：** 1. 每周，AWS Glue 会从 NVD 获取所有（过去 60 天内的）CVE，对其进行分块和 embedding，并将 FAISS 索引存储在 S3 中。 2. FastAPI 应用在启动时从 S3 下载索引，并在内存中构建它。 3. 用户通过 HTML 前端提交问题。 4. 该问题通过 OpenAI 进行 embedding，与 FAISS 索引进行比对搜索，并将匹配度最高的 top-k 个 CVE 注入到 GPT-4o-mini 的 prompt 中。 5. 答案（包含 CVE 来源）将返回给用户。 ## 技术栈 | 层级 | 技术 | 用途 | |---|---|---| | 数据接入 | Python, Requests, AWS Glue | 从 NVD API 获取 CVE | | 数据处理 | Python, Boto3 | 对 CVE 记录进行分块和清理 | | Embeddings | OpenAI `text-embedding-3-small` | 生成语义向量 | | 向量搜索 | FAISS (`IndexFlatL2`) | 最近邻检索 | | LLM | OpenAI `gpt-4o-mini` | 生成自然语言答案 | | 后端 | FastAPI, Uvicorn | 提供 RAG 管道服务的 REST API | | 前端 | HTML, CSS, Vanilla JS | 查询界面 | | Web 服务器 | nginx | 80 端口上的反向代理 | | 进程管理器 | systemd | 确保重启后 FastAPI 依然保持运行 | | 存储 | AWS S3 | 原始数据、数据块、embeddings | | 任务编排 | AWS Glue Workflow | 每周自动化的管道 | | 运行时 | Python 3.11 (EC2), Python 3.9 (Glue) | | | 包管理器 | uv | 快速依赖管理 | ## 仓库结构 ``` cyber-risk-rag-assistant/ │ ├── .env.example # Template for required env vars ├── .gitignore ├── pyproject.toml # uv project config and dependencies ├── uv.lock ├── pipeline_run.py # Local runner — executes all 3 pipeline steps in sequence │ ├── ingestion/ │ ├── fetch_cve_feed.py # Fetches CVEs from NVD API → saves to S3 │ └── chunk_documents.py # Reads raw JSON from S3, cleans → saves chunks to S3 │ ├── embeddings/ │ └── generate_embeddings.py # Reads chunks from S3, calls OpenAI → uploads embeddings.npy to S3 │ ├── rag/ │ └── pipeline.py # Core RAG logic: load index, embed question, retrieve, generate answer │ ├── backend/ │ └── main.py # FastAPI app — exposes /query, /health, / endpoints │ └── frontend/ └── index.html # Single-page HTML frontend ``` ## 数据管道 ### 步骤 1 — 获取 CVE (`ingestion/fetch_cve_feed.py`) 调用 NVD CVE 2.0 API，对结果进行分页（每次请求最多 2000 条），遵守页面之间强制的 6 秒延迟，并将原始 JSON 上传到 S3。 - **来源：** `https://services.nvd.nist.gov/rest/json/cves/2.0` - **输出：** `s3://cyber-risk-raw-{name}/raw/cve_feed.json` - **典型输出大小：** 约 60 天的 CVE 数据为 36 MB 关键参数： ``` days_back = 60 # How far back to fetch results_per_page = 2000 # NVD maximum ``` ### 步骤 2 — 分块与清理 (`ingestion/chunk_documents.py`) 从 S3 读取原始的 CVE JSON，从深层嵌套的结构中提取有用字段，并将每个 CVE 格式化为干净的文本块。 **每个 CVE 提取的字段：** - CVE ID - 发布日期 - 严重程度和 CVSS 评分 (v3.1 → v3.0 → v2.0 回退机制) - 受影响的产品 (来自 CPE 字符串) - 英文描述 **每个数据块的输出格式：** ``` CVE ID: CVE-2026-32211 Published: 2026-04-03 Severity: CRITICAL (Score: 9.1) Affected Products: microsoft azure_mcp_server Description: Missing authentication for a critical function in Azure MCP Server... ``` - **输入：** `s3://.../raw/cve_feed.json` - **输出：** `s3://.../processed/cve_chunks.json` ### 步骤 3 — 生成 Embeddings (`embeddings/generate_embeddings.py`) 从 S3 读取数据块，以 100 个为一批调用 OpenAI Embeddings API，并将生成的 numpy 数组和 chunks pickle 上传到 S3。 - **模型：** `text-embedding-3-small` (1536 维) - **批次大小：** 100 (批次之间休眠 0.5 秒以遵守速率限制) - **输出：** `s3://.../faiss_index/embeddings.npy` 和 `chunks.pkl` ## RAG 管道核心逻辑位于 `rag/pipeline.py` 中，完全在 EC2 应用服务器上运行。 ### `load_index()` 在 FastAPI 启动时，从 S3 下载 `embeddings.npy` 和 `chunks.pkl` 到一个临时目录中，在内存中构建 `faiss.IndexFlatL2` 索引，并将其与 chunks 列表一起返回。 ``` dimension = embeddings.shape[1] # 1536 index = faiss.IndexFlatL2(dimension) index.add(embeddings) ``` ### `embed_question(question)` 使用与 CVE 所用相同的 embedding 模型 (`text-embedding-3-small`)，将用户的问题转换为 1536 维的向量。这确保了问题和文档处于同一个向量空间中。 ### `retrieve(question, index, chunks, top_k=5)` 在 FAISS 索引中搜索与问题向量最近的 `top_k` 个向量。返回相应的 CVE 数据块及其距离。 ### `build_prompt(question, context_chunks)` 构建一个系统 prompt，将检索到的 CVE 文本作为 context 注入，并指示模型仅使用该数据进行回答，且始终引用 CVE ID。 ### `generate_answer(prompt)` 以 `temperature=0.2`（注重事实，低创造性）调用 `gpt-4o-mini` 并返回答案字符串。 ## 后端 API FastAPI 应用位于 `backend/main.py`。由 Uvicorn 在 8000 端口提供服务，nginx 作为 80 端口上的反向代理。 ### 端点 | 方法 | 路径 | 描述 | |---|---|---| | `GET` | `/` | 健康检查 — 返回状态和 CVE 总数 | | `GET` | `/health` | 简单的 ping 测试 | | `POST` | `/query` | 主端点 — 接收问题，返回答案 + 来源 | | `GET` | `/docs` | 自动生成的 Swagger UI | ### POST `/query` — 请求 ``` { "question": "What are the most critical vulnerabilities affecting Microsoft products?", "top_k": 5 } ``` ### POST `/query` — 响应 ``` { "question": "...", "answer": "Based on the CVE data, the most critical Microsoft vulnerabilities include CVE-2026-32211 (CRITICAL, 9.1)...", "sources": [ { "cve_id": "CVE-2026-32211", "severity": "CRITICAL", "score": 9.1 }, { "cve_id": "CVE-2026-24303", "severity": "CRITICAL", "score": 9.6 } ] } ``` ### CORS 允许所有来源 (`allow_origins=["*"]`)。在生产环境中，应将其限制为您的域名。 ## 前端位于 `frontend/index.html` 的单文件 HTML/CSS/JS 界面。由 nginx 作为静态文件提供服务。 **功能：** - 深色主题的专业 UI - 用于输入自然语言问题的文本区域 - 自动填充输入框的示例问题按钮 - 可调节的 `top_k` 选择器（3、5 或 10 个 CVE） - 查询运行时的加载动画 - 以格式化文本呈现的答案 - 按严重程度进行颜色编码的来源卡片（CRITICAL = 红色，HIGH = 橙色，MEDIUM = 黄色，LOW = 绿色） - 按 Enter 键提交查询前端在相同的主机/IP 上调用 API。如果服务器 IP 发生变化，应更新脚本顶部的 `const API` 常量。 ## AWS 基础设施 ### S3 存储桶 | 存储桶 | 内容 | |---|---| | `cyber-risk-raw-{name}` | `raw/cve_feed.json`, `processed/cve_chunks.json` | | `cyber-risk-processed-{name}` | `faiss_index/embeddings.npy`, `faiss_index/chunks.pkl` | 所有存储桶均为私有，并已阻止公共访问。访问通过 IAM 角色 (Glue) 或 AWS 凭证 (EC2) 进行。 ### AWS Glue Workflow **Workflow 名称：** `cyber-risk-daily-pipeline` **计划：** 每周一 UTC 时间 02:00 通过事件触发器链接的三个 Python Shell 任务： ``` [Schedule: Monday 2AM UTC] ↓ [cyber-risk-fetch-cves] ↓ (SUCCEEDED) [cyber-risk-chunk-data] ↓ (SUCCEEDED) [cyber-risk-generate-embeddings] ``` **任务设置：** - 类型：Python Shell - Glue 版本：Python 3.9 - 最大容量：0.0625 DPU (最小值) - `--additional-python-modules`: `openai` (仅限任务 3) **任务参数（在 Glue 控制台中设置 — 非硬编码）：** | 任务 | 参数 | 值 | |---|---|---| | 全部 | `S3_BUCKET_RAW` | `cyber-risk-raw-{name}` | | 任务 3 | `S3_BUCKET_PROCESSED` | `cyber-risk-processed-{name}` | | 任务 3 | `OPENAI_API_KEY` | `sk-proj-...` | ### EC2 实例 | 设置 | 值 | |---|---| | AMI | Ubuntu 26.04 LTS | | 实例类型 | t2.micro (符合免费套餐条件) | | 存储 | 20 GB gp3 | | 区域 | ap-south-1 (孟买) | | 开放端口 | 22 (SSH, 仅限我的 IP), 80 (HTTP, 任意位置), 8000 (API, 任意位置) | **运行在 EC2 上的服务：** - `uvicorn` — 8000 端口上的 FastAPI 应用 (由 systemd 管理) - `nginx` — 80 端口上的反向代理，提供静态前端服务 ### systemd 服务 FastAPI 应用被注册为 systemd 服务 (`cyber-risk-app.service`)，以便它可以： - 在 EC2 启动时自动启动 - 在进程崩溃时自动重启 (`Restart=always`) 常用命令： ``` sudo systemctl status cyber-risk-app # Check status sudo systemctl restart cyber-risk-app # Restart after code update sudo systemctl stop cyber-risk-app # Stop journalctl -u cyber-risk-app -f # View live logs ``` ## 本地开发配置 ### 前置条件 - Python 3.11 - [uv](https://github.com/astral-sh/uv) (`pip install uv`) - Git - 已配置 AWS CLI (`aws configure`) - OpenAI API key ### 步骤 ``` # 1. Clone git clone https://github.com/your-username/cyber-risk-rag-assistant.git cd cyber-risk-rag-assistant # 2. 创建虚拟环境 uv venv source .venv/bin/activate # Linux/Mac .venv\Scripts\activate # Windows # 3. 安装依赖 uv sync # 4. 设置环境变量 cp .env.example .env # 编辑 .env 并添加你的 OPENAI_API_KEY # 5. 运行完整 pipeline（需要 AWS 凭证 + NVD API 访问权限） python pipeline_run.py # 6. 启动 API uvicorn backend.main:app --host 0.0.0.0 --port 8000 # 7. 打开 http://localhost:8000/docs 进行测试 ``` ## 部署指南 ### 首次 EC2 设置 ``` # 连接 ssh -i ~/.ssh/cyber-risk-key.pem ubuntu@ # 更新并安装依赖 sudo apt update && sudo apt upgrade -y sudo apt install -y python3-pip python3-venv nginx git curl unzip # 安装 uv curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env # 安装 AWS CLI curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip && sudo ./aws/install # 配置 AWS aws configure # Enter access key, secret, region: ap-south-1, output: json # Clone 并设置项目 git clone https://github.com/your-username/cyber-risk-rag-assistant.git cd cyber-risk-rag-assistant uv venv && source .venv/bin/activate && uv sync # 设置 OpenAI key echo "OPENAI_API_KEY=sk-proj-..." > .env # 启动 FastAPI uvicorn backend.main:app --host 0.0.0.0 --port 8000 ``` ### 配置 systemd ``` sudo nano /etc/systemd/system/cyber-risk-app.service ``` 粘贴： ``` [Unit] Description=Cyber Risk RAG Assistant After=network.target [Service] User=ubuntu WorkingDirectory=/home/ubuntu/cyber-risk-rag-assistant Environment="PATH=/home/ubuntu/cyber-risk-rag-assistant/.venv/bin" ExecStart=/home/ubuntu/cyber-risk-rag-assistant/.venv/bin/uvicorn backend.main:app --host 0.0.0.0 --port 8000 Restart=always RestartSec=3 [Install] WantedBy=multi-user.target ``` ``` sudo systemctl daemon-reload sudo systemctl enable cyber-risk-app sudo systemctl start cyber-risk-app ``` ### 配置 nginx ``` sudo nano /etc/nginx/sites-available/cyber-risk-app ``` 粘贴： ``` server { listen 80; server_name ; root /var/www/cyber-risk; index index.html; location / { try_files $uri $uri/ /index.html; } location /query { proxy_pass http://127.0.0.1:8000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } location /health { proxy_pass http://127.0.0.1:8000; } } ``` ``` sudo mkdir -p /var/www/cyber-risk sudo cp frontend/index.html /var/www/cyber-risk/index.html sudo ln -s /etc/nginx/sites-available/cyber-risk-app /etc/nginx/sites-enabled/ sudo nginx -t sudo systemctl restart nginx ``` ### 代码更改后更新应用 ``` cd ~/cyber-risk-rag-assistant git pull uv sync sudo systemctl restart cyber-risk-app ``` ## 成本明细 ### 每月预估（每周运行管道） | 服务 | 使用量 | 每月成本 | |---|---|---| | AWS Glue (3 个任务, 每周) | ~$0.004/次 × 4 次 | ~$0.016 | | OpenAI Embeddings (1.3 万个数据块) | ~$0.026/次 × 4 次 | ~$0.10 | | OpenAI 查询 (GPT-4o-mini) | 每次 ~$0.001–0.002 | 不定 | | S3 存储 (~100 MB) | 可忽略不计 | ~$0.002 | | EC2 t2.micro | 免费套餐 (750 小时/月) | $0 (第 1 年) | | **总计 (仅管道)** | | **~$0.12/月** | 在免费套餐期满后，EC2 的费用约为每月 $8–10。 ### 成本优化机会目前，管道在每次运行时都会对所有 CVE 进行重新 embedding。如果采用增量方式 —— 仅对自上次运行以来发布的 CVE 进行 embedding —— 将使 OpenAI 成本降低约 95%（总计约 $0.005/月）。

标签：AWS, DLL 劫持, DPI, FAISS, GPT, RAG, 大语言模型, 威胁情报, 开发者工具, 漏洞管理, 网络安全, 逆向工具, 隐私保护