vedika0806/multi-doc-rag-chatbot

GitHub: vedika0806/multi-doc-rag-chatbot

一款完全本地运行的 RAG 威胁情报助手，支持多文档上传与语义检索，为安全分析师提供带引用的即时问答。

Stars: 1 | Forks: 0

# 🛡️ CyberRAG — 多文档威胁情报助手安全分析师每次调查通常需要花费 2-4 小时手动搜索威胁报告。本系统将此过程缩短至几秒，并提供引用，完全在本地运行——无任何数据离开您的设备。 [![CI](https://static.pigsec.cn/wp-content/uploads/repos/cas/93/934d38eee8989cba060266957752edc68c04fa667a928853a152733f925183b2.svg)](https://github.com/vedika0806/multi-doc-rag-chatbot/actions) [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://python.org) [![LangChain](https://img.shields.io/badge/LangChain-0.2-green.svg)](https://langchain.com) [![ChromaDB](https://img.shields.io/badge/ChromaDB-0.5-orange.svg)](https://trychroma.com) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) 一个专为网络安全威胁情报设计的**生产级检索增强生成 (RAG) 聊天机器人**。上传 PDF、文本文件或抓取 URL——然后提出基于您文档的精确问题。**无需 API 密钥。完全本地化。对有来源的事实零幻觉风险。** ## 🎯 功能说明 | 能力 | 详情 | |---|---| | **多源摄取** | PDF, TXT, Markdown, Web URL | | **语义搜索** | 通过 sentence-transformers 实现 `all-MiniLM-L6-v2` embeddings | | **向量存储** | ChromaDB 余弦相似度，跨会话持久化 | | **本地 LLM** | Ollama 自动检测最佳可用模型 (llama3 → mistral → phi3) | | **来源归因** | 每个回答均引用确切的文档 + chunk + 余弦分数 | | **实时评分** | 每次查询的检索相似度图表 | | **容器化** | Docker + docker-compose，包含 Ollama sidecar | ## 🏗 架构 ``` ┌─────────────────────────────────────────────────────────┐ │ INGESTION PIPELINE │ │ │ │ PDF/TXT/MD/URL │ │ │ │ │ ▼ │ │ LangChain Loaders ──► RecursiveTextSplitter │ │ (PyPDFLoader, chunk_size=800 │ │ TextLoader, chunk_overlap=150 │ │ BeautifulSoup) │ │ │ ▼ │ │ SentenceTransformer Embeddings │ │ (all-MiniLM-L6-v2, local) │ │ │ │ │ ▼ │ │ ChromaDB (cosine similarity) │ │ Persistent vector store │ └────────────────────────┬────────────────────────────────┘ │ ┌────────────────────────▼────────────────────────────────┐ │ QUERY PIPELINE │ │ │ │ User Query │ │ │ │ │ ▼ │ │ Embed query (MiniLM) ──► ChromaDB top-5 retrieval │ │ │ │ │ ▼ │ │ Context assembly + source metadata │ │ │ │ │ ▼ │ │ Ollama LLM (llama3/mistral) │ │ System prompt: cybersec analyst │ │ │ │ │ ▼ │ │ Grounded answer + citations + │ │ similarity scores │ └─────────────────────────────────────────────────────────┘ ``` ## 🚀 快速开始 ### 选项 1：本地运行（推荐） ``` # 1. Clone git clone https://github.com/vedika0806/multi-doc-rag-chatbot.git cd multi-doc-rag-chatbot # 2. 安装 Ollama 并拉取 model # Mac/Linux: https://ollama.com/download ollama pull llama3 # ~4.7GB — best quality # 或：ollama pull mistral # ~4.1GB — 更快 # 3. 创建 virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # 4. 安装 dependencies pip install -r requirements.txt # 5. 运行 streamlit run app/main.py # → 打开 http://localhost:8501 ``` ### 选项 2：Docker Compose（一条命令） ``` git clone https://github.com/vedika0806/multi-doc-rag-chatbot.git cd multi-doc-rag-chatbot # 启动 app + Ollama sidecar docker-compose up --build # 在另一个 terminal 中，拉取 model 到 container 中 docker exec cyberrag-ollama ollama pull llama3 # → 打开 http://localhost:8501 ``` ## 📖 使用说明 ### 1. 上传文档 - 点击侧边栏中的 **"Upload files"** - 支持：`.pdf`, `.txt`, `.md` - 或者粘贴一个 **URL** 进行抓取（MITRE ATT&CK 页面、CVE 通告、NVD 条目） ### 2. 提问针对网络安全文档的示例查询： ``` "What are the primary attack vectors described?" "List all CVEs mentioned and their CVSS scores" "What MITRE ATT&CK techniques does this threat actor use?" "What lateral movement techniques are referenced?" "What defensive mitigations are recommended?" "Summarize the threat actor's TTPs" ``` ### 3. 解读结果 - **绿色分数 (>0.70)**：高相关性——回答有充分根据 - **橙色分数 (0.40–0.70)**：中等相关性——回答可能不完整 - **红色分数 (<0.40)**：低相关性——建议上传更多相关文档 ## 🛠 技术栈 | 层级 | 技术 | 用途 | |---|---|---| | **LLM** | Ollama (llama3 / mistral) | 本地推理，无 API 成本 | | **编排** | LangChain 0.2 | 文档加载、拆分、链管理 | | **Embeddings** | sentence-transformers / all-MiniLM-L6-v2 | 稠密向量 embeddings | | **向量数据库** | ChromaDB 0.5 | 持久化余弦相似度搜索 | | **加载器** | PyPDF, TextLoader, BeautifulSoup | 多格式摄取 | | **UI** | Streamlit | 交互式聊天界面 | | **容器化** | Docker + docker-compose | 可重现部署 | | **CI** | GitHub Actions | 每次 push 时进行 Lint + 测试 | ## 📁 项目结构 ``` multi-doc-rag-chatbot/ ├── app/ │ ├── main.py # Streamlit UI │ └── rag_engine.py # Core RAG pipeline ├── tests/ │ ├── conftest.py │ └── test_rag_engine.py # Unit tests ├── data/ │ ├── chroma_db/ # Persistent vector store (gitignored) │ └── sample_docs/ # Sample cybersec documents ├── .github/ │ └── workflows/ci.yml # GitHub Actions CI ├── .streamlit/ │ └── config.toml # Streamlit theme ├── Dockerfile ├── docker-compose.yml ├── requirements.txt └── README.md ``` ## ⚙️ 配置 `app/rag_engine.py` 中的关键常量： ``` CHUNK_SIZE = 800 # tokens per chunk CHUNK_OVERLAP = 150 # overlap between chunks TOP_K = 5 # retrieved chunks per query EMBEDDING_MODEL = "all-MiniLM-L6-v2" # local embedding model PREFERRED_MODELS = ["llama3", "mistral", "phi3"] # Ollama priority ``` ## 🧪 运行测试 ``` pip install pytest pytest-cov pytest tests/ -v --tb=short # With coverage pytest tests/ --cov=app --cov-report=html ``` ## 开箱即用的查询尝试本仓库附带了 3 份真实的网络安全情报文档。克隆后即可提问： ``` "What MITRE ATT&CK techniques does APT29 use for initial access?" "List all CVEs with CVSS score above 9.0 and their patch status" "What lateral movement techniques are most observed in 2024?" "What defensive mitigations are recommended against credential dumping?" "Which threat actor exploited CVE-2024-3400 and what did they deploy?" ``` ## 📊 性能 | 指标 | 数值 | |---|---| | Embedding 速度 | ~500 chunks/秒 (CPU) | | 查询延迟 | 2–8秒 (llama3, M1 Mac / 现代 CPU) | | Chunk 大小 | 800 tokens，150 重叠 | | 检索 | Top-5 余弦相似度 | | 最大文档大小 | ~500 页 (PDF) | ## 🔒 安全与隐私 - **完全本地化** —— 无数据离开您的设备 - 无需 API 密钥 - 文档仅存储在本地 ChromaDB 中 - Ollama 在本地运行推理 ## 🗺 路线图 - [ ] 混合搜索 (BM25 + 稠密 embeddings) - [ ] 多集合支持（按项目隔离文档） - [ ] 流式 LLM 响应 - [ ] 将聊天记录导出为 PDF 报告 - [ ] MITRE ATT&CK 结构化提取 - [ ] 用于生产环境解耦的 FastAPI 后端 ## 👤 作者 **Vedika Sumbli** 圣何塞州立大学，应用数据智能硕士 [LinkedIn](https://linkedin.com/in/vedikasumbli) · [GitHub](https://github.com/vedika0806) ## 📄 许可证 MIT 许可证 —— 详见 [LICENSE](LICENSE)

标签：AI风险缓解, ChromaDB, DLL 劫持, Kubernetes, LangChain, RAG, Splunk, 人工智能, 大语言模型, 威胁情报, 开发者工具, 用户模式Hook绕过, 请求拦截, 轻量级, 逆向工具