Roastee/cyber-threat-rag-assistant

GitHub: Roastee/cyber-threat-rag-assistant

一款专注网络安全的 RAG 聊天助手，通过 PDF 提取、文本处理与向量嵌入生成，帮助分析师高效检索和分析威胁情报文档。

Stars: 0 | Forks: 0

# 🛡️ 网络威胁 RAG 助手 ![Python](https://img.shields.io/badge/Python-3.12-3776AB?logo=python&logoColor=white) ![Streamlit](https://img.shields.io/badge/Streamlit-1.45-FF4B4B?logo=streamlit&logoColor=white) ![Version](https://img.shields.io/badge/Version-0.4.0-00FF88) ![Status](https://img.shields.io/badge/Status-In%20Development-F59E0B) ![License](https://img.shields.io/badge/License-MIT-blue) ## 这是什么？一个专注于网络安全的 RAG 聊天机器人，它允许分析师上传威胁情报 PDF，自动处理并生成 embedding，并且（在即将到来的 sprint 中）根据这些文档回答问题并提供来源引用。 **当前功能：** - 📄 上传 PDF → 自动提取文本、清理、分块并生成 embedding - 🧠 通过 `all-MiniLM-L6-v2` 生成 384 维向量 embedding - 📊 实时仪表盘（文档、页数、单词数、分块数、embedding、大小） - 🎨 带有自定义 CSS 的网络安全主题深色 UI - 💬 聊天界面（模拟响应 — LLM 集成将在下一个 sprint 中进行） ## 快速开始 ### 前置条件 - Python ≥ 3.10 - pip ### 安装说明 ``` # Clone the repo git clone https://github.com/Roastee/cyber-threat-rag-assistant.git cd cyber-threat-rag-assistant # Install dependencies pip install -r requirements.txt # Run the app streamlit run app.py ``` 应用程序将在 **http://localhost:8501** 打开。 ## 项目结构 ``` cyber-threat-rag-assistant/ │ ├── app.py # Streamlit entry point (thin orchestrator) ├── .streamlit/config.toml # Streamlit theme configuration ├── .env.example # Environment variable template ├── requirements.txt # Pinned Python dependencies ├── .gitignore # Git ignore rules ├── LICENSE # MIT License ├── roadmap.md # Sprint development plan ├── tasks.md # Task tracker ├── ai_context.md # AI assistant context │ ├── src/ # Source code │ ├── core/ # App-wide configuration & state │ │ ├── config.py # Frozen dataclass config (app, ingestion, embedding, theme) │ │ └── state.py # Centralized session state manager (typed, no raw st.session_state) │ │ │ ├── models/ # Data models (shared across all layers) │ │ ├── document.py # Document dataclass (extracted PDF content + metadata) │ │ └── chunk.py # Chunk dataclass (text segment + source metadata, ChromaDB-ready) │ │ │ ├── ingestion/ # Document upload & extraction │ │ ├── pdf_loader.py # PyPDF text extraction (per-page, with error handling) │ │ └── service.py # Ingestion orchestrator: validate → extract → chunk → embed → store │ │ │ ├── processing/ # Text cleaning & chunking │ │ ├── cleaner.py # 8-step text normalization (control chars, unicode, PDF line wraps) │ │ ├── chunker.py # Recursive character splitter (paragraph → line → sentence → word) │ │ └── pipeline.py # Document → Chunks orchestrator with statistics │ │ │ ├── embeddings/ # Vector embedding generation │ │ ├── embedding_service.py # sentence-transformers wrapper (provider pattern, lazy loading) │ │ ├── embedding_pipeline.py # Chunks → EmbeddingRecords batch orchestrator │ │ └── models.py # EmbeddingRecord + EmbeddingStats data models │ │ │ ├── ui/ # Streamlit UI components │ │ ├── styles.py # Custom CSS theme (400+ lines, dark SOC-terminal aesthetic) │ │ ├── header.py # App header with status indicators │ │ ├── sidebar.py # Navigation + quick settings sidebar │ │ ├── chat.py # Chat interface with mock responses │ │ ├── pages.py # Documents / Settings / About page layouts │ │ └── components.py # Reusable widgets (metric cards, badges) │ │ │ ├── rag/ # RAG pipeline (not yet implemented) │ ├── api/ # FastAPI REST API (not yet implemented) │ └── utils/ # Utility functions (not yet implemented) │ ├── tests/ # Test suite │ ├── unit/ # Unit tests (stubs) │ └── integration/ # Integration tests (stubs) │ ├── data/ # Data storage │ ├── raw/feeds/ # Threat intelligence feeds │ ├── raw/reports/ # Threat reports (PDFs) │ └── processed/ # Processed chunks │ ├── docs/ # Documentation │ ├── architecture.md # System architecture │ ├── changelog.md # Version changelog │ ├── api/ # API reference │ ├── guides/ # User & developer guides │ └── assets/ # Images & diagrams │ └── scripts/ # Utility scripts ``` ## 架构 ``` User uploads PDF │ ▼ ┌──────────────────────────────────────────────────┐ │ IngestionService (src/ingestion/service.py) │ │ │ │ Step 1: Validate (type, size, duplicates) │ │ Step 2: Extract text (PyPDF, per-page) │ │ Step 3: Store Document in session state │ │ Step 4: Clean + Chunk (processing pipeline) │ │ Step 5: Generate embeddings (sentence-transformers) │ │ Step 6: Store vectors in session state │ └──────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────┐ │ [Future] ChromaDB → Vector Search → LLM → RAG │ └──────────────────────────────────────────────────┘ ``` ### 层级分离 | 层级 | 包 | 职责 | |---|---|---| | **UI** | `src/ui/` | Streamlit 组件，无业务逻辑 | | **Service** | `src/ingestion/` | 编排、验证、业务规则 | | **Processing** | `src/processing/` | 文本清理、分块（不依赖 Streamlit） | | **Embeddings** | `src/embeddings/` | 向量生成（provider 模式，与模型无关） | | **Models** | `src/models/` | 所有层级共享的类型化 dataclass | | **Core** | `src/core/` | 配置 + session 状态管理 | ## 技术栈 | 组件 | 技术 | 状态 | |---|---|---| | 语言 | Python 3.12 | ✅ 使用中 | | UI 框架 | Streamlit 1.45 | ✅ 使用中 | | PDF 提取 | pypdf 5.6 | ✅ 使用中 | | Embeddings | sentence-transformers (all-MiniLM-L6-v2) | ✅ 使用中 | | 状态管理 | 类型化的 `StateManager`（session_state 包装器） | ✅ 使用中 | | 向量数据库 | ChromaDB | ⏳ 下一个 Sprint | | LLM Provider | OpenAI / Ollama | ⏳ 计划中 | | LLM 编排 | LangChain | ⏳ 计划中 | | API 框架 | FastAPI | ⏳ 计划中 | | 测试 | pytest | ⏳ 计划中 | ## 开发进度 | Sprint | 名称 | 状态 | |---|---|---| | 1 | 项目基础 | ✅ 已完成 | | 2 | Streamlit UI | ✅ 已完成 | | 3A | PDF 摄取 | ✅ 已完成 | | 3B | 文本处理流水线 | ✅ 已完成 | | 4A | Embedding 生成 | ✅ 已完成 | | 4B | ChromaDB 集成 | 🔄 下一步 | | 5 | RAG 流水线 + LLM | ⏳ 计划中 | | 6 | 情报增强 | ⏳ 计划中 | | 7 | 生产环境加固 | ⏳ 计划中 | ## 关键设计决策 - **整洁架构** — UI → Service → Processing → Embeddings。每一层都可以独立测试。 - **Provider 模式** — 只有 `embedding_service.py` 导入了 `sentence_transformers`。只需更改一个文件即可切换至 OpenAI/Cohere。 - **类型化状态** — 所有状态访问均通过带有类型化 getter/setter 的 `StateManager` 进行。组件中没有原始的 `st.session_state`。 - **支持 ChromaDB** — `Chunk.metadata` 属性返回一个与 `collection.add(metadatas=[...])` 一一对应的字典。 - **模型延迟加载** — embedding 模型（约 80MB）在首次使用时加载，而不是在导入时加载。应用启动保持瞬时完成。 ## 许可证 MIT — 详见 [LICENSE](LICENSE)

标签：DLL 劫持, Kubernetes, PDF解析, Python, RAG, Streamlit, 向量检索, 大语言模型, 威胁情报, 开发者工具, 无后门, 访问控制, 逆向工具