RicheByte/cyberSamantha

GitHub: RicheByte/cyberSamantha

面向网络安全从业者的自主认知系统，通过多智能体协作、持久记忆和向量知识库，加速威胁分析与安全调查。

Stars: 1 | Forks: 0

# CyberSamantha：自主网络安全认知生态系统 ``` ▄▄▄▄ ▄▄▄ ▄▄ ▄▄ ▄▄▄ ▄▄ ▄▄ ▄▄▄▄▄▄ ▄▄ ▄▄ ▄▄▄ ███▄▄ ██▀██ ██▀▄▀██ ██▀██ ███▄██ ██ ██▄██ ██▀██ ▄▄██▀ ██▀██ ██ ██ ██▀██ ██ ▀██ ██ ██ ██ ██▀██ ``` CyberSamantha 是一个快速、模块化且自我完善的网络安全**认知生态系统**。它并非一个被动的聊天机器人，而是通过多智能体委派、概率世界模型、检索增强生成（RAG）以及达尔文式能力进化的互联网络来模拟智能。 ## 目录 1. [系统架构概述](#1-system-architecture-overview) 2. [多智能体集群架构](#2-multi-agent-swarm-architecture) 3. [5大认知支柱](#3-the-5-pillars-of-cognition) 4. [三层记忆系统](#4-tri-layer-memory-system) 5. [持续摄取与向量 RAG](#5-continuous-ingestion--vector-rag) 6. [技能基因组系统](#6-skill-genome-system) 7. [安装与设置](#7-installation--setup) 8. [CLI 命令与使用](#8-cli-commands--usage) 9. [数据源与更新流水线](#9-data-sources--update-pipeline) 10. [存储清理与优化](#10-storage-cleanup--optimization) 11. [目录结构](#11-directory-map) 12. [环境变量与配置](#12-environment-variables--configuration) 13. [查询在系统中的流转过程](#13-how-a-query-flows-through-the-system) 14. [扩展系统](#14-extending-the-system) ## 1. 系统架构概述 CyberSamantha 由协同工作的**五个主要子系统**组成： ``` ┌─────────────────────────────────────────────────────┐ │ CLI (rich terminal UI) │ │ /status /daemon /wiki /provider /skills │ ├─────────────────────────────────────────────────────┤ │ AgentRouter (Manager) │ │ ├── LLMProvider (Gemini ↔ Ollama hot-swap) │ │ ├── HackerAgent → TerminalTool (nmap, curl) │ │ ├── ResearcherAgent → WebSearchTool (Exa API) │ │ ├── CoderAgent → FileReaderTool + Terminal │ │ ├── RAG Pipeline → VectorStore (ChromaDB) │ │ └── Wiki Pipeline → RealityGraph (NetworkX) │ ├─────────────────────────────────────────────────────┤ │ Memory Layer │ │ ├── EpisodicMemory (sliding window + summarization)│ │ ├── SemanticMemory (persistent user facts, JSON) │ │ └── MetaMemory (strategy outcome tracking) │ ├─────────────────────────────────────────────────────┤ │ Background Daemons │ │ ├── IngestionDaemon (watchdog → DropBox folder) │ │ └── CuriosityDaemon (contradiction detection) │ ├─────────────────────────────────────────────────────┤ │ Skill System │ │ ├── GenomeEngine (usage tracking, pruning) │ │ ├── SkillLoader (MD playbooks + Python tools) │ │ └── Skills: crypto, osint, pentest, code_review… │ └─────────────────────────────────────────────────────┘ ``` ## 2. 多智能体集群架构系统将认知负载分配给**专门的智能体**，而不是依赖单一的单体提示词。 ### AgentRouter (管理器) - **文件：** `src/core/agent.py` - 中央编排器。接收用户查询，对意图进行分类，并决定是直接回答、委派给专家子智能体，还是触发多智能体辩论。 - 实现了具有最多 5 次迭代的 **ReAct 风格自主循环** (思考 → 行动 → 观察 → 最终答案)。 - 直接访问三个内置工具：`WebSearchTool`、`TerminalTool`、`FileReaderTool`。 ### HackerAgent - **文件：** `src/agents/hacker.py` - 专攻**进攻性安全**：侦察、漏洞评估、网络扫描。 - 工具：`run_network_scan(target_ip, scan_type)` —— 使用安全的参数模式封装 `nmap`；`check_headers(url)` —— 封装 `curl -s -I` 用于 HTTP 头部检查。 - 加载了 `pentest_playbook` markdown 技能。 ### ResearcherAgent - **文件：** `src/agents/researcher.py` - 专攻**威胁情报**、CVE 查找和 Web 研究综合。 - 工具：`search_threat_intel(query)` —— 通过 Exa API 搜索网络；`search_cve(cve_id)` —— 针对性的 CVE 查找。 - 加载了 `threat_intel` markdown 技能。 ### CoderAgent - **文件：** `src/agents/coder.py` - 专攻**代码分析、安全审查和本地文件检查**。 - 工具：`read_source_file(file_path)`、`analyze_directory(directory_path)`、`run_analysis_command(command)` —— 用于代码分析的安全终端命令。 - 加载了 `code_review` 和 `log_analysis` markdown 技能。 ### DebateOrchestrator - **文件：** `src/agents/debate_orchestrator.py` - 实现**跨智能体思维干涉**。在多轮对话中让主智能体（论点辩护者）与对抗性智能体（评论家）进行对抗，然后综合出最终结论。 - 在 ThoughtRouter 确定任务需要对抗性推理时使用。 ### 混合 LLM Provider - **文件：** `src/core/llm_provider.py` - 在 **Google Gemini** (云端) 和 **Ollama** (本地) 之间实现透明的热切换。 - 优先级顺序： 1. 如果设置了 `GEMINI_API_KEY` → Gemini (`gemini-2.0-flash`) 2. 如果 Ollama 在 `http://localhost:11434` 可达 → Ollama (可配置模型) 3. 回退 → 无生成能力 (RAG 仍可用于源检索) - 支持 `generate()` (单次提示词) 和 `chat()` (消息历史) 两种模式。 - 环境变量：`LLM_PROVIDER` (`auto`/`gemini`/`ollama`)、`OLLAMA_MODEL`、`OLLAMA_BASE_URL`。 ## 3. 5大认知支柱 ### 支柱 1：技能基因组引擎 (达尔文式能力进化) - **文件：** `src/skills/genome_engine.py` - 跟踪每个技能的执行情况：**使用次数**、**成功率** (滚动平均值) 和 **平均执行时间**。 - 注册表存储在 `skills/genome_registry.json` 中。 - 支持 `mutate_skill()` (生成变体) 和 `recombine_skills()` (合并两个技能) —— 基于大语言模型的变异已在计划中。 - `prune_weak_skills(threshold)` 会在使用次数足够多之后，将成功率低于阈值的技能归档。 ### 支柱 2：现实图谱层 (概率世界状态) - **文件：** `src/knowledge/reality_graph.py` - 基于 **NetworkX** 构建，作为持久化到 `data/reality_graph.json` 的有向图。 - 与标准知识图谱不同，每个实体和关系都带有一个 **置信度分数** (0.0–1.0) 和时间戳 (`discovered_at`, `last_verified`)。 - `decay_confidence(drift_factor)` 模拟 **时间漂移** —— 未经验证的关系其置信度会随时间降低。 - `detect_contradictions()` 扫描置信度 < 0.5 的边并将其标记给 Curiosity Engine。 ### 支柱 3：跨智能体思维干涉 (辩论编排器) - 参见上文的 `DebateOrchestrator`。当 Meta-Memory 表明辩论对给定任务类型能产生更好结果时，由 `ThoughtRouter` 触发。 ### 支柱 4：Meta-Memory (执行策略跟踪) - **文件：** `src/memory/meta_memory.py` - 持久化到 `data/meta_memory.json`。 - 记录每次推理策略的执行：`task_type`、`strategy` (例如 `debate` 或 `linear`)、`agents_used`、`outcome_score` 和 `context`。 - `query_best_strategy(task_type)` 返回给定任务得分最高的策略，实现数据驱动的路由决策。 ### 支柱 5：Curiosity Engine (自主研究驱动) - **文件：** `src/core/curiosity_daemon.py` - 在后台线程中以可配置的间隔运行 (默认：60 秒)。 - `evaluate_internal_pressure()` 检查 RealityGraph 中的矛盾。发现矛盾后，它会自主派遣 AgentRouter 通过网络搜索或文档分析来研究冲突。 - 如果不存在矛盾，它会触发 `decay_confidence()` 以模拟整个图谱的时间漂移。 ## 4. 三层记忆系统 ### 情景记忆 (短期) - **文件：** `src/memory/episodic.py` - 将当前对话会话存储为 `{role, content}` 消息列表。 - **滑动窗口：** 当历史记录超过 `max_history * 2` 条时，会触发 `_summarize_and_decay()`。 - 摘要使用大语言模型将最旧的一半对话压缩为持久的摘要块，防止上下文溢出同时保留关键信息。 - 如果没有可用的 LLM，则回退到简单的截断方式。 ### 语义记忆 (长期事实) - **文件：** `src/memory/semantic.py` - 以键值字典的形式持久化到 `data/memory.json`。 - 用户通过 `remember that ` 教授事实 —— 这些事实会被注入到每个智能体的系统提示词中。 - 实现对用户环境和偏好的永久适应。 ### Meta-Memory (策略跟踪) - 参见上文的支柱 4。 ## 5. 持续摄取与向量 RAG ### 向量存储 - **文件：** `src/knowledge/vector_store.py` - 存储在 `chroma_db/` 中的本地 ChromaDB 实例。 - 使用 **SentenceTransformers** (`all-MiniLM-L6-v2`) 进行本地嵌入 —— 不会将文档发送到外部 API 进行嵌入。 - **延迟加载：** 转换器模型仅在实际需要时加载，使 CLI 启动保持在 1 秒以内。 - `index_documents()` 递归扫描 `data/` 目录中的 `.txt`、`.md`、`.json`、`.yaml`、`.pdf`、`.docx`、`.pptx` 文件。 - **智能重新索引：** 使用文件哈希跳过未更改的文件。支持 `--force` 进行完全重新索引。 - `search(query, n_results)` 返回带有余弦相似度分数的排名结果。 ### 后台摄取守护进程 - **文件：** `src/ingest/daemon.py` - 使用 **watchdog** 监控投放文件夹 (默认：`~/CyberSamantha/DropBox/`，可通过 `CYBERSAMANTHA_DROPBOX` 配置)。 - 监视 30 多种文件扩展名：PDF、DOCX、PPTX、代码文件、配置、日志、Markdown、JSON、YAML 等。 - **防抖逻辑：** 在最后一次文件事件之后等待 2 秒再进行处理 (防止多次写入的编辑器导致重复摄取)。 - **去重：** 使用 MD5 文件哈希避免重新处理相同内容。 - 每个文件的处理流水线： 1. 解析文档 (特定格式的提取) 2. 文本分块 (在句子/段落边界处进行智能分块) 3. 使用丰富的元数据将其向量化并存入 ChromaDB 4. 通过 LLM 提取知识图谱实体 5. 记录摄取日志并通过回调通知 CLI - 可以作为 CLI 内部的后台线程或作为独立进程 (`python -m src.ingest.daemon`) 运行。 ### 文档解析器 - **文件：** `src/ingest/parsers.py` - 多格式文本提取： - **PDF：** PyPDF2 逐页提取 - **DOCX：** python-docx 段落提取 - **PPTX：** python-pptx 幻灯片/形状文本提取 - **JSON/YAML：** 结构化序列化为可读文本 - **纯文本：** 使用 UTF-8 编码直接读取 - `chunk_text(text, chunk_size=1000, chunk_overlap=200)`：智能分块，优先在句子边界 (`. `) 或换行符处断开。 - `get_file_hash(file_path)`：用于去重的 MD5 哈希。 ### 图谱提取器 - **文件：** `src/ingest/extractor.py` - 使用大语言模型从文本块中提取网络安全实体 (漏洞、工具、攻击者、技术、缓解措施) 及其关系。 - 返回结构化的 JSON，包含 `source`、`target`、`relation`、`confidence`、`source_type`、`target_type`。 - 每个提取出的关系都会带着其置信度分数和源文件归属添加到 RealityGraph 中。 ## 6. 技能基因组系统 CyberSamantha 拥有**两种类型的技能**： ### Markdown 技能 (指令手册) - 位于项目根目录的 `skills/` 文件夹中。 - `.md` 文件由 `SkillLoader` 解析为 `MarkdownSkill` 对象。 - 必须包含一个 `## Skill Info` 块，内容如下： ## Skill Info - **Name:** skill_id - **Agent:** AgentName (或 "all") - **Tags:** tag1, tag2 - 完整的 Markdown 内容作为指令手册**注入到智能体的系统提示词中**。 - 当前技能： | 文件 | 名称 | 智能体 | 标签 | |------|------|-------|------| | `threat_intel.md` | threat_intel | ResearcherAgent | intelligence, analysis, cve | | `pentest_playbook.md` | pentest_playbook | HackerAgent | offensive, recon, exploitation | | `incident_response.md` | incident_response | all | dfir, incident, forensics, blue-team | | `code_review.md` | code_review | CoderAgent | code, security, audit, owasp | | `log_analysis.md` | log_analysis | CoderAgent | forensics, logs, analysis, blue-team | ### Python 技能 (可执行工具) - 位于 `src/skills/`。 - `BaseSkill` 的子类，用于暴露可调用的工具函数。 - 在智能体初始化时自动发现并注册。 - 当前技能： - **CryptoSkill** (`crypto.py`)：哈希 (MD5、SHA-1/256/512)、哈希识别、Base64 编码/解码、十六进制编码/解码。对所有智能体可用。 - **OsintSkill** (`osint.py`)DNS 查询、反向 DNS、端口检查、WHOIS 查询。仅对 HackerAgent 和 ResearcherAgent 可用。 ### SkillLoader - **文件：** `src/skills/loader.py` - MD 和 Python 技能的统一加载器。 - `discover()`：扫描 `skills/` 查找 `.md` 文件，扫描 `src/skills/` 查找 `BaseSkill` 子类。 - `get_md_skills_for(agent_name)`：返回兼容的 Markdown 技能。 - `get_all_tools(agent_name)`：返回所有可调用的 Python 工具函数。 - `enable()` / `disable()`：在运行时切换技能。 - `get_summary()`：返回用于 CLI 显示的综合摘要。 ### BaseAgent - **文件：** `src/agents/base.py` - 所有智能体均继承自 `BaseAgent`。 - 通过将基础提示词与兼容的 Markdown 技能指令组合，自动构建系统提示词。 - 当 Gemini 处于活动状态时，使用**原生函数调用** —— Python 技能工具被注册为可调用函数。 - 当 Ollama 处于活动状态时，回退到**基于聊天的 ReAct 循环**，并在系统提示词中嵌入工具描述。 - 使用 Genome Engine 跟踪器封装所有技能工具，以记录执行时间和成功率。 ## 7. 安装与设置 ### 前置条件 - Python 3.8+ - Google Gemini API 密钥 (可选，用于云端 LLM) - Ollama (可选，用于本地 LLM 回退) - Git (用于克隆数据源) ### 步骤详解 1. **克隆并设置虚拟环境：** git clone https://github.com/RicheByte/cyberSamantha cd cyberSamantha python -m venv myenv 2. **激活环境：** # Windows myenv\Scripts\activate # Linux/Mac source myenv/bin/activate 3. **安装依赖：** pip install -r requirements.txt 4. **配置您的 API 密钥** —— 创建或编辑 `.env` 文件： GEMINI_API_KEY=your_api_key_here EXA_API_KEY=your_exa_api_key_here # 可选，用于网络搜索 CYBERSAMANTHA_DROPBOX=C:/Path/To/Your/Dropbox # 可选 LLM_PROVIDER=auto # auto | gemini | ollama OLLAMA_MODEL=llama3 # 可选 OLLAMA_BASE_URL=http://localhost:11434 # 可选 5. **运行设置检查程序：** python setup_check.py 这将验证您的操作系统、Python 版本、`.env` 配置、数据目录和向量数据库状态。 6. **更新数据源 (可选但推荐)：** python update_data.py --update 7. **将文档索引到向量存储中：** python main.py --index 8. **启动交互式 Shell：** python main.py ## 8. CLI 命令与使用 ### 启动模式 | 命令 | 描述 | |---------|-------------| | `python main.py` | 启动交互式聊天 shell | | `python main.py --index` | 将 `data/` 中的文档索引到 ChromaDB | | `python main.py --index --force` | 强制完全重新索引 (忽略文件哈希) | | `python main.py --question "What is XSS?"` | 提问单个问题并退出 | | `python main.py --daemon` | 在启动时开启后台摄取守护进程 | ### 交互式 Shell 命令 | 命令 | 描述 | |---------|-------------| | *(任意文本)* | 自然语言查询 → AgentRouter 处理 | | `/wiki ` | 查询 RealityGraph 获取 Wiki 风格的实体摘要 | | `search ` | 通过 Exa API 搜索网络以获取威胁情报 | | `run ` | 执行安全的终端命令 (强制使用白名单) | | `read ` | 读取本地文件到上下文中 | | `think` | 重新处理上一个查询并打印 Chain-of-Thought | | `remember that ` | 在 Semantic Memory 中存储一个永久事实 | | `/status` | 完整的系统健康状态表 (LLM、向量存储、图谱、内存、守护进程、技能) | | `/daemon start` | 启动后台文件摄取监视器 | | `/daemon stop` | 停止后台监视器 | | `/daemon log` | 显示最近自动摄取的文件 | | `/provider` | 显示当前活动的 LLM 提供商和模型 | | `/skills` | 列出所有已加载的技能 (类型、智能体、标签、状态) | | `/skills enable ` | 启用一个技能 | | `/skills disable ` | 禁用一个技能 | | `quit` / `exit` / `q` | 退出 Shell | ### 快速启动脚本 | 平台 | 命令 | |----------|---------| | Linux/Mac | `./ask.sh "What is XSS?"` | | PowerShell | `.\ask.ps1 "What is XSS?"` | | CMD | `ask.bat "What is XSS?"` | | 带有 Banner | `./ask.sh "What is XSS?" --banner` | ## 9. 数据源与更新流水线 ### 数据源 (`update_data.py`) 系统可以将**四个网络安全数据存储库**克隆并维护到 `data/` 文件夹中： | 源 | URL | 描述 | 默认 | |--------|-----|-------------|---------| | **Handbooks** | `github.com/0xsyr0/Awesome-Cybersecurity-Handbooks` | 精选网络安全手册 (侦察、漏洞利用、取证等) | ✅ 启用 | | **Exploits** | `gitlab.com/exploit-database/exploitdb` | 漏洞利用数据库 | ❌ 禁用 | | **Advisories** | `github.com/github/advisory-database` | GitHub 安全公告 | ❌ 禁用 | | **NVD CVE** | `github.com/olbat/nvdcve` | 国家漏洞数据库 CVE 订阅源 | ❌ 禁用 | 源 2–4 默认禁用，因为它们体积非常大。请在 `config.yaml` 中启用它们。 ### 更新功能 - **浅克隆** (`--depth 1`) 以最小化下载大小。 - **游离 HEAD 恢复：** 自动检测并修复游离的 HEAD 状态。 - **重试逻辑：** 对失败的 git 操作最多进行 3 次带有指数退避的重试。 - **网络连接检查：** 在开始前验证对 GitHub/GitLab 的访问。 - **元数据跟踪：** 在 `data/update_metadata.json` 中记录上次更新时间、文件数量和源 URL。 ### 更新命令 | 命令 | 描述 | |---------|-------------| | `python update_data.py --update` | 更新所有已启用的数据源 | | `python update_data.py --update --cleanup` | 更新 + 清理 git 历史 | | `python update_data.py --status` | 显示更新状态和文件数量 | | `python update_data.py --update --skip-network-check` | 跳过网络连接检查 | ## 10. 存储清理与优化 **文件：** `cleanup_storage.py` 庞大的 git 存储库 (尤其是 advisories 和 nvdcve) 会占用大量磁盘空间。此工具提供了精细的存储管理功能： | 命令 | 描述 | |---------|-------------| | `python cleanup_storage.py --status` | 显示每个仓库的当前存储使用情况 | | `python cleanup_storage.py --temp` | 删除临时 git 包文件 (`tmp_pack_*`) | | `python cleanup_storage.py --remove-git` | 删除 advisories、nvdcve、exploits 中的 `.git` 文件夹 | | `python cleanup_storage.py --remove-backups` | 删除 `.broken` 备份文件夹 | | `python cleanup_storage.py --gc` | 对所有仓库运行激进的 git 垃圾回收 | | `python cleanup_storage.py --all` | 完全清理 (临时文件 + 备份 + git 历史) | | `python cleanup_storage.py --all --keep-handbooks` | 完全清理但保留 handbooks 的 `.git` 以备将来更新 | 清理后，请使用以下命令重新索引：`python main.py --index --force` ## 11. 目录结构 ``` cyberSamantha/ ├── main.py # Entry point — loads dotenv, starts CLIApp ├── config.yaml # Central configuration (data sources, RAG settings, LLM provider, daemon) ├── requirements.txt # Python dependencies ├── .env # Environment variables (API keys, provider settings) ├── setup_check.py # Cross-platform setup verification script ├── update_data.py # Data source updater (git clone/pull for 4 repos) ├── cleanup_storage.py # Git storage cleanup and optimization tool ├── test_ecosystem.py # Integration tests for core components ├── ask.sh / ask.ps1 / ask.bat # Quick-launch scripts for single-question mode │ ├── src/ │ ├── __init__.py │ ├── agents/ │ │ ├── __init__.py # Exports BaseAgent, HackerAgent, ResearcherAgent, CoderAgent │ │ ├── base.py # BaseAgent — shared logic, tool wrapping, Gemini/Ollama routing │ │ ├── hacker.py # HackerAgent — offensive security, nmap, curl │ │ ├── researcher.py # ResearcherAgent — web search, CVE lookup │ │ ├── coder.py # CoderAgent — file reading, code analysis, safe terminal │ │ └── debate_orchestrator.py # Cross-agent adversarial debate + synthesis │ ├── cli/ │ │ ├── __init__.py │ │ └── app.py # CLIApp — Rich-based interactive REPL, command parser │ ├── core/ │ │ ├── __init__.py │ │ ├── agent.py # AgentRouter — central manager, ReAct loop, query routing │ │ ├── config.py # ConfigManager — singleton YAML config loader │ │ ├── llm_provider.py # LLMProvider — Gemini ↔ Ollama hot-swap │ │ ├── thought_router.py # ThoughtRouter — cognitive strategy selection via Meta-Memory │ │ └── curiosity_daemon.py # CuriosityDaemon — autonomous contradiction resolution │ ├── ingest/ │ │ ├── __init__.py │ │ ├── daemon.py # IngestionDaemon — watchdog-based file watcher + auto-ingest │ │ ├── parsers.py # DocumentParser — PDF/DOCX/PPTX/JSON/YAML/text extraction + chunking │ │ └── extractor.py # GraphExtractor — LLM-based entity/relationship extraction │ ├── knowledge/ │ │ ├── __init__.py │ │ ├── vector_store.py # VectorStore — ChromaDB wrapper, indexing, semantic search │ │ ├── reality_graph.py # RealityGraph — probabilistic knowledge graph with confidence scores │ │ └── graph_store.py # GraphStore — basic NetworkX graph operations (legacy) │ ├── memory/ │ │ ├── __init__.py │ │ ├── episodic.py # EpisodicMemory — sliding window + LLM summarization │ │ ├── semantic.py # SemanticMemory — persistent JSON fact storage │ │ └── meta_memory.py # MetaMemory — strategy outcome tracking and retrieval │ ├── skills/ │ │ ├── __init__.py │ │ ├── base.py # BaseSkill — abstract interface for pluggable skills │ │ ├── loader.py # SkillLoader — discovers MD playbooks + Python skill modules │ │ ├── genome_engine.py # GenomeEngine — usage tracking, success rates, pruning │ │ ├── crypto.py # CryptoSkill — hashing, encoding, hash identification │ │ └── osint.py # OsintSkill — DNS, WHOIS, port checking, reverse DNS │ └── tools/ │ ├── __init__.py # Exports BaseTool, WebSearchTool, TerminalTool, FileReaderTool │ ├── base.py # BaseTool + ToolResult — abstract tool interface │ ├── terminal.py # TerminalTool — safe command execution (allowlist/blocklist) │ ├── file_reader.py # FileReaderTool — safe local file reading with extension whitelist │ └── web_search.py # WebSearchTool — Exa API web search │ ├── skills/ # Markdown skill playbooks (injected into agent prompts) │ ├── threat_intel.md # Threat intelligence analysis framework │ ├── pentest_playbook.md # Penetration testing methodology │ ├── incident_response.md # NIST SP 800-61 incident response procedure │ ├── code_review.md # OWASP Top 10 secure code review checklist │ ├── log_analysis.md # Log forensics and IOC pattern detection │ └── genome_registry.json # Skill fitness tracking (usage, success rate, lineage) │ ├── data/ # Data repositories and runtime storage │ ├── handbooks/ # Cloned cybersecurity handbooks (git submodule) │ ├── memory.json # Semantic memory (user facts) │ ├── meta_memory.json # Meta-memory (strategy outcomes) │ ├── reality_graph.json # RealityGraph (persisted knowledge graph) │ └── update_metadata.json # Data source update tracking │ ├── chroma_db/ # Local ChromaDB vector database └── assets/ # Images and media for documentation ``` ## 12. 环境变量与配置 ### `.env` 变量 | 变量 | 默认值 | 描述 | |----------|---------|-------------| | `GEMINI_API_KEY` | *(空)* | 用于云端 LLM 的 Google Gemini API 密钥 | | `EXA_API_KEY` | *(空)* | 用于网络搜索的 Exa AI API 密钥 | | `CYBERSAMANTHA_DROPBOX` | `~/CyberSamantha/DropBox` | 自动摄取投放文件夹的路径 | | `LLM_PROVIDER` | `auto` | 强制指定提供商：`auto`、`gemini` 或 `ollama` | | `OLLAMA_MODEL` | `llama3` | 用于本地模式的 Ollama 模型名称 | | `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama 服务器 URL | ### `config.yaml` 部分 | 部分 | 用途 | |---------|---------| | `data_sources` | 切换各个数据仓库的开启/关闭，配置浅克隆深度和 git 历史删除 | | `storage` | 自动清理 git 包、临时文件、激进的 gc、最大数据文件夹大小警告 | | `rag_system` | 嵌入模型 (`all-MiniLM-L6-v2`)、Gemini 模型 (`gemini-2.0-flash`)、上下文块 (5)、块大小 (1000)、重叠 (200) | | `llm_provider` | 提供商模式 (`auto`/`gemini`/`ollama`)、Ollama base URL 和模型 | | `daemon` | 监视目录、自动启动开关、监视的文件扩展名 | | `updates` | 重试次数 (3)、git 超时 (600s)、网络检查开关 | ## 13. 查询在系统中的流转过程 ``` User types: "What are the latest vulnerabilities in Apache Log4j?" │ ▼ ┌───────────────────┐ │ CLIApp │ (src/cli/app.py) │ Captures input │ └────────┬──────────┘ │ ▼ ┌───────────────────┐ │ AgentRouter │ (src/core/agent.py) │ .query() │ └────────┬──────────┘ │ ┌──────────┼──────────────────────┐ │ │ │ ▼ ▼ ▼ "search/ "read/ Other: run" cmd "remember" ▼ │ │ ┌───────────────┐ │ │ │ ThoughtRouter │ ▼ ▼ │ (check │ Direct tool Store fact │ Meta-Memory) │ execution in Semantic └───────┬───────┘ │ ┌───────────┼───────────┐ ▼ ▼ ▼ "debate" "linear" fallback │ │ ┌───────────┴───┐ │ │DebateOrchestr.│ ▼ │Coder vs Hacker│ ┌──────────────┐ │+ Synthesis │ │ ReAct Loop │ └───────────────┘ │ (max 5 iters) │ └───────┬───────┘ │ ┌───────────┬────────┼────────┬──────────┐ ▼ ▼ ▼ ▼ ▼ web_search terminal file_rag wiki_query agent delegation │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ Exa API Subprocess ChromaDB NetworkX HackerAgent Graph ResearcherAgent CoderAgent │ ▼ ┌───────────────┐ │ LLM generates │ │ final answer │ │ + sources │ └───────────────┘ │ ▼ ┌───────────────┐ │ EpisodicMem │ ← add to history │ (summarize │ │ if too long)│ └───────────────┘ │ ▼ Rendered via Rich Markdown in CLI ``` ## 14. 扩展系统 ### 添加新的 Markdown 技能 1. 在 `skills/` 文件夹中创建一个 `.md` 文件： # My New Skill ## Skill Info - **Name:** my_new_skill - **Agent:** HackerAgent - **Tags:** recon, network ## Instructions Your detailed playbook goes here... 2. 重启 CyberSamantha —— 该技能会被自动发现并注入到指定智能体的提示词中。 ### 添加新的 Python 技能 1. 在 `src/skills/` 中创建一个继承了 `BaseSkill` 的文件： from typing import List, Callable from src.skills.base import BaseSkill class MySkill(BaseSkill): name = "my_skill" description = "Does something useful" version = "1.0" tags = ["category"] compatible_agents = set() # empty = all agents def get_tools(self) -> List[Callable]: def my_tool(input: str) -> str: """Description of what this tool does.""" return f"Result: {input}" return [my_tool] 2. 重启 —— 该技能会被自动发现，并且其工具将通过 Gemini 原生函数调用或 Ollama ReAct 解析对兼容的智能体可用。 ### 添加新的智能体 1. 在 `src/agents/` 中创建一个继承了 `BaseAgent` 的文件。 2. 实现 `__init__` 并带有系统提示词，以及 `get_tools()` 返回可调用函数。 3. 在 `src/core/agent.py` 的 `AgentRouter.__init__` 中注册它。 4. 将其添加到 `_handle_auto_query()` 中 ReAct 循环的动作调度里。 ### 添加新的数据源 1. 在 `update_data.py` 中向 `repo_configs` 添加带有 URL、类型、目标目录和浅克隆设置的新条目。 2. 添加相应的 `update_()` 方法用于统计和报告文件。 3. 在 `config.yaml` 的 `data_sources` 下添加带有 `enabled` 开关的配置。 ## 依赖项 | 包 | 用途 | |---------|---------| | `google-generativeai` | Google Gemini API 客户端 | | `chromadb` | 本地向量数据库 | | `sentence-transformers` | 本地嵌入模型 (`all-MiniLM-L6-v2`) | | `networkx` | 知识图谱操作 | | `PyPDF2` | PDF 文本提取 | | `python-docx` | DOCX 文本提取 | | `python-pptx` | PPTX 文本提取 | | `watchdog` | 用于摄取守护进程的文件系统监控 | | `exa-py` | Web 搜索 API (需要 `EXA_API_KEY`) | | `requests` | HTTP (Ollama、网络检查) | | `PyYAML` | YAML 配置解析 | | `python-dotenv` | `.env` 文件加载 | | `tqdm` | 进度条 | | `rich` | 终端 UI (面板、表格、Markdown 渲染) | | `tenacity` | 重试逻辑 | ## 测试运行集成测试套件： ``` python test_ecosystem.py ``` 这会测试： - **RealityGraph** —— 初始化、关系添加、矛盾检测 - **MetaMemory** —— 策略记录、最佳策略检索 - **ThoughtRouter** —— 基于 Meta-Memory 的路由决策 - **GenomeEngine** —— 技能使用记录 *CyberSamantha —— 你的网络第二大脑。*

标签：AI安全, AI风险缓解, Chat Copilot, Darwinian能力进化, PyRIT, RAG, Ruby, 向量数据库, 多智能体协同, 多智能体推理, 多智能体系统, 威胁情报分析, 威胁调查, 安全智能化, 技能基因组, 持久记忆, 攻击路径可视化, 智能体集群, 检索增强生成, 特权检测, 知识库, 第二大脑, 网络安全, 网络情报系统, 自主认知系统, 自动威胁分析, 认知生态系统, 逆向工具, 隐私保护, 黑客工具