ef-pa/CTF_Playbook

GitHub: ef-pa/CTF_Playbook

一个自动化的 CTF 知识流水线，通过抓取、LLM 分类与结构化编排，将 writeup 转化为按技巧组织的 playbook。

Stars: 0 | Forks: 0

# CTF Playbook Builder 一个从 CTFtime、GitHub、Reddit 和 curated 博客订阅源抓取数千篇 CTF（夺旗）writeup 的流水线，下载内容，使用 LLM 分类以提取解题技巧/工具/模式，并将所有内容组织成基于技巧的 playbook，旨在帮助解决新挑战。 ## 架构 ### 阶段 1. **抓取（Scrape）**（`ctf_playbook/scrapers/`）— 爬取 CTFtime 事件 + 任务 + writeup 链接；发现 GitHub 仓库；搜索 Reddit CTF 子版块；解析 curated 博客 RSS 订阅源。所有抓取器继承自共享的 `BaseScraper` 类。将元数据存储到 SQLite。 2. **获取（Fetch）**（`ctf_playbook/services/fetcher.py`）— 下载 writeup 内容（HTML→文本使用 trafilatura，原始 markdown）。过滤垃圾内容（链接索引、过短内容）。保存到 `playbook/raw-writeups/`。 3. **清理（Clean）**（自动）— 在分类前自动运行。补全内容哈希，移除被排除仓库中的 writeup，重新检查内容质量，并按内容哈希去重。避免在垃圾或重复内容上浪费 LLM token。 4. **分类（Classify）**（`ctf_playbook/services/classifier.py`）— 将获取的 writeup 发送至 Gemini 进行结构化分析。提取：技巧（含子技巧）、工具、识别信号、解题步骤、难度。将结果以 JSON 形式存入数据库。 5. **构建（Build）**（`ctf_playbook/services/builder.py`）— 三阶段流水线：从数据库组装结构化数据，导出 `playbook.json`（单一事实来源），然后渲染为可浏览的 markdown 文件。 ## 关键设计决策 - **按技巧组织（在类别内）。** 一个堆利用例和一个格式字符串漏洞都属于“pwn”，但解题路径完全不同。playbook 按你实际要做什么分组，并附带可选的子技巧以实现更细粒度（例如 XSS 分为反射型、存储型、DOM 型）。 - **数据模型优先，其次视图。** 构建器从数据库组装结构化的 `playbook.json`，然后将其渲染为 markdown 的一种视图。JSON 是单一事实来源——GUI 或 API 可以直接消费而无需解析 markdown。 - **SQLite 作为中心索引。** 每个 writeup 都有 `fetch_status` 和 `class_status`，以便可以恢复任意阶段。通过内容哈希去重可检测来自不同来源的相同 writeup。 - **自改进分类体系。** `ctf_playbook/taxonomy.py` 定义规范类别和技巧。分类器可以发现 taxonomy 中不存在的技巧 slug——基于关键词的推理会根据 taxonomy 衍生的类别关键词集对 slug 令牌进行评分，自动归类。添加技巧到 taxonomy 会自动扩展关键词索引。 - **自动清理。** 运行 `classify` 或 `all` 会自动去重并在消耗 LLM token 前移除垃圾内容，无需手动执行 `dedup`/`clean` 步骤。 - **分层架构。** 配置、分类法数据、数据访问（db）与服务逻辑（分类、构建、获取）相互分离，以便 GUI 或 API 可以复用相同服务而无需导入 CLI 代码。 ## 安装 ``` uv sync ``` 设置环境变量： ``` export GITHUB_TOKEN="ghp_..." # GitHub personal access token (optional, raises rate limit) export GEMINI_API_KEY="..." # Required for the classify stage (free tier available) ``` ## 用法 ``` # 运行完整流水线（抓取 -> 获取 -> 清理/去重 -> 分类 -> 构建） uv run ctf-playbook all # 或运行单独阶段 uv run ctf-playbook scrape # Discover writeups from select sources uv run ctf-playbook fetch # Download writeup content uv run ctf-playbook classify # Extract techniques via LLM (auto-cleans first) uv run ctf-playbook build # Generate playbook.json + markdown files uv run ctf-playbook export # Export playbook.json only (no markdown) uv run ctf-playbook export -o out.json # Export to a custom path uv run ctf-playbook import ./data/playbook.json # Import a playbook.json for the GUI uv run ctf-playbook serve # Launch the interactive web GUI # 阶段选项 uv run ctf-playbook scrape --max-events 100 # Limit CTFtime events uv run ctf-playbook scrape --source github # Only scrape GitHub uv run ctf-playbook fetch --limit 500 # Fetch up to 500 writeups uv run ctf-playbook classify --limit 100 # Classify up to 100 writeups uv run ctf-playbook classify --category pwn # Only classify pwn challenges uv run ctf-playbook classify -w 10 # Run 10 workers at once # GUI uv run ctf-playbook serve # Browse the playbook at http://127.0.0.1:8080 uv run ctf-playbook serve --port 3000 # Custom port uv run ctf-playbook serve --no-browser # Don't auto-open browser # 从挑战描述中识别技术 uv run ctf-playbook identify "binary with gets() and no canary" uv run ctf-playbook identify --file challenge.txt # 工具 uv run ctf-playbook stats # Database statistics (includes sub-technique breakdown) uv run ctf-playbook search "heap" # Search classified writeups by keyword uv run ctf-playbook search -t buffer-overflow # Filter by technique uv run ctf-playbook search --tool gdb # Filter by tool uv run ctf-playbook dedup # Remove duplicate writeups (by content hash) uv run ctf-playbook clean # Purge junk content from fetched writeups uv run ctf-playbook fix-categories # Backfill challenge categories from technique data # 子技术管理 uv run ctf-playbook promote # Review and promote discovered sub-techniques uv run ctf-playbook promote --threshold 5 # Require 5+ occurrences uv run ctf-playbook soft-reset # Reset classifications for re-classification ``` ## 测试 ``` uv run pytest ``` ## 速率限制 - **CTFtime**：请求之间延迟 1.5 秒（请保持礼貌） - **GitHub API**：有 token 时每小时 5,000 次请求，无 token 时每小时 60 次 - **Reddit**：请求之间延迟 2 秒 - **博客 RSS**：订阅源获取之间延迟 1 秒 ## 交互式 GUI Playbook 包含一个本地 Web 应用，用于浏览技巧、搜索 writeup 和探索工具。它直接从 `playbook.json` 读取——无需外部服务。 ### 视图 - **仪表板（Dashboard）** — 统计卡片、类别分布、难度分布 - **技巧详情（Technique Detail）** — 识别信号、工具、解题流程、交叉引用、子技巧卡片、示例 writeup - **类别概览（Category Overview）** — 可按 writeup 数量、难度、子技巧排序的技巧表 - **识别（Identify）** — 粘贴挑战描述并与识别信号匹配，以识别可能的技巧 - **侦察模式（Recon Patterns）** — 每个类别在 CTF 中的识别模式，用于快速分类 - **搜索（Search）** — 全文搜索，支持按技巧、工具和难度过滤 - **工具参考（Tool Reference）** — 可搜索的工具表，包含使用次数和关联技巧 - **侧边栏树（Sidebar Tree）** — 按类别分组的持久分类法导航，带子技巧深度链接 GUI 还在 `/api/stats`、`/api/techniques`、`/api/technique/{slug}`、`/api/search` 和 `/api/identify` 暴露 JSON API，供程序化访问。 ## 分类法设计 Playbook 按 **技巧**（你为解题所做的事）组织，并分组到顶级类别中。这意味着使用堆利用的“pwn”挑战和使用格式字符串的“pwn”挑战位于不同的技巧分支，因为解题路径不同。 ### 层级结构分类法最多有 3 级：**类别 / 技巧 / 子技巧**。 ``` playbook/techniques/ cryptography/ rsa-attacks/ _pattern.md # Technique overview + sub-technique table _recon.md # Decision tree: which RSA sub-technique? coppersmith.md # Sub-technique pattern file wiener.md hastad.md padding-oracle/ _pattern.md # No sub-techniques — just the overview web/ xss/ _pattern.md _recon.md # Decision tree: reflected vs stored vs DOM? reflected-xss.md stored-xss.md dom-xss.md ``` - `_pattern.md` — 技巧概览，包含识别信号、工具、解题流程、示例和交叉引用（“另请参阅”相关技巧） - `_recon.md` — 对包含 2 个及以上子技巧的技巧的决策树，帮助根据可观测信号区分适用哪个子技巧 - `{sub-technique}.md` — 与 `_pattern.md` 相同结构，但限定于特定子技巧 - 并非所有技巧都有子技巧——大多数仅为 `_pattern.md` ### 交叉引用构建器从 writeup 中挖掘共同出现的技巧（同时使用多种技巧的挑战）并在 pattern 文件中添加“另请参阅”部分。例如，如果 `padding-oracle` 和 `cbc-bit-flipping` 经常同时出现，每个技巧的 pattern 文件都会链接到另一个。当一种方法不奏效时，这有助于快速切换到相关技巧。 ### 类别推断分类器发现的 taxonomy 中不存在的技巧会使用基于关键词的推理自动归类。slug 令牌会根据以下类别关键词集进行评分： 1. 现有分类法技巧 slug（例如，“heap” 从 `heap-exploitation` 为 binary-exploitation 评分） 2. 领域特定补充（例如，“php”、“dom”用于 Web；“ecdsa”、“nonce”用于密码学）在 3 个及以上类别中出现的令牌会被剪除为过于通用。若两类之间平票，则返回 unknown（保留在 misc）。这是自改进的——添加技巧到 taxonomy 会自动扩展关键词索引。 ### 子技巧发现子技巧来源有两个： 1. **预设（Seeded）** — 在 taxonomy 中预定义（例如 RSA 攻击变体、XSS 类型） 2. **发现（Discovered）** — 分类器在分类期间识别出新子技巧发现的子技巧会在数据库中记录出现次数。一旦子技巧在 3 个及以上 writeup 中出现，它将成为**推荐候选**。使用 `ctf-playbook promote` 审查并批准候选。 ### Playbook 数据模型构建阶段生成 `playbook.json` 作为单一事实来源。结构如下： ``` { "metadata": { "generated_at": "...", "technique_count": 140, ... }, "techniques": { "buffer-overflow": { "category": "binary-exploitation", "signals": ["stack smashing", "segfault on input"], "tools": ["gdb", "pwntools", "checksec"], "solve_steps": ["Check protections", "Find offset", ...], "examples": [{ "challenge": "...", "event": "...", "year": 2024, "url": "..." }], "cross_references": [{ "technique": "rop-chains", "count": 5 }], "sub_techniques": { ... } } }, "recon_patterns": { ... }, "tool_reference": [ ... ] } ``` Markdown 文件从此数据渲染。GUI 或 API 可以直接消费 `playbook.json`。

标签：BeEF, C2, Capture The Flag, CTFtime, DLL 劫持, Gemini, LLM, Markdown渲染, Playbook构建, PWN, Reddit, RSS解析, Ruby, Scrapy, SQLite, Trafilatura, Unmanaged PE, Web安全, XSS, 云资产清单, 内容去重, 博客爬取, 哈希去重, 堆溢出, 大语言模型, 夺旗赛, 工具识别, 技术图谱, 数据管道, 机器学习分类, 格式字符串漏洞, 漏洞情报, 爬虫, 知识库, 结构化数据, 网络安全, 蓝队分析, 解题技巧, 软件工程, 逆向工具, 逆向工程, 隐私保护