SeRi0720/LLM-Based-Threat-Intelligence-Gathering

GitHub: SeRi0720/LLM-Based-Threat-Intelligence-Gathering

一套端到端的威胁情报自动化收集与分析系统，利用大语言模型将多源安全资讯转化为结构化分析报告并通过交互式仪表板呈现。

Stars: 0 | Forks: 0

# 🛡️ 基于 LLM 的威胁情报收集 ## 📋 目录 - [概述](#overview) - [功能特性](#features) - [架构](#architecture) - [项目结构](#project-structure) - [前置条件](#prerequisites) - [安装说明](#installation) - [配置](#configuration) - [使用说明](#usage) - [仪表板](#dashboard) - [故障排除](#troubleshooting) - [已知限制](#known-limitations) ## 概述本项目实现了一个端到端的威胁情报 pipeline，它可以： 1. 从 6 个公共安全来源（RSS 源 + NVD API）**收集**文章 2. 对原始内容进行**清洗与规范化**（HTML 解码、Unicode 规范化、空白字符清理） 3. 确定性**提取实体** —— CVE、IOC、MITRE ATT&CK TTP、威胁行为者关键字 4. 使用 LLM（通过 Groq API 调用 Llama 3.1 8B）**综合分析**生成结构化的分析报告 5. 在带有过滤功能的交互式 Streamlit 仪表板上**展示**结果 **数据来源：** | 来源 | 类型 | 内容 | |---|---|---| | [NVD](https://nvd.nist.gov) | REST API | CVE 结构化数据 + CVSS 评分 | | [Bleeping Computer](https://www.bleepingcomputer.com) | RSS | 安全新闻 | | [Krebs on Security](https://krebsonsecurity.com) | RSS | 安全调查报告 | | [The Hacker News](https://thehackernews.com) | RSS | 安全新闻与漏洞警报 | | [SANS ISC](https://isc.sans.edu) | RSS | 每日威胁摘要 | | [Threatpost](https://threatpost.com) | RSS | 安全新闻 | ## 功能特性 - 从 6 个来源**自动收集**，支持配置时间窗口 - **SHA-256 去重** —— 文章不会被重复处理 - **文本清洗 pipeline** —— HTML 实体、Unicode 规范化、标签移除 - **实体提取** —— CVE、IP、域名、URL、文件哈希、电子邮件、MITRE 技术/战术、威胁行为者名称 - **LLM 综合** —— 结构化 JSON 报告，包含摘要、威胁行为者、受影响系统、严重程度、分析人员备注、标签 - **NVD 严重程度保留** —— 权威的 CVSS 评分不会被 LLM 覆盖 - **动态 token 节流** —— 自动遵守 Groq 免费层的 TPM 限制 - **优雅关机** —— Ctrl+C 会在完成当前文章处理后安全停止 - **并行处理** —— 可配置的 worker 数量，用于并发 LLM 调用 - **交互式仪表板** —— 按严重程度、来源过滤；可展开的文章卡片 ## 架构 ``` ┌─────────────────────────────────────────────────────────────────┐ │ DATA SOURCES │ │ NVD API RSS/Atom Feeds (5 sources) │ └──────────────────────┬──────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ COLLECTION LAYER │ │ nvd_collector.py rss_collector.py │ │ • REST API v2 • feedparser + trafilatura │ │ • CVSS severity • SHA-256 deduplication │ └──────────────────────┬──────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ PROCESSING LAYER │ │ entity_extractor.py │ │ ├── clean_text() HTML decode, unicode NFC, whitespace │ │ ├── extract_cves() Regex: CVE-YYYY-NNNNN │ │ ├── extract_iocs() iocextract: IPs, domains, URLs, │ │ │ hashes, emails │ │ ├── extract_ttps() MITRE technique IDs + tactic keywords │ │ └── extract_actor_kw() Known actor patterns + named list │ └──────────────────────┬──────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ LLM LAYER │ │ analyzer.py │ │ • Model: llama-3.1-8b-instant (Groq API) │ │ • Input: cleaned text + pre-extracted entities │ │ • Output: JSON {summary, threat_actors, affected_systems, │ │ severity, severity_reason, analyst_notes, tags} │ │ • Dynamic TPM throttling (4800 effective tokens/min) │ └──────────────────────┬──────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ STORAGE LAYER │ │ SQLite (threat_intel.db) via SQLAlchemy │ │ • Raw content, cleaned text, entities, LLM report │ └──────────────────────┬──────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ PRESENTATION LAYER │ │ Streamlit Dashboard (dashboard/app.py) │ │ • KPI metrics, severity/source filters, article cards │ └─────────────────────────────────────────────────────────────────┘ ``` ## 项目结构 ``` LLM_Project/ ├── collectors/ │ ├── __init__.py │ ├── rss_collector.py # RSS/Atom feed collector │ └── nvd_collector.py # NVD REST API collector │ ├── processors/ │ ├── __init__.py │ └── entity_extractor.py # Clean, normalize, extract entities │ ├── llm/ │ ├── __init__.py │ └── analyzer.py # Groq API integration + throttling │ ├── database/ │ ├── __init__.py │ └── models.py # SQLAlchemy models + session factory │ ├── dashboard/ │ ├── __init__.py │ └── app.py # Streamlit web dashboard │ ├── main.py # Pipeline entrypoint ├── .env # API keys (not committed to git) ├── .gitignore ├── requirements.txt └── README.md ``` ## 前置条件 | 需求 | 版本 | 备注 | |---|---|---| | Python | 3.11+ | 必需 | | [uv](https://github.com/astral-sh/uv) | 最新版 | 本项目使用的 Python/包管理器 | | Groq API Key | — | 免费获取于 [console.groq.com](https://console.groq.com) | | NVD API Key | — | 可选但推荐 —— [在此处申请](https://nvd.nist.gov/developers/request-an-api-key) | | Git | 任意 | 用于克隆仓库 | ## 安装说明 ### 步骤 1 — 克隆仓库 ``` git clone https://github.com/your-username/LLM_Project.git cd LLM_Project ``` ### 步骤 2 — 安装 uv（如果尚未安装） ``` # Windows (PowerShell) powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" # macOS / Linux curl -LsSf https://astral.sh/uv/install.sh | sh ``` 验证安装： ``` uv --version ``` ### 步骤 3 — 创建虚拟环境并安装依赖 ``` # 使用 Python 3.11+ 创建 venv uv venv --python 3.11 # 从 requirements.txt 安装所有依赖 uv add -r requirements.txt ``` ### 步骤 4 — 创建 `__init__.py` 文件 ``` # Windows type nul > collectors/__init__.py type nul > processors/__init__.py type nul > llm/__init__.py type nul > database/__init__.py type nul > dashboard/__init__.py # macOS / Linux touch collectors/__init__.py processors/__init__.py llm/__init__.py database/__init__.py dashboard/__init__.py ``` ## 配置 ### 步骤 1 — 创建 `.env` 文件在项目根目录创建一个名为 `.env` 的文件： ``` # 必需 GROQ_API_KEY=your_groq_api_key_here # 可选 — 将 NVD API 速率限制从每 30 秒 5 次请求提高到 50 次 NVD_API_KEY=your_nvd_api_key_here ``` ### 步骤 2 — 获取 Groq API Key（免费） 1. 前往 **[https://console.groq.com](https://console.groq.com)** 2. 使用 Google 或 GitHub 注册 3. 导航至 **API Keys** → **Create API Key** 4. 复制该 key 并粘贴到 `.env` 中 **Groq 免费层限制 (llama-3.1-8b-instant)：** | 限制 | 值 | |---|---| | 每分钟 token 数 (TPM) | 6,000 | | 每天 token 数 (TPD) | 500,000 | | 每分钟请求数 | 30 | ### 步骤 3 — （可选）获取 NVD API Key 1. 前往 **[https://nvd.nist.gov/developers/request-an-api-key](https://nvd.nist.gov/developers/request-an-api-key)** 2. 输入您的电子邮件 —— key 会立即发送 3. 将其作为 `NVD_API_KEY` 添加到 `.env` 中如果没有 key，NVD 请求将被限制为每 30 秒 5 个请求 —— 这足以满足正常使用，但在进行大批量回填时会较慢。 ### 步骤 4 — 配置 pipeline 参数打开 `main.py` 并调整顶部的配置块： ``` # ── 配置 ──────────────────────────────────────────── MAX_WORKERS = 2 # Concurrent LLM requests # Keep at 2 for Groq free tier # Increase to 3-5 if you upgrade to paid tier NVD_DAYS = 3 # How many days back to fetch CVEs from NVD # Increase to 7-30 for a larger initial dataset # ───────────────────────────────────────────────────────────── ``` ## 使用说明 ### 快速开始 ``` # 1. 运行完整流水线（收集 + 处理） uv run python main.py # 2. 在单独的终端中，启动 Dashboard uv run python -m streamlit run dashboard/app.py --server.fileWatcherType none ``` 在浏览器中打开 **[http://localhost:8501](http://localhost:8501)**。 ### 运行单个组件 **仅收集（不进行 LLM 处理）：** ``` # 仅 RSS feeds uv run python -c "from collectors.rss_collector import collect_rss; collect_rss()" # 仅 NVD CVEs（最近 3 天） uv run python -c "from collectors.nvd_collector import collect_nvd; collect_nvd(days_back=3)" # NVD CVEs — 更大范围（最近 7 天） uv run python -c "from collectors.nvd_collector import collect_nvd; collect_nvd(days_back=7)" ``` **测试实体提取：** ``` uv run python -c " from processors.entity_extractor import extract_all_entities import json test = '''CVE-2024-1234 is a critical RCE in Apache 2.4.51. LockBit ransomware group was observed at 185.220.101.45 using T1059. Contact: admin@evil-c2.ru''' result = extract_all_entities(test) print(json.dumps({k: v for k, v in result.items() if k != 'cleaned_text'}, indent=2)) " ``` **测试 LLM 分析器：** ``` uv run python -c " from processors.entity_extractor import extract_all_entities from llm.analyzer import analyze_with_llm import json content = '''CVE-2024-1234 is a critical RCE vulnerability in Apache HTTP Server versions 2.4.1 through 2.4.51. LockBit ransomware group has been exploiting this in the wild since January 2024, targeting financial institutions. Indicators of compromise include 185.220.101.45 and evil-c2.ru.''' entities = extract_all_entities(content) result = analyze_with_llm(entities['cleaned_text'], entities) print(json.dumps(result, indent=2)) " ``` **检查数据库状态：** ``` uv run python -c " from database.models import get_session, ThreatArticle session = get_session() total = session.query(ThreatArticle).count() processed = session.query(ThreatArticle).filter_by(is_processed=1).count() pending = session.query(ThreatArticle).filter_by(is_processed=0).count() print(f'Total : {total}') print(f'Processed: {processed}') print(f'Pending : {pending}') session.close() " ``` ### 数据库管理 **仅重置未处理的文章（保留已完成的工作）：** ``` uv run python -c " from database.models import get_session, ThreatArticle session = get_session() deleted = session.query(ThreatArticle).filter_by(is_processed=0).delete() session.commit() print(f'Deleted {deleted} unprocessed articles') session.close() " ``` **完整数据库重置（重新开始）：** ``` uv run python -c " from database.models import Base, get_session from sqlalchemy import create_engine engine = create_engine('sqlite:///threat_intel.db') Base.metadata.drop_all(engine) Base.metadata.create_all(engine) print('Database reset complete') " ``` **规范化严重程度值（升级后运行一次）：** ``` uv run python -c " from database.models import get_session, ThreatArticle from processors.entity_extractor import normalize_severity session = get_session() for article in session.query(ThreatArticle).all(): if article.severity: article.severity = normalize_severity(article.severity) session.commit() session.close() print('Severity normalization complete') " ``` ### 运行程序 ``` uv run python main.py ``` ## 仪表板启动仪表板： ``` uv run python -m streamlit run dashboard/app.py --server.fileWatcherType none ``` **URL：** [http://localhost:8501](http://localhost:8501) ### 仪表板功能 | 功能 | 描述 | |---|---| | **KPI 指标** | 文章总数、严重/高危数量、发现的 CVE、活跃来源 | | **严重程度过滤器** | 按严重、高危、中危、低危、信息级别过滤 | | **来源过滤器** | 按单个数据来源过滤 | | **文章卡片** | 包含完整分析详情的可展开卡片 | | **URL 链接** | 指向原始源文章的直接链接 | | **LLM 摘要** | 两句话的分析师摘要 | | **CVE 列表** | 从文章中提取的所有 CVE ID | | **IOC 详情** | 按类别分类的 IP、域名、URL、哈希、电子邮件 | | **分析师备注** | 来自 LLM 的可操作建议 | | **标签** | 快速参考主题标签 | | **自动刷新** | 仪表板每 60 秒刷新一次数据 | ## 故障排除 ### `错误代码：400 — model_decommissioned` 模型名称已更改。请更新 `llm/analyzer.py`： ``` MODEL_NAME = "llama-3.1-8b-instant" # Current recommended model ``` 检查可用模型： ``` uv run python -c " from groq import Groq; import os from dotenv import load_dotenv; load_dotenv() client = Groq(api_key=os.getenv('GROQ_API_KEY')) for m in client.models.list().data: print(m.id) " ``` ## 已知限制 | 限制 | 影响 | 解决方法 | |---|---|---| | Groq 免费层：6,000 TPM | 处理约 70 篇文章需要约 14 分钟 | 使用 `MAX_WORKERS=1` 以保证稳定性，或升级到付费层 | | 仅支持批量处理 | 无实时警报 | 使用 `scheduler.py` 进行定期更新 | | MITRE 技术 ID 覆盖率 | 低 —— 文章很少使用 T-number 格式 | 战术关键字检测仍然有效 | | LLM 幻觉 | 偶尔出现错误的攻击者归因 | 始终通过源 URL 链接进行验证 | | 仅限英文内容 | 不支持非英文文章 | 仅使用英文来源 | | 文章在 4,000 字符处截断 | 篇幅长的文章会丢失细节 | 如果使用付费 API 层，可以增加此限制 | ## 依赖要求完整的 `requirements.txt`： ``` # Data Collection feedparser==6.0.11 httpx==0.27.0 beautifulsoup4==4.12.3 trafilatura==1.12.0 # Entity Extraction iocextract==1.16.0 # LLM - Groq groq==0.9.0 # Database sqlalchemy==2.0.30 # Dashboard streamlit==1.35.0 pandas==2.2.2 # Utilities python-dotenv==1.0.1 tenacity==8.3.0 ```

**基于 LLM 的威胁情报收集** 渗透测试毕业设计项目 — 2026 年 4 月

标签：Cloudflare, CVE提取, DLL 劫持, ELT数据流水线, GPT, Groq API, IOC提取, Kubernetes, Llama 3, LLM, MITRE ATT&CK, NLP, NVD, Python, Rego, RSS订阅, Streamlit, Sysdig, Unmanaged PE, Web安全, 人工智能安全, 合规性, 大语言模型, 失陷标示符, 威胁分析, 威胁情报, 安全仪表盘, 安全报告生成, 安全运营, 实体提取, 实时处理, 开发者工具, 情报聚合, 扫描框架, 数据清洗, 无后门, 漏洞管理, 网络安全, 自动化侦查工具, 自动化情报收集, 蓝队分析, 访问控制, 运行时操纵, 逆向工具, 隐私保护