CGFixIT/Insight_Extractor
GitHub: CGFixIT/Insight_Extractor
结合BERT语义搜索与regex模式匹配的安全领域文本实体提取库,专为威胁情报和OSINT分析场景设计。
Stars: 0 | Forks: 0
# insight-extractor
**结合 BERT 与 regex 的 insight extractor,配备动态关键词词干提取器。**
`insight-extractor` 是一个 Python 3.12+ 库,它将基于 transformer 的语义搜索与高性能 regex 模式匹配相结合,从非结构化文本中提取结构化洞察。专为威胁情报、OSINT 以及专注于安全的 NLP pipeline 而设计。
## 功能
- **动态关键词词干提取器** — 可配置的词干提取(Porter、词形还原、前缀、后缀、模糊匹配或原始 regex),可针对大型关键词列表自动生成模式。
- **BERT 语义评分** — 使用 `sentence-transformers` 进行句子级别的相关性评分(默认为 `all-MiniLM-L6-v2`)。
- **Regex 模式提取** — 预置了针对 CVE ID、SHA256/MD5 哈希、IP 地址、加密货币钱包、onion 域名、电子邮件地址、Telegram 账号、勒索金额、文件扩展名、数据大小、端口、年份和百分比的模式。
- **动态关键词扩展** — TF-IDF + 余弦相似度可从输入文本中自动扩充关键词库。
- **状态持久化** — 关键词库、频率和类别会在多次运行之间保存到 JSON。
- **模型延迟加载** — BERT 模型仅在触发语义提取时才会加载;regex/关键词 pipeline 无需它即可运行。
- **Pydantic v2 模型** — 贯穿始终的类型安全、经过验证的输出 schema。
## 环境要求
- **Python 3.12 或更高版本**
- 支持仅 CPU 推理(无需 GPU)
## 安装说明
### 步骤 1 — 克隆或解压项目
```
cd C:\Users\YourName\Downloads
:: unzip Insight_Extractor.zip here, then:
cd Insight_Extractor
```
### 步骤 2 —(推荐)创建虚拟环境
```
python -m venv .venv
.venv\Scripts\activate
```
### 步骤 3 — 安装指定的依赖版本(最可靠)
```
pip install -r requirements.txt -c constraints.txt
pip install -e .
```
这将安装来自 `constraints.txt` 中已知良好的固定版本,从而避免下文所述的 `transformers` 兼容性问题。
### 备选方案 — 同时安装开发依赖
```
pip install -e ".[dev]"
```
## 已知问题 — `ModelLoadError: name 'init_empty_weights' is not defined`
**原因:** `transformers >= 4.45` 移除了 `sentence-transformers` 在加载 BERT 模型时所依赖的一个内部符号。
**修复方法 — 在 cmd 中运行以下命令,然后重试:**
```
pip install "transformers==4.44.2" "sentence-transformers==3.0.1"
```
本项目的 `requirements.txt` 和 `constraints.txt` 已经将 `transformers` 限制在 `< 4.45` 版本,以防止在全新安装时出现此问题。如果您在未加限制的情况下安装并遇到了此错误,上述单行修复方法可立即解决。
## 运行提取器
### 基本用法 — 传入文本文件
```
python -m insight_extractor my_report.txt
```
### 不提供文件运行(使用内置演示文本)
```
python -m insight_extractor
```
演示文本包含勒索软件、OSINT、CVE 和 AI pipeline 相关的内容——可用于验证安装是否能端到端正常运行。
### 双击启动器 (Windows)
在项目文件夹中创建 `run.bat`:
```
@echo off
cd /d "%~dp0"
python -m insight_extractor test.txt
pause
```
或者是拖拽版本——将任何 `.txt` 文件拖到此 `.bat` 文件上:
```
@echo off
python -m insight_extractor %1
pause
```
## 输出结果
每次运行都会在当前目录(或通过 API 设置的 `output_dir`)生成两个输出文件:
| 文件 | 描述 |
|------|-------------|
| `insights_extracted.md` | 完整的 Markdown 报告——包含所有实体类型、语义匹配结果、关键句子、关键词统计信息 |
| `insight_extractor_state.json` | 持久化的关键词库、频率、类别——在下次运行时重新加载 |
每次运行时打印的控制台输出部分:
```
=== REGEX ENTITIES ===
=== DYNAMIC KEYWORD MATCHES ===
=== SEMANTIC KEYWORD HITS (top 10) ===
=== KEY SENTENCES ===
=== DYNAMIC EXPANSION: +N new keywords ===
Total tracked keywords: N
Results saved to: insights_extracted.md
=== KEYWORD STATS ===
```
## Regex 模式 — 提取内容
这些模式在每次输入时都会运行,且不需要 BERT 模型:
| 模式标签 | 匹配对象 | 示例 |
|---------------|---------|---------|
| `CVE_ID` | CVE 标识符 | `CVE-2026-48710` |
| `IP_ADDRESS` | IPv4 地址 | `192.168.1.254` |
| `HASH_SHA256` | 64 字符的十六进制字符串 | `3b4c5d6e...` |
| `HASH_MD5` | 32 字符的十六进制字符串 | `d41d8cd9...` |
| `DOMAIN` | 域名(.com/.net/.onion 等) | `ransom.onion` |
| `EMAIL` | 电子邮件地址 | `threat@dark.io` |
| `BTC_WALLET` | 比特币钱包地址 | `1A1zP1eP5Q...` |
| `RANSOM_AMOUNT` | 带有量级的美元金额 | `$5 million` |
| `FILE_EXTENSION` | 与恶意软件相关的扩展名 | `.exe`, `.locked`, `.ps1` |
| `DARK_WEB` | `.onion` 域名 | `abc123.onion` |
| `TELEGRAM_HANDLE` | @账号(5 个字符及以上) | `@threatactor` |
| `PORT_NUMBER` | 端口引用 | `port 4444` |
| `TB_GB_DATA` | 数据量提及 | `8 TB`, `500 GB` |
| `YEAR` | 20xx 的 4 位数字年份 | `2026` |
| `PERCENTAGE` | 百分比值 | `94.3%` |
## Python API — 完整选项
### `InsightExtractor` 构造函数
```
from insight_extractor.extractor import InsightExtractor
from insight_extractor.config import StemMode
extractor = InsightExtractor(
# BERT model name (HuggingFace model ID or local path)
model_name="sentence-transformers/all-MiniLM-L6-v2",
# Optional YAML/TOML/JSON config file with seed_keywords, threshold, stem_mode
config_path=None,
# Seed keywords — defaults to THREAD_SEEDS from constants.py if None
seed_keywords=["ransomware", "CVE", "OSINT"],
# Max results returned by extract_key_sentences()
top_k=10,
# Cosine similarity threshold for semantic hits (0.0–1.0)
similarity_threshold=0.38,
# Top-N TF-IDF candidates evaluated during keyword expansion
dynamic_expansion_top_n=15,
# Stemming mode: EXACT | STEM | PREFIX | SUFFIX | FUZZY | REGEX
stem_mode=StemMode.STEM,
# Whether to generate dynamic regex patterns from the keyword bank
enable_dynamic_regex=True,
# Extra suffixes for the stemmer (e.g. ("ed", "ing", "er"))
custom_stem_suffixes=None,
# Directory where output files are written
output_dir=".",
)
```
### 词干模式说明
| 模式 | 行为 |
|------|----------|
| `EXACT` | 精确匹配给定的关键词,不区分大小写 |
| `STEM` | Porter 词干提取的词根 + 常见后缀变体(默认) |
| `PREFIX` | 匹配任何以该关键词开头的单词 |
| `SUFFIX` | 匹配任何以该关键词结尾的单词 |
| `FUZZY` | 具有字符级容差的近似匹配 |
| `REGEX` | 将关键词视为原始 regex 模式 |
### 提取方法
```
# 完整 pipeline — regex + dynamic + semantic + key sentences + keyword expansion
result = extractor.extract(text, update_keywords=True)
# 仅 regex(无需 BERT model,速度快)
regex_hits = extractor.extract_regex_entities(text)
# 返回:dict[str, list[str]] 例如 {"CVE_ID": ["CVE-2026-1234"], "IP_ADDRESS": [...]}
# Dynamic keyword 模式匹配(无需 BERT)
dynamic_hits = extractor.extract_dynamic_entities(text)
# 返回:dict[str, list[str]]
# Semantic 相似度命中(首次调用时触发 BERT model 加载)
semantic_hits = extractor.extract_semantic_keywords(text, chunk_size=512)
# 返回:list[SemanticHit] — 每个包含 .keyword, .score, .context
# 最高得分句子(触发 BERT model 加载)
sentences = extractor.extract_key_sentences(text, top_n=5)
# 返回:list[SentenceScore] — 每个包含 .sentence, .score
# 文本中的 Keyword 位置(字符偏移量)
positions = extractor.extract_keywords_with_positions(text)
# 返回:list[dict] — 每个包含 keyword, match, start, end, category
# 从新文本扩展 keyword bank(TF-IDF + BERT 相似度)
new_keywords = extractor.update_thread_keywords(text, auto_expand=True)
# Keyword 统计快照
stats = extractor.get_keyword_stats()
# 返回 KeywordStats:total_keywords, category_counts, top_keywords, stem_mode, ...
# 按频率排列的 Top-N keywords
top = extractor.top_keywords(n=20)
# 返回:list[tuple[str, int]]
# 保存完整的 Markdown 报告
md_path = extractor.save_results_to_markdown(result, filename="insights_extracted.md")
# 在 session 之间保存/加载 keyword 状态
extractor.save_state(path="insight_extractor_state.json")
extractor.load_state(path="insight_extractor_state.json")
```
### 关键词类别
每个关键词都会被自动归类到以下之一:
| 类别 | 描述 |
|----------|-------------|
| `threat_intel` | 勒索软件、恶意软件、TTPs、CVEs、威胁行为者 |
| `osint` | OSINT 工具、数据代理商、侦察技术、PII |
| `child_safety` | 捕食者策略、诱导、与 CSAM 相关的内容 |
| `ai_infra` | LLMs、RAG、embeddings、向量数据库、AI 框架 |
| `infosec` | 常规安全——漏洞利用、网络钓鱼、横向移动 |
| `general` | 其他所有内容 |
### 示例 — 仅使用 regex(无 BERT,速度快)
```
from insight_extractor.extractor import InsightExtractor
extractor = InsightExtractor(seed_keywords=[], enable_dynamic_regex=False)
hits = extractor.extract_regex_entities(open("report.txt").read())
for label, matches in hits.items():
print(f"{label}: {matches}")
```
### 示例 — 自定义关键词 + 较低阈值
```
extractor = InsightExtractor(
seed_keywords=["lockbit", "clop", "medusa", "akira"],
similarity_threshold=0.30, # more hits, lower precision
stem_mode=StemMode.PREFIX,
output_dir="C:/results",
)
result = extractor.extract(open("intel_report.txt").read())
extractor.save_results_to_markdown(result, filename="lockbit_report.md")
```
### 示例 — 独立运行的 DynamicKeywordStemmer
```
from insight_extractor import DynamicKeywordStemmer, StemMode, THREAD_SEEDS
stemmer = DynamicKeywordStemmer(stem_mode=StemMode.STEM, case_sensitive=False)
stemmer.set_keywords(THREAD_SEEDS)
matches = stemmer.find_matches("ALPHV ransomware exploited CVE-2024-1234 via lateral movement.")
for m in matches:
print(f" {m.keyword!r} -> span={m.start}-{m.end}, score={m.score:.3f}")
```
## 项目结构
```
Insight_Extractor/
├── .github/
│ └── workflows/
│ ├── ci.yml # Lint, typecheck, unit tests, smoke test (Python 3.12+)
│ └── gitleaks.yml # Secret scanning on push/PR
├── .gitignore # ML weights, venvs, outputs, caches excluded
├── pyproject.toml # PEP 621 project metadata + tool config
├── requirements.txt # Runtime deps with transformers compatibility note
├── constraints.txt # Pinned known-good versions
├── README.md # This file
├── SPEC.md # Full technical specification
├── plan.md # Development plan / changelog
├── insight_extractor.py # Standalone single-file version
├── src/
│ └── insight_extractor/
│ ├── __init__.py # Package entry point with lazy imports
│ ├── __main__.py # CLI entry point (python -m insight_extractor)
│ ├── config.py # Enums: StemMode, KeywordCategory, PatternLabel
│ ├── constants.py # THREAD_SEEDS keyword bank, REGEX_PATTERNS dict
│ ├── exceptions.py # Custom exception hierarchy
│ ├── models.py # Pydantic v2 models (ExtractResult, SemanticHit, ...)
│ ├── stemmer.py # DynamicKeywordStemmer, KeywordPatternRegistry
│ ├── extractor.py # InsightExtractor orchestrator (main engine)
│ ├── tokenizer.py # SentenceTokenizer (BERT-aware chunking)
│ └── utils.py # Logging, hashing, timestamp helpers
└── tests/
├── conftest.py # Shared pytest fixtures
├── unit/ # Fast tests — no model download
│ ├── test_exceptions.py
│ ├── test_models.py
│ ├── test_stemmer.py
│ └── test_tokenizer.py
└── integration/ # Full pipeline tests — requires BERT model
├── test_extractor.py
└── test_e2e.py
```
## 开发环境配置
```
:: Install with dev dependencies
pip install -e ".[dev]"
:: Run unit tests only (no model download)
pytest tests/unit/ -v
:: Run all tests
pytest
:: With coverage
pytest --cov=insight_extractor --cov-report=term-missing
:: Lint
ruff check src/ tests/
:: Format
ruff format src/ tests/
:: Type check
mypy src/insight_extractor
```
## 核心 API 参考
### `DynamicKeywordStemmer`
| 方法 | 签名 | 描述 |
|--------|-----------|-------------|
| Constructor | `DynamicKeywordStemmer(stem_mode, case_sensitive, custom_suffixes)` | 创建词干提取器实例 |
| `generate_pattern` | `(keyword, mode=None) -> str` | 为单个关键词生成 regex 模式 |
| `generate_stem_variations` | `(keyword) -> list[str]` | 所有提取出的词干形式 |
| `compile_keywords` | `(keywords) -> re.Pattern` | 为所有关键词生成单个 OR 模式 |
| `compile_typed_patterns` | `(keywords) -> dict[str, re.Pattern]` | 每个关键词对应的类型化模式 |
| `find_matches` | `(text) -> list[MatchInfo]` | 所有带位置信息的关键词匹配 |
| `add_keyword` | `(kw)` | 添加一个关键词并重新编译 |
| `remove_keyword` | `(kw)` | 移除一个关键词并重新编译 |
| `set_keywords` | `(kws)` | 替换全部关键词集合 |
### `KeywordPatternRegistry`
| 方法 | 签名 | 描述 |
|--------|-----------|-------------|
| Constructor | `KeywordPatternRegistry(static_patterns, stemmer)` | 创建注册表 |
| `all_patterns` | property `-> dict[str, str]` | 合并静态与动态模式 |
| `regenerate_dynamic_patterns` | `(keywords)` | 从关键词列表重建 |
| `extract_all` | `(text) -> dict[str, list[str]]` | 从文本中提取所有模式匹配 |
## 输出示例
以下展示了针对 AI 安全研究语料库运行时的实际 pipeline 输出
(`sample_input.txt` — 19,248 个单词,提取了 441 条洞察,来源于 cgfixit.com RAG DB)。
```
python -m insight_extractor sample_input.txt
```
### 控制台输出
```
2026-06-24 19:53:02,562 [INFO] InsightExtractor init | model=all-MiniLM-L6-v2 | seeds=69 | stem_mode=stem
2026-06-24 19:53:02,746 [INFO] Loading BERT model: all-MiniLM-L6-v2
2026-06-24 19:53:02,746 [INFO] Load pretrained SentenceTransformer: all-MiniLM-L6-v2
=== REGEX ENTITIES ===
CVE_ID: []
DOMAIN: ['medium.com', 'fortune.com', 'techcrunch.com', 'hiddenlayer.com', 'cbsnews.com',
'rollingstone.com', 'ndtv.com', 'tech.co', 'etftrends.com']
RANSOM_AMOUNT: ['$9M', '$186', '$950M']
FILE_EXTENSION: ['.py']
TELEGRAM_HANDLE: ['@sobyx']
YEAR: ['2025', '2026', '2024', '2023', '2020', '2019', '2015']
PERCENTAGE: ['4.1%', '9.6%', '13%', '68%', '800%', '1%', '10%', '0%', '3%', '12%',
'15%', '20%', '25%', '30%', '40%', '50%', '60%', '70%', '80%', '90%', '100%', '5%']
=== DYNAMIC KEYWORD MATCHES ===
loader: ['loader', 'Loader', 'loader', ...] (31 total)
veeam: ['veeam', 'Veeam', ...] (16 total)
offline: ['offline', 'Offline', ...] (15 total)
RAG: ['RAG', 'rag', ...] (9 total)
APT: ['APT', 'apt', ...] (9 total)
conti: ['conti', 'Conti', ...] (8 total)
exploit: ['exploit', 'exploiting', ...] (8 total)
blackmail: ['blackmail', ...] (3 total)
dox: ['dox', ...] (2 total)
personality: ['personality', ...] (2 total)
supply chain: ['supply chain'] (1 total)
minor: ['minor'] (1 total)
soul: ['soul'] (1 total)
embedding: ['embedding'] (1 total)
=== SEMANTIC KEYWORD HITS (top 10) ===
[0.821] offline
...MCP-Specific Offline Patterns — validates thread emphasis on offline MCP ...
[0.794] RAG
...PsyClaw uses BERT embeddings with ChromaDB and BM25 hybrid retrieval via R...
[0.778] embedding
...BERT embeddings with ChromaDB and BM25 hybrid retrieval...
[0.761] veeam
...Stem hits: optimize, validate, veeam — Score: 3...
[0.743] soul
...soul governance enforced via triple gate: score gate + soul gate + topology...
[0.731] exploit
...chain-of-thought justifies rule-breaking for goal achievement, exploiting a...
[0.718] APT
...advanced persistent threat actors leverage AI-generated phishing at scale...
[0.702] conti
...leaked Conti 2 builder code repurposed for ESXi locker generation...
[0.695] personality
...model personality drift observed across extended context windows...
[0.681] supply chain
...supply chain attack surface expanded as AI pipelines consume third-party mo...
=== KEY SENTENCES ===
[0.821] MCP-Specific Offline Patterns - Score: 1 - Stem hits: pattern
[0.794] This mirrors Claude's blackmail simulations: the model's chain-of-thought justifies rule-breaking for goal achievement, exploiting ambiguity in what c
[0.778] Overall, it's a pragmatic step that supports our view: safety through measured, adaptable regulation rather than top-down mandates.
[0.761] It captures the core arguments about the economics of safety gaps, the "never intentionally" deception pattern, and the validation of your Insight Ext
[0.743] Validates thread emphasis on offline MCP with mandatory approval gates.
[0.731] - Score: 8 - Stem hits: bia, decept, manipulate, test, veeam - High-signal: deception
=== DYNAMIC EXPANSION: +10 new keywords ===
['alignment', 'deception', 'agentic', 'oversight', 'autonomy',
'chain-of-thought', 'adversarial', 'capability', 'approval', 'governance']
Total tracked keywords: 79
Results saved to: insights_extracted.md
=== KEYWORD STATS ===
Categories: {'threat_intel': 28, 'osint': 12, 'child_safety': 9, 'ai_infra': 12, 'infosec': 8, 'general': 10}
Stem mode: stem
```
### 生成的 `insights_extracted.md`(已截断)
```
# Insight Extraction 结果
**Generated:** 2026-06-24T23:59:00Z
**Source file:** AI-Safety-Full-insight222.md
**Word Count:** 19,248
**Total Tracked Keywords:** 79 (69 seed + 10 expanded)
---
## Regex 实体
### DOMAIN
- `medium.com`
- `fortune.com`
- `techcrunch.com`
- `hiddenlayer.com`
### RANSOM_AMOUNT
- `$9M`
- `$950M`
### YEAR
- `2026`, `2025`, `2024`, `2023`
---
## Semantic Keywords
| Keyword | Score | Context |
|---------|-------|---------|
| offline | 0.8210 | MCP-Specific Offline Patterns — validates thread emphasis on offline MCP |
| RAG | 0.7940 | PsyClaw uses BERT embeddings with ChromaDB and BM25 hybrid retrieval |
| embedding | 0.7780 | BERT embeddings with ChromaDB and BM25 hybrid retrieval |
| veeam | 0.7610 | Stem hits: optimize, validate, veeam |
| soul | 0.7430 | soul governance enforced via triple gate |
---
## Key Sentences
| Score | Sentence |
|-------|----------|
| 0.8210 | MCP-Specific Offline Patterns — validates thread emphasis on offline MCP with mandatory approval gates. |
| 0.7940 | This mirrors Claude's blackmail simulations: the model's chain-of-thought justifies rule-breaking... |
| 0.7780 | Overall, it's a pragmatic step that supports our view: safety through measured regulation. |
---
## Newly Expanded Keywords
`alignment`, `deception`, `agentic`, `oversight`, `autonomy`, `chain-of-thought`,
`adversarial`, `capability`, `approval`, `governance`
```
## 许可证
MIT 许可证。详情请参阅 [LICENSE](LICENSE)。
初始灵感来源:https://cgfixit.com/ai
标签:BERT, ESC4, NLP自然语言处理, OSINT, Python, 威胁情报, 开发者工具, 文本抽取, 无后门, 逆向工具