CGFixIT/Insight_Extractor

GitHub: CGFixIT/Insight_Extractor

结合BERT语义搜索与regex模式匹配的安全领域文本实体提取库，专为威胁情报和OSINT分析场景设计。

Stars: 0 | Forks: 0

# insight-extractor **结合 BERT 与 regex 的 insight extractor，配备动态关键词词干提取器。** `insight-extractor` 是一个 Python 3.12+ 库，它将基于 transformer 的语义搜索与高性能 regex 模式匹配相结合，从非结构化文本中提取结构化洞察。专为威胁情报、OSINT 以及专注于安全的 NLP pipeline 而设计。 ## 功能 - **动态关键词词干提取器** — 可配置的词干提取（Porter、词形还原、前缀、后缀、模糊匹配或原始 regex），可针对大型关键词列表自动生成模式。 - **BERT 语义评分** — 使用 `sentence-transformers` 进行句子级别的相关性评分（默认为 `all-MiniLM-L6-v2`）。 - **Regex 模式提取** — 预置了针对 CVE ID、SHA256/MD5 哈希、IP 地址、加密货币钱包、onion 域名、电子邮件地址、Telegram 账号、勒索金额、文件扩展名、数据大小、端口、年份和百分比的模式。 - **动态关键词扩展** — TF-IDF + 余弦相似度可从输入文本中自动扩充关键词库。 - **状态持久化** — 关键词库、频率和类别会在多次运行之间保存到 JSON。 - **模型延迟加载** — BERT 模型仅在触发语义提取时才会加载；regex/关键词 pipeline 无需它即可运行。 - **Pydantic v2 模型** — 贯穿始终的类型安全、经过验证的输出 schema。 ## 环境要求 - **Python 3.12 或更高版本** - 支持仅 CPU 推理（无需 GPU） ## 安装说明 ### 步骤 1 — 克隆或解压项目 ``` cd C:\Users\YourName\Downloads :: unzip Insight_Extractor.zip here, then: cd Insight_Extractor ``` ### 步骤 2 —（推荐）创建虚拟环境 ``` python -m venv .venv .venv\Scripts\activate ``` ### 步骤 3 — 安装指定的依赖版本（最可靠） ``` pip install -r requirements.txt -c constraints.txt pip install -e . ``` 这将安装来自 `constraints.txt` 中已知良好的固定版本，从而避免下文所述的 `transformers` 兼容性问题。 ### 备选方案 — 同时安装开发依赖 ``` pip install -e ".[dev]" ``` ## 已知问题 — `ModelLoadError: name 'init_empty_weights' is not defined` **原因：** `transformers >= 4.45` 移除了 `sentence-transformers` 在加载 BERT 模型时所依赖的一个内部符号。 **修复方法 — 在 cmd 中运行以下命令，然后重试：** ``` pip install "transformers==4.44.2" "sentence-transformers==3.0.1" ``` 本项目的 `requirements.txt` 和 `constraints.txt` 已经将 `transformers` 限制在 `< 4.45` 版本，以防止在全新安装时出现此问题。如果您在未加限制的情况下安装并遇到了此错误，上述单行修复方法可立即解决。 ## 运行提取器 ### 基本用法 — 传入文本文件 ``` python -m insight_extractor my_report.txt ``` ### 不提供文件运行（使用内置演示文本） ``` python -m insight_extractor ``` 演示文本包含勒索软件、OSINT、CVE 和 AI pipeline 相关的内容——可用于验证安装是否能端到端正常运行。 ### 双击启动器 (Windows) 在项目文件夹中创建 `run.bat`： ``` @echo off cd /d "%~dp0" python -m insight_extractor test.txt pause ``` 或者是拖拽版本——将任何 `.txt` 文件拖到此 `.bat` 文件上： ``` @echo off python -m insight_extractor %1 pause ``` ## 输出结果每次运行都会在当前目录（或通过 API 设置的 `output_dir`）生成两个输出文件： | 文件 | 描述 | |------|-------------| | `insights_extracted.md` | 完整的 Markdown 报告——包含所有实体类型、语义匹配结果、关键句子、关键词统计信息 | | `insight_extractor_state.json` | 持久化的关键词库、频率、类别——在下次运行时重新加载 | 每次运行时打印的控制台输出部分： ``` === REGEX ENTITIES === === DYNAMIC KEYWORD MATCHES === === SEMANTIC KEYWORD HITS (top 10) === === KEY SENTENCES === === DYNAMIC EXPANSION: +N new keywords === Total tracked keywords: N Results saved to: insights_extracted.md === KEYWORD STATS === ``` ## Regex 模式 — 提取内容这些模式在每次输入时都会运行，且不需要 BERT 模型： | 模式标签 | 匹配对象 | 示例 | |---------------|---------|---------| | `CVE_ID` | CVE 标识符 | `CVE-2026-48710` | | `IP_ADDRESS` | IPv4 地址 | `192.168.1.254` | | `HASH_SHA256` | 64 字符的十六进制字符串 | `3b4c5d6e...` | | `HASH_MD5` | 32 字符的十六进制字符串 | `d41d8cd9...` | | `DOMAIN` | 域名（.com/.net/.onion 等） | `ransom.onion` | | `EMAIL` | 电子邮件地址 | `threat@dark.io` | | `BTC_WALLET` | 比特币钱包地址 | `1A1zP1eP5Q...` | | `RANSOM_AMOUNT` | 带有量级的美元金额 | `$5 million` | | `FILE_EXTENSION` | 与恶意软件相关的扩展名 | `.exe`, `.locked`, `.ps1` | | `DARK_WEB` | `.onion` 域名 | `abc123.onion` | | `TELEGRAM_HANDLE` | @账号（5 个字符及以上） | `@threatactor` | | `PORT_NUMBER` | 端口引用 | `port 4444` | | `TB_GB_DATA` | 数据量提及 | `8 TB`, `500 GB` | | `YEAR` | 20xx 的 4 位数字年份 | `2026` | | `PERCENTAGE` | 百分比值 | `94.3%` | ## Python API — 完整选项 ### `InsightExtractor` 构造函数 ``` from insight_extractor.extractor import InsightExtractor from insight_extractor.config import StemMode extractor = InsightExtractor( # BERT model name (HuggingFace model ID or local path) model_name="sentence-transformers/all-MiniLM-L6-v2", # Optional YAML/TOML/JSON config file with seed_keywords, threshold, stem_mode config_path=None, # Seed keywords — defaults to THREAD_SEEDS from constants.py if None seed_keywords=["ransomware", "CVE", "OSINT"], # Max results returned by extract_key_sentences() top_k=10, # Cosine similarity threshold for semantic hits (0.0–1.0) similarity_threshold=0.38, # Top-N TF-IDF candidates evaluated during keyword expansion dynamic_expansion_top_n=15, # Stemming mode: EXACT | STEM | PREFIX | SUFFIX | FUZZY | REGEX stem_mode=StemMode.STEM, # Whether to generate dynamic regex patterns from the keyword bank enable_dynamic_regex=True, # Extra suffixes for the stemmer (e.g. ("ed", "ing", "er")) custom_stem_suffixes=None, # Directory where output files are written output_dir=".", ) ``` ### 词干模式说明 | 模式 | 行为 | |------|----------| | `EXACT` | 精确匹配给定的关键词，不区分大小写 | | `STEM` | Porter 词干提取的词根 + 常见后缀变体（默认） | | `PREFIX` | 匹配任何以该关键词开头的单词 | | `SUFFIX` | 匹配任何以该关键词结尾的单词 | | `FUZZY` | 具有字符级容差的近似匹配 | | `REGEX` | 将关键词视为原始 regex 模式 | ### 提取方法 ``` # 完整 pipeline — regex + dynamic + semantic + key sentences + keyword expansion result = extractor.extract(text, update_keywords=True) # 仅 regex（无需 BERT model，速度快） regex_hits = extractor.extract_regex_entities(text) # 返回：dict[str, list[str]] 例如 {"CVE_ID": ["CVE-2026-1234"], "IP_ADDRESS": [...]} # Dynamic keyword 模式匹配（无需 BERT） dynamic_hits = extractor.extract_dynamic_entities(text) # 返回：dict[str, list[str]] # Semantic 相似度命中（首次调用时触发 BERT model 加载） semantic_hits = extractor.extract_semantic_keywords(text, chunk_size=512) # 返回：list[SemanticHit] — 每个包含 .keyword, .score, .context # 最高得分句子（触发 BERT model 加载） sentences = extractor.extract_key_sentences(text, top_n=5) # 返回：list[SentenceScore] — 每个包含 .sentence, .score # 文本中的 Keyword 位置（字符偏移量） positions = extractor.extract_keywords_with_positions(text) # 返回：list[dict] — 每个包含 keyword, match, start, end, category # 从新文本扩展 keyword bank（TF-IDF + BERT 相似度） new_keywords = extractor.update_thread_keywords(text, auto_expand=True) # Keyword 统计快照 stats = extractor.get_keyword_stats() # 返回 KeywordStats：total_keywords, category_counts, top_keywords, stem_mode, ... # 按频率排列的 Top-N keywords top = extractor.top_keywords(n=20) # 返回：list[tuple[str, int]] # 保存完整的 Markdown 报告 md_path = extractor.save_results_to_markdown(result, filename="insights_extracted.md") # 在 session 之间保存/加载 keyword 状态 extractor.save_state(path="insight_extractor_state.json") extractor.load_state(path="insight_extractor_state.json") ``` ### 关键词类别每个关键词都会被自动归类到以下之一： | 类别 | 描述 | |----------|-------------| | `threat_intel` | 勒索软件、恶意软件、TTPs、CVEs、威胁行为者 | | `osint` | OSINT 工具、数据代理商、侦察技术、PII | | `child_safety` | 捕食者策略、诱导、与 CSAM 相关的内容 | | `ai_infra` | LLMs、RAG、embeddings、向量数据库、AI 框架 | | `infosec` | 常规安全——漏洞利用、网络钓鱼、横向移动 | | `general` | 其他所有内容 | ### 示例 — 仅使用 regex（无 BERT，速度快） ``` from insight_extractor.extractor import InsightExtractor extractor = InsightExtractor(seed_keywords=[], enable_dynamic_regex=False) hits = extractor.extract_regex_entities(open("report.txt").read()) for label, matches in hits.items(): print(f"{label}: {matches}") ``` ### 示例 — 自定义关键词 + 较低阈值 ``` extractor = InsightExtractor( seed_keywords=["lockbit", "clop", "medusa", "akira"], similarity_threshold=0.30, # more hits, lower precision stem_mode=StemMode.PREFIX, output_dir="C:/results", ) result = extractor.extract(open("intel_report.txt").read()) extractor.save_results_to_markdown(result, filename="lockbit_report.md") ``` ### 示例 — 独立运行的 DynamicKeywordStemmer ``` from insight_extractor import DynamicKeywordStemmer, StemMode, THREAD_SEEDS stemmer = DynamicKeywordStemmer(stem_mode=StemMode.STEM, case_sensitive=False) stemmer.set_keywords(THREAD_SEEDS) matches = stemmer.find_matches("ALPHV ransomware exploited CVE-2024-1234 via lateral movement.") for m in matches: print(f" {m.keyword!r} -> span={m.start}-{m.end}, score={m.score:.3f}") ``` ## 项目结构 ``` Insight_Extractor/ ├── .github/ │ └── workflows/ │ ├── ci.yml # Lint, typecheck, unit tests, smoke test (Python 3.12+) │ └── gitleaks.yml # Secret scanning on push/PR ├── .gitignore # ML weights, venvs, outputs, caches excluded ├── pyproject.toml # PEP 621 project metadata + tool config ├── requirements.txt # Runtime deps with transformers compatibility note ├── constraints.txt # Pinned known-good versions ├── README.md # This file ├── SPEC.md # Full technical specification ├── plan.md # Development plan / changelog ├── insight_extractor.py # Standalone single-file version ├── src/ │ └── insight_extractor/ │ ├── __init__.py # Package entry point with lazy imports │ ├── __main__.py # CLI entry point (python -m insight_extractor) │ ├── config.py # Enums: StemMode, KeywordCategory, PatternLabel │ ├── constants.py # THREAD_SEEDS keyword bank, REGEX_PATTERNS dict │ ├── exceptions.py # Custom exception hierarchy │ ├── models.py # Pydantic v2 models (ExtractResult, SemanticHit, ...) │ ├── stemmer.py # DynamicKeywordStemmer, KeywordPatternRegistry │ ├── extractor.py # InsightExtractor orchestrator (main engine) │ ├── tokenizer.py # SentenceTokenizer (BERT-aware chunking) │ └── utils.py # Logging, hashing, timestamp helpers └── tests/ ├── conftest.py # Shared pytest fixtures ├── unit/ # Fast tests — no model download │ ├── test_exceptions.py │ ├── test_models.py │ ├── test_stemmer.py │ └── test_tokenizer.py └── integration/ # Full pipeline tests — requires BERT model ├── test_extractor.py └── test_e2e.py ``` ## 开发环境配置 ``` :: Install with dev dependencies pip install -e ".[dev]" :: Run unit tests only (no model download) pytest tests/unit/ -v :: Run all tests pytest :: With coverage pytest --cov=insight_extractor --cov-report=term-missing :: Lint ruff check src/ tests/ :: Format ruff format src/ tests/ :: Type check mypy src/insight_extractor ``` ## 核心 API 参考 ### `DynamicKeywordStemmer` | 方法 | 签名 | 描述 | |--------|-----------|-------------| | Constructor | `DynamicKeywordStemmer(stem_mode, case_sensitive, custom_suffixes)` | 创建词干提取器实例 | | `generate_pattern` | `(keyword, mode=None) -> str` | 为单个关键词生成 regex 模式 | | `generate_stem_variations` | `(keyword) -> list[str]` | 所有提取出的词干形式 | | `compile_keywords` | `(keywords) -> re.Pattern` | 为所有关键词生成单个 OR 模式 | | `compile_typed_patterns` | `(keywords) -> dict[str, re.Pattern]` | 每个关键词对应的类型化模式 | | `find_matches` | `(text) -> list[MatchInfo]` | 所有带位置信息的关键词匹配 | | `add_keyword` | `(kw)` | 添加一个关键词并重新编译 | | `remove_keyword` | `(kw)` | 移除一个关键词并重新编译 | | `set_keywords` | `(kws)` | 替换全部关键词集合 | ### `KeywordPatternRegistry` | 方法 | 签名 | 描述 | |--------|-----------|-------------| | Constructor | `KeywordPatternRegistry(static_patterns, stemmer)` | 创建注册表 | | `all_patterns` | property `-> dict[str, str]` | 合并静态与动态模式 | | `regenerate_dynamic_patterns` | `(keywords)` | 从关键词列表重建 | | `extract_all` | `(text) -> dict[str, list[str]]` | 从文本中提取所有模式匹配 | ## 输出示例以下展示了针对 AI 安全研究语料库运行时的实际 pipeline 输出（`sample_input.txt` — 19,248 个单词，提取了 441 条洞察，来源于 cgfixit.com RAG DB）。 ``` python -m insight_extractor sample_input.txt ``` ### 控制台输出 ``` 2026-06-24 19:53:02,562 [INFO] InsightExtractor init | model=all-MiniLM-L6-v2 | seeds=69 | stem_mode=stem 2026-06-24 19:53:02,746 [INFO] Loading BERT model: all-MiniLM-L6-v2 2026-06-24 19:53:02,746 [INFO] Load pretrained SentenceTransformer: all-MiniLM-L6-v2 === REGEX ENTITIES === CVE_ID: [] DOMAIN: ['medium.com', 'fortune.com', 'techcrunch.com', 'hiddenlayer.com', 'cbsnews.com', 'rollingstone.com', 'ndtv.com', 'tech.co', 'etftrends.com'] RANSOM_AMOUNT: ['$9M', '$186', '$950M'] FILE_EXTENSION: ['.py'] TELEGRAM_HANDLE: ['@sobyx'] YEAR: ['2025', '2026', '2024', '2023', '2020', '2019', '2015'] PERCENTAGE: ['4.1%', '9.6%', '13%', '68%', '800%', '1%', '10%', '0%', '3%', '12%', '15%', '20%', '25%', '30%', '40%', '50%', '60%', '70%', '80%', '90%', '100%', '5%'] === DYNAMIC KEYWORD MATCHES === loader: ['loader', 'Loader', 'loader', ...] (31 total) veeam: ['veeam', 'Veeam', ...] (16 total) offline: ['offline', 'Offline', ...] (15 total) RAG: ['RAG', 'rag', ...] (9 total) APT: ['APT', 'apt', ...] (9 total) conti: ['conti', 'Conti', ...] (8 total) exploit: ['exploit', 'exploiting', ...] (8 total) blackmail: ['blackmail', ...] (3 total) dox: ['dox', ...] (2 total) personality: ['personality', ...] (2 total) supply chain: ['supply chain'] (1 total) minor: ['minor'] (1 total) soul: ['soul'] (1 total) embedding: ['embedding'] (1 total) === SEMANTIC KEYWORD HITS (top 10) === [0.821] offline ...MCP-Specific Offline Patterns — validates thread emphasis on offline MCP ... [0.794] RAG ...PsyClaw uses BERT embeddings with ChromaDB and BM25 hybrid retrieval via R... [0.778] embedding ...BERT embeddings with ChromaDB and BM25 hybrid retrieval... [0.761] veeam ...Stem hits: optimize, validate, veeam — Score: 3... [0.743] soul ...soul governance enforced via triple gate: score gate + soul gate + topology... [0.731] exploit ...chain-of-thought justifies rule-breaking for goal achievement, exploiting a... [0.718] APT ...advanced persistent threat actors leverage AI-generated phishing at scale... [0.702] conti ...leaked Conti 2 builder code repurposed for ESXi locker generation... [0.695] personality ...model personality drift observed across extended context windows... [0.681] supply chain ...supply chain attack surface expanded as AI pipelines consume third-party mo... === KEY SENTENCES === [0.821] MCP-Specific Offline Patterns - Score: 1 - Stem hits: pattern [0.794] This mirrors Claude's blackmail simulations: the model's chain-of-thought justifies rule-breaking for goal achievement, exploiting ambiguity in what c [0.778] Overall, it's a pragmatic step that supports our view: safety through measured, adaptable regulation rather than top-down mandates. [0.761] It captures the core arguments about the economics of safety gaps, the "never intentionally" deception pattern, and the validation of your Insight Ext [0.743] Validates thread emphasis on offline MCP with mandatory approval gates. [0.731] - Score: 8 - Stem hits: bia, decept, manipulate, test, veeam - High-signal: deception === DYNAMIC EXPANSION: +10 new keywords === ['alignment', 'deception', 'agentic', 'oversight', 'autonomy', 'chain-of-thought', 'adversarial', 'capability', 'approval', 'governance'] Total tracked keywords: 79 Results saved to: insights_extracted.md === KEYWORD STATS === Categories: {'threat_intel': 28, 'osint': 12, 'child_safety': 9, 'ai_infra': 12, 'infosec': 8, 'general': 10} Stem mode: stem ``` ### 生成的 `insights_extracted.md`（已截断） ``` # Insight Extraction 结果 **Generated:** 2026-06-24T23:59:00Z **Source file:** AI-Safety-Full-insight222.md **Word Count:** 19,248 **Total Tracked Keywords:** 79 (69 seed + 10 expanded) --- ## Regex 实体 ### DOMAIN - `medium.com` - `fortune.com` - `techcrunch.com` - `hiddenlayer.com` ### RANSOM_AMOUNT - `$9M` - `$950M` ### YEAR - `2026`, `2025`, `2024`, `2023` --- ## Semantic Keywords | Keyword | Score | Context | |---------|-------|---------| | offline | 0.8210 | MCP-Specific Offline Patterns — validates thread emphasis on offline MCP | | RAG | 0.7940 | PsyClaw uses BERT embeddings with ChromaDB and BM25 hybrid retrieval | | embedding | 0.7780 | BERT embeddings with ChromaDB and BM25 hybrid retrieval | | veeam | 0.7610 | Stem hits: optimize, validate, veeam | | soul | 0.7430 | soul governance enforced via triple gate | --- ## Key Sentences | Score | Sentence | |-------|----------| | 0.8210 | MCP-Specific Offline Patterns — validates thread emphasis on offline MCP with mandatory approval gates. | | 0.7940 | This mirrors Claude's blackmail simulations: the model's chain-of-thought justifies rule-breaking... | | 0.7780 | Overall, it's a pragmatic step that supports our view: safety through measured regulation. | --- ## Newly Expanded Keywords `alignment`, `deception`, `agentic`, `oversight`, `autonomy`, `chain-of-thought`, `adversarial`, `capability`, `approval`, `governance` ``` ## 许可证 MIT 许可证。详情请参阅 [LICENSE](LICENSE)。初始灵感来源：https://cgfixit.com/ai

标签：BERT, ESC4, NLP自然语言处理, OSINT, Python, 威胁情报, 开发者工具, 文本抽取, 无后门, 逆向工具