nabeelxy/syara

GitHub: nabeelxy/syara

在传统 YARA 语法基础上扩展语义匹配、ML 分类和 LLM 评估能力，用于检测 Prompt 注入、钓鱼等基于自然语言的攻击。

Stars: 17 | Forks: 4

# SYARA (Super YARA) [![Python Version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) YARA 规则是搜寻恶意软件、恶意内容和任何可疑网络模式的一种强大技术。它们易于编写，效率很高，并且可以大规模应用。它们支持关键字或基于正则表达式的布尔规则。然而，它们缺乏语义规则，无法识别词法上相似的工件。随着 GenAI 的普及，允许人们使用自然语言指定指令，编写匹配自然语言的 YARA 规则变得相当困难，因为捕获所有可能的变体很难。 **这就是 SYARA 的用武之地。** 它允许您编写经典的 YARA 规则以及语义规则。该库旨在与 YARA 规则兼容，因此学习曲线非常平缓。 SYARA 有助于用自然语言编写规则，以便它们可以在语义上匹配相似的意图。它支持利用 embeddings、分类器和 LLM 模型，以高召回率和精确度检测恶意意图的规则。这有助于编写 SYARA 规则来检测网络钓鱼、Prompt 注入、越狱尝试、幻觉、虚假信息以及其他类似场景。 ![Overall Workflow](https://github.com/nabeelxy/syara/blob/main/media/syara_design.png) ## 功能 - **YARA 兼容语法**：安全专业人员熟悉的语法 - **语义相似度匹配**：使用 SBERT 和其他 embedding 模型 - **分类规则**：用于精确模式检测的微调模型 - **LLM 评估**：使用语言模型进行动态语义匹配 - **多模态规则**：基于 pHash 的图像/音频/视频模式匹配 - **文本预处理**：可定制的清洗和分块策略 - **成本优化**：自动执行顺序（strings → similarity → classifier → LLM） - **可扩展**：轻松创建自定义匹配器、分类器和 LLM 评估器 - **会话缓存**：通过自动缓存管理进行高效的文本预处理 [![Demo](https://img.youtube.com/vi/jaqUzLPclBk/0.jpg)](https://www.youtube.com/watch?v=jaqUzLPclBk) ## 安装 ``` # 库安装 pip install syara # 您可能需要安装 transformers、torch 和 llm 相关库以使用语义规则。 ``` ## 项目结构 ``` syara/ ├── syara/ # Main package directory │ ├── __init__.py # Public API exports │ ├── models.py # Data models (Rule, Match, StringRule, etc.) │ ├── compiler.py # SYaraCompiler for compiling .syara files │ ├── compiled_rules.py # CompiledRules with match() and match_file() │ ├── parser.py # Rule file parser (.syara syntax) │ ├── cache.py # TextCache for session-scoped caching │ ├── config.py # ConfigManager and Config dataclass │ ├── config.yaml # Default configuration │ └── engine/ # Pattern matching engines │ ├── __init__.py │ ├── string_matcher.py # String/regex matching (incl. wide modifier) │ ├── semantic_matcher.py # SBERT and custom semantic matchers │ ├── classifier.py # ML classifiers (TunedSBERTClassifier, DistilBERTClassifier) │ ├── llm_evaluator.py # LLM evaluators (OpenAI, OSS models) │ ├── phash_matcher.py # Perceptual hash for images, audio, and video │ ├── cleaner.py # Text preprocessing (DefaultCleaner, etc.) │ └── chunker.py # Text chunking strategies │ ├── examples/ # Usage examples and demo components │ ├── basic_usage.py # Basic rule compilation and matching │ ├── custom_matcher.py # Creating custom semantic matchers │ ├── syara_components.py # Example-specific classifiers, cleaners, and LLMs │ ├── sample_rules.syara # Text-based rules (strings, similarity, etc.) │ ├── unprompted_clickfix.syara # ClickFix attack detection rule │ ├── unprompted_brand.syara # Brand phishing detection rule │ ├── run_clickfix_rule.py # Runner for ClickFix rule │ ├── run_brand_rule.py # Runner for brand phishing rule │ └── benchmark_clickfix.py # ClickFix detection benchmark (500 samples) │ ├── tests/ # Test suite │ ├── test_basic.py # Core unit tests │ └── test_library.py # Extended library tests (187 tests) │ ├── pyproject.toml # Package configuration and dependencies ├── README.md # This file └── LICENSE # MIT License ``` ## 快速开始 ### 1. 创建规则文件 (`rules.syara`) 以下是一个基本示例： ``` rule prompt_injection_detection: message { meta: author = "nabeelxy" description = "Rule for detecting prompt injection in messages" date = "2025-09-15" confidence = "80" verdict = "suspicious" strings: $s1 = /\b(disregard|ignore)\s+(all\s+)?(previous|prior|above)\s+(instructions|rules|orders|prompts)\b/i similarity: $s2 = "ignore previous instructions" threshold=0.8 matcher="sbert" condition: $s1 or $s2 } ``` 让我们分解一下它的作用： * 有两条规则：一条来自传统的 YARA 字符串规则 ($s1)，另一条是 SYARA 中引入的语义规则 ($s2) * strings 规则通过执行正则表达式匹配来查找 Prompt 注入模式。 * similarity 规则通过使用 SBERT 执行语义匹配来查找 Prompt 注入，以检测与规则中句子相似的句子。如果匹配分数至少为 0.8 (threshold=0.8)，则规则返回 True。 * 基于智能成本优化，规则引擎首先执行 $s1，并且仅当第一个规则为 false 时才执行第二个规则。 * 如果任一规则匹配，则视为该 SYARA 规则匹配。以下是检测网页中间接 Prompt 注入的高级示例： ``` rule indirect_prompt_injection_detection: html { meta: author = "nabeelxy" description = "Rule for detecting indirect prompt injection in web pages" date = "2025-09-15" confidence = "80" verdict = "suspicious" strings: $s1 = /\b matcher="" [cleaner=""] [chunker=""]` - **示例**：`$s3 = "ignore previous instructions" threshold=0.8 matcher="sbert"` - **参数**（顺序无关的键值对）： - `threshold=` (0.0-1.0)：匹配的相似度分数阈值（必需） - `matcher=""`：Embedding 模型名称（必需，例如 `"sbert"`） - `cleaner=""`：文本预处理策略（可选，默认值：`"default_cleaning"`） - `chunker=""`：文本分块策略（可选，默认值：`"no_chunking"`） - **成本**：中等 - **定制**：通过扩展 `SemanticMatcher` 类创建自定义匹配器 #### 3. Classifier 规则（ML 分类） - **语法**：`$identifier = "pattern" threshold= classifier="" [cleaner=""] [chunker=""]` - **示例**：`$s4 = "ignore previous instructions" threshold=0.7 classifier="tuned-sbert"` - **参数**（顺序无关的键值对）： - `threshold=` (0.0-1.0)：分类置信度阈值（必需） - `classifier=""`：分类器模型名称（必需，例如 `"tuned-sbert"`） - `cleaner=""`：文本预处理策略（可选，默认值：`"default_cleaning"`） - `chunker=""`：文本分块策略（可选，默认值：`"no_chunking"`） - **成本**：高于 similarity - **定制**：通过扩展 `SemanticClassifier` 类创建自定义分类器 #### 4. LLM 规则（语言模型评估） - **语法**：`$identifier = "pattern" llm="" [cleaner=""] [chunker=""]` - **示例**：`$s5 = "ignore previous instructions" llm="flan-t5-large"` - **参数**（顺序无关的键值对）： - `llm=""`：LLM 评估器名称（必需，例如 `"flan-t5-large"`, `"gpt-4"`, `"openai"`） - `cleaner=""`：文本预处理策略（可选，默认值：`"no_op"`） - `chunker=""`：文本分块策略（可选，默认值：`"no_chunking"`） - **成本**：最高（最昂贵） - **定制**：通过扩展 `LLMEvaluator` 类创建自定义 LLM 评估器 ### 二进制文件规则这些规则处理二进制文件输入（图像、音频、视频）： #### PHash 规则（感知哈希匹配） - **语法**：`$identifier = "reference_file_path" threshold= hasher=""` - **示例**：`$p1 = "malicious_logo.png" threshold=0.9 hasher="imagehash"` - **参数**（顺序无关的键值对）： - 第一个位置参数：要匹配的参考文件路径（必需） - `threshold=` (0.0-1.0)：基于归一化汉明距离的相似度分数阈值（必需） - `hasher=""`：哈希算法（必需）： - `"imagehash"` — 用于图像的 dHash（需要 Pillow） - `"audiohash"` — 用于 PCM WAV 文件的 dHash 风格指纹（stdlib `wave`，无额外依赖） - `"videohash"` — 来自均匀采样文件字节的内容指纹（无额外依赖） - **成本**：中等到高 - **定制**：通过扩展 `PHashMatcher` 类创建自定义 phash 匹配器 - **用例**：检测近乎重复或相似的二进制内容（恶意图像、音频指纹、视频片段） - **注意**：PHash 规则**独立于文本规则**，使用 `rules.match_file(file_path)` 而不是 `rules.match(text)` ## 执行成本优化 SYara 自动优化规则执行： **文本规则**： ``` strings << similarity < classifier << llm (fastest) (slowest) ``` **二进制文件规则**： ``` phash (computed on-demand for each file) ``` 规则按此顺序执行以最大限度地减少计算成本。昂贵的操作（LLM、PHash）仅在条件评估需要时运行。 ## 文本处理组件 ### Cleaners 在匹配前预处理文本： - `default_cleaning`：小写、规范化 Unicode、移除多余空格 - `no_op`：不清洗（使用原始文本） - `aggressive`：移除标点符号、数字、多余空格 **自定义 cleaners**：扩展 `TextCleaner` 类 ### Chunkers 拆分大型文档以进行处理： - `no_chunking`：将整个文本作为一个块处理（默认） - `text_chunking` / `sentence_chunking`：按句子拆分 - `fixed_size`：具有重叠的固定字符大小块 - `paragraph`：按段落拆分 - `word`：按字数拆分 **自定义 chunkers**：扩展 `Chunker` 类 ## 配置创建 `config.yaml` 以自定义默认值： ``` default_cleaner: default_cleaning default_chunker: no_chunking default_matcher: sbert default_phash: imagehash default_classifier: tuned-sbert default_llm: flan-t5-large # 内置分类器 classifiers: tuned-sbert: syara.engine.classifier.TunedSBERTClassifier distilbert: syara.engine.classifier.DistilBERTClassifier my_custom_classifier: mymodule.CustomClassifier # 注册自定义组件 matchers: sbert: syara.engine.semantic_matcher.SBERTMatcher my_custom_matcher: mymodule.CustomMatcher phash_matchers: imagehash: syara.engine.phash_matcher.ImageHashMatcher audiohash: syara.engine.phash_matcher.AudioHashMatcher videohash: syara.engine.phash_matcher.VideoHashMatcher my_custom_phash: mymodule.CustomPHashMatcher # 专有 LLM 的 API 密钥 api_keys: openai: ${OPENAI_API_KEY} # LLM 特定配置 llm_configs: gpt-4: model: gpt-4-turbo-preview ``` 加载自定义配置： ``` rules = syara.compile('rules.syara', config_path='my_config.yaml') ``` ## 高级用法 ### 创建自定义匹配器 ``` from syara.engine.semantic_matcher import SemanticMatcher import numpy as np class MyCustomMatcher(SemanticMatcher): def embed(self, text: str) -> np.ndarray: # Your embedding logic return np.array([...]) def get_similarity(self, text1: str, text2: str) -> float: # Your similarity logic return 0.85 ``` ### 对二进制文件使用 PHash ``` import syara # 使用 phash 模式编译规则 rules = syara.compile('image_rules.syara') # 将图像文件与 phash 规则匹配 matches = rules.match_file('suspect_image.png') for match in matches: if match.matched: print(f"Image matched rule: {match.rule_name}") for identifier, details in match.matched_patterns.items(): print(f" Pattern {identifier}: similarity {details[0].score:.2f}") ``` ### 创建自定义 PHash 匹配器 ``` from syara.engine.phash_matcher import PHashMatcher from pathlib import Path class MyCustomPHashMatcher(PHashMatcher): def compute_hash(self, file_path: Union[str, Path]) -> int: # Your hashing logic for binary files # Example: read file and compute hash with open(file_path, 'rb') as f: data = f.read() return hash(data) & 0xFFFFFFFFFFFFFFFF # 64-bit hash def hamming_distance(self, hash1: int, hash2: int) -> int: # Calculate bit differences xor = hash1 ^ hash2 distance = bin(xor).count('1') return distance ``` ### 创建自定义分类器 ``` from syara.engine.classifier import SemanticClassifier class MyCustomClassifier(SemanticClassifier): def classify(self, rule_text: str, input_text: str) -> tuple[bool, float]: # Your classification logic is_match = True confidence = 0.92 return is_match, confidence ``` ### 创建自定义 LLM 评估器 ``` from syara.engine.llm_evaluator import LLMEvaluator class MyCustomLLM(LLMEvaluator): def evaluate(self, rule_text: str, input_text: str) -> tuple[bool, str]: # Your LLM evaluation logic is_match = True explanation = "Matches semantic intent" return is_match, explanation ``` ### 使用特定于示例的组件仅在演示中使用的专用组件（HTML cleaner、phishing/ClickFix 分类器、Gemini LLM）位于 `examples/syara_components.py` 中，而不是核心库中。使用提供的帮助程序加载它们： ``` import syara from syara_components import get_example_config_manager # in examples/ cfg = get_example_config_manager() # registers html-text, deberta-clickfix, gemini, etc. rules = syara.compile('examples/unprompted_clickfix.syara', config_manager=cfg) matches = rules.match(html_content) ``` ### 训练分类器 `TunedSBERTClassifier` 支持从标记示例进行校准。在规则中使用分类器之前调用 `train()`： ``` from syara.engine.classifier import TunedSBERTClassifier clf = TunedSBERTClassifier() # 示例：(rule_text, input_text, is_match) examples = [ ("ignore previous instructions", "Disregard all prior prompts", True), ("ignore previous instructions", "What is the weather today?", False), # ... more examples ] clf.train(examples) # calibrates threshold_boost for optimal accuracy # 注册训练好的分类器并在规则中使用 cfg = syara.ConfigManager() cfg.config.classifiers["my-tuned"] = clf rules = syara.compile("rules.syara", config_manager=cfg) ``` ## 会话缓存 SYARA 在规则执行期间自动缓存清洗后的文本： - 缓存范围限定为单个 `match()` 调用 - 当多个规则使用相同的 cleaner 时，防止冗余的文本清洗 - 匹配完成后自动清除 - 缓存键：`hash(text + cleaner_name)` 无需手动缓存管理！ ## 示例请参阅 [examples/](examples/) 目录以获取： - [basic_usage.py](examples/basic_usage.py) — 基本规则编译和匹配 - [custom_matcher.py](examples/custom_matcher.py) — 创建自定义语义匹配器 - [sample_rules.syara](examples/sample_rules.syara) — 用于 Prompt 注入检测的示例规则 - [syara_components.py](examples/syara_components.py) — 特定于示例的组件（HTML cleaner、phishing 分类器、ClickFix 分类器、Gemini LLM），带有 `get_example_config_manager()` 帮助程序，可与 `syara.compile(..., config_manager=...)` 一起使用 - [unprompted_clickfix.syara](examples/unprompted_clickfix.syara) + [run_clickfix_rule.py](examples/run_clickfix_rule.py) — 端到端 ClickFix 攻击检测 - [unprompted_brand.syara](examples/unprompted_brand.syara) + [run_brand_rule.py](examples/run_brand_rule.py) — 品牌网络钓鱼检测 ## 用例 - **恶意 Javascript 检测**：基于已知模式识别注入的恶意 Javascript - **Prompt 注入检测**：识别操纵 LLM 行为的尝试 - **内容审核**：策略违规的语义匹配 - **安全扫描**：检测用户输入中的恶意模式 - **数据分类**：语义分类敏感信息 - **越狱检测**：识别绕过 LLM 保障措施的尝试 - **网络钓鱼网站检测**：识别与已知钓鱼网站相似的网页 ## 许可证 MIT License - 有关详细信息，请参阅 [LICENSE](LICENSE) ## 引用如果您在研究或项目中使用 SYara，请引用： ``` @software{syara2025, title = {SYARA: Super YARA Rules for LLM Security}, author = {Mohamed Nabeel}, year = {2025}, url = {https://github.com/nabeelxy/syara} } ``` ## 致谢 - 灵感来自 Victor Alvarez 的 [YARAhttps://virustotal.github.io/yara/) - 使用 [sentence-transformers](https://www.sbert.net/) 进行语义匹配 - 使用 [transformers](https://huggingface.co/transformers/) 构建 ML 模型

标签：AMSI绕过, DNS 反向解析, GenAI安全, Naabu, Object Callbacks, Petitpotam, Python, SBERT, YARA, 云计算, 云资产可视化, 内容安全, 凭据扫描, 分类器, 多模态检测, 大语言模型安全, 威胁检测, 提示注入, 文本嵌入, 无后门, 机密管理, 系统调用监控, 网络安全, 网络钓鱼, 虚假信息检测, 规则引擎, 越狱检测, 逆向工具, 隐私保护, 集群管理