LuanThanh2/Rule_Evasion_Detection

GitHub: LuanThanh2/Rule_Evasion_Detection

规则规避检测工具，基于 SVM 的误用分类与归因分析，帮助发现并优化 SIEM 检测规则。

Stars: 0 | Forks: 0

# 规则规避检测（RED） AMIDES 管道的重实现与扩展，用于检测 SIEM 规则规避。使用基于 SVM 的误用分类和跨多种 Windows 事件类型的规则归因。 ## 项目结构 ``` rule_evasion_detection/ ├── README.md ├── requirements.txt ├── run_all.sh ├── config/ │ ├── process_creation.yaml # Process Creation experiments │ ├── registry_event.yaml # Registry Event experiments │ ├── powershell.yaml # PowerShell ScriptBlock experiments │ └── proxy_web.yaml # Proxy/Web URL experiments ├── red/ # Core library │ ├── __init__.py │ ├── data.py # Data loading (benign, Sigma rules, events) │ ├── normalize.py # Text normalization pipeline │ ├── features.py # TF-IDF/Count vectorization │ ├── models.py # SVC training (CPU/GPU/Intel acceleration) │ ├── evaluate.py # Threshold sweep & MCC scaling │ ├── attribution.py # Per-rule attribution evaluation │ ├── visualize.py # PR curves & attribution plots │ └── persist.py # Save/load models (pickle + zip) └── scripts/ ├── train.py # Train misuse classifier (C1/C2) ├── validate.py # Validate model with evasions ├── evaluate.py # MCC scaling + threshold sweep ├── train_attribution.py # Train per-rule models (C3) ├── eval_attribution.py # Evaluate rule attribution ├── plot.py # Generate figures ├── run_pipeline.py # Run all steps end-to-end ├── generate_evasions.py # Generate evasion events from match events ├── hayabusa_to_matches.py # Convert Hayabusa JSONL → AMIDES match events ├── otrf_to_matches.py # Convert OTRF datasets → AMIDES match events ├── lmd_to_benign.py # Convert LMD Collections CSV → benign samples ├── mpsd_to_benign.py # Convert MPSD .ps1 files → benign PowerShell samples ├── mpsd_to_malicious.py # Filter MPSD malicious .ps1 by Sigma patterns ├── secrepo_to_benign.py # Extract URLs from Squid access.log → benign samples └── train_all.sh # Shell script to train all event types ``` ## 安装 ``` cd rule_evasion_detection pip install -r requirements.txt ``` 依赖项：`numpy`、`scikit-learn`、`matplotlib`、`seaborn`、`pyyaml`、`luqum`、 `scikit-learn-intelex`（Intel CPU 加速）、`tqdm` ### 可选 GPU 加速 ``` # NVIDIA GPU (RAPIDS cuML — 需要 CUDA) pip install cuml-cu12 ``` 训练代码会自动检测最佳可用后端： NVIDIA GPU（cuML）> Intel CPU（scikit-learn-intelex）> 纯 scikit-learn CPU。 ## 支持的事件类型 | 事件类型 | 提取字段 | 配置文件 | |-----------------|------------------------------|----------------------| | `process_creation` | `process.command_line` | `config/process_creation.yaml` | | `registry_event` | `winlog.event_data.TargetObject` | `config/registry_event.yaml` | | `powershell` | `winlog.event_data.ScriptBlockText` | `config/powershell.yaml` | | `proxy_web` | `url` / `c-uri` / `cs-uri-stem` | `config/proxy_web.yaml` | ## 快速开始 ``` # 运行完整的进程创建事件管道 python scripts/run_pipeline.py --config config/process_creation.yaml # 跳过归因（更快） python scripts/run_pipeline.py --config config/process_creation.yaml --skip-attribution # 跳过图表 python scripts/run_pipeline.py --config config/process_creation.yaml --skip-plots ``` ## 管道概览 ``` ┌──────────────────────────────────────────────────────────────┐ │ DATA PREPARATION │ │ │ │ Benign data sources: │ │ LMD Collections (CSV) ──→ lmd_to_benign.py │ │ MPSD PowerShell .ps1 ──→ mpsd_to_benign.py │ │ SecRepo Squid log ──→ secrepo_to_benign.py │ │ │ │ Match event sources: │ │ Hayabusa JSONL ──→ hayabusa_to_matches.py │ │ OTRF Security-Datasets ──→ otrf_to_matches.py │ │ │ │ Evasion generation: │ │ Match events ──→ generate_evasions.py │ │ (remove_exe, double_space, backtick_insert, ...) │ └──────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ TRAINING PIPELINE │ │ │ │ 1. Load data (data.py) │ │ ├── Benign: txt / jsonl / json / csv (multi-format) │ │ ├── Rule filters: AMIDES YAML or raw Sigma detection │ │ └── Match/Evasion events: per-rule JSON files │ │ │ │ 2. Normalize (normalize.py) │ │ Filter dummy → lowercase → tokenize → filter │ │ numeric/long tokens → sort → comma-join │ │ │ │ 3. Vectorize (features.py) │ │ TF-IDF / Count / Binary / Hashing / Scaled Count │ │ │ │ 4A. Misuse Classification C1/C2 (train.py) │ │ GridSearchCV over C ∈ logspace(-2,1,50), class_weight │ │ → best SVC + MCC-based scaler │ │ │ │ 4B. Rule Attribution C3 (train_attribution.py) │ │ Per-rule binary SVC (benign vs rule_i) → rank by df │ │ │ │ 5. Evaluate & Plot │ │ Threshold sweep P/R/F1/MCC → PR-Threshold plot │ │ Top-k hit rates → Attribution distribution + CDF plot │ └──────────────────────────────────────────────────────────────┘ ``` ## 核心库（`red/`） ### `red/data.py` — 数据加载 **良性样本加载**（多格式，内存高效）： - `load_benign_samples(path, field)` — 一次性加载为列表 - `benign_samples_iter(path, field, max_samples)` — 逐个生成（避免大型数据集 OOM） - `count_benign_samples(path, field, max_samples)` — 仅计数而不加载 - 支持格式：`.txt`（纯文本）、`.jsonl`/`.ndjson`、`.json`、`.csv` - `field`：点分路径，例如 `"process.command_line"`、`"winlog.event_data.TargetObject"` **规则集加载**： - `load_rule_set(events_dir, rules_dir, evasions_dir)` — 加载带有匹配项和规避项的规则 - 支持独立的 `evasions_dir`（生成的规避项与匹配事件分离） - 当 `events_dir` 缺失时回退到仅规则模式 - `RuleData` 数据类：`name`、`filters`、`matches`、`evasions`、`sigma_values` **Sigma 过滤器/检测提取**： - `extract_filter_values(filter_str, search_fields)` — 通过 luqum 解析 AMIDES Lucene 过滤器 - `get_all_filter_values(rule_set, search_fields)` — 跨所有规则（AMIDES + 原始 Sigma） - `get_all_yaml_filter_values(rules_dir, search_fields)` — 直接扫描所有 YAML（处理文件名不匹配） - `_extract_sigma_detection_values(detection, search_fields)` — 从原始 Sigma 检测块提取 **辅助函数**：`extract_commandlines()`、`get_all_matches()`、`get_all_evasions()`、`create_labels()` ### `red/normalize.py` — 文本归一化 `Normalizer` 类执行 6 步流水线： 1. `FilterDummyCharacters`：移除 `"`, `^`, `` ` ``, `'` 2. `Lowercase` 3. `Tokenize`：按 `\w+` 单词边界拆分 4. `FilterNumeric`：移除长度 > `max_num_len`（默认 3）的十六进制/数字令牌 5. `FilterStrings`：移除长度 > `max_str_len`（默认 30）的令牌 6. 按字母顺序排序并用逗号连接 `normalize_samples(samples)` — 批量归一化并丢弃空值。 ### `red/features.py` — 特征提取 `create_vectorizer(method, ngram_range, analyzer)` — 创建 sklearn 向量器： | 方法 | 描述 | |--------------|-------------------------------------------| | `tfidf` | TF-IDF（默认） | | `count` | 原始令牌计数 | | `binary_count` | 二值存在/缺失 | | `hashing` | 哈希向量化器（内存高效） | | `scaled_count` | 计数向量化器 + MaxAbsScaler 流水线 | `comma_tokenizer(text)` — 按逗号拆分归一化后的样本。 ### `red/models.py` — SVM 训练在导入时自动检测加速后端： - **NVIDIA GPU**：RAPIDS cuML SVC（大型数据集最快） - **Intel CPU**：scikit-learn-intelex 修补的 SVC - **纯 CPU**：标准 scikit-learn SVC `train_svc_gridsearch(X, y, param_grid, scoring, cv, n_jobs)`： - 默认网格：`C ∈ logspace(-2, 1, 50)`、`kernel=linear`、`class_weight ∈ [balanced, None]` - 训练期间显示 tqdm 进度条 - GPU 模式：在 CPU 上进行网格搜索 → 最终在 GPU 上拟合 - 返回：`(estimator, best_params, best_score)` `train_svc_fixed(X, y, C, kernel, class_weight)` — 使用固定参数训练。 ### `red/evaluate.py` — 评估与 MCC 缩放 `create_mcc_scaler(df_values, labels, num_samples, mcc_threshold)`： - 两遍扫描：粗略 → 在 MCC > 阈值范围内细化 - 使范围关于 0 对称 - 计算平移量使 MCC 最优点位于 0.5 - 返回 `(MinMaxScaler, shift)` `scale_df_values(df_values, scaler, shift)` — 应用平移与缩放，裁剪到 [0, 1]。 `BinaryEvaluation(num_thresholds)`： - `.evaluate(labels, scaled_scores)` — 遍历 `num_thresholds+1` 个阈值，计算 P/R/F1/MCC/TP/FP/TN/FN - `.optimal_threshold_idx()` — 最大 F1 对应的索引 - `.default_threshold_idx()` — 阈值 0.5 对应的索引 - `.summary()` — 在最优与默认阈值下的指标字典 ### `red/attribution.py` — 规则归因 `RuleAttributionEvaluation(num_rules)`： - `.evaluate_single(true_rule, ranked_attributions)` — 评估单次规避 - `.calculate_hit_rates()` — 将计数转换为命中率 - `.summary()` — Top-1/5/10 累计命中率 + TP/FP/TN/FN `score_evasion(sample_vector, rule_models)` — 对单个样本计算所有规则模型得分，返回排序列表。 `process_evasions_batch(normalized_samples, evasion_to_rule, rule_models)` — 批量评估并应用每规则转换。 ### `red/visualize.py` — 绘图 `plot_pr_threshold(evaluations, labels, output_path, title)`： - 2×2 子图：精确率 / 召回率 / F1 分数 / MCC 随阈值变化 - 标记最优阈值（每个评估）和默认 0.5 `plot_attribution(top_n_hits, output_path, title)`： - 柱状图：归位列分布 - 折线图：累计分布函数（CDF） ### `red/persist.py` — 持久化 `save_result(obj, name, output_dir, info)` — 序列化为 ZIP（最大压缩）+ JSON 侧车文件。 `load_result(path)` — 从 ZIP 存档加载。文件命名约定： - `train_rslt_.zip` — TrainingResult - `valid_rslt_.zip` — ValidationResult - `eval_rslt_.zip` — EvaluationResult - `_info.json` — 人类可读的元数据侧车文件 ## 脚本（`scripts/`） ### `scripts/train.py` — 训练误用分类器 ``` python scripts/train.py --config config/process_creation.yaml # CLI 参数（覆盖配置）： python scripts/train.py \ --benign-samples ~/data/benign/process_creation/benign_train.txt \ --events-dir ~/data/sigma/events_hayabusa/windows/process_creation \ --rules-dir ~/data/sigma/rules/windows/process_creation \ --malicious-samples both \ # rule_filters | matches | both --vectorization tfidf \ --search-params \ --scoring f1 \ --cv 5 \ --mcc-scaling \ --max-benign-samples 50000 \ # cap benign count (avoids OOM) --out-dir models/process_creation \ --result-name misuse_svc_rules_f1 ``` **输出**：`models/process_creation/train_rslt_misuse_svc_rules_f1.zip ### `scripts/validate.py` — 验证模型 ``` python scripts/validate.py \ --config config/process_creation.yaml \ --result-path models/process_creation/train_rslt_misuse_svc_rules_f1.zip ``` **输出**：`models/process_creation/valid_rslt_misuse_svc_rules_f1.zip` ### `scripts/evaluate.py` — 阈值扫描评估 ``` python scripts/evaluate.py \ --config config/process_creation.yaml \ --result-path models/process_creation/valid_rslt_misuse_svc_rules_f1.zip \ --num-thresholds 50 ``` **输出**：`models/process_creation/eval_rslt_misuse_svc_rules_f1.zip` ### `scripts/train_attribution.py` — 训练每规则模型 ``` python scripts/train_attribution.py \ --config config/process_creation.yaml \ --model-params models/process_creation/train_rslt_misuse_svc_rules_f1.zip ``` **输出**：`models/process_creation/train_rslt_attr_svc_rules.zip` ### `scripts/eval_attribution.py` — 评估规则归因 ``` python scripts/eval_attribution.py \ --config config/process_creation.yaml \ --result-path models/process_creation/train_rslt_attr_svc_rules.zip ``` **输出**：`models/process_creation/eval_attr_attr_svc_rules.zip` ### `scripts/plot.py` — 生成图形 ``` # PR-Threshold 图表 python scripts/plot.py pr \ --result-paths models/process_creation/eval_rslt_misuse_svc_rules_f1.zip \ --output figures/figure_3_misuse_classification.pdf # 归因图表 python scripts/plot.py attr \ --result-path models/process_creation/eval_attr_attr_svc_rules.zip \ --output figures/figure_4_rule_attribution.pdf ``` ### `scripts/run_pipeline.py` — 运行完整管道 ``` python scripts/run_pipeline.py --config config/process_creation.yaml python scripts/run_pipeline.py --config config/process_creation.yaml --skip-attribution python scripts/run_pipeline.py --config config/process_creation.yaml --skip-plots ``` 按顺序运行所有步骤：训练 → 验证 → 评估 → 归因 → 绘图。 ## 数据准备脚本 ### `scripts/generate_evasions.py` — 生成规避事件生成规避事件的变体，通过应用绕过 Sigma 规则模式的转换技术。各事件类型支持的转换： | 事件类型 | 技术 | |-----------------|----------------------------------------------------------------------| | `process_creation` | `remove_exe`、`double_space`、`quote_wrap_flags`、`case_upper`、`case_lower`、`env_systemroot`、`env_temp`、`long_flag_o` | | `registry_event` | `hklm_expand`、`hklm_abbrev`、`hku_expand`、`hku_abbrev`、`case_lower`、`case_upper`、`trailing_backslash` | | `powershell` | `backtick_insert`、`case_mix`、`concat_keywords`、`double_space`、`env_comspec` | 每项技术都会验证是否确实能规避规则（若仍匹配任何规则模式则丢弃该规避）。输出：`evasions_dir//_Evasion__.json` ### `scripts/hayabusa_to_matches.py` — Hayabusa JSONL → 匹配事件将 Hayabusa 安全扫描器 JSONL 输出转换为 AMIDES 格式的每规则匹配 JSON 文件。 ``` python scripts/hayabusa_to_matches.py \ --input hayabusa_matches.jsonl \ --output-dir ~/data/sigma/events_hayabusa/windows/process_creation \ --event-type process_creation python scripts/hayabusa_to_matches.py \ --input hayabusa_registry.jsonl \ --output-dir ~/data/sigma/events_hayabusa/windows/registry_event \ --event-type registry_event python scripts/hayabusa_to_matches.py \ --input hayabusa_powershell.jsonl \ --output-dir ~/data/sigma/events_hayabusa/windows/powershell \ --event-type powershell ``` 为事件丰富 AMIDES 兼容字段（`process.command_line`、`winlog.event_data.TargetObject` 等）。按 `RuleTitle` 分组，归一化为蛇形命名并写入 `_Match_.json`。 ### `scripts/otrf_to_matches.py` — OTRF 数据集 → 匹配事件将 OTRF Security-Datasets JSON 文件转换为 AMIDES 格式匹配事件，通过匹配 CommandLine 值与 Sigma 规则模式。 ``` python scripts/otrf_to_matches.py \ --otrf-dir ~/data/Security-Datasets/datasets/atomic/windows \ --rules-dir ~/data/sigma/rules/windows/process_creation \ --output-dir ~/data/sigma/events_otrf/windows/process_creation ``` 支持多种 OTRF JSON 格式（Winlogbeat ECS、原始 WEL、简化 Sysmon）。支持 Sigma 模式中的通配符（`*`）匹配。 ### `scripts/lmd_to_benign.py` — LMD 集合 → 良性样本将 Lateral Movement Dataset (LMD) 集合 CSV 文件转换为按事件类型的良性样本文本文件。 ``` python scripts/lmd_to_benign.py \ --lmd-dir ~/datasets/benign_data/Lateral-Movement-Dataset--LMD_Collections \ --output-dir ~/data/benign ``` 映射 EventID → 事件类型：`1 → process_creation`、`12/13/14 → registry_event`。自动去重，同时处理 LMD-2022 与 LMD-2023 子集。 ### `scripts/mpsd_to_benign.py` — MPSD PowerShell → 良性样本将 das-lab/mpsd PowerShell 良性 `.ps1` 文件转换为 `benign_train.txt`。每个文件作为一行（换行符折叠为空格）。 ``` python scripts/mpsd_to_benign.py \ --mpsd-dir ~/datasets/benign_data/mpsd/powershell_benign_dataset \ --output-dir ~/data/benign/powershell ``` ### `scripts/mpsd_to_malicious.py` — MPSD PowerShell → 恶意样本按 Sigma 规则模式过滤 das-lab/mpsd 恶意 `.ps1` 文件，仅保留会触发至少一个 Sigma 规则的文件。 ``` python scripts/mpsd_to_malicious.py \ --mpsd-dir ~/datasets/malicious_data/mpsd/malicious_pure \ --rules-dir ~/data/sigma/rules/windows/powershell \ --output ~/data/benign/powershell/malicious_extra.txt ``` ### `scripts/secrepo_to_benign.py` — SecRepo Squid 日志 → 良性 URL 从 Squid `access.log` 格式提取 HTTP/HTTPS URL 用于代理/Web 实验。跳过 `CONNECT`（HTTPS 隧道）和非 HTTP 条目。 ``` python scripts/secrepo_to_benign.py \ --input ~/datasets/benign_data/access.log/access.log \ --output-dir ~/data/benign/proxy_web ``` ## 配置文件格式 ``` data: benign_train: ~/data/benign/process_creation/benign_train.txt benign_valid: ~/data/benign/process_creation/benign_train.txt benign_field: process.command_line # dot-path to extract from JSON/CSV events_dir: ~/data/sigma/events_hayabusa/windows/process_creation evasions_dir: ~/data/sigma/evasions/windows/process_creation # separate evasions dir rules_dir: ~/data/sigma/rules/windows/process_creation search_fields: - process.command_line max_benign_samples: 50000 # optional cap (proxy_web has 1.5M URLs) malicious_extra: ~/data/benign/powershell/malicious_extra.txt # extra malicious samples training: malicious_samples: both # rule_filters | matches | both vectorization: tfidf # tfidf | count | binary_count | hashing | scaled_count ngram_range: [1, 1] search_params: true # GridSearchCV (false = fixed default params) scoring: f1 # f1 | mcc cv_folds: 5 num_jobs: 3 scaling: mcc_scaling: true mcc_threshold: 0.1 num_mcc_samples: 50 evaluation: num_thresholds: 50 output: dir: models/process_creation result_name: misuse_svc_rules_f1 attr_result_name: attr_svc_rules ``` ## 数据格式 ### 良性样本支持多种格式（按文件扩展名自动检测）： ``` # .txt — 每行一个值 C:\Windows\System32\cmd.exe /c ipconfig powershell.exe -ExecutionPolicy Bypass -File script.ps1 # .jsonl — 每行一个 JSON 对象 {"process": {"command_line": "cmd.exe /c whoami"}} # .csv — 需要表头，通过 benign_field 名称查找列 CommandLine,User,Host "cmd.exe /c dir",SYSTEM,WORKSTATION ``` ### 匹配/规避事件（JSON） ``` { "process": { "command_line": "cscript.exe //nologo malicious.vbs" }, "winlog": { "event_data": { "CommandLine": "cscript.exe //nologo malicious.vbs", "Image": "C:\\Windows\\System32\\cscript.exe" } } } ``` 文件名格式：`_Match_.json`、`_Evasion__.json` ### Sigma 规则（AMIDES YAML 格式） ``` - filter: 'process.command_line:"cscript" AND process.command_line:"malicious"' pre_detector: title: "Suspicious CScript Execution" ``` 也支持标准 Sigma HQ YAML 格式（包含 `detection:` 块）。 ## 管道映射到 AMIDES 论文 | 论文章节 | 实验 | 脚本 | 输出 | |----------|------------------------------------------|--------------------------------------------------|--------------------------| | C1 | 误用分类（rule_filters） | `train.py` → `validate.py` → `evaluate.py` | 图 3（PR-Threshold） | | C2 | 误用分类（matches） | `train.py --malicious-samples matches` → ... | 图 3 | | C3 | 规则归因 | `train_attribution.py` → `eval_attribution.py` | 图 4（Distribution+CDF） | | — | 规避生成（自有扩展） | `generate_evasions.py` | 规避 JSON 文件 | | — | 数据准备（自有扩展） | `hayabusa_to_matches.py`、`lmd_to_benign.py` 等 | 良性/匹配数据集 | | 可视化 | 所有图形 | `plot.py pr` / `plot.py attr` | PDF 图形 |

标签：Apex, CountVectorizer, IPv6, MCC, Mutation, PowerShell, PR曲线, RED, Rule Evasion Detection, SVM, TF-IDF, URL, Web代理, Windows事件, 事件溯源, 分类, 可视化, 安全信息事件管理, 异常检测, 支持向量机, 数据加载, 文本向量化, 机器学习, 模型持久化, 模型训练, 模型评估, 注册表事件, 特征工程, 管道化, 脚本自动化, 规则归因, 规则规避检测, 误用检测, 进程创建, 逆向工具, 阈值扫描