intezer/PyJSClear

GitHub: intezer/PyJSClear

纯 Python JavaScript 反混淆器，无需 Node.js 依赖，支持 16 种 AST 变换处理 obfuscator.io 等混淆技术。

Stars: 4 | Forks: 0

PyJSClear

# PyJSClear 纯 Python JavaScript 反混淆器。结合了 [obfuscator-io-deobfuscator](https://github.com/ben-sb/obfuscator-io-deobfuscator) （针对 obfuscator.io 输出的 13 种 AST 变换）和 [javascript-deobfuscator](https://github.com/ben-sb/javascript-deobfuscator) （十六进制转义解码、静态数组解包、属性访问清理）的功能，成为一个无需依赖 Node.js 的单一 Python 库。 ## 安装 ``` pip install pyjsclear ``` 用于开发： ``` git clone https://github.com/intezer/PyJSClear.git cd PyJSClear pip install -e . pip install pytest ``` ## 用法 ### Python API ``` from pyjsclear import deobfuscate, deobfuscate_file # 从字符串 cleaned = deobfuscate(obfuscated_code) # 从文件 deobfuscate_file("input.js", "output.js") # 或将结果获取为字符串 cleaned = deobfuscate_file("input.js") ``` ### 命令行 ``` # 文件到 stdout pyjsclear input.js # 文件到文件 pyjsclear input.js -o output.js # Stdin 到 stdout cat input.js | pyjsclear - # 使用自定义 iteration limit pyjsclear input.js --max-iterations 20 ``` ## 功能 PyJSClear 在多轮循环中应用 16 种变换，直到不再发生更改（默认最多 50 次迭代）： | # | 变换 | 描述 | |---|-----------|-------------| | 1 | **StringRevealer** | 解码 obfuscator.io 字符串数组（basic, base64, RC4），包括旋转 IIFE、包装函数、每个文件多个解码器以及 SequenceExpression 包装的旋转模式 | | 2 | **HexEscapes** | 标准化字符串字面量 AST 节点中的 `\xHH`/`\uHHHH` 转义序列 | | 3 | **UnusedVariableRemover** | 删除零引用的变量 | | 4 | **ConstantProp** | 将常量字面量传播到所有引用位置 | | 5 | **ReassignmentRemover** | 消除冗余的 `x = y` 重赋值链 | | 6 | **DeadBranchRemover** | 删除不可达的 `if(true)/if(false)` 和三元分支 | | 7 | **ObjectPacker** | 将顺序的 `obj.x = ...` 赋值合并为对象字面量 | | 8 | **ProxyFunctionInliner** | 在所有调用点内联单返回值的代理函数 | | 9 | **SequenceSplitter** | 将逗号表达式 `(a(), b(), c())` 拆分为单独的语句；提取 `(0, fn)(args)` 间接调用前缀；将循环/if 主体标准化为块语句 | | 10 | **ExpressionSimplifier** | 计算静态表达式：`3 + 5` -> `8`, `![]` -> `false`, `typeof undefined` -> `"undefined"`, `test ? false : true` -> `!test` | | 11 | **LogicalToIf** | 将语句位置的 `a && b()` / `a \|\| b()` 转换为 if 语句 | | 12 | **ControlFlowRecoverer** | 从 `"1\|0\|3".split("\|")` + `while/switch` 调度模式中恢复线性代码 | | 13 | **PropertySimplifier** | 在有效的地方将 `obj["prop"]` 转换为 `obj.prop` | | 14 | **AntiTamperRemover** | 删除自我防御和反调试 IIFE | | 15 | **ObjectSimplifier** | 内联代理对象属性访问 | | 16 | **StringRevealer** | 第二次遍历以捕获由先前变换暴露的字符串 | ### 安全保障 - **永不扩展输出**：如果反混淆的结果比输入大，则返回原始代码不变。 - **对有效的 JS 永不崩溃**：解析错误回退到返回原始源代码。变换异常会针对每个变换进行捕获并跳过。 ## 测试 ``` pytest tests/ # all tests pytest tests/test_regression.py # regression suite (35 tests across 25 samples) pytest tests/ -n auto # parallel execution (requires pytest-xdist) ``` 针对总计 47,836 个文件的六个数据集进行了验证（完整数据集，无采样）： | 数据集 | 文件数 | 崩溃 | 扩展 | 减少 | 来源 | |---------|-------|---------|----------|---------|--------| | E1 技术样本 | 20 | 0 | 0 | 13 | [JSIMPLIFIER](https://zenodo.org/records/17531662) | | Kaggle Obfuscated | 1,477 | 0 | 0 | 1,199 | [Kaggle](https://www.kaggle.com/datasets/fanbyprinciple/obfuscated-javascript-dataset) | | Kaggle NotObfuscated | 1,898 | 0 | 0 | 217 | [Kaggle](https://www.kaggle.com/datasets/fanbyprinciple/obfuscated-javascript-dataset) | | MalJS (恶意软件) | 23,212 | 0 | 0 | 3,193 | [JSIMPLIFIER](https://zenodo.org/records/17531662) | | BenignJS | 21,209 | 0 | 0 | 4,354 | [JSIMPLIFIER](https://zenodo.org/records/17531662) | | E1 原始 (干净) | 20 | 0 | 0 | 15 | [JSIMPLIFIER](https://zenodo.org/records/17531662) | 大于 200KB 或超过 15 秒挂钟超时的文件将被跳过并计为未更改（MalJS 中 14,529 个，BenignJS 中 940 个）。BenignJS 的减少是对从良性网站抓取的混淆 JS 进行的真正反混淆。少数 Kaggle NotObfuscated 文件被错误标记（真正被混淆的 Angular 测试规范）。E1 原始减少来自代码生成器进行的次要空白/格式清理。 **与 Node.js 工具的直接对比**（obfuscator-io-deobfuscator + javascript-deobfuscator 流水线）：在 Kaggle Obfuscated 数据集（1,477 个文件）上，PyJSClear 减少了 1,199 个文件，而 Node.js 流水线更改了零个 —— 该数据集的轻量级混淆（十六进制转义、没有 `parseInt` 校验和的基本字符串数组）超出了 obfuscator-io-deobfuscator 的检测启发式范围。在 E1 和 MalJS 数据集（重度混淆）上，PyJSClear 在至少有一个工具更改输出的文件中，有 93.8% 产生了更小的输出，这得益于死代码删除、代理函数内联、括号到点的转换以及控制流恢复。 **解析覆盖率**：PyJSClear 使用 [esprima2](https://github.com/s0md3v/esprima2)，支持 ES2024 语法，包括箭头函数、可选链、空值合并等。 ## 架构基于 [esprima2](https://github.com/s0md3v/esprima2)（兼容 ESTree 的 JS 解析器，支持 ES2024）构建，带有自定义代码生成器、AST 遍历器（进入/退出/替换/删除）和作用域分析。变换在收敛循环内以固定顺序运行；StringRevealer 在最开始和最后都运行，以在其他变换修改包装函数结构之前和之后处理字符串数组。 ## 限制 - 具有深度混淆的大型文件（>100KB）可能会因为多轮架构而变慢。请考虑使用 `max_iterations` 来限制遍数。 - 并非所有 obfuscator.io 配置都被处理 —— 某些高级字符串编码模式可能无法被完全解码。支持的编码：basic （索引查找）、base64、RC4 和多解码器（多种编码类型共享一个字符串数组）。 ## 许可证 Apache License 2.0 — 参见 [LICENSE](LICENSE)。本项目是基于 [obfuscator-io-deobfuscator](https://github.com/ben-sb/obfuscator-io-deobfuscator) （Apache 2.0）和 [javascript-deobfuscator](https://github.com/ben-sb/javascript-deobfuscator) （Apache 2.0）的衍生作品。有关完整的归属信息，请参阅 [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md) 和 [NOTICE](NOTICE)。

标签：AMSI绕过, Assetfinder, AST 抽象语法树, DNS 反向解析, JavaScript 反混淆, obfuscator.io, Pure Python, Python 库, 云安全监控, 云资产清单, 代码混淆, 去混淆工具, 威胁情报, 威胁检测, 开发者工具, 恶意代码分析, 网络安全, 自动化payload嵌入, 逆向工具, 逆向工程, 配置文件, 隐私保护, 静态分析