gonll/algex

GitHub: gonll/algex

基于GF(2⁸)上Berlekamp-Massey算法与LFSR编码的代数压缩器，专门处理统计压缩器因高熵而无法识别的线性递归结构数据。

Stars: 0 | Forks: 0

# pade-compress **GF(2⁸) 线性递归的代数模型提取器 + 残差编码器。** 在 GF(2⁸) 上运行 Berlekamp-Massey / Padé 逼近，以提取生成字节流的最短 LFSR —— 然后仅存储模型和稀疏 XOR 残差。适用于统计压缩器（gzip、brotli、zstd）从根本上无法识别的数据。 ## 这个工具能做到而其他工具做不到的一件事统计压缩器利用的是符号频率和重复的字节模式。它们无法看到**代数结构**。 GF(2⁸) m 序列具有约 8 bits/byte 的 Shannon 熵 —— 根据任何信息论指标，它都与白噪声无法区分。gzip 会将其视为不可压缩并原样输出。该编解码器能将同样的数据压缩到 **1.2%** —— 达到 **85:1 的压缩比** —— 因为它在正确的领域中进行操作。 | 文件 | gzip | pade-compress | 压缩比 | |------|------|---------------|-------| | GF(2⁸) 几何序列 (L=1，完美) | ~100% | **~0.3%** | ~333:1 | | 混合 LFSR (L=1/2/3 + 8% 噪声) 1 MB | ~99% | **1.1%** | **91:1** | | 二进制可执行文件 (`/bin/ls`) | ~60% | **19.3%** | 5.1:1 | | 自然语言文本 | ~35% | 100.1% | — | | WebP 图像 | ~100% | ~100.5% | — | 适用范围是具有 **GF(2⁸) 线性递归结构** 的数据。在此范围之外，它会快速检测到不匹配并直接退回为透传（passthrough），开销可以忽略不计。 ## 先分析 —— 如果合适再压缩 ``` npm run analyze ./your-prbs-stream.bin ``` ``` File: ./test/gf-structured.bin Size: 1,048,576 bytes Entropy: 7.888 bits/byte Structured: 100.0% algebraic Avg L: 1.75 (weighted LFSR order) Verdict: 100% algebraically structured — compresses extremely well (LFSR/PRBS data) ──────────────────────────────────────────────────────────── Segments (11): + 0 [ 262,140 B] PRBS-8 m-sequence (maximal-length, perfect) noise 0.0% coeffs [0x03] + 262140 [ 262,144 B] L=2 LFSR (exact, period unknown) noise 0.0% coeffs [0x1b,0x4e] + 524284 [ 237,249 B] L=3 LFSR (~2.0% noise) noise 2.0% coeffs [0x57,0x2f,0x11] + 786433 [ 131,071 B] order 85 (period 85) noise 0.0% coeffs [0x07] + 917504 [ 131,072 B] PRBS-8 m-sequence (maximal-length, perfect) noise 0.0% coeffs [0xe3] ``` `--analyze` 会告诉你*存在什么样的代数结构*，即使你不打算压缩。该判定结果驱动路由决策：结构化数据 → 在此压缩；非结构化数据 → 回退使用 zstd/brotli。 ## 工作原理每个字节序列都可以被测试：在 GF(2⁸) 上是否存在短的递推关系 `s[i] = c₁·s[i-1] ⊕ c₂·s[i-2] ⊕ … ⊕ cₗ·s[i-L]`？ **Berlekamp-Massey 算法**能在 O(n²) 时间内找到最短的此类递推关系（最小 LFSR）。如果 LFSR 阶数 L 相对于序列长度 n 较小，则存储 `(L 个系数 + L 个种子字节 + 稀疏残差)` 会远小于原始字节。 ### 编码管线 ``` Input bytes │ ├─ Adaptive chunking (splits at entropy discontinuities; boundaries refined ±4 bytes) │ └─ Per chunk — 15 encoding paths tried in order: ├─ 1. Padé [k/L] search (tries offsets 0..32, finds best k + shortest L) ├─ 2. Approx L=1 (brute-forces all 255 GF coefficients for noisy L=1 data) ├─ 3. Approx L=2 (voting over GF quadruples, covers up to ~28% byte noise) ├─ 4. Approx L=3 (voting over quintuple pairs, covers up to ~23% byte noise) ├─ 5. Approx L=4,5 (sub-sequence BM voting for higher-order LFSRs) ├─ 6. Affine L=1 (y[n] = c·y[n-1] ⊕ b via shift normalisation to pure L=1) ├─ 7. Cyclic (exact period detection for lookup tables and repeating patterns) ├─ 8-10. Delta transforms (XOR-diff, ADD-diff, XOR-2nd-diff → re-run paths 1-7) │ Gated: only attempted when entropy is high AND algebraicity score is low. ├─ 11-13. Interleave m=2,3,4 (split byte lanes → encode independently → merge) │ Gated: BM pre-screen on each lane to avoid spending time on random data. ├─ 14. Bit-plane decomposition (each of 8 bit planes encoded independently) │ Gated: BM pre-screen per plane; useful for ADC/DAC samples and firmware images. └─ 15. Raw passthrough (if no representation is smaller) ``` 每条近似路径（2-5）在找到 LFSR 多项式后会应用**种子去噪（seed denoising）**：对每个种子字节遍历所有 256 个候选值，并选择能使总残差最小的值，从而在 O(L×256×N) 时间内消除系统性的初始窗口误差。 **双重前置门控（Dual pre-gate）** 控制转换路径 8-14 是否运行： - 门控 1（熵）：如果数据在统计上已经可以被压缩（H < 原始数据的 60%），则跳过 —— 例如文本、标头、已压缩的字节。 - 门控 2（代数性）：如果数据过于随机（通过滑动 16 字节窗口的 BM 复杂度波动来衡量），则跳过。高波动 = 加密/噪声 → 转换无济于事。只有同时具有高熵和代数结构的数据才会到达路径 8-14。这正是能从基于转换的编码中受益的那类数据。 ### 传输格式 (v4) ``` File header: [4] magic "PAD4" [4] originalSize [4] chunkCount Raw chunk: [1] kind=0 [4] dataLen [N] data [4] CRC32 LFSR chunk: [1] kind=1 [4] origLen [1] prefixLen [P] prefix [2] lfsrLen L [L] coefficients [L] seed bytes [1] residual flag: 0=plain sparse 1=deflate-raw 2=brotli [payload] [4] CRC32 Cyclic chunk: [1] kind=2 [4] origLen [2] period P [P] cycle_bytes [4] CRC32 Delta chunk: [1] kind=3 [4] origLen [1] deltaId [4] innerLen [inner] [4] CRC32 Affine chunk: [1] kind=4 [4] origLen [1] k [4] innerLen [inner] [4] CRC32 Interleave: [1] kind=5 [4] origLen [1] m m × { [4] laneLen [lane bytes] } [4] CRC32 Bitplane: [1] kind=6 [4] origLen [1] planeCount (always 8) 8 × { [4] planeLen [plane bytes] } [4] CRC32 EOF sentinel: [1] 0xFE XDNI index: [4] "XDNI" [4] chunkCount [N×8] entries [4] indexOffset Each entry: [4] chunkOffset [4] origLen ``` 残差是预测字节与实际字节的 XOR。完美的递推会产生空残差。对于带有噪声的数据，残差被稀疏编码为经 delta 压缩的位置-值对，然后可选择进行 deflate/brotli 压缩 —— 哪个最小就采用哪个结果。 ## 谁会处理这类数据该编解码器面向处理使用移位寄存器逻辑数据源的工程师。对于这些有效载荷，它能实现任何通用工具都无法比拟的压缩效果 —— 因为**熵并不是衡量可压缩性的正确指标**： - **硬件测试工程师** —— 用于 PCIe、USB、SONET 和以太网信号完整性测试的 PRBS（伪随机二进制序列）流是由 LFSR 生成的。`--analyze` 能瞬间识别出生成多项式和噪声级别。 - **嵌入式 / 固件团队** —— flash 中的 bootloader 映像、CRC 查找表、DSP 系数表。对于二进制可执行文件，已能实现 19% 的压缩率，而 gzip 约为 60%。 - **电信 / 网络** —— 光纤链路中的线路加扰器使用基于 LFSR 的 XOR 加扰；解扰后的有效载荷带有 GF 结构。 - **汽车 / 航空航天** —— FlexRay 和 MIL-STD-1553 使用 LFSR 成帧；来自移位寄存器硬件的遥测数据。 - **安全 / 密码分析** —— `--analyze` 可以检测“随机”流是否具有意外低的 LFSR 复杂度，这是弱 PRNG 的危险信号。 ## 预构建可执行文件无需 Node.js。从 [GitHub Releases](https://github.com/gonll/algex/releases) 页面下载适用于您平台的压缩包： | 平台 | 压缩包 | |---|---| | Windows x64 | `pade-compress-windows-x64.zip` | | Linux x64 | `pade-compress-linux-x64.tar.gz` | | macOS ARM64 | `pade-compress-macos-arm64.tar.gz` | 每个压缩包包含**必须保持在同一目录下**的两个文件： - `pade-compress`（Windows 上为 `pade-compress.exe`）—— 独立可执行文件 - `pade_compress_addon.node` —— 原生 GF/BM 库，在运行时作为附属组件加载 ``` # Windows .\pade-compress.exe analyze your-prbs-stream.bin # Linux / macOS ./pade-compress analyze your-prbs-stream.bin ``` 版本在每次推送 `v*` 标签时由 GitHub Actions 在所有三个平台上自动构建。 ## 安装 (Node.js / npm) ``` npm install pade-compress ``` 此软件包包含一个原生 C 插件（GF(2⁸) 算术和 Berlekamp-Massey 核心）。`npm install` 会自动编译它。您需要： - **macOS**：Xcode Command Line Tools —— `xcode-select --install` - **Linux**：`build-essential` —— `sudo apt install build-essential` (Debian/Ubuntu) 或等效软件包 - **Windows**：[windows-build-tools](https://github.com/felixrieseberg/windows-build-tools) 或带有 C++ 工作负载的 Visual Studio 需要 Node.js ≥ 18。 ## 使用方法 ``` # 检测代数结构（不执行压缩） npx tsx src/cli.ts analyze # 压缩 npx tsx src/cli.ts compress # 解压 npx tsx src/cli.ts decompress # Benchmark（压缩 + 验证，不写入输出） npx tsx src/cli.ts bench # Shortcuts（analyze 和 bench 为测试文件硬编码 — 对其他文件请使用上述选项） npm run analyze # runs on ./test/gf-structured.bin npm run bench # runs on ./test/gf-structured.bin ``` ### 编程式 API ``` import { compress, decompress, compressAsync, analyzeBuffer, formatAnalysis } from "pade-compress" import type { AnalysisResult, SegmentInfo } from "pade-compress" // Detect structure without committing to compression const result = analyzeBuffer(inputBytes, "my-file.bin") console.log(formatAnalysis(result)) // result.verdict, result.structuredFraction, result.segments[i].recognition … // Synchronous compression const compressed = compress(inputBytes) // Uint8Array → Uint8Array const restored = decompress(compressed) // Uint8Array → Uint8Array // Async (chunks encoded in parallel across worker threads) const compressed = await compressAsync(inputBytes) const compressed = await compressAsync(inputBytes, 4) // explicit worker count // Streaming import { createCompressStream, createDecompressStream } from "pade-compress" readable.pipe(createCompressStream()).pipe(writable) ``` ## 基准测试在 Apple M 系列机器上运行。 ``` GF(2⁸) geometric (L=1, perfect) 4 096 B → ~12 B (~0.3%) encode 2ms decode <1ms Noisy GF (L=1, 5% errors) 4 096 B → ~400 B (~9.8%) encode 3ms decode <1ms Padé offset (16-byte noise prefix) 4 096 B → ~50 B (~1.2%) encode 3ms decode <1ms Mixed LFSR (L=1/2/3 + 8% noise) 1 048 576 B → 11 503 B ( 1.1%) encode 310ms decode 30ms Binary executable (/bin/ls) 154 624 B → 29 875 B (19.3%) encode 136ms decode 4ms Natural language text (/usr/share/dict/words) → ~100.1% (no LFSR structure found) ``` 解码始终几乎是瞬间完成的 —— LFSR 回放是一个没有分支的紧凑算术循环。 ## 项目布局 ``` c/ gf256.c / gf256.h GF(2⁸) arithmetic (AES poly, O(1) log/exp tables) gf_wide.c / gf_wide.h GF(2^16) arithmetic (128KB tables, poly 0x1002D) bm.c / bm.h BM algorithm, LFSR run/errors, approx L=1..5 bm_wide.c / bm_wide.h BM over GF(2^16), BM16_MAX_L=32 analyze.c / analyze.h Buffer analysis, segment classification, PRBS recognition addon.c N-API bridge: exposes all C math to Node.js src/ native/ addon.ts TypeScript interface for the compiled .node addon core/ pade.ts Thin wrappers: findBestPade, findApproxL1..L5, findApproxAffineL1 entropy.ts Shannon entropy + compressibility gate transform.ts Dual pre-gate (entropy + algebraicity score), delta transforms analysis.ts analyzeBuffer(), formatAnalysis() bitplane.ts splitBitplanes / mergeBitplanes (8-plane decomposition) gf-poly.ts GF polynomial utilities: factorRoots, polyFromRoots (+ .test.ts) codec/ encoder.ts 15-path encoding pipeline decoder.ts LFSR replay + residual XOR; cyclic/interleave/bitplane decode format.ts Binary serialization (format v4: PAD4, CRC32, XDNI index) chunker.ts Entropy-adaptive chunking with ±4 boundary refinement (+ .test.ts) stream.ts Node.js streaming interface worker-pool.ts Worker thread pool for parallel chunk encoding worker-entry.ts Worker thread entry point utils/ sparse.ts Sparse residual encoding (empty / pairs / dense) (+ .test.ts) buffer.ts Byte utilities math.ts Misc math helpers scripts/ gen-gf-file.ts Generates a synthetic GF-structured test binary test/ gf-structured.bin 1MB synthetic GF file (L=1/2/3 segments + noise) examples/ demo.ts Synthetic roundtrip demo ``` ## 架构决策 **GF(2⁸) 而非 GF(257)** —— 所有 256 个字节值都是原生域元素。加法是 XOR。乘法使用 AES 不可约多项式 (x⁸+x⁴+x³+x+1) 与 O(1) 对数/指数查找表。 **熵 ≠ 可压缩性** —— GF(2⁸) m 序列具有约 8 bits/byte 的熵（接近最大值），但 LFSR 长度 = 1。压缩门控使用 L/N 比率，而不是熵。 **Padé [k/L] 偏移搜索** —— 在运行完整的 LFSR 搜索之前尝试偏移量 0..32。处理在规则代数体之前具有非代数标头（文件魔术字节、盐值）的数据。如果 offset=0 已经失败则提前中断 —— 不可压缩的数据在一次 BM 遍历中即可被检测到。 **近似 L=1 回退** —— 对于精确 BM 找到极长 LFSR 的含噪数据，在 O(255·N) 时间内暴力破解所有 255 个 GF 系数，并选择残差最稀疏的系数。可处理高达约 17% 的字节噪声。 **近似 L=2 回退** —— 在 O(N) 时间内对连续的 GF 四元组进行投票，以找到主导的 (c1, c2) 对。多数阈值为 25%；验证阈值为 `1−(1−T)^3`。可处理高达约 28% 的字节噪声。 **近似 L=3 回退** —— 两阶段：通过配对的五元组方程对 (c2, c3) 进行投票，然后在给定 (c2, c3) 锚点的情况下对 c1 进行投票。投票阈值为 20%；验证阈值为 `1−(1−T)^4`。可处理高达约 23% 的字节噪声。 **通过子序列 BM 投票实现近似 L=4,5** —— 在许多短的重叠窗口（长度为 2L+4）上运行 BM。即使单个窗口包含错误，真实多项式也能在所有窗口中获得相对多数票。可以推广到任意 L，而不会增加代码复杂性。 **仿射 L=1 检测** —— 通过对连续三元组的比率 `(y[i]⊕y[i+1]) / (y[i-1]⊕y[i])` 进行投票来识别序列 `y[n] = c·y[n-1] ⊕ b` —— 该公式能抵消加法常数 b，并且在有噪声的情况下保持稳定。一旦 c 和 b 已知，位移量 `k = b·inv(1⊕c)` 就会将该序列转换为纯乘性递推。 **种子去噪** —— 在任何近似 LFSR 搜索（L=1..5）之后，前 L 个种子字节本身可能就是噪声。依次对每个种子位置，遍历所有 256 个候选值，并选择在整个序列中使总预测误差最小的值。O(L×256×N) —— 速度快，能将数据块开头原本密集的残差转化为干净的 LFSR 预测。 **噪声初始偏移搜索** —— 如果数据块的前 L 个字节恰好是噪声，则 LFSR 预测会立即发散。近似路径探测偏移量 1..8 并原样存储有噪声的前缀，从而找到干净的种子窗口。 **循环 / 精确周期编码** —— 对于具有精确周期 P 的数据（查找表、计数器数组、重复的测试模式），存储单个周期。在所有 LFSR 路径之后运行，因此几何序列仍使用较小的 LFSR 形式。 **差分（Delta）转换** —— 对通过双重前置门控的数据块尝试 XOR 一阶差分、ADD 一阶差分 (mod 256) 和 XOR 二阶差分。ADD 差分能捕获在整数上呈线性但在 GF(2⁸) 上非线性的计数器序列。每种转换都是完全可逆的；转换 ID 存储在传输格式中。 **交错 m=2,3,4** —— 将字节流拆分为 m 个通道（字节 0,m,2m,… / 1,m+1,2m+1,… / …）并独立编码每个通道。当偶数字节和奇数字节通道承载不同的 LFSR 生成器时很有用。简短的 BM 复杂度预筛选（上限=5，窗口=20）会在提交完整编码之前以低开销跳过非 LFSR 通道。 **位平面分解** —— 将每个字节拆分为其 8 个位平面（每个输入字节的第 b 位 → 平面 b）。每个平面作为 0/1 字节序列独立编码。当不同的位平面承载不同的线性结构时很有用 —— ADC/DAC 样本（MSB 平面承载幅度模式，LSB 噪声较大），固件映像（操作码 MSB 周期性出现，操作数 LSB 随机）。使用与交错相同的 BM 预筛选门控。 **双重前置门控** —— 转换路径 8-14 在执行任何昂贵的工作之前受两个快速检查的门控：(1) 熵门控拒绝已经在统计上可压缩的数据（huffman 估计 < 原始数据的 60%）；(2) 代数性门控通过测量滑动 16 字节窗口中 BM 复杂度表现的一致性来拒绝类随机数据。只有同时具有高熵和代数结构的数据才能到达转换路径。 **传输格式 v4 (PAD4)** —— 添加了逐块 CRC32、一个 EOF 标记字节以及一个 XDNI 索引 Trailer（块偏移量 + 原始长度），用于实现对任何块的 O(1) 随机访问寻址。所有块类型使用相同的 CRC 放置位置（在块有效载荷之后的 4 个字节）。 **多项式分解** `gf-poly.ts` 提供完整的往返操作：`factorRoots` 将 LFSR 最小多项式分解为其 GF(2⁸) 根（当它们全都不同时）；`polyFromRoots` 通过 ∏(x + αᵢ) 从根重建多项式。它们结合在一起，可以检查高阶 LFSR 是否是独立几何序列的总和。 **边界细化** —— 在检测到熵不连续性之后，会尝试将每个候选分割点在 ±4 字节的偏移量内移动，并保留具有最锐利熵对比度的那一个。使数据块边界与真实的代数转换对齐，而不是与最近的扫描位置对齐。 **Worker 线程池** —— `compressAsync` 将数据块分配给 worker 线程池（默认为 `availableParallelism()`）。如果 worker 无法初始化，则回退到同步编码。 **稀疏 + deflate 残差** —— 非零残差字节被编码为经 delta 压缩的位置-值对。对于在 4096 个字节中散布着 350 个错误的数据块，平均间隔约为 12，这意味着 75% 的 uint32 间隔字节为零 —— 生成的打包块在 deflate-9 下从约 1 KB 压缩到约 50 字节。 **PRBS 识别** —— `--analyze` 输出会识别系数具有 255 乘法阶次的 1 阶 LFSR（本原元 → PRBS-8 m 序列，周期 255）。其他周期（85、51、17、15、5、3、1）由它们在 GF(2⁸)* 中的阶数命名。 ## 局限性 - 不是通用压缩器。对于文本、图像和已压缩格式，请使用 zstd/brotli。 - 最适合作为**流水线中的专用阶段**：首先使用 `analyzeBuffer()`；如果 `structuredFraction > 0.5` 则在此压缩，否则回退到统计编解码器。 - O(n²) 的 Berlekamp-Massey 是具有高 LFSR 阶数的大型数据块的瓶颈。 ## 开发 ``` npm run build # TypeScript compile + native addon npm test # 121 unit tests across 7 test files npm run demo # Synthetic roundtrip demo npx tsx src/cli.ts bench # Benchmark any file npx tsx src/cli.ts analyze # Algebraic structure report npm run build:exe # Build a self-contained executable for the current platform ``` ### 本地构建可执行文件 `npm run build:exe` 会编译 TypeScript，重建原生插件，并通过 [`@yao-pkg/pkg`](https://github.com/yao-pkg/pkg) 将 CLI 打包成单个可执行文件。生成的二进制文件 + `pade_compress_addon.node` 附属组件会输出到 `executables/` 目录中。这两个文件必须放在一起使用。要在单台机器上生成所有三个平台的目标文件，需要交叉编译原生插件（这并非易事）。推荐的方法是推送一个 `v*` 标签，让 GitHub Actions 原生构建每个平台： ``` git tag v0.1.2 && git push origin v0.1.2 ``` 这会触发 `.github/workflows/release.yml`，它会在 `windows-latest`、`ubuntu-latest` 和 `macos-latest` 上进行构建，并自动将所有三个压缩包发布到 GitHub release。 7 个测试文件中包含 121 个测试，覆盖了稀疏编码、Padé 搜索（通过 C 插件）、GF 多项式往返操作、近似 LFSR 检测 (L=1..5)、带边界细化的分块、双重前置门控（熵 + 代数性）以及端到端往返操作。

标签：Berlekamp-Massey算法, DNS 反向解析, GF(2^8)有限域, LFSR, MITM代理, m序列, Padé逼近, PRBS伪随机二进制序列, 代数压缩, 信息熵, 信道编码, 固件压缩, 密码学, 异或残差编码, 手动系统调用, 数据压缩, 无统计学结构数据压缩, 流密码, 算法库, 纠错码, 线性反馈移位寄存器, 编码理论, 自动化攻击, 通信有效载荷