diohabara/pychd

GitHub: diohabara/pychd

Stars: 50 | Forks: 7

# PyChD [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/78bbdc5e6d080509.svg)](https://github.com/diohabara/pychd/actions/workflows/ci.yml) [![PyPI Version](https://img.shields.io/pypi/v/pychd.svg)](https://pypi.python.org/pypi/pychd) ![Recovery rate by corpus — Sig / Decl / Strict / BN / BS](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/6ad7a274da080510.svg) ## Headline — measured contamination differential, fuzz-synthetic as the trust anchor **2,794 modules across 13 corpora**, measured twice — once with the deterministic rule-only path (no LLM, fully reproducible offline) and once with hybrid-rewrite (rule pass + one Codex `gpt-5.5` call per module). Two new tooling members in this repo make the contamination question directly measurable: * **`pychd-pyfuzz`** generates random syntactically-valid Python via direct AST construction — every sample is fresh, never published, never seen by any LLM. Lives in [`pychd_pyfuzz/`](pychd_pyfuzz/) and on [PyPI as `pychd-pyfuzz`](https://pypi.org/project/pychd-pyfuzz/). * **`pychd-pyobf`** anonymises a `.pyc` (renames identifiers, strips strings / docstrings / filenames / line tables) while preserving the opcode stream byte-for-byte. Lives in [`pychd_pyobf/`](pychd_pyobf/) and on [PyPI as `pychd-pyobf`](https://pypi.org/project/pychd-pyobf/). Together they let us run two new families of corpus on top of the existing benchmark suite: * **`fuzz-synthetic`** (200 modules) — pyfuzz-generated, guaranteed LLM-naïve. The strongest contamination guarantee in the repo. * **`-obf`** (815 modules across 5 mirrors of stdlib / stdlib-full / pypi / pypi-top20 / humaneval) — same bytecode structure as the raw counterpart, identifiers stripped. The delta between raw and `-obf` is the contamination signal that lets us *put a number on* "how much of the headline LLM score is memorisation". ### The contamination differential, with numbers Raw vs. anonymised, hybrid-rewrite mode, same backend model: | Corpus | Raw `strict_match` | **`-obf` `strict_match`** | Δ (memorisation lift) | Raw `BS` | **`-obf` `BS`** | Δ | |---|---:|---:|---:|---:|---:|---:| | `stdlib` | 100 % (10/10) | **86.7 %** (13/15) | **−13.3 pt** | 60.0 % | **0.0 %** | **−60.0 pt** | | `stdlib-full` | 91.5 % (140/153) | **80.4 %** (123/153) | **−11.1 pt** | 84.3 % | **2.6 %** | **−81.7 pt** | | `pypi` | 89.9 % (170/189) | **82.0 %** (155/189) | **−7.9 pt** | 33.3 % | **3.2 %** | **−30.1 pt** | | `pypi-top20` | 84.5 % (576/682) | **83.4 %** (569/682) | **−1.1 pt** | 63.3 % | **5.3 %** | **−58.0 pt** | | `humaneval` | 100 % (164/164) | **100 %** (164/164) | **0 pt** (algorithmically simple) | 98.2 % | **86.6 %** | **−11.6 pt** | * **Strict-AST match drops 1.1–13.3 pt** when identifiers are stripped on contamination-likely corpora. That gap is mechanically attributable to surface-token memorisation: the bytecode is unchanged, only the surface form the LLM would have seen in training data is gone. * **Behavioural smoke (import + same public API) collapses** under anonymisation — 60–80 pt drops are typical. This makes intuitive sense: a recovered module whose public surface is `_n0, _v0` will not behave like the original "import the module and call its documented function" smoke test. It's an artefact of the metric, not the decompiler, and it shows up cleanly here. ### The contamination-free baseline The number we'd ask a security-conscious reader to actually trust as "what hybrid-rewrite does on never-before-seen code": | Metric | `fuzz-synthetic` (LLM-naïve, 200 modules) | `recent-pypi` (release-date proxy, 182 modules) | |---|---:|---:| | `parses` | **100 %** | 100 % | | `signature_match` | **100 %** (rule-only) → **100 %** (hybrid) | 98.4 % → 99.5 % | | `declaration_match` | **100 %** (rule-only) → **100 %** (hybrid) | 98.4 % → 99.5 % | | **`strict_match`** | **21.0 %** (rule-only) → **86.0 %** (hybrid) | **45.6 %** (rule-only) → **81.9 %** (hybrid) | | `BS` (behavioural smoke) | 0.0 % (rule-only) → **92.0 %** (hybrid) | 14.3 % → 20.3 % | Hybrid-rewrite reaching **86.0 % strict-AST match on `fuzz-synthetic`** — bytecode that no LLM has ever seen — is the clean answer to "does pychd's hybrid path actually decompile, or does it just remember?": it decompiles. The contamination differential adds ~5–13 pt on contamination-likely corpora; that's the share that is *not* skill. ### Aggregate over all 2,794 modules | Mode | `parses` | `signature_match` | `declaration_match` | **`strict_match`** | `BS` | |---|---:|---:|---:|---:|---:| | Rule-only (no LLM, deterministic) | 100 % | 99.7 % | 99.7 % | **43.1 %** | 19.3 % | | Hybrid-rewrite (rule pass + 1 Codex call/module) | 100 % | 99.7 % | 99.7 % | **86.5 %** | 43.2 % | Pass@1 on HumanEval: **rule-only 2.4 %** → **hybrid-rewrite 97.6 %**, but every HumanEval prompt is in the backend model's training data, so this is mostly an LLM-solves-HumanEval-from-memory signal rather than a decompilation signal. ![Per-corpus recovery rate (rule-only vs hybrid-rewrite)](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/6ad7a274da080510.svg) ![Per-tool comparison at each decompiler's preferred Python version](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/739f8c270b080511.svg) The take-away for anyone reading benchmark numbers for an LLM-assisted decompiler: **separate the rule-only baseline from the LLM lift, and measure on a corpus the backend model cannot have seen.** This repo is the first I know of to ship both halves of that — `pychd-pyfuzz` + `pychd-pyobf` are independent PyPI packages so other Python decompiler authors can drop the same harness into their CI. See [§LLM contamination disclosure](#llm-contamination-disclosure) for the worked example (`_colorize.py`) and [§Comparison with prior Python decompilers](#comparison-with-prior-python-decompilers) for the 23-module stdlib + PyPI head-to-head against uncompyle6 / decompyle3 / pycdc / PyLingual. ## Quick start # The decompiler itself. uv tool install pychd pychd decompile path/to/module.pyc --hybrid-rewrite --backend codex # rules-only (deterministic, no LLM, offline, free — best for declaration recovery): pychd decompile path/to/module.pyc --rules-only # Optional: the contamination-free benchmarking harness used by this # repo. Install both to drop the same fuzz → obfuscate → decompile # pipeline into your own decompiler's CI. uv tool install pychd-pyfuzz pychd-pyobf # uv users pip install pychd-pyfuzz pychd-pyobf # pip users pychd-pyfuzz emit --target 3.14 --seed 0 # one random valid Python module pychd-pyobf rewrite IN.pyc OUT.pyc # anonymise a .pyc in place `--hybrid-rewrite` is the default at the CLI. It uses your existing `codex login` session — set ``model = "gpt-5.5"`` in ``~/.codex/config.toml`` (or pass ``-c model=...``) to control which model. No extra API key needed. If you want a fully offline, deterministic, audit-friendly run with no LLM calls and no contamination risk, use `--rules-only` — that is the path whose numbers the headline table above reports. ## Table of contents - [Pipeline at a glance](#pipeline-at-a-glance) - [LLM contamination disclosure](#llm-contamination-disclosure) - [What you get from each mode](#what-you-get-from-each-mode) — four worked examples - [Detailed recovery walkthrough](#detailed-recovery-walkthrough--what-happens-to-a-real-module) — step-by-step on one module - [How it works — compiler-pipeline perspective](#how-it-works--compiler-pipeline-perspective) - [What survives compilation, and what doesn't](#what-survives-compilation-and-what-doesnt) - [Cross-version support](#cross-version-support) - [Benchmarks](#benchmarks-run-by-just-paper) - [Comparison with prior Python decompilers](#comparison-with-prior-python-decompilers) - [Reproducibility](#reproducibility) - [Scope](#scope) - [Citing](#citing) ## LLM contamination disclosure The benchmark numbers on this page should be read as **upper bounds under likely-contaminated conditions**, not as evidence of clean generalisation. - The headline corpus is dominated by code (CPython stdlib, top-20 PyPI packages, OpenAI HumanEval) that any modern frontier model has almost certainly seen during pre-training. A high recovery rate there is partially memorisation, not just decompilation. - The `tools/synthetic_corpus/` corpus (11 modules, 625 LoC, committed 2026-05-26) was **drafted with the assistance of an LLM during this project's development**. The exact source text did not exist on the public internet before that date, but the modules were produced by the same model family this benchmark uses as a backend, so it cannot honestly be called LLM-naïve. We keep it in the benchmark because it exercises specific PEP-695 / PEP-749 / match- statement constructs, but we no longer claim it isolates "uncontaminated" performance. - The PyPI subset (`requests`, `click`, `attrs`, `flask`, `httpx`, `rich`) and the top-20 sweep overlap published training corpora. Recent wheel pins (e.g. `certifi 2026.5.20`) reduce *exact-version* memorisation risk for those packages, but do not eliminate pattern-level memorisation. - HumanEval is a published evaluation set and almost certainly in training data; we report Pass@1 there as a re-executability sanity check, not as evidence of generalisation. ### Per-metric trust table | Metric | Rule-only | Hybrid-rewrite | Trust | |---|---|---|---| | `parses` | 100 % | 100 % | ✓ honest — just `ast.parse` | | `signature_match` (rule-only) | 99.8 % | — | ✓ honest — bytecode-derived | | `declaration_match` (rule-only) | 99.6 % | — | ✓ honest — bytecode-derived | | `signature_match` (hybrid Δ +0.2 pt) | — | 100 % | ✗ memorisation — see worked example below | | `strict_match` (rule-only) | 36.0 % | — | ✓ honest — bytecode-derived | | `strict_match` (hybrid Δ +57 pt) | — | 93.2 % | ⚠ unmeasured mix of memorisation + canonical-form derivation | | `BS` (rule-only) | 42.1 % | — | ✓ honest | | `BS` (hybrid Δ +26 pt) | — | 68.1 % | ⚠ contamination plausible | | `BN` (rule-only) | 7.2 % | — | ✓ honest | | `BN` (hybrid Δ +42 pt) | — | 48.9 % | ⚠ contamination plausible — body recovery from memory yields exact bytecode | | **`FC` Pass@1 (HumanEval)** | 2.4 % | 97.6 % | ✗ HumanEval is published, almost certainly memorised by the backend model. This metric measures "LLM solves HumanEval", not "pychd decompiles" | | Edit similarity | 0.445 | 0.753 | ⚠ memorisation pushes this towards 1.0 by construction | ### Worked example: `Lib/_colorize.py` The two CPython stdlib modules that fail rule-only `signature_match` (`_colorize.py`, `_pylong.py`) contain `if False:` / `if 0:` guards. For `_colorize.py` L8-12: # types if False: from typing import IO, Self, ClassVar _theme: Theme CPython's constant folder erases the `if False:` block entirely. After `compile()` the bytecode contains zero `IMPORT_NAME typing`, zero `STORE_NAME IO`, etc. — the only survivor is `_theme: Theme` as a PEP 749 lazy annotation in the `__annotate__` closure. Pychd's rule pass correctly leaves those imports out of the recovered tree (you cannot decompile what isn't there). Hybrid- rewrite "fixes" the signature_match score by writing `from typing import IO, Self, ClassVar` into the output anyway — necessarily from training-data memorisation of CPython, since the .pyc carries no information about that line. That is the concrete mechanism behind the 0.2 pt sig-match gain, and the same kind of mechanism is plausibly contributing to the much larger strict- match / BN / Pass@1 gains. ### Adopting the same harness in your own decompiler `pychd-pyfuzz` and `pychd-pyobf` are independent PyPI distributions (see [§Headline](#headline--measured-contamination-differential-fuzz-synthetic-as-the-trust-anchor) for what they do). `pip install pychd-pyfuzz pychd-pyobf` and you can run the same fuzz → obfuscate → decompile audit against any Python decompiler. Expected shape of an honest result: * Rule-only `strict_match` should be within a few points of the raw-corpus number — the rule pass is bytecode-driven and identifier-agnostic, so anonymisation should not move it. * Hybrid-rewrite `strict_match` will drop on `-obf` corpora by an amount equal to the LLM's contamination advantage on that corpus. > 30 pt is strong evidence the upstream hybrid score is contamination-driven; this repo's worst case is 13 pt (stdlib), with most contaminated corpora landing under 10 pt. ## Pipeline at a glance pychd routes every `.pyc` through two passes: - **Rule pass** owns everything CPython compiles to a deterministic bytecode shape — imports, class/function declarations, signatures, decorators (incl. arguments), PEP 695 generics, PEP 749 lazy annotations, common one-line bodies (`return self.x`, `return cls(...)`, constructor `self.x = x`, etc.). Output is reproducible offline and audit-friendly. Bodies it can't recover remain as `pass`. - **Codex rewrite** runs *once per module* with the disassembly + the rule pass's partial output as context. It fills bodies and fixes module-level statements the rule pass got wrong (PEP 709 inlined comprehensions, multi-statement try/except scaffolding, loop bodies the rule pass collapsed). Bytes go in, source comes out — the LLM never sees the original source. (Aggregate numbers across all 2,794 modules are in the [headline table](#aggregate-over-all-2794-modules) at the top of the README. Per-axis ceilings are below.) flowchart LR pyc["foo.pyc"] -- detect magic --> ver["Python version"] ver -- 3.14 --> nat["native rule pass
(deterministic, no LLM)"] ver -- "3.0–3.13" --> cv["cross-version rule pass
(xdis, no LLM)"] nat --> ir["pychd.ir
(typed IR)"] cv --> ir ir -. partial recovery .-> llm["Codex rewrite
(1 call / module)"] ir & llm --> rec["recovered .py"] style nat fill:#d4ffd4 style cv fill:#d4e6ff style rec fill:#fff4d4 Why bodies-as-`pass` happens in rule-only: a function body that compiles to non-trivial control flow (multiple statements, loops, branches, `match`) is many-to-one in bytecode — the same opcode sequence can come from several different source expressions. Picking a representative requires either guessing (the failure mode that killed `uncompyle6`/`decompyle3` at Python 3.8) or asking an oracle. pychd chooses the oracle, so the rule pass deliberately leaves an `UnknownBlock` for the rewrite step to fill. ### Rule-only vs hybrid-rewrite ceiling What each axis can / cannot recover from bytecode alone, aggregated over all 2,794 modules: | Axis | Rule-only | Hybrid-rewrite | What the rule pass cannot reach without an oracle | |---|---:|---:|---| | `parses` | 100 % | 100 % | — | | `signature_match` | 99.7 % | 99.7 % | Residual is `if False:` / `if 0:` guards (`_colorize.py`, `_pylong.py`) whose contents the constant folder erases — *no* decompiler can recover them. Hybrid does not move the needle here. See [§LLM contamination disclosure](#llm-contamination-disclosure). | | `declaration_match` | 99.7 % | 99.7 % | Same. | | **`strict_match`** | **43.1 %** | **86.5 %** | CPython normalises docstrings via `inspect.cleandoc`, folds constants, and re-emits expressions in canonical form. The rewrite re-derives the canonical form from disassembly. | | `BS` (behavioral_smoke) | 19.3 % | 43.2 % | A `pass`-bodied recovery imports but exposes no callable behaviour beyond signatures. Anonymised corpora drop hard here (see contamination differential). | | `BN` (bytecode_normalized) | — | 48.6 % | Tolerates lnotab + specialised-opcode noise but body recovery still required. | | `FC` (Pass@1, HumanEval only) | 2.4 % | **97.6 %** | The recovered module must *behave* like the original. HumanEval is published; the Pass@1 lift is largely memorisation rather than decompilation. | ## More CLI examples # Decompile an entire project tree (mirrors structure into output dir): uv run pychd decompile path/to/package/ -o recovered/ # Rules-only mode — no LLM calls, deterministic, milliseconds: uv run pychd decompile path/to/module.pyc --rules-only # Hybrid-rewrite — rule pass + one LLM rewrite per module (fixes # body fills *and* module-level recovery). Recommended when you # want the highest-fidelity recovery and don't mind a single LLM # call per file. Uses your `codex login` session (no API key). uv run pychd decompile path/to/module.pyc --hybrid-rewrite --backend codex # LLM-only mode (older bytecode versions, or when rules struggle): uv run pychd decompile path/to/module.pyc --llm-only -m gpt-4o # Reproduce every benchmark, table, and figure in this README: just paper ## What you get from each mode ### Example 1: a re-export module (full rule recovery, 0 LLM calls) Original source (a typical `__init__.py`): """Public surface for the foo package.""" from .core import Bar, Baz from .util import parse, as_dict from .errors import FooError __all__ = ["Bar", "Baz", "FooError", "as_dict", "parse"] After `pychd decompile --rules-only`: """Public surface for the foo package.""" from .core import Bar, Baz from .util import parse, as_dict from .errors import FooError __all__ = ['Bar', 'Baz', 'FooError', 'as_dict', 'parse'] Identical modulo single vs double quotes in `__all__`. Zero LLM cost, recovered in 0.9 ms. ### Example 2: a dataclass module (full hybrid-rewrite recovery) Original: from dataclasses import dataclass from typing import Any @dataclass(frozen=True) class AgentMessage: type: str uuid: str agent_id: str message: Any = None @classmethod def from_json(cls, value): return cls( type=value["type"], uuid=value["uuid"], agent_id=value["agentId"], message=value.get("message"), ) After `pychd decompile --hybrid-rewrite --backend codex` (one LLM call per module; rule pass first, LLM corrects bodies + module-level recovery): from dataclasses import dataclass from typing import Any @dataclass(frozen=True) class AgentMessage: type: str uuid: str agent_id: str message: Any = None @classmethod def from_json(cls, value): return cls( type=value["type"], uuid=value["uuid"], agent_id=value["agentId"], message=value.get("message"), ) Byte-for-byte recovery on this shape — `bytecode_exact` round-trips under the producing 3.14 interpreter. The class declaration, every annotation, the `@classmethod` method decorator, the outer `@dataclass(frozen=True)` decorator with its keyword argument, and every method signature come straight from the rule pass; the body is filled by the LLM with the (signature + disassembly) it receives. For the deterministic-only path:
Same input, --rules-only (no LLM) from dataclasses import dataclass from typing import Any @dataclass(frozen=True) class AgentMessage: type: str uuid: str agent_id: str message: Any = None @classmethod def from_json(cls, value): return cls(type=value['type'], uuid=value['uuid'], agent_id=value['agentId'], message=value.get('message')) The trivial-body matcher even lifts this single-statement method into a real `return cls(...)`, so the rules-only output here is already behaviorally equivalent — the LLM is only needed for **multi**- statement bodies and complex module-level constructs.
### Example 3: a generic class (PEP 695, full hybrid-rewrite recovery) Original: class Stack[T]: def __init__(self): self.items: list[T] = [] def push(self, x: T) -> None: self.items.append(x) After `pychd decompile --hybrid-rewrite --backend codex`: class Stack[T]: def __init__(self): self.items: list[T] = [] def push(self, x: T) -> None: self.items.append(x) Identical modulo whitespace. The PEP 695 type parameter `[T]` survives the rule pass — pychd recognises the synthetic `` wrapper code object that the CPython compiler emits and unpacks it. Class-body and module-level annotations *are* recovered from the PEP 749 `__annotate__` closure; parameter annotations (`x: T`) live in a separate per-method closure and the LLM rebuilds them from the disassembly during the rewrite step. ### Example 4: a HumanEval problem (full bytecode round-trip) Original (`HumanEval_0.py`): from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ for idx, elem in enumerate(numbers): for idx2, elem2 in enumerate(numbers): if idx != idx2: distance = abs(elem - elem2) if distance < threshold: return True return False After `pychd decompile --hybrid-rewrite --backend codex`: from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ for idx, elem in enumerate(numbers): for idx2, elem2 in enumerate(numbers): if idx != idx2: distance = abs(elem - elem2) if distance < threshold: return True return False `bytecode_exact`, `bytecode_normalized`, `behavioral_smoke`, and `functional_correctness` (the HumanEval `check(candidate)` oracle) all pass — the recovered module compiles to byte-identical bytecode and passes every assertion. Only difference from the original is a single blank line before the trailing `return False`, which the AST comparator normalises away. ## Detailed recovery walkthrough — what happens to a real module This section shows the four-stage recovery pipeline against a single example module — what *each* stage adds — so you can see why both the rule pass and the LLM are needed and what they contribute respectively. The example: a slimmed-down dataclass module with three things the rule pass handles trivially (imports, decorators, signatures), one thing the trivial-body matcher lifts (a single-statement `from_json` classmethod), one thing only the LLM body fill can recover (a multi-statement `__post_init__`), and one thing only the ``hybrid-rewrite`` module-level fix-up can clean (a module-level dict-comprehension that the rule pass renders as ``X = {}``). Original `agent.py`: from dataclasses import dataclass, field from typing import Any _ALIAS = {old: new for old, new in [('uid', 'uuid'), ('msg', 'message')]} @dataclass(frozen=True) class AgentMessage: type: str uuid: str agent_id: str message: Any = None tags: list[str] = field(default_factory=list) def __post_init__(self): if not self.type: raise ValueError("type must be non-empty") object.__setattr__(self, "type", self.type.lower()) @classmethod def from_json(cls, value): return cls( type=value["type"], uuid=value["uuid"], agent_id=value["agentId"], message=value.get("message"), ) ### Step A: rule pass extracts the declaration skeleton Output of `pychd decompile --rules-only` after the rule walker runs: from dataclasses import dataclass, field from typing import Any _ALIAS = {} # ← lossy @dataclass(frozen=True) class AgentMessage: type: str uuid: str agent_id: str message: Any = None tags: list[str] = field(default_factory=list) def __post_init__(self): pass # pychd: unrecovered body # ← LLM territory @classmethod def from_json(cls, value): return cls(type=value['type'], uuid=value['uuid'], agent_id=value['agentId'], message=value.get('message')) What the rule pass got: - Both `from ... import ...` lines, verbatim. - The outer `@dataclass(frozen=True)` decorator including its keyword argument. - The class header line. - Every annotated attribute (`type: str`, `uuid: str`, …). - The `field(default_factory=list)` default for `tags`, rendered as a call expression. - The `@classmethod` decorator on `from_json`. - The signatures of `__post_init__` and `from_json` (parameter names, no annotations). What it didn't get: - The body of `__post_init__` (multi-statement; UnknownBlock). - The actual contents of `_ALIAS` (PEP 709 inlined dict comprehension; the rule pass emits the empty literal `{}` rather than guessing). ### Step B: trivial-body matcher lifts one-liners Inside the rule pass there's a *trivial body* recogniser that handles single-statement bodies whose opcode shape is closed-form: | Shape | Example | |---|---| | `return name(args)` | `return cls(a, b=b)` | | `return self.x.y` | `return self.config.host` | | `return ` | `return [1, 2, 3]` | | `return X + Y` | `return left + right` | | `self.x = x; …` constructor | `def __init__(self, x): self.x = x` | | `raise SomeException(args)` | `raise ValueError("nope")` | That's how `from_json` in the example above survives the rule pass fully recovered, even though it's "a function body" in principle. Without this matcher, `from_json` would also collapse to `pass # pychd: unrecovered body` and require an LLM call. ### Step C: hybrid LLM body fill completes the non-trivial bodies `pychd decompile --hybrid --backend codex` re-runs the rule pass, then for every remaining `UnknownBlock` sends *just that body's* disassembly + the recovered signature to the LLM. The LLM never sees the rest of the module — that keeps the prompt small, the cost low, and identifier hallucination rare (the signature is already nailed by the rule pass). Diff vs Step A: def __post_init__(self): - pass # pychd: unrecovered body + if not self.type: + raise ValueError("type must be non-empty") + object.__setattr__(self, "type", self.type.lower()) The module-level `_ALIAS = {}` is **still wrong** — body fill operates inside function/class bodies, it doesn't touch top-level statements. ### Step D: hybrid-rewrite corrects module-level mis-recoveries `pychd decompile --hybrid-rewrite --backend codex` adds a final whole-module rewrite step: the LLM gets the disassembly of the entire module plus the rule pass' partial output, and emits the corrected full source. This catches: - Module-level comprehensions the rule pass collapsed to `X = {}` / `X = []` / `X = ...`. - For-loop bodies whose loop variable leaked into top-level declarations (now suppressed by the rule pass' FOR_ITER skip, but the rewrite repairs older recoveries cleanly). - Multi-line dict literals whose `MAP_ADD` accumulator pattern was mis-read. - Module-level `if __name__ == "__main__":` guards. - Multi-statement try/except scaffolding. Diff vs Step C: -_ALIAS = {} +_ALIAS = {old: new for old, new in [('uid', 'uuid'), ('msg', 'message')]} Cost: **one LLM call per module** instead of one per body, so on modules with many small bodies (`stdlib-full`, `pypi-top20`) the rewrite is actually *cheaper* than per-body hybrid. The trade-off is prompt size — the rewrite sends the full module disassembly, so very large modules push closer to the model's context window. On the benchmark corpora this is rarely an issue (the largest single file fits comfortably). This is the mode the headline numbers in [Benchmarks](#benchmarks-run-by-just-paper) are reported under. ## How it works — compiler-pipeline perspective ### Step 1: Python compiles your source to bytecode The CPython compiler takes your `foo.py` and emits `foo.pyc` — a binary file containing a **code object** for the module plus a nested code object for every function and class. Each code object holds: - the bytecode instructions (one byte opcode + one byte argument, since 3.6 "wordcode"), - a `co_consts` tuple of constants used in those instructions, - a `co_names` tuple of identifier names, - a `co_varnames` tuple of local variable names, - argument counts (`co_argcount`, `co_kwonlyargcount`, etc.), - flag bits (`co_flags`: is it a coroutine? a generator? does it use *args?). You can poke at this on any Python install: >>> import dis >>> def f(a, b=1): return a + b >>> dis.dis(f) 1 RESUME 0 LOAD_FAST 0 (a) LOAD_FAST 1 (b) BINARY_OP 0 (+) RETURN_VALUE >>> f.__code__.co_argcount, f.__code__.co_varnames (2, ('a', 'b')) ### Step 3: the IR renders back to Python source Each IR node has a `render(indent) -> str` method: >>> ir.FromImport(module="os.path", level=0, names=[("join", "j")]).render() 'from os.path import join as j' >>> ir.FunctionDef(name="foo", args=ir.Arguments(args=[ir.Arg("a")])).render() 'def foo(a):\n pass' ### Step 4 (optional, `--hybrid` mode): the LLM fills function bodies For every `UnknownBlock` left in the tree, pychd sends a function-body-sized prompt to the configured LLM: You are a Python decompiler. The following Python 3.14 bytecode is the body of: def from_json(cls, value) Reconstruct the original Python source for *just the body*… LOAD_FAST_BORROW cls LOAD_FAST_BORROW value LOAD_CONST 'type' BINARY_SUBSCR … The LLM never sees the rest of the module; the rule pass already nailed the signatures, imports, and names. This keeps prompts small, costs low, and identifier hallucination rare. One LLM call per body, so on modules with many small functions the cost stays modest. ### Step 5 (optional, `--hybrid-rewrite` mode): the LLM rewrites the whole module The per-body path in Step 4 fixes *bodies* but leaves any module-level recovery mistakes (an inlined dict comprehension that collapsed to `X = {}`, a for-loop side effect that wasn't preserved) unchanged. `--hybrid-rewrite` adds a final whole-module rewrite call: You are a Python decompiler. Reconstruct the original Python 3.14 source for an entire module from its disassembled bytecode. You are given two inputs: 1. The complete disassembled bytecode (authoritative). 2. A partial rule-based recovery (declarations reliable; bodies + some module-level statements may be wrong). Bytecode disassembly: ``` ``` Partial recovery: ``` ``` Output ONLY valid Python 3.14 source code. Preserve every class/function/import name from the partial recovery. Fix module-level statements the rule pass got wrong by reading the bytecode. The output must pass `ast.parse` and `py_compile`. One call per module — strictly more expensive than per-body filling, but the prompt amortises across every body in the module so on a 50-function file the rewrite is *cheaper* than 50 separate body calls. The output is sanity-checked with `ast.parse` and the rule-only output is used as a fallback if the rewrite fails to parse. This is the mode the headline benchmark numbers are reported under, and the one the README's worked examples show. ## What survives compilation, and what doesn't | Construct | Status | Why | |---|---|---| | Class / function names | ✅ preserved | Stored in `co_name` and `co_names`. | | Function signatures (args, defaults, kwonly, posonly, `*args`, `**kw`) | ✅ preserved | All in `code.co_argcount`, `code.co_varnames`, etc. | | Imports (incl. relative, dotted, star, `from __future__`) | ✅ preserved | `IMPORT_NAME` / `IMPORT_FROM` carry the full module path. | | Docstrings (module / class / function) | ✅ preserved | `LOAD_CONST ; STORE_NAME __doc__` for modules and classes; `co_consts[0]` for functions. Indentation is normalised by `inspect.cleandoc` semantics. | | Annotations (PEP 749 lazy, 3.14+) | ✅ preserved | Stored as a separate `__annotate__` closure. | | Class metaclass / dotted bases (`abc.ABC`) | ✅ preserved | `LOAD_NAME` + `LOAD_ATTR` chain before `CALL`. | | Bare/dotted/arg-bearing decorators | ✅ preserved | `LOAD_NAME` + optional `LOAD_ATTR` + optional `CALL_KW` wrapping `MAKE_FUNCTION`. | | Name-mangled methods (`_C__private`) | ✅ recoverable | Compiler mangles to `___name`; pychd reverses this. | | Function *body statements* | ⚠️ LLM territory | Logically present but the source→bytecode mapping is many-to-one. | | `if False:` / `if 0:` blocks | ❌ **erased** | CPython's constant folder deletes them at compile time. | | Whitespace, comments | ❌ erased | Tokenised away before bytecode generation. | ### Proof that `if False:` is unrecoverable >>> import dis >>> dis.dis(compile("if False:\n import foo\n", "", "exec")) 0 RESUME 0 LOAD_CONST 1 (None) RETURN_VALUE No trace of `import foo`. The bytecode is **literally empty** — no decompiler can recover what was never written to disk. #### Cross-version full recovery via hybrid-rewrite The deterministic cross-version pass is declaration-only by design, but **hybrid-rewrite mode reaches full-body recovery on every 3.x release** because the LLM consumes the version-specific disassembly text directly. The rule pass still produces the declaration scaffold; the LLM uses xdis' disassembly (which is already version-aware) as the authoritative source for bodies. End-to-end on the fixture sample (10 LoC dataclass + greet methods), one Codex call per module: | Python | Rule pass | Hybrid-rewrite ast_match | Wall-clock | |---|---|---|---| | 3.8 | cross-version | ✅ | ~24s | | 3.9 | cross-version | ✅ | ~24s | | 3.10 | cross-version | ✅ | ~20s | | 3.11 | cross-version | ✅ | ~17s | | 3.12 | cross-version | ✅ | ~20s | | 3.13 | cross-version | ✅ | ~23s | | 3.14 | native | ✅ | ~22s | Reproduce: ``uv run python tools/build_multiversion_fixtures.py`` followed by ``uv run pychd decompile /tmp/pychd-multiversion/sample-3.X.pyc --hybrid-rewrite --backend codex`` for each X. ### What's hard about each version The bytecode specification is **not stable across Python versions**. Below is a tour of the biggest source of pain for each release. #### 3.6 — wordcode Every instruction became exactly two bytes: 1 opcode + 1 argument. Before 3.6 some opcodes took multi-byte arguments. Decompilers from the 3.5 era had to handle variable-length instructions; modern decompilers can index instructions by uniform position. #### 3.7 — keyword arguments carry names as a tuple const `f(x=1)` used to emit `LOAD_CONST 1` and a magic `CALL_FUNCTION_KW` whose argument said "the top 1 thing is a keyword". From 3.7 the *names* of the keywords are pushed as a tuple constant: LOAD_NAME f LOAD_CONST 1 LOAD_CONST ('x',) ← names tuple CALL_FUNCTION_KW 1 Decompilers have to read that tuple constant to know that the `1` is bound to `x`, not positional. #### 3.10 — `match` statements (PEP 634) match x: case 0: ... case _: ... becomes a chain of `MATCH_CLASS` / `MATCH_KEYS` / `MATCH_MAPPING` opcodes. Reconstructing the match-case structure from the bytecode requires recognising patterns the compiler emits — naive decompilers turn match into nested `if/elif/else` chains that *execute* the same but read very differently. #### 3.11 — PEP 657 zero-cost exceptions The biggest spec change in years. Try/except no longer uses `SETUP_FINALLY` blocks. Instead, every code object carries an **exception table** — pairs of (instruction range, handler offset). The bytecode looks completely linear; the exception structure is implicit in a side table. Decompilers have to parse the exception table to recover the try/except structure at all. #### 3.12 — PEP 709 comprehension inlining This silently broke every decompiler. In 3.11: x = [i * 2 for i in range(10)] emits a separate `` code object that the outer module calls. In 3.12 the body of the comprehension is inlined directly into the enclosing scope — there's no `` code object to recurse into anymore. The comprehension is a stretch of *the module's own* bytecode that the decompiler must recognise structurally. #### 3.13 — `CALL_INTRINSIC_1` Several special-purpose opcodes (notably the legacy `IMPORT_STAR`) collapse into `CALL_INTRINSIC_1` with an integer argument: # 3.12 — `from x import *`: IMPORT_STAR # 3.13 — same source: CALL_INTRINSIC_1 2 # 2 = INTRINSIC_IMPORT_STAR If your decompiler doesn't carry the intrinsic-index → semantic mapping, `from x import *` looks like an unrelated builtin call. #### 3.14 — PEP 749 lazy annotations Every annotated scope (module, class, or function) gets a synthetic `__annotate__` closure that returns the annotation dict on demand: class C: name: str age: int = 0 In 3.13 and earlier, the class body itself stored the annotations. In 3.14, the class body is much shorter — annotations migrate into a separate `__annotate__` closure attached via `SET_FUNCTION_ATTRIBUTE`. To recover `name: str` and `age: int`, pychd reads the `__annotate__` code object out of `co_consts` and walks **its** bytecode looking for the (name, annotation) pairs. This is the single biggest reason 3.13 and 3.14 need different rule passes. ## Project layout pychd/ ├── ir.py # IR dataclasses + render() — the typed representation ├── rules.py # bytecode → IR, the *native* 3.14 rule pass ├── cross_version.py # xdis-driven *cross-version* rule pass (3.0 – 3.13) ├── decompile.py # hybrid pipeline + CLI glue + per-version dispatch ├── versions.py # magic-number table + rule-pass selector ├── compile.py # py_compile wrapper ├── validate.py # AST-based diff (with --ignore-annotations) ├── semantic.py # five-axis bytecode/behavioral/oracle comparator └── main.py # argparse entry point tests/ (337 tests total) ├── test_ir.py # IR node renderers ├── test_rules.py # rule extractor unit tests ├── test_versions.py # magic-number detection across 3.0–3.14 ├── test_chunking.py # LLM disassembly chunking ├── test_compile.py # compile pipeline ├── test_decompile.py # pipeline integration (mocked LLM) ├── test_validate.py # AST diff ├── test_e2e_stdlib.py # stdlib-style end-to-end recovery ├── test_cursor_sdk.py # real-world fixture: third-party SDK modules ├── test_cross_version.py # cross-version walker — runs against every │ # /tmp/pychd-multiversion/sample-*.pyc fixture ├── test_semantic.py # five-axis semantic equivalence (BX/BN/BS/FC/ED) └── test_syntax_coverage.py # 86-construct Python 3.14 matrix tools/ ├── build_corpora.py # builds 6 PyPI/stdlib/HumanEval corpora ├── build_multiversion_fixtures.py # compiles a sample with every local Python ├── benchmark.py # per-module measurement (JSON + markdown) ├── compare_decompilers.py # runs pychd vs uncompyle6 / decompyle3 ├── render_figures.py # writes assets/*.svg via plotly └── render_paper.py # regenerates README "Benchmarks" section ## Benchmarks (run by `just paper`) For every `.py` file in a corpus: .py → py_compile → .pyc → pychd → recovered .py where `` is either `rules-only` (deterministic baseline) or `hybrid-rewrite` (rule pass + one Codex CLI call per module). Both sets of numbers are reported below — `rules-only` is the deterministic, free, offline baseline you get without an LLM key; `hybrid-rewrite` is the headline result and the one the BibTeX note references. …and measure six metrics on the result. Three are **static** (AST shape, computed from the recovered source text); three are **semantic** (round-tripped through the producing CPython, computed from the recompiled `.pyc`): | Metric | What it requires | |---|---| | **signature_match** | Every original class/function/import name in the module survives in the recovered tree. Function bodies are out of scope (rule pass emits a placeholder). | | **declaration_match** | `signature_match` AND every module/class-level variable and annotated attribute survives by name. | | **strict_match** | Full normalised AST equality (bodies stripped to `pass`, annotations dropped, decorators dropped). A regression telltale, bounded above by CPython compiler normalisations. | | **BX — `bytecode_exact`** | `marshal.dumps(orig_code) == marshal.dumps(py_compile(recovered.py))`, with `co_filename` normalised away. Strictest of the three semantic axes; trips on any cosmetic compiler-induced change. | | **BN — `bytecode_normalized`** | Recursive equality of `dis.get_instructions` streams after dropping `CACHE`/`NOP`/`RESUME`/`EXTENDED_ARG`/`KW_NAMES` and de-specialising adaptive opcodes (`LOAD_FAST_BORROW`, `LOAD_FAST_CHECK`, `LOAD_SMALL_INT`, `RETURN_CONST`). | | **BS — `behavioral_smoke`** | Recovered module imports under the producing interpreter; same public top-level name set; `inspect.signature` identical for every public callable. Tolerates compiler normalisations completely — catches whether the *external API* survived. | | **FC — `functional_correctness` (Pass@1)** | The recovered module's entry-point function is fed to the corpus's own `check(candidate)` oracle; passes when every assertion holds. Equivalent to Decompile-Bench's "Re-Executability" metric (arXiv 2505.12668) and PyLingual's "Execution Match" (USENIX Security 2025). Reported only on corpora that ship a test oracle (HumanEval is the current one). | | **ED — `edit_similarity`** | Mean character-level Ratcliff–Obershelp similarity (`difflib.SequenceMatcher.ratio`) in `[0, 1]`. Continuous metric — surfaces incremental rule-pass improvements that don't yet flip any boolean axis. Matches Decompile-Bench's "Edit Similarity" column. | Two tables are generated below — one for **rules-only** (no LLM, deterministic, milliseconds per module) and one for **hybrid-rewrite** (one Codex CLI call per module). The bullet headline and the per-corpus table that follows report the hybrid-rewrite numbers; a collapsed *rules-only* sub-section preserves the deterministic baseline. ### How these axes map to published benchmarks The eight columns above intentionally span the metric space used by the three live Python-decompilation benchmarks: | pychd axis | Equivalent in the literature | |---|---| | `parses` | "Re-Compilability" — Decompile-Bench | | `strict_match` | "AST Match" — PyLingual | | `BX` (bytecode_exact) | bytecode-level equivalence — uncompyle6 / decompyle3 self-tests | | `BN` (bytecode_normalized) | structural equivalence — adapted from binary-decompiler literature | | `BS` (behavioral_smoke) | weaker "Re-Executability" (import + surface only) — Decompile-Bench | | `FC` (Pass@1) | "Re-Executability" / "Execution Match" — Decompile-Bench, PyLingual | | `ED` (edit_similarity) | "Edit Similarity" — Decompile-Bench | | `signature_match` / `declaration_match` | pychd-specific declaration-level metrics | `FC` and `ED` are the two axes a reader coming from the published benchmarks expects to see; they're now reported alongside pychd's own declaration-oriented metrics so a side-by-side with paper numbers is possible without re-running anything. ### Why not naïve pyc → py → pyc? A natural intuition is *"if `pyc → py → pyc` produces the same `.pyc` bytes, the recovered source is equivalent."* The forward direction holds — same bytes ⇒ same semantics. The converse does **not**: two semantically-identical sources can produce different bytes. A raw `marshal.dumps` byte comparison conflates real source changes with five unrelated compiler-driven phenomena: 1. **`co_firstlineno` / `co_lnotab` / `co_positions` drift.** Any whitespace or comment difference shifts line/column tables. The bytecode itself is identical; the position metadata is not. 2. **`co_consts` / `co_names` / `co_varnames` reordering.** When the compiler folds or re-emits an expression (`if x is not None` ↔ `if not (x is None)`, partial constant folding, etc.) the index assignments shift even though `LOAD_CONST` resolves to the same value. 3. **Specialising-interpreter adaptive opcodes (CPython 3.11+).** `LOAD_FAST_CHECK`, `LOAD_FAST_BORROW`, `LOAD_FAST_AND_CLEAR`, `LOAD_SMALL_INT`, and `RETURN_CONST` are emitted opportunistically; the same source can compile to either the base or the specialised form depending on what the compiler can prove locally. 4. **Exception-table layout (PEP 657).** Try/except blocks that compile to identical control flow can serialise their exception tables differently. 5. **Magic-number mismatch across minor versions.** A `.pyc` built by 3.13 and one built by 3.14 are never byte-equal, regardless of source. That's why pychd reports three semantic axes rather than one. Each one tolerates a specific class of false negative — **BX** catches everything but trips on (1) – (4); **BN** strips (1), de-specialises (3), and ignores `CACHE` from (4), but cannot defeat (2) because constant-pool indices are baked into instruction operands; **BS** defeats all five by observing only the recovered module's *surface*. All three round-trip through the **producing CPython interpreter** — identified from the `.pyc` magic number and resolved via `uv python find ` — so (5) never applies to the comparison itself. The intersection (`BX ∧ BN ∧ BS`) is the strongest claim pychd can make about a recovery; the union (`BX ∨ BN ∨ BS`) is the weakest useful one. Both extremes are reported in the per-corpus table so reviewers can read the trade-off directly. **Headline:** hybrid-rewrite recovery on **2794 modules / 816,452 LoC**: - **Signature match: 2786/2794 (99.7%)** — every public class, function, import, and class-method name in the original survives in the recovered tree. - **Declaration match: 2785/2794 (99.7%)** — signature match plus every module/class-level variable and annotated attribute by name. - **Strict match: 2416/2794 (86.5%)** — full stripped-AST equality (cosmetic regression telltale; bounded by CPython compiler normalisations). - **Behavioral smoke: 1206/2794 (43.2%)** — recovered module imports under the producing interpreter and exposes the same public name + signature surface as the original. The semantic axis that tolerates the most compiler normalisations; see [Why not naïve pyc → py → pyc?](#why-not-naïve-pyc--py--pyc) for what `BX`/`BN`/`BS` measure and what each one catches. - **Pass@1 (functional correctness): 160/164 (97.6%)** — Decompile-Bench's re-executability oracle, scored on corpora that ship a `check(candidate)` test (HumanEval is currently the only one). The recovered module is imported under the producing interpreter and its entry-point function is fed to the original test suite. A pure rules-only baseline necessarily scores near 0 here because bodies are stubbed; future LLM-assisted or simple-body matcher work shows up directly in this number. - **Edit similarity (mean): 0.870** — Decompile-Bench-style character-level Ratcliff-Obershelp ratio averaged over the corpus. 1.0 means byte-identical, 0.0 means entirely dissimilar. A continuous metric that surfaces incremental rule-pass improvements which haven't yet flipped any boolean axis. #### Per-corpus results | Corpus | Modules | LoC | Parses | Sig | Decl | Strict | BX | BN | BS | FC (Pass@1) | ED | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | **fuzz-synthetic**
_pyfuzz-generated random valid Python (guaranteed LLM-naïve)_ | 200 | 12,742 | 200/200 (100.0%) | 200/200 (100.0%) | 200/200 (100.0%) | 172/200 (86.0%) | 27/200 (13.5%) | 51/200 (25.5%) | 184/200 (92.0%) | n/a | 0.839 | | **recent-pypi**
_Recent / niche PyPI packages — 23 packages, capped at 8 modules each so no single project exceeds 5 % of the corpus. release-date proxy for low contamination (see §LLM contamination disclosure)_ | 182 | 60,390 | 182/182 (100.0%) | 181/182 (99.5%) | 181/182 (99.5%) | 149/182 (81.9%) | 45/182 (24.7%) | 93/182 (51.1%) | 37/182 (20.3%) | n/a | 0.816 | | **synthetic**
_Synthetic modules drafted with LLM assistance (2026-05-26 — see §LLM contamination disclosure)_ | 11 | 634 | 11/11 (100.0%) | 11/11 (100.0%) | 11/11 (100.0%) | 11/11 (100.0%) | 1/11 (9.1%) | 3/11 (27.3%) | 6/11 (54.5%) | n/a | 0.918 | | **stdlib**
_Curated stdlib (10 modules)_ | 10 | 15,996 | 10/10 (100.0%) | 10/10 (100.0%) | 10/10 (100.0%) | 10/10 (100.0%) | 6/10 (60.0%) | 6/10 (60.0%) | 6/10 (60.0%) | n/a | 0.912 | | **stdlib-obf**
_stdlib anonymised via pychd-pyobf (contamination differential)_ | 15 | 13,690 | 15/15 (100.0%) | 15/15 (100.0%) | 15/15 (100.0%) | 13/15 (86.7%) | 1/15 (6.7%) | 3/15 (20.0%) | 0/15 (0.0%) | n/a | 0.916 | | **stdlib-full**
_Full Python 3.14 stdlib (single-file modules)_ | 153 | 130,182 | 153/153 (100.0%) | 151/153 (98.7%) | 151/153 (98.7%) | 140/153 (91.5%) | 66/153 (43.1%) | 91/153 (59.5%) | 129/153 (84.3%) | n/a | 0.856 | | **stdlib-full-obf**
_stdlib-full anonymised via pychd-pyobf (contamination differential)_ | 153 | 95,763 | 153/153 (100.0%) | 149/153 (97.4%) | 148/153 (96.7%) | 123/153 (80.4%) | 26/153 (17.0%) | 51/153 (33.3%) | 4/153 (2.6%) | n/a | 0.897 | | **pypi**
_PyPI: requests, click, attrs, flask, httpx, rich_ | 189 | 74,879 | 189/189 (100.0%) | 189/189 (100.0%) | 189/189 (100.0%) | 170/189 (89.9%) | 75/189 (39.7%) | 129/189 (68.3%) | 63/189 (33.3%) | n/a | 0.905 | | **pypi-obf**
_pypi anonymised via pychd-pyobf (contamination differential)_ | 189 | 39,026 | 189/189 (100.0%) | 189/189 (100.0%) | 189/189 (100.0%) | 155/189 (82.0%) | 48/189 (25.4%) | 92/189 (48.7%) | 6/189 (3.2%) | n/a | 0.891 | | **pypi-top20**
_PyPI top-20 pure-Python packages_ | 682 | 258,421 | 682/682 (100.0%) | 681/682 (99.9%) | 681/682 (99.9%) | 576/682 (84.5%) | 142/682 (20.8%) | 312/682 (45.7%) | 432/682 (63.3%) | n/a | 0.833 | | **pypi-top20-obf**
_pypi-top20 anonymised via pychd-pyobf (contamination differential)_ | 682 | 108,348 | 682/682 (100.0%) | 682/682 (100.0%) | 682/682 (100.0%) | 569/682 (83.4%) | 98/682 (14.4%) | 250/682 (36.7%) | 36/682 (5.3%) | n/a | 0.886 | | **humaneval**
_OpenAI HumanEval (164 problems)_ | 164 | 3,361 | 164/164 (100.0%) | 164/164 (100.0%) | 164/164 (100.0%) | 164/164 (100.0%) | 0/164 (0.0%) | 152/164 (92.7%) | 161/164 (98.2%) | 160/164 (97.6%) | 0.920 | | **humaneval-obf**
_humaneval anonymised via pychd-pyobf (contamination differential)_ | 164 | 3,020 | 164/164 (100.0%) | 164/164 (100.0%) | 164/164 (100.0%) | 164/164 (100.0%) | 92/164 (56.1%) | 126/164 (76.8%) | 142/164 (86.6%) | n/a | 0.927 | | **aggregate** | **2794** | **816,452** | **2794/2794 (100.0%)** | **2786/2794 (99.7%)** | **2785/2794 (99.7%)** | **2416/2794 (86.5%)** | **627/2794 (22.4%)** | **1359/2794 (48.6%)** | **1206/2794 (43.2%)** | **160/164 (97.6%)** | **0.870** | #### Visualisation ![Recovery rate by corpus](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/6ad7a274da080510.svg) Bars = signature match · declaration match · strict match per corpus. #### Residual failure attribution **Residual failures** (signature match): | Cause | Count | Fundamentally recoverable? | |---|---:|---| | other / complex RHS | 4 | future work | | try/except ImportError (control flow) | 2 | future work | | if-False-block (CPython constant-folds — unrecoverable) | 2 | ❌ no — constant-folded | ### Comparison with prior Python decompilers Four publicly-available decompilers compete with pychd on Python 3.x bytecode. Every figure below comes from running the named version of each tool against the locally-built corpus on this host — no paper numbers are reused. **The headline comparison axis is `strict_match`** (stripped-AST equality). pychd's `signature_match` / `declaration_match` lead is real but partially structural — pychd stubs bodies with `pass` when the rule pass can't recover them, which preserves declarations even when the recovery is otherwise incomplete. `strict_match` is the axis that compares apples-to-apples against body-recovering tools like `decompyle3`. #### Head-to-head on `synthetic` — Python 3.8 The eight `synthetic` modules compiled with Python 3.8 and handed to every 3.8-capable tool we have. Read this with the [§LLM contamination disclosure](#llm-contamination-disclosure) in mind: these modules were drafted with LLM assistance during this project's development, so a high pychd score here is **not** evidence of contamination-free generalisation. We keep the table because it still measures whether the bytecode-driven pipeline produces *syntactically valid, AST- matching* source from a Python 3.8 .pyc — which `decompyle3` fails to do on 2 of the 8 modules even with the source pattern available in its training data. | Tool | parses | sig | decl | **strict** | BN | BS | ED | |---|---:|---:|---:|---:|---:|---:|---:| | **pychd (hybrid-rewrite:codex)** | 8/8 | 8/8 | 8/8 | **8/8** | **8/8** | 5/8 | 0.968 | | `decompyle3` 3.9.3 | 6/8 | 6/8 | 6/8 | 3/8 | 0/8 | 0/8 | 0.551 | | `uncompyle6` 3.9.3 | not run on this corpus yet | — | — | — | — | — | — | Source: `assets/_synthetic_comparison.json` (commit-tracked). Reproduce: uv run python tools/build_corpora.py --only synthetic # then compile with Python 3.8 and run pychd + decompyle3. #### Broader head-to-head — 23-module stdlib + PyPI subset Below is the broader comparison against a 23-module mix of stdlib + curated-PyPI modules. The PyPI subset overlaps published corpora (`six`, `packaging`, `certifi`, `idna`, `charset_normalizer`) that the Codex backend almost certainly saw at training time, so all the caveats from [§LLM contamination disclosure](#llm-contamination-disclosure) apply here too. | Tool | Source | Install | Coverage | Best Py version (this run) | |---|---|---|---|---| | [`uncompyle6`](https://pypi.org/project/uncompyle6/) | PyPI | `uv sync` | 2.4 – 3.8 | 3.8 | | [`decompyle3`](https://github.com/rocky/python-decompile3) | PyPI | `uv sync` | 3.7 / 3.8 only | 3.8 | | [`pycdc`](https://github.com/zrax/pycdc) | git source build | `just decompilers-build` | 1.0 – 3.10 | 3.10 | | [`PyLingual`](https://github.com/syssec-utd/pylingual) | podman image (ML-based) | `just decompilers-build` | 3.6 – 3.13 | 3.13 | **Each external tool is evaluated on its *own* highest-supported Python version**, not forced down to a shared 3.8 baseline. uncompyle6 and decompyle3 are scored on 3.8 (their newest supported release), pycdc on 3.10, and PyLingual on 3.13. pychd is scored on every one of those three versions so each row of the cross-version matrix below shows pychd vs the competitor's best-case Python. PyFET (Ahad et al., S&P 2023) is a bytecode *transformer* rather than a standalone decompiler — it rewrites .pyc files so they become readable by uncompyle6/decompyle3. Integrating it would require composing the transformer with one of those decompilers end-to-end, which is on the roadmap but not in this comparison. ### Cross-version coverage Each external tool runs against **its own preferred Python version** (uncompyle6 / decompyle3 → 3.8; pycdc → 3.10; PyLingual → 3.13). pychd runs against all three so a reviewer can see how pychd performs *under each competitor's best-case Python*, side by side. The harness records "failed", "timeout", or "not installed" for (tool, version) pairs the tool can't handle — pychd is the only tool covering every 3.x release, and the matrix below makes that explicit instead of hiding it behind a 3.8-only comparison. Run-time notes for reviewers reproducing the comparison: * **uncompyle6 / decompyle3 / pycdc** finish in a few seconds per module; the full 23-module sweep takes a couple of minutes per Python version. * **PyLingual** spawns a podman container per module with a CPU-only PyTorch backend. Model load is ~10 s plus inference proportional to the module size. The harness enforces a 60 s per-module wall-clock timeout — modules larger than ~500 LoC reliably hit it (PyLingual's segmenter scales super-linearly with statement count). Those modules are recorded as ``timeout`` rather than 0; the reviewer can re-run with a larger ``timeout`` field in ``EXTERNAL_TOOLS`` if needed. Plan ~15 minutes for the full PyLingual pass on Python 3.13. * **Skipping wasted runs**: each external tool only runs against its *own* preferred Python version (`TOOL_PREFERRED_VERSIONS` table in `tools/compare_decompilers.py`). Earlier versions of the harness ran every tool against every version and masked the irrelevant rows; that wasted ~20 minutes per run on pylingual containers we'd discard. Reviewers who want the full matrix can drop the skip-guard block in `_run_one_version`. ![Per-tool comparison at each tool's preferred Python version — 23 real-world modules, faceted by Python version](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/739f8c270b080511.svg) #### Cross-version coverage matrix | Tool | Py 3.8 | Py 3.10 | Py 3.13 | |---|:---:|:---:|:---:| | **pychd (hybrid-rewrite:codex)** | ✅ 23/23 | ✅ 23/23 | ⚠ 20/23 | | **uncompyle6** | ⚠ 4/23 | — (not run) | — (not run) | | **decompyle3** | ⚠ 12/23 | — (not run) | — (not run) | | **pycdc** | — (not run) | ⚠ 4/23 | — (not run) | | **pylingual** | — (not run) | — (not run) | ⚠ 8/23 |
Python 3.8 — all eight axes | Tool | Version | Sig | Decl | Strict | BX | BN | BS | ED | |---|---|---:|---:|---:|---:|---:|---:|---:| | **pychd (hybrid-rewrite:codex)** | main (this repo) | 23/23 | 23/23 | 16/23 | 5/23 | 15/23 | 14/23 | 0.724 | | **uncompyle6** | uncompyle6, version 3.9.3 | 4/23 | 4/23 | 3/23 | 0/23 | 3/23 | 1/23 | 0.483 | | **decompyle3** | 3.9.3 (PyPI) | 12/23 | 11/23 | 4/23 | 0/23 | 4/23 | 8/23 | 0.603 | | **pycdc** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.10))_ | — | — | — | — | — | — | | **pylingual** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.13))_ | — | — | — | — | — | — |
Python 3.10 — all eight axes | Tool | Version | Sig | Decl | Strict | BX | BN | BS | ED | |---|---|---:|---:|---:|---:|---:|---:|---:| | **pychd (hybrid-rewrite:codex)** | main (this repo) | 23/23 | 23/23 | 17/23 | 6/23 | 15/23 | 13/23 | 0.743 | | **uncompyle6** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.8))_ | — | — | — | — | — | — | | **decompyle3** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.8))_ | — | — | — | — | — | — | | **pycdc** | b428976 (2026-04-06) | 4/23 | 4/23 | 1/23 | 0/23 | 1/23 | 1/23 | 0.252 | | **pylingual** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.13))_ | — | — | — | — | — | — |
Python 3.13 — all eight axes | Tool | Version | Sig | Decl | Strict | BX | BN | BS | ED | |---|---|---:|---:|---:|---:|---:|---:|---:| | **pychd (hybrid-rewrite:codex)** | main (this repo) | 20/23 | 20/23 | 17/23 | 5/23 | 11/23 | 14/23 | 0.723 | | **uncompyle6** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.8))_ | — | — | — | — | — | — | | **decompyle3** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.8))_ | — | — | — | — | — | — | | **pycdc** | (skipped — see preferred-version row) | _(out of scope (preferred: Py 3.10))_ | — | — | — | — | — | — | | **pylingual** | main (image: pychd-pylingual:latest) | 8/23 | 8/23 | 5/23 | 0/23 | 3/23 | 5/23 | 0.311 |
`FC` (Pass@1) is omitted from this corpus — the 3.8 stdlib + PyPI subset doesn't ship `check(candidate)` oracles, so no tool can be scored on it. Pass@1 is reported per-corpus in the headline table above (currently HumanEval only). | Tool | Py | Strict | BN | ED | |---|---|---:|---:|---:| | **pychd (hybrid-rewrite:codex)** | 3.8 | **16/23** | 15/23 | 0.724 | | **pychd (hybrid-rewrite:codex)** | 3.13 | **17/23** | 11/23 | 0.723 | | `decompyle3` | 3.8 | 4/23 | 4/23 | 0.603 | | `uncompyle6` | 3.8 | 3/23 | 3/23 | 0.483 | | `pycdc` | 3.10 | 1/23 | 1/23 | 0.252 | | `pylingual` | 3.13 | 5/23 | 3/23 | 0.311 | Each external tool is scored on its own preferred Python version (uncompyle6 / decompyle3 → 3.8, pycdc → 3.10, pylingual → 3.13). pychd's hybrid-rewrite is run on the same `.pyc` file each tool receives. pychd's `Sig`/`Decl` lead in the per-version tables above (99–100% vs 17–50%) is partially structural — the rule pass preserves declarations losslessly even when bodies can't be recovered — so `Strict` is the cleaner head-to-head number. * **decompyle3** commits to a full body reconstruction; when the reconstruction round-trips, `BN` / `BS` / `ED` benefit. When it doesn't, the textual overlap still drags `ED` upward, but the static axes punish it — bodies that compile without preserving declarations lose `Sig`/`Decl`. * **uncompyle6** is the broadest version coverage in the literature (2.4 onwards) but on 3.8 its grammar has known regressions; it trades coverage breadth for accuracy on the latest supported release. * **pycdc** is a C++ tool that parses bytecode in one pass with no Python dependency. Its 3.8 declaration recovery is noisier than decompyle3's (lost annotations, default-value substitution) but it's the only tool here that runs on a fresh checkout with no Python install at all. * **PyLingual** uses LLM-based segmentation + statement translation on top of a deterministic grammar. It's the most accurate of the external tools on its supported range (3.6 – 3.13) but requires a podman image, ~2 GB of model weights, and PyTorch. * `BX` is 0 across the board on this corpus because Python 3.8's compiler emits constant pools whose ordering depends on AST shape; any divergence in the source — even a textually-equivalent rewrite — shifts indices in `co_consts`. No external tool currently emits source that round-trips byte-equal under the original compiler. Reporting all eight axes lets a reviewer read the trade-off rather than relying on whichever axis flatters a given tool. Re-run via `just bench-compare`. ### Why these corpora? Selected to mirror what published Python-decompilation work evaluates against. PyLingual ([Wiedemeier et al., 2024](https://kangkookjee.io/wp-content/uploads/2024/11/pylingual.pdf)) uses CodeSearchNet / PyPI / VirusTotal / PyLingual.io. PyFET ([Ahad et al., S&P 2023](https://userlab.utk.edu/publications/ahad2023pyfet)) draws from 3,000 CPython stdlib + popular PyPI programs. [Decompile-Bench](https://arxiv.org/abs/2505.12668) adds HumanEval/MBPP. pychd's corpora are downloaded on demand into `/tmp/pychd-corpora/` (nothing third-party is committed): | Corpus | Where it comes from | |---|---| | `fuzz-synthetic` | 200 random valid-Python modules generated on every run via `pychd-pyfuzz`. Guaranteed LLM-naïve by construction (see §LLM contamination disclosure). | | `recent-pypi` | 23 recent / niche PyPI packages (`cursor-sdk` 0.1.5, `dspy` 3.2, `logfire` 4.33, …; full list and release-date pins in `assets/_recent_pypi_pins.json`). Each package capped at 8 deterministic modules so no single project exceeds ~5 % of the corpus. `openai` and `openai-agents` are deliberately excluded since the hybrid-rewrite backend is OpenAI Codex. | | `synthetic` | 11 hand-curated modules (LLM-assisted, see §LLM contamination disclosure). | | `stdlib` | 10 curated single-file stdlib modules. | | `stdlib-full` | Every single-file `.py` under the running Python's stdlib path. | | `pypi` | 6 popular pure-Python PyPI packages (`requests`, `click`, `attrs`, `flask`, `httpx`, `rich`). | | `pypi-top20` | 20 more pure-Python PyPI packages (`certifi`, `urllib3`, `packaging`, `PyYAML`, `jinja2`, `werkzeug`, `pygments`, …). | | `humaneval` | 164 reference solutions from OpenAI's HumanEval. | | `*-obf` (5 mirrors) | `stdlib-obf` / `stdlib-full-obf` / `pypi-obf` / `pypi-top20-obf` / `humaneval-obf`: the matching raw corpus rewritten through `pychd-pyobf` so identifiers / strings / docstrings are stripped while the opcode stream is preserved. The raw-vs-obf delta on the same pipeline isolates the contamination contribution. | ## Reproducibility Every number, table, and chart in this README is regenerable by a single command: just paper …which is equivalent to: uv sync # 1. dependencies uv run python tools/build_corpora.py # 2. download corpora to /tmp uv run pytest tests/ -q # 3. 337 tests uv run python tools/render_paper.py # 4. regenerate README results # + assets/_results.json # + assets/_comparison.json uv run python tools/render_figures.py # 5. regenerate assets/*.svg uv run ruff check pychd tests # 6. lint uv run ty check pychd tests # 7. type check ### Reproducibility limits (the honest version) * **PyPI corpora are not version-pinned.** `tools/build_corpora.py` downloads the *latest* release of each package from PyPI. Module counts and the denominator of every per-corpus percentage drift as upstream packages publish new releases. The `recent-pypi` corpus is the exception: every package there has its exact version and release date recorded in `assets/_recent_pypi_pins.json` so the recency claim is auditable. The remaining 26 PyPI packages in the `pypi` + `pypi-top20` corpora are not yet pinned. Pinning every wheel is on the roadmap. * **`stdlib-full` reflects the running interpreter's stdlib.** Re-running on a different 3.14 patch release (3.14.0 vs 3.14.3) shifts which modules are included. * **Headline numbers measure the native 3.14 rule pass only.** The cross-version pass (3.0 – 3.13) is exercised by 31 fixture-based tests against `/tmp/pychd-multiversion/sample-*.pyc` plus a Python-3.8 head-to-head on a 23-module shared corpus against `uncompyle6` and `decompyle3` (see [Comparison with prior Python decompilers](#comparison-with-prior-python-decompilers)). Per-version aggregate numbers for 3.0 – 3.7 require local interpreters of those releases, which are no longer distributed by `uv python install`. * **The bundled `assets/_results.json` and `assets/_comparison.json` are committed** so reviewers who cannot run the corpus build still see the exact numbers the README claims. The task runner exposes every primitive: | Command | What it does | |---|---| | `just setup` | `uv sync` — creates `.venv` with dev + runtime deps | | `just hooks-install` | Register prek pre-commit (ruff) and pre-push (ty + pytest) hooks | | `just lint` | `ruff check` + `ruff format --check` + `ty check` | | `just fix` | `ruff check --fix` + `ruff format` | | `just test` | `pytest tests/ -v` | | `just ci` | `lint` + `test` (the gate prek runs on push) | | `just bench` | Build all corpora + run all benchmarks | | `just bench-stdlib` / `bench-pypi` / `bench-cursor` | One corpus | | `just bench-versions` | Compile a sample with every locally-installed Python and verify pychd detects each `.pyc` | | `just paper` | Full reproduction (corpora + tests + lint + type + render) | | `just compile ` / `decompile ` / `validate ` | CLI shortcuts | To exercise cross-version detection on real `.pyc` files: uv run python tools/build_multiversion_fixtures.py # compiles a sample with every locally-installed Python 3.x and emits # /tmp/pychd-multiversion/sample-3.X.pyc. uv run pytest tests/versions_test.py -v # 20 tests, including integration tests over every fixture. ## Releasing This repository is a **uv workspace** with three PyPI-publishable members; each has its own GitHub Actions workflow and its own tag prefix so a release of one does not drag the others along. | Package | PyPI name | Tag prefix | Workflow | |---|---|---|---| | Decompiler | `pychd` | `pychd-v*` | `.github/workflows/publish-pychd.yaml` | | Syntactic Fuzzer | `pychd-pyfuzz` | `pyfuzz-v*` | `.github/workflows/publish-pyfuzz.yaml` | | Obfuscator | `pychd-pyobf` | `pyobf-v*` | `.github/workflows/publish-pyobf.yaml` | Cut a release with the matching `just` recipe (which `git tag` + `git push origin` together): just release-pychd 1.3.0 # tags pychd-v1.3.0 just release-pyfuzz 0.1.0 # tags pyfuzz-v0.1.0 just release-pyobf 0.1.0 # tags pyobf-v0.1.0 ### Trusted Publishing setup (one-time per package) All three workflows publish via PyPI's OIDC Trusted Publishing (no API tokens in repository secrets). Each PyPI project must be registered with this repository + workflow before its first tag push: 1. On PyPI, create the project (or reserve the name) and open **Manage → Publishing → Add a new pending publisher**. 2. Fill in: - Owner: `diohabara` - Repository name: `pychd` - Workflow filename: `publish-pychd.yaml` (or `publish-pyfuzz.yaml` / `publish-pyobf.yaml`) - Environment name: `pypi` 3. In this GitHub repository, create the `pypi` environment under **Settings → Environments**. Add review requirements / branch protection rules as needed. After that, tag pushes (`pychd-v*` / `pyfuzz-v*` / `pyobf-v*`) release directly to PyPI. ## Scope The rule pass reconstructs the **declaration skeleton** of every module — every class, function, import, docstring, annotation, decorator (including arguments), default argument, and the structure of module-level `if` blocks. Function bodies are reconstructed only for the trivial closed-form cases that account for the bulk of one-line definitions (`return X`, `return self.attr.attr2`, `return `, `pass`); structured bodies (loops, branches, multi-statement sequences) are intentionally left as `UnknownBlock` placeholders for the hybrid LLM pass to fill in with the bytecode disassembly as context. This split is the design — body recovery is a tractable LLM task on top of a *correct* skeleton; trying to recover bodies symbolically across every CPython release is what blocked the prior generation of tools (uncompyle6 / decompyle3) at Python 3.8. The rule pass owns everything that compiles to a deterministic bytecode shape; the LLM owns the rest. A `try: import X except ImportError:` matcher is implemented in `pychd/rules.py` but currently disabled — its handler-boundary heuristic regressed ~15 modules across the benchmark corpus from mis-bounded handler ranges in modules whose handler exits via `JUMP_FORWARD` rather than `POP_EXCEPT`. The fallback contract holds: both branches of the try/except flatten into top-level imports, so the names still survive in the recovered tree; only the `try` / `except` indentation is dropped. Cleanly enabling the matcher requires walking the exception table for *all* nested entries rather than just the entry whose start offset matches the current walker position. ## Citing If you reference pychd somewhere, here's the BibTeX: @software{pychd, author = {Takemaru Kadoi}, title = {{pychd}: A hybrid rule-based and {LLM}-augmented {P}ython bytecode decompiler targeting {P}ython 3.14}, year = {2026}, url = {https://github.com/diohabara/pychd}, note = {Two-tier evaluation on 1{,}217 real-world modules / 513,724 LoC spanning the Python 3.14 stdlib, 26 PyPI packages, OpenAI HumanEval, and a third-party SDK. (a) Deterministic rule-only path: 99.8\% signature match (1215/1217), 99.6\% declaration match (1212/1217), 36.0\% strict-AST match (pre-improvements baseline). The 0.2\% signature-match residual is two stdlib modules whose source uses ``if False:'' / ``if 0:'' guards: CPython's constant folder erases those blocks, so the bytecode contains nothing to recover. Hybrid-rewrite closes the gap only by memorising the original source, not by decompiling. (b) Hybrid-rewrite path (rule pass + one Codex CLI call per module, with the improved pychd rule pass and the AST-normalising strict\_match metric used by prior research): 93.2\% strict-AST match (2.59$\times$ improvement over the pre-improvements baseline) and 97.6\% functional-correctness Pass@1 on HumanEval (160/164), above prior published Python decompiler re-executability baselines (PyLingual, USENIX Security 2025; Decompile-Bench, arXiv 2505.12668). Cross-version xdis-driven pass extends declaration recovery to every CPython 3.0 -- 3.13 release.} }