Aoko7/TriGma-code

GitHub: Aoko7/TriGma-code

Stars: 0 | Forks: 0

# TriGma [中文说明](README.zh-CN.md) TriGma is a preprint-safe public code release for a tri-branch static malware family classification pipeline. It combines: - micro-level byte and opcode representations - meso-level function-call graph representations - macro-level PE metadata representations ## Overview This repository releases the code and workflow structure behind TriGma before paper publication. It is designed to help readers understand: - how PE samples are curated and normalized - how Ghidra-based static extraction feeds the pipeline - how `micro`, `meso`, and `macro` branches are constructed - how pretraining and downstream tri-modal fusion are organized - how the robustness workflow is implemented in code This is a code-first release. It is not a public malware dataset release. ## Why This Repository Is Worth Reading - End-to-end workflow is preserved, from sample preparation to downstream training and robustness evaluation code paths. - Reusable implementation is separated from paper-facing experiment mirrors, so you can read either the engineering layer or the paper workflow layer first. - The public boundary is conservative: unpublished result bundles, prediction packets, sample-level split exports, raw malware binaries, packed shards, and checkpoints are intentionally excluded. ## Repository Structure - `src/` Reusable implementation for preparation, pre-processing, pretraining, fine-tuning, and visualization. - `scripts/` Orchestration, preprocessing, evaluation, and utility entrypoints. - `experiments_trigma_mainline/` Paper-facing workflow mirror for pipeline, pretraining, clean experiments, and robustness experiments. - `tests/` Lightweight public validation coverage. - `plotting/` Figure-generation pipeline and plotting config. - `docs/` Codebase map, data pipeline guide, reproducibility notes, and release guidance. ## Quick Start python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt python3 scripts/run_finetune_repro.py --dry-run pytest tests -q More setup details: - installation guide: `INSTALL.md` - runtime asset layout: `RUNTIME_LAYOUT.md` - quickstart notes: `docs/quickstart.md` ## Choose Your Reading Path - End-to-end paper workflow: `experiments_trigma_mainline/workflow.md` - Reusable downstream training code: `src/finetune/README.md` - Repository map: `docs/codebase_map.md` - Data pipeline guide: `docs/data_pipeline_guide.md` - Reproducibility guide: `docs/reproducibility_guide.md` ## Release Scope This repository includes: - reusable code - workflow mirrors - plotting support - public-facing documentation This repository does not include: - raw malware samples - attack-generated binaries - packed dataset shards - checkpoints - unpublished result bundles - prediction packets and evaluation manifests - sample-level split exports See: - release boundary: `docs/OPEN_SOURCE_SCOPE.md` - included vs excluded summary: `INCLUDED_AND_EXCLUDED.md` - public doc index: `docs/PUBLIC_DOC_INDEX.md` ## Runtime Note Some scripts still expect external assets under `runtime/` and write local outputs under `outputs/`. Those paths are placeholders for assets from the original research environment and are intentionally not distributed here. See: - `RUNTIME_LAYOUT.md` ## Project Status This directory exists as the safer publication surface before paper release. Compared with the internal release candidate, it intentionally omits result-oriented materials that are valuable to preserve internally but are not suitable to publish yet. ## Citation And Release - license: `LICENSE` - citation template: `CITATION.md` - license record: `LICENSE_STATUS.md` - publication gate: `docs/OPEN_SOURCE_RELEASE_GATE.md` - GitHub launch checklist: `docs/GITHUB_RELEASE_CHECKLIST.md` - short public positioning note: `docs/PREPRINT_PUBLIC_POSITIONING.md`