dungkkk7/python_mal

GitHub: dungkkk7/python_mal

一个基于Python的恶意包检测端到端流水线，利用CodeQL与图学习实现自动化分析与报告。

Stars: 1 | Forks: 0

# Python Malware Pipeline Pipeline phát hiện package Python độc hại theo luồng: `CodeQL queries -> Observations -> Knowledge Graph -> Embedding -> Classifier -> Evaluation Report` ## 运行时当前 - Runtime chính: `tools/` - Runtime cũ `python_malware_pipeline/`: deprecated (không còn dùng) - Query pack: `codeql-queries/` (47 queries) - Entry command chuẩn: `Makefile` tại root - Split benchmark mặc định: `family_aware` - Tuning model dùng validation nội bộ; holdout test chỉ dùng để chấm điểm cuối khi split hợp lệ ## 目录结构 ``` python_mal/ ├── Makefile # lệnh chuẩn để dev/test/run/clean ├── codeql-queries/ # 47 CodeQL queries + qlpack ├── document/ # tài liệu chính + archive + runbook ├── tools/ │ ├── src/ # extraction/graph/learning modules │ ├── tests/ # unit tests │ ├── scripts/ # helper scripts (cleanup) │ ├── pipeline.py # single full pipeline │ ├── visualize_e2e.py # generate figures/tables/report │ └── output/ # output chuẩn (e2e) └── output/ # logs legacy/local ``` ## 标准运行方式 ### 1) 运行单元测试 ``` make test ``` ### 2) 烟雾测试 ``` make e2e-smoke ``` ### 3) E2E 基准测试 100 样本 + 可视化 ``` make e2e-full100 make viz ``` ### 4) 全量数据集运行（根据现有数据集自动计数） Mặc định `make e2e-fullall` sẽ: - dùng `tools/data_4000` nếu tồn tại, - split holdout `70/30` (`TEST_SIZE=0.3`), - tự chọn `SCAN_WORKERS/THREADS/RAM` theo tài nguyên máy để tăng throughput. - nếu cần nối tiếp run bị ngắt, dùng `tools/scripts/run_full_dataset.sh --resume` trên đúng `output-dir`. Ví dụ ép cấu hình tay: ``` make e2e-fullall DATA_DIR=tools/data_4000 TEST_SIZE=0.3 SCAN_WORKERS=6 THREADS=2 RAM=1400 ``` ### 4.1) 从现有 artifact 运行消融测试 ``` python3 tools/ablation_bench.py \ --artifact-dir tools/output/e2e/full4000_escalated \ --output-dir tools/output/e2e/full4000_escalated/ablations ``` Hai baseline tabular hiện tại được giữ cố ý ở mức tối giản: - `Indicator-bag + LR`: bag-of-indicator log-count 47 chiều. - `Flat-SARIF + MLP`: vector phẳng từ indicator histograms, severity/context/result-kind/baseline-state histograms, và một nhóm nhỏ summary statistics lấy trực tiếp từ SARIF. ### 5) 清理输出/缓存 ``` make clean-outputs # giữ tools/output/e2e/full100 + fullall make clean-all # xóa cả benchmark/fullall make clean-cache ``` ## 重要输出 - E2E metrics JSON: `tools/output/e2e/full100/reports/eval_report.json` - E2E package-level CSV: `tools/output/e2e/full100/e2e_package_results.csv` - Visual report markdown: `tools/output/e2e/full100/reports/visual_report.md` - Full dataset report: `tools/output/e2e/fullall/reports/eval_report.json` - Ablation summary: `tools/output/e2e/full4000_escalated/ablations/reports/ablation_summary.json` ## 当前评估记录 - Chỉ còn một mode benchmark: `family_aware`. - Nếu không đủ family để dựng holdout hoặc validation sạch, pipeline sẽ fail ngay thay vì fallback. - Query compile coverage bắt buộc 100%; thiếu query compilable sẽ fail trước khi scan. - Package có hard ontology violations luôn bị loại khỏi tập hợp lệ (`scan_status=ontology_error`). - Nếu chỉ đọc một kết quả benchmark, dùng `evaluation`. - Rerun cùng `output-dir` sẽ reuse CodeQL DB nếu còn `databases/db_*`, nhưng vẫn chạy lại `database analyze` để sinh SARIF mới rồi mới evaluation. ## 测试 ``` python3 -m unittest tools/tests/test_query_manifest.py \ tools/tests/test_sarif_parser.py \ tools/tests/test_graph_learning.py \ tools/tests/test_e2e_split.py ``` ## 文档 - Index tài liệu: `document/README.md` - Pipeline overview: `document/pipeline.md` - Ontology: `document/ontology.md` - Runbook vận hành: `document/OPERATIONS.md`

标签：AMSI绕过, CodeQL, DAST, E2E测试, Makefile自动化, Python, Python项目, SEO关键词, 云安全监控, 云资产清单, 代码分析, 凭证管理, 单元测试, 威胁检测, 安全专业人员, 安全工具链, 安全评估工具, 嵌入表示, 恶意软件分析, 数据集处理, 无后门, 日志与可视化, 机器学习分类, 自动化管道, 资源调度, 逆向工具, 逆向工程, 静态分析