joemunene-by/GhostLM

GitHub: joemunene-by/GhostLM

一款从零构建的专注网络安全领域的解码器Transformer语言模型，解决通用模型安全对齐导致的知识缺失问题。

Stars: 4 | Forks: 1

![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/3d7cd98389153457.svg) ![License](https://img.shields.io/badge/license-MIT-blue.svg) ![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg) ![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-orange.svg) ![Status](https://img.shields.io/badge/status-Phase%201%20Complete-green.svg) # GhostLM GhostLM is a decoder-only transformer language model trained on CVE vulnerability descriptions, CTF writeups, and cybersecurity research. Built from scratch — no pretrained weights, no wrappers, every component written by hand. ## 为何选择 GhostLM？ Security researchers currently rely on generic models (GPT-4, Llama) that weren't trained with security context. GhostLM is purpose-built for: - CVE analysis and vulnerability explanation - CTF challenge reasoning - Penetration testing assistance - Exploit and attack pattern understanding - Security concept explanation ### 为何从零开始而非微调？ Two reasons. **First**, most offensive-security content that the best general models have seen was filtered or RLHF-nudged away during alignment — a fine-tune on top fights that prior. Training the tokenizer and weights from zero with security text in the mix lets the model treat CVE IDs, shell one-liners, and exploit technique names as first-class tokens rather than something to refuse. **Second**, GhostLM is also a study project. Every layer — attention, positional encoding, LR schedule, BPE — is hand-written so the codebase doubles as a readable reference for how a transformer is actually put together. A fine-tune hides that behind `AutoModel.from_pretrained`. It is explicitly *not* trying to beat Llama on general benchmarks. It's trying to be the right tool for one narrow job, and a transparent one. ## 架构 | Parameter | Value | |---|---| | Architecture | Decoder-only Transformer | | Parameters (ghost-small) | ~55M | | Context Length | 1024 tokens | | Layers | 6 | | Attention Heads | 8 | | Embedding Dim | 512 | | Tokenizer | GPT-2 BPE (50,261 tokens) | Built with: - Multi-head causal self-attention (manual implementation) - **RoPE** (Rotary Position Embeddings) — opt-in via `use_rope=True`, replaces learned positional embeddings with the relative-position encoding used by LLaMA / Mistral - **Flash Attention** — opt-in via `use_flash_attention=True`, routes through PyTorch 2.0+ `scaled_dot_product_attention` for `O(n)` memory - Pre-norm transformer blocks with residual connections - Cosine LR schedule with linear warmup - Weight-tied output projection - AdamW with weight decay separation - **Safetensors** export for safe, arbitrary-code-free weight distribution (see `scripts/export.py`) ## 模型变体 | Variant | Layers | Dim | Params | Status | |---|---|---|---|---| | ghost-tiny | 2 | 256 | ~14.5M | Phase 1 complete (10K steps) | | ghost-small | 6 | 512 | ~55M | Planned | | ghost-medium | 12 | 768 | ~160M | Future | ## 快速开始 ### 安装 ``` git clone https://github.com/joemunene-by/GhostLM.git cd GhostLM make install ``` ### 准备训练数据 ``` make data ``` ### 训练 ``` # CPU 友好（ghost-tiny） make train-tiny # GPU（ghost-small） make train-small ``` ### 生成文本 ``` make generate ``` ### 交互式聊天 ``` make chat ``` ### 运行 Web 演示 ``` pip install gradio python demo/app.py ``` ### 与 GPT-2 的基准对比 ``` make benchmark ``` ### 导出权重（safetensors 或 PyTorch） ``` # 用于 HuggingFace Hub 分发的安全、无 pickle 权重 python scripts/export.py --format safetensors # 经典 PyTorch 检查点 python scripts/export.py --format pt ``` ### 绘制训练曲线 ``` make plot ``` ## 训练数据 | Source | Records | Type | |---|---|---| | NVD CVE Database | 9,925 | Real | | Security Research Papers | 500 | Synthetic | | CTF Writeups | 500 | Synthetic | | **Total** | **10,925** | | ## 训练进度 | Run | Steps | Train Loss | Val Loss | Status | |---|---|---|---|---| | ghost-tiny Phase 1 | 10,000 | 1.97 | 2.74 | Complete | | ghost-tiny Phase 2 | 100,000 | — | — | Next (Mac Mini M4) | ## 评估结果（阶段 1） | Metric | Score | |---|---| | Cybersecurity Perplexity | 2,183.94 | | GPT-2 Baseline (117M) | 26.76 | | CVE Severity Classification | 20.0% | | Vulnerability Type Detection | 10.0% | | Attack Technique ID | 10.0% | | **Overall Security Score** | **13.3%** | ## 项目结构 ``` GhostLM/ ├── ghostlm/ # Core library │ ├── model.py # Transformer architecture (RoPE + Flash Attention toggles) │ ├── config.py # Hyperparameters + ghost-tiny/small/medium presets │ ├── tokenizer.py # GPT-2 BPE wrapper │ ├── dataset.py # PyTorch dataset │ └── trainer.py # Training loop ├── scripts/ # CLI tools │ ├── train.py # Training entry point │ ├── generate.py # Text generation │ ├── chat.py # Interactive chat │ ├── evaluate.py # Evaluation │ ├── eval_security.py # Security-specific evaluation │ ├── benchmark.py # GPT-2 comparison │ ├── export.py # Weights export (safetensors / pt) + SHA-256 + config.json │ ├── api.py # REST API server │ ├── data_stats.py # Training-data statistics │ ├── plot_training.py # Loss-curve plotter │ ├── push_to_hub.py # HuggingFace Hub publisher │ └── resume_train.sh # Resume an interrupted training run ├── data/ # Data pipeline ├── demo/ # Gradio web demo (demo/app.py) ├── tests/ # 16 unit tests └── Makefile # One-command workflow ``` ## 路线图 ### v0.1.0 — 架构完成 - Full transformer from scratch - Training pipeline verified - 10,925 cybersecurity records ### v0.2.0 — 阶段 1 训练完成 - ghost-tiny trained to 10,000 steps on CPU - Full evaluation suite with benchmark vs GPT-2 - MODEL_CARD with detailed results ### v0.2.1 — 阶段 2 准备就绪 - RoPE (Rotary Position Embeddings) — config-toggled - Flash Attention via `scaled_dot_product_attention` — config-toggled - Safetensors export with config.json sidecar and SHA-256 checksum - Pinned dependency versions + PEP 639 license metadata - Test suite grown from 10 → 16 tests ### v0.3.0 — 阶段 2 训练进行中 - 100K steps on Mac Mini M4 with RoPE + Flash Attention enabled - HuggingFace Hub weights release (safetensors) - Gradio web demo ### v1.0.0 — 发布（计划中） - Public weights + REST API - Fine-tuning scripts ## 贡献 See [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved. ## 许可证 MIT — see [LICENSE](LICENSE) ## 作者 **Joe Munene** — [Complex Developers](https://github.com/joemunene-by) Built in Nairobi, Kenya.

标签：AI安全, Apex, Chat Copilot, CVE, Python, PyTorch, SEO, Transformer, 从头训练, 凭据扫描, 可解释性, 安全培训, 小模型, 开源, 技术栈, 攻击模式, 数字签名, 无后门, 无预训练权重, 机器学习, 模型压缩, 深度学习, 漏洞分析, 研究项目, 网络安全, 解码器, 语言模型, 路径探测, 逆向工具, 隐私保护