joemunene-by/GhostLM

GitHub: joemunene-by/GhostLM

一款从零构建的专注网络安全领域的解码器Transformer语言模型,解决通用模型安全对齐导致的知识缺失问题。

Stars: 1 | Forks: 0

![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/3d7cd98389153457.svg) ![License](https://img.shields.io/badge/license-MIT-blue.svg) ![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg) ![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-orange.svg) ![Status](https://img.shields.io/badge/status-Phase%201%20Complete-green.svg) # GhostLM GhostLM is a decoder-only transformer language model trained on CVE vulnerability descriptions, CTF writeups, and cybersecurity research. Built from scratch — no pretrained weights, no wrappers, every component written by hand. ## 为何选择 GhostLM? Security researchers currently rely on generic models (GPT-4, Llama) that weren't trained with security context. GhostLM is purpose-built for: - CVE analysis and vulnerability explanation - CTF challenge reasoning - Penetration testing assistance - Exploit and attack pattern understanding - Security concept explanation ### 为何从零开始而非微调? Two reasons. **First**, most offensive-security content that the best general models have seen was filtered or RLHF-nudged away during alignment — a fine-tune on top fights that prior. Training the tokenizer and weights from zero with security text in the mix lets the model treat CVE IDs, shell one-liners, and exploit technique names as first-class tokens rather than something to refuse. **Second**, GhostLM is also a study project. Every layer — attention, positional encoding, LR schedule, BPE — is hand-written so the codebase doubles as a readable reference for how a transformer is actually put together. A fine-tune hides that behind `AutoModel.from_pretrained`. It is explicitly *not* trying to beat Llama on general benchmarks. It's trying to be the right tool for one narrow job, and a transparent one. ## 架构 | Parameter | Value | |---|---| | Architecture | Decoder-only Transformer | | Parameters (ghost-small) | ~55M | | Context Length | 1024 tokens | | Layers | 6 | | Attention Heads | 8 | | Embedding Dim | 512 | | Tokenizer | GPT-2 BPE (50,261 tokens) | Built with: - Multi-head causal self-attention (manual implementation) - **RoPE** (Rotary Position Embeddings) — opt-in via `use_rope=True`, replaces learned positional embeddings with the relative-position encoding used by LLaMA / Mistral - **Flash Attention** — opt-in via `use_flash_attention=True`, routes through PyTorch 2.0+ `scaled_dot_product_attention` for `O(n)` memory - Pre-norm transformer blocks with residual connections - Cosine LR schedule with linear warmup - Weight-tied output projection - AdamW with weight decay separation - **Safetensors** export for safe, arbitrary-code-free weight distribution (see `scripts/export.py`) ## 模型变体 | Variant | Layers | Dim | Params | Status | |---|---|---|---|---| | ghost-tiny | 2 | 256 | ~14.5M | Phase 1 complete (10K steps) | | ghost-small | 6 | 512 | ~55M | Planned | | ghost-medium | 12 | 768 | ~160M | Future | ## 快速开始 ### 安装 ``` git clone https://github.com/joemunene-by/GhostLM.git cd GhostLM make install ``` ### 准备训练数据 ``` make data ``` ### 训练 ``` # CPU 友好(ghost-tiny) make train-tiny # GPU(ghost-small) make train-small ``` ### 生成文本 ``` make generate ``` ### 交互式聊天 ``` make chat ``` ### 运行 Web 演示 ``` pip install gradio python demo/app.py ``` ### 与 GPT-2 的基准对比 ``` make benchmark ``` ### 导出权重(safetensors 或 PyTorch) ``` # 用于 HuggingFace Hub 分发的安全、无 pickle 权重 python scripts/export.py --format safetensors # 经典 PyTorch 检查点 python scripts/export.py --format pt ``` ### 绘制训练曲线 ``` make plot ``` ## 训练数据 | Source | Records | Type | |---|---|---| | NVD CVE Database | 9,925 | Real | | Security Research Papers | 500 | Synthetic | | CTF Writeups | 500 | Synthetic | | **Total** | **10,925** | | ## 训练进度 | Run | Steps | Train Loss | Val Loss | Status | |---|---|---|---|---| | ghost-tiny Phase 1 | 10,000 | 1.97 | 2.74 | Complete | | ghost-tiny Phase 2 | 100,000 | — | — | Next (Mac Mini M4) | ## 评估结果(阶段 1) | Metric | Score | |---|---| | Cybersecurity Perplexity | 2,183.94 | | GPT-2 Baseline (117M) | 26.76 | | CVE Severity Classification | 20.0% | | Vulnerability Type Detection | 10.0% | | Attack Technique ID | 10.0% | | **Overall Security Score** | **13.3%** | ## 项目结构 ``` GhostLM/ ├── ghostlm/ # Core library │ ├── model.py # Transformer architecture (RoPE + Flash Attention toggles) │ ├── config.py # Hyperparameters + ghost-tiny/small/medium presets │ ├── tokenizer.py # GPT-2 BPE wrapper │ ├── dataset.py # PyTorch dataset │ └── trainer.py # Training loop ├── scripts/ # CLI tools │ ├── train.py # Training entry point │ ├── generate.py # Text generation │ ├── chat.py # Interactive chat │ ├── evaluate.py # Evaluation │ ├── eval_security.py # Security-specific evaluation │ ├── benchmark.py # GPT-2 comparison │ ├── export.py # Weights export (safetensors / pt) + SHA-256 + config.json │ ├── api.py # REST API server │ ├── data_stats.py # Training-data statistics │ ├── plot_training.py # Loss-curve plotter │ ├── push_to_hub.py # HuggingFace Hub publisher │ └── resume_train.sh # Resume an interrupted training run ├── data/ # Data pipeline ├── demo/ # Gradio web demo (demo/app.py) ├── tests/ # 16 unit tests └── Makefile # One-command workflow ``` ## 路线图 ### v0.1.0 — 架构完成 - Full transformer from scratch - Training pipeline verified - 10,925 cybersecurity records ### v0.2.0 — 阶段 1 训练完成 - ghost-tiny trained to 10,000 steps on CPU - Full evaluation suite with benchmark vs GPT-2 - MODEL_CARD with detailed results ### v0.2.1 — 阶段 2 准备就绪 - RoPE (Rotary Position Embeddings) — config-toggled - Flash Attention via `scaled_dot_product_attention` — config-toggled - Safetensors export with config.json sidecar and SHA-256 checksum - Pinned dependency versions + PEP 639 license metadata - Test suite grown from 10 → 16 tests ### v0.3.0 — 阶段 2 训练进行中 - 100K steps on Mac Mini M4 with RoPE + Flash Attention enabled - HuggingFace Hub weights release (safetensors) - Gradio web demo ### v1.0.0 — 发布(计划中) - Public weights + REST API - Fine-tuning scripts ## 贡献 See [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved. ## 许可证 MIT — see [LICENSE](LICENSE) ## 作者 **Joe Munene** — [Complex Developers](https://github.com/joemunene-by) Built in Nairobi, Kenya.
标签:AI安全, Apex, Chat Copilot, CVE, Python, PyTorch, SEO, Transformer, 从头训练, 凭据扫描, 可解释性, 安全培训, 小模型, 开源, 技术栈, 攻击模式, 数字签名, 无后门, 无预训练权重, 机器学习, 模型压缩, 深度学习, 漏洞分析, 研究项目, 网络安全, 解码器, 语言模型, 路径探测, 逆向工具, 隐私保护