joemunene-by/GhostLM
GitHub: joemunene-by/GhostLM
一款从零构建的专注网络安全领域的解码器Transformer语言模型,解决通用模型安全对齐导致的知识缺失问题。
Stars: 1 | Forks: 0
    
# GhostLM
GhostLM is a decoder-only transformer language model trained on CVE vulnerability descriptions, CTF writeups, and cybersecurity research. Built from scratch — no pretrained weights, no wrappers, every component written by hand.
## 为何选择 GhostLM?
Security researchers currently rely on generic models (GPT-4, Llama) that weren't trained with security context. GhostLM is purpose-built for:
- CVE analysis and vulnerability explanation
- CTF challenge reasoning
- Penetration testing assistance
- Exploit and attack pattern understanding
- Security concept explanation
### 为何从零开始而非微调?
Two reasons. **First**, most offensive-security content that the best general models have seen was filtered or RLHF-nudged away during alignment — a fine-tune on top fights that prior. Training the tokenizer and weights from zero with security text in the mix lets the model treat CVE IDs, shell one-liners, and exploit technique names as first-class tokens rather than something to refuse. **Second**, GhostLM is also a study project. Every layer — attention, positional encoding, LR schedule, BPE — is hand-written so the codebase doubles as a readable reference for how a transformer is actually put together. A fine-tune hides that behind `AutoModel.from_pretrained`.
It is explicitly *not* trying to beat Llama on general benchmarks. It's trying to be the right tool for one narrow job, and a transparent one.
## 架构
| Parameter | Value |
|---|---|
| Architecture | Decoder-only Transformer |
| Parameters (ghost-small) | ~55M |
| Context Length | 1024 tokens |
| Layers | 6 |
| Attention Heads | 8 |
| Embedding Dim | 512 |
| Tokenizer | GPT-2 BPE (50,261 tokens) |
Built with:
- Multi-head causal self-attention (manual implementation)
- **RoPE** (Rotary Position Embeddings) — opt-in via `use_rope=True`, replaces learned positional embeddings with the relative-position encoding used by LLaMA / Mistral
- **Flash Attention** — opt-in via `use_flash_attention=True`, routes through PyTorch 2.0+ `scaled_dot_product_attention` for `O(n)` memory
- Pre-norm transformer blocks with residual connections
- Cosine LR schedule with linear warmup
- Weight-tied output projection
- AdamW with weight decay separation
- **Safetensors** export for safe, arbitrary-code-free weight distribution (see `scripts/export.py`)
## 模型变体
| Variant | Layers | Dim | Params | Status |
|---|---|---|---|---|
| ghost-tiny | 2 | 256 | ~14.5M | Phase 1 complete (10K steps) |
| ghost-small | 6 | 512 | ~55M | Planned |
| ghost-medium | 12 | 768 | ~160M | Future |
## 快速开始
### 安装
```
git clone https://github.com/joemunene-by/GhostLM.git
cd GhostLM
make install
```
### 准备训练数据
```
make data
```
### 训练
```
# CPU 友好(ghost-tiny)
make train-tiny
# GPU(ghost-small)
make train-small
```
### 生成文本
```
make generate
```
### 交互式聊天
```
make chat
```
### 运行 Web 演示
```
pip install gradio
python demo/app.py
```
### 与 GPT-2 的基准对比
```
make benchmark
```
### 导出权重(safetensors 或 PyTorch)
```
# 用于 HuggingFace Hub 分发的安全、无 pickle 权重
python scripts/export.py --format safetensors
# 经典 PyTorch 检查点
python scripts/export.py --format pt
```
### 绘制训练曲线
```
make plot
```
## 训练数据
| Source | Records | Type |
|---|---|---|
| NVD CVE Database | 9,925 | Real |
| Security Research Papers | 500 | Synthetic |
| CTF Writeups | 500 | Synthetic |
| **Total** | **10,925** | |
## 训练进度
| Run | Steps | Train Loss | Val Loss | Status |
|---|---|---|---|---|
| ghost-tiny Phase 1 | 10,000 | 1.97 | 2.74 | Complete |
| ghost-tiny Phase 2 | 100,000 | — | — | Next (Mac Mini M4) |
## 评估结果(阶段 1)
| Metric | Score |
|---|---|
| Cybersecurity Perplexity | 2,183.94 |
| GPT-2 Baseline (117M) | 26.76 |
| CVE Severity Classification | 20.0% |
| Vulnerability Type Detection | 10.0% |
| Attack Technique ID | 10.0% |
| **Overall Security Score** | **13.3%** |
## 项目结构
```
GhostLM/
├── ghostlm/ # Core library
│ ├── model.py # Transformer architecture (RoPE + Flash Attention toggles)
│ ├── config.py # Hyperparameters + ghost-tiny/small/medium presets
│ ├── tokenizer.py # GPT-2 BPE wrapper
│ ├── dataset.py # PyTorch dataset
│ └── trainer.py # Training loop
├── scripts/ # CLI tools
│ ├── train.py # Training entry point
│ ├── generate.py # Text generation
│ ├── chat.py # Interactive chat
│ ├── evaluate.py # Evaluation
│ ├── eval_security.py # Security-specific evaluation
│ ├── benchmark.py # GPT-2 comparison
│ ├── export.py # Weights export (safetensors / pt) + SHA-256 + config.json
│ ├── api.py # REST API server
│ ├── data_stats.py # Training-data statistics
│ ├── plot_training.py # Loss-curve plotter
│ ├── push_to_hub.py # HuggingFace Hub publisher
│ └── resume_train.sh # Resume an interrupted training run
├── data/ # Data pipeline
├── demo/ # Gradio web demo (demo/app.py)
├── tests/ # 16 unit tests
└── Makefile # One-command workflow
```
## 路线图
### v0.1.0 — 架构完成
- Full transformer from scratch
- Training pipeline verified
- 10,925 cybersecurity records
### v0.2.0 — 阶段 1 训练完成
- ghost-tiny trained to 10,000 steps on CPU
- Full evaluation suite with benchmark vs GPT-2
- MODEL_CARD with detailed results
### v0.2.1 — 阶段 2 准备就绪
- RoPE (Rotary Position Embeddings) — config-toggled
- Flash Attention via `scaled_dot_product_attention` — config-toggled
- Safetensors export with config.json sidecar and SHA-256 checksum
- Pinned dependency versions + PEP 639 license metadata
- Test suite grown from 10 → 16 tests
### v0.3.0 — 阶段 2 训练进行中
- 100K steps on Mac Mini M4 with RoPE + Flash Attention enabled
- HuggingFace Hub weights release (safetensors)
- Gradio web demo
### v1.0.0 — 发布(计划中)
- Public weights + REST API
- Fine-tuning scripts
## 贡献
See [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.
## 许可证
MIT — see [LICENSE](LICENSE)
## 作者
**Joe Munene** — [Complex Developers](https://github.com/joemunene-by)
Built in Nairobi, Kenya.
标签:AI安全, Apex, Chat Copilot, CVE, Python, PyTorch, SEO, Transformer, 从头训练, 凭据扫描, 可解释性, 安全培训, 小模型, 开源, 技术栈, 攻击模式, 数字签名, 无后门, 无预训练权重, 机器学习, 模型压缩, 深度学习, 漏洞分析, 研究项目, 网络安全, 解码器, 语言模型, 路径探测, 逆向工具, 隐私保护