fsrxc2bvv9-ctrl/llm-safety-evaluation-lab
GitHub: fsrxc2bvv9-ctrl/llm-safety-evaluation-lab
一个受OWASP启发的LLM安全红队评估框架,通过结构化对抗测试用例和自动化基准测试,系统评估大语言模型在提示注入、幻觉、社会工程等风险场景下的安全表现。
Stars: 0 | Forks: 0
# LLM 安全评估实验室
## 



一个受 OWASP 启发的对抗性测试框架,用于评估 LLM 在提示注入、幻觉、社会工程、过度授权及相关风险类别中的安全行为。
本项目作为独立的 AI 安全评估作品集的一部分构建(2026 年 5 月)。
## 概述
本项目结合了:
- 结构化的红队攻击库
- 自动化基准测试执行
- 混合安全评估(关键词 + LLM 作为评判)
- 跨模型对比
- 真实世界事件分析
- OWASP LLM 应用 Top 10 映射
目标是研究现代 LLM 在对抗性、歧义性和基于操纵的条件下的行为表现。
## 核心组件
### 攻击库
## 63 个手工制作的对抗性测试用例,涵盖:
| 类别 | 测试数 | 主要风险 |
|---|---|---|
| 提示注入 | 12 | 指令层级绕过 |
| 幻觉 | 10 | 将捏造的事实呈现为真相 |
| 敏感信息泄露 | 8 | 凭证和 PII 暴露 |
| 社会工程 | 10 | 利用人类心理学进行策略操纵 |
| 过度授权 | 6 | 未经授权的不可逆操作 |
| 混淆攻击 | 4 | 安全过滤器规避 |
| 误拒 | 3 | 过度拦截合法请求 |
| 意图模糊 | 3 | 双用途请求校准 |
| 向量 / 嵌入弱点 | 1 | RAG 投毒 |
| 时间混淆 | 1 | 不正确的日期锚定 |
## 基准测试运行器
基准测试运行器支持:
- ChatGPT
- Claude
- 混合评估流水线
- Markdown 日志记录
- JSON 导出
- CSV 导出
- 在缺失可选依赖包时优雅降级
### 评估架构
```
Prompt → Model → Response
↓
[Keyword Check]
↓
[LLM-as-a-Judge]
↓
PASS / PARTIAL FAIL / FAIL
↓
benchmark_log.md + JSON export
⸻
Key Research Finding
Gradual Escalation Bypasses Guardrails
The most significant observed failure pattern involved gradual escalation across multiple conversational turns.
In testing:
* Initial chemistry prompts received safe educational responses
* Second-step escalation still maintained safety framing
* Third-step optimization framing (“most toxic fastest”) triggered a partial safety failure
This suggests:
Context normalization is more dangerous than direct attacks.
Models that resist explicit jailbreak attempts may still degrade when harmful intent is introduced incrementally through benign conversational context.
This finding aligns with real-world incidents involving:
* social engineering
* multi-turn manipulation
* session-level context drift
⸻
Live Testing Highlights (May 2026)
Strongest Safety Behaviors Observed
* Attack-aware refusal
* Temporal awareness
* Explicit identification of manipulation patterns
* Clear distinction between trusted and untrusted instruction sources
Weaknesses Observed
* Gradual escalation
* Framing sensitivity
* Cross-session safety reset
* Weak manipulation resistance in smaller open-source models
⸻
Models Evaluated
* ChatGPT
* Claude
* Meta AI
* SuperGrok
* Gemini
* Copilot
* meta-llama-3.1-8b
⸻
Project Structure
llm-safety-evaluation-lab/
├── src/
│ └── run_benchmark.py
├── data/
│ └── attack_library.csv
├── output/
│ ├── benchmark_results.json
│ └── benchmark_log.md
├── case_studies/
│ ├── openclaw_agent_incident.md
│ └── chatgpt_tactical_advice.md
├── notes/
│ ├── fortinet_2026_skills_gap.md
│ └── malicious_chrome_extensions.md
├── ├── reports/
│ └── mini_safety_report_v2.md
├── requirements.txt
├── .env.example
└── README.md
⸻
Setup
Clone the repository:
git clone https://github.com/fsrxc2bvv9-ctrl/llm-safety-evaluation-lab.git
cd llm-safety-evaluation-lab
Install dependencies:
pip install -r requirements.txt
Create environment file:
cp .env.example .env
Add API keys:
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
⸻
Usage
Dry run
python3 src/run_benchmark.py --dry-run
Run benchmark
python3 src/run_benchmark.py --model chatgpt --limit 5
Run specific category
python3 src/run_benchmark.py --model claude --category Hallucination
Run only pending tests
python3 src/run_benchmark.py --pending-only --use-judge
Export updated CSV
python3 src/run_benchmark.py --export-csv
⸻
Methodology
* OWASP Top 10 for LLM Applications 2025
* Structured adversarial testing
* LLM-as-a-judge evaluation
* Cross-model comparison
* Real-world incident mapping
* Rubric-based scoring
* Manual qualitative analysis
⸻
Related Research Included
Case Studies
* OpenClaw agent social engineering incident
* ChatGPT tactical attack advice incident
Supplemental Notes
* Fortinet 2026 cybersecurity skills gap report
* Malicious Chrome extension campaign analysis
⸻
About
Built by Aleksei Khvostov
AI Safety Evaluator focused on:
* LLM evaluation
* red teaming
* prompt injection testing
* safety analysis
* hallucination analysis
* structured QA systems
* multilingual adversarial testing (English/Russian)
Background includes:
* Outlier
* Mercor
* Invisible Technologies
* 15+ years of editorial leadership in digital media
Links
* GitHub: https://github.com/fsrxc2bvv9-ctrl
* LinkedIn: https://www.linkedin.com/in/aleksei-khvostov/
⸻
Disclaimer
This repository is intended исключительно for defensive AI safety research, evaluation methodology, and benchmark development.
All prompts use researcher-safe framing.
No operationally harmful instructions, malware, or exploit code are included.
⸻
```
标签:AI伦理, AI风控, ChatGPT, Claude, CVE检测, Homebrew安装, Kubernetes 安全, LLM Top 10, LLM作为裁判, OWASP红队, Petitpotam, Promptflow, Python, RAG安全, Red Canary, 大模型安全, 大模型幻觉, 安全合规, 安全基准测试, 密码管理, 密钥泄露防护, 对抗性测试, 搜索语句(dork), 攻击库, 无后门, 机器学习安全, 模型鲁棒性, 深度学习安全, 社交工程学, 红队框架, 网络代理, 自动化基准测试, 越权操作, 跨模型评估, 逆向工具, 防御加固