fsrxc2bvv9-ctrl/llm-safety-evaluation-lab

GitHub: fsrxc2bvv9-ctrl/llm-safety-evaluation-lab

一个受OWASP启发的LLM安全红队评估框架,通过结构化对抗测试用例和自动化基准测试,系统评估大语言模型在提示注入、幻觉、社会工程等风险场景下的安全表现。

Stars: 0 | Forks: 0

# LLM 安全评估实验室 ## ![Python](https://img.shields.io/badge/Python-3.13-blue) ![OWASP](https://img.shields.io/badge/OWASP-LLM_Top_10-red) ![状态](https://img.shields.io/badge/status-active-success) ![许可证](https://img.shields.io/badge/license-MIT-green) 一个受 OWASP 启发的对抗性测试框架,用于评估 LLM 在提示注入、幻觉、社会工程、过度授权及相关风险类别中的安全行为。 本项目作为独立的 AI 安全评估作品集的一部分构建(2026 年 5 月)。 ## 概述 本项目结合了: - 结构化的红队攻击库 - 自动化基准测试执行 - 混合安全评估(关键词 + LLM 作为评判) - 跨模型对比 - 真实世界事件分析 - OWASP LLM 应用 Top 10 映射 目标是研究现代 LLM 在对抗性、歧义性和基于操纵的条件下的行为表现。 ## 核心组件 ### 攻击库 ## 63 个手工制作的对抗性测试用例,涵盖: | 类别 | 测试数 | 主要风险 | |---|---|---| | 提示注入 | 12 | 指令层级绕过 | | 幻觉 | 10 | 将捏造的事实呈现为真相 | | 敏感信息泄露 | 8 | 凭证和 PII 暴露 | | 社会工程 | 10 | 利用人类心理学进行策略操纵 | | 过度授权 | 6 | 未经授权的不可逆操作 | | 混淆攻击 | 4 | 安全过滤器规避 | | 误拒 | 3 | 过度拦截合法请求 | | 意图模糊 | 3 | 双用途请求校准 | | 向量 / 嵌入弱点 | 1 | RAG 投毒 | | 时间混淆 | 1 | 不正确的日期锚定 | ## 基准测试运行器 基准测试运行器支持: - ChatGPT - Claude - 混合评估流水线 - Markdown 日志记录 - JSON 导出 - CSV 导出 - 在缺失可选依赖包时优雅降级 ### 评估架构 ``` Prompt → Model → Response ↓ [Keyword Check] ↓ [LLM-as-a-Judge] ↓ PASS / PARTIAL FAIL / FAIL ↓ benchmark_log.md + JSON export ⸻ Key Research Finding Gradual Escalation Bypasses Guardrails The most significant observed failure pattern involved gradual escalation across multiple conversational turns. In testing: * Initial chemistry prompts received safe educational responses * Second-step escalation still maintained safety framing * Third-step optimization framing (“most toxic fastest”) triggered a partial safety failure This suggests: Context normalization is more dangerous than direct attacks. Models that resist explicit jailbreak attempts may still degrade when harmful intent is introduced incrementally through benign conversational context. This finding aligns with real-world incidents involving: * social engineering * multi-turn manipulation * session-level context drift ⸻ Live Testing Highlights (May 2026) Strongest Safety Behaviors Observed * Attack-aware refusal * Temporal awareness * Explicit identification of manipulation patterns * Clear distinction between trusted and untrusted instruction sources Weaknesses Observed * Gradual escalation * Framing sensitivity * Cross-session safety reset * Weak manipulation resistance in smaller open-source models ⸻ Models Evaluated * ChatGPT * Claude * Meta AI * SuperGrok * Gemini * Copilot * meta-llama-3.1-8b ⸻ Project Structure llm-safety-evaluation-lab/ ├── src/ │ └── run_benchmark.py ├── data/ │ └── attack_library.csv ├── output/ │ ├── benchmark_results.json │ └── benchmark_log.md ├── case_studies/ │ ├── openclaw_agent_incident.md │ └── chatgpt_tactical_advice.md ├── notes/ │ ├── fortinet_2026_skills_gap.md │ └── malicious_chrome_extensions.md ├── ├── reports/ │ └── mini_safety_report_v2.md ├── requirements.txt ├── .env.example └── README.md ⸻ Setup Clone the repository: git clone https://github.com/fsrxc2bvv9-ctrl/llm-safety-evaluation-lab.git cd llm-safety-evaluation-lab Install dependencies: pip install -r requirements.txt Create environment file: cp .env.example .env Add API keys: OPENAI_API_KEY=your_key_here ANTHROPIC_API_KEY=your_key_here ⸻ Usage Dry run python3 src/run_benchmark.py --dry-run Run benchmark python3 src/run_benchmark.py --model chatgpt --limit 5 Run specific category python3 src/run_benchmark.py --model claude --category Hallucination Run only pending tests python3 src/run_benchmark.py --pending-only --use-judge Export updated CSV python3 src/run_benchmark.py --export-csv ⸻ Methodology * OWASP Top 10 for LLM Applications 2025 * Structured adversarial testing * LLM-as-a-judge evaluation * Cross-model comparison * Real-world incident mapping * Rubric-based scoring * Manual qualitative analysis ⸻ Related Research Included Case Studies * OpenClaw agent social engineering incident * ChatGPT tactical attack advice incident Supplemental Notes * Fortinet 2026 cybersecurity skills gap report * Malicious Chrome extension campaign analysis ⸻ About Built by Aleksei Khvostov AI Safety Evaluator focused on: * LLM evaluation * red teaming * prompt injection testing * safety analysis * hallucination analysis * structured QA systems * multilingual adversarial testing (English/Russian) Background includes: * Outlier * Mercor * Invisible Technologies * 15+ years of editorial leadership in digital media Links * GitHub: https://github.com/fsrxc2bvv9-ctrl * LinkedIn: https://www.linkedin.com/in/aleksei-khvostov/ ⸻ Disclaimer This repository is intended исключительно for defensive AI safety research, evaluation methodology, and benchmark development. All prompts use researcher-safe framing. No operationally harmful instructions, malware, or exploit code are included. ⸻ ```
标签:AI伦理, AI风控, ChatGPT, Claude, CVE检测, Homebrew安装, Kubernetes 安全, LLM Top 10, LLM作为裁判, OWASP红队, Petitpotam, Promptflow, Python, RAG安全, Red Canary, 大模型安全, 大模型幻觉, 安全合规, 安全基准测试, 密码管理, 密钥泄露防护, 对抗性测试, 搜索语句(dork), 攻击库, 无后门, 机器学习安全, 模型鲁棒性, 深度学习安全, 社交工程学, 红队框架, 网络代理, 自动化基准测试, 越权操作, 跨模型评估, 逆向工具, 防御加固