brandonhimpfen/awesome-ai-safety-alignment

GitHub: brandonhimpfen/awesome-ai-safety-alignment

一份聚焦 AI 安全、对齐与红队测试的 curated 资源清单，帮助团队快速查找框架、评估与治理实践。

Stars: 4 | Forks: 2

# Awesome AI 安全与对齐 [![Awesome Lists](https://srv-cdn.himpfen.io/badges/awesome-lists/awesomelists-flat.svg)](https://github.com/awesomelistsio/awesome) [![DOI](https://zenodo.org/badge/1106325936.svg)](https://doi.org/10.5281/zenodo.19673174) [![GitHub Sponsor](https://srv-cdn.himpfen.io/badges/github/github-flat.svg)](https://github.com/sponsors/brandonhimpfen) [![Buy Me a Coffee](https://srv-cdn.himpfen.io/badges/buymeacoffee/buymeacoffee-flat.svg)](https://buymeacoffee.com/brandonhimpfen) [![Ko-Fi](https://srv-cdn.himpfen.io/badges/kofi/kofi-flat.svg)](https://ko-fi.com/brandonhimpfen) [![PayPal](https://srv-cdn.himpfen.io/badges/paypal/paypal-flat.svg)](https://paypal.me/brandonhimpfen) ## 目录 - [研究机构](#research-organizations) - [安全框架](#safety-frameworks) - [红队测试与威胁建模](#red-teaming--threat-modeling) - [评估与基准](#evaluation--benchmarks) - [模型治理与政策](#model-governance--policy) - [数据集](#datasets) - [学习资源](#learning-resources) - [相关优秀列表](#related-awesome-lists) ## 研究机构 - [Alignment Research Center (ARC)](https://alignment.org/) – 可扩展监督与模型评估的研究。 - [AI Safety Center (UK)](https://www.aisc.gov.uk/) – 政府支持的安全与模型评估计划。 - [OpenAI Safety](https://openai.com/safety) – 鲁棒性、红队测试与对齐的研究。 - [Anthropic Safety](https://www.anthropic.com/safety) – 可解释性与前沿模型评估的安全团队。 - [DeepMind Safety Research](https://deepmind.google/discover/blog) – 可扩展监督、对齐与鲁棒性研究。 - [Center for AI Safety (CAIS)](https://www.safe.ai/) – 公共安全教育、基准与政策指导。 - [ELEUTHERAI](https://www.eleuther.ai/) – 以安全为导向的开源 AI 研究。 ## 安全框架 - [OpenAI Model Spec](https://openai.com/model-spec) – 定义预期安全模型行为的规范。 - [Anthropic Constitutional AI](https://www.anthropic.com/news/constitutional-ai) – 使用基于规则的宪法约束训练模型的框架。 - [Google Responsible AI Practices](https://ai.google/responsibility/) – 安全 AI 开发的准则与框架。 - [OECD AI Principles](https://oecd.ai/en/ai-principles) – 可信 AI 的国际标准。 - [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework) – 评估 AI 风险的美国国家标准。 - [EU AI Act Summary](https://artificialintelligenceact.eu/) – 高风险与通用 AI 系统的监管框架。 ## 红队测试与威胁建模 - [OpenAI Red Teaming Network](https://openai.com/red-teaming-network) – 模型评估的全球研究协作。 - [Anthropic Red Teaming Resources](https://www.anthropic.com/) – 以安全为导向的对抗测试方法。 - [Microsoft AI Red Team](https://www.microsoft.com/en-us/security/blog/) – AI 系统的安全性与安全性测试方法。 - [AI Safety Threat Modeling](https://github.com/topics/ai-safety) – 用于威胁分析的工具与文档。 - [LLM Jailbreak Prompts Datasets](https://github.com/topics/jailbreak-prompts) – 用于鲁棒性测试的对抗性提示集合。 ## 评估与基准 - [HELM](https://crfm.stanford.edu/helm/latest/) – 在安全与风险领域对语言模型进行整体评估。 - [Anthropic Evaluations](https://github.com/anthropics/evals) – 前沿模型的安全评估。 - [OpenAI Evals](https://github.com/openai/evals) – 测试模型安全性、推理与可靠性的框架。 - [Red Teaming Benchmarks](https://github.com/topics/llm-evaluation) – 社区驱动的安全性评估。 - [ToxiGen](https://github.com/microsoft/Counterfit/) – 用于评估有害或有毒输出的数据集。 - [SafetyBench](https://github.com/centerforaisafety/SafetyBench) – AI 安全场景的基准框架。 ## 模型治理与政策 - [AI Safety Institute (UK)](https://www.aisi.gov.uk/) – 前沿模型安全测试的国际协调。 - [AI Safety Institute (US)](https://www.ai.gov/) – 美国的政策、评估与治理工作。 - [OECD AI Governance Hub](https://oecd.ai/en/) – AI 对齐的监管与政策资源。 - [UNESCO AI Ethics Framework](https://www.unesco.org/en/artificial-intelligence/ethics) – 伦理 AI 的全球规范框架。 - [Global AI Safety Summits](https://www.gov.uk/government/publications) – 全球模型安全峰会的协议与宪章。 ## 数据集 - [JailbreakBench](https://github.com/verazuo/jailbreakbench) – 评估越狱易感性的数据集。 - [HarmBench](https://github.com/centerforaisafety/HarmBench) – 用于 AI 伤害分类与安全测试的多领域数据集。 - [RealToxicityPrompts](https://allenai.org/data/real-toxicity-prompts) – 用于鲁棒性评估的对抗性或有害提示。 - [AdvBench](https://github.com/safety-ai/AdvBench) – 用于对抗性攻击与安全测试的数据集。 ## 学习资源 - [AI Alignment Fundamentals (BlueDot)](https://www.alignmentfundamentals.com/) – 对齐入门课程。 - [AGI Safety Fundamentals](https://agi-safety-fundamentals.com/) – 对齐、安全与治理的结构化课程。 - [OpenAI Safety Papers](https://openai.com/research) – 对齐与模型评估的研究论文。 - [Anthropic Interpretability Research](https://www.anthropic.com/research) – 关于模型内部机制的论文与发现。 - [DeepMind Safety Papers](https://deepmind.google/research) – 监督、对齐与鲁棒性研究。 - [CAIS Safety Curriculum](https://www.safe.ai/) – 入门与进阶学习路径。 ## 相关优秀列表 - [Awesome AI](https://github.com/awesomelistsio/awesome-ai) - [Awesome Machine Learning](https://github.com/awesomelistsio/awesome-machine-learning) - [Awesome AI Research Papers](https://github.com/awesomelistsio/awesome-ai-research-papers) - [Awesome AI Ethics](https://github.com/awesomelistsio/awesome-ai-ethics) - [Awesome Open Governance](https://github.com/awesomelistsio/awesome-open-governance) ## 贡献欢迎贡献。请确保您的提交完全遵循 [`CONTRIBUTING.md`](CONTRIBUTING.md) 中概述的要求，包括格式、范围对齐和分类放置。不符合贡献指南的拉取请求可能会被关闭。 ## 许可证 [![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/by-sa.svg)](http://creativecommons.org/licenses/by-sa/4.0/)

标签：AI安全, AI对齐, Chat Copilot, 二进制发布, 人工智能安全, 人工智能对齐, 公共政策, 可扩展监督, 可解释性, 合规性, 威胁建模, 开发资源, 开源工具, 检索优化, 模型治理, 模型评估, 研究组织