ppradyoth/ai-security-resources
GitHub: ppradyoth/ai-security-resources
Stars: 2 | Forks: 0
# 🛡️ AI Security Resources
[](LICENSE)
[](https://github.com/ppradyoth/ai-security-resources)
[](CONTRIBUTING.md)
[](ROADMAP.md)
## 👤 Who This Is For
This isn't a beginner list. This is the resource you wish existed when you started.
- 🔴 **AI Red Teamers** running automated adversarial campaigns against LLM products
- 🔵 **Security Engineers** defending AI pipelines in production
- 🟡 **ML Researchers** studying model robustness, alignment failures, and emergent risks
- 🟢 **Career switchers** with 3+ years in security/ML who want to go deep into AI security
## 🗺️ Navigation Directory
| Domain | Description |
|:---|:---|
| [🧠 Why This Matters — The Origin Story](#-why-this-matters--the-origin-story) | History of risks, neural networks, why this field exists |
| [🔩 Foundational Knowledge](#-foundational-knowledge) | Neural networks, transformers, zero-days, pace of growth |
| [🔴 Red Teaming](#-red-teaming) | Adversarial attacks, jailbreaks, prompt injection |
| [⚡ Runtime Security](#-runtime-security) | Real-time inference protection, guardrails, monitoring |
| [🧬 Inference Security](#-inference-security) | Model serving attacks, side-channels, batching exploits |
| [🔬 Model Scanning](#-model-scanning) | Supply chain, poisoning detection, weight integrity |
| [🌐 Others](#-others) | Governance, datasets, benchmarks, multimodal, agentic |
| [🚀 Zero to Hero Roadmap](#-zero-to-hero-roadmap) | Structured 12-month learning path |
| [💼 Job Opportunities](#-job-opportunities) | Where to work, what to know, salary reality |
| [🤝 How to Contribute](#-how-to-contribute) | Add resources and keep this alive |
## 📚 Specialized Deep-Dive Handbooks
To keep this guide lightweight yet exhaustive, we maintain dedicated, highly comprehensive specialized guides for career, standardizations, and evaluation strategies:
| Handbook | Core Scope | Link |
|:---|:---|:---|
| 💼 **Global Salary Handbook** | Exhaustive country-by-country comp rates (US, IN, UK, IE, SG, AU, ME, EU), tax brackets, rent crises, and career strategies. | **[SALARY_REALITY.md](SALARY_REALITY.md)** |
| 🎓 **Zero to Hero Curriculum** | Rigorous 12-month study plan covering self-attention mechanisms, adversarial CNN/LLM papers, and specialization tracks. | **[ROADMAP.md](ROADMAP.md)** |
| 🧪 **Hands-On Practical Labs** | Ready-to-run code files for PyTorch FGSM attacks, jailbreaks, indirect injections, pickle RCE exploits, and proxy guardrails. | **[LABS.md](LABS.md)** |
| 🤖 **Secure Agents Handbook** | Autonomous coding agent threat modeling, indirect codebase injections, sandboxing, Firecracker MicroVMs, and MCP security. | **[AGENT_SECURITY.md](AGENT_SECURITY.md)** |
| 🏛️ **Standards & Compliance Guide** | MITRE ATLAS threat modeling, OWASP Top 10 for LLMs, NIST AI RMF, ISO 42001, and EU AI Act playbooks. | **[STANDARDS_AND_COMPLIANCE.md](STANDARDS_AND_COMPLIANCE.md)** |
| 📊 **Benchmarks & Datasets Index** | Standardized safety evaluation frameworks (HarmBench, AdvGLUE, CyberSecEval) and adversarial datasets. | **[BENCHMARKS_AND_DATASETS.md](BENCHMARKS_AND_DATASETS.md)** |
| 🎮 **Playgrounds, CTFs & Incidents** | Interactive prompt injection labs (Gandalf, TensorTrust), AI bug bounties, and real-world failure analyses. | **[PLAYGROUNDS_AND_LABS.md](PLAYGROUNDS_AND_LABS.md)** |
| 🔬 **Research Papers Catalog** | Comprehensive, annotated directory of critical academic publications (Zou, Szegedy, Goodfellow, Carlini). | **[RESEARCH_PAPERS.md](RESEARCH_PAPERS.md)** |
| 🏆 **Frontier Safety Leaderboard** | Fact-grounded comparison of GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1, and Grok 2 across safety Elo. | **[SAFETY_LEADERBOARD.md](SAFETY_LEADERBOARD.md)** |
| 🛡️ **Cybersecurity with AI** | Autonomous zero-day vulnerability discovery, exploit generation (Anthropic Mythos), AI defense (OpenAI Daybreak), and MDASH. | **[CYBER_AI.md](CYBER_AI.md)** |
| 🔗 **Related Awesome Lists** | A curated index of "Awesome" lists dedicated to AI Security, MLSecOps, LLM Safety, and Adversarial ML for deep-diving into sub-fields. | **[AWESOME_LISTS.md](AWESOME_LISTS.md)** |
## 🧠 Why This Matters — The Origin Story
### The Trajectory That Created This Problem
Here's what actually happened between the papers:
- **1936 — Turing's Computable Numbers:** Alan Turing introduces the Turing Machine and proves the undecidability of the Halting Problem.
*Modern Security Implication:* Rice's Theorem dictates that dynamically proving any non-trivial semantic property of a Turing-complete system (like an LLM agent with tool access) is undecidable. This is the mathematical proof of why we cannot build a perfect, static "AI firewall" to stop all injections.
- **1943 — McCulloch-Pitts Neuron:** First mathematical model of a neuron. Irrelevant until hardware caught up 70 years later.
- **1948 — Unorganized Machines:** Turing drafts the first blueprint of an artificial neural network, anticipating connectionist AI by decades.
- **1950 — The Imitation Game:** Turing introduces the Turing Test. Modern LLM Red Teaming and safety alignment evaluations are direct, adversarial evolutions of this original capability test.
- **1958 — Perceptron:** Rosenblatt's learning machine. Hyped, then killed by Minsky's proof that it couldn't do XOR.
- **1986 — Backprop:** Rumelhart, Hinton, Williams publish the algorithm that trains everything we use today.
- **1997 — LSTMs (Hochreiter & Schmidhuber):** Memory for sequences. Dominated NLP until attention killed it.
- **2012 — AlexNet:** GPUs + ReLU + dropout + scale = CNN dominance. 10.8% gap over #2. Game over for hand-crafted features.
- **2017 — "Attention Is All You Need":** Transformers. Self-attention. Parallel training. The architecture that scales infinitely.
- **2020 — GPT-3:** 175B parameters. Few-shot learning emerges as a property. Capabilities no one designed for start appearing.
- **2022 — ChatGPT:** 100M users in 60 days. Security teams globally had no playbook.
- **2023–2024 — Multi-Modal & Early Agentics:** Multi-step tool use, retrieval-augmented generation (RAG), visual-language models. The attack surface shifted from the model endpoint to downstream RAG databases and APIs.
- **2025–2026 — Autonomous Swarms & Real-Time Ingestion (Current Era):** Deep agent-to-agent collaboration (swarms), native real-time audio/video streaming pipelines, autonomous code execution in containers. The attack surface is no longer just "the model" — it is the entire host environment, API mesh, and every enterprise system the autonomous swarm can touch.
### Why Risks Evolved
| Era | Model Type & Architecture | Threat Surface / Ingestion Channels | Primary Vulnerability & Risk |
|:---|:---|:---|:---|
| **Pre-2020** | Narrow ML classifiers (ResNet, XGBoost) | Static training datasets, raw inputs | Data poisoning, evasion attacks, adversarial perturbation |
| **2020–2022** | Static Foundation Models (GPT-3, early LLMs) | Raw API endpoints, direct user prompt fields | Direct prompt injection, training data extraction, model inversion |
| **2022–2023** | RLHF-aligned LLMs (ChatGPT, Claude 2) | Public consumer web apps, system prompts | Jailbreaks, alignment bypass, prompt leaking, side-channel attacks |
| **2023–2024** | RAG + Tool-use (Copilots, early Agents) | Integrated databases (vector DBs), external APIs, documents | Indirect prompt injection, database poisoning, tool/API hijacking |
| **2024–2025** | Native Multimodal (GPT-4o, Gemini 1.5 Pro) | Real-time audio stream, visual input frames, live files | Cross-modal injection (steganographic audio, visual typographic exploits) |
| **2025–2026** | **Autonomous Agent Swarms (Current)** | Container environments, host OS, microservices mesh | Sandbox escapes, self-replication, model-to-model spoofing, recursive loop hijacking |
### Zero-Day Vulnerabilities in the AI Context
A traditional zero-day is a software flaw unknown to the vendor. In AI, zero-days take a different form:
- **Prompt injection zero-days**: New attack patterns that bypass guardrails before defenders model them
- **Architecture-specific exploits**: Vulnerabilities in tokenizers (e.g., ChatGPT's `<|endoftext|>` token injection), attention sinks, and positional encoding exploits
- **Emergent capability surprises**: Models demonstrating unexpected behaviors at new capability thresholds — capabilities nobody tested for because nobody expected them
- **Cross-model transferability**: An attack that breaks GPT-4 often breaks Claude and Gemini — the "universal" nature of adversarial examples translates to the LLM domain
### The Pace of Growth Problem
The capability-safety gap is real and growing:
- **Goodhart's Law in AI**: Once a safety metric becomes a target (RLHF reward), it stops being a good safety metric. Models learn to *appear* safe rather than *be* safe.
- **Dual-use acceleration**: The same models that write defensive code write offensive code. The same reasoning that explains vulnerabilities exploits them.
- **Evaluation lag**: By the time researchers publish a benchmark, frontier models have already surpassed it. We are perpetually measuring the past.
**Key reading on this:**
- [Situational Awareness (Leopold Aschenbrenner, 2024)](https://situational-awareness.ai/) — The most honest account of where the capability curve is heading
- [Concrete Problems in AI Safety (Amodei et al., 2016)](https://arxiv.org/abs/1606.06565) — Still the canonical framing of specification gaming, reward hacking, and safe exploration
- [An Overview of Catastrophic AI Risks (Hendrycks et al., 2023)](https://arxiv.org/abs/2306.12001) — Comprehensive taxonomy of the actual risk landscape
## 🔩 Foundational Knowledge
### Neural Networks — What's Actually Happening
| Resource | Type | Why It Matters for Security |
|:---|:---|:---|
| **[3Blue1Brown: Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)** | Video | Best visual intuition on weight spaces. Understand the geometry of the attack surface. |
| **[CS231n: CNNs for Visual Recognition (Stanford)](http://cs231n.stanford.edu/)** | Course | Foundation course. Backprop, gradient descent, weight initialization — all exploited by adversarial attacks. |
| **[Deep Learning Book (Goodfellow et al.)](https://www.deeplearningbook.org/)** | Textbook | The Bible. Chapter 7 (regularization), Chapter 8 (optimization) and Chapter 11 (practical methodology) are most relevant for adversarial ML. |
| **[Ilya Sutskever's Reading List](https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE)** | Paper list | ~30 papers that form the backbone of modern deep learning. Sutskever said reading these gives you ~90% of what matters. |
| **[Neural Networks: Zero to Hero (Karpathy)](https://github.com/karpathy/nn-zero-to-hero)** | Course | Build GPT-2 from scratch. The only way to truly understand what you're attacking. |
### Transformers — The Architecture Everything Runs On
| Resource | Type | Why It Matters for Security |
|:---|:---|:---|
| **[Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762)** | Paper | The architecture paper. Understanding attention heads is prerequisite for understanding activation steering, representation engineering, and mechanistic interpretability attacks. |
| **[The Illustrated Transformer (Jay Alammar)](https://jalammar.github.io/illustrated-transformer/)** | Blog | Best visual walkthrough of the architecture. Start here before the paper. |
| **[The Annotated Transformer (Harvard NLP)](https://nlp.seas.harvard.edu/2018/04/03/attention.html)** | Code | Line-by-line implementation. Seeing the code makes tokenizer exploits and attention pattern manipulation concrete. |
| **[A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)](https://transformer-circuits.pub/2021/framework/index.html)** | Paper | Anthropic's mechanistic interpretability foundation. Understanding circuits is how you understand *why* jailbreaks work. |
| **[Language Models are Few-Shot Learners (GPT-3, Brown et al., 2020)](https://arxiv.org/abs/2005.14165)** | Paper | Emergence paper. In-context learning as a security primitive — and a vulnerability. |
### Adversarial ML — The Science Behind the Attacks
| Resource | Type | Why It Matters for Security |
|:---|:---|:---|
| **[Explaining and Harnessing Adversarial Examples (Goodfellow et al., 2014)](https://arxiv.org/abs/1412.6572)** | Paper | The FGSM paper. The ur-text of adversarial ML. Everything since is a variation. |
| **[Adversarial Robustness Toolbox (ART) Documentation](https://adversarial-robustness-toolbox.readthedocs.io/)** | Docs | IBM's comprehensive adversarial ML library. Attack implementations + defenses. |
| **[Intriguing Properties of Neural Networks (Szegedy et al., 2013)](https://arxiv.org/abs/1312.6199)** | Paper | First adversarial examples paper. The moment the field realized "oh, this is a security problem." |
| **[Certified Adversarial Robustness via Randomized Smoothing (Cohen et al., 2019)](https://arxiv.org/abs/1902.02918)** | Paper | The best theoretical defense. Understanding why it works shows you the limits of all defenses. |
## 🔴 Red Teaming
### Philosophy
Red teaming AI is not the same as red teaming software. You are not looking for logic bugs — you are probing a probability distribution for failure modes that emerge from training. The failure modes are:
1. **Safety misalignment**: The model was trained to avoid X but generalizes imperfectly around X
2. **Capability overshooting**: The model was intended to do Y but can also do harmful Z using the same underlying capabilities
3. **Context collapse**: The model behaves safely in testing but fails under production context diversity
### Automated Attack Frameworks
| Tool | Creator | Stars | Description | Best For |
|:---|:---|:---|:---|:---|
| **[NVIDIA/garak](https://github.com/NVIDIA/garak)** | NVIDIA | ⭐ 5k+ | The Nmap of AI. 100+ probes: prompt injection, jailbreaks, data leakage, hallucination, toxicity. | Comprehensive baseline scanning |
| **[microsoft/PyRIT](https://github.com/microsoft/PyRIT)** | Microsoft | ⭐ 2k+ | Python Risk Identification Tool. Multi-turn conversation orchestration, intent drift, and programmatic red team scaling. | Research-grade multi-turn attacks |
| **[confident-ai/deepteam](https://github.com/confident-ai/deepteam)** | Confident AI | ⭐ 1.5k+ | 50+ vulnerability classes, 20+ attack vectors, OWASP + NIST alignment. Agentic and RAG red teaming. | CI/CD-integrated red teaming |
| **[artkit-ai/artkit](https://github.com/artkit-ai/artkit)** | ARTKIT | ⭐ 700+ | Multi-turn agentic simulation. Realistic attacker-target conversations across complex agentic workflows. | Agentic red teaming |
| **[promptfoo/promptfoo](https://github.com/promptfoo/promptfoo)** | Promptfoo | ⭐ 6k+ | Developer-first. CI/CD integration, model comparison, custom assertion pipelines. | DevSecOps integration |
| **[Giskard-AI/giskard](https://github.com/Giskard-AI/giskard)** | Giskard | ⭐ 4k+ | RAG and agentic stress testing. MCP security scanning. Enterprise-grade dynamic attack generation. | RAG pipeline testing |
| **[BerriAI/litellm](https://github.com/BerriAI/litellm) + PyRIT** | Community | — | Combine LiteLLM's unified API with PyRIT for cross-model adversarial comparisons. | Multi-provider comparison attacks |
### Manual Red Teaming Resources
| Resource | Type | Description |
|:---|:---|:---|
| **[Lakera Gandalf](https://gandalf.lakera.ai/)** | CTF | 8-level prompt injection CTF. Best way to internalize how defenses layer. Start here. |
| **[Crucible (Dreadnode)](https://crucible.dreadnode.io/)** | CTF | Advanced AI security challenges: exfiltration, RAG exploitation, agentic hijacking. |
| **[HackAPrompt (Learning Labs)](https://www.hackaprompt.com/)** | CTF | Large-scale prompt injection competition. Real attack patterns from thousands of players. |
| **[AI Village CTFs (DEF CON)](https://aivillage.org/)** | Competition | Annual DEF CON challenges. State-of-the-art attack techniques from the research community. |
| **[RedTeam Arena (Scale AI)](https://redarena.ai/)** | Platform | Crowdsourced jailbreak arena. See what actually works against current models. |
| **[Prompt Injection Playground](https://github.com/TakSec/Prompt-Injection-Everywhere)** | Guide | Practical examples of prompt injection in real application contexts. |
### Key Attack Research Papers
| Paper | Year | Significance |
|:---|:---|:---|
| **[Universal and Transferable Adversarial Attacks on Aligned LLMs (Zou et al.)](https://arxiv.org/abs/2307.15043)** | 2023 | GCG attack. Automated gradient-based suffix generation that transfers across GPT-4, Claude, Gemini. Broke the field open. |
| **[Jailbroken: How Aligned Language Models Can Be Bypassed (Wei et al.)](https://arxiv.org/abs/2307.02483)** | 2023 | Taxonomizes jailbreaks into competing objectives and mismatch generalization. Essential conceptual framework. |
| **[Many-shot Jailbreaking (Anil et al., Anthropic)](https://www.anthropic.com/research/many-shot-jailbreaking)** | 2024 | Long context windows create new attack surface — demonstrated 256+ in-context examples overwhelm safety training. |
| **[Tree of Attacks with Pruning (TAP)](https://arxiv.org/abs/2312.02119)** | 2023 | LLM-generated attack chains using tree search. 80%+ ASR on GPT-4. |
| **[Prompt Injection Attacks Against LLM-Integrated Applications (Greshake et al.)](https://arxiv.org/abs/2302.12173)** | 2023 | Indirect prompt injection via external content. The paper that defined the modern threat model for agents. |
| **[SmoothLLM: Defending Against Jailbreaking Attacks](https://arxiv.org/abs/2310.03684)** | 2023 | Randomized smoothing for LLM defense. Breaks GCG. Important as reference defense. |
| **[FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts](https://arxiv.org/abs/2311.05608)** | 2023 | Multimodal jailbreaks. Instructions hidden in images bypass text-only safety filtering. |
| **[AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models](https://arxiv.org/abs/2310.04451)** | 2023 | Readable, human-like jailbreak generation that evades perplexity filters. |
### Red Teaming Methodologies & Standards
| Resource | Organization | Description |
|:---|:---|:---|
| **[MITRE ATLAS](https://atlas.mitre.org/)** | MITRE | Adversarial Threat Landscape for AI Systems. ATT&CK-style matrix for AI-specific TTPs. Use this to structure threat models. |
| **[OWASP Top 10 for LLM Applications 2025](https://owasp.org/www-project-top-10-for-large-language-model-applications/)** | OWASP | Updated 2025 version. Prompt Injection #1, Excessive Agency #2. The compliance framework for enterprise red teams. |
| **[Microsoft AI Red Team Practices](https://learn.microsoft.com/en-us/security/ai-red-team/ai-red-team)** | Microsoft | Internal red team methodology made public. Structured approach to AI threat modeling. |
| **[Anthropic's Responsible Scaling Policy](https://www.anthropic.com/news/anthropics-responsible-scaling-policy)** | Anthropic | How frontier labs operationalize red teaming as a deployment gate. Required reading for policy context. |
| **[Google's AI Red Team Report](https://services.google.com/fh/files/blogs/google-ai-red-team-digital-final.pdf)** | Google | Case studies of real red team findings against deployed systems. |
## ⚡ Runtime Security
### The Problem
Training-time safety is necessary but insufficient. At runtime, your model faces:
- Input it was never trained on
- Users with goals it was never designed for
- Context injection from third-party sources it was told to trust
- Adversarial perturbations calibrated specifically against your deployed version
### Guardrails & Input/Output Filtering
| Tool | Creator | Description | Deployment Mode |
|:---|:---|:---|:---|
| **[NVIDIA/NeMo-Guardrails](https://github.com/NVIDIA/NeMo-Guardrails)** | NVIDIA | Programmable semantic rails. Define topic constraints, safety rules, and conversation flows in Colang DSL. | SDK / Self-hosted |
| **[protectai/llm-guard](https://github.com/protectai/llm-guard)** | Protect AI | Real-time scanner: prompt injection, PII detection, toxicity, ban topics, code injection. Input + output coverage. | SDK / Docker |
| **[Lakera Guard](https://platform.lakera.ai/docs/api)** | Lakera | Millisecond-latency API. Best-in-class prompt injection detection from the team that built Gandalf. | API |
| **[guardrails-ai/guardrails](https://github.com/guardrails-ai/guardrails)** | Guardrails AI | Structural + semantic validation. Define output schemas with security assertions. Nails hallucination + format-injection attacks. | SDK |
| **[deadbits/vigil](https://github.com/deadbits/vigil)** | Deadbits | Vector DB + heuristics + classifier ensemble for injection detection. Open-source and auditable. | SDK |
| **[meta-llama/PurpleLlama/Llama Guard](https://github.com/meta-llama/PurpleLlama)** | Meta | Fine-tuned safety classifier for I/O filtering. Available as a model you can self-host. | Model / API |
### Real-Time Monitoring & Observability
| Tool | Creator | Description | Key Capability |
|:---|:---|:---|:---|
| **[whylabs/langkit](https://github.com/whylabs/langkit)** | WhyLabs | Statistical telemetry. Detects distribution shift, toxicity drift, relevance degradation in real-time. | Security drift detection |
| **[Arize AI](https://arize.com/)** | Arize | Enterprise LLM observability. Prompt/response logging, hallucination scoring, user journey tracing. | Production monitoring |
| **[Langfuse](https://github.com/langfuse/langfuse)** | Langfuse | Open-source LLM engineering platform. Full request tracing, eval pipelines, cost tracking. | Open-source observability |
| **[Helicone](https://github.com/Helicone/helicone)** | Helicone | Proxy-based observability. Log, monitor, and rate-limit with zero code change. | Zero-integration monitoring |
| **[Evidently AI](https://github.com/evidentlyai/evidently)** | Evidently | ML monitoring with LLM-specific metrics. Detects prompt/response drift over time. | Drift monitoring |
| **[Phoenix (Arize)](https://github.com/Arize-ai/phoenix)** | Arize | Open-source tracing for LLM apps. OTEL-native. Good for debugging attack chains in agentic systems. | Tracing |
### Agentic Runtime Security
The scariest runtime threat: an agent that *can act* and is *being manipulated*.
| Resource | Type | Description |
|:---|:---|:---|
| **[AgentDojo](https://github.com/ethz-spylab/agentdojo)** | Benchmark | Benchmark for agent injection attacks. Automated scoring of whether injected instructions successfully hijack agent behavior. |
| **[OWASP Agentic AI Top 10 (2025)](https://genai.owasp.org/)** | Standard | Extending OWASP LLM Top 10 to agentic systems. Excessive Agency, Trust Boundary violations, Memory Poisoning. |
| **[PromptArmor](https://promptarmor.com/)** | Tool | Specifically designed for indirect prompt injection detection in RAG + agent pipelines. |
| **[Prompt Injection in the Wild (Research)](https://arxiv.org/abs/2403.02691)** | Paper | Systematic study of injection attacks against deployed LLM applications. |
## 🧬 Inference Security
### Understanding the Attack Surface
Inference is not just "model forward pass." It's:
- Tokenization (exploitable at token boundaries)
- KV cache (privacy leakage between requests)
- Batching (timing side-channels)
- Quantization (quantization can change safety properties)
- Speculative decoding (security properties under speculation are underexplored)
### Key Inference Security Research
| Paper | Year | Finding |
|:---|:---|:---|
| **[Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection](https://arxiv.org/abs/2302.12173)** | 2023 | Defined the indirect injection threat model. Adversarial content in retrieved documents hijacks model behavior. |
| **[Stealing Part of a Production Language Model (Carlini et al.)](https://arxiv.org/abs/2403.06634)** | 2024 | Extracted GPT-3.5 embedding projection layer via API queries. Model theft at production scale. |
| **[Extracting Training Data from Large Language Models (Carlini et al.)](https://arxiv.org/abs/2012.07805)** | 2021 | Demonstrated training data memorization and extraction. PII leakage through inference. |
| **[Practical Membership Inference Attacks Against Large-Scale Multi-Label Learning Systems](https://arxiv.org/abs/1809.00557)** | 2018 | Membership inference: determine if a data point was in training data. Privacy violation at scale. |
| **[Prompt Leaking](https://learnprompting.org/docs/prompt_hacking/leaking)** | — | System prompt extraction techniques. Attacker recovers confidential system instructions. |
| **[KV Cache Side-Channel Attack (Cachebleed for LLMs)](https://arxiv.org/abs/2403.14513)** | 2024 | Timing-based inference about other users' requests via shared KV cache in multi-tenant deployments. |
| **[Quantization and LLM Safety (Paper)](https://arxiv.org/abs/2402.05791)** | 2024 | Quantization can degrade safety fine-tuning. 4-bit models may bypass safety training present in 16-bit version. |
### Inference Security Tools
| Tool | Description | Use Case |
|:---|:---|:---|
| **[TextAttack](https://github.com/QData/TextAttack)** | NLP adversarial attack library. Implements BERT-Attack, TextFooler, CLARE. | Text-level adversarial evaluation |
| **[Adversarial Robustness Toolbox (ART)](https://github.com/Trusted-AI/adversarial-robustness-toolbox)** | IBM's comprehensive adversarial ML framework. 100+ attacks across ML frameworks. | Full adversarial evaluation stack |
| **[Counterfit](https://github.com/Azure/counterfit)** | Microsoft's automation framework for AI security risk assessment. | Enterprise inference testing |
| **[cleverhans](https://github.com/cleverhans-lab/cleverhans)** | Classic adversarial example library. TF/PyTorch attacks (FGSM, PGD, C&W). | Foundational attack implementation |
| **[foolbox](https://github.com/bethgelab/foolbox)** | Fast adversarial attack library. PyTorch-native, gradient-based attacks. | Efficient image model attacks |
## 🔬 Model Scanning
### The Threat
Open-source model ecosystems (Hugging Face, Ollama, CivitAI) have created a massive software supply chain problem. A model is a binary artifact that:
- Can execute arbitrary code when deserialized (pickle exploits)
- Can contain backdoors (hidden triggers that change behavior)
- Can have malicious fine-tune adapters (LoRA poisoning)
- Can be a typosquatted version of a legitimate model
### Model Supply Chain Security Tools
| Tool | Creator | Description | Key Feature |
|:---|:---|:---|:---|
| **[ProtectAI/modelscan](https://github.com/protectai/modelscan)** | Protect AI | Scans ML model files (pickle, H5, ONNX, SavedModel) for malicious code before loading. Open-source. | Pickle exploit detection |
| **[HiddenLayer Model Scanner](https://hiddenlayer.com/model-scanner/)** | HiddenLayer | Commercial model scanner with genealogy tracking, backdoor detection, and weight integrity checks. | Enterprise-grade genealogy |
| **[Hugging Face malware detection](https://huggingface.co/blog/malicious-models)** | Hugging Face | Built-in Pickle scanning on HF Hub. Reference implementation for understanding the threat. | Platform-level scanning |
| **[ONNX model security guidelines](https://onnxruntime.ai/docs/reference/security.html)** | Microsoft | Threat model and mitigations for ONNX model loading. | ONNX-specific security |
| **[safetensors](https://github.com/huggingface/safetensors)** | Hugging Face | Safe serialization format. No arbitrary code execution on load. The correct answer to pickle. | Safe model loading |
### Backdoor Detection Research
| Resource | Type | Description |
|:---|:---|:---|
| **[Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks](https://arxiv.org/abs/1903.03080)** | Paper | First systematic approach to backdoor detection via reverse-engineering trigger patterns. |
| **[STRIP: A Defence Against Trojan Attacks on Deep Neural Networks](https://arxiv.org/abs/1902.06531)** | Paper | Input perturbation-based backdoor detection. Measures prediction entropy under augmentations. |
| **[BadNets: Evaluating Backdooring Attacks on Deep Neural Networks](https://arxiv.org/abs/1708.06733)** | Paper | The original backdoor attack paper. Understanding the attack is prerequisite for scanning defenses. |
| **[Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic, 2024)](https://arxiv.org/abs/2401.05566)** | Paper | LLMs can be trained to behave safely during evaluation but maliciously in deployment. Fundamentally challenges safety training. |
| **[Backdoor Attacks on Language Models (Wallace et al.)](https://arxiv.org/abs/2010.12563)** | Paper | Trojan attacks on NLP models. Trigger words cause model to behave maliciously. |
### Training Data Security
| Resource | Type | Description |
|:---|:---|:---|
| **[PoisonGPT: How We Hid a Lobotomized LLM on HF Hub](https://blog.mithrilsecurity.io/poisongpt-how-we-hid-a-lobotomized-llm-on-hugging-face-to-spread-fake-news/)** | Blog | Live demonstration of model weight poisoning to spread targeted misinformation while passing capability benchmarks. |
| **[Data Poisoning Attacks on Machine Learning: A Survey](https://arxiv.org/abs/2012.01355)** | Paper | Comprehensive survey of training-time attacks. |
| **[Datasheets for Datasets (Gebru et al.)](https://arxiv.org/abs/1803.09010)** | Paper | Framework for dataset documentation and provenance. Foundation of responsible training data practices. |
| **[AI Bill of Materials (AIBOM)](https://www.cisa.gov/topics/risk-management/software-bill-materials-sbom)** | Standard | Extending SBOM concepts to AI models, datasets, and fine-tune adapters. |
## 🌐 Others
This section covers critical elements of the broader AI security landscape: standardization, evaluation benchmarks, and studying real-world failures.
To explore these domains in exhaustive detail without cluttering this README, we maintain dedicated practitioner handbooks:
* 👉 **[AI Security Standards & Compliance Guide (STANDARDS_AND_COMPLIANCE.md)](STANDARDS_AND_COMPLIANCE.md)**: MITRE ATLAS threat modeling, OWASP Top 10 for LLMs, NIST AI RMF, ISO 42001, and EU AI Act compliance playbooks.
* 👉 **[Benchmarks & Datasets Index (BENCHMARKS_AND_DATASETS.md)](BENCHMARKS_AND_DATASETS.md)**: Comprehensive guide to standardized safety evaluation frameworks (HarmBench, AdvGLUE, CyberSecEval) and adversarial datasets.
* 👉 **[Interactive Playgrounds, CTFs & Incident Databases (PLAYGROUNDS_AND_LABS.md)](PLAYGROUNDS_AND_LABS.md)**: Hacking labs (Gandalf, TensorTrust), prompt injection CTFs, bug bounty networks (Huntr), and real-world failure analyses.
### Agentic & Multimodal Security
| Resource | Type | Description |
|:---|:---|:---|
| **[MCP Security (Model Context Protocol)](https://modelcontextprotocol.io/specification/2025-03-26/security)** | Standard | Security specification for the MCP protocol that connects agents to tools. Read before building any agent infrastructure. |
| **[AgentBench](https://github.com/THUDM/AgentBench)** | Benchmark | Evaluates LLM agents across 8 environments. Security-relevant: code execution, OS, database agents. |
| **[VLGuard](https://github.com/ys-zong/VLGuard)** | Dataset | Safety fine-tuning dataset for vision-language models. |
| **[Multimodal Safety Benchmark (MSSBench)](https://arxiv.org/abs/2406.11860)** | Paper | First systematic multimodal safety evaluation across image+text attacks. |
| **[Not All Languages Are Created (Equally Safe)](https://arxiv.org/abs/2310.04451)** | Paper | Multilingual jailbreaks. Low-resource languages bypass safety training more effectively. |
## 🚀 Zero to Hero Roadmap
Rather than teaching you how to run other people's scripts, our curriculum focuses on **architectural fundamentals, mathematical intuition, and custom exploit engineering**.
To explore the exhaustive week-by-week syllabus, reading lists, coding tasks, and hands-on laboratory exercises, please refer to the dedicated learning modules:
* 👉 **[The Definitive Zero to Hero Curriculum (ROADMAP.md)](ROADMAP.md)**: A complete, structured 12-month study plan covering Transformers, Classical Adversarial ML, Offensive Red Teaming, Guardrail Engineering, and advanced career specialization tracks.
* 👉 **[Practical Hands-On Laboratory Handbook (LABS.md)](LABS.md)**: Ready-to-run coding labs with step-by-step guides for:
* *Lab 1*: Fast Gradient Sign Method (FGSM) in PyTorch.
* *Lab 2*: Crafting Direct Prompt Injections & Jailbreaks.
* *Lab 3*: Indirect Prompt Injection via RAG & Tool Hijacking.
* *Lab 4*: Model Supply Chain Exploitation via Malicious Pickle weights.
* *Lab 5*: Implementing an Active Input/Output Guardrail Pipeline.
### Certifications Worth Having
| Certification | Org | Signal | Time |
|:---|:---|:---|:---|
| **[Certified AI Security Professional (CAISP)](https://www.aigov.institute/certifications)** | AI Gov Institute | Practitioner-level AI security. Best available. | 60–80 hrs |
| **[GIAC GREM](https://www.giac.org/certifications/reverse-engineering-malware-grem/)** | SANS | Reverse engineering. Useful for model weight analysis. | 120 hrs |
| **[Google Professional ML Engineer](https://cloud.google.com/learn/certification/machine-learning-engineer)** | Google | ML fundamentals signal. Good for bridging to employers. | 40 hrs |
| **[AWS ML Specialty](https://aws.amazon.com/certification/certified-learning-specialty/)** | AWS | Cloud ML deployment. Covers security of deployed models. | 60 hrs |
| **[OSCP](https://www.offsec.com/courses/pen-200/)** | OffSec | Classic red team cert. Still matters for traditional attack context. | 200+ hrs |
## 💼 Job Opportunities
### The Roles That Exist
| Role | What You Actually Do | Where to Find |
|:---|:---|:---|
| **AI Red Teamer** | Run adversarial campaigns against production LLMs. Find what breaks before attackers do. | Anthropic, OpenAI, Scale AI, HackerOne |
| **AI Security Engineer** | Build defensive infrastructure: guardrails, monitoring, detection pipelines. | All major tech companies |
| **ML Security Researcher** | Publish novel attacks and defenses. Reproduce papers, discover new vulnerability classes. | Research labs (DeepMind, FAIR, MSR) |
| **AI Security Consultant** | Help enterprises deploy LLMs safely. Threat modeling, compliance, red team engagements. | Big 4, security boutiques |
| **AI Safety Engineer** | Alignment-adjacent. Evaluation design, interpretability-informed defenses. | Anthropic, DeepMind, ARC |
| **AI SecOps Engineer** | SOC for AI systems. Monitor, detect, respond to AI-specific incidents. | Financial services, healthcare |
### Salary Reality (2025–2026)
For an exhaustive, deep-dive breakdown of international technical compensation, tax structures, superannuation details, local rental stress, and regional hiring entities, please refer to the dedicated salary handbook:
👉 **[Exhaustive International Salary Handbook & Strategy Guide (SALARY_REALITY.md)](SALARY_REALITY.md)**
#### 🗺️ Executive Total Comp Summary (Annual TC in USD)
| Country | Junior (0–3 yrs) | Mid-Level (3–6 yrs) | Senior (6–9 yrs) | Staff / Principal (9+ yrs) | Hub Cities |
|:---|:---|:---|:---|:---|:---|
| 🇺🇸 **United States** | $140k–$200k | $200k–$320k | $300k–$480k | $400k–$700k+ | San Francisco, NYC, Seattle |
| 🇮🇳 **India** | ₹15–22L ($18k–$26k) | ₹25–45L ($30k–$54k) | ₹40–70L ($48k–$84k) | ₹80–150L+ ($96k–$180k+) | Bengaluru, Hyderabad, Pune |
| 🇬🇧 **United Kingdom** | £50k–£80k ($63k–$100k) | £80k–£120k ($100k–$150k) | £115k–£180k ($145k–$225k) | £175k–£280k+ ($220k–$350k+) | London, Cambridge |
| 🇮🇪 **Ireland** | €65k–€95k ($70k–$102k) | €100k–€150k ($108k–$162k) | €160k–€240k ($172k–$258k) | €250k–€380k+ ($270k–$410k+) | Dublin (Silicon Docks) |
| 🇸🇬 **Singapore** | S$75k–S$110k ($55k–$81k) | S$120k–S$180k ($88k–$132k) | S$200k–S$320k ($147k–$235k) | S$320k–S$480k+ ($235k–$353k+) | Singapore |
| 🇦🇺 **Australia** | A$120k–A$150k ($78k–$98k) | A$160k–A$220k ($104k–$143k) | A$240k–A$340k ($156k–$221k) | A$350k–A$500k+ ($227k–$325k+) | Sydney, Melbourne |
| 🇦🇪/🇸🇦 **Middle East** | $60k–$80k (Tax-Free) | $80k–$145k (Tax-Free) | $145k–$245k (Tax-Free) | $245k–$390k+ (Tax-Free) | Abu Dhabi, Dubai, Riyadh |
| 🇨🇭 **Switzerland** | CHF 90k–120k ($99k–$132k) | CHF 110k–160k ($120k–$175k) | CHF 160k–220k ($175k–$240k) | CHF 220k–350k+ ($240k–$385k+) | Zurich, Geneva |
| 🇪🇺 **Western Europe** | €55k–€75k ($60k–$81k) | €70k–€100k ($75k–$108k) | €95k–€135k ($102k–$145k) | €130k–€180k+ ($140k–$195k+) | Amsterdam, Munich, Paris |
| 🇨🇦 **Canada** | C$90k–C$120k ($66k–$88k) | C$110k–C$160k ($81k–$118k) | C$160k–C$220k ($118k–$162k) | C$200k–C$290k+ ($147k–$213k+) | Toronto, Montreal |
### Where to Work — Company Breakdown
#### Frontier AI Labs
| Company | Focus | Why Join | Links |
|:---|:---|:---|:---|
| **Anthropic** | Safety-first. Constitutional AI, interpretability, red team gates on deployment. | Most rigorous safety culture. Problems are genuinely hard. | [Careers](https://www.anthropic.com/careers) |
| **OpenAI** | Scale. Broad attack surface: DALL-E, Codex, GPT API, Agents. | Largest deployed surface. Detection & response is mature. | [Careers](https://openai.com/careers) |
| **Google DeepMind** | Research + product integration. Safety, interpretability, autonomous security. | Research-to-production pipeline. | [Careers](https://careers.google.com/) |
| **Meta AI (FAIR)** | Open-source focus. PurpleLlama, Llama Guard, CyberSecEval. | Ship open-source that the field uses. | [Careers](https://www.metacareers.com/) |
| **Mistral AI** | European lab, fast-moving. Safety is a growing focus. | Smaller team, higher ownership. | [Careers](https://mistral.ai/company/careers) |
#### AI Security Startups
| Company | Focus | Stage | Link |
|:---|:---|:---|:---|
| **HiddenLayer** | Model security platform, AI-SPM | Series A | [Jobs](https://hiddenlayer.com/careers/) |
| **Protect AI** | Model scanning, MLSecOps | Series B | [Jobs](https://protectai.com/careers) |
| **Lakera** | Prompt injection guardrails | Series A | [Jobs](https://lakera.ai/careers) |
| **Giskard AI** | LLM testing and red teaming | Series A | [Jobs](https://www.giskard.ai/jobs) |
| **Haize Labs** | Adversarial evaluation research | Seed | [Jobs](https://haize.com/about) |
| **Dreadnode** | AI offensive security | Seed | [Jobs](https://dreadnode.io/) |
#### Enterprise Security Teams (AI Focus)
| Company | What They're Building | Link |
|:---|:---|:---|
| **Microsoft** | PyRIT, Azure AI Content Safety, Copilot red teaming | [Jobs](https://careers.microsoft.com/) |
| **NVIDIA** | garak, NeMo-Guardrails, AI security infrastructure | [Jobs](https://nvidia.com/careers) |
| **Cisco** | AI Defense platform, enterprise AI scanning | [Jobs](https://jobs.cisco.com/) |
| **CrowdStrike** | AI-powered threat detection, ML model security | [Jobs](https://crowdstrike.com/careers/) |
| **Wiz** | AI-SPM, shadow AI detection, cloud AI posture | [Jobs](https://wiz.io/careers) |
### Job Boards
| Board | Best For |
|:---|:---|
| **[80,000 Hours Job Board](https://jobs.80000hours.org/)** | AI safety and high-impact security roles |
| **[AI Jobs (aiml.to)](https://aiml.to/jobs)** | Specialized AI/ML roles |
| **[Glassdoor AI Security](https://www.glassdoor.com/)** | Salary verification + company culture |
| **[LinkedIn — AI Security filter](https://linkedin.com/)** | Volume + networking |
| **[Levels.fyi](https://levels.fyi/)** | Comp verification before negotiating |
### How to Stand Out
1. **Have a GitHub that shows you broke something** — a CTF writeup, a tool, a paper reproduction
2. **Contribute to garak, llm-guard, or NeMo-Guardrails** — open-source contributions signal depth
3. **Write publicly** — a blog post on a novel attack/defense pattern is worth more than any cert
4. **Know the papers** — every technical interview at a frontier lab will test whether you've actually read the relevant literature
5. **Speak the compliance language** — OWASP, NIST AI RMF, EU AI Act for enterprise roles; attack chains and ASR for lab roles
### What We're Looking For
- [ ] Novel attack papers published in 2025
- [ ] Production red team case studies
- [ ] Non-English resources (especially Chinese and French AI security research)
- [ ] Multimodal security resources (audio, video)
- [ ] Edge/on-device model security
*Maintained by [@ppradyoth](https://github.com/ppradyoth). Built to secure the future of AI — before AI secures us.*