tejassinghbhati/kaavish
GitHub: tejassinghbhati/kaavish
Stars: 0 | Forks: 0
# Kaavish
### Automated Adversarial Red Teaming for Large Language Model Systems
*Systematic. Reproducible. Evidence-based.*





## Overview
Kaavish is a backend-first, API-driven automated red teaming platform for AI systems deployed in production. It operationalises the adversarial attack taxonomy defined by **OWASP LLM Top 10 (2025)**, **MITRE ATLAS**, and peer-reviewed research from NeurIPS, USENIX Security, and ICLR — translating theoretical vulnerability classes into concrete, verifiable, reproducible exploit chains.
Traditional penetration testing pipelines (Nmap, Metasploit, Burp Suite) operate on deterministic software artefacts. LLMs and agent systems introduce a fundamentally non-deterministic, context-sensitive attack surface: adversarial inputs do not exploit memory boundaries or packet fields — they exploit the model's trained probability distributions, instruction-following behaviour, and tool-calling logic. Kaavish was built specifically for this surface.
## The Threat Landscape
The deployment of LLM-based systems has outpaced the development of security tooling designed for them. Recent empirical measurements illustrate the scale of the problem:
- **74% of LLM-integrated applications** are susceptible to at least one form of prompt injection in default configurations (Greshake et al., 2023)
- **Indirect prompt injection** — where malicious instructions are embedded in documents, web pages, or tool outputs processed by an agent — represents an attack surface with no equivalent in traditional software security
- **Jailbreak transferability** is high: adversarial suffixes found on open-weight models transfer to closed-weight production systems (Zou et al., 2023) with non-trivial success rates
- **Training data extraction** has been demonstrated empirically on GPT-2 and GPT-3.5, recovering verbatim memorised sequences including PII (Carlini et al., 2021, 2023)
- The **OWASP LLM Top 10 (2025)** explicitly names prompt injection as the highest-severity vulnerability class for LLM applications
The security industry has not yet produced a standardised automated toolchain for testing these properties. Kaavish fills that gap.
## Architecture
┌────────────────────────────────────────────────────────────────┐
│ Kaavish API (FastAPI) │
│ │
│ POST /scans → Enqueue scan, return scan_id │
│ GET /scans/{id}/status → Poll scan state │
│ GET /scans/{id}/results → Full findings JSON │
│ GET /scans/{id}/report.pdf → Evidence report │
└──────────────────────────┬─────────────────────────────────────┘
│
┌────────────▼────────────┐
│ Target Profiler │
│ (core/scanner.py) │
│ │
│ • Input schema probe │
│ • Model fingerprint │
│ • Framework detection │
│ • Tool/RAG capability │
└────────────┬────────────┘
│ TargetProfile
┌────────────▼────────────┐
│ Attack Executor │
│ (core/executor.py) │
│ │
│ asyncio.gather() │
│ All attacks concurrent │
└──┬──────┬──────┬────────┘
│ │ │
┌────────────▼┐ ┌───▼────┐ ┌▼──────────────┐ ┌────────────┐
│ Prompt │ │ Jail- │ │ Data │ │ Agent │
│ Injection │ │ break │ │ Extraction │ │ Hijack │
│ (10 vars) │ │(6 vars)│ │ (6 vars) │ │ (7 vars) │
└─────────────┘ └────────┘ └───────────────┘ └────────────┘
│
┌────────────▼────────────┐
│ Report Generator │
│ (core/reporter.py) │
│ │
│ Markdown + PDF │
│ Severity scoring │
│ Remediation guidance │
└─────────────────────────┘
### Design Decisions
**Concurrent execution via `asyncio.gather`**
All attack classes execute concurrently against the target. This is intentional: sequential execution would allow rate-limiting or session-state changes between tests to mask vulnerabilities. Concurrency reflects realistic adversarial conditions.
**Profile-driven attack selection**
The profiler runs before any attacks. Agent-specific attacks (tool enumeration, SSRF via agent, cross-tool exfiltration chains) only execute if the profiler detects tool-calling capability. This prevents false positives and reduces noise on non-agent targets.
**Verify-before-report**
Every attack class verifies the exploit is real before recording a finding. A response containing restricted content is only classified as a finding if it matches specific, pre-defined exploitation indicators. This minimises false positive rates.
**Pluggable attack interface**
Every attack inherits from `BaseAttack` and exposes a uniform `async execute() -> AttackResult` interface. Adding a new attack class requires only implementing `execute()` — the executor, reporter, and API require no modification.
## Attack Coverage
Kaavish implements attacks across four primary vulnerability classes, aligned to the OWASP LLM Top 10 (2025) taxonomy.
### LLM01 — Prompt Injection (`attacks/prompt_injection.py`)
Direct prompt injection exploits the fundamental architectural property that LLMs process system instructions and user inputs within a shared token sequence. When the model cannot reliably distinguish between instruction-carrying and data-carrying tokens, an adversarial user can override, extend, or replace the application's system prompt.
| Variant | Technique | Reference |
|---------|-----------|-----------|
| System Prompt Extraction | Direct verbatim extraction request | Perez & Ribeiro (2022) |
| System Instruction Override | Role replacement via DAN-class injection | Wei et al. (2023) |
| Delimiter Confusion | Injecting synthetic end-of-prompt markers | Greshake et al. (2023) |
| Base64-Encoded Smuggling | Encoding bypasses lexical input filters | Branch et al. (2022) |
| Social Engineering Injection | False authority claims from within user turn | — |
| Fake System Message | Injecting `[SYSTEM UPDATE]` tags | Greshake et al. (2023) |
| Context Summarisation Leakage | Requesting self-summary of instructions | — |
| Template Injection (Jinja2) | `{{7*7}}` evaluation in LangChain pipelines | — |
| HTML/XML Tag Injection | `