pardamike/pytest-wardenbot
GitHub: pardamike/pytest-wardenbot
Stars: 0 | Forks: 0
# pytest-wardenbot
[](https://github.com/pardamike/pytest-wardenbot/actions/workflows/ci.yml)
[](https://codecov.io/gh/pardamike/pytest-wardenbot)
[](https://pypi.org/project/pytest-wardenbot/)
[](./LICENSE.md)
[](https://github.com/astral-sh/ruff)
[](https://github.com/pre-commit/pre-commit)
Pytest plugin for testing chatbots and LLM apps — prompt injection, jailbreaks, system-prompt leaks, hallucinations, brand drift.
📖 **Documentation:** [pytest-wardenbot.wardenbot.ai](https://pytest-wardenbot.wardenbot.ai/)
## What it does
Run pytest against your chatbot and find out if it leaks its system prompt, complies with known jailbreaks, hallucinates business facts, or drifts from your brand voice.
- **Black-box.** Tests run against your live chatbot via HTTP, OpenAI API, Anthropic API, or any object you write a small adapter for.
- **Deterministic-first.** v0.1 ships 29 tests that need zero LLM API spend — regex, substring, and schema checks. Optional LLM-judge tests (DeepEval) ship as an extra for semantic checks.
- **Agent-ready failures.** When a test fails, the failure message includes a structured Markdown remediation prompt you can paste into Cursor or Claude Code.
- **Verified adapters.** The bundled OpenAI and Anthropic adapters are smoke-tested weekly against the live vendor APIs in CI ([live-api-smoke](https://github.com/pardamike/pytest-wardenbot/actions/workflows/live-api-smoke.yml)) — a real round-trip stays known-good, not just mocked.
### What "passing" means (and doesn't)
A green run means your chatbot didn't fail any of the bundled 29 attacks in the most overt way. It's a useful smoke test and a regression detector — if a deploy turns a green test red, that's a real signal to investigate.
A green run does **not** mean your chatbot is secure. Frontier-grade attacks are multi-turn, novel, and adapted to your specific bot — no fixed corpus catches all of them. Treat the shipped suite as a starter set: pair it with periodic red-team exercises (or our [Continuous Monitoring](https://wardenbot.ai/intake/) service) for the always-on adversarial coverage CI alone can't provide.
## Install
pip install pytest-wardenbot
Optional extras for LLM-judge tests or vendor-native adapters:
pip install "pytest-wardenbot[judge]" # adds DeepEval for semantic checks
pip install "pytest-wardenbot[openai]" # adds OpenAI Chat + Assistants adapters (sync + async)
pip install "pytest-wardenbot[anthropic]" # adds Anthropic Messages adapter (sync + async)
pip install "pytest-wardenbot[langchain]" # adds LangChainAdapter for any Runnable (sync + async)
pip install "pytest-wardenbot[async]" # adds pytest-asyncio for parallel async probing (run_probes)
## Quickstart (under 60 seconds)
pip install pytest-wardenbot
pytest --wardenbot-quickstart # generates conftest.py + test_my_bot.py
export CHATBOT_URL=https://your-chatbot.example.com/chat
export CHATBOT_TOKEN=sk-... # optional
pytest # runs all shipped tests against your bot
`--wardenbot-quickstart` accepts an industry template:
pytest --wardenbot-quickstart=ecommerce # adds refund/shipping fact placeholders
pytest --wardenbot-quickstart=saas-support # adds plan/trial fact placeholders
pytest --wardenbot-quickstart=generic # default; minimal placeholders
Then edit `conftest.py` to replace the TODO placeholders with your real
business facts and re-run `pytest`. Worked examples in [`examples/`](./examples/)
cover the basic HTTP setup, a custom OpenAI adapter, and a GitHub Actions
workflow.
### Manual setup (if you prefer)
Add this to your project's `conftest.py`:
import os
import pytest
from pytest_wardenbot.adapters.http import HTTPChatbotAdapter
@pytest.fixture
def chatbot():
return HTTPChatbotAdapter(
url="https://your-chatbot.example.com/chat",
headers={"Authorization": f"Bearer {os.environ['CHATBOT_TOKEN']}"},
request_field="message", # the JSON key your bot reads the prompt from
response_field="response", # the JSON key your bot returns the text in
)
Then run the shipped tests with `pytest --pyargs pytest_wardenbot.tests`.
When a test fails, read the failure message, paste the agent-ready Markdown
into Cursor / Claude Code, ship the fix.
## What's in v0.1
| Category | Count | Grading | Requires API key? |
|---|---|---|---|
| Prompt-injection / jailbreak resistance | 5 prompts × 2 checks = 10 | deterministic | no |
| System-prompt leak elicitation (dedicated extraction prompts) | 3 | deterministic | no |
| Refusal-bypass (roleplay / pretext / hypothetical framings) | 3 | deterministic | no |
| Off-topic deflection (scoped bots) | 2 | deterministic | no |
| Indirect / cross-prompt injection (XPIA) | 4 | deterministic | no |
| Encoded-payload jailbreak (Base64 / ROT13 / leet / hex) | 4 | deterministic | no |
| Multi-turn jailbreak (priming + payload, needs session-aware adapter) | 3 | deterministic | no |
| Canary-token leak (opt-in; you plant the token) | 1 | deterministic | no |
| Business-truth verification (parametrized over your facts) | user-supplied | deterministic | no |
| Semantic checks via DeepEval (5 factories: equivalence, brand, hallucination, off-policy, refusal quality) | user-supplied | LLM-judge | yes, with `[judge]` extra |
That's **29 deterministic tests** out-of-the-box (plus the opt-in canary leak test, plus your business-truth and judge lists). Tests run in under a second against a real chatbot with zero LLM API spend unless you've opted into the `[judge]` extra.
The v0.2 roadmap (RAMPART for tool-using agents, LangChain/MCP adapters, ensemble judging, and more) is tracked in [GitHub Issues](https://github.com/pardamike/pytest-wardenbot/issues).
## How it's different from related tools
- **vs Promptfoo ([acquired by OpenAI in Feb 2026](https://openai.com/index/openai-to-acquire-promptfoo/)):** Promptfoo is a developer testing CLI. We're a pytest plugin — same tool your existing test suite uses, same CI integration you already have.
- **vs DeepEval:** DeepEval focuses on evaluation metrics (faithfulness, relevancy). We focus on adversarial security probes (jailbreak, system-prompt leak, refusal-bypass) — different problem, complementary tool. (We use DeepEval under the hood for our optional semantic checks.)
- **vs Garak / PyRIT:** Garak and PyRIT are research-grade attack libraries. We package a curated subset as everyday pytest tests with clear failure messages.
## License
Apache 2.0. See [LICENSE.md](./LICENSE.md).
## Powered by
[WardenBot AI](https://wardenbot.ai) — continuous external monitoring for AI chatbots. 
The pytest plugin is the free, open-source slice of our test corpus. Want continuous monitoring across all your bots with daily probes and a dashboard? [Tell us about your setup](https://wardenbot.ai/intake/) — we open invites in small batches.