romovpa/claudini

GitHub: romovpa/claudini

一个利用Claude Code实现LLM白盒对抗攻击自动搜索与评估的研究框架。

Stars: 201 | Forks: 22

# ⛓️‍💥 Claudini ⛓️‍💥 **Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs** [![arXiv](https://img.shields.io/badge/arXiv-2603.24511-b31b1b.svg?logo=arxiv)](https://arxiv.org/abs/2603.24511)

Pareto front with evolution annotations

We show that an *[autoresearch](https://github.com/karpathy/autoresearch)*-style pipeline powered by Claude Code discovers novel white-box adversarial attack *algorithms* that **significantly outperform** all existing [methods](claudini/methods/original/README.md) in jailbreaking and prompt injection evaluations. This official code repository contains a demo autoresearch pipeline, the Claude-discovered methods from the paper, baseline implementations, and the evaluation benchmark. Read our [paper](https://arxiv.org/abs/2603.24511) and consider [citing us](#citation) if you find this useful. ## Setup ``` git clone https://github.com/romovpa/claudini.git cd claudini uv sync ``` Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). ## Discover Your Own SOTA Attack

Autoresearch loop: seeding, analysis-experiment cycle, evaluation

To run autoresearch, open [Claude Code](https://docs.anthropic.com/en/docs/claude-code) and start the `/claudini` skill in a loop: ``` claude > /loop /claudini my_run break Qwen2.5-7B on random strings under 1e15 FLOPs ``` Each iteration, Claude studies existing methods and results, designs a new optimizer, benchmarks it, and commits — maintaining an agent log across iterations. The run code (`my_run` above) isolates the method chain, git branch, and log. See the full [skill prompt](.claude/skills/claudini/SKILL.md) for details. Use `tmux` or `screen` so sessions survive disconnection. Track progress via `git log`. ## Evaluate All experiments are run via `claudini.run_bench` CLI: ``` uv run -m claudini.run_bench --help ``` It takes a preset name (from [`configs/`](configs/)) or a path to a YAML file. Config settings can be overridden with CLI options. For example, to evaluate methods on the random targets track, override FLOPs budget: ``` uv run -m claudini.run_bench random_valid --method gcg,acg --max-flops 1e15 ``` Results are saved to `results////sample__seed_.json`. Existing results are auto-skipped. Precomputed results from the paper are available as a [GitHub release](https://github.com/romovpa/claudini/releases). Download and unzip `claudini-results.zip` into the repo root. ## Attack Methods We consider white-box GCG-style attacks that search directly over the model's vocabulary using gradients. Each method ([`TokenOptimizer`](claudini/base.py#L429)) optimizes a short discrete token *suffix* that, when appended to an input prompt, causes the model to produce a desired target sequence. - **Baselines** (existing methods): [`claudini/methods/original/`](claudini/methods/original/) - **Claude-designed methods** (each run code produces a separate chain): - Generalizable attacks (random targets): [`claudini/methods/claude_random/`](claudini/methods/claude_random/) - Attacks on a safeguard model: [`claudini/methods/claude_safeguard/`](claudini/methods/claude_safeguard/) See [`CLAUDE.md`](CLAUDE.md) for how to implement a new method. **Leaderboard.** Run `uv run -m claudini.leaderboard results/` to generate per-track, per-model leaderboards ranking all methods by average loss. Results are saved to `results/loss_leaderboard//.json`. ## Citation ``` @article{panfilov2026claudini, title = {Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs}, author = {Alexander Panfilov and Peter Romov and Igor Shilov and Yves-Alexandre de Montjoye and Jonas Geiping and Maksym Andriushchenko}, journal = {arXiv preprint}, eprint = {2603.24511}, archivePrefix = {arXiv}, year = {2026}, url = {https://arxiv.org/abs/2603.24511}, } ```
标签:AI代理, arXiv, Claude Code, DNS解析, Pareto前沿, Python, UV工具链, 优化器设计, 大模型安全, 实验迭代, 对抗攻击, 开源项目, 提示注入, 敏感信息检测, 无后门, 白盒攻击, 网络安全研究, 网络攻击, 自动化基准测试, 自动研究, 自研AI, 进化算法, 逆向工具, 集群管理