code-sensei/artemiskit
GitHub: code-sensei/artemiskit
一个开源的 LLM 评估工具包,帮助团队通过场景测试、红队攻击和压力测试来系统化地验证 AI 应用的可靠性与安全性。
Stars: 4 | Forks: 1
# ArtemisKit

**开源 LLM 评估工具包** - 通过基于场景的测试和多 Provider 支持,对您的 AI 应用进行测试、评估、压力测试和红队测试。
[](LICENSE)
[](https://www.npmjs.com/package/@artemiskit/cli)
[](https://artemiskit.vercel.app)
📚 **[文档](https://artemiskit.vercel.app)** | 🚀 **[快速入门](https://artemiskit.vercel.app/docs/cli/getting-started/)**
## 功能特性
- **基于场景的测试** - 在 YAML 中定义测试用例,支持多轮对话
- **安全红队测试** - 自动测试 Prompt 注入、越狱和数据提取
- **压力测试** - 测量负载下的延迟、吞吐量和可靠性
- **多 Provider 支持** - OpenAI, Azure OpenAI, Vercel AI SDK (20+ providers)
- **丰富的报告** - 具有配置可追溯性的交互式 HTML 报告
- **CI/CD 就绪** - 支持自动化的退出代码和 JSON 输出
## 安装
```
npm install -g @artemiskit/cli
# 或
pnpm add -g @artemiskit/cli
# 或
bun add -g @artemiskit/cli
```
## 快速入门(基础示例)
这是开始使用 ArtemisKit 的最简单方法。
### 1. 设置您的 API key
```
export OPENAI_API_KEY="your-api-key"
```
### 2. 创建一个简单的场景
```
# scenarios/hello.yaml
name: hello-test
description: My first ArtemisKit test
cases:
- id: greeting-test
prompt: "Say hello"
expected:
type: contains
values:
- "hello"
mode: any
```
### 3. 运行
```
artemiskit run scenarios/hello.yaml
# 或使用短别名
akit run scenarios/hello.yaml
```
就是这样!ArtemisKit 默认使用 OpenAI。有关完整的配置选项,请参见下文。
## 配置
### 配置文件(完整参考)
在您的项目根目录中创建 `artemis.config.yaml`。以下是所有可用的选项:
```
# artemis.config.yaml - 完整参考
# =====================================
# 项目标识符(用于运行存储和报告)
project: my-project
# 未在 scenario 或 CLI 中指定时使用的默认 provider
# 选项: openai, azure-openai, vercel-ai
provider: openai
# 要使用的默认 model
# 注意: 对于 azure-openai,这仅用于显示 - 实际模型
# 由您的 Azure deployment 决定,而非此值。
# 详情见 docs/providers/azure-openai.md。
model: gpt-4o
# 包含 scenario 文件的目录
scenariosDir: ./scenarios
# 特定于 provider 的配置
providers:
openai:
# API key (can use environment variable reference)
apiKey: ${OPENAI_API_KEY}
azure-openai:
# API key for Azure OpenAI
apiKey: ${AZURE_OPENAI_API_KEY}
# Your Azure resource name (the subdomain in your endpoint URL)
resourceName: ${AZURE_OPENAI_RESOURCE_NAME}
# The deployment name you created in Azure Portal
deploymentName: ${AZURE_OPENAI_DEPLOYMENT_NAME}
# API version (optional, has sensible default)
apiVersion: "2024-02-15-preview"
vercel-ai:
# Underlying provider for Vercel AI SDK
underlyingProvider: openai
apiKey: ${OPENAI_API_KEY}
# 用于运行历史的 storage 配置
storage:
# Storage type: "local" or "supabase"
type: local
# Base path for local storage (relative to project root)
basePath: ./artemis-runs
# 用于报告的 output 配置
output:
# Output format: "json", "html", or "both"
format: html
# Directory for generated reports
dir: ./artemis-output
# CI 特定设置(可选)
ci:
# Fail if regression exceeds threshold
failOnRegression: true
# Regression threshold (0-1)
regressionThreshold: 0.05
```
### 最小配置文件
如果您只想设置默认值,最小配置也可以:
```
# artemis.config.yaml - 最小配置
project: my-project
provider: openai
model: gpt-4o
```
## 场景格式
### 基础场景(简单 Prompts)
```
# scenarios/basic.yaml
name: basic-test
description: Simple prompt-response tests
# 可选: 覆盖此 scenario 的 provider/model
provider: openai
model: gpt-4o
cases:
- id: greeting
prompt: "Say hello"
expected:
type: contains
values:
- "hello"
mode: any
```
### 完整场景参考
以下是场景的所有可用选项:
```
# scenarios/full-reference.yaml - 完整示例
# =================================================
# 必需: 此 scenario 的唯一名称
name: customer-support-eval
# 可选: 人类可读的描述
description: Evaluate customer support bot responses
# 可选: Scenario 版本
version: "1.0"
# 可选: 用于筛选的标签(使用 --tags 标志)
tags:
- support
- production
# 可选: Provider 覆盖(默认为配置文件,然后是 "openai")
# 选项: openai, azure-openai, vercel-ai
provider: openai
# 可选: Model 覆盖
# 注意: 对于 azure-openai,这仅用于显示 - 实际模型
# 由您的 Azure deployment 决定。参见 docs/providers/azure-openai.md
model: gpt-4o
# 可选: Model 参数
temperature: 0.7
maxTokens: 1024
seed: 42
# 可选: 前置于所有 cases 的 system prompt
setup:
systemPrompt: |
You are a helpful customer support assistant.
Always be polite and professional.
# 可选: Scenario 级别变量(对所有 cases 可用)
# Case 级别变量会覆盖这些。使用 {{var_name}} 语法。
variables:
company_name: "Acme Corp"
default_greeting: "Hello"
# 必需: 要运行的 test cases
cases:
# ---- Simple prompt/response case ----
- id: simple-greeting
name: Simple greeting test
description: Test basic greeting response
# The prompt to send to the model
prompt: "Hello, I need help"
# Expected result validation
expected:
type: contains
values:
- "help"
- "assist"
mode: any
# Optional: Tags for this case
tags:
- basic
# ---- Case with regex matching ----
- id: order-number-check
name: Order number extraction
prompt: "My order number is #12345"
expected:
type: regex
pattern: "12345"
flags: "i"
# ---- Case with exact match ----
- id: yes-no-response
name: Binary response test
prompt: "Reply with only 'Yes' or 'No': Is the sky blue?"
expected:
type: exact
value: "Yes"
caseSensitive: false
# ---- Case with fuzzy matching ----
- id: fuzzy-match-test
name: Fuzzy similarity test
prompt: "What color is grass?"
expected:
type: fuzzy
value: "green"
threshold: 0.8
# ---- Case with LLM grading ----
- id: quality-check
name: Response quality evaluation
prompt: "Explain quantum computing in simple terms"
expected:
type: llm_grader
rubric: |
Score 1.0 if the explanation is clear and accurate.
Score 0.5 if partially correct but confusing.
Score 0.0 if incorrect or overly technical.
threshold: 0.7
# ---- Case with JSON schema validation ----
- id: json-output-test
name: Structured output test
prompt: "Return a JSON object with name and age fields"
expected:
type: json_schema
schema:
type: object
properties:
name:
type: string
age:
type: number
required:
- name
- age
# ---- Multi-turn conversation ----
- id: multi-turn-support
name: Multi-turn conversation
# Use array of messages for multi-turn
prompt:
- role: user
content: "I have a problem with my order"
- role: assistant
content: "I'd be happy to help. What's your order number?"
- role: user
content: "Order number is #99999"
expected:
type: contains
values:
- "99999"
mode: any
# ---- Case with variables ----
- id: dynamic-content
name: Variable substitution test
# Case-level variables override scenario-level
variables:
product_name: "Widget Pro"
order_id: "ORD-789"
prompt: "What's the status of my {{product_name}} order {{order_id}}?"
expected:
type: contains
values:
- "ORD-789"
mode: any
# ---- Case with timeout and retries ----
- id: slow-response-test
name: Timeout handling test
prompt: "Generate a detailed report"
expected:
type: contains
values:
- "report"
mode: any
timeout: 30000
retries: 2
```
### 变量
变量允许您创建动态、可复用的场景。在 prompts 中使用 `{{variable_name}}` 语法。
```
name: customer-support
description: Test with dynamic content
# Scenario 级别变量 - 对所有 cases 可用
variables:
company_name: "Acme Corp"
support_email: "support@acme.com"
cases:
# Uses scenario-level variables
- id: contact-info
prompt: "What is the email for {{company_name}}?"
expected:
type: contains
values:
- "support@acme.com"
mode: any
# Case-level variables override scenario-level
- id: different-company
variables:
company_name: "TechCorp" # Overrides "Acme Corp"
product: "Widget"
prompt: "Tell me about {{product}} from {{company_name}}"
expected:
type: contains
values:
- "TechCorp"
mode: any
```
变量优先级:**用例级别 > 场景级别**
### 期望类型
| Type | Description | Key Fields |
|------|-------------|------------|
| `contains` | 响应包含字符串 | `values: [...]`, `mode: all\|any` |
| `exact` | 响应完全等于值 | `value: "..."`, `caseSensitive: bool` |
| `regex` | 响应匹配正则表达式模式 | `pattern: "..."`, `flags: "i"` |
| `fuzzy` | 模糊字符串相似度 | `value: "..."`, `threshold: 0.8` |
| `llm_grader` | 基于 LLM 的评估 | `rubric: "..."`, `threshold: 0.7` |
| `json_schema` | 验证 JSON 结构 | `schema: {...}` |
## CLI 命令
| Command | Description |
|---------|-------------|
| `artemiskit run ` | 运行基于场景的评估 |
| `artemiskit validate ` | 验证场景而不运行它们 |
| `artemiskit redteam ` | 运行安全红队测试 |
| `artemiskit stress ` | 运行负载/压力测试 |
| `artemiskit report ` | 从保存的运行中重新生成报告 |
| `artemiskit history` | 查看运行历史 |
| `artemiskit compare ` | 比较两次运行 |
| `artemiskit baseline` | 管理基线以进行回归检测 |
| `artemiskit init` | 初始化配置 |
使用 `akit` 作为 `artemiskit` 的更短别名。
### Run 命令选项
```
artemiskit run [options]
Options:
-p, --provider Provider: openai, azure-openai, vercel-ai
-m, --model Model to use
-o, --output Output directory for results
-v, --verbose Verbose output
-t, --tags Filter test cases by tags
-c, --concurrency Number of concurrent test cases (default: 1)
--parallel Number of scenarios to run in parallel
--timeout Timeout per test case in milliseconds
--retries Number of retries per test case
--config Path to config file
--save Save results to storage (default: true)
--ci CI mode: machine-readable output
--baseline Compare against baseline for regression
--budget Maximum budget in USD
--export Export format: markdown or junit
```
### Validate 命令选项
```
artemiskit validate [options]
Options:
--json Output results as JSON
--strict Treat warnings as errors
-q, --quiet Only output errors
--export junit Export to JUnit XML for CI
```
### CI/CD 集成
ArtemisKit 支持具有机器可读输出和 JUnit 导出功能的 CI/CD 流水线:
```
# 用于 CI 的机器可读 output
akit run scenarios/ --ci
# 导出为 JUnit XML 以用于 CI 平台
akit run scenarios/ --export junit --export-output ./test-results
# 运行前验证 scenarios
akit validate scenarios/ --strict --export junit
```
GitHub Actions 示例:
```
- name: Validate scenarios
run: akit validate scenarios/ --strict
- name: Run tests
run: akit run scenarios/ --export junit --export-output ./test-results
- name: Publish Test Results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: test-results/*.xml
```
## Providers
ArtemisKit 支持多种 LLM Provider。有关详细的设置指南,请参阅 [Provider 文档](docs/providers/)。
| Provider | Use Case | Docs |
|----------|----------|------|
| `openai` | 直接 OpenAI API | [docs/providers/openai.md](docs/providers/openai.md) |
| `azure-openai` | Azure OpenAI Service | [docs/providers/azure-openai.md](docs/providers/azure-openai.md) |
| `vercel-ai` | 通过 Vercel AI SDK 支持 20+ Provider | [docs/providers/vercel-ai.md](docs/providers/vercel-ai.md) |
### 快速设置
**OpenAI:**
```
export OPENAI_API_KEY="sk-..."
akit run scenario.yaml --provider openai --model gpt-4o
```
**Azure OpenAI:**
```
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_RESOURCE_NAME="my-resource"
export AZURE_OPENAI_DEPLOYMENT_NAME="gpt-4o-deployment"
akit run scenario.yaml --provider azure-openai --model gpt-4o
# 注意: --model 仅用于显示;实际模型即您的 deployment
```
**Vercel AI (任意 provider):**
```
export ANTHROPIC_API_KEY="sk-ant-..."
akit run scenario.yaml --provider vercel-ai --model anthropic:claude-3-5-sonnet-20241022
```
## 安全测试(红队)
测试您的 LLM 是否存在漏洞:
```
akit redteam scenarios/my-bot.yaml --mutations typo,role-spoof,cot-injection
```
### 可用的变异
| Mutation | Description |
|----------|-------------|
| `typo` | 引入拼写错误以绕过过滤器 |
| `role-spoof` | 尝试角色/身份欺骗 |
| `instruction-flip` | 反转或否定指令 |
| `cot-injection` | 思维链 (Chain-of-thought) 注入攻击 |
## 包
ArtemisKit 是一个包含以下包的 monorepo:
| Package | Description |
|---------|-------------|
| `@artemiskit/cli` | 命令行界面 |
| `@artemiskit/core` | 核心 runner, 类型和存储 (内部) |
| `@artemiskit/sdk` | 用于 TypeScript/JavaScript 的编程式 SDK (即将推出) |
| `@artemiskit/reports` | HTML 和 JSON 报告生成 |
| `@artemiskit/redteam` | 红队变异策略 |
| `@artemiskit/adapter-openai` | OpenAI/Azure Provider 适配器 |
| `@artemiskit/adapter-vercel-ai` | Vercel AI SDK 适配器 |
| `@artemiskit/adapter-anthropic` | Anthropic Provider 适配器 |
## 开发
```
# 克隆 repository
git clone https://github.com/artemiskit/artemiskit.git
cd artemiskit
# 安装依赖项
bun install
# 构建所有 packages
bun run build
# 运行测试
bun test
# 类型检查
bun run typecheck
# Lint
bun run lint
```
## 路线图
查看 [ROADMAP.md](ROADMAP.md) 了解完整的开发路线图。
## 贡献
欢迎贡献!在提交 Pull Request 之前,请阅读 [CONTRIBUTING.md](CONTRIBUTING.md)。
## 许可证
Apache-2.0 - 详情见 [LICENSE](LICENSE)。
标签:AI安全, AI应用可靠性, ASM汇编, Chat Copilot, DLL 劫持, LLM评估, LNA, MITM代理, Ollama, OpenAI, TypeScript, 内存规避, 压力测试, 合规性测试, 场景测试, 多模型支持, 多模态安全, 大语言模型, 安全插件, 性能评估, 自动化攻击, 自动化攻击, 越狱检测, 软件质量保障, 防御性安全