kogunlowo123/ai-evaluation-prompts
GitHub: kogunlowo123/ai-evaluation-prompts
跨维度LLM输出评估框架,提供标准化评分标准、红队测试方法及自动化批处理工具。
Stars: 0 | Forks: 0
# AI 评估 Prompts
一个用于跨多个质量维度评估语言模型输出的综合框架。包含标准化的评估提示词、评分标准和自动化工具。
## 架构
```
flowchart TD
A[Input: Prompt + Response] --> B[Evaluation Framework]
B --> C[Accuracy]
B --> D[Coherence]
B --> E[Safety]
B --> F[Helpfulness]
C --> G[Scoring Rubric]
D --> G
E --> G
F --> G
G --> H[Weighted Aggregation]
H --> I[Pass/Fail Decision]
H --> J[Detailed Report]
K[Red Teaming] --> B
L[Benchmark Suite] --> B
style A fill:#3498DB,stroke:#2476AB,color:#FFFFFF
style B fill:#2ECC71,stroke:#1A9B52,color:#FFFFFF
style C fill:#E74C3C,stroke:#B83A2F,color:#FFFFFF
style D fill:#F39C12,stroke:#C47F0E,color:#FFFFFF
style E fill:#9B59B6,stroke:#7A3D94,color:#FFFFFF
style F fill:#1ABC9C,stroke:#148F77,color:#FFFFFF
style G fill:#34495E,stroke:#2C3E50,color:#FFFFFF
style H fill:#D4AC0D,stroke:#B7950B,color:#FFFFFF
style I fill:#27AE60,stroke:#1E8449,color:#FFFFFF
style J fill:#2980B9,stroke:#1F6D9E,color:#FFFFFF
style K fill:#C0392B,stroke:#96281B,color:#FFFFFF
style L fill:#8E44AD,stroke:#6C3483,color:#FFFFFF
```
## 结构
```
ai-evaluation-prompts/
evaluations/ # Evaluation dimension definitions
accuracy.md # Factual correctness evaluation
coherence.md # Logical structure and flow
safety.md # Harm, bias, and safety assessment
helpfulness.md # Actionability and user value
rubrics/ # Scoring frameworks
scoring.md # Universal scoring scale and weights
examples.md # Annotated evaluation examples
frameworks/ # Testing frameworks
red-teaming.md # Adversarial testing methodology
benchmark.md # Standardized benchmark design
tools/ # Automation scripts
evaluate.py # Batch evaluation tool
compare.py # Cross-run comparison tool
```
## 快速入门
### 评估一批响应
```
python tools/evaluate.py data/responses.jsonl -o results.jsonl -p technical
```
### 比较多次运行
```
python tools/compare.py \
model-a:results_a.jsonl \
model-b:results_b.jsonl \
-o comparison.json
```
### 输入格式 (JSONL)
```
{"prompt": "What is Docker?", "response": "Docker is a platform...", "reference": "Docker is..."}
```
## 权重配置
| 配置 | 准确性 | 连贯性 | 安全性 | 有用性 |
|---------|----------|-----------|--------|-------------|
| Default | 0.30 | 0.20 | 0.25 | 0.25 |
| Medical | 0.35 | 0.15 | 0.35 | 0.15 |
| Creative | 0.15 | 0.35 | 0.20 | 0.30 |
| Technical | 0.35 | 0.25 | 0.15 | 0.25 |
| Support | 0.20 | 0.20 | 0.25 | 0.35 |
## 许可证
MIT License - 详情请参阅 [LICENSE](LICENSE)。
标签:AI安全, AI治理, Apex, Chat Copilot, DLL 劫持, LLM评估, Ollama, Python, 一致性分析, 人工智能, 准确性评估, 大语言模型, 对抗性测试, 文档结构分析, 无后门, 时序数据库, 机器学习, 模型安全性, 用户模式Hook绕过, 评估框架, 评分标准, 质量保证, 逆向工具, 防御加固