kogunlowo123/ai-evaluation-prompts

GitHub: kogunlowo123/ai-evaluation-prompts

跨维度LLM输出评估框架,提供标准化评分标准、红队测试方法及自动化批处理工具。

Stars: 0 | Forks: 0

# AI 评估 Prompts 一个用于跨多个质量维度评估语言模型输出的综合框架。包含标准化的评估提示词、评分标准和自动化工具。 ## 架构 ``` flowchart TD A[Input: Prompt + Response] --> B[Evaluation Framework] B --> C[Accuracy] B --> D[Coherence] B --> E[Safety] B --> F[Helpfulness] C --> G[Scoring Rubric] D --> G E --> G F --> G G --> H[Weighted Aggregation] H --> I[Pass/Fail Decision] H --> J[Detailed Report] K[Red Teaming] --> B L[Benchmark Suite] --> B style A fill:#3498DB,stroke:#2476AB,color:#FFFFFF style B fill:#2ECC71,stroke:#1A9B52,color:#FFFFFF style C fill:#E74C3C,stroke:#B83A2F,color:#FFFFFF style D fill:#F39C12,stroke:#C47F0E,color:#FFFFFF style E fill:#9B59B6,stroke:#7A3D94,color:#FFFFFF style F fill:#1ABC9C,stroke:#148F77,color:#FFFFFF style G fill:#34495E,stroke:#2C3E50,color:#FFFFFF style H fill:#D4AC0D,stroke:#B7950B,color:#FFFFFF style I fill:#27AE60,stroke:#1E8449,color:#FFFFFF style J fill:#2980B9,stroke:#1F6D9E,color:#FFFFFF style K fill:#C0392B,stroke:#96281B,color:#FFFFFF style L fill:#8E44AD,stroke:#6C3483,color:#FFFFFF ``` ## 结构 ``` ai-evaluation-prompts/ evaluations/ # Evaluation dimension definitions accuracy.md # Factual correctness evaluation coherence.md # Logical structure and flow safety.md # Harm, bias, and safety assessment helpfulness.md # Actionability and user value rubrics/ # Scoring frameworks scoring.md # Universal scoring scale and weights examples.md # Annotated evaluation examples frameworks/ # Testing frameworks red-teaming.md # Adversarial testing methodology benchmark.md # Standardized benchmark design tools/ # Automation scripts evaluate.py # Batch evaluation tool compare.py # Cross-run comparison tool ``` ## 快速入门 ### 评估一批响应 ``` python tools/evaluate.py data/responses.jsonl -o results.jsonl -p technical ``` ### 比较多次运行 ``` python tools/compare.py \ model-a:results_a.jsonl \ model-b:results_b.jsonl \ -o comparison.json ``` ### 输入格式 (JSONL) ``` {"prompt": "What is Docker?", "response": "Docker is a platform...", "reference": "Docker is..."} ``` ## 权重配置 | 配置 | 准确性 | 连贯性 | 安全性 | 有用性 | |---------|----------|-----------|--------|-------------| | Default | 0.30 | 0.20 | 0.25 | 0.25 | | Medical | 0.35 | 0.15 | 0.35 | 0.15 | | Creative | 0.15 | 0.35 | 0.20 | 0.30 | | Technical | 0.35 | 0.25 | 0.15 | 0.25 | | Support | 0.20 | 0.20 | 0.25 | 0.35 | ## 许可证 MIT License - 详情请参阅 [LICENSE](LICENSE)。
标签:AI安全, AI治理, Apex, Chat Copilot, DLL 劫持, LLM评估, Ollama, Python, 一致性分析, 人工智能, 准确性评估, 大语言模型, 对抗性测试, 文档结构分析, 无后门, 时序数据库, 机器学习, 模型安全性, 用户模式Hook绕过, 评估框架, 评分标准, 质量保证, 逆向工具, 防御加固