umayer16/VIBEBENCH
GitHub: umayer16/VIBEBENCH
一个基于静态分析与沙盒执行的自动化框架,用于全面评估 LLM 生成代码的质量、安全性和生产就绪程度。
Stars: 1 | Forks: 0
# VibeBench
[](https://doi.org/10.5281/zenodo.18758578)
[](https://opensource.org/licenses/MIT)
[](https://joss.theoj.org/papers/e2f0068b712c24134f3c43abdb394eb4)
[](https://github.com/umayer16/VIBEBENCH/actions/workflows/tests.yml)
**VibeBench** 是一个自动化、可扩展的 Python 框架,用于全面评估 LLM 生成的代码。它超越了功能正确性的范畴,通过将静态质量启发式方法与沙盒动态执行相结合,来衡量 AI 生成软件的真正生产就绪程度。
## 为什么选择 VibeBench?
现有的基准测试(如 HumanEval 和 MBPP)仅检查代码是否能*正确运行*。
VibeBench 额外检查代码是否*可维护、安全且高效*——
这些都是在现实世界软件工程中至关重要的指标。
| 指标 | HumanEval | MBPP | VibeBench |
|---|---|---|---|
| 功能正确性 | ✅ | ✅ | ✅ |
| Halstead 复杂度 | ❌ | ❌ | ✅ |
| 圈复杂度 | ❌ | ❌ | ✅ |
| Docstring 覆盖率 | ❌ | ❌ | ✅ |
| 硬编码凭证检测 | ❌ | ❌ | ✅ |
| 幽灵注释检测 | ❌ | ❌ | ✅ |
| 带有资源限制的沙盒执行 | ❌ | ❌ | ✅ |
| 与人类基线的操作对等性 | ❌ | ❌ | ✅ |
## 安装
**前提条件:** Python 3.9+,基于 Unix 的操作系统,用于
沙盒执行功能。
### 标准安装
```
git clone https://github.com/umayer16/VIBEBENCH.git
cd VIBEBENCH
pip install .
```
安装后,`vibebench` 命令将全局可用:
```
vibebench --help
```
### 开发安装
适用于希望更改能立即生效的贡献者:
```
pip install -e ".[dev]"
```
### 可选的 LLM Generator 依赖项
如需使用模型代码生成器(Gemini, Groq, OpenAI):
```
pip install ".[llm]"
```
```
Two important changes here: the Python requirement is now `3.9+` to match `pyproject.toml`, and the installation uses `pip install .` rather than `pip install -r requirements.txt`. Also note you removed `Python 3.8+` — `pyproject.toml` specifies `requires-python = ">=3.9"` and you should be consistent.
## 快速开始
### 分析单个代码片段
```python
from core.analyzer import CodeAnalyzer
code = """
def add(a, b):
return a + b
"""
analyzer = CodeAnalyzer(code)
print(analyzer.calculate_halstead_metrics())
# {'vocabulary': 4, 'volume': 8.0}
print(analyzer.get_docstring_coverage())
# 0.0
print(analyzer.detect_bad_practices())
# []
```
### 运行完整 benchmark
```bash
python vibebench.py
```
Results are saved as a timestamped JSON file (e.g. `vibebench_multimodel_20260224_1912.json`)
and a leaderboard is generated at `VibeBench_Leaderboard.md`.
---
## 输出格式
VibeBench produces a JSON results file with the following structure per model:
```json
{
"model": "gpt-4o",
"task": "fibonacci",
"halstead_volume": 42.5,
"cyclomatic_complexity": 3,
"docstring_coverage": 100.0,
"bad_practices": [],
"execution_success": true,
"execution_time_ms": 12.4,
"operational_parity": 0.95
}
```
---
## 排行榜
Current benchmark results across evaluated models:
See [VibeBench_Leaderboard.md](VibeBench_Leaderboard.md) for full results.
---
## 项目结构
```
VIBEBENCH/
├── core/
│ ├── analyzer.py # Static analysis engine (AST-based)
│ ├── executor.py # Sandboxed dynamic execution
│ └── reporter.py # Leaderboard and visualization
├── datasets/ # Benchmark task definitions
├── figures/ # Architecture and leaderboard figures
├── tests/ # pytest test suite
├── vibebench.py # Main entry point
├── paper.md # JOSS paper
└── requirements.txt
```
---
## 运行测试
```bash
pip install pytest
pytest tests/
```
---
## 复现 Benchmark 结果
To reproduce the findings from our v1.2.0 release:
1. Ensure your API keys are set in a `.env` file (see `.env.example`).
2. Run the full suite:
```bash
python vibebench.py benchmark --tasks datasets/prompts.json --verbose
## 引用
If you use VibeBench in your research, please cite:
```bibtex
@software{arif2026vibebench,
author = {Arif, Muktadir},
title = {VibeBench: An Automated Framework for the Holistic Evaluation of LLM-Generated Code},
year = {2026},
doi = {10.5281/zenodo.18758578},
url = {https://github.com/umayer16/VIBEBENCH}
}
```
---
## 贡献
Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) before
opening a pull request.
---
## 许可证
This project is licensed under the MIT License — see [LICENSE](LICENSE) for details.
```
```
标签:AI代码安全, DLL 劫持, Halstead复杂度, HumanEval, LLM代码评估, MBPP, Petitpotam, Python, 代码复杂度, 凭据检测, 圈复杂度, 大语言模型, 安全专业人员, 开源, 微调策略, 数据管道, 无后门, 生产就绪, 硬编码漏洞, 自动化payload嵌入, 自动化测试框架, 软件工程, 逆向工具, 错误基检测, 静态代码分析