umayer16/VIBEBENCH

GitHub: umayer16/VIBEBENCH

一个基于静态分析与沙盒执行的自动化框架,用于全面评估 LLM 生成代码的质量、安全性和生产就绪程度。

Stars: 1 | Forks: 0

# VibeBench [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18758578.svg)](https://doi.org/10.5281/zenodo.18758578) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![status](https://joss.theoj.org/papers/e2f0068b712c24134f3c43abdb394eb4/status.svg)](https://joss.theoj.org/papers/e2f0068b712c24134f3c43abdb394eb4) [![Tests](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/f3c4a39343004456.svg)](https://github.com/umayer16/VIBEBENCH/actions/workflows/tests.yml) **VibeBench** 是一个自动化、可扩展的 Python 框架,用于全面评估 LLM 生成的代码。它超越了功能正确性的范畴,通过将静态质量启发式方法与沙盒动态执行相结合,来衡量 AI 生成软件的真正生产就绪程度。 ## 为什么选择 VibeBench? 现有的基准测试(如 HumanEval 和 MBPP)仅检查代码是否能*正确运行*。 VibeBench 额外检查代码是否*可维护、安全且高效*—— 这些都是在现实世界软件工程中至关重要的指标。 | 指标 | HumanEval | MBPP | VibeBench | |---|---|---|---| | 功能正确性 | ✅ | ✅ | ✅ | | Halstead 复杂度 | ❌ | ❌ | ✅ | | 圈复杂度 | ❌ | ❌ | ✅ | | Docstring 覆盖率 | ❌ | ❌ | ✅ | | 硬编码凭证检测 | ❌ | ❌ | ✅ | | 幽灵注释检测 | ❌ | ❌ | ✅ | | 带有资源限制的沙盒执行 | ❌ | ❌ | ✅ | | 与人类基线的操作对等性 | ❌ | ❌ | ✅ | ## 安装 **前提条件:** Python 3.9+,基于 Unix 的操作系统,用于 沙盒执行功能。 ### 标准安装 ``` git clone https://github.com/umayer16/VIBEBENCH.git cd VIBEBENCH pip install . ``` 安装后,`vibebench` 命令将全局可用: ``` vibebench --help ``` ### 开发安装 适用于希望更改能立即生效的贡献者: ``` pip install -e ".[dev]" ``` ### 可选的 LLM Generator 依赖项 如需使用模型代码生成器(Gemini, Groq, OpenAI): ``` pip install ".[llm]" ``` ``` Two important changes here: the Python requirement is now `3.9+` to match `pyproject.toml`, and the installation uses `pip install .` rather than `pip install -r requirements.txt`. Also note you removed `Python 3.8+` — `pyproject.toml` specifies `requires-python = ">=3.9"` and you should be consistent. ## 快速开始 ### 分析单个代码片段 ```python from core.analyzer import CodeAnalyzer code = """ def add(a, b): return a + b """ analyzer = CodeAnalyzer(code) print(analyzer.calculate_halstead_metrics()) # {'vocabulary': 4, 'volume': 8.0} print(analyzer.get_docstring_coverage()) # 0.0 print(analyzer.detect_bad_practices()) # [] ``` ### 运行完整 benchmark ```bash python vibebench.py ``` Results are saved as a timestamped JSON file (e.g. `vibebench_multimodel_20260224_1912.json`) and a leaderboard is generated at `VibeBench_Leaderboard.md`. --- ## 输出格式 VibeBench produces a JSON results file with the following structure per model: ```json { "model": "gpt-4o", "task": "fibonacci", "halstead_volume": 42.5, "cyclomatic_complexity": 3, "docstring_coverage": 100.0, "bad_practices": [], "execution_success": true, "execution_time_ms": 12.4, "operational_parity": 0.95 } ``` --- ## 排行榜 Current benchmark results across evaluated models: See [VibeBench_Leaderboard.md](VibeBench_Leaderboard.md) for full results. --- ## 项目结构 ``` VIBEBENCH/ ├── core/ │ ├── analyzer.py # Static analysis engine (AST-based) │ ├── executor.py # Sandboxed dynamic execution │ └── reporter.py # Leaderboard and visualization ├── datasets/ # Benchmark task definitions ├── figures/ # Architecture and leaderboard figures ├── tests/ # pytest test suite ├── vibebench.py # Main entry point ├── paper.md # JOSS paper └── requirements.txt ``` --- ## 运行测试 ```bash pip install pytest pytest tests/ ``` --- ## 复现 Benchmark 结果 To reproduce the findings from our v1.2.0 release: 1. Ensure your API keys are set in a `.env` file (see `.env.example`). 2. Run the full suite: ```bash python vibebench.py benchmark --tasks datasets/prompts.json --verbose ## 引用 If you use VibeBench in your research, please cite: ```bibtex @software{arif2026vibebench, author = {Arif, Muktadir}, title = {VibeBench: An Automated Framework for the Holistic Evaluation of LLM-Generated Code}, year = {2026}, doi = {10.5281/zenodo.18758578}, url = {https://github.com/umayer16/VIBEBENCH} } ``` --- ## 贡献 Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) before opening a pull request. --- ## 许可证 This project is licensed under the MIT License — see [LICENSE](LICENSE) for details. ``` ```
标签:AI代码安全, DLL 劫持, Halstead复杂度, HumanEval, LLM代码评估, MBPP, Petitpotam, Python, 代码复杂度, 凭据检测, 圈复杂度, 大语言模型, 安全专业人员, 开源, 微调策略, 数据管道, 无后门, 生产就绪, 硬编码漏洞, 自动化payload嵌入, 自动化测试框架, 软件工程, 逆向工具, 错误基检测, 静态代码分析