dongtingshuo/researchflow-agent

GitHub: dongtingshuo/researchflow-agent

一款围绕论文阅读、代码分析和实验复现的科研工作流 AI Agent 系统，提供论文 RAG 问答、代码仓库分析、实验计划生成和证据核验等一站式科研辅助能力。

Stars: 0 | Forks: 0

# ResearchFlow-Agent **基于多工具调用的科研论文阅读与实验复现 AI Agent 系统** **A multi-tool AI Agent system for research paper reading, repository analysis, experiment reproduction planning, and evidence-aware reporting** ResearchFlow-Agent 是一个面向科研论文阅读、代码仓库分析、实验复现规划和证据核验的专业 AI Agent 系统。系统支持上传论文 PDF、构建论文 RAG 知识库、分析 GitHub 代码仓库、生成实验复现计划、生成 Markdown 技术报告，并通过 Verifier 标注证据来源与不确定性。 ResearchFlow-Agent is a professional AI Agent system for research paper reading, code repository analysis, experiment reproduction planning, and evidence verification. It supports PDF paper ingestion, paper-grounded RAG, GitHub repository analysis, experiment planning, Markdown technical report generation, and evidence-aware verification. ![ResearchFlow-Agent Walkthrough](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/7bff10de90220829.gif) 上方 GIF 展示论文问答、代码分析、完整 Agent 工作流和实验评测四个核心界面。 The GIF above shows the four core views: paper QA, code analysis, full Agent workflow, and experiment evaluation. ## Project Positioning / 项目定位 ResearchFlow-Agent 不是普通聊天机器人，而是一个围绕“论文 + 代码 + 复现实验”的科研工作流 Agent。它强调可解释过程、可追溯证据、可保存输出和人工可复核的评测记录。 ResearchFlow-Agent is not a generic chatbot. It is a research workflow agent centered on "paper + code + reproducible experiment". It emphasizes inspectable steps, traceable evidence, saved artifacts, and human-reviewable evaluation records. 适用场景： - 科研论文阅读与方法梳理 - 开源论文代码仓库结构分析 - 实验复现计划设计 - Markdown 项目报告生成 - RAG / Agent / Verifier 模式对比评测 Use cases: - Research paper reading and method understanding - Open-source research repository analysis - Experiment reproduction planning - Markdown project report generation - Evaluation of RAG, Agent, and Agent + Verifier workflows ## Core Features / 核心功能 | 模块 | 中文说明 | English Description | | --- | --- | --- | | Paper RAG | 解析 PDF、保留页码、切分 chunk、生成 embedding、检索论文证据并回答问题 | Parse PDFs, preserve page numbers, chunk text, embed content, retrieve evidence, and answer paper questions | | Code Analyzer | 支持 GitHub clone 和 zip 上传，生成目录树，识别 README、依赖文件、训练/推理/模型/数据集/config 文件 | Clone GitHub repositories or extract zip archives, generate a directory tree, and detect key files | | Experiment Planner | 基于论文和代码分析结果生成实验目标、环境配置、数据准备、训练测试步骤和风险提示 | Generate experiment goals, environment setup, data preparation, training/testing steps, and reproduction risks | | Report Writer | 生成包含背景、相关工作、方法、系统设计、实验步骤和结果模板的 Markdown 报告 | Generate Markdown reports with background, related work, methods, system design, experiments, and result templates | | Agent Workflow | 一键执行论文解析、RAG 构建、论文摘要、代码分析、计划生成、报告生成和 Verifier 检查 | Run the full pipeline with one click: paper parsing, RAG indexing, summary, code analysis, planning, reporting, and verification | | Verifier | 区分论文证据、代码证据、模型推断、缺少证据、人工确认项和潜在幻觉 | Separate paper evidence, code evidence, model inference, missing evidence, human-review items, and possible hallucinations | | Evaluation | 生成普通 RAG、Agent 分步骤、Agent + Verifier 三种模式的人工评分表，并提供固定 evaluation benchmark | Generate manual evaluation sheets and a fixed evaluation benchmark for ordinary RAG, step-by-step Agent, and Agent + Verifier outputs | ## 当前质量状态 / Current Quality Status 当前版本已经完成可运行 MVP，并通过真实论文 smoke tests 验证了核心链路： - CLIP 数量题：回答 `400 million`，Top-1 引用为 Page 2，引用片段包含 `400 million (image, text) pairs`。 - ReAct benchmark 题：保留 HotPotQA、Fever、ALFWorld、WebShop 等原始 benchmark 名称。 - RAG formulation 题：区分 RAG-Sequence 和 RAG-Token，并引用 same-document / different-document 证据。 - 单元测试：`42 passed`。 The current version is a runnable MVP validated with real-paper smoke tests: - CLIP quantity question: answers `400 million`, with Page 2 as the top citation and a direct evidence snippet. - ReAct benchmark question: preserves HotPotQA, Fever, ALFWorld, and WebShop. - RAG formulation question: distinguishes RAG-Sequence and RAG-Token with grounded evidence. - Unit tests: `42 passed`. ## 界面截图 / Screenshots ### 论文问答 / Paper QA 上传论文 PDF 后，系统会解析文本、保留页码、构建本地检索索引，并在回答中显示引用页码与原文片段。 After uploading a PDF, the system extracts text, preserves page numbers, builds a local retrieval index, and returns answers with page-grounded snippets. ![Paper QA Tab](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/53e1d613af220835.jpg) ### 代码分析 / Code Analysis 支持输入 GitHub 仓库链接或上传 zip 代码包，系统会生成目录树、识别关键文件，并给出代码结构总结。 The code analysis tab accepts a GitHub repository URL or a zip archive, generates a directory tree, detects key files, and summarizes the codebase. ![Code Analysis Tab](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/5672d05a3b220841.jpg) ### 完整 Agent 工作流 / Full Agent Workflow 完整工作流会从论文和 GitHub 仓库出发，一键生成论文摘要、代码分析、实验计划、项目报告和 Verifier 结果。 The full workflow starts from a paper PDF and a GitHub repository URL, then generates a paper summary, code analysis, experiment plan, project report, and verifier output. ![Full Agent Workflow Tab](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/daf39c9771220847.jpg) ### 实验评测 / Experiment Evaluation 实验评测模块用于比较普通 RAG、Agent 分步骤、Agent + Verifier 三种模式，输出 Markdown 和 CSV 人工评分表。 The evaluation module compares ordinary RAG, step-by-step Agent, and Agent + Verifier outputs, then exports Markdown and CSV manual scoring sheets. ![Experiment Evaluation Tab](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/9a26cfb0e8220852.jpg) ## System Architecture / 系统架构 flowchart LR A["PDF Paper"] --> B["Paper Parser"] B --> C["Chunking + Embedding"] C --> D["Local Vector Store"] D --> E["Paper RAG QA"] F["GitHub URL / Zip"] --> G["Code Analyzer"] G --> H["Directory Tree + Key Files"] E --> I["Agent Workflow"] H --> I I --> J["Experiment Planner"] J --> K["Report Writer"] K --> L["Verifier"] L --> M["Evaluation Sheet"] 系统采用模块化 Python 结构：论文解析、RAG、代码分析、Agent 编排、报告生成、Verifier 和 Evaluation 相互独立，便于测试和扩展。 The system uses a modular Python architecture. Paper parsing, RAG, code analysis, agent orchestration, report generation, verification, and evaluation are separated for testability and extensibility. ## Tech Stack / 技术栈 - Python 3.10+ - Gradio - PyMuPDF / pdfplumber - sentence-transformers - Hybrid dense + lexical retrieval - Optional cross-encoder reranker - Local JSON vector store; Chroma / FAISS dependencies are included for extension - OpenAI-compatible LLM API - GitPython / subprocess - Markdown and CSV artifact export - pytest ## 目录结构 / Project Structure researchflow-agent/ README.md AGENTS.md requirements.txt .env.example app.py config.py docs/ images/ data/ uploads/ vectorstores/ workspaces/ outputs/ src/ llm/ paper/ rag/ code_analyzer/ agent/ report/ evaluation/ storage/ utils/ tests/ examples/ 关键目录说明： - `src/paper`: PDF 解析与论文文本建模 - `src/rag`: chunk、embedding、本地向量检索和论文问答 - `src/code_analyzer`: GitHub / zip 代码加载、目录树和关键文件识别 - `src/agent`: 实验计划生成和完整 Agent Workflow - `src/report`: Markdown 项目报告生成 - `src/evaluation`: Verifier 和三模式实验评测表 - `data/outputs`: 生成的计划、报告、评测表和工作流摘要 - `docs/images`: README 截图资源 Key directories: - `src/paper`: PDF parsing and paper text models - `src/rag`: chunking, embeddings, local retrieval, and paper QA - `src/code_analyzer`: GitHub / zip code loading, directory tree generation, and key-file detection - `src/agent`: experiment planning and complete Agent Workflow - `src/report`: Markdown project report generation - `src/evaluation`: verifier and three-mode evaluation sheets - `data/outputs`: generated plans, reports, evaluation sheets, and workflow summaries - `docs/images`: README screenshot assets ## Installation / 安装建议使用独立 conda 环境，不要安装到 `base` 环境。 Use a dedicated conda environment instead of installing dependencies into `base`. conda create -n researchflow python=3.11 conda activate researchflow pip install -r requirements.txt cp .env.example .env python app.py 运行应用： Run the application: python app.py 运行后打开终端输出中的本地 Gradio URL。 After starting the app, open the local Gradio URL printed in the terminal. ## Configuration / 配置复制 `.env.example` 为 `.env` 后可配置模型和运行参数。 Copy `.env.example` to `.env` and configure model/runtime settings. OPENAI_API_KEY=your_api_key_here OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_MODEL=gpt-4o-mini EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 ALLOW_HASH_EMBEDDING_FALLBACK=true RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2 ENABLE_CROSS_ENCODER_RERANKER=true MAX_PAPER_CHUNK_TOKENS=220 CHUNK_OVERLAP_TOKENS=40 TOP_K_RETRIEVAL=8 RERANKER_CANDIDATE_MULTIPLIER=4 GIT_CLONE_TIMEOUT_SECONDS=120 MAX_ZIP_MEMBERS=4000 MAX_ZIP_TOTAL_BYTES=150000000 如果没有配置 LLM API key，系统仍可运行离线模板和本地检索流程，但 LLM 总结质量会受限。 If no LLM API key is configured, the system can still run local retrieval and deterministic templates, but LLM-based summaries will be limited. ## Usage / 使用流程 ### 1. 论文问答 / Paper QA 1. 打开 **论文问答** Tab。 2. 上传 PDF。 3. 点击 **Parse and Index**。 4. 输入问题并查看带页码引用的回答。 5. Open the **论文问答** tab. 6. Upload a PDF. 7. Click **Parse and Index**. 8. Ask a question and inspect page-grounded citations. ### 2. 代码分析 / Code Analysis 1. 打开 **代码分析** Tab。 2. 输入 GitHub 仓库链接，或上传 zip 代码包。 3. 查看目录树、关键文件表和代码结构总结。 4. Open the **代码分析** tab. 5. Enter a GitHub repository URL or upload a zip archive. 6. Review the directory tree, key files, and codebase summary. ### 3. 完整 Agent 工作流 / Full Agent Workflow 1. 打开 **完整 Agent 工作流** Tab。 2. 上传论文 PDF。 3. 输入 GitHub 仓库链接。 4. 输入任务目标。 5. 点击 **一键运行**。 6. 查看状态日志、论文摘要、实验计划、项目报告和 Verifier 输出。 7. Open the **完整 Agent 工作流** tab. 8. Upload a paper PDF. 9. Enter a GitHub repository URL. 10. Enter the task goal. 11. Click **一键运行**. 12. Review logs, paper summary, experiment plan, project report, and verifier output. ### 4. 实验评测 / Experiment Evaluation 实验评测用于比较三种模式： 1. 普通 RAG 回答 2. Agent 分步骤回答 3. Agent + Verifier 回答 The evaluation workflow compares three modes: 1. Ordinary RAG answer 2. Step-by-step Agent answer 3. Agent + Verifier answer 评测指标： - 答案完整性 - 引用正确性 - 复现计划可执行性 - 是否存在无依据结论 - 人工评分备注 Evaluation metrics: - Answer completeness - Citation correctness - Reproduction-plan executability - Unsupported conclusions - Human scoring notes 输出文件： - `data/outputs/evaluation-*.md` - `data/outputs/evaluation-*.csv` Generated files: - `data/outputs/evaluation-*.md` - `data/outputs/evaluation-*.csv` ### 5. Evaluation Benchmark / 固定评测集项目提供固定 evaluation benchmark，便于重复验证系统在论文问答、证据引用和方法区分任务中的表现： - `examples/evaluation_benchmark.json`: 三个固定问题，覆盖 CLIP、ReAct、RAG。 - `examples/validation_workflows.md`: 四个推荐验证流程。 - `examples/validation_results.md`: 当前版本的本地验证记录与人工复核建议。 - `docs/technical_overview.md`: 技术概览文档，说明系统架构、核心模块、验证结果和局限性。 The project includes a fixed evaluation benchmark for repeatable validation: - `examples/evaluation_benchmark.json`: three fixed questions covering CLIP, ReAct, and RAG. - `examples/validation_workflows.md`: recommended validation workflows. - `examples/validation_results.md`: current local validation record and human-review checklist. - `docs/technical_overview.md`: technical overview covering architecture, modules, validation results, and limitations. 在 Gradio 的 **实验评测** Tab 中点击 **Generate Evaluation Benchmark**，可导出： - `data/outputs/benchmark-evaluation-*.md` - `data/outputs/benchmark-evaluation-*.csv` Click **Generate Evaluation Benchmark** in the **实验评测** tab to export Markdown and CSV benchmark sheets. 也可以使用一键脚本生成 evaluation benchmark 结果： conda activate researchflow python scripts/run_evaluation_benchmark.py 默认不会调用 LLM，适合本地快速验证和 CI 环境。若需要使用 `.env` 中配置的 OpenAI-compatible API： python scripts/run_evaluation_benchmark.py --use-llm The CLI script runs locally by default without LLM calls. Add `--use-llm` to use the configured OpenAI-compatible API. ## Evaluation and Validation / 评测与验证 ResearchFlow-Agent provides manual evaluation templates and fixed validation questions. These artifacts are intended for human-reviewable evaluation, not for replacing manual judgment. ResearchFlow-Agent 提供人工评测模板和固定验证问题。这些文件用于人工可复核评测，不用于替代人工判断。 Related files: 相关文件： - `docs/evaluation_report.md`: evaluation goals, modes, metrics, table template, sample result, and limitations. - `docs/demo_guide.md`: professional walkthrough script for explaining the main workflow. - `docs/project_summary.md`: concise technical summary and current limitations. - `docs/technical_overview.md`: architecture, modules, engineering notes, validation status, and roadmap. - `examples/paper_eval_questions.json`: 5 paper samples with 25 evaluation questions. - `examples/evaluation_benchmark.json`: compact fixed benchmark for CLIP, ReAct, and RAG questions. Generate a manual Markdown/CSV evaluation template: 生成手动 Markdown/CSV 评测模板： python scripts/run_manual_evaluation_template.py Generate the compact evaluation benchmark result template: 生成紧凑评测 benchmark 结果模板： python scripts/run_evaluation_benchmark.py ## Security Notes / 安全设计 ResearchFlow-Agent 面向本地科研工作流，但仍做了基础安全限制： - GitHub clone 只接受标准 `https://github.com/owner/repo` 公共仓库 URL。 - 拒绝 SSH、`git@`、非 GitHub 域名、带 query/fragment 的 URL 和伪造域名。 - clone 使用 shallow clone，并配置超时。 - zip 上传会检查路径穿越、绝对路径、符号链接、文件数量和解压后总体积。 - `.env`、上传文件、向量库、工作区和输出文件默认不提交到 Git。 ResearchFlow-Agent is intended for local research training and includes basic safety controls: - GitHub cloning accepts only public HTTPS URLs in the `https://github.com/owner/repo` form. - SSH, `git@`, non-GitHub hosts, query/fragment URLs, and spoofed domains are rejected. - Cloning uses shallow clone with timeout. - Zip uploads are checked for path traversal, absolute paths, symlinks, member count, and total extracted size. - `.env`, uploads, vector stores, workspaces, and outputs are ignored by Git by default. ## Verifier Design / Verifier 设计 Verifier 不声称生成内容 100% 正确。它的作用是帮助用户区分证据、推断和风险。 The verifier does not claim that generated content is 100% correct. Its purpose is to separate evidence, inference, and risk. Verifier 输出七类信息： 1. 来自论文的内容 2. 来自代码仓库的内容 3. 模型推断的内容 4. 缺少证据的内容 5. 需要人工确认的内容 6. 可能存在幻觉的内容 7. 改进建议 The verifier outputs seven categories: 1. Content from the paper 2. Content from the code repository 3. Model-inferred content 4. Content with missing evidence 5. Items requiring human confirmation 6. Possible hallucinations 7. Improvement suggestions ## Known Limitations / 已知局限 - ResearchFlow-Agent does not automatically run real training experiments. - ResearchFlow-Agent 不会自动运行真实训练实验。 - Verifier provides evidence attribution and uncertainty classification, but it does not guarantee factual correctness. - Verifier 提供证据归因和不确定性分类，但不保证事实正确。 - Evaluation artifacts are designed for human review and require manual scoring. - 评测文件面向人工复核，仍需要人工评分。 - PDF page numbers come from parser page order and should be checked against the original PDF viewer when exact page mapping matters. - PDF 页码来自解析器页序；需要严格页码映射时，应与原 PDF 阅读器核对。 - If hashing fallback is used, retrieval quality may be weaker than real semantic embeddings. - 如果使用 hashing fallback，检索质量可能弱于真实语义 embedding。 - Chroma / FAISS are available as extension directions, while the current main implementation uses a local JSON vector store. - Chroma / FAISS 是扩展方向，当前主实现仍使用本地 JSON vector store。 ## Testing / 测试 conda activate researchflow pytest tests 项目包含 GitHub Actions CI：`.github/workflows/tests.yml` 会在 push 和 pull request 时安装依赖并运行测试。 The repository includes GitHub Actions CI in `.github/workflows/tests.yml`, which installs dependencies and runs tests on push and pull request events. 当前测试覆盖： - PDF 解析错误处理 - chunk 切分 - embedding fallback 和本地向量检索 - hybrid retrieval、cross-encoder reranker 和 query-aware 引用片段 - 代码仓库分析 - GitHub URL 与 zip 上传安全边界 - 实验计划与报告生成 - 完整 Agent Workflow 成功与失败路径 - Verifier 证据归因与不确定性输出 - 实验评测表与 evaluation benchmark Markdown / CSV 导出 Current tests cover: - PDF parser error handling - Chunking - Embedding fallback and local vector retrieval - Hybrid retrieval, cross-encoder reranking, and query-aware citation snippets - Code repository analysis - GitHub URL and zip-upload security boundaries - Experiment planning and report writing - Full Agent Workflow success and failure paths - Verifier evidence attribution and uncertainty reporting - Evaluation and evaluation benchmark Markdown / CSV export ## 当前状态 / Current Status ResearchFlow-Agent 已实现一个可运行的科研工作流 MVP：论文 RAG、代码分析、实验计划、报告生成、Verifier、实验评测和 Gradio UI 均已具备基础功能。 ResearchFlow-Agent currently provides a runnable research workflow MVP: paper RAG, code analysis, experiment planning, report writing, verifier, evaluation sheets, and Gradio UI are implemented. ## Roadmap / 后续计划 - SQLite 会话历史与项目级持久化 - 更严格的 citation-level fact checking - 更完整的 Chroma / FAISS backend adapter - 自动读取论文标题、作者、摘要和章节结构 - 评测结果可视化 - 自动运行 benchmark 并生成汇总图表 Planned improvements: - SQLite session history and project-level persistence - Stronger citation-level fact checking - Complete Chroma / FAISS backend adapters - Automatic paper title, author, abstract, and section extraction - Evaluation result visualization - Automatic benchmark execution with summary charts ## License / 许可证 This project is released under the MIT License. See [LICENSE](LICENSE) for details. 本项目使用 MIT License 开源。详情见 [LICENSE](LICENSE)。 ## 声明 / Notes 本项目用于科研工作流辅助，不替代真实科研判断。论文事实、实验指标、复现结果和报告结论都应由使用者进行人工复核。 This project is intended to support research workflows. It does not replace human research judgment. Paper facts, experiment metrics, reproduction results, and report conclusions should be manually verified.

标签：RAG, 代码分析, 凭证管理, 安全规则引擎, 实验复现, 科研工作流, 论文阅读, 逆向工具