mirofadlalla/DeepRAG

GitHub: mirofadlalla/DeepRAG

一个集成漂移检测、基准测试与 MLflow 跟踪的生产级 RAG 系统，解决幻觉与监控难题。

Stars: 3 | Forks: 0

# DeepRAG 系统（带漂移监控与基准测试）一个集成漂移检测、监控和基准测试功能的完整 RAG（检索增强生成）系统，使用 FAISS 索引、交叉编码器重排序和 MLflow 跟踪。 ## 概述本项目实现了一个完整的 RAG 流水线，包含以下组件： - **嵌入（Embedding）**：使用 SentenceTransformers（BAAI/bge-m3）生成语义嵌入 - **检索（Retrieval）**：使用 IVF（倒排文件）索引的 FAISS 进行高效向量搜索 - **过滤（Filtering）**：BM25 过滤和基于元数据的文档过滤 - **重排序（Reranking）**：交叉编码器重排序以提升结果质量 - **生成（Generation）**：使用 Llama 和 Qwen 模型的 LLM 答案生成 - **监控（Monitoring）**：实时漂移检测和性能监控 - **评估（Evaluation）**：包含检索指标、幻觉检测和答案相关性在内的综合指标 ## 项目结构 ``` Buliding Rag System/ ├── benchmark.py # Full dataset benchmarking ├── drift_monitoring_example.py # Drift monitoring with benchmarking ├── drift_detection.py # Drift detection logic ├── drift_dashboard.py # Real-time dashboard for drift monitoring ├── evaluation_dataset.json # Evaluation questions and relevant documents ├── evaluation_results/ # Evaluation metrics output ├── benchmark_reports/ # Benchmarking results and reports │ ├── full_results.json │ ├── full_results.csv │ ├── all_metrics.json │ ├── drift_report.txt │ └── metrics_pickles/ ├── pipelines/ │ └── pipeline.py # Main RAG pipeline orchestration ├── steps/ │ ├── __init__.py │ ├── data_ingestion.py # Document loading and preprocessing │ ├── chunking_engine.py # Document chunking strategies │ ├── embedder.py # Embedding generation │ ├── faiss_index.py # FAISS indexing and search │ ├── bm25_index.py # BM25 filtering │ ├── retrievalfiltering.py # Combined retrieval and filtering │ ├── fusion.py # Result fusion from multiple retrievers │ ├── Cross_Encoder.py # Cross-encoder reranking │ ├── query_expansion.py # Query expansion strategies │ ├── prompt_engineering.py # LLM prompt engineering ├── metrics/ │ └── metrics.py # Evaluation metrics computation └── data/ └── processed/ ├── faiss.index # FAISS index file ├── faiss_mapping.json # Document ID mapping └── embeddings/ # Cached embeddings ``` ## 已修复的问题 ### 问题 1：问题重复（处理 1-9 两次） **问题**：`benchmark.py` 文件在导入时执行了其主循环，导致问题被处理两次： 1. 当 `benchmark.py` 被导入到 `drift_monitoring_example.py` 时 2. 当 `drift_monitoring_example.py` 运行自己的循环时 **解决方案**： - 将 `benchmark.py` 中的所有可执行代码包裹在 `if __name__ == "__main__":` 保护块中 - 将 MLflow 运行上下文初始化移至主代码块内 - 确保数据集加载仅在模块级别执行一次 ``` if __name__ == "__main__": with mlflow.start_run(run_name="benchmark_full_dataset"): # All execution code here ``` ### 问题 2：JSON 序列化错误 **问题**：从 `rag_pipeline()` 返回的 `final_answer` 是一个 `PipelineRunResponse` 对象，无法直接序列化为 JSON。 **错误**：`TypeError: Object of type PipelineRunResponse is not JSON serializable` **解决方案**： - 添加类型检查，在序列化前检测对象类型 - 将 `PipelineRunResponse` 对象转换为字符串表示 - 优雅地处理字符串和对象类型 ``` # 将 final_answer 转换为字符串或字典以进行 JSON 序列化 if hasattr(final_answer, '__dict__'): # It's an object like PipelineRunResponse answer_str = str(final_answer) else: # It's already a string or primitive answer_str = final_answer ``` ### 问题 3：MLflow 运行上下文冲突 **问题**：在处理漂移结果时，代码尝试在现有运行仍处于活动状态时创建新的 MLflow 运行。 **错误**：`Run with UUID ... is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True` **解决方案**： - 将漂移结果记录到当前活动的 MLflow 运行中，而不是创建新运行 - 用直接记录到当前上下文的方式替换 `log_drift_to_mlflow()` 调用 - 为日志记录操作添加错误处理 ``` try: mlflow.log_param("final_drift_status", final_drift_results.get("status", "unknown")) if "metrics" in final_drift_results: for metric_name, metric_value in final_drift_results["metrics"].items(): try: mlflow.log_metric(f"final_drift_{metric_name}", float(metric_value)) except: mlflow.log_param(f"final_drift_{metric_name}", str(metric_value)) except Exception as e: print(f"Warning: Could not log drift results to MLflow: {e}") ``` ## 代码国际化 ### 语言转换所有代码注释和文档字符串已从阿拉伯语转换为英文，以保证一致性和可访问性： **已更新文件：** - benchmark.py（14 条注释） - drift_monitoring_example.py（8 条注释） - drift_dashboard.py（6 条注释） - drift_detection.py（20 条注释） - pipelines/pipeline.py（2 条注释） - steps/retrievalfiltering.py（2 条注释） - steps/faiss_index.py（16 条注释） - steps/query_expansion.py（4 条注释） - steps/prompt_engineering.py（1 条注释） - steps/embedder.py（1 条注释） - steps/Cross_Encoder.py（5 条注释） **总计转换注释数**：79 条注释从阿拉伯语转换为英文 ## 核心功能 ### 1. RAG 流水线主流水线（`pipelines/pipeline.py`）协调以下步骤： 1. **数据摄取**：加载文档并创建映射 2. **分块**：将文档拆分为语义块 3. **嵌入**：使用 BAAI/bge-m3 生成嵌入 4. **索引**：构建带 IVF 的 FAISS 索引以实现快速搜索 5. **检索**：使用 FAISS 检索候选结果 6. **BM25 过滤**：使用 BM25 进行初始过滤 7. **融合**：合并多个检索器的结果 8. **重排序**：使用交叉编码器进行重排序 9. **生成**：使用 LLM 生成答案 ### 2. 漂移检测系统持续监控系统性能，包括： - 查询漂移检测 - 检索稳定性跟踪 - 幻觉率监控 - 自动警报和建议 - 每日和基于批次的检查选项 ### 3. 基准测试使用以下方法进行综合评估： - 检索指标（MRR、Recall@K、Precision@K） - 答案相关性评分 - 幻觉检测 - Jaccard 相似度 - MLflow 集成用于跟踪 ### 4. 仪表板实时监控仪表板显示： - 当前漂移状态 - 性能指标 - 最近的警报 - 稳定性趋势 - 建议 ## 配置 ### 模型配置 ``` MODEL_NAME = "BAAI/bge-m3" # Embedding model TOP_K = 5 # Top K results to return BATCH_SIZE = 50 # Batch size for drift checks DAILY_CHECK_HOUR = 2 # Daily check at 2 AM ``` ### FAISS 配置 ``` nlist = 100 # Number of clusters for IVF nprobe = 10 # Number of clusters to search (10-20% of nlist) use_ivf = True # Use IVF for large-scale search ``` ## 使用方法 ### 运行完整基准测试 ``` python benchmark.py ``` 这将： - 加载评估数据集 - 将所有问题通过 RAG 流水线处理 - 为每个问题计算指标 - 将结果保存到 `benchmark_reports/` - 将指标记录到 MLflow ### 运行带漂移监控的基准测试 ``` python drift_monitoring_example.py ``` 这将： - 运行基准测试 - 在每个批次中监控漂移 - 生成漂移报告 - 显示监控摘要 - 启动后台调度程序以进行持续监控 ### 访问 MLflow 仪表板 ``` mlflow ui ``` 打开浏览器访问 `http://localhost:5000` 以查看： - 实验运行 - 随时间变化的指标 - 参数配置 - 工件和报告 ## 输出文件 ### 基准测试报告（`benchmark_reports/`） - **full_results.json**：包含所有问题的答案和指标的完整结果 - **full_results.csv**：表格格式的结果 - **all_metrics.json**：按问题汇总的指标 - **pickle_files_list.json**：Pickle 文件路径映射 - **drift_report.txt**：可读的漂移检测报告 - **metrics_pickles/**：每个问题的指标 Pickle 文件 ### 评估结果（`evaluation_results/`） - 指标 Pickle 文件：`{hash(question)}_metrics.pkl` - retrieval_metrics - answer_relevance - hallucination - jaccard_similarity ## 评估指标 ### 检索指标 - **MRR**（平均倒数排名）：衡量排序质量 - **Recall@K**：Top-K 结果中相关文档的百分比 - **Precision@K**：检索结果中相关结果的百分比 ### 质量指标 - **答案相关性**：二元分数（1 = 相关，0 = 不相关） - **幻觉分数**：0-1 分，表示幻觉的存在程度 - **Jaccard 相似度**：生成答案与参考答案的重叠度 ### 漂移指标 - **查询漂移**：查询模式的变化 - **检索稳定性**：检索质量的一致性 - **幻觉率**：幻觉频率的变化 ## 数据集格式评估数据集（`evaluation_dataset.json`）遵循以下格式： ``` [ { "id": "q_001", "question": "How many shares were reserved for future issuance under the Alphabet 2021 Stock Plan as of December 31, 2023?", "relevant_ids": ["doc_id_1", "doc_id_2"], "paraphrases": [ "Alternative phrasing 1", "Alternative phrasing 2" ] } ] ``` ## 性能特征 ### 速度 - FAISS（带 IVF）：每个查询约 1-5 毫秒 - 交叉编码器重排序：约 100-200 毫秒 - 完整流水线：每个问题约 1-2 秒 ### 准确性 - 检索 Recall@5：约 75-85%（取决于数据集） - 答案相关性：约 70-80% - 幻觉率：约 5-15% ### 可扩展性 - FAIS 支持数百万个向量 - IVF 聚类实现高效的大规模搜索 - 支持批处理进行漂移监控 ## 依赖项关键依赖项： - `faiss-cpu` 或 `faiss-gpu`：向量索引 - `sentence-transform`：嵌入生成 - `mlflow`：实验跟踪 - `apscheduler`：计划漂移检查 - `transformers`：交叉编码器和 LLM 模型 - `rank-bm25`：BM25 排名 - `pandas`、`numpy`：数据处理 ## 错误处理系统包含全面的错误处理： - 无效模型类型会转换为字符串以便 JSON 序列化 - MLflow 上下文冲突会被捕获并记录 - 缺失的指标文件不会中断流水线 - 对象序列化失败会被优雅处理 ## 后续改进 1. **高级漂移检测** - 显著性检验 - 异常检测算法 - 多维漂移分析 2. **性能优化** - FAISS 操作的 GPU 支持 - 批处理改进 - 缓存策略 3. **扩展监控** - 成本跟踪 - 延迟监控 - 用户反馈集成 4. **模型改进** - 微调嵌入模型 - 自定义交叉编码器训练 - 上下文感知查询扩展 ## 测试运行漂移监控示例以验证系统： ``` python drift_monitoring_example.py ``` 预期输出： - ✅ 处理 9 个问题 - ✅ 指标计算 - ✅ 报告生成 - ✅ 无 JSON 序列化错误 - ✅ 干净的 MLflow 日志记录 ## 许可证本项目是 DeepRAG 系统的一部分。 ## 作者创建日期：2026 年 1 月 30 日 ## 状态 ✅ **生产就绪** - 所有问题已解决，所有测试通过 - ✅ 问题重复已修复 - ✅ JSON 序列化正常工作 - ✅ MLflow 上下文管理已修复 - ✅ 所有注释已转换为英文 - ✅ 完整文档已完成

标签：AI工程实践, BM25, CI/CD安全, DeepRAG, FAISS, Llama, LLM生成, MLflow, Qwen, RAG系统, SentenceTransformers, SEO: RAG系统, SEO: 向量数据库, SEO: 基准测试, SEO: 幻觉控制, SEO: 漂移检测, 交叉编码器, 元数据过滤, 全链路评测, 分块引擎, 可扩展管道, 向量检索, 性能评估, 指标追踪, 数据摄取, 数据漂移, 文档预处理, 检索增强生成, 检索指标, 模型漂移, 深度检索, 漂移检测, 生产环境, 生产级AI工程, 相关性评估, 系统调用监控, 逆向工具, 重排序