QwenLM/Qwen3-VL-Embedding

GitHub: QwenLM/Qwen3-VL-Embedding

基于 Qwen3-VL 构建的多模态 Embedding 和 Reranker 模型，解决跨模态信息检索与精确重排序问题。

Stars: 1342 | Forks: 115

# Qwen3-VL-Embedding & Qwen3-VL-Reranker [![GitHub](https://img.shields.io/badge/GitHub-black?logo=github)](https://github.com/QwenLM/Qwen3-VL-Embedding) [![Hugging Face - Embedding](https://img.shields.io/badge/🤗-Embedding-yellow)](https://huggingface.co/collections/Qwen/qwen3-vl-embedding) [![Hugging Face - Reranker](https://img.shields.io/badge/🤗-Reranker-yellow)](https://huggingface.co/collections/Qwen/qwen3-vl-reranker) [![ModelScope - Embedding](https://img.shields.io/badge/ModelScope-Embedding-blue)](https://modelscope.cn/organization/qwen/qwen3-vl-embedding) [![ModelScope - Reranker](https://img.shields.io/badge/ModelScope-Reranker-blue)](https://modelscope.cn/organization/qwen/qwen3-vl-reranker) [![技术报告](https://img.shields.io/badge/📄-Technical%20Report-red)](assets/qwen3vlembedding_technical_report.pdf) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) **基于 Qwen3-VL 构建的最先进的多模态 Embedding 和 Reranker 模型，支持文本、图像、截图、视频以及多模态混合输入，适用于高级信息检索和跨模态理解。** ## 目录 - [概述](#overview) - [功能](#features) - [模型架构](#model-architecture) - [安装说明](#installation) - [使用方法](#usage) - [示例](#examples) - [模型性能](#model-performance) - [引用](#citation) ## 概述 Qwen3-VL-Embedding 和 Qwen3-VL-Reranker 系列模型是 Qwen 家族的最新成员，基于近期开源且强大的 [Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl) 基础模型构建。该系列模型专为多模态信息检索和跨模态理解设计，可接受多种输入，包括文本、图像、截图、视频，以及包含这些模态混合的输入。在我们面向文本的 [Qwen3-Embedding](https://huggingface.co/collections/Qwen/qwen3-embedding) 和 [Qwen3-Reranker](https://huggingface.co/collections/Qwen/qwen3-reranker) 系列取得成功的基础上，这些多模态模型将业界领先的性能扩展到了视觉和视频理解任务。模型协同工作：Embedding 模型通过生成语义丰富的向量来处理初始召回阶段，而 Reranker 模型则通过精确的相关性评分来管理重排序阶段，显著提升了最终的检索准确率。 ## 功能 - **🎨 多模态多功能性**：在统一框架内无缝处理包含文本、图像、截图和视频的输入。在图文检索、视频文本匹配、视觉问答 (VQA) 和多模态内容聚类等各种任务中实现业界领先的性能。 - **🔄 统一的表示空间**：利用 Qwen3-VL 架构生成语义丰富的向量，在共享空间中捕捉视觉和文本信息，促进不同模态间的高效相似度估计和检索。 - **🎯 高精度重排序**：重排序模型接受输入对（Query, Document）——两者均可由任意单一或混合模态组成——并输出精确的相关性评分，从而实现卓越的检索准确率。 - **🌍 卓越的实用性**： - 支持 30 多种语言，非常适合全球化应用 - 支持自定义 instruction 以进行特定任务的优化 - 通过 Matryoshka Representation Learning (MRL) 支持灵活的向量维度 - 量化 embedding 性能强劲，便于高效部署 - 轻松集成到现有的检索 pipeline 中 ## 模型架构 ### 模型规格 | Model | Size | Layers | Sequence Length | Embedding Dimension | Quantization Support | MRL Support | Instruction Aware | |---|---|---|---|---|---|---|---| | **Qwen3-VL-Embedding-2B** | 2B | 28 | 32K | 2048 | ✅ | ✅ | ✅ | | **Qwen3-VL-Embedding-8B** | 8B | 36 | 32K | 4096 | ✅ | ✅ | ✅ | | **Qwen3-VL-Reranker-2B** | 2B | 28 | 32K | - | - | - | ✅ | | **Qwen3-VL-Reranker-8B** | 8B | 36 | 32K | - | - | - | ✅ | ### LoRA 配置 | Model | rank | alpha | target_modules | |------|------|-------|----------------| | Qwen3-VL-Embedding | 32 | 32 | q_proj v_proj k_proj up_proj down_proj gate_proj | | Qwen3-VL-Reranker | 32 | 32 | q_proj v_proj k_proj up_proj down_proj gate_proj | ### 架构设计 **Qwen3-VL-Embedding：双塔架构** - 接收单一模态或混合模态输入，并将其映射到高维语义向量中 - 从基础模型最后一层提取与 `[EOS]` token 对应的 hidden state 向量，作为最终的语义表示 - 实现大规模检索所需的极其高效的独立编码 **Qwen3-VL-Reranker：单塔架构** - 接收一个输入对 `(Query, Document)` 并执行 pointwise 重排序 - 利用 Cross-Attention 机制进行更深层次、更细粒度的模态间交互与信息融合 - 通过预测特殊 token（`yes` 和 `no`）的生成概率来表达相关性得分 ### 功能对比 | | Qwen3-VL-Embedding | Qwen3-VL-Reranker | |---------|-------------------|-------------------| | **核心功能** | 语义表示，生成 Embedding | 相关性打分，Pointwise 重排序 | | **输入** | 单一模态或混合模态 | 具有单一或混合模态输入的 (Query, Document) 对 | | **架构** | 双塔 | 单塔 | | **机制** | 高效检索 | 深度的模态间交互，精确对齐 | | **输出** | 语义向量 | 相关性得分 | 这两个模型均通过多阶段训练范式构建，充分利用了 Qwen3-VL 强大的通用多模态语义理解能力，为复杂、大规模的多模态检索任务提供了高质量的语义表示和精确的重排序机制。 ## 安装说明 ### 环境配置 ``` # Clone 仓库 git clone https://github.com/QwenLM/Qwen3-VL-Embedding.git cd Qwen3-VL-Embedding # 运行脚本以设置环境 bash scripts/setup_environment.sh ``` 设置脚本将自动： - 安装 `uv`（如果尚未安装） - 安装所有项目依赖项设置完成后，激活环境： ``` source .venv/bin/activate ``` ### 下载模型我们的模型在 Hugging Face 和 ModelScope 上均可用。 | Model | Hugging Face | ModelScope | |-------|--------------|------------| | Qwen3-VL-Embedding-2B |[链接](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) | [链接](https://modelscope.cn/models/qwen/Qwen3-VL-Embedding-2B) | | Qwen3-VL-Embedding-8B |[链接](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) | [链接](https://modelscope.cn/models/qwen/Qwen3-VL-Embedding-8B) | | Qwen3-VL-Reranker-2B |[链接](https://huggingface.co/Qwen/Qwen3-VL-Reranker-2B) | [链接](https://modelscope.cn/models/qwen/Qwen3-VL-Reranker-2B) | | Qwen3-VL-Reranker-8B |[链接](https://huggingface.co/Qwen/Qwen3-VL-Reranker-8B) | [链接](https://modelscope.cn/models/qwen/Qwen3-VL-Reranker-8B) | **安装下载依赖：** **从 Hugging Face 下载：** ``` uv pip install huggingface-hub huggingface-cli download Qwen/Qwen3-VL-Embedding-2B --local-dir ./models/Qwen3-VL-Embedding-2B ``` **从 ModelScope 下载：** ``` uv pip install modelscope modelscope download --model qwen/Qwen3-VL-Embedding-2B --local_dir ./models/Qwen3-VL-Embedding-2B ``` ## 使用方法 ### 快速开始 #### Embedding 模型 ##### Transformers 用法 ``` import torch from src.models.qwen3_vl_embedding import Qwen3VLEmbedder model = Qwen3VLEmbedder( model_name_or_path="./models/Qwen3-VL-Embedding-2B", # flash_attention_2 for better acceleration and memory saving # torch_dtype=torch.bfloat16, # attn_implementation="flash_attention_2" ) inputs = [{ "text": "A woman playing with her dog on a beach at sunset.", "instruction": "Retrieve images or text relevant to the user's query.", }, { "text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust." }, { "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" }, { "text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" }] embeddings = model.process(inputs) print(embeddings @ embeddings.T) ``` ##### vLLM 用法有关 Embedding 模型的 vLLM 用法示例，请参阅 [examples/embedding_vllm.ipynb](examples/embedding_vllm.ipynb)。 #### Reranker 模型 ##### Transformers 用法 ``` import torch from src.models.qwen3_vl_reranker import Qwen3VLReranker model = Qwen3VLReranker( model_name_or_path="./models/Qwen3-VL-Reranker-2B", # flash_attention_2 for better acceleration and memory saving # torch_dtype=torch.bfloat16, # attn_implementation="flash_attention_2" ) inputs = { "instruction": "Retrieve images or text relevant to the user's query.", "query": {"text": "A woman playing with her dog on a beach at sunset."}, "documents": [ {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust."}, {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}, {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"} ], "fps": 1.0, "max_frames": 64 } scores = model.process(inputs) print(scores) ``` ##### vLLM 用法有关 Reranker 模型的 vLLM 用法示例，请参阅 [examples/reranker_vllm.ipynb](examples/reranker_vllm.ipynb)。 ### 模型输入规范 #### 多模态对象一个可以包含以下键的字典： - **text**：作为字符串或字符串列表的文本输入 - **image**：图像输入，支持： - 本地文件路径 - URL（网络路径） - `PIL.Image.Image` 实例 - 上述任意组合的列表（多张图像） - **video**：视频输入，支持： - 本地文件路径 - URL（网络路径） - 视频帧序列（图像路径列表或 `PIL.Image.Image` 实例） - 上述任意组合的列表（多个视频） **注意**：所有输入类型（文本、图像、视频）现在都支持单个对象和对象列表，允许您在单个请求中提供多个相同类型的输入。例如，您可以将多张图像作为列表传入，多个文本字符串作为列表传入，或者多个视频作为列表传入。 #### 说明用于相关性评估的任务描述（默认值：“Represent the user's input”） #### 视频采样设置仅在视频输入为视频文件时生效： - **fps**：每秒帧采样率（每秒帧数） - **max_frames**：要采样的最大帧数 #### 输入格式 **Embedding 模型**：一个字典列表，其中每个字典包含： - Instruction（可选） - 视频采样设置（可选） - 多模态对象键（text、image 和/或 video） **Reranker 模型**：一个包含以下内容的字典： - **query**：一个多模态对象 - **documents**：多模态对象的列表 - **instruction**：任务描述（可选） - **fps**：视频采样率（可选） - **max_frames**：最大帧数（可选） ### Embedding 模型 #### 模型初始化参数 ``` Qwen3VLEmbedder( model_name_or_path="./models/Qwen3-VL-Embedding-2B", max_length=8192, # Default context length min_pixels=4096, # Minimum pixels for input images max_pixels=1843200, # Maximum pixels for input images (equivalent to 1280×1440 resolution) total_pixels=7864320, # Maximum total pixels for input videos (multiplied by 2 in model) # For a 16-frame video, each frame can have up to 983040 pixels (1280×768 resolution) fps=1.0, # Default sampling frame rate for video files (frames per second) max_frames=64, # Maximum number of frames for video input torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2" ) ``` ## 示例 ### Embedding 模型我们在此处提供了全面的[示例](examples/embedding.ipynb)，展示了跨不同模态的各种任务： **文本任务：** - 文本分类 (AG News) - 文本问答 (SQuAD) - 文本检索 (MS MARCO) **图像任务：** - 图像分类 (CIFAR-10) - 图像问答 (VQAv2) - 图像检索 (MS COCO) 视频和视觉文档任务的示例见[技术报告](assets/qwen3vlembedding_technical_report.pdf)的附录中。我们还在此处提供了一个使用 Qwen3-VL-Embedding、Qwen3-VL-Reranker 和 Qwen3-VL 的端到端多模态 RAG [示例](examples/Qwen3VL_Multimodal_RAG.ipynb)。 ### Reranker 模型我们在此处提供了全面的[示例](examples/reranker.ipynb)，展示了跨不同模态的各种任务： **文本任务：** - 文本检索 (MS MARCO) **图像任务：** - 图像检索 (MS COCO) ## 模型性能 ### Embedding 模型 #### [MMEB-V2](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard) 评估结果在 MMEB-V2 基准测试上的结果。除了 IFM-TTE 之外的所有模型均在更新后的 VisDoc OOD 划分上进行了重新评估。CLS：分类，QA：问答，RET：检索，GD：定位，MRET：瞬间检索，VDR：ViDoRe，VR：VisRAG，OOD：分布外。 | Model | Model Size | Image CLS | Image QA | Image RET | Image GD | Image Overall | Video CLS | Video QA | Video RET | Video MRET | Video Overall | VisDoc VDRv1 | VisDoc VDRv2 | VisDoc VR | VisDoc OOD | VisDoc Overall | All | |----------------------------|---------|-------|------|------|------|-----------|------|------|------|------|------|-------|------|--------|------|------|--------| | **# of Datasets →** | | 10 | 10 | 12 | 4 | 36 | 5 | 5 | 5 | 3 | 18 | 10 | 4 | 6 | 4 | 24 | 78 | | VLM2Vec | 2B | 58.7 | 49.3 | 65.0 | 72.9 | 59.7 | 33.4 | 30.5 | 20.6 | 30.7 | 28.6 | 49.8 | 13.5 | 51.8 | 48.2 | 44.0 | 47.7 | | VLM2Vec-V2 | 2B | 62.9 | 56.3 | 69.5 | 77.3 | 64.9 | 39.3 | 34.3 | 28.8 | 36.8 | 34.6 | 75.5 | 44.9 |79.4 | 62.2 | 69.2 | 59.2 | | GME-2B | 2B | 54.4 | 29.9 | 66.9 | 55.5 | 51.9 | 34.9 | 42.0 | 25.6 | 31.1 | 33.6 | 86.1 | 54.0 | 82.5 | 67.5 | 76.8 | 55.3 | | GME-7B | 7B | 57.7 | 34.7 | 71.2 | 59.3 | 56.0 | 37.4 | 50.4 | 28.4 | 37.0 | 38.4 | 89.4 | 55.6 | 85.0 | 68.3 | 79.3 | 59.1 | | Ops-MM-embedding-v1 | 8B | 69.7 | 69.6 | 73.1 | 87.2 | 72.7 | 59.7 | 62.2 | 45.7 | 43.2 | 53.8 | 80.1 | 59.6 | 79.3 | 67.8 | 74.4 | 68.9 | | IFM-TTE | 8B | **76.7** | 78.5 | 74.6 | 89.3 | 77.9 | 60.5 | 67.9 | 51.7 | 54.9 | 59.2 | 85.2 | 71.5 | **92.7** | 53.3 | 79.5 | 74.1 | | RzenEmbed | 8B | 70.6 | 71.7 | 78.5 | 92.1 | 75.9 | 58.8 | 63.5 | 51.0 | 45.5 | 55.7 | 89.7 | 60.7 | 88.7 | 69.9 | 81.3 | 72.9 | | Seed-1.6-embedding-1215 | unknown | 75.0 | 74.9 | 79.3 | 89.0 | 78.0 | **85.2** | 66.7 | **59.1** | 54.8 | **67.7** | **90.0** | 60.3 | 90.0 | 70.7 | 82.2 | 76.9 | | **Qwen3-VL-Embedding-2B** | 2B | 70.3 | 74.3 | 74.8 | 88.5 | 75.0 | 71.9 | 64.9 | 53.9 | 53.3 | 61.9 | 84.4 | 65.3 | 86.4 | 69.4 | 79.2 | 73.2 | | **Qwen3-VL-Embedding-8B** | 8B | 74.2 | **81.1** | **80.0** | **92.2** | **80.1** | 78.4 | **71.0** | 58.7 | **56.1** | 67.1 | 87.2 | **69.9** | 88.7 | **73.3** | **82.4** | **77.8** | #### [MMTEB](https://huggingface.co/spaces/mteb/leaderboard) 评估结果在 MMTEB 基准测试上的结果。 | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS | |----------------------------------|:-------:|:-------------:|:-------------:|:------------:|:------:|:------:|:------------:|:-------------:|:------------:|:------:|:------:|:----:| | NV-Embed-v2 | 7B | 56.3 | 49.6 | 57.8 | 57.3 | 40.8 | 1.0 | 18.6 | 78.9 | 63.8 | 56.7 | 71.1 | | GritLM-7B | 7B | 60.9 | 53.7 | 70.5 | 61.8 | 49.8 | 3.5 | 22.8 | 79.9 | 63.8 | 58.3 | 73.3 | | BGE-M3 | 0.6B | 59.6 | 52.2 | 79.1 | 60.4 | 40.9 | -3.1 | 20.1 | 80.8 | 62.8 | 54.6 | 74.1 | | multilingual-e5-large-instruct | 0.6B | 63.2 | 55.1 | 80.1 | 64.9 | 50.8 | -0.4 | 22.9 | 80.9 | 62.6 | 57.1 | 76.8 | | gte-Qwen2-1.5B-instruct | 1.5B | 59.5 | 52.7 | 62.5 | 58.3 | 52.1 | 0.7 | 24.0 | 81.6 | 62.6 | 60.8 | 71.6 | | gte-Qwen2-7b-Instruct | 7B | 62.5 | 55.9 | 73.9 | 61.6 | 52.8 | 4.9 | 25.5 | 85.1 | 65.6 | 60.1 | 74.0 | | text-embedding-3-large | - | 58.9 | 51.4 | 62.2 | 60.3 | 46.9 | -2.7 | 22.0 | 79.2 | 63.9 | 59.3 | 71.7 | | Cohere-embed-multilingual-v3.0 | - | 61.1 | 53.2 | 70.5 | 63.0 | 46.9 | -1.9 | 22.7 | 79.9 | 64.1 | 59.2 | 74.8 | | Gemini Embedding | - | 68.4 | 59.6 | 79.3 | 71.8 | 54.6 | 5.2 | **29.2** | 83.6 | 65.6 | 67.7 | 79.4 | | Qwen3-Embedding-0.6B | 0.6B | 64.3 | 56.0 | 72.2 | 66.8 | 52.3 | 5.1 | 24.6 | 80.8 | 61.4 | 64.6 | 76.2 | | Qwen3-Embedding-4B | 4B | 69.5 | 60.9 | 79.4 | 72.3 | 57.2 | **11.6** | 26.8 | 85.1 | 65.1 | 69.6 | 80.9 | | Qwen3-Embedding-8B | 8B | **70.6** | **61.7** | **80.9** |**74.0**|**57.7**| 10.1 | 28.7 | **86.4** |**65.6**|**70.9**|**81.1**| | Qwen3-VL-Embedding-2B | 2B | 63.9 | 55.8 | 69.5 | 65.9 | 52.5 | 3.9 | 26.1 | 78.5 | 64.8 | 67.1 | 74.3 | | Qwen3-VL-Embedding-8B | 8B | 67.9 | 58.9 | 77.5 | 72.0 | 55.8 | 4.5 | 28.6 | 81.1 | 65.7 | 69.4 | 75.4 | ### Reranker 模型我们利用了 [MMEB-v2](https://huggingface.co/spaces/TIGER-Lab/MMEB-Leaderboard) 和 [MMTEB](https://huggingface.co/spaces/mteb/leaderboard) 检索基准下各种子任务的检索数据集。对于视觉文档检索，我们采用了 [JinaVDR](https://huggingface.co/collections/jinaai/jinavdr-visual-document-retrieval) 和 [ViDoRe v3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3) 数据集。我们的结果表明，所有 Qwen3-VL-Reranker 模型的性能均持续优于基础 Embedding 模型和基线 Reranker，其中 8B 版本在大多数任务中取得了最佳性能。 | Model | Size | MMEB-v2(Retrieval) - Avg | MMEB-v2(Retrieval) - Image | MMEB-v2(Retrieval) - Video | MMEB-v2(Retrieval) - VisDoc | MMTEB(Retrieval) | JinaVDR | ViDoRe(v3) | |-------|------|--------------------------|----------------------------|----------------------------|------------------------------|------------------|---------|------------| | Qwen3-VL-Embedding-2B | 2B | 73.4 | 74.8 | 53.6 | 79.2 | 68.1 | 71.0 | 52.9 | | jina-reranker-m0 | 2B | - | 68.2 | - | **85.2** | - | 82.2 | 57.8 | | Qwen3-VL-Reranker-2B | 2B | 75.2 | 74.0 | 53.2 | 83.2 | 70.0 | 80.9 | 60.8 | | Qwen3-VL-Reranker-8B | 8B | **79.2** | **78.2** | **61.0** | 85.8 | **74.9** | **83.6** | **66.7** | ### 复现评估 #### Embedding 模型我们基于 [VLM2Vec](https://github.com/TIGER-AI-Lab/VLM2Vec) 提供了 **MMEB v2** 基准测试的可复现评估代码。要复现结果： 1. **下载评估数据：** bash data/evaluation/mmeb_v2/download_data.sh 2. **运行评估：** bash scripts/evaluation/mmeb_v2/eval_embedding.sh 运行不带参数的脚本以查看所需参数。该脚本将自动评估任务并收集结果。 #### Reranker 模型我们提供了 **MMEB v2** 检索划分的可复现评估代码。要复现结果： 1. **下载评估数据：** bash data/evaluation/mmeb_v2/download_data.sh 2. **运行评估：** bash scripts/evaluation/mmeb_v2/eval_reranker.sh 运行不带参数的脚本以查看所需参数。该脚本将自动评估任务并收集结果。 ## 引用 ``` @article{qwen3vlembedding, title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking}, author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen, Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang}, journal={arXiv}, year={2026} } ```

标签：IaC 扫描, IPv6支持, Qwen3-VL, 人工智能, 信息检索, 凭据扫描, 向量嵌入, 多模态模型, 用户模式Hook绕过, 系统调用监控, 逆向工具, 重排序模型