QwenLM/Qwen3

GitHub: QwenLM/Qwen3

阿里通义千问团队发布的新一代大语言模型系列，支持思考与非思考模式无缝切换，具备强大的逻辑推理、代码生成及 Agent 能力。

Stars: 26733 | Forks: 1904

# Qwen3

访问我们的 Hugging Face 或 ModelScope 组织（点击上方链接），搜索以 `Qwen3-` 开头的模型权重，或访问 [Qwen3 合集](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f)，你就能找到所需的一切！尽情使用吧！欲了解更多关于 Qwen3 的信息，请随时阅读我们的文档 \[[EN](https://qwen.readthedocs.io/en/latest/)|[ZH](https://qwen.readthedocs.io/zh-cn/latest/)\]。我们的文档包含以下部分： - 快速开始：基本用法和演示； - 推理：使用 Transformers 进行推理的指南，包括批处理推理、流式传输等； - 本地运行：在 CPU 和 GPU 上本地运行 LLM 的说明，支持 llama.cpp、Ollama 和 LM Studio 等框架； - 部署：演示如何使用 SGLang、vLLM、TGI 等框架部署 Qwen 以进行大规模推理； - 量化：使用 GPTQ、AWQ 量化 LLM 的实践，以及如何制作高质量量化 GGUF 文件的指南； - 训练：训练后的说明，包括使用 Axolotl、LLaMA-Factory 等框架进行 SFT 和 RLHF（待定）。 - 框架：Qwen 在应用框架中的使用，例如 RAG、Agent 等。 ## 简介 ### Qwen3-2507 在过去的三个月里，我们继续探索 Qwen3 系列的潜力，很高兴在此介绍更新后的 **Qwen3-2507**，它包含两种变体：Qwen3-Instruct-2507 和 Qwen3-Thinking-2507，以及三种规格：235B-A22B、30B-A3B 和 4B。 **Qwen3-Instruct-2507** 是先前 Qwen3 非思考模式的更新版本，具有以下关键增强功能： - 通用能力**显著提升**，包括**指令遵循、逻辑推理、文本理解、数学、科学、编码和工具使用**。 - 在**多种语言**的长尾知识覆盖方面**大幅增益**。 - 在**主观和开放式任务**中与用户偏好的对齐**明显改善**，从而提供更有帮助的回复和更高质量的文本生成。 - **256K token 长上下文理解**能力**增强**，可扩展至 **100 万 token**。 **Qwen3-Thinking-2507** 是 Qwen3 思考模式的延续，在推理的质量和深度上有所提升，具有以下关键增强功能： - 在推理任务上**性能显著提升**，包括逻辑推理、数学、科学、编码以及通常需要人类专业知识的学术基准测试——在**开源权重思考模型中取得了最先进的结果**。 - **通用能力明显更好**，例如指令遵循、工具使用、文本生成以及与人类偏好的对齐。 - **增强的 256K 长上下文理解**能力，可扩展至 **100 万 token**。

以往的 Qwen3 发布

Qwen3（即 Qwen3-2504）

我们很高兴地宣布发布 Qwen3，这是 Qwen 大语言模型家族的最新成员。这些模型代表了我们要迄今为止最先进、最智能的系统，是我们基于构建 QwQ 和 Qwen2.5 的经验改进而来的。我们向公众开放 Qwen3 的权重，包括 Dense 和 Mixture-of-Expert (MoE) 模型。

Qwen3 的亮点包括：

各种规格的 Dense 和 Mixture-of-Experts (MoE) 模型，提供 0.6B、1.7B、4B、8B、14B、32B 以及 30B-A3B、235B-A22B 等版本。
在思考模式（用于复杂的逻辑推理、数学和编码）和非思考模式（用于高效、通用的聊天）之间无缝切换，确保在各种场景下都能获得最佳性能。
推理能力显著增强，在数学、代码生成和常识逻辑推理方面，分别超越了之前的 QwQ（在思考模式下）和 Qwen2.5 instruct 模型（在非思考模式下）。
卓越的人类偏好对齐，在创意写作、角色扮演、多轮对话和指令遵循方面表现出色，从而提供更自然、引人入胜和身临其境的对话体验。
Agent 能力专长，能够在思考和未思考模式下精确集成外部工具，并在复杂的基于 Agent 的任务中在开源模型中取得了领先性能。
支持 100 多种语言和方言，具有强大的多语言指令遵循和翻译能力。

## 新闻 - 2025.08.08：你现在可以使用 Qwen3-2507 处理 **100 万 token** 的超长输入！请参阅更新的模型卡片（[235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)、[235B-A22B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)、[A30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)、[A30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507)）了解如何启用此功能。 - 2025.08.06：Qwen3-2507 的最终开源版本，[Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) 和 [Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) 已发布！ - 2025.07.31：Qwen3-30B-A3B-Thinking-2507 已发布。请查看[模型卡片](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507)了解更多详情！ - 2025.07.30：Qwen3-30B-A3B-Instruct-2507 已发布。请查看[模型卡片](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)了解更多详情！ - 2025.07.25：我们发布了 Qwen3-235B-A22B 思考模式的更新版本，命名为 Qwen3-235B-A22B-Thinking-2507。请查看[模型卡片](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)了解更多详情！ - 2025.07.21：我们发布了 Qwen3-235B-A22B 非思考模式的更新版本，命名为 Qwen3-235B-A22B-Instruct-2507，该版本较上一版本有显著增强，并支持 256K token 长上下文理解。请查看我们的[模型卡片](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)了解更多详情！ - 2025.04.29：我们发布了 Qwen3 系列。请查看我们的[博客](https://qwenlm.github.io/blog/qwen3)了解更多详情！ - 2024.09.19：我们发布了 Qwen2.5 系列。这次增加了 3 种额外的模型规格：3B、14B 和 32B，以提供更多可能性。请查看我们的[博客](https://qwenlm.github.io/blog/qwen2.5)了解更多信息！ - 2024.06.06：我们发布了 Qwen2 系列。请查看我们的[博客](https://qwenlm.github.io/blog/qwen2/)！ - 2024.03.28：我们发布了 Qwen 的第一个 MoE 模型：Qwen1.5-MoE-A2.7B！暂时只有 HF transformers 和 vLLM 支持该模型。我们将很快添加对 llama.cpp、mlx-lm 等的支持。请查看我们的[博客](https://qwenlm.github.io/blog/qwen-moe/)了解更多信息！ - 2024.02.05：我们发布了 Qwen1.5 系列。 ## 性能详细的评估结果在此 [📑 博客 (Qwen3-2504)](https://qwenlm.github.io/blog/qwen3/) 和此 [📑 博客 (Qwen3-2507) \[即将推出\]]() 中发布。有关 GPU 内存需求和相应吞吐量的要求，请参阅[此处](https://qwen.readthedocs.io/en/latest/getting_started/speed_benchmark.html)的结果。 ## 运行 Qwen3 ### 🤗 Transformers Transformers 是一个用于推理和训练的预训练自然语言处理库。建议使用最新版本的 `transformers`，且要求 `transformers>=4.51.0`。 #### Qwen3-Instruct-2507 以下包含一段代码片段，演示如何使用 Qwen3-30B-A3B-Instruct-2507 根据给定输入生成内容。 ``` from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507" # 加载 tokenizer 和模型 tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # 准备模型输入 prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # 执行文本补全 generated_ids = model.generate( **model_inputs, max_new_tokens=16384 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() content = tokenizer.decode(output_ids, skip_special_tokens=True) print("content:", content) ``` #### Qwen3-Thinking-2507 以下包含一段代码片段，演示如何使用 Qwen3-30B-A3B-Thinking-2507 根据给定输入生成内容。 ``` from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen3-30B-A3B-Thinking-2507" # 加载 tokenizer 和模型 tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # 准备模型输入 prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # 执行文本补全 generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # 解析思考内容 try: # rindex finding 151668 () index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("thinking content:", thinking_content) # no opening tag print("content:", content) ```

为以往的 Qwen3 模型切换思考/非思考模式

默认情况下，Qwen3 模型会在回复前进行思考。这可以通过以下方式控制：

enable_thinking=False：将 enable_thinking=False 传递给 `tokenizer.apply_chat_template` 将严格防止模型生成思考内容。
/think 和 /no_think 指令：在 system 或 user 消息中使用这些词来指示 Qwen3 是否应该思考。在多轮对话中，将遵循最新的指令。

### ModelScope 我们强烈建议用户（尤其是中国大陆的用户）使用 ModelScope。 ModelScope 采用了类似 Transformers 的 Python API。 CLI 工具 `modelscope download` 可以帮助您解决有关下载模型权重的问题。对于 vLLM 和 SGLang，可以分别使用环境变量 `VLLM_USE_MODELSCOPE=true` 和 `SGLANG_USE_MODELSCOPE=true`。 ### llama.cpp [`llama.cpp`](https://github.com/ggml-org/llama.cpp) 能够以最少的设置在各种硬件上实现 LLM 推理，并具有最先进的性能。建议使用 `llama.cpp>=b5401` 以完全支持 Qwen3。要使用 CLI，请在终端中运行以下命令： ``` ./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift # CTRL+C 退出 ``` 要使用 API server，请在终端中运行以下命令： ``` ./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080 ``` 一个简单的 Web 前端将位于 `http://localhost:8080`，一个 OpenAI 兼容的 API 将位于 `http://localhost:8080/v1`。有关更多指南，请参阅[我们的文档](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)。 ### Ollama [安装 Ollama](https://ollama.com/) 后，你可以使用以下命令启动 Ollama 服务（建议使用 Ollama v0.9.0 或更高版本）： ``` ollama serve # 无论何时使用 ollama，都需要保持此服务运行 ``` 要拉取模型权重并运行模型，请使用 `ollama run` 命令。你可以通过在 `qwen3` 后添加后缀来指定模型大小，例如 `:8b` 或 `:30b-a3b`： ``` ollama run qwen3:8b # 设置参数，输入 "/set parameter num_ctx 40960" 和 "/set parameter num_predict 32768" # 要退出，输入 "/bye" 并按 ENTER # 对于 Qwen3-2504 模型， # - 要启用思考（默认），输入 "/set think" # - 要禁用思考，输入 "/set nothink" ``` 你也可以通过其 OpenAI 兼容 API 访问 Ollama 服务。请注意，你需要 (1) 在使用 API 时保持 `ollama serve` 运行，并且 (2) 在使用此 API 之前执行 `ollama run qwen3:8b` 以确保模型权重已准备好。 API 默认位于 `http://localhost:11434/v1/`。欲了解更多详情，请访问 [ollama.ai](https://ollama.com/)。 ### LMStudio Qwen3 现已获得 [lmstudio.ai](https://lmstudio.ai/) 的支持。你可以直接将 LMStudio 与我们的 GGUF 文件一起使用。 ### ExecuTorch 要在 ExecuTorch（iOS、Android、Mac、Linux 等）上导出并运行，请遵循此[示例](https://github.com/pytorch/executorch/blob/main/examples/models/qwen3/README.md)。 ### MNN 要在支持移动设备上 Qwen3 的 MNN 上导出并运行，请访问 [Alibaba MNN](https://github.com/alibaba/MNN)。 ### MLX LM 如果你在 Apple Silicon 上运行，[`mlx-lm`](https://github.com/ml-explore/mlx-lm) 也支持 Q3（`mlx-lm>=0.24.0`）。请在 Hugging Face Hub 上查找以 MLX 结尾的模型。 ### OpenVINO 如果你在 Intel CPU 或 GPU 上运行，[OpenVINO toolkit](https://github.com/openvinotoolkit) 支持 Qwen3。你可以遵循此[聊天机器人示例](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/llm-chatbot/llm-chatbot.ipynb)。 ## 部署 Qwen3 Qwen3 得到了多个推理框架的支持。这里我们演示 `SGLang`、`vLLM` 和 `TensorRT-LLM` 的用法。你也可以从各种推理提供商处找到 Qwen3 模型，例如 [Alibaba Cloud Model Studio](https://www.alibabacloud.com/en/product/modelstudio)。 ### SGLang [SGLang](https://github.com/sgl-project/sglang) 是一个用于大语言模型和视觉语言模型的快速服务框架。 SGLang 可用于启动具有 OpenAI 兼容 API 服务的服务器。需要 `sglang>=0.4.6.post1`。对于 Qwen3-Instruct-2507， ``` python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --port 30000 --context-length 262144 ``` 对于 Qwen3-Thinking-2507， ``` python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 --port 30000 --context-length 262144 --reasoning-parser deepseek-r1 ``` 对于 Qwen3， ``` python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000 --context-length 131072 --reasoning-parser qwen3 ``` OpenAI 兼容 API 将在 `http://localhost:30000/v1` 上可用。 ### vLLM [vLLM](https://github.com/vllm-project/vllm) 是一个用于 LLM 的高吞吐量和内存高效的推理和服务引擎。建议使用 `vllm>=0.9.0`。对于 Qwen3-Instruct-2507， ``` vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --port 8000 --max-model-len 262144 ``` 对于 Qwen3-Thinking-2507， ``` vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --port 8000 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseek_r1 ``` 对于 Qwen3， ``` vllm serve Qwen/Qwen3-8B --port 8000 --max-model-len 131072 --enable-reasoning --reasoning-parser qwen3 ``` OpenAI 兼容 API 将在 `http://localhost:8000/v1` 上可用。 ### TensorRT-LLM [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) 是 NVIDIA 的开源 LLM 推理引擎，它在 NVIDIA GPU 上提供包括自定义注意力内核、量化等优化。Qwen3 在其重新架构的 [PyTorch backend](https://nvidia.github.io/TensorRT-LLM/torch.html) 中得到支持。建议使用 `tensorrt_llm>=0.20.0rc3`。请参阅 [README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/qwen/README.md#qwen3) 页面了解更多详情。 ``` trtllm-serve Qwen/Qwen3-8B --host localhost --port 8000 --backend pytorch ``` OpenAI 兼容 API 将在 `http://localhost:8000/v1` 上可用。 ### MindIE 要在 Ascend NPU 上进行部署，请访问 [Modelers](https://modelers.cn/) 并搜索 Qwen3。