NVIDIA/Model-Optimizer

GitHub: NVIDIA/Model-Optimizer

NVIDIA 推出的统一模型优化库，集成了量化、剪枝和蒸馏等前沿技术，旨在为大模型压缩体积并显著提升在 TensorRT-LLM 等框架上的推理速度。

Stars: 3262 | Forks: 504

![横幅图片](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/9d67416bb0091443.png) # NVIDIA Model Optimizer [![文档](https://img.shields.io/badge/Documentation-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/Model-Optimizer) [![版本](https://img.shields.io/pypi/v/nvidia-modelopt?label=Release)](https://pypi.org/project/nvidia-modelopt/) [![许可证](https://img.shields.io/badge/License-Apache%202.0-blue)](./LICENSE) [文档](https://nvidia.github.io/Model-Optimizer) | [路线图](https://github.com/NVIDIA/Model-Optimizer/issues/146)

**NVIDIA Model Optimizer**（简称 **Model Optimizer** 或 **ModelOpt**）是一个包含最先进模型优化[技术](#techniques)的库，包括量化、蒸馏、剪枝、推测解码和稀疏性，旨在加速模型。 **[输入]** Model Optimizer 目前支持 [Hugging Face](https://huggingface.co/)、[PyTorch](https://github.com/pytorch/pytorch) 或 [ONNX](https://github.com/onnx/onnx) 模型作为输入。 **[优化]** Model Optimizer 提供 Python API，供用户轻松组合上述模型优化技术，并导出优化后的量化检查点。 Model Optimizer 还与 [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)、[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) 和 [Hugging Face Accelerate](https://github.com/huggingface/accelerate) 集成，用于需要训练的推理优化技术。 **[导出以供部署]** Model Optimizer 与 NVIDIA AI 软件生态系统无缝集成，生成的量化检查点可直接部署在 [SGLang](https://github.com/sgl-project/sglang)、[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization)、[TensorRT](https://github.com/NVIDIA/TensorRT) 或 [vLLM](https://github.com/vllm-project/vllm) 等下游推理框架中。统一的 Hugging Face 导出 API 现已支持 transformers 和 diffusers 模型。 ## 最新消息 - [2026/03/11] Model Optimizer 量化的 Nemotron-3-Super 检查点现已在 Hugging Face 上提供下载：[FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8)、[NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)。在 [Nemotron 3 Super 发布博客](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/)中了解更多信息。查看关于如何量化 Nemotron 3 模型以加速部署[此处](./examples/llm_ptq/README.md) - [2026/03/11] [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) 现支持使用 Model Optimizer 库进行 Nemotron-3-Super 量化（PTQ 和 QAT）及导出工作流。请参阅 [量化（PTQ 和 QAT）指南](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md#quantization-ptq-and-qat)了解 FP8/NVFP4 量化和 HF 导出说明。 - [2025/12/11] [博客：实现更快、更智能推理的 5 大 AI 模型优化技术](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/) - [2025/12/08] NVIDIA TensorRT Model Optimizer 现正式更名为 NVIDIA Model Optimizer。 - [2025/10/07] [博客：使用 NVIDIA Model Optimizer 进行 LLM 剪枝和蒸馏](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/) - [2025/09/17] [博客：降低 AI 推理延迟的推测解码简介](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/) - [2025/09/11] [博客：量化感知训练如何实现低精度精度恢复](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/) - [2025/08/29] [博客：通过量化感知训练对 gpt-oss 进行精度和性能微调](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/) - [2025/08/01] [博客：通过训练后量化优化 LLM 的性能和精度](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/) - [2025/06/24] [博客：介绍 NVFP4 以实现高效且准确的低精度推理](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) - [2025/05/14] [NVIDIA TensorRT 为 NVIDIA Blackwell GeForce RTX 50 系列 GPU 解锁 FP4 图像生成功能](https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/) - [2025/04/21] [Adobe 使用 Model-Optimizer + TensorRT 优化部署，将 Diffusion 延迟降低了 60%，总拥有成本降低了 40%](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/) - [2025/04/05] [NVIDIA 加速 Meta Llama 4 Scout 和 Maverick 的推理](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/)。查看关于如何量化 Llama4 以加速部署[此处](./examples/llm_ptq/README.md#llama-4) - [2025/03/18] [利用 Blackwell FP4 实现全球最快 DeepSeek-R1 推理及提升 Blackwell 上的图像生成效率](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) - [2025/02/25] Model Optimizer 量化的 NVFP4 模型现已在 Hugging Face 上提供下载：[DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4)、[Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4)、[Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4) - [2025/01/28] Model Optimizer 已增加对 NVFP4 的支持。在[此处](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion)查看 NVFP4 PTQ 示例。 - [2025/01/28] Model Optimizer 现已开源！

以往新闻

- [2024/10/23] Model Optimizer 量化的 FP8 Llama-3.1 Instruct 模型现已在 Hugging Face 上提供下载：[8B](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8)、[70B](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8)、[405B](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)。 - [2024/09/10] [使用 NVIDIA NeMo 和 Model Optimizer 进行 LLM 训练后量化](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/)。 - [2024/08/28] [在 NVIDIA H200 GPU 上使用 Model Optimizer 将 Llama 3.1 405B 性能提升高达 44%](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/) - [2024/08/28] [利用 Medusa 实现高达 1.9 倍的 Llama 3.1 性能提升](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/) - [2024/08/15] 近期版本的新功能：[缓存 Diffusion](./examples/diffusers/cache_diffusion)、[与 NVIDIA NeMo 的 QLoRA 工作流](https://docs.nvidia.com/nemo-framework/user-guide/24.09/sft_peft/qlora.html)等。详情请参阅[我们的博客](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/)。 - [2024/06/03] 作为支持流行部署框架工作的一部分，Model Optimizer 现在具有部署到 vLLM 的实验性功能。在[此处](./examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)查看工作流。 - [2024/05/08] [公告：Model Optimizer 现已正式推出，以进一步加速 GenAI 推理性能](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/) - [2024/03/27] [Model Optimizer 助力 TensorRT-LLM 刷新 MLPerf LLM 推理记录](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/) - [2024/03/18] [GTC 会议：在 TensorRT-LLM 和 TensorRT 中利用量化优化生成式 AI 推理](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/) - [2024/03/07] [Model Optimizer 的 8 位训练后量化使 TensorRT 能够将 Stable Diffusion 加速近 2 倍](https://developer.nvidia.com/blog/tensorrt-accelerates-stable-diffusion-nearly-2x-faster-with-8-bit-post-training-quantization/) - [2024/02/01] [在 TRT-LLM 中利用 Model Optimizer 量化技术加速推理](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md)

## 安装要从 [PyPI](https://pypi.org/project/nvidia-modelopt/) 使用 `pip` 安装 Model Optimizer 的稳定版包： ``` pip install -U nvidia-modelopt[all] ``` 要以可编辑模式从源代码安装（包含所有开发依赖项）或使用最新功能，请运行： ``` # 克隆 Model Optimizer 仓库 git clone git@github.com:NVIDIA/Model-Optimizer.git cd Model-Optimizer pip install -e .[dev] ``` 您也可以直接使用 [TensorRT-LLM docker 镜像](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) （例如 `nvcr.io/nvidia/tensorrt-llm/release:`），其中已预装 Model Optimizer。请务必按照上述说明将 Model Optimizer 升级到最新版本。请访问我们的[安装指南](https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html)，了解如何更精细地控制已安装的依赖项，或获取用于设置的其他 docker 镜像和环境变量。 ## 技术手段

| **技术** | **描述** | **示例** | **文档** | | :------------: | :------------: | :------------: | :------------: | | 训练后量化 | 将模型大小压缩 2 倍-4 倍，在保持模型质量的同时加速推理！ | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/vlm_ptq/)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[文档](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] | | 量化感知训练 | 通过少量训练步骤进一步提升精度！ | \[[Hugging Face](./examples/llm_qat/)\] | \[[文档](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] | | 剪枝 | 通过移除不必要的权重来减小模型大小并加速推理！ | \[[通用](./examples/pruning/)\] \[[Megatron-Bridge](./examples/megatron_bridge/README.md#pruning)\] | | | 蒸馏 | 通过教导小模型模仿大模型的行为来减小部署模型大小！ | \[[Megatron-Bridge](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-bridge-framework)\] \[[Megatron-LM](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[文档](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] | | 推测解码 | 训练草稿模块以在推理过程中预测额外的 token！ | \[[Megatron](./examples/speculative_decoding#mlm-example)\] \[[Hugging Face](./examples/speculative_decoding/)\] | \[[文档](https://nvidia.github.io/Model-Optimizer/guides/5_speculative_decoding.html)\] | | 稀疏性 | 通过仅存储模型的非零参数值及其位置来高效压缩模型 | \[[PyTorch](./examples/llm_sparsity/)\] | \[[文档](https://nvidia.github.io/Model-Optimizer/guides/6_sparsity.html)\] |

## 预量化检查点 - 可直接部署的检查点 \[[🤗 Hugging Face - Nvidia Model Optimizer 收藏集](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer)\] - 可部署于 [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)、[vLLM](https://github.com/vllm-project/vllm) 和 [SGLang](https://github.com/sgl-project/sglang) - 更多模型即将推出！ ## 资源 - 📅 [路线图](https://github.com/NVIDIA/Model-Optimizer/issues/146) - 📖 [文档](https://nvidia.github.io/Model-Optimizer) - 🎯 [基准测试](./examples/benchmark.md) - 💡 [发布说明](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html) - 🐛 [提交 Bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md) - ✨ [提交功能请求](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md) ## 模型支持矩阵 | 模型类型 | 支持矩阵 | |------------|----------------| | LLM 量化 | [查看支持矩阵](./examples/llm_ptq/README.md#support-matrix) | | Diffusers 量化 | [查看支持矩阵](./examples/diffusers/README.md#support-matrix) | | VLM 量化 | [查看支持矩阵](./examples/vlm_ptq/README.md#support-matrix) | | ONNX 量化 | [查看支持矩阵](./examples/torch_onnx/README.md#onnx-export-supported-llm-models) | | Windows 量化 | [查看支持矩阵](./examples/windows/README.md#support-matrix) | | 量化感知训练 | [查看支持矩阵](./examples/llm_qat/README.md#support-matrix) | | 剪枝 | [查看支持矩阵](./examples/pruning/README.md#support-matrix) | | 蒸馏 | [查看支持矩阵](./examples/llm_distill/README.md#support-matrix) | | 推测解码 | [查看支持矩阵](./examples/speculative_decoding/README.md#support-matrix) | ## 贡献 Model Optimizer 现已开源！我们欢迎任何反馈、功能请求和 PR。请阅读我们的[贡献](./CONTRIBUTING.md)指南，了解如何为该项目做出贡献的详细信息。 ### 顶级贡献者 [![贡献者](https://contrib.rocks/image?repo=NVIDIA/Model-Optimizer)](https://github.com/NVIDIA/Model-Optimizer/graphs/contributors) 祝优化愉快！

标签：CNCF毕业项目, DLL 劫持, Hugging Face, IaC 扫描, LLM, Megatron-LM, ONNX, PyTorch, SGLang, SOTA, TensorRT, TensorRT-LLM, Transformer, Unmanaged PE, Vectored Exception Handling, vLLM, 人工智能, 凭据扫描, 剪枝, 大语言模型, 推测解码, 推理加速, 模型优化, 模型压缩, 深度学习, 用户模式Hook绕过, 知识蒸馏, 神经网络, 稀疏性, 计算机视觉, 逆向工具, 部署优化, 量化