huggingface/pytorch-image-models

GitHub: huggingface/pytorch-image-models

汇集了数百种 PyTorch 图像模型架构及预训练权重的综合工具库，提供从训练、验证到推理和导出的一站式支持。

Stars: 37000 | Forks: 5174

# PyTorch 图像模型 - [最新动态](#whats-new) - [简介](#introduction) - [模型](#models) - [功能](#features) - [结果](#results) - [入门指南 (文档)](#getting-started-documentation) - [训练、验证、推理脚本](#train-validation-inference-scripts) - [优质 PyTorch 资源](#awesome-pytorch-resources) - [许可证](#licenses) - [引用](#citing) ## 最新动态 ## 2026 年 5 月 8 日 * 发布 1.0.27 版本 ## 2026 年 4 月 23 日 * 添加 Gemma4 ViT 编码器，支持 NaFlex pipeline（每张图像具有可变的宽高比/尺寸）。感谢 [Yonghye Kwon](https://github.com/developer0hye) * 在 NaFlexVit 中支持 DINOv3 权重。感谢 [Yonghye Kwon](https://github.com/developer0hye) * 对 Muon 降级使用 (AdamW/NadamW) 时的一些学习率行为进行了改进 ## 2026 年 3 月 23 日 * 改进 pickle checkpoint 处理的安全性。将所有加载操作默认设置为 `weights_only=True`，并为 ArgParse 添加 safe_global。 * 改进了核心 ViT/EVA 模型和层的注意力掩码处理。解析布尔掩码，并在 SSL 任务中传递 `is_causal`。 * 修复了 ViT 在未启用位置编码 (pos embed) 时 class & register token 的使用问题。 * 将 Patch Representation Refinement (PRR) 作为 ViT 中的一个池化选项添加。感谢 Sina (https://github.com/sinahmr)。 * 提高了注意力池化层输出投影 / MLP 维度的一致性。 * Hiera 模型的 F.SDPA 优化，以允许使用 Flash Attention kernel。 * 为 SGDP 优化器添加了警告。 * 发布 1.0.26。自我离开 Hugging Face 以来的首个维护版本。 ## 2026 年 2 月 23 日 * 添加了 token 蒸馏训练支持到蒸馏任务包装器 * 移除了一些 `torch.jit` 的使用，为官方正式弃用做准备 * 为 AdamP 优化器添加了警告 * 即使在 `meta` 设备初始化时也调用 `reset_parameters()`，以便像 `init_empty_weights` 这样的技巧能初始化缓冲区 * 调整 Muon 优化器以适配 DTensor/FSDP2（使用 `clamp_` 代替 `clamp_min_`，并为 DTensor 提供单独的 NS 分支） * 发布 1.0.25 ## 2026 年 1 月 21 日 * **兼容性破坏**：修复了 `ParallelScalingBlock`（和 `DiffParallelScalingBlock`）中 QKV 与 MLP 偏置的疏忽 * 不影响任何已训练的 `timm` 模型，但可能影响下游使用。 ## 2026 年 1 月 5 日及 6 日 * 发布 1.0.24 * 添加了新的基准测试结果 csv 文件，包含在 RTX Pro 6000、5090 和 4090 显卡上使用 PyTorch 2.9.1 对所有模型进行的推理计时 * 修复了已弃用的 `timm.models.layers` 导入路径中影响旧版导入的模块移动错误 * 发布 1.0.23 ## 2025 年 12 月 30 日 * 添加了通过 NAdaMuon 训练的更优 `dpwee`、`dwee`、`dlittle`（差分）ViTs，较之前的运行有小幅提升 * https://huggingface.co/timm/vit_dlittle_patch16_reg1_gap_256.sbb_nadamuon_in1k (83.24% top-1) * https://huggingface.co/timm/vit_dwee_patch16_reg1_gap_256.sbb_nadamuon_in1k (81.80% top-1) * https://huggingface.co/timm/vit_dpwee_patch16_reg1_gap_256.sbb_nadamuon_in1k (81.67% top-1) * 添加了分辨率为 512x512 和 640x640 的约 21M 参数 `timm` 版 CSATv2 模型 * https://huggingface.co/timm/csatv2_21m.sw_r640_in1k (83.13% top-1) * https://huggingface.co/timm/csatv2_21m.sw_r512_in1k (82.58% top-1) * 将非持久化参数初始化从 `__init__` 中提取为一个通用方法，在 `meta` 设备初始化后可通过 `init_non_persistent_buffers()` 从外部调用。 ## 2025 年 12 月 12 日 * 添加 CSATV2 模型 (感谢 https://github.com/gusdlf93) -- 一个轻量级但高分辨率的模型，具有 DIT stem 和空间注意力机制。https://huggingface.co/Hyunil/CSATv2 * 为现有的 `timm` Muon 实现添加了 AdaMuon 和 NAdaMuon 优化器支持。对于图像任务，在使用熟悉的超参数时，它看起来比 AdamW 更具竞争力。 * 年末 PR 清理，合并了几个长期开放的 PR 中的部分内容 * 合并了差分注意力 (`DiffAttention`)，添加了对应的 `DiffParallelScalingBlock`（用于 ViT），训练了一些小型 vit * https://huggingface.co/timm/vit_dwee_patch16_reg1_gap_256.sbb_in1k * https://huggingface.co/timm/vit_dpwee_patch16_reg1_gap_256.sbb_in1k * 添加了一些池化模块，`LsePlus` 和 `SimPool` * 清理并优化了 `DropBlock2d`（同时添加了对基于 ByobNet 的模型的支持） * 将单元测试的上限提升至 PyTorch 2.9.1 + Python 3.13，下限仍保持为 PyTorch 1.13 + Python 3.10 ## 2025 年 12 月 1 日 * 添加轻量级任务抽象，通过新任务为训练脚本添加了 logits 和特征蒸馏支持。 * 移除了旧的 APEX AMP 支持 ## 2025 年 11 月 4 日 * 修复了在 1.0.21 中引入的 LayerScale / LayerScale2d 初始化 bug（初始化值被忽略）。感谢 https://github.com/Ilya-Fradlin * 发布 1.0.22 ## 2025 年 10 月 31 日 🎃 * 更新了 imagenet 和 OOD 变体的结果 csv 文件，包含一些新模型，并验证了在多个 torch 和 timm 版本上的正确性 * 作为 AdamW 与 Muon 超参数搜索的一部分，添加了 EfficientNet-X 和 EfficientNet-H B5 模型权重（仍在迭代 Muon 的训练运行） ## 2025 年 10 月 16-20 日 * 添加了 Muon 优化器的实现（基于 https://github.com/KellerJordan/Muon）并进行了定制化 * 增加了额外的灵活性，改进了对 conv 权重的处理，并为不适合正交化的权重形状提供了降级方案 * 通过减少分配和使用融合的 (b)add(b)mm 操作，小幅提升了 NS 迭代速度 * 如果 Muon 不适合参数形状（或通过参数组标志被排除），默认使用 AdamW（如果 `nesterov=True` 则使用 NAdamW）进行更新 * 与 torch 实现类似，通过 `adjust_lr_fn` 从多种学习率缩放调整函数中进行选择 * 从多种 NS 系数预设中选择，或通过 `ns_coefficients` 指定您自己的预设 * 支持 'meta' 设备模型初始化的前 2 个步骤 * 修复了几个在 'meta' 设备上下文中破坏创建过程的操作 * 为 `timm` 中的所有模型和模块（任何继承自 `nn.Module` 的对象）添加了设备和数据类型工厂 kwarg 支持 * 许可证字段已添加到代码中的预训练配置里 * 发布 1.0.21 ## 2025 年 9 月 21 日 * 将 DINOv3 ViT 权重标签从 `lvd_1689m` 重新映射为 `lvd1689m` 以保持一致（`sat_493m` -> `sat493m` 同理） * 发布 1.0.20 ## 2025 年 9 月 17 日 * 添加了 DINOv3 (https://arxiv.org/abs/2508.10104) ConvNeXt 和 ViT 模型。ConvNeXt 模型已映射到现有的 `timm` 模型。ViT 支持通过 EVA 基础模型实现，并使用了新的 `RotaryEmbeddingDinoV3` 以匹配 DINOv3 特有的 RoPE 实现。 * HuggingFace Hub: https://huggingface.co/collections/timm/timm-dinov3-68cb08bb0bee365973d52a4d * MobileCLIP-2 (https://arxiv.org/abs/2508.20691) 视觉编码器。添加了新的 MCI3/MCI4 FastViT 变体，并将权重映射到现有的 FastViT 和 B, L/14 ViTs。 * 添加了 MetaCLIP-2 Worldwide (https://arxiv.org/abs/2507.22062) ViT 编码器权重。 * 添加了 SigLIP-2 (https://arxiv.org/abs/2502.14786) NaFlex ViT 编码器权重，通过 timm NaFlexViT 模型实现。 * 其他修复和贡献 ## 2025 年 7 月 23 日 * 为 EVA 模型添加了 `set_input_size()` 方法，供 OpenCLIP 3.0.0 使用，以允许调整基于 timm 的编码器模型的大小。 * 发布 1.0.18，OpenCLIP 3.0.0 中的 PE-Core S 和 T 模型需要此版本。 * 修复了导致 Python 3.9 兼容性中断的小型类型问题。1.0.19 补丁版本发布。 ## 2025 年 7 月 21 日 * 为 NaFlexViT 添加了 ROPE 支持。所有由 EVA 基类 (`eva.py`) 覆盖的模型，包括 EVA、EVA02、Meta PE ViT、带有 ROPE 的 `timm` SBB ViT，以及 Naver ROPE-ViT，现在在模型创建时传入 `use_naflex=True` 即可加载到 NaFlexViT 中。 * 添加了更多 Meta PE ViT 编码器，包括 small/tiny 变体、带有 tiling 的 lang 变体，以及更多空间变体。 * 修复了 NaFlexViT 和 EVA 模型中 PatchDropout 的问题（添加 Naver ROPE-ViT 后引入的回归）。 * 修复了 `grid_indexing='xy'` 时的 XY 顺序问题，影响了 'xy' 模式下非正方形图像的使用（仅影响了 ROPE-ViT 和 PE）。 ## 2025 年 7 月 7 日 * 为改进 Google Gemma 3n 的行为对 MobileNet-v5 骨干网络进行了微调（以配合更新的官方权重） * 添加了 stem bias（在更新的权重中置为零，与旧权重存在兼容性破坏） * GELU -> GELU (tanh 近似)。一个旨在更接近 JAX 的微小改动 * 为 layer-decay 支持添加了两个参数，一个最小缩放限制和一个“无优化”缩放阈值 * 添加了 'Fp32' LayerNorm、RMSNorm、SimpleNorm 变体，可以启用它们以强制在 float32 下进行范数计算 * 借此机会完成了一些针对 norm、norm+act 层的类型和参数清理 * 在 `eva.py` 中支持 Naver ROPE-ViT (https://github.com/naver-ai/rope-vit)，添加了用于混合模式的 RotaryEmbeddingMixed 模块，权重位于 HuggingFace Hub |模型 |img_size|top1 |top5 |param_count| |--------------------------------------------------|--------|------|------|-----------| |vit_large_patch16_rope_mixed_ape_224.naver_in1k |224 |84.84 |97.122|304.4 | |vit_large_patch16_rope_mixed_224.naver_in1k |224 |84.828|97.116|304.2 | |vit_large_patch16_rope_ape_224.naver_in1k |224 |84.65 |.154|304.37 | |vit_large_patch16_rope_224.naver_in1k |224 |84.648|97.122|304.17 | |vit_base_patch16_rope_mixed_ape_224.naver_in1k |224 |83.894|96.754|86.59 | |vit_base_patch16_rope_mixed_224.naver_in1k |224 |83.804|96.712|86.44 | |vit_base_patch16_rope_ape_224.naver_in1k |224 |83.782|96.61 |86.59 | |vit_base_patch16_rope_224.naver_in1k |224 |83.718|96.672|86.43 | |vit_small_patch16_rope_224.naver_in1k |224 |81.23 |95.022|21.98 | |vit_small_patch16_rope_mixed_224.naver_in1k |224 |81.216|95.022|21.99 | |vit_small_patch16_rope_ape_224.naver_in1k |224 |81.004|95.016|22.06 | |vit_small_patch16_rope_mixed_ape_224.naver_in1k |224 |80.986|94.976|22.06 | * 对 ROPE 模块、辅助函数以及 FX 追踪叶节点注册进行了一些清理 * 准备发布 1.0.17 版本 ## 2025 年 6 月 26 日 * 为 [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters) 图像编码器提供 MobileNetV5 骨干网络（带有仅编码器变体） * 发布 1.0.16 版本 ## 2025 年 6 月 23 日 * 添加了基于 F.grid_sample 的 2D 和分解式位置嵌入大小调整到 NaFlexViT。当存在大量不同尺寸时速度更快（基于 https://github.com/stas-sl 提供的示例）。 * 通过用 matmul 替换 vmap，进一步加快了 patch embed 重采样的速度（基于 https://github.com/stas-sl 提供的代码片段）。 * 添加了在测试期间创建的 3 个初始原生宽高比 NaFlexViT checkpoints，使用相同的超参数在 ImageNet-1k 和 3 种不同的位置嵌入配置上进行训练。 | 模型 | Top-1 准确率 | Top-5 准确率 | 参数量 (M) | 评估序列长度 | |:---|:---:|:---:|:---:|:---:| | [naflexvit_base_patch16_par_gap.e300_s576_in1k](https://hf.co/timm/naflexvit_base_patch16_par_gap.e300_s576_in1k) | 83.67 | 96.45 | 86.63 | 576 | | [naflexvit_base_patch16_parfac_gap.e300_s576_in1k](https://hf.co/timm/naflexvit_base_patch16_parfac_gap.e300_s576_in1k) | 83.63 | 96.41 | 86.46 | 576 | | [naflexvit_base_patch16_gap.e300_s576_in1k](https://hf.co/timm/naflexvit_base_patch16_gap.e300_s576_in1k) | 83.50 | 96.46 | 86.63 | 576 | * 支持了 `forward_intermediates` 的梯度检查点并修复了一些检查点相关的 bug。感谢 https://github.com/brianhou0208 * 将“修正的权重衰减” (https://arxiv.org/abs/2506.02285) 作为选项添加到 AdamW (legacy)、Adopt、Kron、Adafactor (BV)、Lamb、LaProp、Lion、NadamW、RmsPropTF、SGDW 优化器中 * 将 PE (感知编码器) ViT 模型切换为使用原生 timm 权重，而不是在运行中动态重映射 * 修复了 prefetch loader 中的 cuda stream bug ## 2025 年 6 月 5 日 * 初始 NaFlexVit 模型代码。NaFlexVit 是一个具有以下特点的 Vision Transformer： 1. 将嵌入和位置编码封装在单个模块中 2. 支持对预分块（字典）输入使用 `nn.Linear` patch 嵌入 3. 支持 NaFlex 可变宽高比、可变分辨率 (SigLip-2: https://arxiv.org/abs/2502.14786) 4. 支持 FlexiViT 可变 patch 大小 (https://arxiv.org/abs/2212.08013) 5. 支持 NaViT 分数/分解式位置嵌入 (https://arxiv.org/abs/2307.06304) * 通过在 `create_model` 中添加 `use_naflex=True` 标志，可以将 `vision_transformer.py` 中现有的 vit 模型加载到 NaFlexVit 模型中 * 即将推出一些原生权重 * 提供了一个完整的 NaFlex 数据 pipeline，允许使用可变宽高比/尺寸的图像进行训练/微调/评估 * 要在 `train.py` 和 `validate.py` 中启用，请添加 `--naflex-loader` 参数，必须与 NaFlexVit 一起使用 * 要在带有 NaFlex 数据管道的 NaFlexVit 模型中评估已加载的现有（经典）ViT： * `python validate.py /imagenet --amp -j 8 --model vit_base_patch16_224 --model-kwargs use_naflex=True --naflex-loader --naflex-max-seq-len 256` * 训练过程有一些值得注意的额外参数功能 * `--naflex-train-seq-lens'` 参数指定在训练期间每个批次随机挑选的序列长度 * `--naflex-max-seq-len` 参数设置验证时的目标序列长度 * 添加 `--model-kwargs enable_patch_interpolator=True --naflex-patch-sizes 12 16 24` 将启用每个批次的随机 patch 大小选择（带插值） * `--naflex-loss-scale` 参数更改每个批次相对于批大小的损失缩放模式，`timm` NaFlex 加载会更改每个序列长度的批大小 ## 2025 年 5 月 28 日 * 添加了许多小型/快速模型，感谢 https://github.com/brianhou0208 * SwiftFormer - [(ICCV2023) SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://github.com/Amshaker/SwiftFormer) * FasterNet - [(CVPR2023) Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks](https://github.com/JierunChen/FasterNet) * SHViT - [(CVPR2024) SHViT: Single-Head Vision Transformer with Memory Efficient](https://github.com/ysj9909/SHViT) * StarNet - [(CVPR2024) Rewrite the Stars](https://github.com/ma-xu/Rewrite-the-Stars) * GhostNet-V3 [GhostNetV3: Exploring the Training Strategies for Compact Models](https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv3_pytorch) * 更新了 EVA ViT (最接近的匹配) 以支持来自 Meta 的感知编码器模型 (https://arxiv.org/abs/2504.13181)，正在加载 Hub 权重，但我仍需推送专用的 `timm` 权重 * 增加了 ROPE 实现的灵活性 * 支持通过 `forward_intermediates()` 进行特征提取的模型数量大幅增加，并修复了一些额外问题，感谢 https://github.com/brianhou0208 * DaViT, EdgeNeXt, EfficientFormerV2, EfficientViT(MIT), EfficientViT(MSRA), FocalNet, GCViT, HGNet /V2, InceptionNeXt, Inception-V4, MambaOut, MetaFormer, NesT, Next-ViT, PiT, PVT V2, RepGhostNet, RepViT, ResNetV2, ReXNet, TinyViT, TResNet, VoV * 更新了带有新权重的 TNT 模型 `forward_intermediates()`，感谢 https://github.com/brianhou0208 * 添加了 `local-dir:` 预训练模式，可以使用 `local-dir:/path/to/model/folder` 作为模型名称，从本地文件夹获取 Hugging Face Hub 模型（config.json + 权重文件）的模型/预训练配置和权重。 * 修复并改进了 onnx 导出 ## 2025 年 2 月 21 日 * 添加了 SigLIP 2 ViT 图像编码器 (https://huggingface.co/collections/timm/siglip-2-67b8e72ba08b09dd97aecaf9) * 可变分辨率/宽高比的 NaFlex 版本正在开发中 * 添加了使用 SBB 配方训练的 'SO150M2' ViT 权重，结果极佳，比之前尝试的训练量更少，且在 ImageNet 上效果更好。 * `vit_so150m2_patch16_reg1_gap_448.sbb_e200_in12k_ft_in1k` - 88.1% top-1 * `vit_so150m2_patch16_reg1_gap_384.sbb_e200_in12k_ft_in1k` - 87.9% top-1 * `vit_so150m2_patch16_reg1_gap_256.sbb_e200_in12k_ft_in1k` - 87.3% top-1 * `vit_so150m2_patch16_reg4_gap_256.sbb_e200_in12k` * 更新了 InternViT-300M '2.5' 权重 * 发布 1.0.15 ## 2025 年 2 月 1 日 * 供参考，PyTorch 2.6 和 Python 3.13 已经过测试，可在当前 main 分支和已发布的 `timm` 版本中正常工作 ## 2025 年 1 月 27 日 * 添加了 Kron 优化器 (带有 Kronecker 分解预处理器的 PSGD) * 代码来自 https://github.com/evanatyourservice/kron_torch * 另请参阅 https://sites.google.com/site/lixilinx/home/psgd ## 2025 年 1 月 19 日 * 修复了 LeViT safetensor 权重的加载问题，移除了本应被禁用的转换代码 * 添加了使用 SBB 配方训练的 'SO150M' ViT 权重，结果不错，但用于 ImageNet-12k/1k 预训练/微调的形状并非最佳 * `vit_so150m_patch16_reg4_gap_256.sbb_e250_in12k_ft_in1k` - 86.7% top-1 * `vit_so150m_patch16_reg4_gap_384.sbb_e250_in12k_ft_in1k` - 87.4% top-1 * `vit_so150m_patch16_reg4_gap_256.sbb_e250_in12k` * 杂项类型、错别字等清理工作 * 发布 1.0.14 以推出上述 LeViT 修复 ## 2025 年 1 月 9 日 * 添加了对以纯 `bfloat16` 或 `float16` 进行训练和验证的支持 * 由 https://github.com/caojiaolong 添加了 `wandb` 项目名称参数，使用 arg.experiment 作为名称 * 修复了在不支持硬链接的文件系统（如 FUSE 文件系统挂载）上无法保存 checkpoint 的老问题 * 发布 1.0.13 ## 2025 年 1 月 6 日 * 在 `timm.models` 中添加了 `torch.utils.checkpoint.checkpoint()` 包装器，默认 `use_reentrant=False`，除非在环境变量中设置了 `TIMM_REENTRANT_CKPT=1`。 ## 2024 年 12 月 31 日 * `convnext_nano` 384x384 ImageNet-12k 预训练和微调。https://huggingface.co/models?search=convnext_nano%20r384 * 添加了来自 https://github.com/apple/ml-aim 的 AIM-v2 编码器，在 Hub 上查看：https://huggingface.co/models?search=timm%20aimv2 * 添加了来自 https://github.com/google-research/big_vision 的 PaliGemma2 编码器到现有的 PaliGemma 中，在 Hub 上查看：https://huggingface.co/models?search=timm%20pali2 * 添加了缺失的 L/14 DFN2B 39B CLIP ViT，`vit_large_patch14_clip_224.dfn2b_s39b` * 修复了现有的 `RmsNorm` 层和函数以符合标准公式，在可能的情况下使用 PT 2.5 实现。将旧实现移动到 `Norm` 层，它是不带居中或偏置的 LN。只有两个 `timm` 模型在使用它，并且它们已被更新。 * 允许覆盖模型创建时的 `cache_dir` 参数 * 为 HF datasets 包装器传递 `trust_remote_code` * 添加了由创建者提交的 `inception_next_atto` 模型 * Adan 优化器警告，以及 Lamb 解耦权重衰减选项 * 一些 feature_info 元数据由 https://github.com/brianhou0208 修复 * 为所有使用加载时重映射的 OpenCLIP 和 JAX (CLIP、SigLIP、Pali 等) 模型权重提供了它们自己的 HF Hub 实例，以便它们能与 `hf-hub:` 加载方式一起工作，从而也能与新的 Transformers `TimmWrapperModel` 兼容 ## 简介 Py**T**orch **Im**age **M**odels (`timm`) 是一个集合，包含了图像模型、网络层、实用工具、优化器、学习率调度器、数据加载器/数据增强，以及参考的训练/验证脚本，旨在汇集各种 SOTA 模型，并具备复现 ImageNet 训练结果的能力。这里包含了众多其他人的劳动成果。我已尝试在 README、文档和代码 docstring 中通过指向 GitHub、arXiv 论文等的链接来确认所有源材料。如果我遗漏了什么，请告诉我。 ## 功能 ### 模型所有模型架构系列都包含带有预训练权重的变体。存在部分没有任何权重的特定模型变体，这并不是 Bug。非常欢迎并感激您帮助训练新的或更好的权重。 * Aggregating Nested Transformers - https://arxiv.org/abs/2105.12723 * BEiT - https://arxiv.org/abs/2106.08254 * BEiT-V2 - https://arxiv.org/abs/2208.06366 * BEiT3 - https://arxiv.org/abs/2208.10442 * Big Transfer ResNetV2 (BiT) - https://arxiv.org/abs/1912.11370 * Bottleneck Transformers - https://arxiv.org/abs/2101.11605 * CaiT (Class-Attention in Image Transformers) - https://arxiv.org/abs/2103.17239 * CoaT (Co-Scale Conv-Attentional Image Transformers) - https://arxiv.org/abs/2104.06399 * CoAtNet (Convolution and Attention) - https://arxiv.org/abs/2106.04803 * ConvNeXt - https://arxiv.org/abs/2201.03545 * ConvNeXt-V2 - http://arxiv.org/abs/2301.00808 * ConViT (Soft Convolutional Inductive Biases Vision Transformers)- https://arxiv.org/abs/2103.10697 * CspNet (Cross-Stage Partial Networks) - https://arxiv.org/abs/1911.11929 * DeiT - https://arxiv.org/abs/2012.12877 * DeiT-III - https://arxiv.org/pdf/2204.07118.pdf * DenseNet - https://arxiv.org/abs/1608.06993 * DLA - https://arxiv.org/abs/1707.06484 * DPN (Dual-Path Network) - https://arxiv.org/abs/1707.01629 * EdgeNeXt - https://arxiv.org/abs/2206.10589 * EfficientFormer - https://arxiv.org/abs/2206.01191 * EfficientFormer-V2 - https://arxiv.org/abs/2212.08059 * EfficientNet (MBConvNet 系列) * EfficientNet NoisyStudent (B0-B7, L2) - https://arxiv.org/abs/1911.04252 * EfficientNet AdvProp (B0-B8) - https://arxiv.org/abs/1911.09665 * EfficientNet (B0-B7) - https://arxiv.org/abs/1905.11946 * EfficientNet-EdgeTPU (S, M, L) - https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html * EfficientNet V2 - https://arxiv.org/abs/2104.00298 * FBNet-C - https://arxiv.org/abs/1812.03443 * MixNet - https://arxiv.org/abs/1907.09595 * MNASNet B1, A1 (Squeeze-Excite), 和 Small - https://arxiv.org/abs/1807.11626 * MobileNet-V2 - https://arxiv.org/abs/1801.04381 * Single-Path NAS - https://arxiv.org/abs/1904.02877 * TinyNet - https://arxiv.org/abs/2010.14819 * EfficientViT (MIT) - https://arxiv.org/abs/2205.14756 * EfficientViT (MSRA) - https://arxiv.org/abs/2305.07027 * EVA - https://arxiv.org/abs/2211.07636 * EVA-02 - https://arxiv.org/abs/2303.11331 * FasterNet - https://arxiv.org/abs/2303.03667 * FastViT - https://arxiv.org/abs/2303.14189 * FlexiViT - https://arxiv.org/abs/2212.08013 * FocalNet (Focal Modulation Networks) - https://arxiv.org/abs/2203.11926 * GCViT (Global Context Vision Transformer) - https://arxiv.org/abs/2206.09959 * GhostNet - https://arxiv.org/abs/1911.11907 * GhostNet-V2 - https://arxiv.org/abs/2211.12905 * GhostNet-V3 - https://arxiv.org/abs/2404.11202 * gMLP - https://arxiv.org/abs/2105.08050 * GPU-Efficient Networks - https://arxiv.org/abs/2006.14090 * Halo Nets - https://arxiv.org/abs/2103.12731 * HGNet / HGNet-V2 - TBD * HRNet - https://arxiv.org/abs/1908.07919 * InceptionNeXt - https://arxiv.org/abs/2303.16900 * Inception-V3 - https://arxiv.org/abs/1512.00567 * Inception-ResNet-V2 和 Inception-V4 - https://arxiv.org/abs/1602.07261 * Lambda Networks - https://arxiv.org/abs/2102.08602 * LeViT (Vision Transformer in ConvNet's Clothing) - https://arxiv.org/abs/2104.01136 * MambaOut - https://arxiv.org/abs/2405.07992 * MaxViT (Multi-Axis Vision Transformer) - https://arxiv.org/abs/2204.01697 * MetaFormer (PoolFormer-v2, ConvFormer, CAFormer) - https://arxiv.org/abs/2210.13452 * MLP-Mixer - https://arxiv.org/abs/2105.01601 * MobileCLIP - https://arxiv.org/abs/2311.17049 * MobileNet-V3 (MBConvNet w/ Efficient Head) - https://arxiv.org/abs/1905.02244 * FBNet-V3 - https://arxiv.org/abs/2006.02049 * HardCoRe-NAS - https://arxiv.org/abs/2102.11646 * LCNet - https://arxiv.org/abs/2109.15099 * MobileNetV4 - https://arxiv.org/abs/2404.10518 * MobileOne - https://arxiv.org/abs/2206.04040 * MobileViT - https://arxiv.org/abs/2110.02178 * MobileViT-V2 - https://arxiv.org/abs/2206.02680 * MViT-V2 (Improved Multiscale Vision Transformer) - https://arxiv.org/abs/2112.01526 * NASNet-A - https://arxiv.org/abs/1707.07012 * NesT - https://arxiv.org/abs/2105.12723 * Next-ViT - https://arxiv.org/abs/2207.05501 * NFNet-F - https://arxiv.org/abs/2102.06171 * NF-RegNet / NF-ResNet - https://arxiv.org/abs/2101.08692 * PE (Perception Encoder) - https://arxiv.org/abs/2504.13181 * PNasNet - https://arxiv.org/abs/1712.00559 * PoolFormer (MetaFormer) - https://arxiv.org/abs/2111.11418 * Pooling-based Vision Transformer (PiT) - https://arxiv.org/abs/2103.16302 * PVT-V2 (Improved Pyramid Vision Transformer) - https://arxiv.org/abs/2106.13797 * RDNet (DenseNets Reloaded) - https://arxiv.org/abs/2403.19588 * RegNet - https://arxiv.org/abs/2003.13678 * RegNetZ - https://arxiv.org/abs/2103.06877 * RepVGG - https://arxiv.org/abs/2101.03697 * RepGhostNet - https://arxiv.org/abs/2211.06088 * RepViT - https://arxiv.org/abs/2307.09283 * ResMLP - https://arxiv.org/abs/2105.03404 * ResNet/ResNeXt * ResNet (v1b/v1.5) - https://arxiv.org/abs/1512.03385 * ResNeXt - https://arxiv.org/abs/1611.05431 * 'Bag of Tricks' / Gluon C, D, E, S 变体 - https://arxiv.org/abs/1812.01187 * 弱监督 (WSL) Instagram 预训练 / ImageNet 调优的 ResNeXt101 - https://arxiv.org/abs/1805.00932 * 半监督 (SSL) / 半弱监督 (SWSL) ResNet/ResNeXts - https://arxiv.org/abs/1905.00546 * ECA-Net (ECAResNet) - https://arxiv.org/abs/1910.03151v4 * Squeeze-and-Excitation Networks (SEResNet) - https://arxiv.org/abs/1709.01507 * ResNet-RS - https://arxiv.org/abs/2103.07579 * Res2Net - https://arxiv.org/abs/1904.01169 * ResNeSt - https://arxiv.org/abs/2004.08955 * ReXNet - https://arxiv.org/abs/2007.00992 * ROPE-ViT - https://arxiv.org/abs/2403.13298 * SelecSLS - https://arxiv.org/abs/1907.00837 * Selective Kernel Networks - https://arxiv.org/abs/1903.06586 * Sequencer2D - https://arxiv.org/abs/2205.01972 * SHViT - https://arxiv.org/abs/2401.16456 * SigLIP (图像编码器) - https://arxiv.org/abs/2303.15343 * SigLIP 2 (图像编码器) - https://arxiv.org/abs/2502.14786 * StarNet - https://arxiv.org/abs/2403.19967 * SwiftFormer - https://arxiv.org/pdf/2303.15446 * Swin S3 (AutoFormerV2) - https://arxiv.org/abs/2111.14725 * Swin Transformer - https://arxiv.org/abs/2103.14030 * Swin Transformer V2 - https://arxiv.org/abs/2111.09883 * TinyViT - https://arxiv.org/abs/2207.10666 * Transformer-iN-Transformer (TNT) - https://arxiv.org/abs/2103.00112 * TResNet - https://arxiv.org/abs/2003.13630 * Twins (Spatial Attention in Vision Transformers) - https://arxiv.org/2104.13840.pdf * VGG - https://arxiv.org/abs/1409.1556 * Visformer - https://arxiv.org/abs/2104.12533 * Vision Transformer - https://arxiv.org/abs/2010.11929 * ViTamin - https://arxiv.org/abs/2404.02132 * VOLO (Vision Outlooker) - https://arxiv.org/abs/2106.13112 * VovNet V2 和 V1 - https://arxiv.org/abs/1911.06667 * Xception - https://arxiv.org/abs/1610.02357 * Xception (Modified Aligned, Gluon) - https://arxiv.org/abs/1802.02611 * Xception (Modified Aligned, TF) - https://arxiv.org/abs/1802.02611 * XCiT (Cross-Covariance Image Transformers) - https://arxiv.org/abs/2106.09681 ### 优化器要查看带有描述的完整优化器列表：`timm.optim.list_optimizers(with_description=True)` 可以通过 `timm.optim.create_optimizer_v2` 工厂方法调用的优化器包括： * `adabelief` AdaBelief 的实现，改编自 https://github.com/juntang-zhuang/Adabelief-Optimizer - https://arxiv.org/abs/2010.07468 * `adafactor` 改编自 [FAIRSeq 实现](https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py) - https://arxiv.org/abs/1804.04235 * `adafactorbv` 改编自 [Big Vision](https://github.com/google-research/big_vision/blob/main/big_vision/optax.py) - https://arxiv.org/abs/2106.04560 * `adahessian` 由 [David Samuel](https://github.com/davda54/ada-hessian) 提供 - https://arxiv.org/abs/2006.00719 * `adamp` 和 `sgdp` 由 [Naver ClovAI](https://github.com/clovaai) 提供 - https://arxiv.org/abs/2006.08217 * `adamuon` 和 `nadamuon` 遵循 https://github.com/Chongjie-Si/AdaMuon - https://arxiv.org/abs/2507.11005 * `adan` Adan 的实现，改编自 https://github.com/sail-sg/Adan - https://arxiv.org/abs/2208.06677 * `adopt` ADOPT 改编自 https://github.com/iShohei220/adopt - https://arxiv.org/abs/2411.02853 * `kron` 带有 Kronecker 分解预处理器的 PSGD，来自 https://github.com/evanatyourservice/kron_torch - https://sites.google.com/site/lixilinx/home/psgd * `lamb` Lamb 和 LambC（带有 trust-clipping）的实现，经过清理并修改以支持 XLA - https://arxiv.org/abs/1904.00962 * `laprop` 优化器，来自 https://github.com/Z-T-WANG/LaProp-Optimizer - https://arxiv.org/abs/2002.04839 * `lars` LARS 和 LARC（带有 trust-clipping）的实现 - https://arxiv.org/abs/1708.03888 * `lion` Lion 的实现，改编自 https://github.com/google/automl/tree/master/lion - https://arxiv.org/abs/2302.06675 * `lookahead` 改编自 [Liam](https://github.com/alphadl/lookahead.pytorch) 的实现 - https://arxiv.org/abs/1907.08610 * `madgrad` MADGRAD 的实现，改编自 https://github.com/facebookresearch/madgrad - https://arxiv.org/abs/2101.11075 * `mars` MARS 优化器，来自 https://github.com/AGI-Arena/MARS - https://arxiv.org/abs/2411.10438 * `muon` MUON 优化器，来自 https://github.com/KellerJordan/Muon，并进行了大量添加和改进的非 Transformer 行为 * `nadam` 带有 Nesterov 动量的 Adam 的实现 * `nadamw` 带有 Nesterov 动量的 AdamW（带有解耦权重衰减的 Adam）的实现。基于 https://github.com/mlcommons/algorithmic-efficiency 的简化实现 * `novograd` 由 [Masashi Kimura](https://github.com/convergence-lab/novograd) 提供 - https://arxiv.org/abs/1905.11286 * `radam` 由 [Liyuan Liu](https://github.com/LiyuanLucasLiu/RAdam) 提供 - https://arxiv.org/abs/1908.03265 * `rmsprop_tf` 由我本人改编自 PyTorch RMSProp。复现了大幅改进的 Tensorflow RMSProp 行为 * `sgdw` 带有解耦权重衰减的 SGD 实现 * `fused` 通过名称指定的优化器，需要安装 [NVIDIA Apex](https://github.com/NVIDIA/apex/tree/master/apex/optimizers) * `bnb` 通过名称指定的优化器，需要安装 [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) * `cadamw`、`clion` 以及更多来自 https://github.com/kyleliang919/C-Optim 的 'Cautious' 优化器 - https://arxiv.org/abs/2411.16085 * `adam`、`adamw`、`rmsprop`、`adadelta`、`adagrad` 和 `sgd` 会直接传递给 `torch.optim` 的实现 * `c` 后缀（如 `adamc`、`nadamc`）用于实现在 https://arxiv.org/abs/2506.02285 中提出的“修正的权重衰减” ### 数据增强 * 来自 [Zhun Zhong](https://github.com/zhunzhong07/Random-Erasing/blob/master/transforms.py) 的 Random Erasing - https://arxiv.org/abs/1708.04896) * Mixup - https://arxiv.org/abs/1710.09412 * CutMix - https://arxiv.org/abs/1905.04899 * AutoAugment (https://arxiv.org/abs/1805.09501) 和 RandAugment (https://arxiv.org/abs/1909.13719) ImageNet 配置，参照 EfficientNet 训练的实现建模 (https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py) * 带有 JSD 损失的 AugMix，支持干净图像和增强图像混合的 JSD 也可与 AutoAugment 和 RandAugment 一起使用 - https://arxiv.org/abs/1912.02781 * SplitBachNorm - 允许在干净数据和增强（辅助批量归一化）数据之间拆分批量归一化层 ### 正则化 * DropPath 又名 "Stochastic Depth" - https://arxiv.org/abs/1603.09382 * DropBlock - https://arxiv.org/abs/1810.12890 * Blur Pooling - https://arxiv.org/abs/1904.11486 ### 其他包含了我经常在项目中使用的几个（不太常见的）功能。添加它们中的许多功能也是我维护自己的一套模型而不是通过 PIP 使用他人模型的原因： * 所有模型都有一个通用的默认配置接口和 API，用于 * 访问/更改分类器 - `get_classifier` 和 `reset_classifier` * 仅对特征进行前向传播 - `forward_features`（参见[文档](https://huggingface.co/docs/timm/feature_extraction)） * 这些使得编写适用于任何模型的、一致的网络包装器变得容易 * 所有模型都支持通过 `create_model` 进行多尺度特征图提取（特征金字塔）（参见[文档](https://huggingface.co/docs/timm/feature_extraction)） * `create_model(name, features_only=True, out_indices=..., output_stride=...)` * `out_indices` 创建参数指定要返回的特征图，这些索引从 0 开始，通常对应于 `C(i + 1)` 特征级别。 * `output_stride` 创建参数通过使用空洞卷积来控制网络的输出步幅。默认情况下，大多数网络的步幅为 32。并非所有网络都支持此功能。 * 特征图通道数、降维级别（步幅）可以在模型创建后通过 `.feature_info` 成员进行查询 * 所有模型都有一个一致的预训练权重加载器，如有必要可自适应调整最后一个线性层，如需要可将输入通道从 3 调整为 1 * 高性能的[参考训练、验证和推理脚本](https://huggingface.co/docs/timm/training_script)，支持多种进程/GPU 模式： * NVIDIA DDP，每个进程一个 GPU，多个进程且存在 APEX（可选 AMP 混合精度） * PyTorch DistributedDataParallel，多 GPU 单进程（禁用 AMP，因为启用时会崩溃） * PyTorch 单 GPU 单进程（可选 AMP） * 一个动态的全局池化实现，允许在模型创建时从平均池化、最大池化、平均+最大或 concat([平均, 最大]) 中进行选择。默认情况下，所有全局池化都是自适应平均池化，并与预训练权重兼容。 * 一个 'Test Time Pool' 包装器，可以包装任何包含的模型，通常在使用大于训练尺寸的输入图像进行推理时能提供更好的性能。这个想法改编自我移植 (https://github.com/cypw/DPNs) 时的原始 DPN 实现 * 学习率调度器 * 采用的想法来自 * [AllenNLP 调度器](https://github.com/allenai/allennlp/tree/master/allennlp/training/learning_rate_schedulers) * [FAIRseq lr_scheduler](https://github.com/pytorch/fairseq/tree/master/fairseq/optim/lr_scheduler) * SGDR: Stochastic Gradient Descent with Warm Restarts (https://arxiv.org/abs/1608.03983) * 调度器包括 `step`、带重启的 `cosine`、带重启的 `tanh`、`plateau` * Space-to-Depth 由 [mrT23](https://github.com/mrT23/TResNet/blob/master/src/models/tresnet/layers/space_to_depth.py) 提供 (https://arxiv.org/abs/1801.04590) * 自适应梯度裁剪 (https://arxiv.org/abs/2102.06171, https://github.com/deepmind/deepmind-research/tree/master/nfnets) * 丰富的通道和/或空间注意力模块选择： * Bottleneck Transformer - https://arxiv.org/abs/2101.11605 * CBAM - https://arxiv.org/abs/1807.06521 * Effective Squeeze-Excitation (ESE) - https://arxiv.org/abs/1911.06667 * Efficient Channel Attention (ECA) - https://arxiv.org/abs/1910.03151 * Gather-Excite (GE) - https://arxiv.org/abs/1810.12348 * Global Context (GC) - https://arxiv.org/abs/1904.11492 * Halo - https://arxiv.org/abs/2103.12731 * Involution - https://arxiv.org/abs/2103.06255 * Lambda Layer - https://arxiv.org/abs/2102.08602 * Non-Local (NL) - https://arxiv.org/abs/1711.07971 * Squeeze-and-Excitation (SE) - https://arxiv.org/abs/1709.01507 * Selective Kernel (SK) - (https://arxiv.org/abs/1903.06586 * Split (SPLAT) - https://arxiv.org/abs/2004.08955 * Shifted Window (SWIN) - https://arxiv.org/abs/2103.14030 ## 结果模型验证结果可以在[结果表格](results/README.md)中找到 ## 入门指南 (文档) 官方文档可以在 https://huggingface.co/docs/hub/timm 找到。欢迎对文档做出贡献。 [Getting Started with PyTorch Image Models (timm): A Practitioner’s Guide](https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055-2/) 由 [Chris Hughes](https://github.com/Chris-hughes10) 撰写，是一篇涵盖了 `timm` 许多方面的详尽博文。 [timmdocs](http://timm.fast.ai/) 是 `timm` 的一套替代文档。非常感谢 [Aman Arora](https://github.com/amaarora) 为创建 timmdocs 所做的努力。 [paperswithcode](https://paperswithcode.com/lib/timm) 是浏览 `timm` 中模型的好资源。 ## 训练、验证、推理脚本代码库的根文件夹包含了参考训练、验证和推理脚本，可与包含的模型及本代码库的其他功能配合使用。只需稍作修改，即可将其适配于其他数据集和用例。参见[文档](https://huggingface.co/docs/timm/training_script)。 ## 优质 PyTorch 资源 PyTorch 最大的财富之一就是其社区及他们的贡献。下面列出了一些我最喜欢的、与这里的模型和组件搭配良好的资源。 ### 目标检测、实例和语义分割 * Detectron2 - https://github.com/facebookresearch/detectron2* Segmentation Models (Semantic) - https://github.com/qubvel/segmentation_models.pytorch * EfficientDet (Obj Det, Semantic soon) - https://github.com/rwightman/efficientdet-pytorch ### 计算机视觉 / 图像增强 * Albumentations - https://github.com/albumentations-team/albumentations * Kornia - https://github.com/kornia/kornia ### 知识蒸馏 * RepDistiller - https://github.com/HobbitLong/RepDistiller * torchdistill - https://github.com/yoshitomo-matsubara/torchdistill ### 度量学习 * PyTorch Metric Learning - https://github.com/KevinMusgrave/pytorch-metric-learning ### 训练 / 框架 * fastai - https://github.com/fastai/fastai * lightly_train - https://github.com/lightly-ai/lightly-train ### 部署 * timmx (将 timm 模型导出到 ONNX, CoreML, LiteRT, TensorRT 等) - https://github.com/Boulaouaney/timmx ## 许可证 ### 代码这里的代码采用 Apache 2.0 许可。我已经注意确保包含或改编的任何第三方代码都具有兼容的（宽松的）许可证，例如 MIT、BSD 等。我已尽力避免任何 GPL / LGPL 冲突。话虽如此，您仍有责任确保您遵守此处的许可证以及任何依赖许可证的条件。在适用的情况下，我已在 docstring 中链接了各种组件的来源/参考。如果您认为我遗漏了什么，请创建一个 issue。 ### 预训练权重到目前为止，这里提供的所有预训练权重都是在 ImageNet 上预训练的，其中少数有一些额外的预训练（见下面的额外说明）。ImageNet 仅出于非商业研究目的发布 (https://image-net.org/download)。目前尚不清楚这对使用该数据集的预训练权重有什么影响。我用 ImageNet 训练的任何模型都是出于研究目的，并且应假定原始数据集许可证适用于该权重。如果您打算在商业产品中使用预训练权重，最好寻求法律建议。 #### 在 ImageNet 之外的其他数据上预训练的权重此处包含或引用的几个权重是使用我无法访问的专有数据集预训练的。其中包括 Facebook WSL、SSL、SWSL ResNe(Xt) 和 Google Noisy Student EfficientNet 模型。Facebook 模型具有明确的非商业许可证 (CC-BY-NC 4.0, https://github.com/facebookresearch/semi-supervised-ImageNet1K-models, https://github.com/facebookresearch/WSL-Images)。Google 模型除了 Apache 2.0 许可证（以及 ImageNet 相关问题）之外，似乎没有任何限制。无论哪种情况，如有任何问题，您都应该联系 Facebook 或 Google。 ## 引用 ### BibTeX ``` @misc{rw2019timm, author = {Ross Wightman}, title = {PyTorch Image Models}, year = {2019}, publisher = {GitHub}, journal = {GitHub repository}, doi = {10.5281/zenodo.4414861}, howpublished = {\url{https://github.com/rwightman/pytorch-image-models}} } ``` ### 最新 DOI [![DOI](https://zenodo.org/badge/168799526.svg)](https://zenodo.org/badge/latestdoi/168799526)

标签：AI模型库, Apex, Backbones, Caido项目解析, CNCF毕业项目, CoAtNet, ConvNeXt, CSPNet, DINOv3, EfficientNet, Hiera, MaxViT, MobileNet, NFNet, PyTorch, RegNet, ResNet, ResNeXT, Swin Transformer, timm, Vision Transformer, ViT, 人工智能, 凭据扫描, 图像分类, 图像编码器, 大模型, 开源库, 搜索引擎爬虫, 机器学习, 模型优化, 模型导出, 模型推理, 模型权重, 模型训练, 模型评估, 注意力机制, 深度学习, 特征提取, 用户模式Hook绕过, 视觉模型, 计算机视觉, 逆向工具, 预训练模型, 骨干网络