ikawrakow/ik_llama.cpp

GitHub: ikawrakow/ik_llama.cpp

llama.cpp 的高性能优化分支，专注于 CPU 和混合推理性能提升，提供先进的量化方案和 DeepSeek MLA 加速支持。

Stars: 2740 | Forks: 351

# ik_llama.cpp: 具有更好 CPU 性能的 llama.cpp 分支 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT) ## 太长不看本仓库是 [llama.cpp](https://github.com/ggerganov/llama.cpp) 的一个分支，具有更好的 CPU 和混合 GPU/CPU 性能、新的 SOTA 量化类型、一流的 Bitnet 支持、通过 MLA 实现的更好 DeepSeek 性能、FlashMLA、融合 MoE 操作和用于混合 GPU/CPU 推理的张量覆盖、行交错量化打包等。 ## 快速开始 ### 前置条件 ``` git clone https://github.com/ikawrakow/ik_llama.cpp cd ik_llama.cpp ``` 在 Debian/Ubuntu Linux 上，安装所需的软件包（如果使用其他 Linux 发行版，您需要找到相应的软件包并进行调整）： ``` apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake ``` ### 为 CPU 构建 ``` cmake -B build -DGGML_NATIVE=ON cmake --build build --config Release -j$(nproc) ``` ### 为 GPU 构建安装 Nvidia 驱动程序和 [CUDA Toolkit](https://developer.nvidia.com/cuda/toolkit)。 ``` cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON cmake --build build --config Release -j$(nproc) ``` ### 运行将 `.gguf` 模型文件（例如 [bartowski/Qwen_Qwen3-0.6B-IQ4_NL.gguf](https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF/blob/main/Qwen_Qwen3-0.6B-IQ4_NL.gguf)）下载到您喜欢的目录（例如 `/my_local_files/gguf`）。使用以下命令之一（CPU 或 GPU）启动服务器： ``` ./build/bin/llama-server --model /my_local_files/gguf/Qwen_Qwen3-0.6B-IQ4_NL.gguf --ctx-size 4096 ``` ``` ./build/bin/llama-server --model /my_local_files/gguf/Qwen_Qwen3-0.6B-IQ4_NL.gguf --ctx-size 4096 -ngl 999 ``` 完成！在浏览器中打开 [http://127.0.0.1:8080](http://127.0.0.1:8080) 开始聊天。 ### 包含 llama-swap 的 podman/docker 容器中 ik_llama.cpp 的[分步指南](./docker/README.md) ### [常用参数和选项](./docs/parameters.md) ## 最新消息 ### 模型支持 LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321), Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276), DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176), Kimi-2 [PR 609](https://github.com/ikawrakow/ik_llama.cpp/pull/609), dots.llm1 [PR 573](https://github.com/ikawrakow/ik_llama.cpp/pull/573), Hunyuan [PR 565](https://github.com/ikawrakow/ik_llama.cpp/pull/565), GLM-4.5 [PR 668](https://github.com/ikawrakow/ik_llama.cpp/pull/668) (4.5/4.6/4.7/AIR), Ernie 4.5 MOE and 0.3B [PR 759](https://github.com/ikawrakow/ik_llama.cpp/pull/759), grok-2 [PR 782](https://github.com/ikawrakow/ik_llama.cpp/pull/782), Ling/Ring (Bailing-MoE2) [PR 833](https://github.com/ikawrakow/ik_llama.cpp/pull/833), Qwen3-VL [PR 883](https://github.com/ikawrakow/ik_llama.cpp/pull/883), SmolLM3 [PR 934](https://github.com/ikawrakow/ik_llama.cpp/pull/934), GigaChat3 [PR 995](https://github.com/ikawrakow/ik_llama.cpp/pull/995), ministral3 [PR 1030](https://github.com/ikawrakow/ik_llama.cpp/pull/1030), Mimo-V2-Flash [PR 1096](https://github.com/ikawrakow/ik_llama.cpp/pull/1096), GLM-4.7-Flash [PR 1168](https://github.com/ikawrakow/ik_llama.cpp/pull/1168), Seed-OSS [PR 1218](https://github.com/ikawrakow/ik_llama.cpp/pull/1218), Step-3.5-Flash [PR 1231](https://github.com/ikawrakow/ik_llama.cpp/pull/1231), GLM-5 [PR 1268](https://github.com/ikawrakow/ik_llama.cpp/pull/1268), Qwen3-Next [PR 1266](https://github.com/ikawrakow/ik_llama.cpp/pull/1266) ### 量化 #### 量化新增内容 ##### Trellis 量化 (`IQ1_KT`, `IQ2_KT`, `IQ3_KT`, `IQ4_KT`) 信息及原始 CUDA 实现见 [PR 113](https://github.com/ikawrakow/ik_llama.cpp/pull/113)。其他实现：Metal [PR 475](https://github.com/ikawrakow/ik_llama.cpp/pull/475)，Neon [PR 471](https://github.com/ikawrakow/ik_llama.cpp/pull/471)，CPU [PR 441](https://github.com/ikawrakow/ik_llama.cpp/pull/441)。`IQ1_KT` 是在 [PR 616](https://github.com/ikawrakow/ik_llama.cpp/pull/616) 中较新添加的。注意：这些基于一种新颖的、基于整数的 trellis，可以实现合理的 CPU 性能，详情见 [PR 529](https://github.com/ikawrakow/ik_llama.cpp/pull/529) 及其中引用的 PR。 ##### IQK 量化信息可在 [Discussion 8](https://github.com/ikawrakow/ik_llama.cpp/discussions/8) 中找到。初始实现（Zen4, AVX2, NEON）：`IQ5_KS_R4` [PR 426](https://github.com/ikawrakow/ik_llama.cpp/pull/426), `IQ5_KS` [PR 422](https://github.com/ikawrakow/ik_llama.cpp/pull/422), `IQ4_KS_R4` [PR 150](https://github.com/ikawrakow/ik_llama.cpp/pull/150), `IQ5_K_R4` [PR 149](https://github.com/ikawrakow/ik_llama.cpp/pull/149), `IQ2_K_R4` [PR 146](https://github.com/ikawrakow/ik_llama.cpp/pull/146), `IQ3_K_R4` [PR 145](https://github.com/ikawrakow/ik_llama.cpp/pull/145), `IQ4_K_R4` [PR 138](https://github.com/ikawrakow/ik_llama.cpp/pull/138), `IQ4_KSS` [PR 89](https://github.com/ikawrakow/ik_llama.cpp/pull/89), `IQ2_KS` [PR 85](https://github.com/ikawrakow/ik_llama.cpp/pull/85), `IQ4_KS` [PR 83](https://github.com/ikawrakow/ik_llama.cpp/pull/83), `IQ6_K` [PR 14](https://github.com/ikawrakow/ik_llama.cpp/pull/14), `IQ2_K, IQ3_K and IQ5_K` [PR 7](https://github.com/ikawrakow/ik_llama.cpp/pull/7), `IQ4_K` [PR 6](https://github.com/ikawrakow/ik_llama.cpp/pull/6) CUDA 实现：`IQ4_KS_R4` and `IQ5_KS_R4` [PR 493](https://github.com/ikawrakow/ik_llama.cpp/pull/493), `IQ1_S_R4` [PR 492](https://github.com/ikawrakow/ik_llama.cpp/pull/492), `IQ1_M_R4` [PR 494](https://github.com/ikawrakow/ik_llama.cpp/pull/494). `IQ4_KS_R4` and `IQ5_KS_R4` [PR 462](https://github.com/ikawrakow/ik_llama.cpp/pull/462), `IQ2_K_R4`, `IQ3_K_R4`, `IQ4_K_R4`, `IQ5_K_R4` [PR 461](https://github.com/ikawrakow/ik_llama.cpp/pull/461), `IQ4_K, IQ5_K, IQ6_K` [PR 417](https://github.com/ikawrakow/ik_llama.cpp/pull/417), `IQ2_KS, IQ2_K, IQ3_K` [PR 418](https://github.com/ikawrakow/ik_llama.cpp/pull/417) `IQ2_KL` 是在 [PR 602](https://github.com/ikawrakow/ik_llama.cpp/pull/602) 中较新的添加 ##### K-cache 的 Hadamard 变换 CPU [PR 1033](https://github.com/ikawrakow/ik_llama.cpp/pull/1033) 和 CUDA [PR 1034](https://github.com/ikawrakow/ik_llama.cpp/pull/1034) ##### gpt-oss 模型中使用的 MXFP4 为 Zen4, AVX2, ARM_NEON, Metal, CUDA 实现 [PR 682](https://github.com/ikawrakow/ik_llama.cpp/pull/682) #### 量化改进 `IQ1_M` [PR 327](https://github.com/ikawrakow/ik_llama.cpp/pull/327), `IQ2_XS` [PR 312](https://github.com/ikawrakow/ik_llama.cpp/pull/312), `Q2_K, Q4_K, Q5_K, Q4_1, Q5_1` [PR 302](https://github.com/ikawrakow/ik_llama.cpp/pull/302), `Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL` [PR 295](https://github.com/ikawrakow/ik_llama.cpp/pull/295) #### 量化性能改进 * 所有非交错量化的 CPU prompt 处理速度大幅提升。初始想法见 [PR 515](https://github.com/ikawrakow/ik_llama.cpp/pull/515) 和 [PR 531](https://github.com/ikawrakow/ik_llama.cpp/pull/531)，随后有许多 PR 将其应用于 3 个支持的 CPU 平台的所有量化类型。 * 所有量化类型现在都有了量化矩阵乘法 CUDA 内核，见 [PR 557](https://github.com/ikawrakow/ik_llama.cpp/pull/515) 及其他几个。 * Trellis 量化和 MoE 模型的 CPU prompt 处理更快。[PR 488](https://github.com/ikawrakow/ik_llama.cpp/pull/488) * Trellis 量化：更快的 CPU prompt 处理 [PR 482](https://github.com/ikawrakow/ik_llama.cpp/pull/482)。 * CUDA 上 `iq2_ks` TG 性能小幅提升（~2%）[PR 468](https://github.com/ikawrakow/ik_llama.cpp/pull/468) * 更快的 `IQ3_KT` 和 `IQ4_KT` [PR 453](https://github.com/ikawrakow/ik_llama.cpp/pull/453) * Zen4：`IQ2_KS, IQ4_KS, IQ5_KS` 的 PP 更快 [PR 428](https://github.com/ikawrakow/ik_llama.cpp/pull/428) * `IQ1_S` 的快速 GEMM/GEMV [PR 212](https://github.com/ikawrakow/ik_llama.cpp/pull/212) ### 功能 * 多 GPU 设置的新分割模式 "graph" [PR 1022](https://github.com/ikawrakow/ik_llama.cpp/pull/1022) * 所有补全的字符串禁止功能 [PR 1185](https://github.com/ikawrakow/ik_llama.cpp/pull/1185) * OpenAI `/v1/responses` API 端点 [PR 1184](https://github.com/ikawrakow/ik_llama.cpp/pull/1184) * 函数调用支持 [PR 628](https://github.com/ikawrakow/ik_llama.cpp/pull/628) * jinja 模板支持 [PR 677](https://github.com/ikawrakow/ik_llama.cpp/pull/677) * Webui：对话、设置和聊天消息的新功能 [PR 618](https://github.com/ikawrakow/ik_llama.cpp/pull/618) * `convert_hf_to_gguf.py` 中的旧版量化转换方案 [PR 449](https://github.com/ikawrakow/ik_llama.cpp/pull/449)，`Q6_0` 在 [PR 483](https://github.com/ikawrakow/ik_llama.cpp/pull/483) * Adaptive-P 采样器 [PR 1100](https://github.com/ikawrakow/ik_llama.cpp/pull/1100) 按其作者的设计实现；在 Webui 上支持 * `llama-mtmd-cli` 中的多模态视觉支持 [PR 798](https://github.com/ikawrakow/ik_llama.cpp/pull/798) 和 `llama-server` 中的支持 [PR 901](https://github.com/ikawrakow/ik_llama.cpp/pull/901) * mikupad 作为替代 WebUI [PR 558](https://github.com/ikawrakow/ik_llama.cpp/pull/558) * 2025年6月8日：Webui 更新（传递 `--path ./examples/server/public_legacy` 时仍可使用旧版）[PR 481](https://github.com/ikawrakow/ik_llama.cpp/pull/481) * 2025年6月8日：RPC 改进 [PR 480](https://github.com/ikawrakow/ik_llama.cpp/pull/480) * 2025年6月7日：向服务器添加列出所有已保存 prompt cache 的端点 [PR 502](https://github.com/ikawrakow/ik_llama.cpp/pull/502) * 2025年6月6日：使 prompt cache 保存和恢复感知 MLA [PR 497](https://github.com/ikawrakow/ik_llama.cpp/pull/497) * 2025年6月3日：添加了采样器，XTC [PR 486](https://github.com/ikawrakow/ik_llama.cpp/pull/486), top-n σ [PR 489](https://github.com/ikawrakow/ik_llama.cpp/pull/489). * 2025年5月22日：重构 `iqk_mul_mat.cpp`，显著加快编译时间。[PR 435](https://github.com/ikawrakow/ik_llama.cpp/pull/435) * 2025年5月17日：启用或禁用 CPU FA 内核的选项 [PR 429](https://github.com/ikawrakow/ik_llama.cpp/pull/429)。 * 2025年5月12日：用户现在可以控制是否/哪些涉及 RAM 中张量的操作被卸载到 GPU。见 [PR 405](https://github.com/ikawrakow/ik_llama.cpp/pull/405) * 2025年5月12日：与启用了 MLA 的 DeepSeek 模型的主线 `llama.cpp` GGUF 的兼容性问题已在 [PR 394](https://github.com/ikawrakow/ik_llama.cpp/pull/394) 中解决。使用 `llama.cpp` 风格的 MLA GGUF 导致的较低 prompt 处理性能已在 [PR 409](https://github.com/ikawrakow/ik_llama.cpp/pull/409) 中恢复。 * 2025年4月21日：ik_llama.cpp 在 Android（使用 termux）上成功构建和运行，见 [PR 336](https://github.com/ikawrakow/ik_llama.cpp/pull/336) * 2025年3月1日：智能专家缩减以实现更快的 DeepSeek 推理 [PR 239](https://github.com/ikawrakow/ik_llama.cpp/pull/239) * 2025年2月25日：张量覆盖，以更好地控制模型权重的存储位置（GPU 或 CPU）[PR 232](https://github.com/ikawrakow/ik_llama.cpp/pull/232) * 2025年2月23日：`sweep-bench` - 更好的性能基准测试 [PR 225](https://github.com/ikawrakow/ik_llama.cpp/pull/225) * 2025年2月19日：`Q8_KV` - 8位 KV-cache 量化的新类型 [PR 208](https://github.com/ikawrakow/ik_llama.cpp/pull/208) * 2025年3月7日：使用正则表达式的自定义量化组合 [PR 244](https://github.com/ikawrakow/ik_llama.cpp/pull/244) ### 性能改进 * 使用混合 HPU/CPU 推理时 MoE 模型更好的 GPU 卸载策略，见 [PR 520](https://github.com/ikawrakow/ik_llama.cpp/pull/520) * 更快的 rng 采样 [PR 1187](https://github.com/ikawrakow/ik_llama.cpp/pull/1187) * 2025年5月13日：DeepSeek-Lite 更好的 CPU FA 性能。[PR 410](https://github.com/ikawrakow/ik_llama.cpp/pull/410) * 2025年5月11日：CUDA 上 Deep 模型的 Flash Attention 稍快，并扩展了对 Touring 或更新 GPU 的兼容性。[PR 408](https://github.com/ikawrakow/ik_llama.cpp/pull/408) * 2025年5月4日：带有 Flash Attention 的 GQA 模型在 CUDA 上的 token 生成性能显著提升。详情和基准测试见 [PR 370](https://github.com/ikawrakow/ik_llama.cpp/pull/370) * 2025年4月17日：更好的 CPU Flash Attention token 生成性能。[PR 332](https://github.com/ikawrakow/ik_llama.cpp/pull/332) * 2025年4月3日：Metal 上快得多的 MoE 实现。[PR 307](https://github.com/ikawrakow/ik_llama.cpp/pull/307) * 2025年3月25日：CUDA 上更好的 MoE 性能 [PR 283](https://github.com/ikawrakow/ik_llama.cpp/pull/283) * 2025年3月23日：DeepSeek 模型更好的批处理速度 [PR 282](https://github.com/ikawrakow/ik_llama.cpp/pull/282) * 2025年3月18日：减少计算缓冲区大小 [PR 237](https://github.com/ikawrakow/ik_llama.cpp/pull/237) * 2025年3月10日：CUDA 上 MoE 模型更好的 TG 性能 [PR 248](https://github.com/ikawrakow/ik_llama.cpp/pull/248) * 2025年2月23日：融合 FFN 操作以实现更快的 MoE 推理 [PR 229](https://github.com/ikawrakow/ik_llama.cpp/pull/229) ### Flash-MLA * 2025年5月7日：🚀 CUDA 上 DeepSeek 模型的 FlashMLA-3。[PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386)。注意：需要 Ampere 或更新的 Nvidia GPU * 2025年3月21日：🚀 FlashMLA-3：DeepSeek 模型最快的纯 CPU 推理 [PR 273](https://github.com/ikawrakow/ik_llama.cpp/pull/273) * 2025年3月17日：🚀 FlashMLA-2 性能改进 [PR 253](https://github.com/ikawrakow/ik_llama.cpp/pull/253) * 2025年3月12日：允许在 CUDA 上的 FlashMLA-2 中使用 `Q8_0` KV cache [PR 265](https://github.com/ikawrakow/ik_llama.cpp/pull/265) * 2025年3月9日：🚀 CUDA 上的 FlashMLA [PR 247](https://github.com/ikawrakow/ik_llama.cpp/pull/247) * 2025年3月8日：🚀 更快的 FlashMLA CPU 实现 [PR 243](https://github.com/ikawrakow/ik_llama.cpp/pull/243) * 2025年3月3日：🚀 推出 FlashMLA - 带 Flash Attention 的 MLA [PR 240](https://github.com/ikawrakow/ik_llama.cpp/pull/240) * 2025年2月27日：无转置 cache 的 MLA [PR 235](https://github.com/ikawrakow/ik_llama.cpp/pull/235) * 2025年2月13日：允许 MLA 使用 `Q8_0` 量化 cache [PR 206](https://github.com/ikawrakow/ik_llama.cpp/pull/206) * 2025年2月11日：🚀 DeepSeek 模型的 Flash Attention 支持 [PR 200](https://github.com/ikawrakow/ik_llama.cpp/pull/200) * 2025年2月9日：🚀 DeepSeek 模型的 MLA [PR 188](https://github.com/ikawrakow/ik_llama.cpp/pull/188) ### 修复 * 修复 MMVQ 内核中的 bug [PR 446](https://github.com/ikawrakow/ik_llama.cpp/pull/446) * 修复 `IQ4_K, IQ4_KS, IQ5_K, IQ6_K` 的 AVX2 实现 [PR 427](https://github.com/ikawrakow/ik_llama.cpp/pull/427) * 修复 CPU 上的标准 attention [PR 421](https://github.com/ikawrakow/ik_llama.cpp/pull/421) * 修复 MLA 模型的 imatrix 计算 [PR 411](https://github.com/ikawrakow/ik_llama.cpp/pull/411) * 修复 Touring 上新的 CUDA FA [PR 413](https://github.com/ikawrakow/ik_llama.cpp/pull/413) * 修复 SER。CPU：[PR 415](https://github.com/ikawrakow/ik_llama.cpp/pull/415) CUDA：[PR 416](https://github.com/ikawrakow/ik_llama.cpp/pull/416) ## 资源没有单一的参考点描述所有新的 `ik_llama.cpp` 功能。Pull requests 通常包含详细信息，因此浏览 PR 通常是了解新功能及其使用方法的最佳方式。此外： * [Wiki 页面](https://github.com/ikawrakow/ik_llama.cpp/wiki) 有与主线 `llama.cpp` 的性能比较 * [本指南](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) 是如果您是因为 DeepSeek 模型而来这里的一个好起点 * [此讨论](https://github.com/ikawrakow/ik_llama.cpp/discussions/266) 是关于在 16 x 3090 设置上运行 DeepSeek-V3/R1 * [此讨论](https://github.com/ikawrakow/ik_llama.cpp/discussions/8) 描述了 `ik_llama.cpp` 中可用的新的量化类型 ## 测试 ### 函数调用测试运行函数调用测试套件： ``` cd build cmake --build . --target test-function-calls ./bin/test-function-calls ``` 测试套件涵盖解析器功能、流式传输、错误处理、内容清理和服务器集成。所有测试都应通过以确保生产就绪。 ## 贡献欢迎以 pull requests、问题提交（错误报告、功能请求）或一般讨论的形式进行贡献。 ## 许可证 MIT

标签：Bash脚本, Bitnet, C++, CPU优化, CUDA, DeepSeek, DLL 劫持, DNS解析, FlashMLA, GGUF, llama.cpp, LLM, MLA, MoE, SOTA, Unmanaged PE, Vectored Exception Handling, 人工智能, 大语言模型, 实时告警, 开源项目, 性能加速, 推理框架, 数据擦除, 本地推理, 模型部署, 混合推理, 熵值分析, 用户模式Hook绕过, 逆向工具, 量化