flashinfer-ai/flashinfer

GitHub: flashinfer-ai/flashinfer

FlashInfer 是一个面向大模型推理服务的高性能 GPU 算子库，为 attention、GEMM、MoE 等关键操作提供统一的 API 和多后端实现。

Stars: 5995 | Forks: 1177

High-Performance GPU Kernels for Inference

[![构建状态](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/badge/icon)](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/) [![文档](https://static.pigsec.cn/wp-content/uploads/repos/cas/2e/2e6605448da1aba1e51a19b950d8f67f57eca3ff1197120a14567fff1b1f0a8b.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml) **FlashInfer** 是一个用于推理的库和 kernel 生成器，可在多种 GPU 架构上提供最先进的性能。它为 attention、GEMM 和 MoE 操作提供了统一的 API，并包含 FlashAttention-2/3、cuDNN、CUTLASS 和 TensorRT-LLM 在内的多种 backend 实现。 ## 为什么选择 FlashInfer？ - **最先进的性能**：针对 prefill、decode 和混合 batching 场景优化的 kernel - **多种 Backend**：根据您的硬件和工作负载自动选择最佳的 backend - **支持现代架构**：支持 SM75 (Turing) 及更新的架构（直至 Blackwell） - **低精度计算**：针对 attention、GEMM 和 MoE 操作提供 FP8 和 FP4 量化 - **生产就绪**：兼容 CUDAGraph 和 torch.compile，适用于低延迟服务 ## 核心功能 ### Attention Kernel - **Paged 和 Ragged KV-Cache**：为动态 batch 服务提供高效的内存管理 - **Decode、Prefill 和 Append**：为所有 attention 阶段提供优化的 kernel - **MLA Attention**：原生支持 DeepSeek 的 Multi-Latent Attention - **Cascade Attention**：针对共享前缀提供内存高效的分层 KV-Cache - **Sparse Attention**：支持 Block-sparse 和可变 Block-sparse 模式 - **POD-Attention**：面向混合 batching 的融合 prefill+decode ### GEMM 与线性操作 - **BF16 GEMM**：适用于 SM10.0+ GPU 的 BF16 矩阵乘法。 - **FP8 GEMM**：支持 Per-tensor 和 groupwise 缩放 - **FP4 GEMM**：适用于 Blackwell GPU 的 NVFP4 和 MXFP4 矩阵乘法 - **Grouped GEMM**：针对 LoRA 和多专家路由提供高效的批量矩阵操作 ### Mixture of Experts (MoE) - **融合 MoE Kernel** - **多种路由方法**：支持 DeepSeek-V3、Llama-4 以及标准的 top-k 路由 - **量化 MoE**：支持带有 block-wise 缩放的 FP8 和 FP4 专家权重 ### 采样与解码 - **无排序采样**：无需排序即可实现高效的 Top-K、Top-P 和 Min-P - **推测解码**：支持链式推测采样 ### 通信 - **AllReduce**：自定义实现 - **多节点 NVLink**：支持 MNNVL 以实现多节点推理 - **NVSHMEM 集成**：用于分布式内存操作 ### 其他算子 - **RoPE**：LLaMA 风格的旋转位置编码（包括 LLaMA 3.1） - **归一化**：RMSNorm、LayerNorm、Gemma 风格的融合操作 - **激活函数**：带有融合门控的 SiLU、GELU ## GPU 支持 | 架构 | 计算能力 | 示例 GPU | |--------------|-------------------|------| | Turing | SM 7.5 | T4, RTX 20 系列 | | Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 系列 | | Ada Lovelace | SM 8.9 | L4, L40, RTX 40 系列 | | Hopper | SM 9.0 | H100, H200 | | Blackwell | SM 10.0, 10.3 | B200, B300 | | Blackwell | SM 11.0 | Jetson Thor | | Blackwell | SM 12.0, 12.1 | RTX 50 系列, DGX Spark | ## 新闻最新版本：[![GitHub Release](https://img.shields.io/github/v/release/flashinfer-ai/flashinfer)](https://github.com/flashinfer-ai/flashinfer/releases/latest) 重要更新： - [2025-10-08] 在 [v0.4.0](https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.4.0) 中添加了对 Blackwell 的支持 - [2025-03-10] [博客文章](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling，解释了 FlashInfer 中采样 kernel 的设计。 ## 快速入门 ### 安装 **快速指南：** ``` pip install flashinfer-python ``` **包选项：** - **flashinfer-python**：核心包，在首次使用时编译/下载 kernel - **flashinfer-cubin**：为所有受支持的 GPU 架构预编译的 kernel 二进制文件 - **flashinfer-jit-cache**：针对特定 CUDA 版本预构建的 kernel 缓存 **为了实现更快的初始化和离线使用**，请安装可选包以预先编译大多数 kernel： ``` pip install flashinfer-python flashinfer-cubin # JIT cache（将 cu129 替换为您的 CUDA 版本） pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129 ``` **对于 Blackwell (SM100+) CuTe DSL kernel**，请使用 CUDA 13 额外依赖进行安装，以启用针对 Blackwell 优化的 kernel： ``` pip install flashinfer-python[cu13] ``` ### 验证安装 ``` flashinfer show-config ``` ### 基本用法 ``` import torch import flashinfer # 单个 decode attention q = torch.randn(32, 128, device="cuda", dtype=torch.float16) # [num_qo_heads, head_dim] k = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) # [kv_len, num_kv_heads, head_dim] v = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) output = flashinfer.single_decode_with_kv_cache(q, k, v) ``` 请查阅[文档](https://docs.flashinfer.ai/)以获取全面的 API 参考和教程。 ### 从源码安装 ``` git clone https://github.com/flashinfer-ai/flashinfer.git --recursive cd flashinfer python -m pip install -v . ``` **用于开发目的**，请以可编辑模式安装： ``` python -m pip install --no-build-isolation -e . -v ``` 构建可选包： ``` # flashinfer-cubin cd flashinfer-cubin python -m build --no-isolation --wheel python -m pip install dist/*.whl ``` ``` # flashinfer-jit-cache（根据您的目标 GPU 进行自定义） export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f" cd flashinfer-jit-cache python -m build --no-isolation --wheel python -m pip install dist/*.whl ``` 有关更多详细信息，请参阅[从源码安装文档](https://docs.flashinfer.ai/installation.html#install-from-source)。 ### Nightly 构建 ``` pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps pip install flashinfer-python # Install dependencies from PyPI pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/ # JIT cache（将 cu129 替换为您的 CUDA 版本） pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129 ``` ### CLI 工具 FlashInfer 提供了多个 CLI 命令，用于配置、模块管理和开发： ``` # 验证安装并查看配置 flashinfer show-config # 列出并检查模块 flashinfer list-modules flashinfer module-status # 管理 artifacts 和 cache flashinfer download-cubin flashinfer clear-cache # 面向开发者：生成 compile_commands.json 用于 IDE 集成 flashinfer export-compile-commands [output_path] ``` 有关完整的文档，请参阅 [CLI 参考](https://docs.flashinfer.ai/cli.html)。 ## API 日志 FlashInfer 提供了全面的 API 日志记录以用于调试。使用环境变量启用它： ``` # 启用 logging（级别：0=off（默认），1=basic，3=detailed，5=statistics） export FLASHINFER_LOGLEVEL=3 # 设置 log 目标（stdout（默认），stderr 或文件路径） export FLASHINFER_LOGDEST=stdout ``` 有关日志级别、配置和高级功能的详细信息，请参阅我们文档中的[日志记录](https://docs.flashinfer.ai/logging.html)。 ## 自定义 Attention 变体用户可以使用附加参数自定义自己的 attention 变体。有关更多详细信息，请参阅我们的 [JIT 示例](https://github.com/flashinfer-ai/flashinfer/blob/main/tests/utils/test_jit_example.py)。 ## CUDA 支持 **受支持的 CUDA 版本：** 12.6, 12.8, 13.0, 13.1 ## 采用情况 FlashInfer 为以下项目提供推理支持： - [SGLang](https://github.com/sgl-project/sglang) - [vLLM](https://github.com/vllm-project/vllm) - [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) - [TGI (Text Generation Inference)](https://github.com/huggingface/text-generation-inference) - [MLC-LLM](https://github.com/mlc-ai/mlc-llm) - [LightLLM](https://github.com/ModelTC/lightllm) - [lorax](https://github.com/predibase/lorax) - [ScaleLLM](https://github.com/vectorch-ai/ScaleLLM) ## 致谢 FlashInfer 的灵感来源于 [FlashAttention](https://github.com/dao-AILab/flash-attention/)、[vLLM](https://github.com/vllm-project/vllm)、[stream-K](https://arxiv.org/abs/2301.03598)、[CUTLASS](https://github.com/nvidia/cutlass) 和 [AITemplate](https://github.com/facebookincubator/AITemplate)。 ## 引用如果您发现 FlashInfer 对您的项目或研究有帮助，请考虑引用我们的[论文](https://arxiv.org/abs/2501.01005)： ``` @article{ye2025flashinfer, title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving}, author = { Ye, Zihao and Chen, Lequn and Lai, Ruihang and Lin, Wuwei and Zhang, Yineng and Wang, Stephanie and Chen, Tianqi and Kasikci, Baris and Grover, Vinod and Krishnamurthy, Arvind and Ceze, Luis }, journal = {arXiv preprint arXiv:2501.01005}, year = {2025}, url = {https://arxiv.org/abs/2501.01005} } ```

标签：DLL 劫持, GPU内核, Vectored Exception Handling, 人工智能, 凭据扫描, 大语言模型, 推理优化, 用户模式Hook绕过, 算子库, 逆向工具