google/sentencepiece

GitHub: google/sentencepiece

Google 开源的神经网络文本分词器，支持 BPE 和 unigram 算法，可直接从原始文本训练子词模型。

Stars: 11691 | Forks: 1334

# SentencePiece [![Build C++](https://static.pigsec.cn/wp-content/uploads/repos/2026/03/2f619d0006194112.svg)](https://github.com/google/sentencepiece/actions/workflows/cmake.yml) [![Build Wheels](https://static.pigsec.cn/wp-content/uploads/repos/2026/03/9deb5c0002194113.svg)](https://github.com/google/sentencepiece/actions/workflows/wheel.yml) [![GitHub Issues](https://img.shields.io/github/issues/google/sentencepiece.svg)](https://github.com/google/sentencepiece/issues) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentencepiece) [![PyPI version](https://badge.fury.io/py/sentencepiece.svg)](https://badge.fury.io/py/sentencepiece) [![PyPi downloads](https://img.shields.io/pypi/dm/sentencepiece?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/sentencepiece/) [![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md) [![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0) [![SLSA 3](https://slsa.dev/images/gh-badge-level3.svg)](https://slsa.dev) SentencePiece 是一个无监督的文本分词器和去分词器，主要用于基于神经网络的文本生成系统，其中词汇表大小在神经网络模型训练之前是预先确定的。SentencePiece 实现了 **子词单元** (例如，**字节对编码 (BPE)** [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)]) 和 **unigram 语言模型** [[Kudo.](https://arxiv.org/abs/1804.10959)]) 并扩展了直接从原始句子训练的功能。SentencePiece 允许我们构建一个不依赖于特定语言预处理/后处理的纯端到端系统。 **这不是一个官方的 Google 产品。** ## 技术亮点 - **纯数据驱动**: SentencePiece 从句子中训练分词和去分词模型。并不总是需要预分词 ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/))。 - **语言独立**: SentencePiece 将句子仅视为 Unicode 字符序列。没有依赖语言的逻辑。 - **多种子词算法**: 支持 **BPE** [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)] 和 **unigram 语言模型** [[Kudo.](https://arxiv.org/abs/1804.10959)]。 - **子词正则化**: SentencePiece 为 [子词正则化](https://arxiv.org/abs/1804.10959) 和 [BPE-dropout](https://arxiv.org/abs/1910.13267) 实现了子词采样，这有助于提高 NMT 模型的鲁棒性和准确性。 - **快速且轻量级**: 分割速度约为 50k 句/秒，内存占用约为 6MB。 - **自包含**: 只要使用相同的模型文件，就能获得相同的分词/去分词结果。 - **直接生成词汇表 ID**: SentencePiece 管理词汇表到 ID 的映射，可以直接从原始句子生成词汇表 ID 序列。 - **基于 NFKC 的归一化**: SentencePiece 执行基于 NFKC 的文本归一化。对于不熟悉 SentencePiece 软件/算法的人，可以阅读[这里的入门介绍](https://medium.com/@jacky2wong/understanding-sentencepiece-under-standing-sentence-piece-ac8da59f6b08)。 ## 与其他实现的比较 | 特性 | SentencePiece | [subword-nmt](https://github.com/rsennrich/subword-nmt) | [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) | | :-------------------------------------- | :--------------------------------------------: | :-----------------------------------------------------: | :-----------------------------------------------: | | 支持的算法 | BPE, unigram, char, word | BPE | BPE\* | | 开源软件? | 是 | 是 | Google 内部使用 | | 子词正则化 | [是](#subword-regularization-and-bpe-dropout) | 否 | 否 | | Python 库 | [是](python/README.md) | 否 | 不适用 | | C++ 库 | [是](doc/api.md) | 否 | 不适用 | | 需要预分割? | [否](#whitespace-is-treated-as-a-basic-symbol) | 是 | 是 | | 可自定义归一化 (例如, NFKC) | [是](doc/normalization.md) | 否 | 不适用 | | 直接 ID 生成 | [是](#end-to-end-example) | 否 | 不适用 | 注意，WordPiece 中使用的 BPE 算法与原始 BPE 略有不同。 ## 概述 ### 什么是 SentencePiece？ SentencePiece 是**子词单元**的重新实现，这是缓解神经机器翻译中开放词汇表问题的有效方法。SentencePiece 支持两种分割方式，**字节对编码 (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] 和 **unigram 语言模型** [[Kudo.](https://arxiv.org/abs/1804.10959)]。以下是与其他实现的高级差异。 #### 唯一 Token 的数量是预先确定的神经机器翻译模型通常使用固定的词汇表进行操作。大多数假设无限词汇表的无监督分词算法不同，SentencePiece 训练分割模型时，使得最终的词汇表大小是固定的，例如 8k、16k 或 32k。请注意，SentencePiece 指定训练的最终词汇表大小，这与使用合并操作次数的 [subword-nmt](https://github.com/rsennrich/subword-nmt) 不同。合并操作的数量是 BPE 特有的参数，不适用于其他分割算法，包括 unigram、word 和 character。 #### 从原始句子训练以前的子词实现假设输入句子是经过预分词的。这个约束是为了高效训练所必需的，但这使得预处理变得复杂，因为我们必须提前运行依赖语言的分词器。 SentencePiece 的实现速度足够快，可以从原始句子训练模型。这对于训练中文和日文的分词器和去分词器非常有用，因为这些语言的单词之间不存在显式的空格。 #### 空格被视为基本符号自然语言处理的第一步是文本分词。例如，一个标准的英语分词器会将文本 "Hello world." 分割成以下三个 Token。一个观察结果是，原始输入和分词后的序列**不能可逆地转换**。例如，“World”和“.”之间没有空格的信息在分词后的序列中丢失了，因为例如 `Tokenize(“World.”) == Tokenize(“World .”)` SentencePiece 将输入文本仅视为 Unicode 字符序列。空格也被当作普通符号处理。为了将空格显式地作为基本 Token 处理，SentencePiece 首先使用元符号 "▁" (U+2581) 对空格进行转义，如下所示。然后，这段文本被分割成小块，例如：由于空格保留在分割后的文本中，我们可以毫无歧义地对文本进行去分词。 ``` detokenized = ''.join(pieces).replace('▁', ' ') ``` 此功能使得无需依赖特定语言的资源即可执行去分词。请注意，当使用标准分词器分割句子时，我们无法应用相同的无损转换，因为它们将空格视为特殊符号。分词后的序列不保留恢复原始句子所需的信息。 - (en) Hello world. → [Hello] [World] [.] $Hello 和 World 之间有空格$ - (ja) こんにちは世界。 → [こんにちは] [世界] [。] $こんにちは和世界之间没有空格$ #### 子词正则化和 BPE-dropout 子词正则化 [[Kudo.](https://arxiv.org/abs/1804.10959)] 和 BPE-dropout [Provilkov et al](https://arxiv.org/abs/1910.13267) 是简单的正则化方法，通过实时子词采样虚拟地增加训练数据，这有助于提高 NMT 模型的准确性和鲁棒性。要启用子词正则化，您需要将 SentencePiece 库 ([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) 集成到 NMT 系统中，以便为每次参数更新采样一个分割，这与标准的离线数据准备不同。以下是 [Python 库](python/README.md) 的示例。您会发现 'New York' 在每次 `SampleEncode (C++)` 或 `enable_sampling=True 的 encode (Python)` 调用中的分割方式都不同。采样参数的详细信息可以在 [sentencepiece_processor.h](src/sentencepiece_processor.h) 中找到。 ``` >>> import sentencepiece as spm >>> s = spm.SentencePieceProcessor(model_file='spm.model') >>> for n in range(5): ... s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1) ... ['▁', 'N', 'e', 'w', '▁York'] ['▁', 'New', '▁York'] ['▁', 'New', '▁Y', 'o', 'r', 'k'] ['▁', 'New', '▁York'] ['▁', 'New', '▁York'] ``` ## 安装 ### Python 模块 SentencePiece 提供了支持 SentencePiece 训练和分割的 Python 封装。您可以使用以下命令安装 SentencePiece 的 Python 二进制包。 ``` pip install sentencepiece ``` 更多详情，请参见 [Python 模块](python/README.md) ### 从 C++ 源码构建并安装 SentencePiece 命令行工具构建 SentencePiece 需要以下工具和库： - [cmake](https://cmake.org/) - C++11 编译器 - [gperftools](https://github.com/gperftools/gperftools) 库（可选，可获得 10-40% 的性能提升。）在 Ubuntu 上，可以使用 apt-get 安装构建工具： ``` % sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev ``` 然后，您可以按如下方式构建和安装命令行工具。 ``` % git clone https://github.com/google/sentencepiece.git % cd sentencepiece % mkdir build % cd build % cmake .. % make -j $(nproc) % sudo make install % sudo ldconfig -v ``` 在 OSX/macOS 上，请将最后一条命令替换为 `sudo update_dyld_shared_cache` ### 使用 vcpkg 构建和安装您可以使用 [vcpkg](https://github.com/Microsoft/vcpkg) 依赖管理器下载并安装 sentencepiece： ``` sudo git clone https://github.com/Microsoft/vcpkg.git cd vcpkg ./bootstrap-vcpkg.sh ./vcpkg integrate install ./vcpkg install sentencepiece ``` vcpkg 中的 sentencepiece 端口由 Microsoft 团队成员和社区贡献者保持最新。如果版本过期，请在 vcpkg 仓库上[创建 issue 或 pull request](https://github.com/Microsoft/vcpkg)。 ### 从签名的发布 wheels 下载并安装 SentencePiece 您可以从 [GitHub 发布页面](https://github.com/google/sentencepiece/releases/latest) 下载 wheel。我们在发布过程中使用 OpenSSF 的 [slsa-framework/slsa-github-generator](https://github.com/slsa-framework/slsa-github-generator) 生成 [SLSA3 签名](slsa.dev)。要验证发布的二进制文件： 1. 从 [slsa-framework/slsa-verifier#installation](https://github.com/slsa-framework/slsa-verifier#installation) 安装验证工具。 2. 从 [GitHub 发布页面](https://github.com/google/sentencepiece/releases/latest) 下载来源文件 `attestation.intoto.jsonl`。 3. 运行验证器： ``` slsa-verifier -artifact-path -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag ``` pip install wheel_file.whl ## 使用说明 ### 训练 SentencePiece 模型 ``` % spm_train --input= --model_prefix= --vocab_size=8000 --character_coverage=1.0 --model_type= ``` - `--input`: 每行一句的**原始**语料库文件。无需运行分词器、归一化器或预处理器。默认情况下，SentencePiece 使用 Unicode NFKC 对输入进行归一化。您可以传递逗号分隔的文件列表。 - `--model_prefix`: 输出模型名称前缀。将生成 `.model` 和 `.vocab`。 - `--vocab_size`: 词汇表大小，例如 8000、16000 或 32000 - `--character_coverage`: 模型覆盖的字符数量，良好的默认值为：`0.9995` 适用于日语或中文等字符集丰富的语言，`1.0` 适用于其他字符集较小的语言。 - `--model_type`: 模型类型。可选 `unigram` (默认)、`bpe`、`char` 或 `word`。使用 `word` 类型时，输入句子必须经过预分词。使用 `--help` 标志显示所有训练参数，或查看[此处](doc/options.md)获取概览。 ### 将原始文本编码为句子片段/ID ``` % spm_encode --model= --output_format=piece < input > output % spm_encode --model= --output_format=id < input > output ``` 使用 `--extra_options` 标志插入 BOS/EOS 标记或反转输入序列。 ``` % spm_encode --extra_options=eos (add only) % spm_encode --extra_options=bos:eos (add ~~and~~ ) % spm_encode --extra_options=reverse:bos:eos (reverse input and add ~~and~~ ) ``` SentencePiece 通过 `--output_format=(nbest|sample)_(piece|id)` 标志支持 nbest 分割和分割采样。 ``` % spm_encode --model= --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output % spm_encode --model= --output_format=nbest_id --nbest_size=10 < input > output ``` ### 将句子片段/ID 解码为原始文本 ``` % spm_decode --model= --input_format=piece < input > output % spm_decode --model= --input_format=id < input > output ``` 使用 `--extra_options` 标志以相反的顺序解码文本。 ``` % spm_decode --extra_options=reverse < input > output ``` ### 端到端示例 ``` % spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000 unigram_model_trainer.cc(494) LOG(INFO) Starts training with : input: "../data/botchan.txt" ... unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091 trainer_interface.cc(272) LOG(INFO) Saving model: m.model trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab % echo "I saw a girl with a telescope." | spm_encode --model=m.model ▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe . % echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id 9 459 11 939 44 11 4 142 82 8 28 21 132 6 % echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id I saw a girl with a telescope. ``` 您会发现原始输入句子是从词汇表 ID 序列中恢复的。 ### 导出词汇表列表 ``` % spm_export_vocab --model= --output= ``` `` 存储词汇表列表和发射对数概率。词汇表 ID 对应于此文件中的行号。 ### 重新定义特殊元 Token 默认情况下，SentencePiece 使用未知 (<unk>)、BOS (<s>) 和 EOS (</s>) Token，其 ID 分别为 0、1 和 2。我们可以在训练阶段重新定义此映射，如下所示。 ``` % spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=... ``` 当设置 -1 id 例如 `bos_id=-1` 时，该特殊 Token 将被禁用。请注意，未知 ID 不能被禁用。我们可以为填充 (<pad>) 定义一个 ID，即 `--pad_id=3`。如果您想分配其他特殊 Token，请参见[使用自定义符号](doc/special_symbols.md)。 ### 词汇表限制 `spm_encode` 接受 `--vocabulary` 和 `--vocabulary_threshold` 选项，以便 `spm_encode` 只生成也在词汇表中出现的符号（至少有一定的频率）。此功能的背景在 [subword-nmt 页面](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt) 中进行了描述。用法基本上与 `subword-nmt` 相同。假设 L1 和 L2 是两种语言（源/目标语言），训练共享的 spm 模型并获取每种语言的生成词汇表： ``` % cat {train_file}.L1 {train_file}.L2 | shuffle > train % spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995 % spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1 % spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2 ``` 使用 `shuffle` 命令只是以防万一，因为 `spm_train` 默认加载语料库的前 1000 万行。然后使用 `--vocabulary` 选项分割训练/测试语料库 ``` % spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1 % spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2 ``` ## 高级主题 - [SentencePiece 实验](doc/experiments.md) - [SentencePieceProcessor C++ API](doc/api.md) - [使用自定义文本归一化规则](doc/normalization.md) - [使用自定义符号](doc/special_symbols.md) - [Python 模块](python/README.md) - [详细的分割和训练算法] ## 相关项目这些是与 SentencePiece 相关的项目。它们独立管理。如果需要添加，请发送 Pull Request (PR)。 - [SentencePiece 的 Java 工具/绑定](https://mvnrepository.com/artifact/io.github.eix128/sentencepiece4j)

标签：Bash脚本, BPE, C++, DLL 劫持, Google, IPv6支持, LLM, NLP, Nuclei, Python, SentencePiece, Unigram, Unmanaged PE, 大语言模型, 子词算法, 开源库, 搜索引擎爬虫, 数据擦除, 数据预处理, 文本分词器, 无后门, 无监督学习, 机器翻译, 深度学习, 特征工程, 神经网络文本生成, 端到端系统, 词表构建, 逆向工具