clipperhouse/uax29

GitHub: clipperhouse/uax29

基于 Unicode UAX #29 标准的 Go 文本分词库，提供句子、短语、词语和字素级别的 Unicode 文本分段能力。

Stars: 119 | Forks: 7

这个包基于 [Unicode 文本分段](https://unicode.org/reports/tr29/) (UAX #29) 对 Unicode 17 的词语、句子和字素进行分词（拆分）。详情和用法见各个包： [uax29/graphemes](https://github.com/clipperhouse/uax29/tree/master/graphemes) [uax29/words](https://github.com/clipperhouse/uax29/tree/master/words) [uax29/phrases](https://github.com/clipperhouse/uax29/tree/master/phrases) [uax29/sentences](https://github.com/clipperhouse/uax29/tree/master/sentences) ### 为什么要分词？ ### 用途 uax29 模块有 4 个分词器。按粒度递减顺序为：句子 → 短语 → 词语 → 字素。词语和字素是最常见的用法。你可以将 `words` 用于倒排索引、全文搜索、TF-IDF、BM25、embeddings 等。如果你在做 embeddings，“有意义单元”的定义将取决于你的应用场景。你可以选择句子、短语、词语，或者将它们组合使用。 ### 一致性我们使用官方的 [Unicode 测试套件](https://unicode.org/reports/tr41/tr41-36.html#Tests29)。状态： ![Go](https://static.pigsec.cn/wp-content/uploads/repos/cas/ce/ce733292a922c08274cf5a2096f8fa4cf01023bfa51a36ef6beecaaef371a9d9.svg) ## 快速开始 ``` go get "github.com/clipperhouse/uax29/v2/words" ``` ``` import "github.com/clipperhouse/uax29/v2/words" text := "Hello, 世界. Nice dog! 👍🐶" tokens := words.FromString(text) for tokens.Next() { // Next() returns true until end of data fmt.Println(tokens.Value()) // Do something with the current token } ``` ### 另请参阅 [jargon](https://github.com/clipperhouse/jargon)，一个用于 CLI 和 Go 的文本 pipeline 包，它使用了此包。 ### 前人工作 [blevesearch/segment](https://github.com/blevesearch/segment) [rivo/uniseg](https://github.com/rivo/uniseg) ### 其他语言实现 [C#](https://github.com/clipperhouse/uax29.net)（同样由我开发） [JavaScript](https://github.com/tc39/proposal-intl-segmenter) [Rust](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/trait.UnicodeSegmentation.html) [Java](https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html) [Python](https://uniseg-python.readthedocs.io/en/latest/)

标签：EVTX分析, Go, NLP, Ruby工具, Unicode, 分词, 文本处理, 日志审计