yisding/sentencesplit
GitHub: yisding/sentencesplit
基于规则的句子边界检测库,支持23种语言,可独立使用或集成到spaCy流水线,擅长处理缩写和标点歧义。
Stars: 0 | Forks: 0
# sentencesplit
sentencesplit 是一个基于规则的句子边界检测库,可在多种语言中开箱即用。
本项目是 Ruby gem - [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) 的直接移植版本,提供基于规则的句子边界检测。
## 安装
**需要 Python 3.11+。**
```
pip install sentencesplit
```
## 用法
- 目前 sentencesplit 支持 23 种语言。
```
import sentencesplit
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = sentencesplit.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']
```
- 将 `sentencesplit` 用作 [spaCy](https://spacy.io/usage/processing-pipelines) pipeline 组件。(推荐)请参考示例 [sentencesplit\_as\_spacy\_component.py](examples/sentencesplit_as_spacy_component.py)
- 通过 [entry points](https://spacy.io/usage/saving-loading#entry-points-components) 使用 sentencesplit
```
import spacy
nlp = spacy.blank('en')
# 添加通过 package entry points 注册的 sentencesplit component
nlp.add_pipe("sentencesplit")
doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]
```
## 多语言分割
具有相似书写系统的语言(例如英语、西班牙语、法语)可以通过合并其缩写列表组合到单个分割器中。这避免了在分割前需要检测每个句子的语言。
```
import sentencesplit
from sentencesplit.abbreviation_replacer import AbbreviationReplacer
from sentencesplit.lang.common import Common, Standard
from sentencesplit.lang.english import English
from sentencesplit.lang.spanish import Spanish
from sentencesplit.lang.french import French
from sentencesplit.languages import LANGUAGE_CODES
class MultiLang(Common, Standard):
iso_code = 'multi'
class Abbreviation(Standard.Abbreviation):
ABBREVIATIONS = sorted(set(
Standard.Abbreviation.ABBREVIATIONS +
Spanish.Abbreviation.ABBREVIATIONS +
French.Abbreviation.ABBREVIATIONS
))
PREPOSITIVE_ABBREVIATIONS = sorted(set(
Standard.Abbreviation.PREPOSITIVE_ABBREVIATIONS +
Spanish.Abbreviation.PREPOSITIVE_ABBREVIATIONS +
French.Abbreviation.PREPOSITIVE_ABBREVIATIONS
))
NUMBER_ABBREVIATIONS = sorted(set(
Standard.Abbreviation.NUMBER_ABBREVIATIONS +
Spanish.Abbreviation.NUMBER_ABBREVIATIONS +
French.Abbreviation.NUMBER_ABBREVIATIONS
))
class AbbreviationReplacer(AbbreviationReplacer):
SENTENCE_STARTERS = English.AbbreviationReplacer.SENTENCE_STARTERS
LANGUAGE_CODES['multi'] = MultiLang
seg = sentencesplit.Segmenter(language="multi", clean=False)
print(seg.segment("Hola Srta. Ledesma. How are you?"))
# ['Hola Srta. Ledesma.', 'How are you?']
```
这适用于共享 `Common` 和 `Standard` 基类并使用相同句子结束标点符号(`.`、`!`、`?`)的语言。同样的模式可以扩展到其他类似的语言,如意大利语、荷兰语或丹麦语。具有不同书写系统或标点符号的语言(例如日语、阿拉伯语)则需要不同的方法。
## 引用
本项目源自 pySBD。如果您在项目或研究中使用它,请引用原始的 [PySBD: Pragmatic Sentence Boundary Disambiguation](https://www.aclweb.org/anthology/2020.nlposs-1.15) 论文。
```
@inproceedings{sadvilkar-neumann-2020-pysbd,
title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
author = "Sadvilkar, Nipun and
Neumann, Mark",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
pages = "110--114",
abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}
```
## 致谢
本项目源自 Nipun Sadvilkar 开发的 [pySBD](https://github.com/nipunsadvilkar/pySBD),而该项目若没有 [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) 团队的杰出工作是不可能实现的。
标签:IPv6支持, NLP, Pragmatic Segmenter, Python, spaCy, TCP/UDP协议, 云计算, 句子分割, 句子边界检测, 多语言, 开源库, 搜索引擎爬虫, 文本分割, 文本预处理, 无后门, 规则引擎