jawah/charset_normalizer

GitHub: jawah/charset_normalizer

一个纯 Python 实现的通用字符编码检测库，用于替代 chardet，能准确识别文本文件的编码并规范化为 Unicode。

Stars: 782 | Forks: 66

面向所有人的字符集检测 👋

^{The Real First Universal Charset Detector}

^{Featured Packages}

^{In other language (unofficial port - by the community)}

本项目为您提供了一个替代 **Universal Charset Encoding Detector**（也称为 **Chardet**）的方案。 | 功能 | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) | |--------------------------------------------------|:---------------------------------------------:|:-----------------------------------------------------------------------------------------------:|:-----------------------------------------------:| | `快速` | ✅ | ✅ | ✅ | | `通用`[^1] | ❌ | ✅ | ❌ | | 在没有可区分标准的情况下具有 `可靠性` **without** distinguishable standards | ✅ | ✅ | ✅ | | 在有可区分标准的情况下具有 `可靠性` **with** distinguishable standards | ✅ | ✅ | ✅ | | `许可证` | _存在争议_[^2]
_限制性_ | MIT | MPL-1.1
_限制性_ | | `原生 Python` | ✅ | ✅ | ❌ | | `检测口语语言` | ✅ | ✅ | N/A | | `UnicodeDecodeError 安全性` | ✅ | ✅ | ❌ | | `Whl 体积 (最小)` | 500 kB | 150 kB | ~200 kB | | `支持的编码` | 99 | [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 | | `可注册自定义编码` | ❌ | ✅ | ❌ |

Reading Normalized Text Cat Reading Text

[^1]: 它们显然是针对特定的编码使用了特定的代码，即使这涵盖了大多数常用编码。 [^2]: Chardet 7.0+ 在经过 AI 辅助重写后，从 LGPL-2.1 更改为 MIT 许可证。这一重新许可的行为在两个独立的方面存在争议：**(a)** 原作者[抗议](https://github.com/chardet/chardet/issues/327)维护者拥有重新许可的权利，认为该重写是 LGPL 许可代码库的衍生作品，因为它并非净室实现；**(b)** 鉴于代码主要由 LLM 生成，版权主张本身就值得[商榷](https://github.com/chardet/chardet/issues/334)，并且在大多数司法管辖区，AI 生成的输出可能不受版权保护。任何一个问题都可能使 MIT 许可证失效。除了许可问题之外，这次重写还引发了关于在开源中负责任地使用 AI 的质疑：charset-normalizer 开创的关键架构理念——尤其是解码优先的有效性过滤（我们自 v1 以来的基础方法）以及在相同算法和阈值下的编码成对相似性——在未经确认的情况下出现在了 chardet 7 中。该项目还从 charset-normalizer 导入了测试文件来训练和对其进行基准测试，然后声称在这些文件上具有卓越的准确性。Charset-normalizer 始终采用 MIT 许可证，在设计上与编码无关，并建立在可验证的人工编写历史之上。 ## ⚡ 性能与 Chardet 相比，该包提供了更好的性能（在第 99 和第 95 百分位上）。以下是一些数据。 | 包 | 准确率 | 平均单文件耗时 (ms) | 每秒文件数 (预估) | |---------------------------------------------------|:--------:|:------------------:|:------------------:| | [chardet 7.1](https://github.com/chardet/chardet) | 89 % | 3 ms | 333 file/sec | | charset-normalizer | **97 %** | 3 ms | 333 file/sec | | 包 | 第 99 百分位 | 第 95 百分位 | 第 50 百分位 | |---------------------------------------------------|:---------------:|:---------------:|:---------------:| | [chardet 7.1](https://github.com/chardet/chardet) | 32 ms | 17 ms | < 1 ms | | charset-normalizer | 16 ms | 10 ms | 1 ms | _于 2026 年 3 月使用 CPython 3.12、Charset-Normalizer 3.4.6 和 Chardet 7.1.0 更新_ ~Chardet 在处理大型文件 (1MB+) 时性能非常差。预计在大型 payload 上会有巨大差异。~ 自 Chardet 7.0+ 以来不再是这种情况 ## ✨ 安装使用 pip： ``` pip install charset-normalizer -U ``` ## 🚀 基本用法 ### CLI 此包附带了一个 CLI。 ``` usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD] file [file ...] The Real First Universal Charset Detector. Discover originating encoding used on text file. Normalize text to unicode. positional arguments: files File(s) to be analysed optional arguments: -h, --help show this help message and exit -v, --verbose Display complementary information about file if any. Stdout will contain logs about the detection process. -a, --with-alternative Output complementary possibilities if any. Top-level JSON WILL be a list. -n, --normalize Permit to normalize input file. If not set, program does not write anything. -m, --minimal Only output the charset detected to STDOUT. Disabling JSON output. -r, --replace Replace file when trying to normalize it instead of creating a new one. -f, --force Replace file without asking if you are sure, use this flag with caution. -t THRESHOLD, --threshold THRESHOLD Define a custom maximum amount of chaos allowed in decoded content. 0. <= chaos <= 1. --version Show version information and exit. ``` ``` normalizer ./data/sample.1.fr.srt ``` 或者 ``` python -m charset_normalizer ./data/sample.1.fr.srt ``` 🎉 从 1.4.0 版本开始，CLI 可以生成易于使用的 JSON 格式标准输出结果。 ``` { "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt", "encoding": "cp1252", "encoding_aliases": [ "1252", "windows_1252" ], "alternative_encodings": [ "cp1254", "cp1256", "cp1258", "iso8859_14", "iso8859_15", "iso8859_16", "iso8859_3", "iso8859_9", "latin_1", "mbcs" ], "language": "French", "alphabets": [ "Basic Latin", "Latin-1 Supplement" ], "has_sig_or_bom": false, "chaos": 0.149, "coherence": 97.152, "unicode_path": null, "is_preferred": true } ``` ### Python *只需打印出规范化后的文本* ``` from charset_normalizer import from_path results = from_path('./my_subtitle.srt') print(str(results.best())) ``` *毫不费力地升级您的代码* ``` from charset_normalizer import detect ``` 上面的代码行为将与 **chardet** 相同。我们确保提供尽可能好的（合理的）BC（向下兼容）结果。有关高级用法，请参阅文档：[readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/) ## 😇 为什么当我开始使用 Chardet 时，我注意到它不符合我的期望，因此我想提出一种使用完全不同方法的可靠替代方案。而且！我从不退缩于一个很好的挑战！我**不关心** **原始的** 字符集编码，因为**两个不同的表**可以生成**两个完全相同的渲染字符串。** 我想要的是获取可读的文本，尽我所能。在某种程度上，**我正在对文本解码进行暴力破解。** 这有多酷？😎 不要将 **ftfy** 包与 charset-normalizer 或 chardet 混淆。ftfy 的目标是修复 Unicode 字符串，而 charset-normalizer 是将未知编码的原始文件转换为 unicode。 ## 🍰 如何工作 - 丢弃所有不符合二进制内容的字符集编码表。 - 测量噪音，或使用相应的字符集编码（按块）打开后的混乱程度。 - 提取检测到混乱程度最低的匹配项。 - 此外，我们测量连贯性/探测语言。 **等一下**，对**您**来说，什么是噪音/混乱和连贯性？ *噪音：* 我用错误的编码表打开了数百个由**人类编写**的文本文件。我进行了**观察**，然后 **建立**了一些关于在**看起来**像一团糟时什么是**显而易见的**的基本规则（即定义渲染文本中的噪音）。我知道我对噪音的解释可能是不完整的，欢迎自由贡献，以改进或重写它。 *连贯性：* 对于世界上存在的每种语言，我们都计算了排名靠前的字母出现频率（尽我们所能）。所以我认为这些信息在这里很有价值。因此，我使用这些记录来检查解码后的文本，以判断是否能检测到智能设计。 ## ⚡ 已知限制 - 当文本包含两种或多种共享相同字母的语言时，语言检测是不可靠的。（例如：HTML（英文标签）+ 土耳其语内容（共享拉丁字符）） - 每个字符集检测器都在很大程度上依赖于足够的内容。在通常情况下，不要在非常小的内容上运行检测。 ## ⚠️ 关于 Python EOL（停止维护） **如果您正在运行：** - Python >=2.7,<3.5：不支持 - Python 3.5：charset-normalizer < 2.1 - Python 3.6：charset-normalizer < 3.1 请尽快升级您的 Python 解释器。 ## 📝 许可证版权所有 © [Ahmed TAHRI @Ousret](https://github.com/Ousret)。
本项目采用 [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) 许可证。本项目中使用的字符出现频率版权所有 © 2012 [Denny Vrandečić](http://simia.net/letters/) ## 💼 企业版 charset-normalizer 的专业支持作为 [Tidelift 订阅][1] 的一部分提供。Tidelift 为软件开发团队提供了一个单一来源，用于购买和维护他们的软件，提供来自最了解该软件的专家的专业级保证，同时与现有工具无缝集成。 [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/7297/badge)](https://www.bestpractices.dev/projects/7297)

标签：Chardet替代品, charset-normalizer, NLP基础工具, Python, Rust, 字符编码, 字符编码标准化, 字符集, 开源库, 搜索引擎爬虫, 数据清洗, 文本处理, 文本解析, 无后门, 纯Python, 编码检测, 网络流量审计, 逆向工具, 通用编码检测