andrewheiss/quarto-wordcount

GitHub: andrewheiss/quarto-wordcount

Quarto 扩展用于解决学术写作中字数统计不准与格式兼容问题，在渲染后精确计算正文、摘要、参考文献及附录字数。

Stars: 144 | Forks: 8

# Quarto 字数统计 - [实验性新功能！](#experimental-new-feature) - [为什么计算字数很难](#why-counting-words-is-hard) - [使用字数统计脚本](#using-the-word-count-script) - [安装](#installing) - [用法](#usage) - [终端输出](#terminal-output) - [短代码](#shortcodes) - [不计入统计](#no-counting) - [代码块](#code-blocks) - [附录](#appendices) - [示例](#example) - [致谢](#credits) - [它是如何工作的](#how-this-all-works) ## 实验性新功能！ ## 为什么计算字数很难在学术写作和出版中，字数统计很重要，因为许多期刊对提交的稿件有字数限制。计算 Quarto Markdown 文件中的字数很棘手，原因有以下几点： 1. **与 Word 的兼容性**：学术出版门户往往关注类似 Microsoft Word 的字数统计，但许多用于计算文档字数的 R 和 Python 函数对单词边界的处理方式不同。例如，Word 认为连字符单词是一个单词（例如，“A super-neat kick-in-the-pants example”是 4 个单词），而 `stringi::stri_count_words()` 将其计为多个单词（例如，“A super-neat kick-in-the-pants example”在使用 {stringi} 时为 8 个单词）。更糟糕的是，{stringi} 将“/”视为单词边界，因此 URL 可能会严重夸大实际字数。 2. **额外的文本元素**：学术写作通常不将标题、摘要、表格文本、表格和图形标题或公式计入稿件字数。在 Quarto Markdown 这样的计算文档中，这些内容在文档渲染之前通常不会出现，因此简单地对 `.qmd` 文件运行字数统计函数会计算生成表格和图形的代码，再次导致字数膨胀。 3. **引用和参考文献**：学术写作通常将参考文献计入字数（尽管实际上不应该）。然而，在 Quarto Markdown（以及其他基于 pandoc 的 Markdown）中，引用在生成参考文献之前不会被计入字数。简单地对 `.qmd` 文件（或者类似超整洁的 [{wordcountaddin}](https://github.com/benmarwick/wordcountaddin)）运行字数统计函数只会看到文档中的 citekeys，例如 `@Lovelace1842`，但只会将其作为独立单词计数（例如，不是“(Lovelace 1842)”这种内文格式或“Ada Augusta Lovelace, “Sketch of the Analytical Engine…，” *Taylor's Scientific Memoirs* 3 (1842): 666–731.”这种脚注格式），更重要的是，它不会计入最终参考文献列表中自动生成的任何引用。这个扩展通过依赖一个 [Lua 过滤器](_extensions/wordcount/wordcount.lua) 来在文档渲染完成后、转换为最终输出格式之前统计字数，解决了以上三个问题。 [Frederik Aust (@crsh)](https://github.com/crsh) 在 R Markdown 文档中使用相同的 Lua 过滤器进行字数统计，并结合 [{rmdfiltr}](https://github.com/crsh/rmdfiltr) 软件包（实际上我只是复制并略微扩展了[该软件包的](https://github.com/crsh/rmdfiltr/blob/master/inst/wordcount.lua) `inst/wordcount.lua`）。该过滤器效果很好，并且[大致可与 Word 的字数统计](https://cran.r-project.org/web/packages/rmdfiltr/vignettes/wordcount.html)相比。你最好浏览一下[“它是如何工作的”](#) 章节以了解……嗯……它是如何工作的。 ## 使用字数统计脚本 ### 安装 ``` quarto add andrewheiss/quarto-wordcount ``` {quarto-wordcount} 需要 Quarto 版本 \>= 1.4.551。这将在 `_extensions` 子目录下安装扩展。如果使用版本控制，请将此目录纳入版本管理。 ### 用法启用扩展有两种方式：(1) 作为输出格式，(2) 作为过滤器。 #### 输出格式可以在 YAML 设置中指定四种不同的输出格式：`wordcount-html`、`wordcount-pdf`、`wordcount-docx`： ``` title: Something format: wordcount-html: default ``` `wordcount-FORMAT` 格式类型实际上只是每种基础格式（HTML、PDF、Word 和 Markdown）的包装器，因此所有其他 HTML、PDF、Word 和 Markdown 特定选项都会像往常一样工作： ``` title: Something format: wordcount-html: toc: true fig-align: center cap-location: margin ``` #### 过滤器如果使用的是 [自定义输出格式](https://quarto.org/docs/extensions/listing-formats.html)（例如 [{hikmah-academic-quarto}](https://github.com/andrewheiss/hikmah-academic-quarto) 或 [期刊文章格式](https://quarto.org/docs/extensions/listing-journals.html) 如 [{jss}](https://github.com/quarto-journals/jss)），则无法使用 `wordcount-html` 格式，因为不能组合输出格式。要为 *任何* 格式（包括自定义格式）启用字数统计，可以将扩展 Lua 脚本作为过滤器添加。需要指定三个设置： 1. 必须设置 `citeproc: false`，以防止 Quarto 尝试处理引用； 2. 引用处理所需的 `citeproc.lua` 路径，因此它必须在 `wordcount.lua` [之前执行](#how-this-all-works)； 3. 字数统计所需的 `wordcount.lua` 路径。 ``` title: Something format: html: # Regular built-in format citeproc: false filters: - at: pre-quarto path: _extensions/andrewheiss/wordcount/citeproc.lua - at: pre-quarto path: _extensions/andrewheiss/wordcount/wordcount.lua jss-pdf: # Custom third-party format citeproc: false filters: - at: pre-quarto path: _extensions/andrewheiss/wordcount/citeproc.lua - at: pre-quarto path: _extensions/andrewheiss/wordcount/wordcount.lua ``` ### 终端输出字数统计结果将在渲染文档时显示在终端输出中。它在三个部分中显示多个值： - **文稿总计**：(1) 正文、注释和参考文献的总字数，(2) 仅正文和注释的字数，以及（如果存在附录）(3) 正文、注释、参考文献和附录的总字数。我通常合作的期刊会将正文 + 注释 + 参考文献计入总字数。在压缩稿件以符合字数限制时，将参考文献与正文 + 注释分开有助于我更清晰地判断哪里可以最有效地编辑（例如，是改写句子还是删除参考文献）。 - **具体总计**：正文、注释、参考文献和附录的字数统计。 - **总体总计**：包括摘要在内的所有内容（1）以及仅摘要（2）的字数统计。这里包含摘要总计，是因为摘要通常不计入实际稿件字数限制，但仍需统计，因为它通常有独立的字数限制。 ``` Manuscript totals: --------------------------------------------------- - 458 words (text + notes + references) - 405 words (text + notes) - 478 words (text + notes + appendix + references) Specific totals: --------------------------------------------------- - 315 words in text body - 90 words in notes - 53 words in references - 20 words in appendix Overall totals: --------------------------------------------------- - 484 words in entire document - 6 words in abstract ``` ### 短代码还可以在文档中直接使用多个短代码来包含不同的字数统计： - 使用 `{{< words-total >}}` 包含所有字数统计； - 使用 `{{< words-body >}}` 仅包含正文中的字数统计，不包括参考文献、注释和附录； - 使用 `{{< words-ref >}}` 包含参考文献部分的字数统计； - 使用 `{{< words-append >}}` 包含附录部分的字数统计，附录必须包含在 `id="appendix-count"` 的分隔 div 中（[更多细节请参见下文](#appendices)）； - 使用 `{{< words-note >}}` 包含注释部分的字数统计； - 使用 `{{< words-abstract >}}` 包含摘要部分的字数统计； - 使用 `{{< words-sum ARG >}}`，其中 `ARG` 是五个可统计区域（`body`、`ref`、`append`、`note`、`abstract`）的某种组合。例如，`{{< words-sum body-note >}}` 包含正文和注释的字数统计；`{{< words-sum ref-append >}}` 包含参考文献和附录的字数统计。还可以在 YAML 元数据中使用短代码： ``` title: Something subtitle: "{{< words-sum body-note-ref >}} words" ``` ### 不计入统计如果希望排除某些文本不计入字数统计，可以将其包含在带有 `{.no-count}` 类的 [fenced div](https://quarto.org/docs/authoring/markdown-basics.html#sec-divs-and-spans) 中： ``` ::: {.no-count} These words don't count. ::: ``` ### 代码块默认情况下，代码块中的文本 ***会被计入***。例如，下面的内容： ``` --- title: "Code counting" format: wordcount-html --- This sentence has seven words in it. ```{r} # 这是代码 numbers <- 1:10 mean(numbers) ``` ``` ……会产生以下字数统计结果： ``` Overall totals: ----------------------------- - 16 total words - 16 words in body and notes Section totals: ----------------------------- - 16 words in text body ``` ……其中包含来自句子的 7 个单词和来自代码的 9 个单词。可以通过 YAML 选项 `count-code-blocks` 禁用代码块计数： ``` --- title: "Code counting" format: wordcount-html: count-code-blocks: false --- This sentence has seven words in it. ```{r} # 这是代码 numbers <- 1:10 mean(numbers) ``` ``` ……这将产生以下统计结果： ``` Overall totals: ---------------------------- - 7 total words - 7 words in body and notes Section totals: ---------------------------- - 7 words in text body ``` ### 附录在学术写作中，通常需要单独统计附录内容的字数，因为附录中的内容通常不计入期刊字数限制。Quarto 有一个很酷的功能，可以自动创建附录部分并在需要时自动移动内容。Quarto 使用（我）一个 Lua 过滤器来实现这一点。然而，Quarto 的附录生成过程发生在任何自定义 Lua 过滤器之后，因此即使最终渲染的文档创建了一个 `id="appendix"` 的分隔 div，该 div 在统计字数时也还不可访问，因此无法轻松提取附录字数。因此，作为（临时的？）变通方法（直到我弄清楚如何让这个 Lua 过滤器在创建附录 div 之后运行），你可以通过创建具有 `id="appendix-count"` 的自定义分隔 div 来获取独立的附录字数统计： ``` # 介绍 Regular text goes here. ::: {#appendix-count} # 附录 {.appendix} More words here ::: ``` ## 示例你可以在 [`template.qmd`](template.qmd) 中看到一个最小示例文档。 ## 致谢原始的 [`wordcount.lua`](_extensions/wordcount/wordcount.lua) 过滤器来自 [Frederik Aust 的 (@crsh)](https://github.com/crsh) [{rmdfiltr}](https://github.com/crsh/rmdfiltr) 软件包。 ## 它是如何工作的在幕后，pandoc 通常会将 Markdown 文档转换为抽象语法树（AST），即一种与输出无关的文档元素表示形式。以 AST 形式，很容易使用 [Lua 语言](https://pandoc.org/lua-filters.html) 提取或排除文档的特定元素（即排除标题或仅查看参考文献）。 Quarto 被设计为与语言无关，因此 {rmdfiltr} 使用 R 在 YAML 前置事项中动态设置 Lua 过滤器路径的方法不适用于 Quarto 文件。（[参见 Quarto 团队的评论](https://github.com/quarto-dev/quarto-cli/issues/1391#issuecomment-1185348644)，指出你不能在 Quarto YAML 头部使用 R 输出。）但仍然可以使用带有一点技巧的 {rmdfiltr} Lua 过滤器。为了在字数统计中包含引用，我们必须将字数统计过滤器提供给已使用 [`--citeproc` 选项](https://pandoc.org/MANUAL.html#citation-rendering) 处理的文档版本。然而，在 R Markdown/knitr 和 Quarto 中，`--citeproc` 标志被设计为最后一个选项，导致 pandoc 命令类似于： ``` pandoc whatever.md --output whatever.html --lua-filter wordcount.lua --citeproc ``` 参数的顺序很重要，因此让 `--lua-filter wordcount.lua` 出现在 `--citeproc` 之前会导致在生成参考文献之前统计字数，这并不理想。 {rmdfiltr} 通过编辑 YAML 前置事项来绕过此顺序问题，以 (1) 禁用一般 citeproc 并 (2) 在运行过滤器之前指定 `--citeproc` 标志： ``` output: html_document: citeproc: false pandoc_args: - '--citeproc' - '--lua-filter' - '/path/to/rmdfiltr/wordcount.lua' ``` 这将生成一个 pandoc 命令，类似于以下内容，其中 `--citeproc` 在前，因此生成的参考文献会被计入： ``` pandoc whatever.md --output whatever.html --citeproc --lua-filter wordcount.lua ``` 但 Quarto 没有 `pandoc_args` 选项。相反，它有一个 `filters` YAML 键，允许你指定要在渲染过程特定步骤应用的 Lua 过滤器列表。 ``` format: html: citeproc: false filters: - "/path/to/wordcount.lua" ``` 然而，没有明显的方法可以重新定位 `--citeproc` 参数，它会自动出现在最后，导致生成的参考文献不被计入。幸运的是，[这个 GitHub 评论](https://github.com/quarto-dev/quarto-cli/issues/2294#issuecomment-1238954661) 表明可以创建一个 Lua 过滤器，其行为类似于 `--citeproc`，通过将整个文档传递给 `pandoc.utils.citeproc()`。这意味着我们可以创建一个类似 `citeproc.lua` 的 Lua 脚本： ``` -- Lua filter that behaves like `--citeproc` function Pandoc (doc) return pandoc.utils.citeproc(doc) end ``` ……然后将其作为过滤器包含： ``` format: html: citeproc: false filters: - at: pre-quarto path: "path/to/citeproc.lua" - at: pre-quarto path: "path/to/wordcount.lua" ``` 这会生成一个 pandoc 命令，类似于以下内容，先将文档传递给 citeproc “过滤器”，然后再传递给字数统计脚本： ``` pandoc whatever.md --output whatever.html --lua-filter citeproc.lua --lua-filter wordcount.lua ```

标签：Markdown, Python, Quarto, R, rizin, Word Count, 二进制发布, 代码块计数, 字数统计, 字数统计工具, 字数限制, 学术写作, 开源工具, 引用计数, 扩展插件, 技术写作, 文本处理, 文档分析, 无后门