pirovc/genome_updater

GitHub: pirovc/genome_updater

一个 Bash 脚本工具，用于从 NCBI 基因组库（RefSeq/GenBank）下载和增量更新非冗余基因组快照，支持高级筛选、分类集成和并行下载。

Stars: 173 | Forks: 18

# genome_updater [![Build Status](https://app.travis-ci.com/pirovc/genome_updater.svg?branch=main)](https://app.travis-ci.com/pirovc/genome_updater) [![codecov](https://codecov.io/gh/pirovc/genome_updater/branch/master/graph/badge.svg)](https://codecov.io/gh/pirovc/genome_updater) [![Anaconda-Server Badge](https://anaconda.org/bioconda/genome_updater/badges/downloads.svg)](https://anaconda.org/bioconda/genome_updater) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/genome_updater/README.html) genome_updater 是一个 bash 脚本，用于下载和更新 NCBI 基因组库（RefSeq/GenBank）[[1](https://ftp.ncbi.nlm.nih.gov/genomes/)]的非冗余快照。它具备高级过滤条件、详细的日志和报告、文件完整性校验（MD5、gzip）、NCBI 分类系统和 GTDB [[2](https://gtdb.ecogenomic.org/)] 集成，并支持使用 parallel [[3](https://doi.org/10.5281/zenodo.1146014)] 进行并行下载。genome_updater 使用 [assembly_summary.txt](https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt) 来检索数据。 ## 快速使用指南 ### 下载 ``` # 下载脚本，允许执行 wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh chmod +x genome_updater.sh # 从 RefSeq 下载古菌完整基因组序列（-t 并行下载，-G 检查 gz 完整性） ./genome_updater.sh -o "arc_refseq_cg" -d "refseq" -g "archaea" -l "complete genome" -f "genomic.fna.gz" -t 12 -G ``` ### 更新一段时间后，从 NCBI 获取最新版本，仅下载新增的文件： ``` # 通过 dry-run 检查是否有可用的更新（-k） ./genome_updater.sh -o "arc_refseq_cg" -k # 执行更新 ./genome_updater.sh -o "arc_refseq_cg" -G ``` - 布尔标志（例如 `-G`）在不同版本之间不会被记录，必须在更新命令中重复指定才能生效。 ## 安装说明 ### conda/mamba ``` conda install -c bioconda genome_updater ``` ### 直接下载文件 ``` wget https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh chmod +x genome_updater.sh ``` - 需要使用 `bash>=4.*` 或更高版本。 - 此脚本力求良好的可移植性，仅依赖于 GNU Core Utilities 以及少量其他常用工具（`awk` `bc` `find` `fmt` `gzip` `join` `md5sum` `parallel` `sed` `tar` `wget` 及可选的 `curl`），这些工具在大多数发行版中都默认提供并已安装。 - `genome_updater.sh -Z` 将显示有关您系统依赖项的更多信息，并报告缺失的内容。 - 测试 genome_updater 的所有功能是否能在您的系统上正常运行： ``` git clone --recurse-submodules https://github.com/pirovc/genome_updater.git cd genome_updater tests/test.sh ``` ## 参数 ``` ./genome_updater -h ┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐ ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐ │ ┬├┤ ││││ ││││├┤ │ │├─┘ ││├─┤ │ ├┤ ├┬┘ └─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴ ─┴┘┴ ┴ ┴ └─┘┴└─ v0.8.0 Source: -d Database(s) (comma-separated, mandatory) Options: "genbank, refseq" Default: "" -f File type(s) to download (comma-separated, mandatory) Options: "genomic.fna.gz, assembly_report.txt, protein.faa.gz, genomic.gbff.gz, ..." all available formats are described at https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt Default: "assembly_report.txt" Organism/Taxa: -g Organism group(s) (comma-separated, empty for all) Options: "archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral" Default: "" -T Taxonomic group(s) (comma-separated, empty for all) Optional negation using the ^ prefix. Example: "543,^562" (for -M ncbi) or "f__Enterobacteriaceae,^s__Escherichia coli" (for -M gtdb) Default: "" Filter: -c RefSeq category (comma-separated, empty for all) Options: "reference genome, na" Default: "" -l Assembly level (comma-separated, empty for all) Options: "Complete Genome, Chromosome, Scaffold, Contig" Default: "" -D Start date (empty for no filter) Keep assemblies with sequence release date greater then or equal (>=) to value. Format YYYYMMDD. Default: "" -E End date (empty for no filter) Keep assemblies with sequence release date less then or equal (<=) to value. Format YYYYMMDD. Default: "" -F Custom assembly summary filter (empty for no filter) Use awk syntax, e.g.: $ for column index, || "or", && "and", ! "not", parentheses for nesting. Case sensitive. Columns info at https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt Examples: Single: -F '$14 == "Full"' Multi: -F '($2 == "PRJNA12377" || $2 == "PRJNA670754") && $4 != "Partial"' Regex: -F '$8 ~ /bacterium/' Whole-file: -F '$0 ~ "plasmid"' Default: "" Taxonomy: -M Taxonomy "gtdb[-*]" filters assemblies present in the GTDB version, which contains archaea and bacteria only. "gtdb" uses latest GTDB release. "ncbi" filters latest assemblies (version_status=latest). This option changes the behavior of -T -A -a. Options: "ncbi, gtdb, gtdb-80, gtdb-83, gtdb-89, gtdb-232, gtdb-214.1, gtdb-207, gtdb-202, gtdb-86.2, gtdb-95, gtdb-226, gtdb-220" Default: "ncbi" -A Top assemblies (0 for all) Option to keep a limited number of assemblies for each taxa leaf nodes. Selection by tax. ranks are supported in the format "rank:number", e.g.: "genus:3" to keep only 3 assemblies for each genus. Top choice based on sorted fields: RefSeq Category, Assembly level, Relation to type material, Date (most recent). Options (ranks): "species, genus, family, order, class, phylum, domain" Default: 0 -a (boolean flag) Download and keep taxonomy database files in the output folder Run: -k Dry-run mode Only checks for possible actions, no real data is downloaded, deleted or updated -i Fix mode Re-download incomplete or failed data from a previous run. Can also be used to change files (-f). -t Threads Number of processes to parallelize downloads and some file operations Default: 1 -L Downloader program Options: "wget, curl" Default: "wget" -G gzip check (boolean flag) Check integrity of downloaded gzipped files with "gzip -t". Downloaded files are removed if test fail. -m MD5 check (boolean flag) Download, compute and check the MD5 checksum for all downloaded files. Downloaded files are removed if checksum can be downloaded but does not match. Output: -o Output directory Default: "./tmp.XXXXXXXXXX" (random folder) -b Version label Name for the downloaded version. Will generate a directory inside the output directory (-o). Default: "YYYY-MM-DD_HH-MM-SS" (current timestamp) -N Files directory structure The "split" structure store files in sub-directories based on the assembly accession, e.g.: files/GCF/000/499/605/GCF_000499605.1_genomic.fna.gz. The "flat" will store everything under one dir, e.g.: files/GCF_000499605.1_genomic.fna.gz Options: "split, flat" Default: "split" Report: -u Assembly accession report (boolean flag) Generate a report (*_assembly_accession.txt) with updated assembly accessions with the fields (tab-separated): Added/Removed, assembly accession, url -r Sequence accession report (boolean flag) Generate a report (*_sequence_accession.txt) with updated sequence accessions with the fields (tab-separated): Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid. Only available when file format (-f) "assembly_report.txt" is selected and successfully downloaded. -p URL report (boolean flag) Generate two files with successful and failed URLs (url_downloaded.txt, url_failed.txt) Misc.: -e Local assembly_summary.txt Use provided "assembly_summary.txt" instead of downloading. Mutually exclusive with -d and -g Default: "" -B Alternative version label Use a previous version label instead of the latest as base version. Can be also used to rollback to an older version or to create multiple branches from a base version. Mutually exclusive with -i. Default: "" -H Link mode Change link type for files kept between versions. Hard links save inodes (useful on HPC systems) and allow version deletion. Options: "hard, soft" Default: "hard" -R Retry batches Number of attempts to retry failed downloads in batches. Default: "5" -n Conditional exit status Change exit code based on number of failures accepted, otherwise will Exit Code = 1. For example: -n 10 will exit code 1 if 10 or more files failed to download Options: integer for file number, float for percentage, 0 = off Default: "0" -x Delete extra files (boolean flag) Search and delete files that do not belong to the current version inside "files/" directory. -s Silent output -w Silent output with download progress -V Verbose log -Z Print debug information and run in debug mode ``` ## 本地更新详情更新现有的本地仓库时： - 新添加的序列将会被下载，从而创建一个新版本。 - 可以使用 `-b` 更改版本名称，否则默认使用当前的时间戳。 - 已删除或旧的序列将被保留，但不会转移到新版本中。 - 重复或未更改的文件将链接到新版本。 - 默认使用硬链接，可以通过 `-H soft` 强制使用软链接。 - 可以添加或修改参数。 - 基于快速入门示例，使用命令 `./genome_updater.sh -o "arc_refseq_cg" -t 2` 指定不同数量的线程，或者使用 `./genome_updater.sh -o "arc_refseq_cg" -l ""` 移除 `complete genome` 过滤条件。 - 文件 `history.tsv` 将在输出文件夹（`-o`）中创建，用于记录版本及所使用的参数。请注意，布尔标志/参数不会被记录（例如 `-m`）。 - `history.tsv` 包含以下列（以制表符分隔）： - `current_label`：对于新下载的内容为空，或引用先前的版本 - `new_label`：当前标签。对于修复运行 `-i` 为空 - `timestamp`：执行的时间和日期 - `assembly_summary_entries`：`new_label` 版本的 assembly 数量 - `arguments`：使用的参数（不包含布尔标志） ## 完整示例 ### 古菌、细菌、真菌和病毒的 complete genome 序列 (RefSeq) ``` # 下载（-m 用于检查已下载文件的完整性） ./genome_updater.sh -d "refseq" -g "archaea,bacteria,fungi,viral" -f "genomic.fna.gz" -o "arc_bac_fun_vir_refseq_cg" -t 12 -m # 更新（例如几天后） ./genome_updater.sh -o "arc_bac_fun_vir_refseq_cg" -m ``` ### 所有 Riboviria RNA 病毒 txid:2559587 ``` # 使用 -t 12 来通过 12 个线程并行下载 ./genome_updater.sh -d "refseq" -T "2559587" -f "genomic.fna.gz" -o "all_rna_virus" -t 12 -m ``` ### 每个细菌 taxonomic leaf node 各一个 genome assembly ``` ./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1_bacteria_genbank" -A 1 -t 12 -m ``` ### 每个细菌物种各一个 genome assembly ``` ./genome_updater.sh -d "genbank" -g "bacteria" -f "genomic.fna.gz" -o "top1species_bacteria_genbank" -A "species:1" -t 12 -m ``` ### 最新 GTDB 版本中的所有基因组 ``` ./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_complete" -M "gtdb" -t 12 -m ``` ### 特定 GTDB 版本中的所有基因组 ``` ./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_R220_complete" -M "gtdb-220" -t 12 -m ``` ### GTDB 中每个属各两个 genome assembly ``` ./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_top2genus" -M "gtdb" -A "genus:2" -t 12 -m ``` ### GTDB 中特定科的所有 assembly ``` ./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_family_Gastranaerophilaceae" -M "gtdb" -T "f__Gastranaerophilaceae" -t 12 -m ``` ### GTDB 中特定科的所有 assembly（排除某一个属） ``` ./genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_Mycobacteriacea_minus_Mycobacterium" -M "gtdb" -T "f__Mycobacteriacea,^g__Mycobacterium" -t 12 -m ``` ## 高级示例 ### 下载、修改并更新仓库 ``` # 通过 dry-run 检查可用文件 ./genome_updater.sh -d "refseq" -g "archaea,bacteria" -l "complete genome" -f "genomic.fna.gz" -k # 下载（-o 输出文件夹，-t 线程数，-m 检查 md5，-u 扩展 assembly accession report） ./genome_updater.sh -d "refseq" -g "archaea,bacteria" -l "complete genome" -f "genomic.fna.gz" -o "arc_bac_refseq_cg" -t 12 -u -m # 为当前快照下载额外的 .gbff 文件（向 -f 添加 genomic.gbff.gz，-i 仅添加文件而不更新） ./genome_updater.sh -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -i # 几天后，仅检查更新但不执行更新 ./genome_updater.sh -o "arc_bac_refseq_cg" -k # 执行更新 ./genome_updater.sh -o "arc_bac_refseq_cg" -u -m ``` ### 使用特定过滤条件从基础版本中创建分支 ``` # 下载完整的细菌 refseq ./genome_updater.sh -d "refseq" -g "bacteria" -f "genomic.fna.gz" -o "bac_refseq" -t 12 -m -b "all" # 将主要文件分支为两个子版本（不会下载或复制新文件） ./genome_updater.sh -o "bac_refseq" -B "all" -b "complete" -l "complete genome" ./genome_updater.sh -o "bac_refseq" -B "all" -b "reference" -c "reference genome" ``` ### 生成序列报告和 URL ``` ./genome_updater.sh -d "refseq" -g "fungi" -f "assembly_report.txt" -o "fungi" -t 12 -rpu ``` ### 从外部的 assembly_summary.txt 中恢复基因组 assembly ``` ./genome_updater.sh -e /my/path/assembly_summary.txt -f "genomic.fna.gz" -o "recovered_sequences" ``` ### 使用 curl 替代 wget，修改下载的 timeout 和重试次数，增加重试次数 ``` # 默认值：retries=3 timeout=120 retries=10 timeout=600 ./genome_updater.sh -g "fungi" -o fungi -t 12 -f "genomic.fna.gz,assembly_report.txt" -L curl -R 10 ``` ### 使用本地 taxdump 文件 ``` new_taxdump_file="my/local/new_taxdump.tar.gz" ./genome_updater.sh -T 562 -o 562assemblies -t 12 ``` - 请注意，这里需要的是 [new_taxdump](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/)，而不是更常见的 `taxdump.tar.gz`。 ### 备用下载 URL ``` # NCBI ncbi_base_url="https://ftp.ncbi.nih.gov/" ./genome_updater.sh -d refseq -g bacteria # GTDB gtdb_base_url="https://data.ace.uq.edu.au/public/gtdb/data/releases/" ./genome_updater.sh -d refseq,genbank -g bacteria,archaea ``` ## 报告 ### assembly accessions（-u） `-u` 参数会激活输出功能，为所有文件均已成功下载的条目输出一个已更新的 assembly accession 列表。文件 `{timestamp}_assembly_accession.txt` 包含以下以制表符分隔的字段：`Added/Removed [A/R]`、`assembly accession`、`url` 示例： ``` A GCF_000146045.2 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64 A GCF_000002515.2 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/515/GCF_000002515.2_ASM251v1 R GCF_000091025.4 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/091/025/GCF_000091025.4_ASM9102v4 ``` ### sequence accessions（-r） `-r` 参数会激活输出功能，为所有文件均已成功下载的条目输出一个已更新的 sequence accession 列表。此选项仅在文件类型包含 `assembly_report.txt` 时可用。文件 `{timestamp}_sequence_accession.txt` 包含以下以制表符分隔的字段：`Added/Removed [A/R]`、`assembly accession`、`genbank accession`、`refseq accession`、`sequence length`、`taxonomic id` 示例： ``` A GCA_000243255.1 CM001436.1 NZ_CM001436.1 3200946 937775 R GCA_000275865.1 CM001555.1 NZ_CM001555.1 2475100 28892 ``` - 注意：如果运行中断或未成功完成，assembly 和 sequence accession 报告中可能会缺少某些文件。 ### URL（-p） `-p` 参数会激活输出功能，将下载失败和成功的 URL 分别输出到文件 `{timestamp}_url_downloaded.txt` 和 `{timestamp}_url_failed.txt` 中。只有在命令完整运行完毕且没有错误或中断的情况下，失败列表才是完整的。 ## 顶级 assemblies (-A) `-A` 选项将根据四个类别（A-D），按照重要程度依次为每个 taxonomic node（叶子节点或特定分类层级）选择“最佳” assemblies： ``` A) refseq Category: 1) reference genome 2) na B) Assembly level: 3) Complete Genome 4) Chromosome 5) Scaffold 6) Contig C) Relation to type material: 7) assembly from type material 8) assembly from synonym type material 9) assembly from pathotype material 10) assembly designated as neotype 11) assembly designated as reftype 12) ICTV species exemplar 13) ICTV additional isolate D) Date: 14) Most recent first ``` ## 参考文献： [1] https://ftp.ncbi.nlm.nih.gov/genomes/ [2] https://gtdb.ecogenomic.org/ [3] O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.

标签：基因组学, 应用安全, 数据同步, 生物信息学