richarddurbin/syng

GitHub: richarddurbin/syng

syng是一个用于构建和操作基于同步子的序列图的工具，帮助高效表示和分析DNA序列数据。

Stars: 42 | Forks: 6

# syng 同步子图，以及可能其他类型的序列图。这里的主要产品是**syng**。它本质上读取各种类型的序列文件，并以*同步子*概念图上的*路径*来表示它们，这些*同步子*是所有可能kmer的子集，保证提供稀疏但完整的任何DNA序列覆盖（平均深度约两层）。有关同步子的更多信息，请参阅 [Edgar, 2021](https://peerj.com/articles/10805/) - 我们在这里称为同步子，Edgar 称之为“封闭同步子”。生成的同步子图可用作组装图或泛基因组图。 syng 大量使用 ONEcode 包进行紧凑、快速和高效的文件表示。同步子集合，例如表示一组序列所需的集合，通常存储在 **.1khash** 文件中，但也可以以 gzipped fasta 文件导出。除了显式地将路径作为kmer索引字符串导出到 **.1path** 文件中，syng 还可以构建一个 [GBWT](https://academic.oup.com/bioinformatics/article/36/2/400/5538990) 来隐式表示这些路径，存储在 **.1gbwt** 文件中。我们计划让 syng 也导出，并可能导入 [GFA](https://gfa-spec.github.io/GFA-spec/GFA1.html) 文件。 2026年3月更新：我使用双向链表的行程压缩跳表（在 **rskip.[ch]** 中）重新实现了 GBWT 代码，使核心 GBWT 操作达到 O(logN) 并显著加速。现在生成 **.1gbwt** 文件的最佳方式是先创建 **.1path** 文件，然后使用 **syngpath2gbwt** 进行转换。为了提供一些性能概念，syng 在 MacBook Pro 上将 20x PacBio HiFi 读取数据集（935Mb 慈鲷基因组，约19Gbp）转换为 1.05GB 的 .1khash 文件和 493Mb 的 (1023,32)-同步子 .1gbwt 文件，耗时62秒；并在 Linux HPC 集群上将来自 [HPRC](https://humanpangenome.org/) 第1版的92个人类基因组（277Gbp）（排除 HG002 用于评估）转换为 4GB 的 .1khash 文件和 5.8GB 的 (63,8)-同步子 .1gbwt 文件，耗时约1.5小时。 .1khash、.1path 和 .1gbwt 文件都是 **ONEcode** 文件的例子。该项目还包含来自 [ONEcode](https://github.com/thegenemyers/ONEcode) 仓库的 **ONEview**。更多实用工具 **seqconvert**、**seqstat**、**seqextract** 用于总结、操作和在 fastz[.gz] 与 **.1seq** ONEcode 序列文件之间相互转换，可从 ONEcode 的 [SEQUENCE_UTILTIES](https://github.com/thegenemyers/ONEcode/tree/main/SEQUENCE_UTILITIES) 子目录中获取。 ## 构建 ``` git clone https://github.com/richarddurbin/syng.git cd syng make ``` 如果希望能够读取 SAM/BAM/CRAM 文件，则需要在并行目录中安装 [htslib](https://github.com/samtools/htslib) 并使用 `BAMIO=1` 构建： ``` cd .. git clone https://github.com/samtools/htslib.git cd htslib autoreconf -i # Build the configure script and install files it uses ./configure # Optional but recommended, for choosing extra functionality make cd ../syng make clean make BAMIO=1 ``` ## 库有各种有用的库，带有头文件： - **ONElib.[hc]** 支持 ONEcode 文件的读写，实现在单个文件 ONElib.c 中，无依赖。有关更多信息，请参阅。 - **seqio.[hc]** 支持读取、写入DNA文件以及其他一些基本操作。实现在 seqio.c 中，根据编译操作依赖于 utils.[hc]、libz、ONElib 和 htslib。 - **seqhash.[hc]** 序列处理以生成各种类型的kmers，包括通过迭代器接口提取同步子、minimizers 和 modimizers。 - **kmerhash.[hc]** 用于构建、搜索、写入和读取（到 .1kash 文件）固定长度kmers表的高效代码，例如由 seqhash 返回。 - **utils.[hc]** 一些非常底层的类型定义（例如 I8 到 I64 和 U8 到 U64）、die()、warn()、new()、new0() 和一个计时包。除了正常的 C 运行时库外没有依赖。注意有一个方便的 fzopen()，可以静默地将 .gz 文件作为标准文件打开，但这依赖于 funopen()，该函数并非在所有系统上都可用。如果这无法编译/链接，则需要取消定义 WITH_ZLIB 以进行链接。 - **array.[ch]、dict.[ch]、hash.[ch]** 分别实现了高级语言风格的可扩展数组、字典（字符串的哈希）和基本类型（最多64位）的通用哈希。 ## 概述下面给出了一个使用模式示例。 ``` > syng Usage: syng * * possible operations are: -w : [55] syncmer length = w + k -k : [8] must be under 32 -seed : [7] for the hashing function -T : [8] number of threads -o : [syngOut] applies to all following write* options -readK <.1khash file> : read and start from this syncmer (khash) file -zeroK : zero the kmer counts -noAddK : do not create new syncmers - convert unmatched syncmers to 0 -histK : output quadratic histogram of kmer counts (after sequence processsing) -writeK : write the syncmers as a .1khash file -writeKfa : write the syncmers as a fasta file, with ending .kmer.fa.gz -writeNewK : write new syncmers as a .1khash file; implies -noAddK -writePath : write a .1path file (paths of nodes) -writeGBWT : write a .1gbwt file (nodes, edges and paths in GBWT form) -writeSeq : write a .1seq file (paths converted back to sequences) -outputEnds : write the non-syncmer ends of path sequences as X,Y lines possible inputs are: : any of fasta[.gz], fastq[.gz], BAM/CRAM/SAM, .1seq <.1path file> : sequences as lists of kmers, with optional non-syncmer DNA ends <.1gbwt file> : graph BWT with paths, with optional ends Operations are carried out in order as they are parsed, with some setting up future actions, e.g. changing the outfile prefix affects following lower case options for file opening Some output files, e.g. .1gbwt will be output at the end, after all inputs are processed, whereas others, e.g. .1path are written as inputs are processed. > syng -o cichlid -writeK -writePath -outputEnds *.fa.gz k, w, seed are 8 55 7 sequence file 1 fAstCal68.FINAL.fa.gz type fasta: had 31 sequences 935785828 bp, yielding 36820971 syncs with 29445488 extra syncmers user 18.041796 system 0.918415 elapsed 13.309440 alloc_max 3752 max_RSS 5911035904 sequence file 2 fAulStu2.FINAL.fa.gz type fasta: had 23 sequences 927632324 bp, yielding 36196076 syncs with 6870638 extra syncmers user 17.979181 system 0.289314 elapsed 7.157738 alloc_max 4128 max_RSS 112132096 sequence file 3 fDipLim2.FINAL.fa.gz type fasta: had 437 sequences 932767327 bp, yielding 36681142 syncs with 4535666 extra syncmers user 18.628796 system 0.458877 elapsed 7.594654 alloc_max 6516 max_RSS 2246590464 sequence file 4 fLabFue1.FINAL.fa.gz type fasta: had 31 sequences 936110624 bp, yielding 36736333 syncs with 3748552 extra syncmers user 17.172514 system 0.225017 elapsed 6.408912 alloc_max 6526 max_RSS 0 sequence file 5 fLabTre1.FINAL.fa.gz type fasta: had 26 sequences 936863659 bp, yielding 36778443 syncs with 1973510 extra syncmers user 16.720750 system 0.158108 elapsed 5.891383 alloc_max 6811 max_RSS 86949888 sequence file 6 fMayPea1.FINAL.fa.gz type fasta: had 30 sequences 927122434 bp, yielding 36548424 syncs with 2765274 extra syncmers user 17.186198 system 0.197489 elapsed 6.093247 alloc_max 7113 max_RSS 827047936 Total for this run 578 sequences, total length 5596282196 Overall total 219761389 instances of 49339130 syncmers, average 4.45 coverage wrote 49339130 syncmers to file cichlid.1khash user 2.358794 system 0.312918 elapsed 2.805761 alloc_max 7113 max_RSS 1245184 total: user 108.088040 system 2.594965 elapsed 49.295976 alloc_max 7113 max_RSS 9185001472 > syng -o cichlid -readK cichlid.1khash -writeGBWT -outputEnds cichlid.1path read syncmer parameters k 8 w 55 (size 63) seed 7 read 49339130 syncmers from cichlid.1khash with total count 219761392 user 2.078966 system 0.367622 elapsed 2.459572 alloc_max 3972 max_RSS 3435151360 k, w, seed are 8 55 7 path file 1 cichlid.1path: had 578 sequences containing 219761397 syncmers user 98.485414 system 28.079270 elapsed 106.574848 alloc_max 12455 max_RSS 23292624896 Total for this run 578 sequences, total length 0 Overall total 439522789 instances of 49339130 syncmers, average 8.91 coverage wrote gbwt to file cichlid.1gbwt user 19.222173 system 1.010528 elapsed 20.482916 alloc_max 12455 max_RSS 0 total: user 120.153096 system 29.556817 elapsed 130.028740 alloc_max 12455 max_RSS 26727776256 > syng -o cichlid -readK cichlid.1khash -writeSeq cichlid.1gbwt read syncmer parameters k 8 w 55 (size 63) seed 7 read 49339130 syncmers from cichlid.1khash with total count 219761392 user 2.076341 system 0.382973 elapsed 2.460453 alloc_max 3972 max_RSS 3435397120 k, w, seed are 8 55 7 path file 1 cichlid.1gbwt: read GBWT with 49339130 vertices and 115529066 edges user 33.337580 system 0.797333 elapsed 34.188887 alloc_max 7022 max_RSS 3624419328 had 578 sequences containing 219761397 syncmers user 159.932554 system 21.517852 elapsed 96.478368 alloc_max 15660 max_RSS 20926840832 Total for this run 578 sequences, total length 0 Overall total 439522789 instances of 49339130 syncmers, average 8.91 coverage total: user 195.346545 system 22.837188 elapsed 133.267362 alloc_max 15660 max_RSS 27986657280 > ls -l cichlid* -rw-r--r-- 1 rd staff 1012108820 Mar 16 12:34 cichlid.1khash // the kmer sequences -rw-r--r-- 1 rd staff 776147166 Mar 16 12:34 cichlid.1path // the sequences as lists of kmers -rw-r--r-- 1 rd staff 1290427468 Mar 16 12:38 cichlid.1gbwt // the sequences stored as a GBWT over kmers -rw-r--r-- 1 rd staff 1399079525 Mar 16 13:21 cichlid.1seq // the regenerated sequences 2-bit compressed with indices // as the level of replication increases the gbwt representation becomes more efficient > seqstat -b cichlid.1seq onecode file, 578 sequences >= 0, 5596282196 total, 9682149.13 average, 1000 min, 85450508 max bases a 1646170104 29.4 % c 1152062443 20.6 % g 1152334894 20.6 % t 1645714755 29.4 % > ONEview -h cichlid.1gbwt | head // -h does not show the header, just the body > ONEview -H cichlid.1gbwt // -H shows only the header, with the schema and object statistics ```

标签：DNA序列分析, GBWT索引, GFA格式, kmer技术, O(logN)复杂度, ONEcode包, skip lists, syncmer图, 图数据结构, 基因组学, 基因组组装, 客户端加密, 序列图, 序列索引, 序列读取, 性能优化, 数据高效处理, 文件格式转换, 检测绕过, 泛基因组图, 生物信息学, 生物数据压缩, 稀疏kmer, 稀疏覆盖, 组装图, 路径表示