摘要 Towards complete and error-free genome assemblies of all vertebrate species

linnil1

11 min readJun 23, 2021

我大概看了一下這篇 paper ，然後做一小段重點(其實是順便拿來交報告ㄉ)

這是一篇 2021/03/12 發表於 Nature 的 paper ，連結 https://www.nature.com/articles/s41586-021-03451-0

Target

這篇 paper 是 International Genome 10K (G10K) consortium 底下的一個 Vertebrate Genomes Project (VGP) 的一項 project

主旨在建立一個完整的 error-free 的 reference genome (因為目前只有 microbial species 做得很好)

目標是組出來的序列至少有以下特性(其實就是要幾乎完美的 genome)

Error-free
Gapless
Phased
Annotated
Chromsome-level (就是指 telomere-to-telomere)

然後這個 project 希望在十年內完成這四個目標

phase 1: representatives of approximately 260 vertebrate orders, defined here as lineages separated by 50 million or more years of divergence from each other
Phase 2 will encompass species that represent all approximately 1,000 vertebrate families
phase 3, all roughly 10,000 genera
phase 4, nearly all 71,657 extant named vertebrate species

最後階段就是把 7萬多個脊椎動物的 genome 全部都組出高品質的 reference

Assembly metrics

定義所謂的高品質完整 reference，他用 6 大類 14 metrics 來衡量，最重要的包括

Coverge (覆蓋率，可以想像成跟真實 genome 完整度)
Base-accuracy (把 short-reads map 上去，再根據 base 評分)
Contig NG50 (把 reads 組成很長的 sequence)
Scaffold NG50 (把很長的 contig 用 gap 黏起來)
Haplotype phase NG50

(NG50 就是指把組出來的 sequence 長度由大到小排後，中位數的 sequence 的長度)

這次最低目標就設定，超過這個標準才算是 high-quality 的序列

1 Mb contig NG50
10 Mb scaffold NG50;
Assigning 90% of the sequence to chromosomes,
Structurally validated by at least two independent lines of evidence
Q40 average base quality
Haplotypes assembled as completely and correctly as possible.
Most genes were assembled with gapless exon and intron structures fewer than 3% had frame-shift base errors identified in annotation

Pipeline

這篇 Paper 希望提供一個完整的 pipeline，大家可以按照這個程序把 genome組裝得很完整

VGP assembly pipeline (v1.6)

Haplotype-separated CLR contigs(+ fixing function in the PacBio FALCON)
Scaffolding with linked reads, optical maps, and Hi-C
Gap filling (PacBio 處理工具 Arrow 也能做這步，就是把 scaffold 的 gap 盡量補起來)
Base call polishing (在 long-read 也很常需要這步，PacBio 會有 frameshift 的 sequencing error，poblish 會最至少 two rounds，quality 會從 Q30 升到 Q40)
Manual curation (人工校正)

本 paper 還有附 trio 跟 mitochondria 的 flowchart

人工校正

最後有一個大重點是使用人工做校正，校正的錯誤主要為

False Duplication 又分成

Heterotype duplications: 不同的 haplotype 被組在同一條
Homotype duplication: homotype + 附近有長得像的 sequence 會被誤組成 heterotype

通常在這兩種情形會使用這兩個軟體: 可以清掉一些錯誤

Purge_haplotigs
Purge_Dups

最後還是人工檢視這 19 genome 下的 7262 區域

換算一下就是每 GB 有 236 區域被人工審核

實驗

使用剛剛的 pipeline 測試在 16 species (分別來自不同 vertebrate 的 lineages)

為什麼這次挑選的是 vertebrate 因為 centromere 附近通常有長長的 satellite

其中使用 Anna’s hummingbird (Calypte anna) (genome size: 1Gb) 這個物種做完整的工具比較

首先是 sequencing 的技術，共13 個，其中重要的是：

Oxford Nanopore 1D and 2D (Long read 的一項技術, max 120kb)
PacBio continuous long reads(CLR)(long read 的另一項技術, max 84kb)
Hi-C (可以抓到真實 genome 上兩個片段在空間上靠近的 sequence 最遠距離有 197Mb)
Bionano Optical maps (利用 nanochannel 拉直 genome 後可以找到那些 sequence 是位於同一條上, max 3MB)(https://youtu.be/7e5v3b4NB6I?t=400)
10XG linked reads (利用 droplet 分出把每條 sequence 上 barcode 後在做 short reads, avg 74kb)(https://www.10xgenomics.com/products/linked-reads)
illumina (short reads)

下面這張圖畫得很好

然後可以從組出來的 NG50 長度跟 gap 數做個比較

大致可以發現， short reads 普遍組的不好，然後 linked reads, optical maps optimal, Hi-C 要同時使用效果會最好(至少目前是這樣)

然後也有一些小結論 (Figure2)

組出來的 NG50 的長度跟 genome 的大小有關
repeat 越高，NG50 越小
GC 高的地方， gap 就越多
兩個 haplotype 越不像，組出來的序列越長
大多組裝錯誤來自於 false duplication (也就是我們人工要去 curated 的)

最後甚至跟目前的 reference (由 Sanger 一個個做出來的)相比(下圖)，除了找到了以前沒找到的 karyotype 更發現一些 gaps 跟 inversion 是原本是組錯了，然後在 telomere 上完整度也比較高

在 annotation 的部分，如下圖的 protein-coding gene ，GC-rich (紅色)是原本 reference 沒有組好的地方，所以之前以為 DRD1B 有兩個 exon，事實上應該只有一個 exon，進而影響我們對這個 promoter 的判讀

DRD1B 在 zebra finch brain 的 regulatory region

Related Research

組出完整序列是目前很重要的研究，當然也不得不提最近這兩篇有組出人類非常完整的序列的 paper

Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature https://doi.org/10.1038/s41586-021-03420-7 (2021).

這兩篇跟本文也有用到很多的相同的技術，除了 Pacbio Hifi reads 本篇沒有使用

Telomere-to-telomere

這篇使用 ONT 跟 Hifi 組出來 contig NG50 有 70 MB，再用 linked-reads, optimal maps, short reads 去修正 base 跟 indel 錯誤，但要特別注意的是，他使用 CHM13 的 genome ，也就是只有一種 haplotype 不需要考慮 phasing 的問題。這篇也在最後有說這只有在 chrX 驗證，其他更困難的 chromosome (chr1, chr9, chr16) satellites 比較長的地方，還是個挑戰。

Chromosome 8

這篇 2021/4 也是 Hifi + ONT 兩個 long read 都用來組人類 chr8，但是 linked-reads, optimal maps, short reads 拿來驗證而已。組的也是 CHM13 ，跟上面的 T2T 一樣，並沒有考慮 haplotype phasing 的問題。

結論

這篇沒有 conclusion ，那我就說一下我的好了，這是一個超大的計劃，光是人工 curate 7262 的區域就感到不可思議。然後也測試了不同種的工具跟 sequencing data 而得到的 pipeline ，甚至未來要組出這麼多 species 的完整基因。

主要也是有 long reads 的技術越來越成熟，error-rate 越來越低，所以這種相較於傳統方式，就能用較低的勞力與金錢做出足夠完整的 reference

利用這些高品質的 reference ，未來我們對這些 species 的 read mapping 品質會提升，下游分析比如說 gene function 跟 regulatory 的時候也會更準確 (因為之前的 GC-rich 那裏的 reference 可能有問題，影響到的是 promoter 的區域)，做 lineage 之間的演化比較也更好(這篇 paper 有)。

覺得我整理得還行的話，記得按 clap 支持我喔！

Reference

本篇 paper (其實一堆作者)

Rhie, A., McCarthy, S.A., Fedrigo, O. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021). https://doi.org/10.1038/s41586-021-03451-0