Differential Expression Gene (DEG) Analysis 基因差異化表現的分析原理

linnil1

22 min readJan 15, 2021

這篇我想要試圖把 Deseq2 會用到的想法跟統計方式寫下來避免忘記

其實在 deep learning 時代我連統計都快忘光了，趕緊複習…

而且蠻複雜的，拖了很久…

如果不想看太過理論的話，這個 workshop 其實很不錯 https://hbctraining.github.io/DGE_workshop/schedule/1.5-day.html

Intro

接續前一篇的 Quantification，我們有了每個基因的 RNA 的 read count (數量) 或者說表現量 (abundance) 之後

RNA Quantification, RNA 定量

簡單介紹 RNA-seq 表現量(Abundance) 的基本概念

linnil1.medium.com

我們想做的事情是看看不同實驗之間有哪些基因有表現量差異。e.g. 比如說同一個人吃藥前，吃藥後，藉由看表現量差異，我們可以期待看到某些gene 機制被抑制了。

釐清問題

t-test

既然有兩群 samples A, B 那我們就做 Student’s t-test 檢測 gene 在 A 的 abundance 跟在 B 的 abundance 一不一樣，那麼 Null hypothesis 就是 expression A = Expression B

沒錯，但是通常你的 sample 數會不太夠

不只會估的不準，dof (degree of freedom) 不夠大所以 p-value ≥ 0.05 會很寬，所以通常會 reject 不了 null hypothesis

就像吃藥一樣，有些人好得特別快，有些人特別慢，只有蒐集更多 samples 你才能估計出比較準確的藥效，但通常 samples 數要用大筆的錢去換就是ㄌ。

LFC (log fold change)

如果不能用 t-test ，那麼挑變化大的 gene 出來，而變化大是指 A/B (fold change) 取極大或極小

A/B = 2 但是 B/A = 0.5 ，同樣是兩倍但是這樣不好比較，所以我們會多加個 log (通常用 log_2) ，在 log 下會變成相減，變多變少 (up-regulate down-regulate) 的值就會一樣了

接下來可以用 |log(A/B)| > θ 來判斷哪些是 differential expression了

但是會遇到跟剛剛一樣的問題：

如果本來的 variance 很大，相除並沒有意義
當 a or b read count 很小的時候，相除可能造成反效果

Overdispersion

我說的 variance 很大，在這裡的意思是指資料的 variance 比一般 model 估計出來的還要大，所以後面都會估的不準，這種情形叫做 overdispersion

這裡的一般 model 指的就是 Poisson Distribution ，因為 Poisson 只有一個參數並不能調整 variance

variance 在這裡很重要的，比如說

如果你的某個 gene 抽樣出3個 read count [100,112,115] 請問他真實的數字是多少

然後套 distribution 帶入後會得到 -> 108: 20%, 109: 30%, 110: 20%

然後你就能回答最有可能是 109 這樣

如果 distribution 的 variance 能調整成比剛剛的小，代表著剛剛的 100 比較像是 outlier，結果可能會改變成 108: 20%, 109: 27%, 110: 30% ，回答就變成 110

基本概念

Negative Binomial Distribution (NB)

用 Negative Binomial Distribution 取代只有一個參數的 Poisson

K = read count, u = mean of read count, σ=dispersion

NB 有兩個參數(μ, σ) μ 代表平均, σ 代表 dispersion 也就是分散的程度

這裡我是用 variance 而不是 dispersion 不過其實就是簡單的轉換

dispersion 跟統計出來資料的 variance(ν) 有這樣的關係

可以這樣想，兩個 gene read count 平均一樣，但其中一個天生的 expression 比較不穩，另一個則比較穩，我們就要用大的 dispersion 的 model 套在第一個，小的給另一個，才會比較精確。

Pipeline

回顧這類的文章，有一個特點是為了要以準確的得到 dispersion ，會多做一些手續，以下為簡單的 pipeline，最特別的是第三點

1.Normalization 把 read count 做個標準化

2. Dispersion怎麼得到呢，第一點從資料計算出來(maximum-likelihood)，缺點是一個 gene 做一次，sample 不夠，所以不夠準

3. 大家就會去想要怎麼樣去算得更準，當然會用到其他 gene 的資訊做為參考，然後把原本估出的 dispersion 做調整。把原本誤差很大的資料整理成統計上誤差比較小的資料，又叫 shrinkage

4. 已經有準確的 NB 了，我們可以 testing 看看有那些是 differiential gene

Normalization

跟 quantification 不一樣，我們要看的是 samples 間的差異，而不是 sample 內的差異，所以 TPM 並不適合(可以看上一篇)

我們使用 read count 來計算，一樣，必須先 normalize 才不會因為有些 sample 因為 reads 比較多而產生偏差

這個方法是 median-of-ratios: 先對每個基因找出一個 reference 然後相除，得到一個比例，比例的中位數就是整個 sample 的 scale

Equation for calculating sample-scale. Copied from DESeq

稱做 normalization constants, size factor 我在這裡叫他 sample-scale

而分母又叫做 pseudo-reference, pseudo-count ，他是個幾何平均 (geometric mean)，可以當成同個基因的 read count 的一個標準

所以寫在一起變成

K 是 normalized 的 read count ，q 就是我們想知道的 abundance

null hypothesis 變成 q_A = q_B

Maximum Likelihood

我們可以用 normalized read count 的平均當作 μ，然後找個 σ 使得 NB(μ,σ) 最符合資料，用的方式叫 Maximum Likelihood 可以直接 closed form 的解出來

詳細計算可以看以下 paper(Normalization, Max-likelihood, Testing 都是這篇寫的) https://academic.oup.com/biostatistics/article/9/2/321/353777

Testing

有了可以描述 read count 的 Negative Binomial model 後我們就可以使用 small-sample Fisher’s Exact Test 去看這個 gene 是否有差異。Null hypothesis 是 read count ~ NB(μ, σ) ，觀察到的 read count 能否被 NB 說明 (很像 F-test)

我們弄出 pseudo-sum 也就是 NB_A + NB_B 這個 distribution 後(右下圖)，去計算有多少機率比觀察到的組合(A+B 基因表現量) 還要極端(左下公式) (有頭有尾是 two-sided)

這個機率就是 p-value 最後可以去做 thresholding

False Discovery Rate (FDR) & P-adjust

注意剛剛是 p-value 是 per-gene 的，但是 gene 數量一多，即使 p-value 機率很小，還是可以 by-chance 篩出不少 gene，但很有機會是 false-positive. e.g. 你有 40000 gene, p-value 設 0.01 ，那麼即使什麼都沒做就有 400 gene 會被找出來

所以要設定一個 FDR 的值 alpha ，會根據 testing 的次數跟原本的 p-value ，調整 p-value 變成 p-adj

計算這裡不討論，詳情請看這篇 70k citations 的 paper

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289–300.

MA-plot (minus over average)

MA plot 就是 LFC vs log(abundance) 的圖

不難發現，我們可以 p-value 有顯著的(紅色) 會跟 LFC 有一定的關聯性，說明這個方式是好的。

當然你也可以把 p-value 畫在 y 軸上，x 軸是 LFC 就變成

Volcano plot

Paper

edgeR

應該算是第一篇

除了上面整段寫的流程大部分都是 edgeR purposed 的，除了 sample-scale 的計算是直接幾何平均沒有用 median (以前)

edgeR 多了一個假設: 所有 gene 的 dispersion 都一樣(下圖紅色線)

方法是蒐集所有 gene 的 dispersion ，量比較多，估計起來會比較準

CPM = count per million. Copied from https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

注: Maximum Likelihood 就是找出 per gene 的 dispersion (上圖黑點)

注: 在 DESeq function 其中一個參數 fit_type: mean 應該就是這樣算。

有了 common dispersion 後再用 Bayes 調整(shrinkage) 成實際的 dispersion(下圖黑色點 to 紅色點)，當然不是拿那條線上的點直接取代。實際計算如下列公式，找 Cox–Reid adjusted likelihood(LCR) 中的最大值，這個 LCR 是由 Negative Binomial 和 Maximum likelihood 所組成的。簡單的說，要找個數字同時符合真實資料點跟 negative binomial 在指定的 α 下做出來的分布。

α = common dispersion. Equation Copied from DESeq2

而現在的 edgeR 我看了看 document ，現在會先找到藍色的趨勢線(Trend)，再用 Bayes 調整 dispersion，下圖。

DESeq

DESeq 是 edgeR 的擴充的 model (←paper 自己寫的)

在 dispersion 的計算之前，我們要先估計 gene 的 variance (這點我們在 Negative Binomial 那個部分已經說明)

差別在 edgeR 估計 variance 時是 per-gene 計算，而 DESeq 多一個步驟是把 per-gene variance 收集起來後，smooth 一下，之後在 per-gene 計算 dispersion，因此 dispersion 會更準

Equation4 in DESeq

smooth 的計算是用 generalized linear models (GLM) of gamma family ，簡單講是 regression 求出最適合的線當作該 gene 的 variance，如下圖

這張圖順便說明 poisson 只有一個參數，會無法描述這麼高 variation 的資料。 Copied from DESeq

驗證的方式就是像 QQplot 一樣，把所有 p-value 攤開來應該是一條直線

DESeq2

就是 DESeq 進階

跟之前一樣，dispersion(從 Maximum likelihood 求出) 太 variance 。(待補所以之前的方法不用了ㄇ)在這裡的解決方法是用一個 read count 有關的函數去 smooth 他

上面的式子又叫做 parametric (DESeq 其中一個 fit_type)，fit 的方式用 GLM gamma function(對跟上面一樣)，然後 iteratively 求出 a0, a1

如下圖紅色線(Trend, parametric fit 的結果，這裡叫做 prior) 就是最合適的線對於黑色點(黑點來自 maximum likelihood)，可以發現 DESeq2 在 read count 小的時候，dispersion 會估得比較大，這是個合理的設計，DESeq 當時已有發現到這個問題

DESeq2 還有一個 fit_type 是 local ，式子大概長這樣

local fit in DESeq

有了 prior 後，我們要調整(shrinkage)原始的 dispersion。在這裡要算出 posterior ，然後找 posterior 中的最大值(maximum a posteriori, MAP)，即是最有可能的 dispersion。

這個 posterior 考慮這三項:

Negative Binomial(edgeR已用) 和
Maximum likelihood(edgeR已用) 和
log-normal prior ，就是 dispersion 取 log 是個 normal distribution(paper 有證) ，這個 normal distribution 的 mean 會是 prior dispersion 的值。

因為多了這一項，所以跟 edgeR 的式子有一點差異，而 paper 有提到之前的方式容易 overestimate。

最後靠這條式子把上上圖的黑色點轉成藍色點(MAP)了

要注意的是如果是 outlier (> 2 standard deviation)，就不用拿進來 fit ，而他的 MAP 直接拿 prior 上的數字就好

DESeq2 for LFC

有時候 fold change 比較重要，可以想像變化比較大的那個 gene 通常才是關鍵。DESeq2 也有 purpose 方法去求比較準的 LFC，概念大致相同，共三個步驟

per-gene LFC

這跟上面一樣，從原本是要找 abundance 的值，現在是找 log(abundance) 也就是 β (下面的式子)，又稱之作 effect size。計算方法是 iteratively reweighted least-squares algorithm

2. gene-wise Prior

我們假設 β 符合 normal distribution 且 mean=0，也就是 log(abundance) 是常態分布，這其實部分合理，因為通常大部分的 gene abundance 不會因此而改變，有變化的只是那幾個 gene 而已

收集全部的 β，然後使用 cook’s distance 去掉 outlier 之後，我們可以估計出一個比較準的 prior width (normal distribution 的寬度)，注意下式的 Q 會把比較極端的點去掉(p = 0.05)

3. Shrinkage

找出一個最佳的 β，其實找 posterior 的 likelihood 的最大值，也就是找最有可能的 β 符合 Negative binomial 跟 normal distribution，如果是 outlier 就直接使用 prior

效果如下 (A -> B)，你可以發現到原本 mean 小的地方會造成 LFC 非常的大，但是我們今天有假設 β 的 distribution mean =0，所以圖 B 就 mean 小的地方被 shrink 到 0 了

下圖是圖解 prior 所造成的 posterior 的影響，原本兩個 gene 平均表現量一樣(實線)，但是因為 prior(黑線) 的分布，他的最佳 β 變成虛線的位置。還可以發現，因為紫色 gene 的 read count 比較分散，所以 posterior 的 LFC 會比原本預期的低，這很合理，比較分散證據力會不太夠。

這張圖也是，說明做過 shrinkage 的估計會比較穩(Stable)，因為有考慮全部的 gene 的資訊，所以不會因為小 samples 所以每次做出來都不一樣。

RMSE = root mean square error. Copied from DESeq2

4. LFC testing

DESeq2 使用 Wald test

檢測 null hypothesis: |β| ≤ θ，然後計算 p-value 值

Result 當然會跟你說 DESeq2 好棒棒，不過最重要的一點就是大家都用這個工具

Sensitivity 越高越好，FDR 越低越好。Figure 6 in DESeq2

apeglm

DESeq 團隊 2019 新作品，因為 DESeq2 可以直接使用，所以 citation 意外的還蠻高的

之前是假設 β 是 normal distribution mean = 0(lfcShrink 的 type normal)，這裡變成 Cauchy distribution mean=0 (等價於 T distribution 在 dof=1 的情況)，更能處理 effect size 比較大的狀況。

Cauchy distribution in wiki. https://en.wikipedia.org/wiki/Cauchy_distribution#/media/File:Cauchy_pdf.svg

1.前面到 likelihood 的部分都一樣

2. prior 的話就根據 β 跟 SE(β) 還有假設 Maximum-likelihood 是 normal distribution，就能 closed form 計算出來 (公式 paper 有)，解出來的 S，也就是 Cauchy distribution 的 S(Scale)，是根據所有 gene 的資料一起統計出來ㄉ。注意這裡並沒有用任何 threshold 把 outlier 拿掉，減少了人為的操作。

3. posterior 是找 Cauchy distribution 跟 Negative Binomial 混合起來的最大值，詳細計算蠻複雜的，最後就能準確的 β

4. Testing 方式比較有趣，利用 posterior 的 probability 算出

local false sign rate (FSR): effect size 的正負(變多變少) 相反的機率
local false-sign-or-smaller rate (FSOS): FSR 再加上他小於某個值的機率

然後就可以找出每個 gene 的 s-value，可以想像 s-value 是某種根據 effect-size (β) ranking 的機制，s-value 把 FSR 比較前面的 gene 的 FSR 算的平均當作這個 gene 的 ranking。 s-value 有幾個優點

排名越前面的代表 effect-size 越重要，i.e. s-value 的大小是有意義的，這是跟 p-value 的觀念不一樣的地方
Threshold 的意義就變成根據 s-value (effect size 的強度) 去決定要那些 gene list
而且因為少了每個 gene 的 testing 所以也沒有 p-adj
也沒有因為 read count 少就先篩掉，全憑 effect size 決定

Result (apeglm 好棒棒)

Conclusion

以上四篇 paper ，概念上其實沒有很複雜，

我們假設 read count 資料符合 negative binomial distribtuion ，

然後利用其他 gene 的資訊，比如說用 regression fit 出一個比較平滑的 prior 後，

再把這個 prior 拿來 tune 預測值

最後在 testing 出 differential expression gene

就這樣而已

剩下的內容看看要不要下次寫

Other DESeq2 and apeglm Result (直接去看 paper 這部分不難懂)
rlog in DESeq2 (做 cluster 用的)
cutdiff2 (有考慮 ambiguous mapping in isoforms)
limma
ashr(在 FDR 的步驟中，把 effect size 跟 standard error 考慮進來)

edgeR, DESeq2, cutdiff 就佔了 citation 的 90% 了，真猛

統計至 2018/2 Copied from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6954399/

Medium 有個 claps 功能，拜託大家使用鼓勵作者阿，也可以留言回應文章的問題

Reference

Robinson, M. D., & Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9(2), 321–332.
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.
Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Nature Precedings, 1–1.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology, 15(12), 550.
Zhu, A., Ibrahim, J. G., & Love, M. I. (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics, 35(12), 2084–2092.
edgeR document https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
DESeq document https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html