tokyo.r #44 lt.pptx

引用回数Top100にランクインした10の統計論文

@siero5335 Tokyo.R #44 20141101

11.Nonparametric es;ma;on from incomplete observa;ons 24. Regression models and life-‐tables. 29. Sta;s;cal methods for assessing agreement between two methods of clinical measurement 57. Maximum likelihood from incomplete data via EM algorithm 58. Equa;on of state calcula;ons by fast compu;ng machines 59. Controlling the false discovery rate: a prac;cal and powerful approach to mul;ple tes;ng 64. Mul;ple range and mul;ple F tests 68.The measurement of observer agreement for categorical data 73. A new look at sta;s;cal-‐model iden;fica;on 88. An algorithm for least-‐squares es;ma;on of nonlinear parameters

自己紹介 Twitter ID: @siero5335

仕事: 某大学で　　化学物質曝露影響の解析　　測定法の開発してます　　専門: 環境化学、分析化学

R→　測定結果のまとめに使用

Natureが先日引用回数が多かった論文Top100を発表

The top 100 papers hTp://linkis.com/www.nature.com/news/VOuNk

統計関係の論文が10報含まれてた！どんな論文？ Rで使える感じ？ →レビューしてみた。

11位: Nonparametric es;ma;on from incomplete observa;ons

Kaplan, E. L. & Meier, P. J. Am. Stat. Assoc. 53 457–481: 1958 hTp://dx.doi.org/10.1080/01621459.1958.10501452

カプランマイヤー法曝露の有無による生存率の比較法 2値のアウトカムのリスクの推移を時間の経過を考慮に入れて解析病気の生存率だけではなく機械の故障率の比較などにも使われる Example参照 leukemia.surv <-‐ survfit(Surv(;me, status) ~ x, data = aml) plot(leukemia.surv, lty = 2:3) legend(100, .9, c("Maintenance", "No Maintenance"), lty = 2:3) ;tle("Kaplan-‐Meier Curves\nfor AML Maintenance Study") lsurv2 <-‐ survfit(Surv(;me, status) ~ x, aml, type='fleming') plot(lsurv2, lty=2:3, fun="cumhaz",

xlab="Months", ylab="Cumula;ve Hazard")

24位: Regression models and life-tables.

Cox, D. R. J. R. Stat. Soc., B 34 187–220: 1972 hTp://www.jstor.org/discover/10.2307/2985181?uid=3739256&uid=2&uid=4&sid=21104904748827

Cox比例ハザードモデル生存に関わる因子の効果を調べるための多変量回帰モデル生存時間解析 Example bladder1 <-‐ bladder[bladder$enum < 5, ] coxph(Surv(stop, event) ~ (rx + size + number) * strata(enum) + cluster(id), bladder1) Result

coef exp(coef) se(coef) robust se z p

rx -0.526 0.591 0.3158 0.3152 -1.669 0.095

size 0.0696 1.072 0.1016 0.0886 0.785 0.43

number 0.2382 1.269 0.0759 0.0746 3.193 0.0014

rx:strata(enum)enum=2 -0.1063 0.899 0.5042 0.334 -0.318 0.75



size:strata(enum)enum=2 -0.1474 0.863 0.168 0.1141 -1.292 0.2



number:strata(enum)enum=2 -0.1013 0.904 0.119 0.1176 -0.861 0.39

number:strata(enum)enum=3 -0.0647 0.937 0.1293 0.1203 -0.537 0.59

number:strata(enum)enum=4 0.0943 1.099 0.1459 0.1197 0.788 0.43

29位: Statistical methods for assessing agreement between two methods of clinical measurement

Bland, J. M. & Altman, D. G. Lancet 327 307–310: 1986 hTp://dx.doi.org/10.1016S0140-‐6736(86)90837-‐8

血液検査などにより得られた値は真値ではない →新しくできた測定法を以前の測定法と比較する際には、　どちらの値にもバイアスが含まれていることに気をつけよう

Package ‘mcr’: (Method Comparison Regression)の Weighted Deming regression かPassing-‐Bablok regressionでいける Example library("mcr") data(crea;nine,package="mcr") x <-‐ crea;nine$serum.crea y <-‐ crea;nine$plasma.crea m1 <-‐ mcreg(x,y,method.reg="Deming", mref.name="serum.crea", mtest.name="plasma.crea", na.rm=TRUE) plot(m1)

57位: Maximum likelihood from incomplete data via EM algorithm

Dempster, A. P., Laird, N. M. & Rubin, D. B. J. R. Stat. Soc., B 39 1–38: 1977 hTp://www.jstor.org/discover/10.2307/2984875?uid=3739256&uid=2&uid=4&sid=21104904748827

Example model <-‐ Mclust(faithful) # predict cluster for the observed data pred <-‐ predict(model) str(pred) pred$z # equal to model$z pred$classifica;on # equal to plot(faithful, col = pred$classifica;on, pch = pred$classifica;on) # predict cluster over a grid grid <-‐ apply(faithful, 2, func;on(x) seq(min(x), max(x), length = 50)) grid <-‐ expand.grid(erup;ons = grid[,1], wai;ng = grid[,2]) pred <-‐ predict(model, grid) plot(grid, col = mclust.op;ons()$classPlotColors[pred$classifica;on], pch = 15, cex = 0.5) points(faithful, pch = model$classifica;on)

EMアルゴリズム

Rubin神: 教師なし学習の一つ

隠れ変数があるときに、確率モデルのパラメータの最尤推定値を求めるための手法（はじパタ: P169参照）

58位: Equation of state calculations by fast computing machines

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. J. Chem. Phys. 21 1087–1092 1953

hTp://dx.doi.org/10.1063/1.1699114

hTp://www.slideshare.net/hoxo_m/ss-‐38848635

マルコフ連鎖モンテカルロ法のひとつメトロポリス法

59位: Controlling the false discovery rate: a practical and powerful approach to multiple testing

Benjamini, Y. & Hochberg, Y. J. R. Stat. Soc. B 57 289–300: 1995 hTp://www.jstor.org/discover/10.2307/2346101?

uid=3739256&uid=2129&uid=2&uid=70&uid=4&sid=21104904748827

False discovery rate (FDR)

多重比較法の一つ変数が超多い時にBonferroniで補正したら何も残らんようなときに使用遺伝子解析などの分野で広く使われる

Example source("hTp://bioconductor.org/biocLite.R") biocLite("qvalue") library(qvalue) data(hedenfalk) length(hedenfalk) [1] 3170 qobj <-‐ qvalue(hedenfalk)

<1e-04 <0.001 <0.01 <0.025 <0.05 <0.1 <1

p-value 15 76 265 424 605 868 3170

q-value 0 0 1 73 162 319 3170

Result qsummary(qobj)多重比較による調整値を表示

64位: Multiple range and multiple F tests.

Duncan, D. B. Biometrics 11 1–42: 1955 hTp://dx.doi.org/10.2307/3001478

ダンカンの多重比較検定疑陽性のリスク増大を犠牲にし疑陰性に対して堅牢

wikipedia参照

@monotropastrum さんのblog hTp://ito-‐hi.blog.so-‐net.ne.jp/2008-‐08-‐04より土壌肥料学における数理統計手法の応用上の問題点 : 3. Duncanの多重範囲検定はなぜ使えないか

hTp://ci.nii.ac.jp/els/110001747309.pdf?id=ART0001883271&type=pdf&lang=jp&host=cinii&order_no=&ppv_type=0&lang_sw=&no=1414828068&cp= 多重性を考慮した比較法ではないという批判あり

68位: The measurement of observer agreement for categorical data

Landis, J. R. & Koch, G. G. Biometrics 33 159–174: 1977 hTp://dx.doi.org/10.2307/2529310

観測者の信頼性の研究から生じる多変量カテゴリカルデータの分析のための一般的な統計的方法論が文献を入手できませんでした…。大学などで論文落とせる方は探してみてください…

73位: A new look at statistical-model identification.

Akaike, H. IEEE Trans. Automat. Contr. 19 716–723: 1974 hTp://dx.doi.org/10.1109/TAC.1974.1100705

AIC = -‐2ln(L) + 2k L = 最大尤度、k =自由度

赤池情報量基準(AIC)

回帰式に含まれる因子を増減させるたびにAIC値を計算因子が増えるとペナルティ変数の数が増えすぎないように調整 Example lm1 <-‐ lm(Y ~ X1 + X2, data = data) AIC(lm1)

88位: An algorithm for least-squares estimation of nonlinear parameters

Marquardt, D. W. J. Soc. Ind. Appl. Math. 11 431–441: 1963 hTp://dx.doi.org/10.1137/0111030

非線形データに対する最小二乗法 nlsを使用すればいける

Estimate Std. Error t value Pr(>|t|)

Asym 2.34518 0.07815 30.01 2.17E-13 ***

xmid 1.48309 0.08135 18.23 1.22E-10 ***

scal 1.04146 0.03227 32.27 8.51E-14 ***

Example library(nls) require(graphics)

DNase1 <-‐ subset(DNase, Run == 1)

## using a selfStart model fm1DNase1 <-‐ nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1) summary(fm1DNase1)

　まとめ

生存解析強し Rubin神（将来的に傾向スコアも入ってきそう）多重比較赤池先生非線形データ Rでもだいたい動かせるのでガンガン試していきたい

tokyo.r #44 lt.pptx

Data & Analytics