22 r data manipulation 2 pt 20140404

[원] 통계상담 2014-11

A.2 R 데이터 다루기 reshape, plyr, data.table packages

허 명 회 (고려대 교수, 통계학)

2014.04.04 Hadley Wickham at UseR! 2013:

Author of reshape, plyr, ggplot2, ...

응 용 데 이 터 분 석 : R의 활용

[원] 통계상담 2014-12

개 요

R에서 loop를 사용하지 않는 효율적인 데이터 처리 방법들

1. reshape 형태 전환

2. plyr 분리-적용-합체 (split-apply-combine)

3. data.table 검색

[원] 통계상담 2014-13

: reshape데이터세트의 형태 전환

사례: french_fries {reshape} 구조 time treatment subject rep y1 y2 y3 y4 y5

5개 반응 2회 반복 12명 피험자 3개 처리 10개 시점

* 다수의 결측 * 행 순서 임의화

[원] 통계상담 2014-14

: reshape사례: french_fries 계속

목표 subject 별로 treatment에 따른 5개 y 반응의 평균 구하기

subject = : treatment y1 y2 y3 y4 y5 1 □ □ □ □ □ 2 □ □ □ □ □ 3 □ □ □ □ □

[원] 통계상담 2014-15

: reshapemelting: french_fries 사례 계속

> ff.melted <- melt(french_fries, id=c("time","subject","treatment","rep"), na.rm=TRUE)> head(ff.melted, 10) time subject treatment rep variable value1 1 3 1 1 y1 2.92 1 3 1 2 y1 14.03 1 10 1 1 y1 11.04 1 10 1 2 y1 9.95 1 15 1 1 y1 1.26 1 15 1 2 y1 8.87 1 16 1 1 y1 9.08 1 16 1 2 y1 8.29 1 19 1 1 y1 7.010 1 19 1 2 y1 13.0

⋮

식별(id) 변수

[원] 통계상담 2014-16

: reshapecasting: french_fries 사례 계속

> cast(ff.melted, subject+treatment ~ variable, length) subject treatment y1 y2 y3 y4 y51 3 1 18 18 18 18 182 3 2 18 18 18 18 183 3 3 18 18 18 18 18 ... 20회 미달4 10 1 20 20 20 20 205 10 2 20 20 20 20 206 10 3 20 20 20 20 207 15 1 20 20 20 20 208 15 2 20 20 20 20 209 15 3 19 19 19 19 1910 16 1 20 20 20 20 2011 16 2 20 19 20 20 20 ... 20회 미달12 16 3 20 20 20 20 20

⋮

[원] 통계상담 2014-17


> cast(ff.melted, subject+treatment ~ variable, function(x) 20-length(x)) subject treatment y1 y2 y3 y4 y51 3 1 2 2 2 2 22 3 2 2 2 2 2 23 3 3 2 2 2 2 2 ... 2회 결측4 10 1 0 0 0 0 05 10 2 0 0 0 0 06 10 3 0 0 0 0 07 15 1 0 0 0 0 08 15 2 0 0 0 0 09 15 3 1 1 1 1 110 16 1 0 0 0 0 011 16 2 0 1 0 0 0 ... 1회 결측12 16 3 0 0 0 0 0

⋮

[원] 통계상담 2014-18


> options(digits=3)> cast(ff.melted, treatment+time ~ variable, mean) treatment time y1 y2 y3 y4 y51 1 1 7.92 1.796 0.904 2.76 2.1502 1 2 7.59 2.525 1.004 3.90 1.9753 1 3 7.77 2.296 0.817 4.65 1.1174 1 4 8.40 1.979 1.025 2.08 0.4675 1 5 7.74 1.367 0.771 4.28 3.0086 1 6 6.08 1.825 0.467 4.34 2.5547 1 7 6.28 1.242 0.163 3.20 2.1968 1 8 5.17 0.987 0.633 5.39 4.5889 1 9 6.07 1.830 0.135 3.95 2.90510 1 10 5.46 1.960 0.455 6.50 5.40011 2 1 8.78 2.492 0.996 1.72 0.80812 2 2 8.54 3.125 0.950 2.14 0.662

⋮

[원] 통계상담 2014-19


> options(digits=3)> cast(ff.melted, treatment+time ~ variable, mean, margins="grand_col") subject treatment y1 y2 y3 y4 y5 (all)1 3 1 6.22 0.372 0.1889 2.106 3.1111 2.402 3 2 6.74 0.589 0.1056 3.139 2.4778 2.613 3 3 5.29 0.767 0.0944 2.856 2.8667 2.384 10 1 9.96 6.750 0.5850 4.020 1.3750 4.545 10 2 9.99 6.980 0.4750 2.150 0.8200 4.086 10 3 10.03 6.450 0.1450 3.110 0.6900 4.087 15 1 3.36 0.720 0.4200 3.965 3.2600 2.358 15 2 4.41 1.315 0.3400 2.285 2.0600 2.089 15 3 3.96 0.989 0.4421 2.547 2.3684 2.0610 16 1 6.50 3.260 0.7550 4.120 1.2300 3.1711 16 2 6.45 3.374 1.0550 3.400 0.4550 2.9412 16 3 6.86 2.700 1.1250 3.200 0.5550 2.89

⋮

[원] 통계상담 2014-110

: reshapearray: french_fries 사례 계속

> options(digits=3)> cast(ff.melted, subject ~ treatment ~ variable, mean), , variable = y1 , , variable = y2 , , variable = y3 ⋯ treatment treatmentsubject 1 2 3 subject 1 2 3 3 6.22 6.74 5.29 3 0.372 0.589 0.767 10 9.96 9.99 10.03 10 6.750 6.980 6.450 15 3.36 4.41 3.96 15 0.720 1.315 0.989 16 6.50 6.45 6.86 16 3.260 3.374 2.700 19 9.38 8.64 8.74 19 3.055 2.450 1.725 31 8.84 8.03 9.03 31 0.444 0.617 0.650 51 10.68 9.98 10.22 51 2.640 3.795 3.130 52 5.06 5.51 5.47 52 0.805 1.025 0.865 63 6.78 8.41 8.06 63 0.025 0.105 0.065 78 3.62 3.78 4.00 78 0.735 0.295 0.705 79 8.06 7.94 7.73 79 0.282 0.694 0.572 86 4.18 3.99 3.87 86 1.772 2.061 1.633

[원] 통계상담 2014-111


> options(digits=3) > apply(cast(ff.melted, subject ~ treatment ~ variable, mean), c(2,3), mean) variabletreatment y1 y2 y3 y4 y5 1 6.89 1.74 0.639 4.05 2.58 2 6.99 1.94 0.652 3.63 2.45 3 6.94 1.69 0.668 3.85 2.53

[원] 통계상담 2014-112


> options(digits=3) > cast(ff.melted, subject+treatment ~ ., quantile, c(0,0.25,0.5,0.75,1)) subject treatment X0. X25. X50. X75. X100.1 3 1 0 0.000 0.40 3.22 14.02 3 2 0 0.000 0.50 3.38 14.13 3 3 0 0.000 0.60 3.80 14.14 10 1 0 0.000 3.85 8.40 13.25 10 2 0 0.000 2.55 8.25 11.46 10 3 0 0.000 3.35 8.40 11.57 15 1 0 0.175 1.25 3.65 10.88 15 2 0 0.200 1.05 3.12 12.79 15 3 0 0.200 0.80 3.40 10.410 16 1 0 0.300 2.15 4.95 11.011 16 2 0 0.200 1.50 4.65 13.412 16 3 0 0.500 1.35 4.58 12.7

⋮

[원] 통계상담 2014-113

: plyrSplit-Apply-Combine split apply combine

data a function outputs

하둡: Map Reduce

[원] 통계상담 2014-114

: plyr사례: baseball {plyr} 구조 data.frame: 21699 obs. of 22 variables $ id : ch "ansonca01" "forceda01" "mathebo01" "startjo01" ... $ year : int 1871 1871 1871 1871 1871 1871 1871 1872 ... $ rbi : int 16 29 10 34 23 21 23 50 15 16 ...

목표 id (선수) 별 career high rbi year 구하기 - c.year (= year-min(year)+1) - max.rbi

절차 1. split – 전체자료를 id (선수) 별로 나누기 2. apply – id 별 subset에서 c.year와 max.rbi, 그것의 c.year를 구하기 3. combine – 앞의 결과를 합체하기

[원] 통계상담 2014-115

: plyr사례: baseball 계속

# plyr for baseball datalibrary(plyr)str(baseball)

calculate_c.year <- function(df) mutate(df, cyear = year - min(year)+1)baseball.1 <- ddply(baseball, .(id), calculate_c.year) ## 데이터프레임 baseball의 오른쪽에 cyear가 붙는다.

calculate_c.rbi <- function(df) c(best.year=df$cyear[which.max(df$rbi)], best.rbi=max(df$rbi), career.year=max(df$cyear))bb.2 <- ddply(baseball.1, .(id), calculate_c.rbi)str(bb.2) ## 데이터프레임 bb.2는 4개 변수로 구성된다: id, best.year, best.rbi, career.year ## 데이터프레임 bb.2의 개체 수는 1,228 (=선수 수)이다.

[원] 통계상담 2014-116

: plyr사례: baseball 계속

# histograms of best.year and career.yearmax(bb.2$career.year)hist(bb.2$best.year, breaks=seq(0.5,40.5,1), xlab="best.year", main="")hist(bb.2$career.year, breaks=seq(0.5,40.5,1), xlab="career.year", main="")

## max.rbi 분포의 mode는 7년차

[원] 통계상담 2014-117

: plyr**ply

출력

array df list

입력

array aaply adply alply

df daply ddply dlply

list laply ldaply llaply

[원] 통계상담 2014-118

: plyrsummarise( )

> library(plyr) > ddply(baseball, "id", summarise, duration = max(year)-min(year)+1, + nteams = length(unique(team)))

id duration nteams1 aaronha01 23 32 abernte02 18 73 adairje01 13 44 adamsba01 21 25 adamsbo03 14 46 adcocjo01 17 5

[원] 통계상담 2014-119

: data.table효율적인 검색1. Data Table 만들기: 6개의 random digit column과 1개의 수치 열로 구성된 10,000,000*7 데이터프레임

library(data.table)n <- 10000000digits <- as.factor(0:9)x1 <- sample(digits, n, replace=T)x2 <- sample(digits, n, replace=T)x3 <- sample(digits, n, replace=T)x4 <- sample(digits, n, replace=T)x5 <- sample(digits, n, replace=T)x6 <- sample(digits, n, replace=T)DT <- data.table(x1, x2, x3, x4, x5, x6, y=rnorm(n))

[원] 통계상담 2014-120

: data.tableData Table 만들기 (계속)

> head(DT, 10) x1 x2 x3 x4 x5 x6 y 1: 3 7 0 2 1 0 -2.1384800 2: 9 1 6 1 6 0 2.1295443 3: 9 6 2 9 6 3 -1.0069040 4: 8 8 5 5 6 4 0.1813213 5: 2 9 9 5 3 3 -0.5683664 6: 2 8 3 0 8 4 0.1869398 7: 0 8 9 8 5 6 -0.1080321 8: 4 7 5 3 7 1 2.1213928 9: 1 9 4 9 1 6 1.333834210: 9 7 9 7 6 4 –0.6250066

> class(DT) [1] "data.table" "data.frame"

[원] 통계상담 2014-121

: data.table검색 키의 설정

> setkey(DT, x1, x2, x3, x4, x5, x6)> head(DT, 10) x1 x2 x3 x4 x5 x6 y * key 변수들의 순서로 정렬된다. 1: 0 0 0 0 0 0 1.7554923 2: 0 0 0 0 0 0 1.4160151 3: 0 0 0 0 0 0 0.3351744 4: 0 0 0 0 0 0 -0.4342841 5: 0 0 0 0 0 1 -1.4443813 6: 0 0 0 0 0 1 0.8493174 7: 0 0 0 0 0 1 1.2504767 8: 0 0 0 0 0 1 -1.4396524 9: 0 0 0 0 0 1 -0.976235210: 0 0 0 0 0 1 0.9889054

[원] 통계상담 2014-122

: data.table자료세트 검색

> DT[J("1","2","3","4","5","6")] x1 x2 x3 x4 x5 x6 y x1 x2 x3 x4 x5 x6 y 1: 1 2 3 4 5 6 0.4442011 1: 1 2 3 4 5 6 0.0185136802: 1 2 3 4 5 6 –0.4213922 2: 1 2 3 4 5 6 0.6322815863: 1 2 3 4 5 6 0.9358654 3: 1 2 3 4 5 6 -0.1692423174: 1 2 3 4 5 6 0.1211770 4: 1 2 3 4 5 6 0.0034174595: 1 2 3 4 5 6 0.2052872 5: 1 2 3 4 5 6 -1.2906784126: 1 2 3 4 5 6 –1.4889960 6: 1 2 3 4 5 6 0.4206969957: 1 2 3 4 5 6 –0.8041964 7: 1 2 3 4 5 6 1.484245923 8: 1 2 3 4 5 6 0.050544004 9: 1 2 3 4 5 6 0.151274821 10: 1 2 3 4 5 6 0.308839374 11: 1 2 3 4 5 6 0.076483702

* 기대되는 레코드 수는 × 개. * 출현 레코드 수는 평균이 인 포아송 분포를 따름.

[원] 통계상담 2014-123

: data.table검색: 다른 방법

> p.time <- proc.time()> DT[x1=="1" & x2=="2" & x3=="3" & x4=="4" & x5=="5" & x6=="6",] ⋮> proc.time() - p.time user system elapsed 8.47 1.06 9.63

비교: 앞 방법의 처리 시간

> proc.time() - p.time user system elapsed 0.08 0.03 0.11 * elapsed time 기준 기존 방법 대비 1.1%에 불과, * data.table이 binary search를 하기 때문.

[원] 통계상담 2014-124

정리⋅요약

데이터 다루기: “빅” 데이터 분석의 기초 (fundamentals)

통계학 전공자의 취약점데이터 과학으로 진화하기 위해 넘어야 할 벽

참고문헌: R Manuals, Vignettes, ... Journal of Statistical Software Papers

전희원 (2013). R로 하는 데이터 시각화. 한빛미디어

실습 파일: reshape_ff.r plyr_bb.r datatale_sim.r

22 r data manipulation 2 pt 20140404

Documents