Download - Big Analytics Without Big Hassles
Big Analytics without Big Hassles
Bryan Lewis Chief Data Scientist
Alex PoliakovSolutions Architect
© Paradigm4 Inc. 2
Paradigm4’s SciDB
SciDB is an open source, scalable array database, with native complex math analytics, integrated with R & python
© Paradigm4 Inc. 3
Paradigm4’s SciDB
SciDB helps data scientists, bioinformaticians, quants, analysts, and scientists tackle their toughest “Big Data” management and complex analytics challenges.
© Paradigm4 Inc. 4
Webinar Replay
These slides are from a Paradigm4 webinar held on 11/12/13
You can find this webinar, and additional webinars, at:
http://www.paradigm4.com/video/
www.paradigm4.com
© P
arad
igm
4 5
Agenda
1. Brief Introduction to SciDB
2. Demos
3. Q & A
© P
arad
igm
4 6
Open-source high-performance database
Data organized in multi-dimensional sparse arrays
Horizontally scalable
Excels at parallel linear algebra
ACID, data replication, versioned data
Developed by Paradigm4
© P
arad
igm
4 7
About Paradigm4
Paradigm4 develops & supports SciDB
CTO is MIT database researcher Mike StonebrakerForce behind many major advances in commercial database products (Postgres, Illustra, Streambase, Vertica, VoltDB, …)
Commercial applicationsComputational Genomics E-commerceImaging Industrial AnalyticsQuantitative Finance Internet of Things
© P
arad
igm
4 8
Community edition• Open Source• Unrestricted• Fully scalable
Enterprise edition• More math • Fault tolerance• System management tools
Developed by Paradigm4
© P
arad
igm
4 9
SciDB Powers NIH NCBI’s1000 Genomes Project
Running 24 x 7 since Fall 2012
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/
© P
arad
igm
4 1
0
SciDB Builds ARCA NBBO Book
• 186 million quotes for one day
• 80 seconds on a 32-instance cluster
• Runs in about half the time on a cluster twice as large
© P
arad
igm
4 1
1
SciDB Powers Recommendation Engines
• Fast truncated SVD
• Sparse 50M x 50M matrix 4 billion nonzero values
• Minutes per singular value on a four node Linux cluster
© P
arad
igm
4 1
2
SciDB System Architecture
“Shared Nothing” cluster of commodity hardware nodes
Interconnected with standard ethernet and TCP/IP
© P
arad
igm
4 1
3
SciDB Arrays
Each cell in a SciDB array consists of a fixed number of typed attributes (variables).
Here is an example cell with four attributes
Price Volume Symbol usec
450.61 150 “AAPL” 36013008713
© P
arad
igm
4 1
4
SciDB Arrays
A 1-D array looks like a spreadsheet
This picture shows five cells, each with four attributes
Dim
ensi
on
i
Attributes
Price Volume Symbol usec
1 450.61 150 “AAPL” 36013008713
2 450.73 200 “AAPL” 36013008915
3 450.84 10 “AAPL” 36013208113
4 36.57 75 “MSFT” 36019008713
5 36.20 100 “MSFT” 36003200113
© P
arad
igm
4 1
5
SciDB Arrays
The same data “redimensioned” into a 2D array
Dim
ensi
on
use
c
“AAPL” “MSFT”
Price Volume Price Volume
36003200113 36.20 100
36013008713 450.61 150
36013008915 450.73 200
36013208113 450.84 10
36019008713 36.57 75
Dimension Symbol.
© P
arad
igm
4 1
6
Access multi-dimensional subsets in constant time
© P
arad
igm
4 1
7
High Performance Windowing
Simple running median outlier filter
Fast, one-pass, running stats over arbitrary time or data windows
Even when time intervals cross over internal storage shards
© P
arad
igm
4 1
8
SciDB Arrays
Arrays can be joinedalong dimensions or subsets of dimensions
Values can be aggregatedalong dimensions and over windows
Functions can be applied over values in arrays
Arrays can be sparse
Linear algebra operations, matrix decompositions, and other interesting operations are defined for matrices and vectors
© P
arad
igm
4 1
9
• Work in familiar IDE• Data persisted in SciDB• Offload large computations to cluster
© P
arad
igm
4 2
0
Demos
Quantitative Finance example• Regularized correlation• Relevance network graph
Remote Sensing application• NASA MODIS satellite images• Regrid with spatial interpolation• Visualize (multiple resolutions)
Survival Analysis on Healthcare Data• Estimate Cox proportional hazards model with
the big data bootstrap
© P
arad
igm
4 2
1
Live demos
© P
arad
igm
4 2
2
SciDB-R/Py(global)
R/Py-exec(local)
Program SciDB naturally from R or Python
Invoke R or Python from within SciDB queries
Two modes for using R & Python
© P
arad
igm
4 2
3
Rationale
Provide a simple, robust way to run R or Python from inside SciDB queries, in parallel
Extend SciDB's powerful native analysis capabilities
© P
arad
igm
4 2
4
Really simple example
avg(
r_exec(
build(<z:double>[i=1:1000,1,0],0),
'expr=x<-runif(1000);y<-runif(1000);list(sum(x^2+y^2<1)/250)') )
{i} x_avg
{0} 3.14119
Instance-parallel Monte Carlo estimate of π
© P
arad
igm
4 2
5
Big data bootstrap example
Consider a matrix named "events" with 8 columns:
ID (numeric) Race (categorical)
SES (numeric) Age (numeric)
Days_to_event (numeric) Group (categorical)
Event (binary) Gender (categorical)
Apply the bag of little bootstraps to estimate confidence intervals for coefficients of a Cox proportional hazards survival model.
© P
arad
igm
4 2
6
Big data bootstrap example
Randomly partition rows of the events matrix into blocks of at most 1000 rows (the "bag" part of the BLB method).
store(redimension( cross_join(events as A, redimension(apply(project(sort(apply( build(<v:int64>[k=0:9999,1000,0],random()),p,k)),p),m,n), <p:int64> [m=0:*,1000,0]) as B, A.i, B.m), <val:double>[p=0:9999,1000,0,j=0:7,8,0]),P)
© P
arad
igm
4 2
7
store(redimension(apply(r_exec(P, "expr= require(survival); D <- as.data.frame(matrix(val,ncol=8,byrow=TRUE)); names(D) <-c ('ID','Race','SES','Age','Days','Event','Group','Gender'); D[,'Race'] <- factor(D[,'Race'], levels=1:13); D[,'Group'] <- factor(D[,'Group'], levels=1:2); D[,'Gender'] <- factor(D[,'Gender'], levels=1:2); ans <- sapply(1:500, function(x) { M <- coxph(Surv(Days, Event) ~ Age + Race + Group + Gender + SES + cluster(ID), data=D[sample(nrow(D),nrow(D),replace=1),]); c(coef(M), sqrt(diag(M[['var']])))}); list(apply(ans, 1, mean));'),m, n%32),<ans:double null>[m=0:31,32,0], avg(val) as ans), coefs)
Big data bootstrap example
© P
arad
igm
4 2
8
Big data bootstrap result
Group 2 exhibits significantly lower relative risk of an event than Group 1 in this example.
library("scidb")cf =scidb("coefs")[c(0,13:15)][]se =scidb("coefs")[c(16,29:31)][]
plot(exp(cf))
lapply(1:4,function(j){lines(c(j,j),c(exp(cf[j]-1.96*se[j]),exp(cf[j]+1.96*se[j])))})
Take Away
In-database, scalable, complex mathLess coding, more analysisTransparent scale-up & speed-upInteractive exploratory analyticsSeamless R and Python integration
www.paradigm4.com
© Paradigm4 Inc. 30
Questions?
Tell us about your application• [email protected]
Try our Quick Start • scidb.org/forum• Download a VM or EC2 AMI
www.paradigm4.com
Thanks for your interest!