2013 05 ny
DESCRIPTION
TRANSCRIPT
A Distributed ParallelLogistic Regression & GLM
Cliff Click, CTO [email protected]://0xdata.comhttp://cliffc.org/blog
0xdata.com 2
H2O – A Platform for Big Math
● In-memory distributed & parallel vector math● Pure Java, runs in cloud, server, laptop
● Open source: http://0xdata.github.com/h2o● java -jar h2o.jar -name meetup
● Will auto-cluster in this room● Best with default GC, largest heap
● Inner loops: near FORTRAN speeds & Java ease● for( int i=0; i<N; i++ )...do_something... // auto-distribute & par
0xdata.com 3
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS● Iterative: multiple passes, multiple Grams
ƞk = Xß
k
μk = link-1(ƞ
k)
z = ƞk+ (y-μ
k)·link'(μ
k)
ßk+1
= (XT·w·X)-1·(XT·z)
0xdata.com 4
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS● Iterative: multiple passes, multiple Grams
ƞk = Xß
k
μk = link-1(ƞ
k)
z = ƞk+ (y-μ
k)·link'(μ
k)
ßk+1
= (XT·w·X)-1·(XT·z)
Inverse solved withCholesky Decomposition
0xdata.com 5
GLM Running Time
● n – number of rows or observations● p – number of features● Gram Matrix: O(np2) / #cpus
● n can be billions; constant is really small● Data is distributed across machines
● Cholesky Decomp: O(p3)● Real limit: memory is O(p2), on a single node
● Times a small number of iterations (5-50)
0xdata.com 6
Gram Matrix
● Requires computing XT·X ● A single observation: double x[], y;
for( int i=0; i<P; i++ ) {for( int j=0; j<=i; j++ )
_xx[i][j] += x[i]*x[j];_xy[i] += y*x[i];
}_yy += y*y;
● Computed per-row● Millions to billions of rows● Parallelize / distribute per-row
0xdata.com 7
Distributed Vector Coding
● Map-Reduce Style● Start with a Plain Olde Java Object● Private clone per-Map
● Shallow-copy with-in JVM Deep-copy across JVMs
● Map a “chunk” of data into private clone● "chunk" == all the rows that fit in 4Meg
● Reduce: combine pairs of cloned objects
0xdata.com 8
Plain Old Java Object
● Using the POJO:
Gram G = new Gram();G.invoke(A); // Compute the Gram of A...G._xx[][]... // Use the Gram for more math
● Defining the POJO:
class Gram extends MRTask {Key _data; // Input variable(s)// Output variablesdouble _xx[][], _xy[], _yy;void map( Key chunk ) { … }void reduce( Gram other ) { … }
0xdata.com 9
Gram.map
● Define the map:void map( Key chunk ) {
// Pull in 4M chunk of data...boiler plate...for( int r=0; r<rows; r++ ) {
double y,x[] = decompress(r);for( int i=0; i<P; i++ ) {
for( int j=0; j<=i; j++ )_xx[i][j] += x[i]*x[j];
_xy[i] += y*x[i];}_yy += y*y;
}}
0xdata.com 10
Gram.reduce
● Define the reduce:
// Fold 'other' into 'this'void reduce( Gram other ) {
for( int i=0; i<P; i++ ) {for( int j=0; j<=i; j++ )
_xx[i][j] += other._xx[i][j];_xy[i] += other._xy[i];
}_yy += other._yy;
}
0xdata.com 11
Distributed Vector Coding 2
● Gram Matrix computed in parallel & distributed● Excellent CPU & load-balancing● About 1sec per Gig for 32 medium EC2 instances● The whole Logistic Regression, about 10sec/Gig
– Varies by #features, (i.e. billion rows, 1000 features)
● Distribution & Parallelization handled by H2O● Data is pre-split by rows during parse/ingest● map(chunk) is run where chunk is local● reduce runs both local & distributed
– Gram object auto-serialized, auto-cloned
0xdata.com 12
Other Inner-Loop Considerations
● Real inner loop has more cruft● Some columns excluded by user● Some rows excluded by sampling, or missing data● Data is normalized & centered● Catagorical column expansion
– Math is straightforward, but needs another indirection
● Iterative Reweighted Least Squares– Adds weight to each row
0xdata.com 13
GLM + GLMGrid
● Gram matrix is computed in parallel & distributed● Rest of GLM is all single-threaded pure Java
● Includes JAMA for Cholesky Decomposition
● Default 10-fold x-val runs in parallel● Warm-start all models for faster solving
● GLMGrid: Parameter search for GLM● In parallel try all combo's of λ & α
0xdata.com 14
Meta Considerations: Math @ Scale
● Easy coding style is key:● 1st cut GLM ready in 2 weeks, but● Code was changing for months● Incremental evolution of a number of features● Distributed/parallel borders kept clean & simple
● Java● Runs fine in a single-JVM in debugger + Eclipse● Well understood programming model
0xdata.com 15
H2O: Memory Considerations
● Runs best with default GC, largest -Xmx● Data cached in Java heap● Cache size vs heap monitored, spill-to-disk
● FullGC typically <1sec even for >30G heap
● If data fits – math runs at memory speeds● Else disk-bound
● Ingest: Typically need 4x to 6x more memory● Depends on GZIP ratios & column-compress ratios
0xdata.com 16
H2O: Reliable Network I/O
● Uses both UDP & TCP● UDP for fast point-to-point control logic● Reliable UDP via timeout & retry● TCP, under load, reliably fails silently
– No data at receiver, no errors at sender– 100% fail, <5mins in our labs or EC2
● (so not a fault of virtualization)
● TCP uses the same reliable comm layer as UDP– Only use TCP for congestion control of large xfers
0xdata.com 17
H2O: S3 Ingest
● H2O can inhale from S3 (any many others)● S3, under load, reliably fails● Unlike TCP, appears to throw exception every time● Again, wrap in a relibility retry layer
● HDFS backed by S3 (jets3)● New failure mode: reports premature EOF● Again, wrap in a relibility retry layer