2013 05 ny

A Distributed ParallelLogistic Regression & GLM

Cliff Click, CTO [email protected]://0xdata.comhttp://cliffc.org/blog

mailto:[email protected]

http://0xdata.com/

http://cliffc.org/blog

0xdata.com 2

H2O – A Platform for Big Math

● In-memory distributed & parallel vector math● Pure Java, runs in cloud, server, laptop

● Open source: http://0xdata.github.com/h2o● java -jar h2o.jar -name meetup

● Will auto-cluster in this room● Best with default GC, largest heap

● Inner loops: near FORTRAN speeds & Java ease● for( int i=0; i<N; i++ )...do_something... // auto-distribute & par

0xdata.com 3

GLM & Logistic Regression

● Vector Math (for non math majors):

● At the core, we compute a Gram Matrix● i.e., we touch all the data

● Logistic Regression – solve with Iterative RLS● Iterative: multiple passes, multiple Grams

ƞk = Xß

k

μk = link-1(ƞ

k)

z = ƞk+ (y-μ

k)·link'(μ

k)

ßk+1

= (XT·w·X)-1·(XT·z)

0xdata.com 4

GLM & Logistic Regression

● Vector Math (for non math majors):

● At the core, we compute a Gram Matrix● i.e., we touch all the data

● Logistic Regression – solve with Iterative RLS● Iterative: multiple passes, multiple Grams

ƞk = Xß

k

μk = link-1(ƞ

k)

z = ƞk+ (y-μ

k)·link'(μ

k)

ßk+1

= (XT·w·X)-1·(XT·z)

Inverse solved withCholesky Decomposition

0xdata.com 5

GLM Running Time

● n – number of rows or observations● p – number of features● Gram Matrix: O(np2) / #cpus

● n can be billions; constant is really small● Data is distributed across machines

● Cholesky Decomp: O(p3)● Real limit: memory is O(p2), on a single node

● Times a small number of iterations (5-50)

0xdata.com 6

Gram Matrix

● Requires computing XT·X ● A single observation: double x[], y;

for( int i=0; i<P; i++ ) {for( int j=0; j<=i; j++ )

_xx[i][j] += x[i]*x[j];_xy[i] += y*x[i];

}_yy += y*y;

● Computed per-row● Millions to billions of rows● Parallelize / distribute per-row

0xdata.com 7

Distributed Vector Coding

● Map-Reduce Style● Start with a Plain Olde Java Object● Private clone per-Map

● Shallow-copy with-in JVM Deep-copy across JVMs

● Map a “chunk” of data into private clone● "chunk" == all the rows that fit in 4Meg

● Reduce: combine pairs of cloned objects

0xdata.com 8

Plain Old Java Object

● Using the POJO:

Gram G = new Gram();G.invoke(A); // Compute the Gram of A...G._xx[][]... // Use the Gram for more math

● Defining the POJO:

class Gram extends MRTask {Key _data; // Input variable(s)// Output variablesdouble _xx[][], _xy[], _yy;void map( Key chunk ) { … }void reduce( Gram other ) { … }

0xdata.com 9

Gram.map

● Define the map:void map( Key chunk ) {

// Pull in 4M chunk of data...boiler plate...for( int r=0; r<rows; r++ ) {

double y,x[] = decompress(r);for( int i=0; i<P; i++ ) {

for( int j=0; j<=i; j++ )_xx[i][j] += x[i]*x[j];

_xy[i] += y*x[i];}_yy += y*y;

}}

0xdata.com 10

Gram.reduce

● Define the reduce:

// Fold 'other' into 'this'void reduce( Gram other ) {

for( int i=0; i<P; i++ ) {for( int j=0; j<=i; j++ )

_xx[i][j] += other._xx[i][j];_xy[i] += other._xy[i];

}_yy += other._yy;

}

0xdata.com 11

Distributed Vector Coding 2

● Gram Matrix computed in parallel & distributed● Excellent CPU & load-balancing● About 1sec per Gig for 32 medium EC2 instances● The whole Logistic Regression, about 10sec/Gig

– Varies by #features, (i.e. billion rows, 1000 features)

● Distribution & Parallelization handled by H2O● Data is pre-split by rows during parse/ingest● map(chunk) is run where chunk is local● reduce runs both local & distributed

– Gram object auto-serialized, auto-cloned

0xdata.com 12

Other Inner-Loop Considerations

● Real inner loop has more cruft● Some columns excluded by user● Some rows excluded by sampling, or missing data● Data is normalized & centered● Catagorical column expansion

– Math is straightforward, but needs another indirection

● Iterative Reweighted Least Squares– Adds weight to each row

0xdata.com 13

GLM + GLMGrid

● Gram matrix is computed in parallel & distributed● Rest of GLM is all single-threaded pure Java

● Includes JAMA for Cholesky Decomposition

● Default 10-fold x-val runs in parallel● Warm-start all models for faster solving

● GLMGrid: Parameter search for GLM● In parallel try all combo's of λ & α

0xdata.com 14

Meta Considerations: Math @ Scale

● Easy coding style is key:● 1st cut GLM ready in 2 weeks, but● Code was changing for months● Incremental evolution of a number of features● Distributed/parallel borders kept clean & simple

● Java● Runs fine in a single-JVM in debugger + Eclipse● Well understood programming model

0xdata.com 15

H2O: Memory Considerations

● Runs best with default GC, largest -Xmx● Data cached in Java heap● Cache size vs heap monitored, spill-to-disk

● FullGC typically <1sec even for >30G heap

● If data fits – math runs at memory speeds● Else disk-bound

● Ingest: Typically need 4x to 6x more memory● Depends on GZIP ratios & column-compress ratios

0xdata.com 16

H2O: Reliable Network I/O

● Uses both UDP & TCP● UDP for fast point-to-point control logic● Reliable UDP via timeout & retry● TCP, under load, reliably fails silently

– No data at receiver, no errors at sender– 100% fail, <5mins in our labs or EC2

● (so not a fault of virtualization)

● TCP uses the same reliable comm layer as UDP– Only use TCP for congestion control of large xfers

0xdata.com 17

H2O: S3 Ingest

● H2O can inhale from S3 (any many others)● S3, under load, reliably fails● Unlike TCP, appears to throw exception every time● Again, wrap in a relibility retry layer

● HDFS backed by S3 (jets3)● New failure mode: reports premature EOF● Again, wrap in a relibility retry layer

2013 05 ny

Documents