2©MapR Technologies 2013- Confidential
Me, Us
Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
MapRDistributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
TonightHash tag - #tchug
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
3©MapR Technologies 2013- Confidential
Sidebar on Drill
Apache Drill– SQL on Hadoop (and other things)
– Intended to solve problems for 1-5 years from nowNot the problems from 1-10 years ago
– Multiple levels of API supported• SQL-2003
• Logical plan language (DAG in JSON)
• Physical plan language (DAG with push-down, exchange markers)
• Execution plan language (many DAG’s)
Current state– SQL 2003 support in place
– Logical plan interpreter useful for testing
– Value vectors near completion
– High performance RPC working
4©MapR Technologies 2013- Confidential
More on Drill
Just completed OSCON workshop
Workshop materials available shortly– Extracted technology demonstrators
– Sample queries
Send me email or tweet for more info
5©MapR Technologies 2013- Confidential
What’s Up?
What is Mahout?– Math library
– Clustering, classifiers, other stuff
Recommendation– Generalities
– Algorithm Specifics
– System Design
– Important things never mentioned
Final thoughts
6©MapR Technologies 2013- Confidential
What is Mahout?
“Scalable machine learning”– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
Components– math library
– clustering
– classification
– decompositions
– recommendations
7©MapR Technologies 2013- Confidential
What is Mahout?
“Scalable machine learning”– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
Components– math library
– clustering
– classification
– decompositions
– recommendations
9©MapR Technologies 2013- Confidential
Mahout Math
Goals are– basic linear algebra,
– and statistical sampling,
– and good clustering,
– decent speed,
– extensibility,
– especially for sparse data
But not– totally badass speed
– comprehensive set of algorithms
– optimization, root finders, quadrature
10©MapR Technologies 2013- Confidential
Matrices and Vectors
At the core:– DenseVector, RandomAccessSparseVector
– DenseMatrix, SparseRowMatrix
Highly composable API
Important ideas: – view*, assign and aggregate
– iteration
m.viewDiagonal().assign(v)
11©MapR Technologies 2013- Confidential
Assign? View?
Why assign?– Copying is the major cost for naïve matrix packages
– In-place operations critical to reasonable performance
– Many kinds of updates required, so functional style very helpful
Why view?– In-place operations often required for blocks, rows, columns or diagonals
– With views, we need #assign + #views methods
– Without views, we need #assign x #views methods
Synergies– With both views and assign, many loops become single line
12©MapR Technologies 2013- Confidential
Assign
Matrices
Vectors
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);
13©MapR Technologies 2013- Confidential
Views
Matrices
Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Vector viewPart(int offset, int length);
14©MapR Technologies 2013- Confidential
Aggregates
Matrices
Vectors
double zSum();
double aggregate(
DoubleDoubleFunction reduce, DoubleFunction map);
double aggregate(Vector other,
DoubleDoubleFunction aggregator,
DoubleDoubleFunction combiner);
double zSum();
Vector aggregateRows(VectorFunction f);
Vector aggregateColumns(VectorFunction f);
double aggregate(DoubleDoubleFunction combiner,
DoubleFunction mapper);
15©MapR Technologies 2013- Confidential
Predefined Functions
Many handy functions
ABS LOG2
ACOS NEGATE
ASIN RINT
ATAN SIGN
CEIL SIN
COS SQRT
EXP SQUARE
FLOOR SIGMOID
IDENTITY SIGMOIDGRADIENT
INV TAN
LOGARITHM
16©MapR Technologies 2013- Confidential
Examples
double alpha; a.assign(alpha);
a.assign(b, Functions.chain(Functions.plus(beta), Functions.times(alpha));
A =a
A =aB+ b
17©MapR Technologies 2013- Confidential
Sparse Optimizations
DoubleDoubleFunction abstract properties
And Vector properties
public boolean isLikeRightPlus();
public boolean isLikeLeftMult();
public boolean isLikeRightMult();
public boolean isLikeMult();
public boolean isCommutative();
public boolean isAssociative();
public boolean isAssociativeAndCommutative();
public boolean isDensifying();
public boolean isDense();
public boolean isSequentialAccess();
public double getLookupCost();
public double getIteratorAdvanceCost();
public boolean isAddConstantTime();
18©MapR Technologies 2013- Confidential
More Examples
The trace of a matrix
Set diagonal to zero
Set diagonal to negative of row sums
19©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Set diagonal to zero
Set diagonal to negative of row sums
m.viewDiagonal().zSum()
20©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Set diagonal to zero
Set diagonal to negative of row sums
m.viewDiagonal().zSum()
m.viewDiagonal().assign(0)
21©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Set diagonal to zero
Set diagonal to negative of row sums excluding the diagonal
m.viewDiagonal().zSum()
m.viewDiagonal().assign(0)
Vector diag = m.viewDiagonal().assign(0);
diag.assign(m.rowSums().assign(Functions.MINUS));
22©MapR Technologies 2013- Confidential
Iteration
Matrices are Iterable in Mahout
Vectors are densely or sparsely iterable
// compute both row and columns sums in one pass
for (MatrixSlice row: m) {
rSums.set(row.index(), row.zSum());
cSums.assign(row, Functions.PLUS);
}
double entropy = 0;
for (Vector.Element e: v.nonZeroes()) {
entropy += e.get() * Math.log(e.get());
}
23©MapR Technologies 2013- Confidential
Random Sampling
Samples from some type
Lots of kinds
ChineseRestaurant Missing Normal
Empirical Multinomial PoissonSampler
IndianBuffet MultiNormal Sampler
public interface Sampler<T> {
T sample();
}
public abstract class AbstractSamplerFunction
extends DoubleFunction
implements Sampler<Double>
24©MapR Technologies 2013- Confidential
Clustering and Such
Streaming k-means and ball k-means– streaming reduces very large data to a cluster sketch
– ball k-means is a high quality k-means implementation
– the cluster sketch is also usable for other applications
– single machine threaded and map-reduce versions available
SVD and friends– stochastic SVD has in-memory, single machine out-of-core and map-reduce
versions
– good for reducing very large sparse matrices to tall skinny dense ones
Spectral clustering– based on SVD, allows massive dimensional clustering
25©MapR Technologies 2013- Confidential
Mahout Math Summary
Matrices, Vectors– views
– in-place assignment
– aggregations
– iterations
Functions– lots built-in
– cooperate with sparse vector optimizations
Sampling– abstract samplers
– samplers as functions
Other stuff … clustering, SVD
27©MapR Technologies 2013- Confidential
Recommendations
Often known as collaborative filtering
Actors interact with items– observe successful interaction
We want to suggest additional successful interactions
Observations inherently very sparse
28©MapR Technologies 2013- Confidential
The Big Ideas
Cooccurrence is the core operation (and it is pretty simple)
Cooccurrence can be extended to handle important new capabilities
Recommendation systems can be deployed ideally using search technology
29©MapR Technologies 2013- Confidential
Examples of Recommendations
Customers buying books (Linden et al)
Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)
Internet radio listeners not skipping songs (Musicmatch)
Internet video watchers watching >30 s (Veoh)
Visibility in a map UI (new Google maps)
30©MapR Technologies 2013- Confidential
A simple recommendation architecture
Look at the history of interactions
Find significant item cooccurrence in user histories
Use these cooccurring items as “indicators”
For all indicators in user history, accumulate scores for related items
31©MapR Technologies 2013- Confidential
Recommendation Basics
History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
32©MapR Technologies 2013- Confidential
Recommendation Basics
History as matrix:
(t1, t3) cooccur 2 times,
(t1, t4) once,
(t2, t4) once,
(t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
33©MapR Technologies 2013- Confidential
A Quick Simplification
Users who do h
Also do r
Ah
ATAh( )
ATA( )h
User-centric recommendations
Item-centric recommendations
34©MapR Technologies 2013- Confidential
Recommendation Basics
Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
35©MapR Technologies 2013- Confidential
Problems with Raw Cooccurrence
Very popular items co-occur with everything– Welcome document
– Elevator music
That isn’t interesting– We want anomalous cooccurrence
36©MapR Technologies 2013- Confidential
Recommendation Basics
Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2t3 not t3
t1 2 1
not t1 1 1
37©MapR Technologies 2013- Confidential
Spot the Anomaly
Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.44 0.98
2.26 7.15
39©MapR Technologies 2013- Confidential
Threshold by Score
Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
40©MapR Technologies 2013- Confidential
Threshold by Score
Significant cooccurrence => Indicators
t1 t2 t3 t4
t1 1 0 0 1
t2 0 1 0 1
t3 0 0 1 1
t4 1 0 0 1
41©MapR Technologies 2013- Confidential
So Far, So Good
Classic recommendation systems based on these approaches– Musicmatch (ca 2000)
– Veoh Networks (ca 2005)
Currently available in Mahout– See RowSimilarityJob
Very simple to deploy– Compute indicators
– Store in search engine
– Works very well with enough data
43©MapR Technologies 2013- Confidential
Virtues of Current State of the Art
Lots of well publicized history– Musicmatch, Veoh, Netflix, Amazon, Overstock
Lots of support– Mahout, commercial offerings like Myrrix
Lots of existing code– Mahout, commercial codes
Proven track record
Well socialized solution
45©MapR Technologies 2013- Confidential
Problems for Recommenders
Cold start
Disjoint populations
Long tail
Multiple kinds of evidence (multi-modal recommendations)– unstructured add-on data
– other transaction streams
– textual descriptions
46©MapR Technologies 2013- Confidential
What is this multi-modal stuff?
But people don’t just do one thing
One kind of behavior is useful for predicting other kinds
Having a complete picture is important for accuracy
What has the user said, viewed, clicked, closed, bought lately?
47©MapR Technologies 2013- Confidential
Example Multi-modal Inputs
Overlap in restaurant visits is useful
Big spender cues
Cuisine as an indicator
Review text as an indicator
48©MapR Technologies 2013- Confidential
Too Limited
People do more than one kind of thing
Different kinds of behaviors give different quality, quantity and kind of information
We don’t have to do co-occurrence
We can do cross-occurrence
Result is cross-recommendation
51©MapR Technologies 2013- Confidential
For example
Users enter queries (A)– (actor = user, item=query)
Users view videos (B)– (actor = user, item=video)
ATA gives query recommendation– “did you mean to ask for”
BTB gives video recommendation– “you might like these videos”
52©MapR Technologies 2013- Confidential
The punch-line
BTA recommends videos in response to a query– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
53©MapR Technologies 2013- Confidential
Real-life example
Query: “Paco de Lucia”
Conventional meta-data search results:– “hombres del paco” times 400
– not much else
Recommendation based search:– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
55©MapR Technologies 2013- Confidential
Hypothetical Example
Want a navigational ontology?
Just put labels on a web page with traffic– This gives A = users x label clicks
Remember viewing history– This gives B = users x items
Cross recommend– B’A = label to item mapping
After several users click, results are whatever users think they should be
60©MapR Technologies 2013- Confidential
A1 A2éë
ùû
T
A1 A2éë
ùû=
A1
T
A2
T
é
ë
êê
ù
û
úúA1 A2
éë
ùû
=A1
TA1 A1
TA2
AT
2A1 AT
2A2
é
ë
êê
ù
û
úú
r1
r2
é
ë
êê
ù
û
úú=
A1
TA1 A1
TA2
AT
2A1 AT
2A2
é
ë
êê
ù
û
úú
h1
h2
é
ë
êê
ù
û
úú
r1 = A1
TA1 A1
TA2
éëê
ùûú
h1
h2
é
ë
êê
ù
û
úú
61©MapR Technologies 2013- Confidential
Summary
Input: Multiple kinds of behavior on one set of things
Output: Recommendations for one kind of behavior with a different set of things
Cross recommendation is a special case
63©MapR Technologies 2013- Confidential
Input Data
User transactions– user id, merchant id
– SIC code, amount
– Descriptions, cuisine, …
Offer transactions– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
64©MapR Technologies 2013- Confidential
Input Data
User transactions– user id, merchant id
– SIC code, amount
– Descriptions, cuisine, …
Offer transactions– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
Derived user data– merchant id’s
– anomalous descriptor terms
– offer & vendor id’s
Derived merchant data– local top40
– SIC code
– vendor code
– amount distribution
65©MapR Technologies 2013- Confidential
Cross-recommendation
Per merchant indicators– merchant id’s
– chain id’s
– SIC codes
– indicator terms from text
– offer vendor id’s
Computed by finding anomalous (indicator => merchant) rates
67©MapR Technologies 2013- Confidential
Search-based Recommendations
Sample document– Merchant Id
– Field for text description
– Phone
– Address
– Location
68©MapR Technologies 2013- Confidential
Search-based Recommendations
Sample document– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
69©MapR Technologies 2013- Confidential
Search-based Recommendations
Sample document– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
Sample query– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40
70©MapR Technologies 2013- Confidential
Search-based Recommendations
Sample document– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
Sample query– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40
Original data and meta-data
Derived from cooccurrenceand cross-occurrence analysis
Recommendation query
71©MapR Technologies 2013- Confidential
SolRIndexerSolR
IndexerSolr
indexingCooccurrence
(Mahout)
Item meta-data
Indexshards
Complete history
Analyze with Map-Reduce
72©MapR Technologies 2013- Confidential
SolRIndexerSolR
IndexerSolr
searchWeb tier
Item meta-data
Indexshards
User history
Deploy with Conventional Search System
73©MapR Technologies 2013- Confidential
Objective Results
At a very large credit card company
History is all transactions
Development time to minimal viable product about 4 months
General release 2-3 months later
Search-based recs at or equal in quality to other techniques
74©MapR Technologies 2013- Confidential
Contact:– [email protected]
– @ted_dunning
– @apachemahout
Slides and suchhttp://www.slideshare.net/tdunning
Hash tags: #mapr #apachemahout #recommendations