recommendations with hadoop streaming and python

45
Recommendations with Python and Hadoop Streaming Andrew Look Senior Engineer Shopzilla

Upload: andrew-look

Post on 25-May-2015

4.824 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Recommendations with hadoop streaming and python

Recommendations with Python and Hadoop Streaming

Andrew Look

Senior EngineerShopzilla

Page 2: Recommendations with hadoop streaming and python

Getting started

● Slides○ http://bit.ly/J7vmx7

● Python/NumPy Installed○ http://bit.ly/JWNWbq

● Sample code○ http://aws-hadoop.s3.amazonaws.com/similarity.zip

Page 3: Recommendations with hadoop streaming and python

Outline

● Problem● Recommendation basics● MapReduce review and conventions● Python + Hadoop Streaming basics● MapReduce jobs (data, code, data-flow)● Recommendation algorithm

Page 4: Recommendations with hadoop streaming and python

Problem - Music Recommendations

● We want to recommend similar artists● We have data from Last.fm ● Which Last.fm users liked which artists?● How can we decide which artists are similar?

Toby Keith Tupac

De La Soul

Garth Brooks

Page 5: Recommendations with hadoop streaming and python

Solution - Find Artist Similarities

● We'll follow along with a tutorial from AWS● By Data Wrangling blogger/AWS developer

Peter Skomoroch● Uses publicly available data from Last.fm● User's rating of artist is number of plays

Page 6: Recommendations with hadoop streaming and python

Solution - Find Artist Similarities

● We can look at co-ratings● One user played artist A songs X times● Same user played artist B songs Y times

co-rating = ((A,X),(B,Y))

Page 7: Recommendations with hadoop streaming and python

Recommendation Basics

● User Based○ Given a user, recommend the artists that are favored

by users with similar artist preferences ● Item Based

○ Given an item (artist), recommend the artists that were most commonly favored by users that also liked the input artist

Page 8: Recommendations with hadoop streaming and python

Recommendation Basics

● Types of data○ Explicit - user rates a movie on Netflix○ Implicit - user watches a YouTube video

● Types of ratings

○ Multivalued - bounded, ex. star rating (1-5)○ Multivalued - unbounded, ex. number of plays (>0)○ Binary - did a user play a movie or not?

Page 9: Recommendations with hadoop streaming and python

Last.fm Recommendations

● Data was implicitly collected (as users play songs)● Transform binary data (did user listen to artist?) ...● Into multivalued data (how many times?)● We'll use item-based recommendations

Page 10: Recommendations with hadoop streaming and python

Mapper Input

Page 11: Recommendations with hadoop streaming and python

Map Output - Reduce Input

Page 12: Recommendations with hadoop streaming and python

Chaining MapReduce Jobs

Page 13: Recommendations with hadoop streaming and python

Distributed Cache

Page 14: Recommendations with hadoop streaming and python

Python Shell and Hadoop Streaming

Streaming API requires shell commands● Mapper● Reducer

Page 15: Recommendations with hadoop streaming and python

Python Shell and Hadoop Streaming

Streaming API requires shell commands● Mapper● Reducer

For mapper / reducer commands Streaming API will

● Partition the input ● Distribute across mappers and

reducers

Page 16: Recommendations with hadoop streaming and python

Python Shell and Hadoop Streaming

Page 17: Recommendations with hadoop streaming and python

Full Recommendation Job Overview

Page 18: Recommendations with hadoop streaming and python

Example - Working Data Set

○ Inspect your working data set ...○ Each row is one "rating"○ Each "number of plays" is the "rating value" Code

cat input/sample_user_artist_data.txt \| head

Page 19: Recommendations with hadoop streaming and python

Example - Working Data Set

User ID Artist ID Number of Plays

1000020 1001820 20

1000020 1003557 1

1000021 700 1

1000029 1001819 1

1000036 1001820 34

1000036 1011819 2

1000036 700 2

1000040 1001820 1

1000057 1011819 37

1000060 700 17

Page 20: Recommendations with hadoop streaming and python

Mapper 1 - Count Ratings per Artist

○ Prepend LongValueSum:<artist ID>○ More on this later○ Use a value of "1"

Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper1

Page 21: Recommendations with hadoop streaming and python

Mapper 1 - Count Ratings per ArtistArtist ID Number of Ratings

LongValueSum:1001820 1

LongValueSum:1003557 1

LongValueSum:700 1

LongValueSum:1001819 1

LongValueSum:1001820 1

LongValueSum:1011819 1

LongValueSum:700 1

LongValueSum:1001820 1

LongValueSum:1011819 1

LongValueSum:700 1

Page 22: Recommendations with hadoop streaming and python

Mapper 1 - Count Ratings per Artist

○ We use the sort command locally○ We sort by artist ID○ Emulates shuffle/sort in Hadoop Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper1 | sort

Page 23: Recommendations with hadoop streaming and python

Mapper 1 - Count Ratings per Artist

Artist ID Number of Plays

LongValueSum:1001820 1

LongValueSum:1001820 1

LongValueSum:1001820 1

LongValueSum:1003557 1

LongValueSum:1011819 1

LongValueSum:1011819 1

LongValueSum:1011819 1

LongValueSum:700 1

LongValueSum:700 1

LongValueSum:700 1

Page 24: Recommendations with hadoop streaming and python

Reducer 1 - Count Ratings by Artist

○ LongValueSum tells 'aggregate' reducer○ Group by artist ID○ Sum up the 1's○ Emit artist ID as Key, count(ratings) as Value

Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper1 | sort \| ./similarity.py reducer1 \> input/artist_playcounts.txt

Page 25: Recommendations with hadoop streaming and python

Reducer 1 - Count Ratings by Artist

Artist ID Number of Ratings

1000143 1905

1000418 184

1001820 12950

700 7243

1003557 2976

1011819 7601

1012511 1881

Page 26: Recommendations with hadoop streaming and python

Mapper 2 - User Artist Preferences

○ Mapper2 outputs key user ID, artist ID○ Mapper2 outputs rating as value (# plays) Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 int

Page 27: Recommendations with hadoop streaming and python

Mapper 2 - User Artist Preferences

User ID, Artist ID Number of Plays

1000020,1001820 20

1000020,1003557 1

1000021,700 1

1000029,1011819 1

1000036,1001820 34

1000036,1011819 2

1000036,700 2

1000040,1001820 1

1000057,1011819 37

1000060,700 17

Page 28: Recommendations with hadoop streaming and python

Mapper 2 - User Artist Preferences

○ Can large counts skew our results?○ Apply log function to outliers. Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort

Page 29: Recommendations with hadoop streaming and python

Mapper 2 - Logarithmic Smoothing

User ID, Artist ID Smoothing Smoothed Count

1000020,1001820 log(20) 3

1000020,1003557 log(1) 1

1000021,700 log(1) 1

1000029,1011819 log(1) 1

1000036,1001820 log(34) 4

1000036,1011819 log(2) 1

1000036,700 log(2) 1

1000040,1001820 log(1) 1

1000057,1011819 log(37) 4

1000060,700 log(17) 3

Page 30: Recommendations with hadoop streaming and python

Reducer 2 - Aggregate User Prefs

○ Reduce for each user○ Key - user ID○ Value is complex

○ Count(ratings)○ Sum(rating values)○ Space delimited list of - artist ID, rating value

Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort \| ./similarity.py reducer2

Page 31: Recommendations with hadoop streaming and python

Reducer 2 - Aggregated User Prefs

User ID Smoothing

1000020 2 | 4 | 1001820,3 1003557,1

1000021 1 | 1 | 700,1

1000029 1 | 1 | 1011819,1

1000036 3 | 6 | 1001820,4 1011819,1 700,1

1000040 1 | 1 | 1001820,1

1000057 1 | 4 | 1011819,4

1000060 1 | 3 | 700,3

Page 32: Recommendations with hadoop streaming and python

Mapper 3 - User Co-Ratings

○ Mapper3 culls users via cutoff○ Drop user ID, emit pairwise

Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort \| ./similarity.py reducer2 \| ./similarity.py mapper3 100 \input/artist_playcounts.txt | sort

Page 33: Recommendations with hadoop streaming and python

Mapper 3 - User Co-Ratings

Artist ID: X, Y Rating: X, Y

1000143 1003577 2 3

1000143 1011819 2 3

1001820 700 1 2

1001820 700 1 3

1011819 700 3 2

1011819 700 3 3

1011819 700 4 2

1011819 700 4 2

1011819 700 5 5

1012511 700 1 1

Page 34: Recommendations with hadoop streaming and python

Reducer 3 - Artist Similarities

○ Given num artists, computes similarities○ Each pair of artists emitted w/ similarities

Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort \| ./similarity.py reducer2 \| ./similarity.py mapper3 100 \input/artist_playcounts.txt | sort \| ./similarity.py reducer3 147160 \> artist_similarities.txt

Page 35: Recommendations with hadoop streaming and python

Reducer 3 - Artist Similarities

Artist ID, Similarity, Artist ID, Co-Ratings

1003557 0.121659425105 1012511 360

1012511 0.121659425105 1003557 360

1003557 0.0197107349416 700 212

700 0.0197107349416 1003557 212

1011819 0.0128808637553 1012511 259

1012511 0.0128808637553 1011819 259

1011819 0.297222927702 700 3050

700 0.297222927702 1011819 3050

1012511 0.0426446192482 700 270

700 0.0426446192482 1012511 270

Page 36: Recommendations with hadoop streaming and python

Mapper 4 - Sort by Artist Correlation

○ Emit artist ID, similarity concatenated○ Sort by similarity = recommendation Code

cat artist_similarities.txt \| ./similarity.py mapper4 20 | sort

Page 37: Recommendations with hadoop streaming and python

Mapper 4 - Sort by Artist Correlation

Artist X-ID, Similarity Artist Y-ID, Num Co-Ratings

1012511,0.924219271937 1000143 237

1012511,0.945653412649 1001820 468

1012511,0.957355380752 700 270

1012511,0.961454917198 1000418 50

1012511,0.987119136245 1011819 259

700,0.702777072298 1011819 3050

700,0.898811337303 1001820 2250

700,0.95212801312 1000143 114

700,0.957355380752 1012511 270

700,0.980289265058 1003557 212

Page 38: Recommendations with hadoop streaming and python

Reducer 4 - Cosmetic Results

○ Reducer attaches artist names

Code

cat artist_similarities.txt \| ./similarity.py mapper4 20 \| sort \| ./similarity.py reducer4 3 lastfm/artist_data.txt \> related_artists.tsv

Page 39: Recommendations with hadoop streaming and python

Reducer 4 - Cosmetic Results

Artist ID Related Artist ID

Similarity Number of Co-Ratings

Artist Name

1000143 1000143 1 0 Toby Keith

1000143 1003557 0.2434 809 Garth Brooks

1000143 1000418 0.1068 120 Mark Chestnutt

1000143 1012511 0.0758 237 Kenny Rogers

1000418 1000418 1 0 Mark Chestnutt

1000418 1000143 0.1068 120 Toby Keith

1000418 1003557 0.056 114 Garth Brooks

1000418 1012511 0.0385 50 Kenny Rogers

Page 40: Recommendations with hadoop streaming and python

Pearson Similarity - Visualization

covariance(A, B) = 2.44covariance(C, D) = -2.36

Page 41: Recommendations with hadoop streaming and python

Pearson Similarity - Equation

pearson(x, y)

= covariance(x, y) / (stddev(x) * stddev(y))

pearson(A, B) = 0.772pearson(C, D) = -0.746

Page 42: Recommendations with hadoop streaming and python

Pearson Similarity - Summary

○ Pearson similarity normalizes correlation○ Linear dependence between two variables○ Normalized ...

-1 < pearson(x, y) < 1

(for any x, y)

Page 43: Recommendations with hadoop streaming and python

Questions?

Page 44: Recommendations with hadoop streaming and python

● Hadoop Streaming○ http://hadoop.apache.org/common/docs/r0.20.1/streaming.html

● Explanation of LongValueSum○ http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce

● Pearson Correlation○ http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

● Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming○ http://aws.amazon.com/articles/2294

Appendix

Page 45: Recommendations with hadoop streaming and python

● Anscombe's Quartet○ http://en.wikipedia.org/wiki/Anscombe's_quartet

● Tau Coefficient○ http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient

● Jaccard Index○ http://en.wikipedia.org/wiki/Jaccard_index

● Quality of Recommendations○ http://en.wikipedia.org/wiki/Mean_squared_error

Appendix