recommendations with hadoop streaming and python

Recommendations with Python and Hadoop Streaming

Andrew Look

Senior EngineerShopzilla

Getting started

● Slides○ http://bit.ly/J7vmx7

● Python/NumPy Installed○ http://bit.ly/JWNWbq

● Sample code○ http://aws-hadoop.s3.amazonaws.com/similarity.zip

Outline

● Problem● Recommendation basics● MapReduce review and conventions● Python + Hadoop Streaming basics● MapReduce jobs (data, code, data-flow)● Recommendation algorithm

Problem - Music Recommendations

● We want to recommend similar artists● We have data from Last.fm ● Which Last.fm users liked which artists?● How can we decide which artists are similar?

Toby Keith Tupac

De La Soul

Garth Brooks

Solution - Find Artist Similarities

● We'll follow along with a tutorial from AWS● By Data Wrangling blogger/AWS developer

Peter Skomoroch● Uses publicly available data from Last.fm● User's rating of artist is number of plays

Solution - Find Artist Similarities

● We can look at co-ratings● One user played artist A songs X times● Same user played artist B songs Y times

co-rating = ((A,X),(B,Y))

Recommendation Basics

● User Based○ Given a user, recommend the artists that are favored

by users with similar artist preferences ● Item Based

○ Given an item (artist), recommend the artists that were most commonly favored by users that also liked the input artist

Recommendation Basics

● Types of data○ Explicit - user rates a movie on Netflix○ Implicit - user watches a YouTube video

● Types of ratings

○ Multivalued - bounded, ex. star rating (1-5)○ Multivalued - unbounded, ex. number of plays (>0)○ Binary - did a user play a movie or not?

Last.fm Recommendations

● Data was implicitly collected (as users play songs)● Transform binary data (did user listen to artist?) ...● Into multivalued data (how many times?)● We'll use item-based recommendations

Mapper Input

Map Output - Reduce Input

Chaining MapReduce Jobs

Distributed Cache

Python Shell and Hadoop Streaming

Streaming API requires shell commands● Mapper● Reducer

For mapper / reducer commands Streaming API will

● Partition the input ● Distribute across mappers and

reducers

Full Recommendation Job Overview

Example - Working Data Set

○ Inspect your working data set ...○ Each row is one "rating"○ Each "number of plays" is the "rating value" Code

cat input/sample_user_artist_data.txt \| head

Example - Working Data Set

User ID Artist ID Number of Plays

1000020 1001820 20

1000020 1003557 1

1000021 700 1

1000029 1001819 1

1000036 1001820 34

1000036 1011819 2

1000036 700 2

1000040 1001820 1

1000057 1011819 37

1000060 700 17

Mapper 1 - Count Ratings per Artist

○ Prepend LongValueSum:<artist ID>○ More on this later○ Use a value of "1"

cat input/sample_user_artist_data.txt \| ./similarity.py mapper1

Mapper 1 - Count Ratings per ArtistArtist ID Number of Ratings

LongValueSum:1001820 1

LongValueSum:700 1

○ We use the sort command locally○ We sort by artist ID○ Emulates shuffle/sort in Hadoop Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper1 | sort

Artist ID Number of Plays

LongValueSum:700 1

Reducer 1 - Count Ratings by Artist

○ LongValueSum tells 'aggregate' reducer○ Group by artist ID○ Sum up the 1's○ Emit artist ID as Key, count(ratings) as Value

cat input/sample_user_artist_data.txt \| ./similarity.py mapper1 | sort \| ./similarity.py reducer1 \> input/artist_playcounts.txt

Reducer 1 - Count Ratings by Artist

Artist ID Number of Ratings

1000143 1905

1000418 184

1001820 12950

700 7243

1003557 2976

1011819 7601

1012511 1881

Mapper 2 - User Artist Preferences

○ Mapper2 outputs key user ID, artist ID○ Mapper2 outputs rating as value (# plays) Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 int

User ID, Artist ID Number of Plays

1000020,1001820 20

1000020,1003557 1

1000021,700 1

1000029,1011819 1

1000036,1001820 34

1000036,1011819 2

1000036,700 2

1000040,1001820 1

1000057,1011819 37

1000060,700 17

○ Can large counts skew our results?○ Apply log function to outliers. Code

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort

Mapper 2 - Logarithmic Smoothing

User ID, Artist ID Smoothing Smoothed Count

1000020,1001820 log(20) 3

1000020,1003557 log(1) 1

1000021,700 log(1) 1

1000029,1011819 log(1) 1

1000036,1001820 log(34) 4

1000036,1011819 log(2) 1

1000036,700 log(2) 1

1000040,1001820 log(1) 1

1000057,1011819 log(37) 4

1000060,700 log(17) 3

Reducer 2 - Aggregate User Prefs

○ Reduce for each user○ Key - user ID○ Value is complex

○ Count(ratings)○ Sum(rating values)○ Space delimited list of - artist ID, rating value

cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort \| ./similarity.py reducer2

Reducer 2 - Aggregated User Prefs

User ID Smoothing

1000020 2 | 4 | 1001820,3 1003557,1

1000021 1 | 1 | 700,1

1000029 1 | 1 | 1011819,1

1000036 3 | 6 | 1001820,4 1011819,1 700,1

1000040 1 | 1 | 1001820,1

1000057 1 | 4 | 1011819,4

1000060 1 | 3 | 700,3

Mapper 3 - User Co-Ratings

○ Mapper3 culls users via cutoff○ Drop user ID, emit pairwise

Mapper 3 - User Co-Ratings

Artist ID: X, Y Rating: X, Y

1000143 1003577 2 3

1000143 1011819 2 3

1001820 700 1 2

1001820 700 1 3

1011819 700 3 2

1011819 700 3 3

1011819 700 4 2

1011819 700 5 5

1012511 700 1 1

Reducer 3 - Artist Similarities

○ Given num artists, computes similarities○ Each pair of artists emitted w/ similarities

Reducer 3 - Artist Similarities

Artist ID, Similarity, Artist ID, Co-Ratings

1003557 0.121659425105 1012511 360

1012511 0.121659425105 1003557 360

1003557 0.0197107349416 700 212

700 0.0197107349416 1003557 212

1011819 0.0128808637553 1012511 259

1012511 0.0128808637553 1011819 259

1011819 0.297222927702 700 3050

700 0.297222927702 1011819 3050

1012511 0.0426446192482 700 270

700 0.0426446192482 1012511 270

Mapper 4 - Sort by Artist Correlation

○ Emit artist ID, similarity concatenated○ Sort by similarity = recommendation Code

cat artist_similarities.txt \| ./similarity.py mapper4 20 | sort

Mapper 4 - Sort by Artist Correlation

Artist X-ID, Similarity Artist Y-ID, Num Co-Ratings

1012511,0.924219271937 1000143 237

1012511,0.945653412649 1001820 468

1012511,0.957355380752 700 270

1012511,0.961454917198 1000418 50

1012511,0.987119136245 1011819 259

700,0.702777072298 1011819 3050

700,0.898811337303 1001820 2250

700,0.95212801312 1000143 114

700,0.957355380752 1012511 270

700,0.980289265058 1003557 212

Reducer 4 - Cosmetic Results

○ Reducer attaches artist names

cat artist_similarities.txt \| ./similarity.py mapper4 20 \| sort \| ./similarity.py reducer4 3 lastfm/artist_data.txt \> related_artists.tsv

Reducer 4 - Cosmetic Results

Artist ID Related Artist ID

Similarity Number of Co-Ratings

Artist Name

1000143 1000143 1 0 Toby Keith

1000143 1003557 0.2434 809 Garth Brooks

1000143 1000418 0.1068 120 Mark Chestnutt

1000143 1012511 0.0758 237 Kenny Rogers

1000418 1000418 1 0 Mark Chestnutt

1000418 1000143 0.1068 120 Toby Keith

1000418 1003557 0.056 114 Garth Brooks

1000418 1012511 0.0385 50 Kenny Rogers

Pearson Similarity - Visualization

covariance(A, B) = 2.44covariance(C, D) = -2.36

Pearson Similarity - Equation

pearson(x, y)

= covariance(x, y) / (stddev(x) * stddev(y))

pearson(A, B) = 0.772pearson(C, D) = -0.746

Pearson Similarity - Summary

○ Pearson similarity normalizes correlation○ Linear dependence between two variables○ Normalized ...

-1 < pearson(x, y) < 1

(for any x, y)

Questions?

● Hadoop Streaming○ http://hadoop.apache.org/common/docs/r0.20.1/streaming.html

● Explanation of LongValueSum○ http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce

● Pearson Correlation○ http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

● Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming○ http://aws.amazon.com/articles/2294

Appendix

● Anscombe's Quartet○ http://en.wikipedia.org/wiki/Anscombe's_quartet

● Tau Coefficient○ http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient

● Jaccard Index○ http://en.wikipedia.org/wiki/Jaccard_index

● Quality of Recommendations○ http://en.wikipedia.org/wiki/Mean_squared_error

Appendix

recommendations with hadoop streaming and python

artist longvaluesum

user artist preferencesuser

input artist

item artist

artist similarities

user key user id value

artist id mapper2 outputs

aggregated user prefs

Technology

developing streaming applications with apache apex (strata +...

hadoop jute record python

tytuł oryginału: hadoop: the definitive guide, fourth...

megadata with python and hadoop

enterprise grade streaming under 2ms on hadoop

intro to apache apex (next gen hadoop) & comparison to spark...

hadoop record reader in python

docker, hadoop streaming, mrjob, ngs example

cloud computing mapreduce (2) keke chen. outline hadoop...

streaming live data and the hadoop ecosystem

mpi cluster programming with python and amazon...

distributed data analysis with hadoop and r · with hadoop....

rit++: hadoop streaming: простой путь к...

real-time streaming: ims to apache kafka and hadoop -...

real-time streaming analysis for hadoop and flume...

python 3 + apache hadoop

multi-tenant streaming and tensorflow as a service with...

flink streaming hadoop summit san jose

low latency streaming data processing in hadoop

python mapreduce programming with pydoop · mapreduce and...