scientific article recommendation with mahout

Post on 28-May-2015

936 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

I gave this presentation as part of the Data Science meetup in London on 23rd May, 2012.This describes how I've been making use of Mahout's item-based collaborative filtering recommender system to produce personalised scientific article recommendations for researchers. I discuss how well Mahout performs out of the box and how I manage to reduce processing time by 95% by tuning it to our data set.

TRANSCRIPT

Scientific Article Recommendation

with Mahout

Kris Jack, PhD

Senior Data Mining Engineer

Use Case

➔ Good researchers are on top of their game➔ Large amount of research produced➔ Takes time to get at what you need

➔ Help researchers by recommending relevant research

1.5 million+ users; the 20 largest user bases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina50m research articles

1.5 million+ users; the 20 largest user bases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina50m research articles

We need a recommender that

scales up, coping with our data and future

growth

➔ How does Mahout's recommender work?

➔ How well does it perform out of the box?

➔ How well does it perform after some tuning?

Questions

Mahout's Recommender

Generating recommendations through matrix multiplication

This is item-based recommendations as similarity is based on items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Res

earc

h A

rtic

les

Researchers

Physics 2

Comp Sci 2

Input (all user preferences)

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Res

earc

h A

rtic

les

Researchers

Physics 2

Comp Sci 2

1.5M

50M

Input (all user preferences)

300M prefs

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

How well doesit work?

Mendeley Suggest

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

3

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

3

Mahout'sPerformance

Let's tune it!

1. Reduce processing time

2. Improve quality

1. Reduce processing time

➔ Mahout's recommender is already efficient➔ But your data may have unusual properties➔ Hadoop may need a helping hand➔ Let's see what's going on...

Task Allocation 37 hours to complete

1 reducer allocated, despite having 48 available...

Task Allocation

job.getConfiguration().set("mapred.max.split.size",String.valueOf(splitSize));

Allocating more mappers on a per job basis

job.getConfiguration().setInt("mapred.reduce.tasks",numReducers);

Allocating more reducers on a per job basis

Task Allocation 37 hours to complete14 hours

From 1 → 40 reducers

Partitioners 14 hours to complete

Partitioners 14 hours to complete

~50KB

~500MB

InputSampler.Sampler<IntWritable, Text> sampler =new InputSampler.RandomSampler<IntWritable, Text>(...);

InputSampler.writePartitionFile(conf, sampler);conf.setPartitionerClass(TotalOrderPartitioner.class);

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Partitioners 14 hours to complete

2 hours

Evenly distributed

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

3

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

-4.1K(63%)

Mahout'sPerformance

2. Improve quality

➔ Mahout provides item-based CF➔ We have many more items than users➔ Typically, user-based is more appropriate

➔ So let's make one!

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

user

User Similarity (user x user)

Researchers

Re

sea

rch

ers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

-1.4K(58%)

+1 (67%)

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

Mahout'sPerformance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-0.7K(70%)

Mahout'sPerformance

-4.1K(63%)

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-6.2K(95%)

Mahout'sPerformance

+1 (67%)

Conclusions

Conclusions

➔ Mahout is doing a great job of powering Mendeley Suggest➔ Large scale data set➔ Good quality recommendations

➔ Tuning helps➔ Help Hadoop with task allocation if necessary➔ Partition your data appropriately➔ We save 95% resources

➔ Use an appropriate algorithm➔ Item- vs user-based (MAHOUT-1004)➔ We increase precision by 66.6%

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-6.2K(95%)

Mahout'sPerformance

+1 (67%)

http://www.mendeley.com/profiles/kris-jack/

top related