scientific article recommendation with mahout

48
Scientific Article Recommendation with Mahout Kris Jack, PhD Senior Data Mining Engineer

Upload: kris-jack

Post on 28-May-2015

936 views

Category:

Technology


0 download

DESCRIPTION

I gave this presentation as part of the Data Science meetup in London on 23rd May, 2012.This describes how I've been making use of Mahout's item-based collaborative filtering recommender system to produce personalised scientific article recommendations for researchers. I discuss how well Mahout performs out of the box and how I manage to reduce processing time by 95% by tuning it to our data set.

TRANSCRIPT

Page 1: Scientific Article Recommendation with Mahout

Scientific Article Recommendation

with Mahout

Kris Jack, PhD

Senior Data Mining Engineer

Page 2: Scientific Article Recommendation with Mahout

Use Case

➔ Good researchers are on top of their game➔ Large amount of research produced➔ Takes time to get at what you need

➔ Help researchers by recommending relevant research

Page 3: Scientific Article Recommendation with Mahout

1.5 million+ users; the 20 largest user bases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina50m research articles

Page 4: Scientific Article Recommendation with Mahout

1.5 million+ users; the 20 largest user bases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina50m research articles

We need a recommender that

scales up, coping with our data and future

growth

Page 5: Scientific Article Recommendation with Mahout
Page 6: Scientific Article Recommendation with Mahout

➔ How does Mahout's recommender work?

➔ How well does it perform out of the box?

➔ How well does it perform after some tuning?

Questions

Page 7: Scientific Article Recommendation with Mahout

Mahout's Recommender

Page 8: Scientific Article Recommendation with Mahout

Generating recommendations through matrix multiplication

This is item-based recommendations as similarity is based on items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Page 9: Scientific Article Recommendation with Mahout

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Res

earc

h A

rtic

les

Researchers

Physics 2

Comp Sci 2

Input (all user preferences)

Page 10: Scientific Article Recommendation with Mahout

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Res

earc

h A

rtic

les

Researchers

Physics 2

Comp Sci 2

1.5M

50M

Input (all user preferences)

300M prefs

Page 11: Scientific Article Recommendation with Mahout

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 12: Scientific Article Recommendation with Mahout

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 13: Scientific Article Recommendation with Mahout

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 14: Scientific Article Recommendation with Mahout

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 15: Scientific Article Recommendation with Mahout

How well doesit work?

Page 16: Scientific Article Recommendation with Mahout

Mendeley Suggest

Page 17: Scientific Article Recommendation with Mahout

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Page 18: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Mahout'sPerformance

Page 19: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Page 20: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Page 21: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

Page 22: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

3

Mahout'sPerformance

Page 23: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

3

Mahout'sPerformance

Page 24: Scientific Article Recommendation with Mahout

Let's tune it!

Page 25: Scientific Article Recommendation with Mahout

1. Reduce processing time

2. Improve quality

Page 26: Scientific Article Recommendation with Mahout

1. Reduce processing time

➔ Mahout's recommender is already efficient➔ But your data may have unusual properties➔ Hadoop may need a helping hand➔ Let's see what's going on...

Page 27: Scientific Article Recommendation with Mahout

Task Allocation 37 hours to complete

1 reducer allocated, despite having 48 available...

Page 28: Scientific Article Recommendation with Mahout

Task Allocation

job.getConfiguration().set("mapred.max.split.size",String.valueOf(splitSize));

Allocating more mappers on a per job basis

job.getConfiguration().setInt("mapred.reduce.tasks",numReducers);

Allocating more reducers on a per job basis

Page 29: Scientific Article Recommendation with Mahout

Task Allocation 37 hours to complete14 hours

From 1 → 40 reducers

Page 30: Scientific Article Recommendation with Mahout

Partitioners 14 hours to complete

Page 31: Scientific Article Recommendation with Mahout

Partitioners 14 hours to complete

~50KB

~500MB

Page 32: Scientific Article Recommendation with Mahout

InputSampler.Sampler<IntWritable, Text> sampler =new InputSampler.RandomSampler<IntWritable, Text>(...);

InputSampler.writePartitionFile(conf, sampler);conf.setPartitionerClass(TotalOrderPartitioner.class);

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Page 33: Scientific Article Recommendation with Mahout

Partitioners 14 hours to complete

2 hours

Evenly distributed

Page 34: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

3

Mahout'sPerformance

Page 35: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

Mahout'sPerformance

Page 36: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

-4.1K(63%)

Mahout'sPerformance

Page 37: Scientific Article Recommendation with Mahout

2. Improve quality

➔ Mahout provides item-based CF➔ We have many more items than users➔ Typically, user-based is more appropriate

➔ So let's make one!

Page 38: Scientific Article Recommendation with Mahout

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

Page 39: Scientific Article Recommendation with Mahout

Res

earc

h

Art

icle

sTuring

A User's Preferences(item x user)

Res

earc

h

Art

icle

s

Researchers

All User Preferences (item x user)

Res

earc

h

Art

icle

s

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

esea

rch

A

rtic

les

Turing

Recommendations(item x user)

X =

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

user

User Similarity (user x user)

Researchers

Re

sea

rch

ers

Page 40: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

3

Mahout'sPerformance

Page 41: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Mahout'sPerformance

Page 42: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

-1.4K(58%)

+1 (67%)

Mahout'sPerformance

Page 43: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

Mahout'sPerformance

Page 44: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-0.7K(70%)

Mahout'sPerformance

-4.1K(63%)

Page 45: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-6.2K(95%)

Mahout'sPerformance

+1 (67%)

Page 46: Scientific Article Recommendation with Mahout

Conclusions

Page 47: Scientific Article Recommendation with Mahout

Conclusions

➔ Mahout is doing a great job of powering Mendeley Suggest➔ Large scale data set➔ Good quality recommendations

➔ Tuning helps➔ Help Hadoop with task allocation if necessary➔ Partition your data appropriately➔ We save 95% resources

➔ Use an appropriate algorithm➔ Item- vs user-based (MAHOUT-1004)➔ We increase precision by 66.6%

Page 48: Scientific Article Recommendation with Mahout

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5Orig. item-based

Cust. item-based➔2.4K, 1.5

Orig. user-based➔1K, 2.5

3

Cust. user-based➔0.3K, 2.5

-6.2K(95%)

Mahout'sPerformance

+1 (67%)

http://www.mendeley.com/profiles/kris-jack/