scientific article recommendation with mahout

Scientific Article Recommendation

with Mahout

Kris Jack, PhD

Senior Data Mining Engineer

Use Case

➔ Good researchers are on top of their game➔ Large amount of research produced➔ Takes time to get at what you need

➔ Help researchers by recommending relevant research

1.5 million+ users; the 20 largest user bases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina50m research articles

1.5 million+ users; the 20 largest user bases:

University of CambridgeStanford University

MITUniversity of Michigan

Harvard UniversityUniversity of OxfordSao Paulo University

Imperial College LondonUniversity of Edinburgh

Cornell UniversityUniversity of California at Berkeley

RWTH AachenColumbia University

Georgia TechUniversity of Wisconsin

UC San DiegoUniversity of California at LA

University of FloridaUniversity of North Carolina50m research articles

We need a recommender that

scales up, coping with our data and future

growth

➔ How does Mahout's recommender work?

➔ How well does it perform out of the box?

➔ How well does it perform after some tuning?

Questions

Mahout's Recommender

Generating recommendations through matrix multiplication

This is item-based recommendations as similarity is based on items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Researchers

Physics 2

Comp Sci 2

Input (all user preferences)

Turing Babbage Einstein Newton

Comp Sci 1

Physics 1

Researchers

Physics 2

Comp Sci 2

Input (all user preferences)

300M prefs

Researchers

All User Preferences (item x user)

1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)

item.RecommenderJob

sTuring

A User's Preferences(item x user)

Researchers

item.RecommenderJob

sTuring

Researchers

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)

item.RecommenderJob

sTuring

Researchers

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Item Similarity (item x item)R

Turing

Recommendations(item x user)

item.RecommenderJob

How well doesit work?

Mendeley Suggest

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

No. Good Recommendations/10

Mahout'sPerformance

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

6.5K, 1.5Orig. item-based

Mahout'sPerformance

Let's tune it!

1. Reduce processing time

2. Improve quality

1. Reduce processing time

➔ Mahout's recommender is already efficient➔ But your data may have unusual properties➔ Hadoop may need a helping hand➔ Let's see what's going on...

Task Allocation 37 hours to complete

1 reducer allocated, despite having 48 available...

Task Allocation

job.getConfiguration().set("mapred.max.split.size",String.valueOf(splitSize));

Allocating more mappers on a per job basis

job.getConfiguration().setInt("mapred.reduce.tasks",numReducers);

Allocating more reducers on a per job basis

Task Allocation 37 hours to complete14 hours

From 1 → 40 reducers

Partitioners 14 hours to complete

~500MB

InputSampler.Sampler<IntWritable, Text> sampler =new InputSampler.RandomSampler<IntWritable, Text>(...);

InputSampler.writePartitionFile(conf, sampler);conf.setPartitionerClass(TotalOrderPartitioner.class);

http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

Partitioners 14 hours to complete

2 hours

Evenly distributed

0 0.5 1 1.5 2 2.5

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

Cust. item-based➔2.4K, 1.5

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

-4.1K(63%)

Mahout'sPerformance

2. Improve quality

➔ Mahout provides item-based CF➔ We have many more items than users➔ Typically, user-based is more appropriate

➔ So let's make one!

sTuring

Researchers

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Turing

item.RecommenderJob

sTuring

Researchers

Research Articles

2 11 10 00 0

2 22 2

0 00 0

Turing

item.RecommenderJob

User Similarity (user x user)

Researchers

0 0.5 1 1.5 2 2.5

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

Orig. user-based➔1K, 2.5

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

-1.4K(58%)

+1 (67%)

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

Cust. user-based➔0.3K, 2.5

Mahout'sPerformance

0 0.5 1 1.5 2 2.5

-0.7K(70%)

Mahout'sPerformance

-4.1K(63%)

0 0.5 1 1.5 2 2.5

-6.2K(95%)

Mahout'sPerformance

+1 (67%)

Conclusions

➔ Mahout is doing a great job of powering Mendeley Suggest➔ Large scale data set➔ Good quality recommendations

➔ Tuning helps➔ Help Hadoop with task allocation if necessary➔ Partition your data appropriately➔ We save 95% resources

➔ Use an appropriate algorithm➔ Item- vs user-based (MAHOUT-1004)➔ We increase precision by 66.6%

0 0.5 1 1.5 2 2.5

-6.2K(95%)

Mahout'sPerformance

+1 (67%)

http://www.mendeley.com/profiles/kris-jack/

scientific article recommendation with mahout

users preferencesitem

largest user bases

relevant research

research articlescomp

itembased5k4k3k cust

research producedtakes

mahouts recommender

partitioners14 hours

Technology

mahout part1

hands on mahout!

hadoop mahout budai steliana

recommendation from the scientific

apache mahout

mahout introduction barcampdc

apache mahout - introduction

learning apache mahout classification -...

recommendation from the scientific committee on ......

mahout introduction

mahout classifier tour

scientific paper recommendation: a survey - the alpha...

mahout interview questions

seattle scalability mahout

recommendation systems with mahout: introduction

introducing apache mahout

apache mahout algorithms

mahout part2

introduction to mahout

mahout quick guide