crab: a python framework for building recommender systems

Post on 06-May-2015

8.417 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Keynote at VII PythonBrasil (Python Brazilian Users Meeting) at São Paulo. SP in 01/10/2011.

TRANSCRIPT

CrabA Python Framework for Building

Recommendation Engines

Marcel Caraciolo@marcelcaraciolo

Bruno Melo@brunomelo

Ricardo Caspirro@ricardocaspirro

PythonBrasil 2011, São Paulo, SP

What is Crab ?

A python framework for building recommendation engines

A Scikit module for collaborative, content and hybrid filtering

Mahout Alternative for Python Developers :D

Open-Source under the BSD license

https://github.com/muricoca/crab

When started ?

It began one year ago

Community-driven, 4 members

Since April,2011 the open-source labs Muriçoca incorporated it

https://github.com/muricoca/

Since April,2011 rewritting it as Scikit

Knowing Scikits

Scikits are Scipy Toolkits - independent and projects hosted under a common namespace.

Scikits ImageScikits MlabWrapScikits AudioLabScikit Learn

....

http://scikits.appspot.com/scikits

Knowing Scikits

Scikit-Learn

Machine Learning Algorithms + scientific Python packages (Numpy, Scipy and Matplotlib)

http://scikit-learn.sourceforge.net/

Our goal: Incorporate the Crab as Scikit and incorporate some parts of them at Scikit-learn

Why Recommendations ?

!"#$%&'()$*+$,-$&.#'/0'&%)#)$1(,0#The world is an over-crowded place

Why Recommendations ?

We are overloaded

!"#$%"#&'"%(&$)")

* +,&-.$/).#&0#/"1.#$%234(".#

$/)#5(&6 7&.2.#"$4,#)$8

* 93((3&/.#&0#:&'3".;#5&&<.#

$/)#:-.34#2%$4<.#&/(3/"

* =/#>$/&3;#?#@A#+B#4,$//"(.;#

2,&-.$/).#&0#7%&6%$:.#

"$4,#)$8

* =/#C"1#D&%<;#."'"%$(#

2,&-.$/).#&0#$)#:"..$6".#

."/2#2&#-.#7"%#)$8

Thousands of news articles and blog posts each day

Millions of movies, books and music tracks online

Even Friends sometimes we are overloaded !

Several Places, Offers and Events

Why Recommendations ?

We really need and consume only a few of them!

“A lot of times, people don’t know what they want until you show it to them.”

Steve Jobs

“We are leaving the Information age, and entering into the Recommendation age.”

Chris Anderson, from book Long Tail

Why Recommendations ?

Can Google help ?

Yes, but only when we really know what we are looking for

Can Facebook help ?Yes, I tend to find my friends’ stuffs interesting

But, what’s does it mean by “interesting” ?

What if i had only few friends and what they like do not always attract me ?

Can experts help ?Yes, but it won’t scale well.

But it is what they like, not me! Exactly same advice!

Why Recommendations ?

Recommendation Systems

Systems designed to recommend to me something I may like

Why Recommendations ?

Recommendation Systems!"#$%&"'$"'(')*#*+,)

-+*#)+. -#/') 0#)1#

2' 23&4"+')1 5,6 7),*%'"&863

!

Graph Representation

The current Crab

Collaborative Filtering algorithms

Evaluation of the Recommender Algorithms

User-Based, Item-Based and Factorization Matrix (SVD)

Precision, Recall, F1-Score, RMSE

Precision-Recall Charts

The current Crab

Precision-Recall Charts

Collaborative Filtering

O Vento Levou

Thor

Similar

Armagedon ToyStore

Marcel

like recommends

Items

Rafael Amanda Users

The current Crab

The current Crab>>>#load the dataset

The current Crab>>>#load the dataset

>>> from crab.datasets import load_sample_movies

The current Crab>>>#load the dataset

>>> from crab.datasets import load_sample_movies

>>> data = load_sample_movies()

The current Crab>>>#load the dataset

>>> from crab.datasets import load_sample_movies

>>> data = load_sample_movies()

>>> data

The current Crab

{'DESCR': 'sample_movies data set was collected by the book called \nProgramming the Collective Intelligence by Toby Segaran \n\nNotes\n----- \nThis data set consists of\n\t* n ratings with (1-5) from n users to n movies.', 'data': {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},  2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},  3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},  4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},  5: {2: 4.5, 3: 1.0, 4: 4.0},  6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},  7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}}, 'item_ids': {1: 'Lady in the Water',  2: 'Snakes on a Planet',  3: 'You, Me and Dupree',  4: 'Superman Returns',  5: 'The Night Listener',  6: 'Just My Luck'}, 'user_ids': {1: 'Jack Matthews',  2: 'Mick LaSalle',  3: 'Claudia Puig',  4: 'Lisa Rose',  5: 'Toby',  6: 'Gene Seymour',  7: 'Michael Phillips'}}

>>>#load the dataset

>>> from crab.datasets import load_sample_movies

>>> data = load_sample_movies()

>>> data

The current Crab

The current Crab

>>> from crab.models import MatrixPreferenceDataModel

The current Crab

>>> from crab.models import MatrixPreferenceDataModel

>>> m = MatrixPreferenceDataModel(data.data)

The current Crab

>>> print mMatrixPreferenceDataModel (7 by 6)         1 2 3 4 5 ...1 3.000000 4.000000 3.500000 5.000000 3.0000002 3.000000 4.000000 2.000000 3.000000 3.0000003 --- 3.500000 2.500000 4.000000 4.5000004 2.500000 3.500000 2.500000 3.500000 3.0000005 --- 4.500000 1.000000 4.000000 ---6 3.000000 3.500000 3.500000 5.000000 3.0000007 2.500000 3.000000 --- 3.500000 4.000000

>>> from crab.models import MatrixPreferenceDataModel

>>> m = MatrixPreferenceDataModel(data.data)

The current Crab

The current Crab

>>> #import pairwise distance

The current Crab

>>> #import pairwise distance

>>> from crab.metrics.pairwise import euclidean_distances

The current Crab

>>> #import pairwise distance

>>> from crab.metrics.pairwise import euclidean_distances

>>> #import similarity

The current Crab

>>> #import pairwise distance

>>> from crab.metrics.pairwise import euclidean_distances

>>> #import similarity>>> from crab.similarities import UserSimilarity

The current Crab

>>> #import pairwise distance

>>> from crab.metrics.pairwise import euclidean_distances

>>> #import similarity>>> from crab.similarities import UserSimilarity

>>> similarity = UserSimilarity(m, euclidean_distances)

The current Crab

>>> #import pairwise distance

>>> from crab.metrics.pairwise import euclidean_distances

>>> #import similarity>>> from crab.similarities import UserSimilarity

>>> similarity = UserSimilarity(m, euclidean_distances)

>>> similarity[1]

The current Crab

[(1, 1.0), (6, 0.66666666666666663), (4, 0.34054242658316669), (3, 0.32037724101704074), (7, 0.32037724101704074), (2, 0.2857142857142857), (5, 0.2674788903885893)]

>>> #import pairwise distance

>>> from crab.metrics.pairwise import euclidean_distances

>>> #import similarity>>> from crab.similarities import UserSimilarity

>>> similarity = UserSimilarity(m, euclidean_distances)

>>> similarity[1]

The current Crab

[(1, 1.0), (6, 0.66666666666666663), (4, 0.34054242658316669), (3, 0.32037724101704074), (7, 0.32037724101704074), (2, 0.2857142857142857), (5, 0.2674788903885893)]

>>> #import pairwise distance

>>> from crab.metrics.pairwise import euclidean_distances

>>> #import similarity>>> from crab.similarities import UserSimilarity

>>> similarity = UserSimilarity(m, euclidean_distances)

>>> similarity[1]

MatrixPreferenceDataModel (7 by 6)         1 2 3 4 5 ...1 3.000000 4.000000 3.500000 5.000000 3.0000002 3.000000 4.000000 2.000000 3.000000 3.0000003 --- 3.500000 2.500000 4.000000 4.5000004 2.500000 3.500000 2.500000 3.500000 3.0000005 --- 4.500000 1.000000 4.000000 ---6 3.000000 3.500000 3.500000 5.000000 3.0000007 2.500000 3.000000 --- 3.500000 4.000000

The current Crab

The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender

The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender

>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)

The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender

>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)array([[ 5. , 3.45712869],       [ 1. , 2.78857832],       [ 6. , 2.38193068]])

The current Crab

>>> recsys.recommended_because(user_id=5,item_id=1)array([[ 2. , 3. ],       [ 1. , 3. ],       [ 6. , 3. ],       [ 7. , 2.5],       [ 4. , 2.5]])

>>> from crab.recommenders.knn import UserBasedRecommender

>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)array([[ 5. , 3.45712869],       [ 1. , 2.78857832],       [ 6. , 2.38193068]])

The current Crab

>>> recsys.recommended_because(user_id=5,item_id=1)array([[ 2. , 3. ],       [ 1. , 3. ],       [ 6. , 3. ],       [ 7. , 2.5],       [ 4. , 2.5]])

>>> from crab.recommenders.knn import UserBasedRecommender

>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)array([[ 5. , 3.45712869],       [ 1. , 2.78857832],       [ 6. , 2.38193068]])

MatrixPreferenceDataModel (7 by 6)         1 2 3 4 5 ...1 3.000000 4.000000 3.500000 5.000000 3.0000002 3.000000 4.000000 2.000000 3.000000 3.0000003 --- 3.500000 2.500000 4.000000 4.5000004 2.500000 3.500000 2.500000 3.500000 3.0000005 --- 4.500000 1.000000 4.000000 ---6 3.000000 3.500000 3.500000 5.000000 3.0000007 2.500000 3.000000 --- 3.500000 4.000000

The current Crab

Using REST APIs to deploy the recommenderdjango-piston, django-rest, django-tastypie

Crab is already in production

News from Abril Publisher recommendations!Collecting over 10 magazines, 20 books and 100+ articles

Running on Python+ Scipy +Django

Easy-to-use interface

Content-Based-Filtering

Still in development

Content Based Filtering

O Vento Levou

Duro de Matar

Similar

Armagedon ToyStore

Marcel

likesrecommend

Items

Users

Crab is already in production

PythonBrasil keynotes RecommenderRecommending keynotes based on a hybrid approach

Running on Python+ Scipy +Django

Schedule yourkeynotes

Content-Based-Filtering+

Collaborative Filtering

Still in development

source, the recommendation architecture that we propose willaggregate the results of such filtering techniques.We aim at integrating the previously mentioned hybrid prod-

uct recommendation approach in a mobile application so theusers could benefit from useful and logical recommendations.Moreover, we aim at providing a suited explanation for eachrecommendation to the user, since the current approaches justonly deliver product recommendations with a overall scorewithout pointing out the appropriateness of such recommen-dation [13]. Besides the basic information provided by thesuppliers, the system will deliver the explanation, providingrelevant reviews of similar users, we believe that it willincrease the confidence in the buying decision process and theproduct accepptance rate. In the mobile context this approachcould help the users in this process and showing the useropinions could contribute to achieve this task.

!"#$%&'%($)

!"*+#,$+'-)

!".,"/#)

!"*+#,$+'-)

0+($"($)1%#"2)

3,4$"',(5)

!"#$%&"'()*+,#&-,.)

/$%,0"12()*3$4%)3""5.)

0+44%6+'%$,.")1%#"2)

3,4$"',(5)

)))67,8,#%)+,4%$91$'%4)-1":))))

))))1,;&,<4)<1&%%,')=2)4&:&8$1))

)))))))))))%$4%,5)94,14>?)

7"$%)

!"8+99"(2"'))

!"8+99"(2%$,+(#)

Fig. 1. Meta Recommender Architecture

Since one of the goals of this work is to incorporatedifferent data sources of user opinions and descriptions, wehave addopted an meta recommendation architecture. By usinga meta recommender architecture, the system would providea personalized control over the generated recommendation listformed by the combination of rich data [16]. The influenceof the specific data sources could be explicitly controlled byevaluating the past user interaction with the recommender todecide how to balance the different knowledge sources. Forinstance, if the product or service to be recommended has arich structured description (e.g. restaurant) , then the systemtends to use more content-based filtering approach. Otherwise,if the product is poorly described (attributes), then the system

would rely more on collaborative-filtering techniques, that is,the reviews from similar users.Figure 1 shows a overview of our meta recommender

approach. By combining the content-based filtering and thecollaborative-based one into a hybrid recommender system, itwould use the services/products repositories which cataloguesthe services to be recommended, and the review repositorythat contains the user opinions about those services. All thisdata can be extracted from data source containers in the websuch as the location-based social network Foursquare [17] asdisplayed at the Figure 2 and the location recommendationengine from Google: Google HotPot [18].

Fig. 2. User Reviews from Foursquare Social Network

The content-based filtering approach will be used to filterthe product/service repository, while the collaborative basedapproach will derive the product review recommendations. Inaddition we will use text mining techniques to distinct thepolarity of the user review between positive or negative one.This information summarized would contribute in the productscore recommendation computation. The final product recom-mendation score is computed by integrating the result of bothrecommenders. By now, we are considering to use differentoptions regarding this integration approach, one at specialis the symbolic data analysis approach (SDA) [19], whicheach product description and user ratings/reviews are modeledas set of modal symbolic descriptions that summarizes theinformation provided by the corresponding data sources. It isa novel approach in hybrid recommender systems which,i nour domain, can encapsulate in entities the levels of influenceof both user reviews and product descriptions.

B. Symbolic Recommendation ApproachThe Symbolic Data Analysis (SDA) is a research field that

provides suitable tools to manage aggregated data detailedby multi-valued variables, where data table entries are sets

of categories, ordered list of categories, intervals or weighthistograms [19]. It is also provides approaches for informationfiltering algorithms such as Content-Based , Collaborative-Based and Hybrid Base ones. The main idea is to representthe user profile, in our domain the product to be recom-mended, through symbolic data structures and the user anditem correlations are computed through dissimilarity functionsadapted from the symbolic data analysis (SDA) domain. Theadvantage of using SDA based information-filtering methodsin the context of recommendations is that the user descriptionsynthetizes the entire body of information taken from the itemdescriptions belonging to the user profile. Therefore, itensare described by histogram-value symbolic data, so it can becompared through a dissimilarity function. By using the userreviews and the product descriptions modeled by histogram-value symbolic data, it would attend our requirements sinceour recommendations would be balanced by both structures.Bezerra and Carvalho proposed approaches where the resultsachieved showed to be very promising [19].

III. SYSTEM DESIGNApplication data information our mobile recommender sys-

tem can be divided into two parts: the product description(such as location, description and its attributes) and the userreviews or ratings provided by user (such as rating, comments,tags, etc.). The Figure 3 gives the system’s architecture andrelative components.

!"#$"%&'$

!(#$()&'*&%$+,-*.&$

/01&'234&$

5&-$

!6#$6,00&41&7$

8&4,99&0731*,0$:0;*0&$

<',7)41$

8&=,%*1,'>$

8&?*&@$

8&=,%*1,'>$

8&%).1%$

!<#$<'&2&'&04&%A$B,431*,0A$&14C$

!B#$B*%1$,2$D4,'&7$<',7)41%$

!8#$830E&7$<',7)41%$

!(#$()&'*&%$

Fig. 3. Mobile Recommender System Architecture

In our mobile product/service recommender, the user couldfilter some products or services and get a list of recommen-tations. The user also can enter his preferences or give hisfeedback to some offered product recommendation.Other functionalities are the retrieval of the next ve best

recommendations, the search for reviews satisfying somegiven constraints (text-length, date, review attitude) or thosecontaining some keywords. Let us place a typical use scenarioof this recommender by showing a restaurant query example.For instance, a user wants to know good restaurants for eating

a chinese food around his current location (the system alsocould integrate location-based services). Refining the querythe user also searches for places with highly positive reviews.According to the coordinate information and then calculate thedistance of their current location, the system would provide alist ranked by the highly positive reviews of restaurants joinedby summarized reviews written by the most similar users,that reviewed the restaurant services. According to personalpreferences, the user continues to review the recommendationrestaurant list, and select places he wishes to go. After gonethe places he selected, the user coult enter his feedback, inform of wishes or critiques, to the service recommendation.This information would be added to the reviews repository forthose places where it would be processed and summarized byour recommender, extracting useful information such as thepolarity and adding new keywords for the product featuresvector.

IV. METHODOLOGY AND EXPECTED RESULTSA. MethodologyOur research focuses on methodologies and techniques

for improving the user acceptance of product and servicesrecommendations and explaining these suggestions in themobile context. To achieve this, we have used a new approachwhere both product descriptions and user reviews are incor-porated into the mobile recommendation process. We believethis approach will bring to the user more condence on therecommendations and a better understanding of the productsspecially supporting him in the buying decision process. Toaccomplish this task, we will do a overview of the context inwhich mobile recommender systems are included and the mainresearch concerns about the topic. We wil also analyze and in-vestigate approaches for filtering algorithms, where could baseour design and implementation choices on previous failures orsucesses and reuse and adapt successful solutions. We will alsopropose our meta-recommender system by providing a detailedexplanation of our recommender architecture, implementation,main processes, and crucial features. Finally, to validate it,we will use standard measures of recommendation systemscomparing our suggestions to mobile users in a real datasetextracted from Web and discussing the results.

B. Expected ResultsWe believe that our product mobile recommender incorpo-

rating user reviews will increase the user trust in the recom-mended products, since he can read opinions, both positiveand negative from a group of similar users. Moreover, viewingother reviews, the user can feel more estimulated to share hisown experiences as also obtain recognition from other users.Reviews also provide a better product understanding sincethere will be more information for the user to decide if theproduct is (or is not) suitable for him. Finally, presenting alist of recommendations aside with explanations may provide’local Hidden knowledge’. Local hidden knowledge can bedescribed as knowledege that you gain only after purschasinga product or visiting a place. It can not be found using

Crab is already in production

Hybrid Meta Approach

Crab is already in production

Brazilian Social Network called Atepassar.comEducational network with more than 60.000 students and 120 video-classes

Running on Python + Numpy + Scipy and

Django

Backend for Recommendations MongoDB - mongoengine

Daily Recommendations with Explanations

Evaluating your recommender

Crab implements the most used recommender metrics.Precision, Recall, F1-Score, RMSE

Using matplotlib for a plotter utility

Implement new metrics

Simulations support maybe (??)

Evaluating your recommender

Evaluating your recommender

>>> from crab.metrics.classes import CfEvaluator

Evaluating your recommender

>>> from crab.metrics.classes import CfEvaluator

>>> evaluator = CfEvaluator()

Evaluating your recommender

>>> from crab.metrics.classes import CfEvaluator

>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')

Evaluating your recommender

>>> from crab.metrics.classes import CfEvaluator

>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse'){'rmse': 0.69467177857026907}

Evaluating your recommender

>>> from crab.metrics.classes import CfEvaluator

>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse'){'rmse': 0.69467177857026907}

>>> evaluator.evaluate_on_split(recommender=recsys, at =2)

Evaluating your recommender

>>> from crab.metrics.classes import CfEvaluator

>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse'){'rmse': 0.69467177857026907}

>>> evaluator.evaluate_on_split(recommender=recsys, at =2)

({'error': [{'mae': 0.345, 'nmae': 0.4567, 'rmse': 0.568}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}],

'ir': [{'f1score': 0.456, 'precision': 0.78557, 'recall':0.55677}, {'f1score': 0.64567, 'precision': 0.67865, 'recall': 0.785955},

{'f1score': 0.45070, 'precision': 0.74744, 'recall': 0.858585}]}, {'final_score': {'avg': {'f1score': 0.495955,

'mae': 0.429292, 'nmae': 0.373739,

'precision': 0.63932929, 'recall': 0.729939393, 'rmse': 0.3466868},

'stdev': {'f1score': 0.09938383 , 'mae': 0.0593933,

'nmae': 0.03393939, 'precision': 0.0192929, 'recall': 0.031293939, 'rmse': 0.234949494}}})

Distributing the recommendation computations

Use Hadoop and Map-Reduce intensivelyhttps://github.com/pfig/mrjobInvestigating the Yelp mrjob framework

Develop the Netflix and novel standard-of-the-art usedMatrix Factorization, Singular Value Decomposition (SVD), Boltzman machines

The most commonly used is Slope One technique.Simple algebra math with slope one algebra y = a*x+b

Cache/Paralelism with joblib

class UserSimilarity(BaseSimilarity):    ...

    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id)

        return self.distance(source_preferences, target_preferences) \            if not source_preferences.shape[1] == 0 \                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

...

def get_similarities(self, source_id):        return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]

from joblib import Memory memory = Memory(cachedir=’’, verbose=0)

http://packages.python.org/joblib/index.html

Cache/Paralelism with joblib

class UserSimilarity(BaseSimilarity):    ...

    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id)

        return self.distance(source_preferences, target_preferences) \            if not source_preferences.shape[1] == 0 \                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

...

def get_similarities(self, source_id):        return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]

from joblib import Memory memory = Memory(cachedir=’’, verbose=0)

>>> #Without memory.cache

http://packages.python.org/joblib/index.html

Cache/Paralelism with joblib

class UserSimilarity(BaseSimilarity):    ...

    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id)

        return self.distance(source_preferences, target_preferences) \            if not source_preferences.shape[1] == 0 \                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

...

def get_similarities(self, source_id):        return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]

from joblib import Memory memory = Memory(cachedir=’’, verbose=0)

>>> #Without memory.cache >>># With memory.cache

http://packages.python.org/joblib/index.html

Cache/Paralelism with joblib

class UserSimilarity(BaseSimilarity):    ...

    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id)

        return self.distance(source_preferences, target_preferences) \            if not source_preferences.shape[1] == 0 \                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

...

def get_similarities(self, source_id):        return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]

from joblib import Memory memory = Memory(cachedir=’’, verbose=0)

>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities

(‘marcel_caraciolo’)

http://packages.python.org/joblib/index.html

Cache/Paralelism with joblib

class UserSimilarity(BaseSimilarity):    ...

    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id)

        return self.distance(source_preferences, target_preferences) \            if not source_preferences.shape[1] == 0 \                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

...

def get_similarities(self, source_id):        return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]

from joblib import Memory memory = Memory(cachedir=’’, verbose=0)

>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities

(‘marcel_caraciolo’)>>> timeit similarity.get_similarities

(‘marcel_caraciolo’)

http://packages.python.org/joblib/index.html

Cache/Paralelism with joblib

class UserSimilarity(BaseSimilarity):    ...

    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id)

        return self.distance(source_preferences, target_preferences) \            if not source_preferences.shape[1] == 0 \                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

...

def get_similarities(self, source_id):        return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]

from joblib import Memory memory = Memory(cachedir=’’, verbose=0)

>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities

(‘marcel_caraciolo’)>>> timeit similarity.get_similarities

(‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop

http://packages.python.org/joblib/index.html

Cache/Paralelism with joblib

class UserSimilarity(BaseSimilarity):    ...

    @memory.cache  def get_similarity(self, source_id, target_id):         source_preferences = self.model.preferences_from_user(source_id)         target_preferences = self.model.preferences_from_user(target_id)

        return self.distance(source_preferences, target_preferences) \            if not source_preferences.shape[1] == 0 \                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

...

def get_similarities(self, source_id):        return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]

from joblib import Memory memory = Memory(cachedir=’’, verbose=0)

>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities

(‘marcel_caraciolo’)>>> timeit similarity.get_similarities

(‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop 100 loops, best of 3: 434 ms per loop

http://packages.python.org/joblib/index.html

Cache/Paralelism with joblibhttp://packages.python.org/joblib/index.html

Investigate how to use multiprocessing and parallel packages with similarities computation

def get_similarities(self, source_id):        return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity) (source_id, other_id)) for other_id, v in self.model)

from joblib import Parallel ...

Distributed Computing with mrJobhttps://github.com/Yelp/mrjob

Distributed Computing with mrJobhttps://github.com/Yelp/mrjob

It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)

Distributed Computing with mrJobhttps://github.com/Yelp/mrjob

It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)

Distributed Computing with mrJobhttps://github.com/Yelp/mrjob

"""The classic MapReduce job: count the frequency of words."""from mrjob.job import MRJobimport re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):        for word in WORD_RE.findall(line):            yield (word.lower(), 1)

    def reducer(self, word, counts):        yield (word, sum(counts))

if __name__ == '__main__':    MRWordFreqCount.run()

It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)

Distributed Computing with mrJobhttps://github.com/Yelp/mrjob

Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce

Distributed Computing with mrJobhttps://github.com/Yelp/mrjob

Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce

Future studies with Sparse MatricesReal datasets come with lots of empty values

Apontador Reviews Dataset

http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html

Solutions:

scipy.sparse package

Sharding operations

Matrix Factorization techniques (SVD)

Future studies with Sparse MatricesReal datasets come with lots of empty values

Apontador Reviews Dataset

http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html

Solutions:

scipy.sparse package

Sharding operations

Matrix Factorization techniques (SVD)

Crab implements a Matrix Factorization with Expectation

Maximization algorithm

Future studies with Sparse MatricesReal datasets come with lots of empty values

Apontador Reviews Dataset

http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html

Solutions:

scipy.sparse package

Sharding operations

Matrix Factorization techniques (SVD)

Crab implements a Matrix Factorization with Expectation

Maximization algorithmscikits.crab.svd package

Optimizations with Cythonhttp://cython.org/

Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html

Optimizations with Cythonhttp://cython.org/

Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html

# setup.py

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

# for notes on compiler flags see:

# http://docs.python.org/install/index.html

setup(

cmdclass = {'build_ext': build_ext},

ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])]

)

Optimizations with Cythonhttp://cython.org/

Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html

# setup.py

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

# for notes on compiler flags see:

# http://docs.python.org/install/index.html

setup(

cmdclass = {'build_ext': build_ext},

ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])]

)

Benchmarks

Dataset Pure Python w/ dicts

Python w/ Scipy and Numpy

MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73

Old Crab New Crab

Benchmarks

Dataset Pure Python w/ dicts

Python w/ Scipy and Numpy

MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73

0 4 8 12 16

Time ellapsed ( Recommend 5 items)

Old Crab New Crab

Benchmarks

Dataset Pure Python w/ dicts

Python w/ Scipy and Numpy

MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73

0 4 8 12 16

Time ellapsed ( Recommend 5 items)

Old Crab New Crab

Benchmarks

Dataset Pure Python w/ dicts

Python w/ Scipy and Numpy

MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73

0 4 8 12 16

Time ellapsed ( Recommend 5 items)

Old Crab New Crab

Why migrate ?

Old Crab running only using Pure Python

Recommendations demand heavy maths calculations and lots of processing

Compatible with Numpy and Scipy libraries

High Standard and popular scientific libraries optimized for scientific calculations in Python

Scikits projects are amazing! Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn)

Turn the Crab framework visible for the community Join the scientific researchers and machine learning developers around the Globe coding with

Python to help us in this project

Be Fast and Furious

Why migrate ?

http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html

Numpy optimized with PyPy

2.x - 48.x Faster

How are we working ?

Sprints, Online Discussions and Issues

https://github.com/muricoca/crab/wiki/UpcomingEvents

How are we working ?

Our Project’s Home Page

http://muricoca.github.com/crab

Future Releases

Planned Release 0.1Collaborative Filtering Algorithms working, sample datasets to load and test

Planned Release 0.11Sparse Matrixes and Database Models support

Planned Release 0.12Slope One Agorithm, new factorization techniques implemented

....

Join us!

1. Read our Wiki Pagehttps://github.com/muricoca/crab/wiki/Developer-Resources

2. Check out our current sprints and open issueshttps://github.com/muricoca/crab/issues

3. Forks, Pull Requests mandatory

4. Join us at irc.freenode.net #muricoca or at our discussion list

http://groups.google.com/group/scikit-crab

Recommended Books

SatnamAlag, Collective Intelligence in Action, Manning Publications, 2009

Toby Segaran, Programming Collective Intelligence, O'Reilly, 2007

ACM RecSys, KDD , SBSC...

CrabA Python Framework for Building

Recommendation Engines

Marcel Caraciolo@marcelcaraciolo

Bruno Melo@brunomelo

Ricardo Caspirro@ricardocaspirro

{marcel, ricardo,bruno}@muricoca.com

https://github.com/muricoca/crab

top related