summary of the 15th discovery challenge - ceur-ws.orgceur-ws.org/vol-1120/paper1.pdf ·...

Summary of the 15th Discovery Challenge

Recommending Given Names

Folke Mitzlaff1, Stephan Doerfel1, Andreas Hotho2,3, Robert Jaschke3, andJuergen Mueller1,3

1 University of Kassel, Knowledge Discovery and Data Engineering Group,Wilhelmshoher Allee 73, 34121 Kassel, Germany

{mitzlaff, doerfel, mueller}@cs.uni-kassel.de2 University of Wurzburg, Data Mining and Information Retrieval Group, Am

Hubland, 97074 Wurzburg, [email protected]

3 L3S Research Center, Appelstraße 4, 30167 Hannover, Germany{hotho, juergen.mueller, jaeschke}@l3s.de

The 15th ECML PKDD Discovery Challenge centered around the recommen-dation of given names. Participants of the challenge implemented algorithms thatwere tested both offline – on data collected by the name search engine Namel-ing – and online within Nameling. Here, we describe both tasks in detail anddiscuss the publicly available datasets. We motivate and explain the chosen eval-uation of the challenge, and we summarize the different approaches applied tothe name recommendation tasks. Finally, we present the rankings and winnersof the offline and the online phase.

1 Introduction

The choice of a given name is typically accompanied with an extensive search forthe most suitable alternatives, at which many constraints apply. First of all, thesocial and cultural background determines, what a common name is and mayadditional imply certain habits, such as, e. g., the patronym. Additionally, mostnames bear a certain meaning or associations which, also depend on the culturalcontext. Whoever makes the decision is strongly influenced by personal tasteand current trends within the social context. Either by preferring names whichare currently popular, or by avoiding names which most likely will be commonin the neighborhood.

Future parents are often aided by huge collections of given names which listseveral thousand names, ordered alphabetically or by popularity. To simplifyand shorten this extensive approach, the name search engine Nameling (see Sec-tion 2) allows its users to query names and returns similar names. To determinesimilarity, Nameling utilizes Wikipedia’s text corpus for interlinking names andthe microblogging service Twitter for capturing current trends and popularityof given names. Nevertheless, the underlying rankings and thus the search re-sults are statically bound to the underlying co-occurrence graph obtained fromWikipedia and thus not personalized. Since naming a child is a very personal

task, a name search engine can certainly profit from personalized name recom-mendation.

The task of 15th ECML PKDD Discovery Challenge was to create successfulrecommendation algorithms that would suggest suitable given names to futureparents. The challenge relied on data gathered by Nameling and consisted oftwo phases, i. e., an offline and an online challenge. In both phases, participantswere asked to provide a name recommendation algorithm to solve the task.

Task 1: The Offline Challenge. In the first phase, recommenders have beenevaluated in an offline setting. To train an algorithm, the participants had beenprovided with a public training data set from Nameling’s access logs, representinguser search activities. Given a set of names for which a user had shown interestin, the recommender should provide suggestions for further names for that user.A second, private dataset from Nameling contained further search events fromusers of Nameling. To win the challenge, participants had to predict these searchevents. Details of the offline phase are discussed in Section 3 where we alsosummarize the different approaches to solve the challenge as well as the rankingof the participating teams and the challenge’s winners.

Task 2: The Online Challenge. The online phase took place after Task 1had been completed. The participants implemented a recommendation servicethat could be actively queried via HTTP and would provide names accordingto the participant’s algorithm. These recommendation were shown to actualusers of Nameling and their quality was measured by counting user’s clicks onrecommended names. We elaborate on the online challenge in Section 4.

2 The Name Search Engine Nameling

Nameling is designed as a search engine and recommendation system for givennames. The basic principle is simple: The user enters a given name and gets abrowsable list of relevant, related names, called “namelings”. As an example,Figure 1a shows the namelings for the classical masculine German given name“Oskar”. The list of namelings in this example (“Rudolf”, “Hermann”, “Egon”,etc.) exclusively contains classical German masculine given names as well. When-ever an according article in Wikipedia exists, categories for the respective givenname are displayed, as, e. g., “Masculine given names” and “Place names” forthe given name “Egon”. Via hyperlinks, the user can browse for namelings of eachlisted name or get a list of all names linked to a certain category in Wikipedia.Further background information for the query name is summarized in a corre-sponding details view, where, among others, popularity of the name in differentlanguage editions of Wikipedia as well as in Twitter is shown. As depicted inFigure 1b, the user may also explore the “neighborhood” of a given name, i. e.,names which co-occur often with the query name.

From a user’s perspective, Nameling is a tool for finding a suitable givenname. Accordingly, names can easily be added to a personal list of favoritenames. The list of favorite names is shown on every page in the Nameling andcan be shared with a friend, for collaboratively finding a given name.

(a) Namelings (b) Co-occurring names

Fig. 1: A user query for the classical German given name “Oskar”.

2.1 Computing Related Names

To generate the lists of related names, Nameling makes use of techniques thathave become popular in the so called “Web 2.0”. With the rise of the latter,various social applications for different domains – offering a huge source of infor-mation and giving insight into social interaction and personal attitudes – haveemerged that make use of user generated data (e. g., user profiles and friendshipsin social networks or tagging data in bookmarking systems).

The basic idea behind Nameling was to discover relations among given names,based on such user generated data. In this section, we briefly summarize howdata is collected and how relations among given names are established. Namelingis based on a comprehensive list of given names, which was initially manuallycollected, but then populated by user suggestions. Information about names andrelations between them is gathered from three different data sources, as depictedin Figure 2:

Wikipedia: As basis for discovering relations among given names, co-occurrencegraphs are generated for each language edition of Wikipedia separately. That is,for each language, a corresponding data set is downloaded from the WikimediaFoundation4. Afterwards, for any pair of given names in our dataset, the numberof sentences where they jointly occur is determined. Thus, an undirected graph isobtained for every language, where two names are adjacent if they occur togetherat least in one sentence within any of the articles and the edge’s weight is givenby the number of such sentences.

4 “Database dump progress.”, Wikimedia. 2012. http://dumps.wikimedia.org/

backup-index.html (May 1, 2013)

http://dumps.wikimedia.org/backup-index.html

http://dumps.wikimedia.org/backup-index.html

Fig. 2: Nameling determines similarities among given names based on co-occurrence networks from Wikipedia, popularity of given names via Twitterand social context of the querying user via Facebook.

Relations among given names are established by calculating a vertex similar-ity score between the corresponding nodes in the co-occurrence graph. Currently,namelings are calculated based on the cosine similarity (cf. [2]).

Twitter: For assessing up-to-date popularity of given names, a random sampleof status messages in Twitter is constantly processed via the Twitter streamingAPI5. For each name, the number of tweets mentioning it is counted.

Facebook: Optionally, a user may connect Nameling with Facebook6. If the userallows Nameling to access his or her profile information, the given names of allcontacts in Facebook are collected anonymously. Thus, a “social context” for theuser’s given name is recorded. Currently, the social context graph is too small forimplementing features based on it, but it will be a valuable source for discoveringand evaluating relations among given names.

2.2 Research around Nameling

Beside serving as a tool for parents-to-be, Nameling is a research platform too.The choice of a given name is influenced by many factors, ranging from culturalbackground and social environment to personal preference. Accordingly, the taskof recommending given names is per se subject to interdisciplinary considera-tions.

Within Nameling, users are anonymously identified via a cookie that is, asmall identification fragment which uniquely identifies a user’s web browser.Although a single user might use several browsers or computers, Nameling usesthe simple heuristic of treating cookies identification for users.

5 Twitter Developers. https://dev.twitter.com/docs/api/1/get/statuses/sample(May 3, 2013)

6 https://facebook.com/

https://dev.twitter.com/docs/api/1/get/statuses/sample

https://facebook.com/

The Nameling dataset arises from the requests that users make to Namel-ing. More than 60,000 users conducted more than 500,000 activities within thetime range of consideration (March 6th, 2012 until February 12th, 2013). Forevery user, Nameling tracks the search history, favorite names and geographi-cal location based on the user’s IP address and the GeoIP7 database. All thesefootprints together constitute a multi-mode network with multiple edge types.Analyzing this graph (or one of its projections) can reveal communities of userswith similar search characteristics or cohesive groups of names, among others.In the context of the Discovery Challenge, the data of Nameling is used to trainand to test given name recommendation algorithms.

3 Offline Challenge

The first task of the Discovery Challenge was to create an algorithm that pro-duces given name recommendations – given a list of users with their previoushistory of name search requests to Nameling. The evaluation of these algorithmswas conducted in a classical offline prediction scenario. A large dataset fromNameling was split into a public training dataset and a secret test dataset. Inthe following we describe details of the task and the dataset. In the second partof this section we summarize the participants’ approaches and their results inthe challenge.

3.1 Task

In the challenges, we deal with a standard binary item recommendation task.Users of the name search engine Nameling have expressed interest in certainnames by searching for them or requesting their details. These interactions withthe system are interpreted as (binary) positive feedback to these names, whilethere is no explicit negative feedback - only names towards which we do notknow the user’s attitude. A recommender algorithm must determine, which ofthese names will be of interest to the user.

Participants were given a public dataset to train their algorithms on. For theoverall evaluation a second dataset containing only a list of users was given tothem. The task in the offline challenge then was to produce for each user in thatsecond dataset a list of 1,000 name recommendations, ordered by their relevanceto the user at hand.

For the challenge no further restrictions were made regarding the choice ofmethodology or additional data. On the contrary, participants were encouragedto make use of any kind of data they might find suitable, e. g., family trees,location information, data from social networks, etc.

7 “GeoIP databases and web services.”, MaxMind. http://www.maxmind.com/en/

geolocation_landing (May 3, 2013)

http://www.maxmind.com/en/geolocation_landing

http://www.maxmind.com/en/geolocation_landing

3.2 Challenge Data

For the challenge, we provided data from the name search engine Nameling,containing users together with their (partial) interaction history in Nameling. Auser interaction hereby is always one of the following:

ENTER SEARCH The user entered a name directly into Nameling’s search field.LINK SEARCH The user followed a link on some result page (including pagination

links in longer result lists).LINK CATEGORY SEARCH Wherever available, names are categorized according to

the corresponding Wikipedia articles. The users clicked on such a categorylink to obtain all accordingly categorized names.

NAME DETAILS The user requested some detailed information for a name usinga respective button.

ADD FAVORITE The user added a name to his list of favorite names.

The full dataset contains interactions from Nameling’s query logs, rangingfrom March 6th, 2012 to February 12th, 2013. It contains profile data for 60,922users with 515,848 activities. This dataset was split into a public training datasetand a secret test dataset. For this purpose, a subset of users (in the followingcalled test users) was selected for the evaluation. For each such test user, wewithheld some of their most recent activities for testing according to the followingrules:

– For each user, we selected the chronologically last two names for evalu-ation which had directly been entered into Nameling’s search field (i.e.,ENTER SEARCH activity) and which are also contained in the list of knownnames. We thereby considered the respective time stamp of a name’s firstoccurrence within the user’s activities. We restricted the evaluation toENTER SEARCH activities, because all other user activities are biased towardsthe lists of names which were displayed by Nameling (see our correspondinganalysis of the ranking performance in [2]).

– We considered only those names for evaluation which had not previouslybeen added as a favorite name by the user.

– All remaining user activity after the (chronologically) first evaluation namehas been discarded.

– We required at least three activities per user to remain in the data set.– For previous publications, we already published part of Nameling’s usage

data. Only users not contained in this previously published data set, havebeen selected as test users.

With the above procedure we obtained two data sets8: The secret evaluationdata set containing for each test user the two left out (i. e., ENTER SEARCH)names and the public challenge data set containing the remaining user activitiesof the test users and the full lists of activities from all other users. The only

8 Both datasets are available from the challenge’s website: http://www.kde.cs.

uni-kassel.de/ws/dc13/downloads/

http://www.kde.cs.uni-kassel.de/ws/dc13/downloads/

http://www.kde.cs.uni-kassel.de/ws/dc13/downloads/

applied preprocessing on our part was a conversion to lower case of all names.Additionally to the public training dataset, participants were provided with thelist of all users in the test dataset to be used as input for their algorithms.Furthermore, we published a list of all names known to Nameling, which thusincluded all names occurring in the training or the test data (roughly 36,000names).

3.3 Evaluation

Given the list of test users (see above), each participant produced a list of 1,000recommended names9 for each such user. These lists were then used to evaluatethe quality of the algorithm by comparing for each test user the 1,000 namesto the two left-out names from the secret test dataset. As usual, it is assumedthat good algorithms will rank the left-out names high, since they represent theactual measurable interests of the user at hand.

The chosen assessment metric to compare the lists of recommendations ismean average precision (MAP@1000). MAP means to compute for each testuser the precision at exactly the ranking positions of the two left-out names.These precision values are then first averaged per test user and finally in totalto yield the score for the recommender at hand. While MAP usually can handlearbitrarily long lists of recommendations, for the challenge we restricted it toMAP@1000, meaning that only the first 1,000 positions of a list are considered.If one or both of the left out names were not among the top 1,000 name inthe list, they were treated as if they were ranked at position 1,001 and 1,002respectively. More formally, the score assigned to a participant’s handed-in listis

MAP@1000 :=1

|U |

|U |∑u=1

(1

r1,u+

2

r2,u

)where U is the set of all test users, r1,u and r2,u are the ranks of two left-outnames for user u from the secret evaluation dataset, and r1,u > r2,u.

The choice of the evaluation measure is crucial in the comparison of rec-ommender algorithms. It is well-known that different measures often lead todifferent evaluation results and the choice of the metric must therefore be mo-tivated by the use case at hand. In the case of Nameling, we had already seenthat recommending given names is a difficult task [3]. For many users, manyrecommenders did not produce recommendation rankings with the test namesamong top positions. Thus, measures like precision@k – with k typically ob-taining low values like 5 or 10 – make it hard to distinguish between results,especially for lower cut-off-thresholds k. MAP (Mean Average Precision) is ameasure that is suitable for (arbitrarily long) ordered lists of recommendations.Like NDCG (Normalized Discounted Cumulative Gain) or AUC (Area Underthe Curve) it evaluates the recommendations based on the positions of the left

9 1,000 name sound like a large number but given that parents currently read muchlonger and badly sorted name lists the number is reasonable (details below).

out items within the list of recommendations. It yields good scores when testitems appear on the top positions of the recommendations and lower scores ifthey are ranked further below (unlike precision@k where lower ranked items arecut off and thus do not contribute to the score).

The main difference to AUC and NDCG is how the ranks of the left-outnames are incorporated into the score. While AUC yields a linear combinationof the two ranks, MAP takes a linear combination of the reciprocal ranks andNDCG a linear combinations of the (logarithmically) smoothed reciprocal ranks.Among these measures, MAP is the one that discriminates the strongest betweenhigher or lower ranks and therefore was most suitable for the challenge.

Although we have just argued against cut-off-measures like precision@k it isreasonable to cut off lists at some point. In contrast to many other areas whererecommendations are used (e.g., friend, movie, book, or website recommenders),in Nameling the time needed to evaluate a recommendation is very short: if youlike a name, just click on it. Additionally, the cost in terms of money or timespent for following a recommendation that turns out bad, is very low. At thesame time, finding the perfect name for a child is often a process of months ratherthan minutes (like for finding the next book to read or deciding which movieto watch) or seconds (deciding which tag to use or which website to visit onthe net). Thus it is reasonable to assume that parents-to-be are willing to scrollthrough lists of names longer than the usual top ten – especially, consideringthat one of the traditional ways of searching for names is to go through firstnames dictionaries where names are listed unpersonalized, in alphabetical order.In such books usually there are a lot more than 1,000 names that have to beread and therefore it seems reasonable that readers of such books won’t mindstudying longer name lists on the web.

3.4 Summary of the Offline Challenge

Registered participants (or teams of participants) were given access to the publictraining dataset and the dataset containing the names of all test users. The offlinechallenge ran for about 17 weeks, beginning March 1st and ending July 1st, 2013.Every week, participants were invited to hand in lists of recommended namesfor the test users. These lists were evaluated (using the secret test dataset andthe evaluation described above) to produce a weekly updated leaderboard. Theleaderboard allowed the participants to check on the success of their methodsand to compare themselves with the other participants. Since frequently updatedresults would constitute an opportunity for the participants to optimize theiralgorithms towards this feedback or even to reverse-engineer the correct testnames, the leaderboard was not updated more often than once a week.

Participants and Winners More than 40 teams registered for the challengeof which 17 handed in lists of recommended names. Of these 17, six teamssubmitted a papers which are summarized below in Section 3.5. Table 1 showsthe final scores of the 17 teams and reveals the winners of the offline phase:

1. place goes to team uefs.br for their approach using a features of names.2. place is won by team ibayer using a syntactically enriched tensor factoriza-

tion model.3. place goes to team all your base for their algorithm exploiting a generalized

form of association rules.

Table 1: The final results of the 17 teams in the offline challenge, showing all theparticipants’ team names, together with the achieved MAP@1000 score.

Pos. Team Name MAP@1000

1 uefs.br 0,04912 ibayer 0,04723 all your base 0,04234 Labic 0,03795 cadejo 0,03676 disc 0,03407 Context 0,03218 TomFu 0,03099 Cibal 0,026210 thalesfc 0,025311 Prefix 0,020312 Gut und Guenstig 0,016913 TeamUFCG 0,015614 PwrInfZC 0,013015 persona-non-data 0,004316 erick oliv 0,002117 Chanjo 0,0016

Figure 3 shows the scores of those six teams that handed in papers in timedescribing their approach. Only described approaches can be judge and presentedat the workshop and therefore, all the other results are not considered in theremaining discussion. Additionally, two baselines (NameRank from [3] and thesimple most-popular recommender) are presented in Figure 3. The latter simplysuggests to any user those names that have been queried the most often in thepast. It is thus unpersonalized and rather ad-hoc. NameRank is a variant of thepopular personalized PageRank algorithm [1]. From the baseline results we canalready tell that the recommendation problem is indeed hard, as the scores arerather low (between 0.025 and 0.030). On the other hand, we can observe that thesimple most-popular is not that much worse than the much more sophisticatedPageRank-like approach. The first approaches of almost all participants yieldedscores lower or comparable to those of the baselines. However, over the course ofthe challenge the scores improved significantly and by the end of the challengeall teams had produced algorithms that outperformed both baselines.

To compare the recommenders’ performances in greater detail, Figure 4 showsthe cumulative distribution of the different ranking positions (1, . . . , 1000) for the

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0,045

0,050

1st

2n

d

3rd

4th

5th

6th

7th

8th

9th

10

th

11

th

12

th

13

th

14

th

15

th

16

th

Fin

al

SCO

RE

LEADERBOARD

uefs.br

ibayer

all your base

Labic

cadejo

disc

NameRank

Most Popular

Fig. 3: The scores of the six teams that handed in papers plotted of the 17 weeksruntime of the offline challenge. For comparison, two baselines have been added(the two constant lines): NameRank and Most Popular.

top three algorithms. For each recommender system and every ranking positionk, displayed is the number of hold-out names – out of the 8, 280 hold-out namesfrom the 4, 140 test users with two hold-out names each – that had a ranksmaller than or equal to k on the list of recommended names. We can observe,that the three curves are very close together, indicating that the distributionsof ranks are similar. Comparing the results of the top placed two algorithms(team uefs.br and team ibayer), we see that the former has predicted more ofthe hold-out names in its top 1, 000 lists in total. However, the latter succeededin placing more hold-out names among the top ranking positions (k ≤ 400). Thedistribution of the third algorithm (team all your base) is almost identical withthat of team uefs.br over the first 300 ranks, but then falls behind.

3.5 Approaches

In the offline phase of the challenge, six teams documented their approaches inthe papers that are included in the challenges proceedings. In the following, thekey idea of each approach is summarized. Using the respective team name ofeach paper’s authors, their scores can be identified in Figure 3.

A mixed hybrid recommender system for given namesThe paper by Rafael Glauber, Angelo Loula, and Joao B. Rocha-Junior(team uefs.br) presents a hybrid recommender which combines collaborativefiltering, most popular, and content-based recommendations. In particularthe latter contributes with two interesting approaches (Soundex and splitting

0 200 400 600 800 1000

010

0030

0050

00

Position in Ranking

Fre

quen

cy

1. uefs.br2. ibayer3. all your base

Fig. 4: Cumulative distribution of the ranking positions for the top three recom-mendation systems described in Section 3.5.

of invalid names) that are fitted to the problem at hand. The combination ofthe three approaches as a concatenation is likely the reason for the success inthe challenge, yielding the highest score on the test dataset and thus winningthe offline phase.

Collaborative Filtering Ensemble for Personalized Name Recommen-dationBernat Coma-Puig, Ernesto Diaz-Aviles, and Wolfgang Nejdl (team cadejo)present an algorithm based on the weighted rank ensemble of various collab-orative filtering recommender systems. In particular, the classic item-to-itemcollaborative filtering is considered in different, specifically adopted variants(considering only ENTER SEARCH activities vs. all activities, frequency biasedname sampling vs. recency biased name sampling), item- and user-basedCF as well as PageRank weights for filling up recommendation lists withless than 1,000 elements. The weights of the ensemble are determined in ex-periments and yield a recommender that outperforms each of its individualcomponents.

Nameling Discovery Challenge - Collaborative NeighborhoodsThe paper by Dirk Schafer and Robin Senge (team disc) created an algorithmwhich combines user-based collaborative filtering with information about thegeographical location of the users and their preference for male/female andlong/short names. The paper further explores the use of two different similar-ity measures, namely Dunning and Jaccard, and find that the rather exoticone Dunning yields better recommendations than the Jaccard measure.

Improving the Recommendation of Given Names by Using Contex-tual Information

The paper by Marcos Aurelio Domingues, Ricardo Marcondes Marcacini,Solange Oliveira Rezende and Gustavo E. A. P. A. Batista (team Labic)presents two approaches to tackle the challenge of name recommendation:item-based collaborative filtering and association rules. In addition, the weightpost filtering approach is leveraged to weight these two baseline recom-menders by contextual information about time and location. Therefore, foreach user-item pair the probability that the user accessed it at a certaincontext (i.e., time or location) is computed and used to weight the baselineresults.

Similarity-weighted association rules for a name recommender systemThe paper by Benjamin Letham (team all your base) considers associationrules for recommendation. The key idea here is the introduction of an ad-justed confidence value for association rules, capturing the idea of inducinga bias towards observations which stem from likeminded users (with respectto the querying user). This generalized definition of confidence is addition-ally combined with a previous approach of the author [4] which accounts forassociation rules with low support values, by adding in a Beta prior distri-bution. This recommender system achieved the third place in the challenge’soffline task.

Factor Models for Recommending Given NamesThe paper by Imannuel Bayer and Steffen Rendle (team ibayer) presents anapproach using a sequential factor model that is enriched with syntacticalname similarity – a prefix equality, called “prefix smoothing”. The model istrained with a slight adoption of the standard Bayesian Personalized Rank-ing algorithm, while the final recommendation is obtained by averaging therankings of different, independently trained models. This recommender sys-tem achieved the second place in the challenge’s offline task.

4 Online Challenge

Conducting offline experiments is usually the first step to estimate the perfor-mance of different recommender systems. However, thus recommender systemsare trained to predict those items that users found interesting without beingassisted by a recommender system in the first place. In an online evaluationdifferent recommenders are implemented into the running productive system.Their performance is compared in a test, where different users are assigned todifferent recommenders and the impact of the recommendations is estimated byanalyzing the users responses to their recommendation. Like noted in [5] onlineexperiments provide “the strongest evidence as to the true value of the system[. . . ]”, but have also an influence on the users as the recommendations are dis-played before users decide where to navigate (click) next. In the challenge, theonline phase gave the participants the opportunity to test the algorithms, theyhad created during the offline phase in Nameling. In the following we describethe setup that allowed the teams to integrate their algorithms with Namelingbefore we discuss this phase’s results and winners.

(a) interactive recommenda-tion

(b) sidebar recommendation

Fig. 5: Implemented recommendation use cases in Nameling: A user interactivelyqueries for suitable name recommendations (a) or gets recommended namesdisplayed in the sidebar of some regular Nameling page (b).

4.1 Recommendations in Nameling

The feature of recommendations in Nameling was introduced with the beginningof the challenge’s online phase. To every page of the system was added a person-alized list of recommended names. This list automatically adapts to the user’sactivities (e.g., the user’s clicks or entering of favorite actions). Users may usethis list to further search for names by clicking on one, add a name to his favoritenames, or ban a name which they do not want to be recommended again. Addi-tionally, users can visit an interactive recommendation site in Nameling, wherethey can enter names and will get personalized recommendations related tothose names. The latter functionality is very similar to the usual results Namel-ing shows, the difference being that regular search results are non-personalized.Figure 5 shows how recommendations are displayed in Nameling’s user interface.

To integrate their algorithms into Nameling, the participants had to imple-ment a simple interface. The search engine itself provides a framework for theeasy integration of recommender systems based on lightweight REST (HTTP +XML / JSON) interaction. Participants could choose to implement their recom-mender in Java or Python and to run their recommender in a web service ontheir own or to provide a JAR file to be deployed by the challenge organizers.

The recommender framework that integrates the different third-party recom-mender systems into Nameling is sketched in Figure 6. When a user of Namelingsends a request, a call for recommendations including the current user’s ID issent to each participating recommender. Recommendations are collected fromeach such recommender within a time frame of 500 ms, i. e., recommendationsproduced after that time are discarded. Using an equally distributed random

Request Recommendation

Que

ry R

ecom

men

der

1

Query Recommender nQue

ry R

ecom

men

der 2

Select Result

nameling

Team n

Team 1

Team 2

Fig. 6: Schematic representation of Nameling’s online recommender framework.

function, one of the recommendation responses is selected and the recommendednames are displayed to the current user in the response to their request. Oncea recommender has been assigned to a user, this assignment is fixed for theduration of the current session unless the user specifically requests new recom-mendations.

4.2 Evaluation

Assessing the success of recommendations in a live system requires some quan-tity that can be measured and that represents the use of the recommendationsto the user or to the system. Often used measures include a rise in revenue(e. g., for product recommendations), click counts on the recommended items,or comparisons to ratings that users assigned to recommended items. In the caseof name recommendations, no particular revenue is created for the system asthere are no products to be sold. Thus to evaluate different recommenders wefocused on the interest that users showed in the recommended names. For thechallenge we estimated this interest by the combined number of requests usersmade responding to the recommendations. More precisely, we counted all interac-tions that could be made on the recommender user interface (i. e., LINK SEARCH,LINK CATEGORY SEARCH, and ADD FAVORITE events). Here, we excluded the pre-viously mentioned option to ban names as their interpretations is unclear. Onthe one hand, banning a name is certainly a negative response to that particularrecommendation. On the other hand, since users are not bound to react to therecommendations at all, it is a clear sign of interest in the recommendations andcould well be interpreted as a deselection of one uninteresting name among froma set of otherwise interesting names. Since the recommenders were assigned todifferent users using an equally distributed random function, the final measure

DE 5193AU 1759

US 681

AT 161GB 110CH 104FR 81CA 77EU 69RU 43UA 32NL 30NZ 25SE 24BR 22

Others 319

Fig. 7: The countries (according to IP address) of Nameling’s visitors during theonline phase of the challenge.

was simply the sum of all considered requests in one of the three mentionedcategories.

4.3 Summary of the Online Challenge

The online phase ran from August, 1st to September 24th, 2013. During the timeof the online phase, more than 8,000 users visited Nameling engaging in morethan 200,000 activities there. Figure 7 shows the distributions of the users overtheir countries (it is to be expected, the home country is an important influenceon the choice of names). While most of the requests came from Germany, followedby Austria, the largest number of visitors from an English speaking country camefrom the US.

Participants and Winners Of the teams that contributed to the challengeproceedings, five entered their algorithms in the online challenge. Of those fiveteams, four managed to produce recommendations within the time-window of500 ms: “all your base”, “Contest”, “ibayer”, and “uefs.br”. Figure 8 shows foreach of these four teams the number of responses – in terms of clicks to one ofthree categories of links related to the recommended names (see Section 4.2 – totheir recommendations. The clear winner of the online phase is team “ibayer”:Immanuel Bayer and Steffen Rendle (Paper: Factor Models for RecommendingGiven Names). Ranks two and three go to teams “all your base” and “uefs.br”respectively. Compared to the offline challenge, we find, the three top teamsof the offline phase were the same that constituted the top three of the online

0

50

100

150

200

250

300

350

400

450

500

550

ibay

er

all y

ou

r b

ase

uef

s.b

r

Co

nte

xt

SCO

RE

TEAM NAME

PICK_NAME

LINK_CATEGORY_SEARCH

LINK_SEARCH

Fig. 8: Schematic representation of Nameling’s online recommender framework.

phase. It is however interesting to note that the order of the teams changed– team “uefs.br” fell from rank one to rank three. It is also worth noting thatalthough team “Context” yielded a MAP@1000 score of only 0,0321 in the onlinechallenge, compared to team “uefs.br” with 0,0491, both teams were almostequally successful in the online phase. It thus seems that the offline testing hasindeed been a reasonable precursor, yet also that the offline scenario does notfully capture the actual use case.

5 Conclusion

The 15th Discovery Challenge of the ECML PKDD posed the task of recom-mending given names to users of a name search engine. In the two parts ofthe challenge, the offline and the online phase several teams of scientists imple-mented and augmented recommendation algorithms to tackle that problem. Intheir approaches, participants mainly chose to use well-established techniqueslike collaborative filtering, tensor factorization, popularity measures, or associa-tion rules and hybridization thereof. Participants adapted such algorithms to theparticular domain of given names exploiting name feature like gender, a name’sprefix, a name’s string length, or phonetic similarity. In the offline challenge,six teams entered their approaches and by the end of the phase, each team hadproduced a new algorithm outperforming the previously most successful recom-mender NameRank. The achieved scores of the individual recommenders were yetrather low (compared to other domains were recommenders are applied). Thisshows that there is yet much to be explored to better understand and predict theattitude of users towards different names. Through the challenge, a multitude of

ideas and approaches has been proposed and a straight forward next step willbe to explore their value in hybrid recommender algorithms. Hybridization hasbeen used already by several participants with great success.

The online challenge opened the productively running name search engineNameling to the scientific community, offering the possibility to implement andtest name recommendation algorithms in a live system. Results showed that theactual performance varied from that measured in the offline challenge. However,it could also be observed that despite the low scores in the offline phase, the rec-ommendations were perceived by users and were able to attract their attention.

As organizers, we would like to thank all participants for their valuable con-tributions and ideas.

References

1. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine.Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.

2. F. Mitzlaff and G. Stumme. Relatedness of given names. Human Journal, 1(4):205–217, 2012.

3. F. Mitzlaff and G. Stumme. Recommending given names, 2013. citearxiv:1302.4412Comment: Baseline results for the ECML PKDD Discovery Chal-lenge 2013.

4. C. Rudin, B. Letham, A. Salleb-Aouissi, E. Kogan, and D. Madigan. Sequentialevent prediction with association rules. COLT 2011 - 24th Annual Conference onLearning Theory, 2011.

5. G. Shani and A. Gunawardana. Evaluating recommendation systems. In F. Ricci,L. Rokach, B. Shapira, and P. B. Kantor, editors, Recommender Systems Handbook,pages 257–297. Springer US, 2011.

summary of the 15th discovery challenge - ceur-ws.orgceur-ws.org/vol-1120/paper1.pdf ·...

Documents