enlister baidu's recommender system for the biggest chinese q&a website

Baidu's Recommender System for

the Biggest Chinese Q&A Website

Enlister

JasonFu | [email protected]

2012-12-10, Monday

We are leaving the age of information and entering the age of recommendation.

克里斯·安德森，《长尾理论》作者

1/36

ABOUT THE PAPER

2/36

BACKGROUND

3/36

BACKGROUND

4/36

BACKGROUND

5/36

BACKGROUND

Baidu Knows offers a Q&A platform to its users for knowledge

and experience sharing.

Today, Baidu Knows website:

There are over 400 million questions

More than 170 million questions were answered by users

Around 100 million users search for answers every day

Over 12 new questions are posted online every second

Baidu Knows is the biggest Q&A website in China.

6/36

BACKGROUND

An eco-system of knowledge sharing between the website’s users.

As more questions have been answered, the search result quality of

common users will be improved.

“Answer” is contribution, not consumption.

7/36

APPLICATION DESCRIPTION

Baidu Knows builds an intelligent RS, Enlister, to provide the Baidu

Knows users with questions that they may be willing to answer.

Is based on the machine learning technology

Apply a machine learning based CTR prediction methodology to

improve the recommendation accuracy

8/36

APPLICATION DESCRIPTION

Content-based

CTR prediction

Stream computing

Previous Enlister

Typical content-based

Cosine similarity degree

9/36

ALGORITHM

We expect the user models to disclose the nature of our users’

choices of answering questions.

The models have to be simple enough to accommodate massive

calculation and industrial adoption.

10/36

ALGORITHM

1. User Model

2. Click Prediction

3. Diversity Adjustment

11/36

1. User Model

user model

attributes

Age, Gender, Education

Tags

…

interest

Interest Term Vector

Related Questions

Abstract Interest Vector

12/36

1. User Model

Interest Term Vector

A vector contains weights of terms, which implicate the correlation between

user and the term on sematic level.

Related Questions

Questions are browsed or answered by a user.

Abstract Interest Vector

A vector contains weights of terms, which implicate the correlation between

a user and an abstract concept.

Use PLSA technology to get a conceptual topic model from millions of

question-answer pairs on the Baidu Knows website .

13/36

ALGORITHM

1. User Model

2. Click Prediction


14/36

2. Click Prediction

The click model that we created is a kind of probabilistic

classification model. It is a binary classification model to

calculate the probability of a sample belonging to a class.

15/36

2. Click Prediction

Sample Collection

Positive samples: questions that the user had examined and clicked.

Negative samples: questions that randomly choose from the question pool.

Feature Selection

User Attributes: (1) basic user attributes (2) other features from the statistics

Correlation Degree: the number of matched terms, cosine similarity and bm25

between the user interest term vectors and question vectors

Classification Algorithm: two principles (1) probabilistic classification (2) linear

classifier

16/36

2. Click Prediction

Classification Algorithm

maximum entropy classifier

The probability P of a user u will click the question q can be calculated as

follows:

Global optimization solution：limited-memory Broyden-Fletcher-

Goldfarb-Shanno (L-BFGS) ; Stochastic Gradient Descent (SGD)

17/36

ALGORITHM

1. User Model

2. Click Prediction


18/36


The head part and the tail part of a list garnered most attention from the

users.

For the head part, we apply a loose filtering algorithm, which only deletes

some apparent duplication in the list.

For the tail part, we use a strict filtering algorithm to take out any

questions that have noticeable semantic level similarity to each other in

the list.

19/36

SYSTEM SETUP

The most important concept in the Enlister system design is real-time CTR

prediction. The major data process can be described as follows :

20/36

SYSTEM SETUP

For building the data processing flow, we construct multiple logic queues between

processing nodes.

The processing nodes are grouped into several node groups. Each group

represents a simple logic section.

21/36

Before login After login

23/36

EXPERIMENT & EVALUATION

1. Evaluation Metrics

2. Experiment

3. Online Evaluation

26/36


Confusion Matrix

Precision（查准率）：tp/(tp+fp),识别出的真正的正面观点数/所有的识别为正面观点的条数

Recall（查全率）：tp/(tp+fn), 识别出的真正的正面观点数/样本中所有的真正正面观点的条数

Accuracy（准确率）： (tp + tn)/(tp + fn + fp + tn), 正确识别观点数/所有观点的条数

27/36



2. Experiment


28/36

2. Experiment

Sample Selection

100,000 questions that had been viewed and clicked by users are selected from

users’ logs as positive sample, which involves 10 thousands users with 10

records per user on average.

Negative samples: random negative samples

29/36

Sample Proportion

2. Experiment

30/36

Optimization Algorithm

LBFGS algorithms is chosen as the optimization algorithm in the maximum

entropy model training.

2. Experiment

31/36



2. Experiment


32/36


Enlister was released to the Baidu Knows users and an online evaluation

was conducted from Feb. 11th, 2012.

33/36


34/36

CONCLUSION

① Have successfully built a real-time RS that serves millions of users every

day.

② The algorithm and system design fit the recommendation scenario quite

well.

③ Great improvement had been made on the accuracy and time-sensitive

issues.

④ The number of active users had grown substantially after the system was

officially launched.

The future work : the timing of recommendation and the utilization of

relationships between users

35/36

Thank You !

36/36