enlister baidu's recommender system for the biggest chinese q&a website
TRANSCRIPT
Baidu's Recommender System for
the Biggest Chinese Q&A Website
Enlister
JasonFu | [email protected]
2012-12-10, Monday
We are leaving the age of information and entering the age of recommendation.
克里斯·安德森,《长尾理论》作者
1/36
ABOUT THE PAPER
2/36
BACKGROUND
3/36
BACKGROUND
4/36
BACKGROUND
5/36
BACKGROUND
Baidu Knows offers a Q&A platform to its users for knowledge
and experience sharing.
Today, Baidu Knows website:
There are over 400 million questions
More than 170 million questions were answered by users
Around 100 million users search for answers every day
Over 12 new questions are posted online every second
Baidu Knows is the biggest Q&A website in China.
6/36
BACKGROUND
An eco-system of knowledge sharing between the website’s users.
As more questions have been answered, the search result quality of
common users will be improved.
“Answer” is contribution, not consumption.
7/36
APPLICATION DESCRIPTION
Baidu Knows builds an intelligent RS, Enlister, to provide the Baidu
Knows users with questions that they may be willing to answer.
Is based on the machine learning technology
Apply a machine learning based CTR prediction methodology to
improve the recommendation accuracy
8/36
APPLICATION DESCRIPTION
Content-based
CTR prediction
Stream computing
Previous Enlister
Typical content-based
Cosine similarity degree
9/36
ALGORITHM
We expect the user models to disclose the nature of our users’
choices of answering questions.
The models have to be simple enough to accommodate massive
calculation and industrial adoption.
10/36
ALGORITHM
1. User Model
2. Click Prediction
3. Diversity Adjustment
11/36
1. User Model
user model
attributes
Age, Gender, Education
Tags
…
interest
Interest Term Vector
Related Questions
Abstract Interest Vector
12/36
1. User Model
Interest Term Vector
A vector contains weights of terms, which implicate the correlation between
user and the term on sematic level.
Related Questions
Questions are browsed or answered by a user.
Abstract Interest Vector
A vector contains weights of terms, which implicate the correlation between
a user and an abstract concept.
Use PLSA technology to get a conceptual topic model from millions of
question-answer pairs on the Baidu Knows website .
13/36
ALGORITHM
1. User Model
2. Click Prediction
3. Diversity Adjustment
14/36
2. Click Prediction
The click model that we created is a kind of probabilistic
classification model. It is a binary classification model to
calculate the probability of a sample belonging to a class.
15/36
2. Click Prediction
Sample Collection
Positive samples: questions that the user had examined and clicked.
Negative samples: questions that randomly choose from the question pool.
Feature Selection
User Attributes: (1) basic user attributes (2) other features from the statistics
Correlation Degree: the number of matched terms, cosine similarity and bm25
between the user interest term vectors and question vectors
Classification Algorithm: two principles (1) probabilistic classification (2) linear
classifier
16/36
2. Click Prediction
Classification Algorithm
maximum entropy classifier
The probability P of a user u will click the question q can be calculated as
follows:
Global optimization solution:limited-memory Broyden-Fletcher-
Goldfarb-Shanno (L-BFGS) ; Stochastic Gradient Descent (SGD)
17/36
ALGORITHM
1. User Model
2. Click Prediction
3. Diversity Adjustment
18/36
3. Diversity Adjustment
The head part and the tail part of a list garnered most attention from the
users.
For the head part, we apply a loose filtering algorithm, which only deletes
some apparent duplication in the list.
For the tail part, we use a strict filtering algorithm to take out any
questions that have noticeable semantic level similarity to each other in
the list.
19/36
SYSTEM SETUP
The most important concept in the Enlister system design is real-time CTR
prediction. The major data process can be described as follows :
20/36
SYSTEM SETUP
For building the data processing flow, we construct multiple logic queues between
processing nodes.
The processing nodes are grouped into several node groups. Each group
represents a simple logic section.
21/36
22/36
Before login After login
23/36
24/36
25/36
EXPERIMENT & EVALUATION
1. Evaluation Metrics
2. Experiment
3. Online Evaluation
26/36
1. Evaluation Metrics
Confusion Matrix
Precision(查准率):tp/(tp+fp),识别出的真正的正面观点数/所有的识别为正面观点的条数
Recall(查全率):tp/(tp+fn), 识别出的真正的正面观点数/样本中所有的真正正面观点的条数
Accuracy(准确率): (tp + tn)/(tp + fn + fp + tn), 正确识别观点数/所有观点的条数
27/36
EXPERIMENT & EVALUATION
1. Evaluation Metrics
2. Experiment
3. Online Evaluation
28/36
2. Experiment
Sample Selection
100,000 questions that had been viewed and clicked by users are selected from
users’ logs as positive sample, which involves 10 thousands users with 10
records per user on average.
Negative samples: random negative samples
29/36
Sample Proportion
2. Experiment
30/36
Optimization Algorithm
LBFGS algorithms is chosen as the optimization algorithm in the maximum
entropy model training.
2. Experiment
31/36
EXPERIMENT & EVALUATION
1. Evaluation Metrics
2. Experiment
3. Online Evaluation
32/36
3. Online Evaluation
Enlister was released to the Baidu Knows users and an online evaluation
was conducted from Feb. 11th, 2012.
33/36
3. Online Evaluation
34/36
CONCLUSION
① Have successfully built a real-time RS that serves millions of users every
day.
② The algorithm and system design fit the recommendation scenario quite
well.
③ Great improvement had been made on the accuracy and time-sensitive
issues.
④ The number of active users had grown substantially after the system was
officially launched.
The future work : the timing of recommendation and the utilization of
relationships between users
35/36
Thank You !
36/36