Transcript
Page 1: Deep Reinforcement Learning forPage-wise Recommendationszhaoxi35/paper/recsys2018poster.pdf · 2018-09-12 · 70 Motivation DeepPageFramework Experiments Deep Reinforcement Learning

70

DeepPage Framework ExperimentsMotivation

Deep Reinforcement Learning for Page-wise RecommendationsXiangyu Zhao1, Long Xia2, Liang Zhang2, Zhuoye Ding2, Dawei Yin2 and Jiliang Tang1

1: Data Science and Engineering Lab, Michigan State University 2: Data Science Lab, JD.com

7

Conclusion and Future Work

Framework Overview

Data Science and Engineering Lab

Recommender systems can mitigate the information overloadproblem by suggesting users’ personalized items

q Interactions between recommender systems and usersv Each time the system recommends a page of items to the user

v Next the user browses these items and provides real-time feedback

v Then the system recommends a new page of items…

q Two key problemsv How to efficiently capture user's dynamic preference and update

recommending strategy according to user's real-time feedback

v How to generate a page of items with proper display based on user's

preferences

We model the recommendation task as a MDP and leverage Reinforcement Learning to automatically learn the optimal recommendation strategies

1. Framework Architecture Selection

action ai

h1

h2

· · ·

(c)(b)(a) Actor Critic

state s

action a

Q(s, ai)Q(s, a2)Q(s, a1) Q(s, a)

action a

h2 h2h2

h1 h1 h1

state s state s state s

ü Actor-Critic Frameworkü Actor: current state -> a deterministic action (recommend a deterministic

page of items)

ü Critic: this state-action pair -> Q-value

✕ (a) cannot handle large and dynamic action space scenario

✕ (b) high computational cost

Q(s, a) = Es′!

r + γQ(s′, a′)|s, a"

Q∗(s, a) = Es′!

r + γmaxa′

Q∗(s′, a′)|s, a"

2. Markov Decision Process

1. Architecture of Actor Framework

1.3. Decoder for Action Generation Process

2. The Architecture of Critic Framework

· · · · · ·

· · · · · ·

eNe2e1

hNh2h1

Embedding Layer

Input Layer

sini

GRU Layer

Output Layer

ENE2E1

zt = σ(WzEt + Uzht−1)rt = σ(WrEt + Urht−1)ht = tanh[WEt + U(rt · ht−1)]ht = (1− zt)ht−1 + ztht

sini = ht

· · · · · ·

· · · · · ·

· · · · · ·

sini

Embedding Layer

Input Layer

GRU Layer

CNN Layer

scur

Attention Layer

h2h1 hT

Page Layer

Output Layer

fiei ci

Xi

XM−1

X2X1

XM· · ·

· · ·

αTα2α1

P1

p1

· · ·· · · · · ·

ei : item representationci : item

′s categoryfi : user

′s feedback

Xi = concat(Ei, Ci, Fi)

= tanh!

concat(WEei + bE,WCci + bC,WFfi + bF )"

pt = conv2d(Pt)

scur

=

T!

t=1

αtht αt =exp(Wαht + bα)!j exp(Wαhj + bα)

1.1. Encoder for Initial State Generation Process

q The Actor is designed to generate a page of recommendations according to

user's preference, which needs to tackle three challenges

v Setting an initial preference at the beginning of a new recommendation session

v Learning the real-time preference in the current session

v Jointly generating a set of recommendations and displaying them in a 2-D page

1.2. Encoder for Real-time State Generation Process

q Setting an initial preference at the beginning of a new session

v Inputs: user’s last clicked/ordered items before the current session

v Output: the representation of users’ initial preference by a vector

q Learning the real-time preference in the current session

v Inputs: item representation, item’s category (to generate complementary

recommendations) and user’s feedback (to capture user’s interests in current page)

v Output: the representation of users’ real-time preference by a vector

q Jointly generating a set of recommendations and displaying them in a page

v Inputs: users’ real-time preference

v Output: the representation of a page of recommendations (items) with proper display

q Critic aims to learn an action-value function Q(s,a), which is a judgment of

whether the action “a” matches the current state “s”v Inputs: users’ real-time preference “s” and a page of recommendations (items) “a”

v Output: Q-value, i.e., Q(s,a)

h1

h2

acur

pro

CNN

Actor Critic

User

eM

e2e1 · · ·

· · ·

eM−1

e2e1

eM

acurval Q(s, a)

ars s

· · · · · ·

· · · eM−1

· · ·

· · · · · · · · ·

· · ·

Decoderacur = deconv2d(scur) a = conv2d(acur)

1. How DeepPage performs compared to representative baselines?

2. How the components in Actor and Critic contribute to the performance?

Ø Dataset from JD.com

Ø Baselines

Ø Collaborative filtering

Ø Factorization Machines

Ø GRU for Rec

Ø DQN

Ø DDPG

training process

Ø 1: no embedding-layers

Ø 2: no category and feedback

Ø 3: no GRU for initial state

Ø 4: no CNNs

Ø 5: no attention layers

Ø 6: no GRU for real-time state

Ø 7: no DeCNN

q We make following observations:

v DeepPage-1 and DeepPage-2 validate that incorporating category/feedback information and the

embedding layer can boost the quality of recommendation

v DeepPage-3 and DeepPage-6 suggest that setting user’s initial preference, and capturing user’s

real-time preference is helpful for accurate recommendations

v DeepPage outperforms DeepPage-4 and DeepPage-7, which indicates that item display strategy

can influence the decision making process of users

q It can be observed:

v CF and FM perform worse than other baselines. These two baselines ignore the temporal

sequence of the users' browsing history

v DQN and DDPG outperform GRU. We design GRU to maximize the immediate reward for

recommendations, while DQN and DDPG are designed to achieve the trade-off between short-

term and long-term rewards

v DeepPage performs better than conventional DDPG. Compared to DDPG, DeepPage jointly

optimizes a page of items and uses GRU to learn user's real-time preference

q Conclusionv We propose a novel Actor-Critic framework, which can be applied in scenarios with

large and dynamic item space and can reduce computational cost significantly

v DeepPage framework is able to capture the real-time preference and optimize a

page of items jointly, which can boost recommendation performance

q Future Work

v We would like to handle multiple tasks such as search, bidding, advertisement and

recommendation collaboratively in one reinforcement learning framework

v We would like to validate with more agent-user interaction patterns, e.g., adding

items into shopping cart

q Acknowledgements

v This research is supported by National Science Foundation

(IIS-1714741, IIS-1715940) and Criteo Faculty Research Award

My homepage

Deconvolution Neural Network

Convolution Neural Network

Top Related