deep reinforcement learning forpage-wise recommendationszhaoxi35/paper/recsys2018poster.pdf ·...

1
DeepPage Framework Experiments Motivation Deep Reinforcement Learning for Page-wise Recommendations Xiangyu Zhao 1 , Long Xia 2 , Liang Zhang 2 , Zhuoye Ding 2 , Dawei Yin 2 and Jiliang Tang 1 1: Data Science and Engineering Lab, Michigan State University 2: Data Science Lab, JD.com 7 Conclusion and Future Work Framework Overview Data Science and Engineering Lab Recommender systems can mitigate the information overload problem by suggesting users’ personalized items q Interactions between recommender systems and users v Each time the system recommends a page of items to the user v Next the user browses these items and provides real-time feedback v Then the system recommends a new page of items… q Two key problems v How to efficiently capture user's dynamic preference and update recommending strategy according to user's real-time feedback v How to generate a page of items with proper display based on user's preferences We model the recommendation task as a MDP and leverage Reinforcement Learning to automatically learn the optimal recommendation strategies 1. Framework Architecture Selection action a i h 1 h 2 ··· (c) (b) (a) Actor Critic state s action a Q(s, a i ) Q(s, a 2 ) Q(s, a 1 ) Q(s, a) action a h 2 h 2 h 2 h 1 h 1 h 1 state s state s state s ü Actor-Critic Framework ü Actor: current state -> a deterministic action (recommend a deterministic page of items) ü Critic: this state-action pair -> Q-value (a) cannot handle large and dynamic action space scenario (b) high computational cost Q(s, a)= E s r + γ Q(s ,a )|s, a Q (s, a)= E s r + γ max a Q (s ,a )|s, a 2. Markov Decision Process 1. Architecture of Actor Framework 1.3. Decoder for Action Generation Process 2. The Architecture of Critic Framework ······ ······ e N e 2 e 1 h N h 2 h 1 Embedding Layer Input Layer s ini GRU Layer Output Layer E N E 2 E 1 z t = σ (W z E t + U z h t1 ) r t = σ (W r E t + U r h t1 ) ˆ h t = tanh[WE t + U (r t · h t1 )] h t = (1 z t )h t1 + z t ˆ h t s ini = h t ······ ······ ······ s ini Embedding Layer Input Layer GRU Layer CNN Layer s cur Attention Layer h 2 h 1 h T Page Layer Output Layer f i e i c i X i X M 1 X 2 X 1 X M ··· ··· α T α 2 α 1 P 1 p 1 ··· ··· ··· e i : item representation c i : item s category f i : user s f eedback X i = concat(E i ,C i ,F i ) = tanh ( concat(W E e i + b E ,W C c i + b C ,W F f i + b F ) ) p t = conv 2d(P t ) s cur = T t=1 α t h t α t = exp(W α h t + b α ) j exp(W α h j + b α ) 1.1. Encoder for Initial State Generation Process q The Actor is designed to generate a page of recommendations according to user's preference, which needs to tackle three challenges v Setting an initial preference at the beginning of a new recommendation session v Learning the real-time preference in the current session v Jointly generating a set of recommendations and displaying them in a 2-D page 1.2. Encoder for Real-time State Generation Process q Setting an initial preference at the beginning of a new session v Inputs: user’s last clicked/ordered items before the current session v Output: the representation of users’ initial preference by a vector q Learning the real-time preference in the current session v Inputs: item representation, item’s category (to generate complementary recommendations) and user’s feedback (to capture user’s interests in current page) v Output: the representation of users’ real-time preference by a vector q Jointly generating a set of recommendations and displaying them in a page v Inputs: users’ real-time preference v Output: the representation of a page of recommendations (items) with proper display q Critic aims to learn an action-value function Q(s,a), which is a judgment of whether the action “a” matches the current state “s” v Inputs: users’ real-time preference “s” and a page of recommendations (items) “av Output: Q-value, i.e., Q(s,a) h 1 h 2 a cur pro CNN Actor Critic User e M e 2 e 1 ··· ··· e M 1 e 2 e 1 e M a cur val Q(s, a) a r s s ··· ··· ··· e M 1 ··· ··· ··· ··· ··· Decoder a cur = deconv 2d(s cur ) a = conv 2d(a cur ) 1. How DeepPage performs compared to representative baselines? 2. How the components in Actor and Critic contribute to the performance? Ø Dataset from JD.com Ø Baselines Ø Collaborative filtering Ø Factorization Machines Ø GRU for Rec Ø DQN Ø DDPG training process Ø 1: no embedding-layers Ø 2: no category and feedback Ø 3: no GRU for initial state Ø 4: no CNNs Ø 5: no attention layers Ø 6: no GRU for real-time state Ø 7: no DeCNN q We make following observations: v DeepPage-1 and DeepPage-2 validate that incorporating category/feedback information and the embedding layer can boost the quality of recommendation v DeepPage-3 and DeepPage-6 suggest that setting user’s initial preference, and capturing user’s real-time preference is helpful for accurate recommendations v DeepPage outperforms DeepPage-4 and DeepPage-7, which indicates that item display strategy can influence the decision making process of users q It can be observed: v CF and FM perform worse than other baselines. These two baselines ignore the temporal sequence of the users' browsing history v DQN and DDPG outperform GRU. We design GRU to maximize the immediate reward for recommendations, while DQN and DDPG are designed to achieve the trade-off between short- term and long-term rewards v DeepPage performs better than conventional DDPG. Compared to DDPG, DeepPage jointly optimizes a page of items and uses GRU to learn user's real-time preference q Conclusion v We propose a novel Actor-Critic framework, which can be applied in scenarios with large and dynamic item space and can reduce computational cost significantly v DeepPage framework is able to capture the real-time preference and optimize a page of items jointly, which can boost recommendation performance q Future Work v We would like to handle multiple tasks such as search, bidding, advertisement and recommendation collaboratively in one reinforcement learning framework v We would like to validate with more agent-user interaction patterns, e.g., adding items into shopping cart q Acknowledgements v This research is supported by National Science Foundation (IIS-1714741, IIS-1715940) and Criteo Faculty Research Award My homepage Deconvolution Neural Network Convolution Neural Network

Upload: others

Post on 07-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Reinforcement Learning forPage-wise Recommendationszhaoxi35/paper/recsys2018poster.pdf · 2018-09-12 · 70 Motivation DeepPageFramework Experiments Deep Reinforcement Learning

70

DeepPage Framework ExperimentsMotivation

Deep Reinforcement Learning for Page-wise RecommendationsXiangyu Zhao1, Long Xia2, Liang Zhang2, Zhuoye Ding2, Dawei Yin2 and Jiliang Tang1

1: Data Science and Engineering Lab, Michigan State University 2: Data Science Lab, JD.com

7

Conclusion and Future Work

Framework Overview

Data Science and Engineering Lab

Recommender systems can mitigate the information overloadproblem by suggesting users’ personalized items

q Interactions between recommender systems and usersv Each time the system recommends a page of items to the user

v Next the user browses these items and provides real-time feedback

v Then the system recommends a new page of items…

q Two key problemsv How to efficiently capture user's dynamic preference and update

recommending strategy according to user's real-time feedback

v How to generate a page of items with proper display based on user's

preferences

We model the recommendation task as a MDP and leverage Reinforcement Learning to automatically learn the optimal recommendation strategies

1. Framework Architecture Selection

action ai

h1

h2

· · ·

(c)(b)(a) Actor Critic

state s

action a

Q(s, ai)Q(s, a2)Q(s, a1) Q(s, a)

action a

h2 h2h2

h1 h1 h1

state s state s state s

ü Actor-Critic Frameworkü Actor: current state -> a deterministic action (recommend a deterministic

page of items)

ü Critic: this state-action pair -> Q-value

✕ (a) cannot handle large and dynamic action space scenario

✕ (b) high computational cost

Q(s, a) = Es′!

r + γQ(s′, a′)|s, a"

Q∗(s, a) = Es′!

r + γmaxa′

Q∗(s′, a′)|s, a"

2. Markov Decision Process

1. Architecture of Actor Framework

1.3. Decoder for Action Generation Process

2. The Architecture of Critic Framework

· · · · · ·

· · · · · ·

eNe2e1

hNh2h1

Embedding Layer

Input Layer

sini

GRU Layer

Output Layer

ENE2E1

zt = σ(WzEt + Uzht−1)rt = σ(WrEt + Urht−1)ht = tanh[WEt + U(rt · ht−1)]ht = (1− zt)ht−1 + ztht

sini = ht

· · · · · ·

· · · · · ·

· · · · · ·

sini

Embedding Layer

Input Layer

GRU Layer

CNN Layer

scur

Attention Layer

h2h1 hT

Page Layer

Output Layer

fiei ci

Xi

XM−1

X2X1

XM· · ·

· · ·

αTα2α1

P1

p1

· · ·· · · · · ·

ei : item representationci : item

′s categoryfi : user

′s feedback

Xi = concat(Ei, Ci, Fi)

= tanh!

concat(WEei + bE,WCci + bC,WFfi + bF )"

pt = conv2d(Pt)

scur

=

T!

t=1

αtht αt =exp(Wαht + bα)!j exp(Wαhj + bα)

1.1. Encoder for Initial State Generation Process

q The Actor is designed to generate a page of recommendations according to

user's preference, which needs to tackle three challenges

v Setting an initial preference at the beginning of a new recommendation session

v Learning the real-time preference in the current session

v Jointly generating a set of recommendations and displaying them in a 2-D page

1.2. Encoder for Real-time State Generation Process

q Setting an initial preference at the beginning of a new session

v Inputs: user’s last clicked/ordered items before the current session

v Output: the representation of users’ initial preference by a vector

q Learning the real-time preference in the current session

v Inputs: item representation, item’s category (to generate complementary

recommendations) and user’s feedback (to capture user’s interests in current page)

v Output: the representation of users’ real-time preference by a vector

q Jointly generating a set of recommendations and displaying them in a page

v Inputs: users’ real-time preference

v Output: the representation of a page of recommendations (items) with proper display

q Critic aims to learn an action-value function Q(s,a), which is a judgment of

whether the action “a” matches the current state “s”v Inputs: users’ real-time preference “s” and a page of recommendations (items) “a”

v Output: Q-value, i.e., Q(s,a)

h1

h2

acur

pro

CNN

Actor Critic

User

eM

e2e1 · · ·

· · ·

eM−1

e2e1

eM

acurval Q(s, a)

ars s

· · · · · ·

· · · eM−1

· · ·

· · · · · · · · ·

· · ·

Decoderacur = deconv2d(scur) a = conv2d(acur)

1. How DeepPage performs compared to representative baselines?

2. How the components in Actor and Critic contribute to the performance?

Ø Dataset from JD.com

Ø Baselines

Ø Collaborative filtering

Ø Factorization Machines

Ø GRU for Rec

Ø DQN

Ø DDPG

training process

Ø 1: no embedding-layers

Ø 2: no category and feedback

Ø 3: no GRU for initial state

Ø 4: no CNNs

Ø 5: no attention layers

Ø 6: no GRU for real-time state

Ø 7: no DeCNN

q We make following observations:

v DeepPage-1 and DeepPage-2 validate that incorporating category/feedback information and the

embedding layer can boost the quality of recommendation

v DeepPage-3 and DeepPage-6 suggest that setting user’s initial preference, and capturing user’s

real-time preference is helpful for accurate recommendations

v DeepPage outperforms DeepPage-4 and DeepPage-7, which indicates that item display strategy

can influence the decision making process of users

q It can be observed:

v CF and FM perform worse than other baselines. These two baselines ignore the temporal

sequence of the users' browsing history

v DQN and DDPG outperform GRU. We design GRU to maximize the immediate reward for

recommendations, while DQN and DDPG are designed to achieve the trade-off between short-

term and long-term rewards

v DeepPage performs better than conventional DDPG. Compared to DDPG, DeepPage jointly

optimizes a page of items and uses GRU to learn user's real-time preference

q Conclusionv We propose a novel Actor-Critic framework, which can be applied in scenarios with

large and dynamic item space and can reduce computational cost significantly

v DeepPage framework is able to capture the real-time preference and optimize a

page of items jointly, which can boost recommendation performance

q Future Work

v We would like to handle multiple tasks such as search, bidding, advertisement and

recommendation collaboratively in one reinforcement learning framework

v We would like to validate with more agent-user interaction patterns, e.g., adding

items into shopping cart

q Acknowledgements

v This research is supported by National Science Foundation

(IIS-1714741, IIS-1715940) and Criteo Faculty Research Award

My homepage

Deconvolution Neural Network

Convolution Neural Network