inducing semantic micro-clusters from deep multi-view ......clusters than less structured...

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1873–1883Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Inducing Semantic Micro-Clusters from Deep Multi-ViewRepresentations of Novels

Lea FrermannUniversity of Edinburgh ∗

[email protected]

György SzarvasAmazon Development Center Germany GmbH

[email protected]

Abstract

Automatically understanding the plot ofnovels is important both for informing lit-erary scholarship and applications such assummarization or recommendation. Vari-ous models have addressed this task, buttheir evaluation has remained largely in-trinsic and qualitative. Here, we pro-pose a principled and scalable frameworkleveraging expert-provided semantic tags(e.g., mystery, pirates) to evaluate plot rep-resentations in an extrinsic fashion, assess-ing their ability to produce locally coher-ent groupings of novels (micro-clusters) inmodel space. We present a deep recur-rent autoencoder model that learns richlystructured multi-view plot representations,and show that they i) yield better micro-clusters than less structured representa-tions; and ii) are interpretable, and thususeful for further literary analysis or la-belling of the emerging micro-clusters.

1 Introduction

For the literature aficionado, the quest for the nextnovel to read can be daunting: the sheer number ofnovels of different styles, topics and genres is dif-ficult to navigate. It is intuitively clear that readersselect novels based on specific but potentially di-verse and structured preferences (e.g., they mightprefer novels of a particular theme (small-town ro-mance), mood (dark) or based on character types(grumpy boss), character relations (love, enmity)and their development). These preferences alsomanifest in the organization of online book storesor recommendation platforms.1 For example, the

∗ Work done while the first author was an intern at Ama-zon (ADC Germany GmbH, Berlin).

1E.g., www.amazon.com or www.goodreads.com

Amazon book catalog contains semantic tags pro-vided by experts (publishers), including labels ofcharacter types (pirates) or theme (secret baby ro-mance) to aid focused search for novels of interest.

Although these tags are already fairly granular,many cover large sets of novels (e.g., the tag secretbaby romance covers almost 4, 000 novels), limit-ing their utility for exhaustive exploration and callfor even finer grained micro-groupings. Can weinstead automatically induce fine-grained novelclusters in an unsupervised, data-driven way?

We propose a framework to learn structured, in-terpretable book representations that capture dif-ferent aspects of the plot, and verify that suchrepresentations are rich enough to support down-stream tasks like generating interpretable bookgroupings. A real-world application of this workis content-based book recommendation based ondiverse and interpretable book characteristics.Content-based recommendation has been criti-cized by the limited complexity of typically em-ployed features (limited content analysis; Lopset al. (2011); Adomavicius and Tuzhilin (2005)).This work addresses this problem by inducingcomplex, structured and interpretable representa-tions. Our contributions are two-fold.

First, assuming that richly structured book tagscall for rich content representations (which experttaggers arguably possess), we describe a deep un-supervised model for learning multi-view repre-sentations of novel plots. We use the term viewto refer to specific types of plot characteristics(e.g., pertaining to events, characters or mood),and multi-view to refer to combinations of theseviews. We use multi-view book representations toconstruct meaningful and locally coherent neigh-bourhoods in model space, which we will refer toas micro-clusters. To this end, we extend a recentautoencoder model (Iyyer et al., 2016) to learnmulti-view representations of books. Our model

1873

encodes properties of characters (view v1), rela-tions between characters (view v2), and their re-spective trajectories over the plot.2 View-specificencodings are learnt in an unsupervised way fromraw text as separate sets of word clusters whichare jointly optimized to encode relevant anddistinct information. These properties arecrucial for applications such as book recommen-dation, because they allow to i) explain why par-ticular books are similar based on the inferred la-tent structure and ii) find similarities based on im-portant and distinct aspects of a novel (charactertypes or interactions). Our framework of unsu-pervised multi-view learning is very flexible andcan straightforwardly be applied to learn arbitrarykinds and numbers of views from raw text.

Secondly, we propose an empirical evaluationframework. Before we can use models to extendexisting categories as discussed above, it must beshown that the representations capture existing as-sociations. To this end, we investigate whethermicro-clusters derived from induced representa-tions resemble reference clusters defined as groupsof novels sharing tags in the Amazon catalog.While automatic induction of plot representationshas attracted considerable attention (see Jockers(2013)), evaluation has remained largely qualita-tive and intrinsic. To the best of our knowledge,we are the first to investigate the utility of auto-matically induced plot representations on an ex-trinsic task at scale. We evaluate micro-clusters aslocal neighbourhoods in model space containing10, 000 novels under 50 reference tags.

We show that rich multi-view representationsproduce better micro-clusters compared to com-petitive but simpler models, and that interpretabil-ity of the learnt representations is not compro-mised despite the more complex objective. Wealso qualitatively demonstrate that high-qualitymicro-clusters emerge from a smaller, more ho-mogeneous data set of Gutenberg3 novels.

2 Related Work

Automatically learning representations of bookplots, as structured summaries of their content,has attracted much attention (cf, Jockers (2013)for a review). Unsupervised models have been

2We argue that both characters, and their relations evolvethroughout the plot: Heroes pick up new attitudes or skills,and utilize those to different extents; relations change and de-velop over time (hate to love, friendship to enmity and back).

3https://www.gutenberg.org/

proposed which, given raw text, extract prototyp-ical event structure (McIntyre and Lapata, 2010;Chambers and Jurafsky, 2009), prototypical char-acters (Bamman et al., 2013, 2014; Elsner, 2012)and their social networks (Elson et al., 2010).

Other work focused on the dynamics of a plot,learning trajectories of relations between two char-acters (Iyyer et al., 2016; Chaturvedi et al., 2017).Iyyer et al. (2016) combine dictionary learn-ing (Olshausen and Field, 1997) with deep recur-rent autoencoders to learn interpretable characterrelationship descriptors. They show that their deepmodel learns better representations than concep-tually similar topic models (Gruber et al., 2007;Chang et al., 2009). Here, we extend the modelof Iyyer et al. (2016) to simultaneously inducemultiple views on the plot.

Methodologically, our work falls into the classof multi-view learning, and we propose a novelformulation of the model objective which encour-ages encoding of distinct information in the views.Our objective function is inspired by prior workin multi-task learning and deep domain adaptationfor classification (Ganin and Lempitsky, 2015;Ganin et al., 2016). They train neural networksto simultaneously learn classifiers which are ac-curate on their target task and are agnostic aboutfeature fluctuation pertaining to domain shift. Weadapt this idea to unsupervised models with a re-construction objective and learn multi-view repre-sentations which efficiently encode the input dataand, at the same time, learn to only encode infor-mation relevant for the particular view.

Evaluating induced plot representations is no-toriously difficult. Most evaluation has resortedto manual inspection, or crowd-sourced humanjudgments of the coherence and interpretability ofthe representations (Iyyer et al., 2016; Chaturvediet al., 2017). While such evaluations demonstratedthat the induced representations are qualitativelyvaluable, it is not clear whether they are rich andgeneral enough to be used for downstream tasksand applications. Others have used automaticallycreated gold-standards of re-occurring characternames across scripts (‘gang member’) (Bammanet al., 2013), prototypical plot templates (tropes,e.g., ‘corrupt corporate executive’) or manuallycreated gold-standards of character types (Valaet al., 2016) or their relations (Massey et al., 2015;Chaturvedi et al., 2017) to automatically measurethe intrinsic value of learnt representations. Here,

1874

we investigate how these results extend to extrin-sic tasks, and use structured plot representationsfor the task of inducing micro-clusters of novels.

Elsner (2012) depart from the above pattern,suggesting an extrinsic, albeit artificial, evaluationparadigm. Approaching plot understanding fromthe angle of its utility for summarization, they usekernel methods to learn character-centric plot rep-resentations. They evaluate their trained modelson their ability to differentiate between real and ar-tificially distorted novels (e.g., with shuffled chap-ters). While this evaluation is extrinsic and quanti-tative, it leverages artificial data and it is not clearhow the results extend to real-world summaries.

Language features were previously used incontent-based book recommendation e.g., as bags-of-words (Mooney and Roy, 1999) or semanticframes (Clercq et al., 2014). Both works use struc-tured databases and plot summaries rather thanthe raw book text. Other work used topic mod-els to augment a recommender system of scien-tific articles (Wang and Blei, 2011). Similar toour work, these works emphasize the added valueof interpretable representations and recommenda-tions, however, they do not leverage the raw con-tent of entire novels and the richness of informa-tion encoded in those.

3 Multi-View Novel Representations

We first provide an intuitive description of Rela-tionship Modeling Networks (RMN; Iyyer et al.2016), and our extension (henceforth MVPlot),which jointly induces temporally aware multi-viewrepresentations of novel plots. Afterwards we de-scribe the MVPlot model in technical detail.

3.1 Intuition

Iyyer et al. (2016) introduce the RMN, an un-supervised model which learns interpretable plotrepresentations in terms of types of relations be-tween pairs of book characters, and their devel-opment over time. Given a book and a charac-ter pair, the model learns relation types as wordclusters (not unlike topics in a topic model (Bleiet al., 2003)) from local contexts mentioning bothcharacters. In addition the RMN learns for eachcharacter pair how these relations vary over timeas a trajectory of relations. Methodologically, theRMN combines a deep recurrent autoencoder withdictionary learning, where terms in the dictionaryare relationship descriptors. The RMN learns to

View Descriptor

v1laugh scream laughing yell joke cringe disgraceembarrassment hate cursing

v1snug fleece warm comfortable wet blanket flan-nel cozy comfort roomy

v1excellency mademoiselle monsieur majestyduchess empress madame countess madam

v2love loving lovely dear sweetest dearest thankdarling congratulation hello

v2associate assistant senior chairman executiveleadership vice director liaison vice-president

Table 1: Example property (v1) and relation (v2)descriptors induced by MVPlot on the Gutenbergcorpus, as their nearest neighbours in GloVe space.

efficiently encode local text spans as a linear com-bination of these relation descriptors.

We extend RMNs to induce temporally awaremulti-view representations of novel plots. Multi-ple interpretable views are induced jointly withinone process in an unsupervised way. The core ofour model closely corresponds to the structure ofthe RMN (as technically described in Section 3.2).However, we provide the model with distinct typesof informative input for each view, and, reformu-late the objective in a way that jointly optimizesparameterizations of all views to encode distinctinformation (cf., Section 3.3).

Our MVPlot model learns two views: prop-erties associated with individual characters (v1),relations between character pairs (v2, as in theRMN) and their respective development over thecourse of the plot (examples of descriptors learntby MVPlot for both views are shown in Table 1).Our modeling framework, however, is very gen-eral in the sense that any number and type of viewscan be learnt jointly as long as input with relevantsignals can be provided for each view. For exam-ple, we could naturally extend the model describedhere with a ‘plot’ view to capture properties of thestory which are not related to any character.

3.2 The MVPlot Model

We now formally describe the MVPlot model forlearning multi-view plot representations encodingindividual character properties (v1), character pairrelationships (v2), and their respective trajectories.The full model is shown in Figure 1.

Input to our model are two corpora of textspans, one for each view, Sv1 and Sv2. The cor-pora consist of different sets of relevant view-specific local contexts as described in Section 5.Given a book b and a character c, Sc,bv1 contains

1875

et ebt

span embedding book embedding

hthidden input representation

dt−1v2 dtv2 d

tv1 d

t−1v1

descr weights t−1 descr weights t descr weights t descr weights t−1

Rv2 Rv1

descr matrix descr matrix

rtv2 rtv1

reconst span embedding reconst span embedding

multi-view loss

pair view v2

input

single view v1

Figure 1: Visualization of the MVPlot model.

linearly ordered4 sequences of text spans st at timet={1...T} in which character c is mentioned, butno other character,

Sc,bv1 ={s1, s2, ..., sT } s.th. ∀t : c ∈ st

Similarly, Sc1,c2,bv2 , given a book b and a pair ofcharacters c1 and c2, contains linearly ordered textspans which mention both c1 and c2, but no othercharacter,

Sc1,c2,b2 ={s1, s2, ..., sT } s.th. ∀t : c1 ∈ st, c2 ∈ st.

The rest of the input preparation follows Iyyeret al. (2016) as follows. We map text spans intoword embedding space, by mapping each word wto its 300-dimensional GloVe embedding ew (Pen-nington et al., 2014) pre-trained on Common-Crawl, and averaging the word embeddings,

et =1

|st|∑w∈st

ew. (1)

We provide MVPlot with a trainable matrix Bof dimensions b × n, where b is the num-ber of books in our data set, and each row eb

is an n-dimensional book embedding, encodingbackground information (e.g, about its general set-ting or style) which is relevant to neither viewof MVPlot.5 Finally the span embedding and thecorresponding book embedding are concatenated,

4with respect to their occurrence in the novel5The RMN learns background encodings for characters in

addition to the book embeddings. We omit this for MVPlotas this information is explicitly learned in the views.

and passed through a ReLu non-linearity (cf., Fig-ure 1, bottom),

ht = ReLu(Wh[et; eb

t

]). (2)

Model architecture MVPlot uses the architec-ture of the RMN autoencoder, but replicates it foreach input view, v1 and v2 (cf., Figure 1, center).Each part will induce an encoding of view-specificinformation. The feed-forward pass, described be-low, is identical for both parts, however, the lossand backpropagation will differ (cf., Section 3.3).

We describe the feed-forward pass for v2, not-ing that it works analogously for v1. The latentinput representation ht (eqn (2)) is passed througha softmax layer which returns a weight vector overdescriptors, dtv2 = softmax(W

dv2[h

t]). Descrip-tors are rows in the k × d-dimensional descrip-tor matrix Rv2, with each row k correspondingto one d-dimensional descriptor (similar to a topicin a topic model). The input et is reconstructedthrough the dot product of dtv2 and the descriptormatrix Rv2,

rt = dtv2Rv2. (3)

Like in the original RMN, we want to capturethe temporal development of character relations orproperties. Intuitively, we assume that the rela-tions between (or properties of) characters at timet depend on their relations (or properties) at timet − 1. As in the RMN, we factor the descriptorweights of the previous time step dt−1 into the rep-resentation at time t, such that

dtv2 = α softmax(Wdv2[ht;d

t−1v2 ]

)+(1− α)dt−1v2 (4)

Output First, the model induces property de-scriptors (rows in Rv1) and the relationship de-scriptors (rows in Rv2). Both sets of descriptorsare optimized to reconstruct model input in GloVeembedding space (cf., Section 3.3 for details).They consequently themselves live in GloVe wordembedding space, and can be visualized throughtheir nearest neighbours in this space. In addi-tion, for each book b, character cb and characterpair {c1, c2}, sequences of weight vectors over re-lations

T c1,c2,bv2 = d1v2...dTv2,and over properties

T c,bv1 = d1v1...dTv1are induced, which encode their trajectory of re-lations and properties, respectively. We will uti-lize these trajectories for inducing micro-clustersof novels (Section 6.1).

1876

3.3 The Multi-View Loss

We formulate our loss as a Hinge loss within thecontrastive max-margin framework. Our objectiveis to learn parameters for each view ∈ {v1, v2}which efficiently encode view-specific input in alow-dimensional space from which the original in-put can be re-constructed with high accuracy. Inaddition, we want to learn view-specific parame-ters which encode distinct information such thatwhen utilized together, they provide an improvedembedding of the data. Intuitively, we achieve thisby discouraging parameters of view v1 from ac-curately reconstructing input spans from view v2,and vice versa.

Our loss combines these two objectives as fol-lows. The first part of the loss correspondsto the loss of the RMN. We use negative sam-pling to induce parameters for each view whichreconstruct their respective view-specific inputwell. Formally, assuming model input fromview v1, etv1, we construct a set of 10 ‘negativeinputs’{en1v1 , ...enIv1} which are sampled at randomfrom the v1 input corpus. We want to learn pa-rameters encoding view v1 to reconstruct the inputsuch that the inner product between the true in-put etv1 and its reconstruction r

tv1 is higher than

the inner product between rtv1 and any of the neg-ative samples eniv1 by a margin of at least 1,

J(θ) =∑

t

∑i

max(0, 1− rtv1etv1 + rtv1eniv1), (5)

where θ refers to the set of all model parameters.We add an orthogonality-encouraging regulariz-ing term to this objective in order to obtain view-specific descriptors which are distant from eachother (Hyvärinen and Oja, 2000),

J(θ) =∑

t

∑i

max(0, 1− rtv1etv1 + rtv1eniv1)

+ λ||Rv1RTv1 − I||.(6)

The loss is defined analogously for input ofview v2. Note that so far, the loss is defined inan entirely view-specific way, independent of thev2 parameters (e.g., the v1 loss in equation (6) isindependent of the v2 parameters).

We break this independence by adding a sec-ond term to our loss function, which ensures thatview-specific parameters encode only relevant in-formation. That is, we want v2-specific parametersto only encode v2-specific information, and viceversa. Assuming model input from view v1, etv1 ,

Genre Example Tags

Mystery British Detectives; FBI Agents; Female Pro-tagonists; Private Investigators

Romance Cowboys; Criminals & Outlaws; Doctors;Royalty & Aristocrats; Spies; Wealthy

SciFi AIs; Aliens; Clones; Corporations; Mutants;Pirates; Psychics; Robots & Androids

Table 2: Example tags from the Amazon book cat-alog for the refinement character type.

we learn parameters for to view v2 that reconstructthe input poorly. Again, we use the max-marginframework, maximizing the margin between the(high) quality reconstruction of etv1 from v1 pa-rameters, rtv1, and the (poor) quality of the recon-struction from v2 parameters, rtv2,

K(θ) = max(0, 1− etv1rtv1 + etv1rtv2). (7)

The update is defined analogously, swapping v1and v2 subscripts, when the true input stems fromv2. The full loss is defined as a weighted linearcombination of its terms (eqns (6) and (7)),

L(θ) =βJ(θ) + (1− β)K(θ). (8)

4 Semantic Micro-Cluster Evaluation

MVPlot induces structured representations of anovel b as relation trajectories between charac-ters pairs in b, and property trajectories of indi-vidual characters in b. Are those representationsrich and informative enough to produce mean-ingful and interpretable micro-clusters of novels?In Section 6.1 we evaluate the quality of suchmicro-clusters, i.e., local novel neighbourhoods inmodel space. We propose an objective and empir-ical evaluation employing expert-provided seman-tic novel tags in the Amazon catalog.

Novels listed in the Amazon catalog are taggedwith respect to their genre (e.g., mystery, ro-mance). They are further labelled with re-finements pertaining to diverse information likecharacter types or mood, which take dif-ferent sets of values, depending on the genre, andare as such predestined as an objective referencefor evaluating the diverse information captured byour model. Table 2 lists example tags for the re-finement character type.

All tags are provided by publishers and can con-sequently be taken as a reliable source of infor-mation. In our evaluation we assume that novelswhich share a tag are related to each other. Weuse this tag-overlap metric to evaluate local neigh-bourhoods of book representations in model space.

1877

# novels # v1 sequences # v2 sequencesGutenberg 3,500 45,182 60,493Amazon 10,000 91,511 70,156

Table 3: The number of novels and property (v1)relation (v2) input sequences for the Gutenbergand the Amazon corpus.

We selected a set of 50 representative tags fromthe Amazon catalog and did not tune this set forour evaluation. The full tag set is included in thesupplementary material.

Note that while this scheme provides an em-pirical way of evaluating plot representations, itmay not capture their full potential: our modelsare not explicitly tuned towards producing micro-clusters which are coherent with respect to ourgold-standard tags, and may encode additionalstructure which is not probed in this evaluation.That said, we consider this evaluation as a goodprocedure to evaluate the relative quality of differ-ent models in the sense that a better model shouldproduce micro-clusters that better correspond toreference clusters derived from gold-standard tags.

5 Data

We evaluate our model on two data sets. First,we create a diverse data set by sampling 10,000digital novels under our 50 gold-standard tags(cf., Section 4) of the Amazon catalog (Ama-zon). Our second data set consists of 3,500 nov-els from Project Gutenberg, a large digital collec-tion of freely available novels consisting primar-ily of classic literature (Gutenberg). The Ama-zon novels are already labelled with genre and re-finement tags, such that evaluation using our gold-standard is straightforward. While Gutenberg nov-els come with the advantage of being freely avail-able, they are unlabelled, and not fully covered byour 50 gold-standard tags. We therefore restrictour quantitative analysis to the Amazon data set.However, we also report qualitative results on theGutenberg corpus, demonstrating that our modelinduces meaningful novel representations for cor-pora of varying size and diversity.

Both data sets were pre-processed with theBookNLP pipeline (Bamman et al., 2014) forcoreference resolution of character mentions. Wefiltered stop-words and low-frequency words bydiscarding the 500 most frequent words and thosewhich occur in less than 100 novels, and discardednovels less than 100 sentences long or containing

fewer than 5 characters from our data set.We created view-specific input corpora as fol-

lows: (1) a relation corpus of chronologically or-dered sequences of text spans of 20 words for char-acter pairs {c1, c2} in a book b, Sc1,c2,bv2 , whichmention only c1 and c2 with a margin of 10 wordsfor the Amazon corpus (1 word for the smallerGutenberg corpus) but no other character; and(2) a property corpus of chronologically orderedsequences of 20 word text spans for individualcharacters c in book b, Sc1,c2,bv2 , which mentiononly c, using the same margins as above.

We keep only sequences of length n time stepss.th., 5 ≤ n ≤ 250. We only keep pair sequencesif we also obtain sequences for each individualcharacter confirming to the above criteria. Table 3summarizes statistics on our input corpora.

6 Evaluation

Section 6.1 quantitatively evaluates the qualityof local neighbourhoods in model space inducedfrom the Amazon corpus against our proposedgold-standard. Section 6.2 evaluates the qualityof the induced descriptors from both the Amazonand Gutenberg corpus both through crowd sourc-ing and illustrative examples.

Models We set the MVPlot performance intoperspective comparing it against the RMN.6 MV-Plot induces both character properties and rela-tions, and is trained on both the relation-view andproperty-view input, while the RMN only inducespair relationships and can only utilize relation-view input. In addition, we report a frequencybaseline, which is trained on both property andrelation-view input. We concatenate all inputspans of a given view for a particular novel; con-struct its term frequency vector and use cosinesimilarity to compute the nearest neighbours toeach novel.

Parameter settings Across all experiments andcorpus-specific models, we set β=0.99 for MV-Plot, and for both MVPlot and RMN weset α=0.5, λ=10−5, k=50.7 We train bothRMN and MVPlot for 15 epochs using SGD andADAM (Kingma and Ba, 2014).8

6We do not compare against topic model baselines be-cause they were outperformed by RMN (Iyyer et al., 2016).

7Parameters were tuned on a small subset of books usedin the nearest neighbourhood evaluation (Section 6.1).

8Our implementation builds on the available RMN codehttps://github.com/miyyer/rmn.

1878

6.1 Nearest Neighbours EvaluationWe evaluate local neighbourhoods in model spaceusing the 500 most popular novels by their num-ber of Amazon reviews as reference novels fromthe Amazon corpus. For each reference novel wecompute the 10 nearest neighbours as describedbelow. We measure neighbourhood quality us-ing the gold-standard tags from Section 4, regard-ing neighbours as relevant if at least one tag isshared with the reference novel. We report pre-cision at rank 10 (P@10) and mean average preci-sion (MAP ).

Method MVPlot represents a book b in termsof trajectories of weight vectors over relation de-scriptors T bv2 and property descriptors T bv1. RMNonly learns relation descriptors and their tra-jectories. For both models, we map each in-duced trajectory for book b to a fixed-sized k-dimensional vector representation by averagingthe time-specific weight vectors, for example fora v2 trajectory,

T c1,c2,bv2 =1∣∣T c1,c2,bv2 ∣∣ ∑

t∈∣∣T c1,c2,bv2 ∣∣d

tv2, (9)

and equivalently for v1 trajectories, T c,bv1 .We compute the similarity between two

books {br, bc} as follows. We align the v2 tra-jectory for each character pair {c1, c2} in br,T c1,c2,br , to its closest neighbouring character pairvector in bc, T c′1,c′2,bc , by Euclidean distance, andcompute the overall book similarity in terms ofcharacter relations between br and bc as the av-erage over all distances.

simbr,bcv2 =1

|T brv2 |∑T ∈T brv2

arg minT ′∈T bcv2

dist(T , T ′). (10)

We obtain simbr,bcv1 in an analogous process. Forcosine and MVPlot we obtain a final, multi-viewsimilarity by averaging similarity scores obtainedin each view’s space,

simbr,bcboth =1

2

(simbr,bcv1 + sim

br,bcv2

). (11)

For RMN we compute similarity only in characterrelation space.

Results Table 4 presents micro-cluster qualityin terms of precision@10 and MAP . The fullMVPlot model statistically significantly outper-foms all other models.9 The same pattern emerges

9Also, intra-view comparisons except for MVPlot v1 andcosine v1 are statistically significant.

Model View P@10 MAP

cosinev1 0.516 ‡ 0.392 †v2 0.468 ‡ 0.339 ‡both 0.512 ‡ 0.390 ‡

RMN v2 0.479 ‡ 0.347 ‡

MVPlotv1 0.529 † 0.401 †v2 0.496 ‡ 0.367 ‡both 0.546 0.421

Table 4: Micro-cluster quality results (Amazoncorpus). Differences of cosine and RMN com-pared to the best MVPlot result are significant withp < 0.05 (†) or p < 0.01 (‡) (paired t-test).

when comparing models with the same underly-ing views: MVPlot v2 outperforms both cosinev2 and RMN v2 (similarly for MVPlot v1 andcosine v1), indicating that the MVPlot characterrelation representations are most informative formicro-cluster induction.

In order to shed light on the contribution of in-dividual model components, we compare the fullMVPlot model (both) to model versions with ac-cess to only v1 or v2 (Table 4 bottom). Combininginformation from both views boosts performancecompared to the single-view versions. This con-firms that MVPlot indeed encodes distinct and rel-evant information in the respective views.

While cosine is a strong baseline, its representa-tions are not structured or interpretable. It conse-quently does not provide sufficient information forapplications like book tagging or recommendationwith respect to specific aspects or criteria. Simi-larly, RMN cannot learn representations of multi-ple, distinct views of the plot.

Advancing our understanding of the informa-tion encoded in the individual views of MVPlot,we took a closer look at the refinement tags forwhich the single view MVPlot model (v1) hasthe clearest advantage over the pair view MVPlotmodel (v2), and vice-versa. We computed tag-wise F1-scores for the two MVPlot variants. Ta-ble 5 lists the book tags for which the scores of thetwo views diverge the most.

In terms of types of refinements, view v2 suffersmost for predicting book categories, or topical tags(‘sports’, ‘second changes’), while view v1 is par-ticularly deficient for predicting character types.While this seems counterintuitive we hypothesizethat character types are to a large extent defined bytheir interactions with, or relations to, other char-

1879

F1 v2 >> F1 v1 F1 v1 >> F1 v2Tag RefType Tag RefType

Robots & Androids Character Hard SciFi CategoryCorporations Character Sports CategoryInternational Theme Horror Theme

Aliens Character Second Chances ThemeCowboys Character Crime Category

Table 5: The tags (Tag) and their refinement types(RefType) for which MVPlot v1 most clearly out-performs MVPlot v2 (left) and vice versa (right)in terms of tag-specific F1-measure.

acters. Topical information, however, is encodedrobustly in the properties of individual characters.

6.2 Evaluating Induced Descriptors

This evaluation investigates whether induced re-lation descriptors indeed capture relational infor-mation. We evaluate the interpretability of the in-duced descriptors, comparing the v2 (relation) de-scriptors induced by RMN and MVPlot. We applyboth models to both the Amazon and the Guten-berg corpus, and report results on both data sets.

Method We collect crowdsourced judgments onAmazon Mechanical Turkto qualitatively evaluatethe learnt descriptors, following Chaturvedi et al.(2017). In each task a worker is shown one in-duced descriptor as a set of its 10 closest wordsin GloVe space (like in Table 1), and is asked toindicate whether ”the words in the group describea relation, event or interaction between people”.We compare the proportion of positive answers,i.e., the number of descriptors considered relevant,for RMN descriptors and MVPlot pair descriptors.We collect 30 judgments for each of k=50 de-scriptors induced by the respective models.

Results Figure 2 displays our results. We ob-serve a similar pattern of ratings across models andcorpora, e.g., around 50% of the descriptors arelabelled as relevant by at least 50% of the annota-tors. None of the differences are statistically sig-nificant which lets us conclude that interpretabilityof induced descriptors is comparable for the RMNand MVPlot. This is encouraging because we con-firm that representation interpretability is not com-promised by MVPlot’s more complex objective.

Table 1 displays examples of property and re-lation descriptors induced by MVPlot from theGutenberg corpus. We can see that the differentviews indeed capture differing information (e.g., av1 descriptor refers to individuals’ titles, while a

204060800

20

40

60

80

% of annotators who marked descriptor relevant

%de

scri

ptor

sm

arke

dre

leva

nt

RMN Amazon

MVPlot Amazon

RMN Gutenberg

MVPlot Gutenberg

Figure 2: Results of descriptor interpretability.(% of descriptors marked as ‘relevant descriptorsof relations’ by various proportions of annotators).

v2 descriptor refers to a love relation). Despiteits smaller size and more homogeneous nature, weshow that MVPlot learns meaningful representa-tions from the Gutenberg corpus, demonstratingthe flexibility of our model.

Figure 3 further illustrates this, displaying ex-ample local neighbourhoods of four reference nov-els (left) with their eight nearest neighbours or-dered by proximity (left to right). The neighbour-hoods are intuitively meaningful, and particularlyimpressive bearing in mind that the full modelspace contains 3, 500 novels. While most neigh-bourhoods are dominated by novels of the sameauthor, some exceptions emerge. Row two, forexample, contains novels by Thomas Hardy andCharles Dickens who both are known for bio-graphical 17th century novels focusing on classand social changes.

7 Conclusions

Content-based micro-clustering of novels is acomplex but interesting task. In order to even-tually augment the diverse associations humanshave, models must be able to pick up rich andstructured signals from raw text. This paper pre-sented a deep recurrent autoencoder which learnsmulti-view representations of plots, and intro-duced a principled evaluation framework usingclusters based on expert-provided book tags.

Our evaluation showed that rich multi-view rep-resentations are better suited to recover such refer-ence clusters compared to each individual view, aswell as compared to simpler, but competitive mod-els which induce less structured representations.Our view-specific representations are interpretablewhich allows to analyse and explain the emerging

1880

Figure 3: Nearest neighbours for four classic stories from the Gutenberg Corpus. Target novels on theleft (with red border), and NNs are presented in the same row, ordered by their distance to the targetnovel.

micro-clusters, and might reveal previously unno-ticed parallels between novels and may be usefulfor literary analysis or content-based recommen-dation. This is an exciting avenue for future work.

Our method is general and scalable both interms of its input, utilizing raw text with only au-tomatic pre-processing, and in terms of the num-ber of distinct views it can learn. We described anobjective function which triggers views to encodedistinct information. In future work we plan to ex-plore joint learning of more and different views.

Our approach relies strongly on the assumptionthat text spans mentioning two characters containinformation about character relations, and that textspans mentioning one character contain informa-tion about the character’s properties. While our re-sults suggest that these assumptions are valid, theyare arguably crude. In the future we plan to definemore targeted input, e.g., by using semantic andsyntactic information from dependency parses.

In this work we induced dual-view representa-tions of book content, however, we emphasize thatthe proposed method is very general. The numberand kinds of views, as well as underlying data arein no way constrained, as long as relevant view-specific input can be defined. In the context ofnovel representation it would be interesting to in-

duce additional views, for example one that cap-tures the mood of a novel. Another interesting av-enue for future work would be to apply the frame-work to questions arising in the digital humanities,e.g., to extract different views from news articles.

The presented model and evaluation are de-signed with the objective to detect a different kindsof similarity between novels, with the ultimategoal to enrich human-provided genres and tags.We described a first step in this direction, verifyingthat the learnt information is meaningful and canreproduce human-created semantic book tags. Ex-pert book tags exist for a wide variety of informa-tion (mood, theme, characters), and provide a richevaluation environment for learnt representations.We invite the community to join us in exploringthe full space of opportunities and evaluating in-duced representations holistically in the future.

Acknowledgements

We would like to thank Alex Klementiev, KevinSmall, Joon Hao Chuah and Mohammad Kansofor their valuable insights, feedback and technicalhelp on the work presented in this paper. We alsothank the anonymous reviewers for their valuablefeedback and suggestions.

1881

ReferencesGediminas Adomavicius and Alexander Tuzhilin.

2005. Toward the next generation of recommendersystems: A survey of the state-of-the-art and pos-sible extensions. IEEE Trans. on Knowl. and DataEng., 17(6):734–749.

David Bamman, Brendan O’Connor, and Noah A.Smith. 2013. Learning latent personas of film char-acters. In Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 352–361, Sofia, Bul-garia. Association for Computational Linguistics.

David Bamman, Ted Underwood, and Noah A. Smith.2014. A bayesian mixed effects model of literarycharacter. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguis-tics (Volume 1: Long Papers), pages 370–379, Bal-timore, Maryland. Association for ComputationalLinguistics.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022.

Nathanael Chambers and Dan Jurafsky. 2009. Unsu-pervised learning of narrative schemas and their par-ticipants. In Proceedings of the Joint Conference ofthe 47th Annual Meeting of the ACL and the 4th In-ternational Joint Conference on Natural LanguageProcessing of the AFNLP, pages 602–610, Suntec,Singapore. Association for Computational Linguis-tics.

Jonathan Chang, Jordan L. Boyd-Graber, and David M.Blei. 2009. Connections between the lines: aug-menting social networks with text. In KDD, pages169–178. ACM.

Snigdha Chaturvedi, Mohit Iyyer, and Hal Daumé III.2017. Unsupervised learning of evolving relation-ships between literary characters. In Association forthe Advancement of Artificial Intelligence.

Orphée De Clercq, Michael Schuhmacher, Si-mone Paolo Ponzetto, and Véronique Hoste. 2014.Exploiting framenet for content-based book recom-mendation. In Proceedings of the 1st Workshop onNew Trends in Content-based Recommender Sys-tems co-located with the 8th ACM Conference onRecommender Systems, CBRecSys@RecSys 2014,Foster City, Silicon Valley, California, USA, Octo-ber 6, 2014., pages 14–21.

Micha Elsner. 2012. Character-based kernels for nov-elistic plot structure. In Proceedings of the 13thConference of the European Chapter of the Asso-ciation for Computational Linguistics, pages 634–644, Avignon, France. Association for Computa-tional Linguistics.

David Elson, Nicholas Dames, and Kathleen McKe-own. 2010. Extracting social networks from literary

fiction. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics, pages 138–147, Uppsala, Sweden. Associationfor Computational Linguistics.

Yaroslav Ganin and Victor Lempitsky. 2015. Unsu-pervised domain adaptation by backpropagation. InProceedings of the 32nd International Conferenceon Machine Learning (ICML-15), pages 1180–1189.JMLR Workshop and Conference Proceedings.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, François Lavio-lette, Mario Marchand, and Victor Lempitsky. 2016.Domain-adversarial training of neural networks. J.Mach. Learn. Res., 17(1):2096–2030.

Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. 2007.Hidden topic markov models. In AISTATS, volume 2of JMLR Proceedings, pages 163–170. JMLR.org.

A. Hyvärinen and E. Oja. 2000. Independent compo-nent analysis: Algorithms and applications. NeuralNetw., (4-5):411–430.

Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor-dan Boyd-Graber, and Hal Daumé III. 2016. Feud-ing families and former friends: Unsupervised learn-ing for dynamic fictional relationships. In NorthAmerican Association for Computational Linguis-tics.

Matthew L. Jockers. 2013. Macroanalysis: DigitalMethods and Literary History. University of IllinoisPress.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In Proceed-ings of the 3rd International Conference on Learn-ing Representations (ICLR).

Pasquale Lops, Marco de Gemmis, and Giovanni Se-meraro. 2011. Content-based Recommender Sys-tems: State of the Art and Trends. Springer US,Boston, MA.

Philip Massey, Patrick Xia, David Bamman, andNoah A. Smith. 2015. Annotating character rela-tionships in literary texts. CoRR, abs/1512.00728.

Neil McIntyre and Mirella Lapata. 2010. Plot induc-tion and evolutionary search for story generation. InProceedings of the 48th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1562–1572, Uppsala, Sweden. Association for Computa-tional Linguistics.

Raymond J. Mooney and Loriene Roy. 1999. Content-based book recommending using learning for textcategorization. In Proceedings of the SIGIR-99Workshop on Recommender Systems: Algorithmsand Evaluation, Berkeley, CA.

Bruno A. Olshausen and David J. Field. 1997. Sparsecoding with an overcomplete basis set: A strategyemployed by v1? Vision Research, 37(23):3311 –3325.

1882

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Hardik Vala, Stefan Dimitrov, David Jurgens, An-drew Piper, and Derek Ruths. 2016. AnnotatingCharacters in Literary Corpora: A Scheme, theCHARLES Tool, and an Annotated Novel. In Pro-ceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016),Paris, France. European Language Resources Asso-ciation (ELRA).

Chong Wang and David M. Blei. 2011. Collabora-tive topic modeling for recommending scientific ar-ticles. In Proceedings of the 17th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining, KDD ’11, pages 448–456, NewYork, NY, USA. ACM.

1883

inducing semantic micro-clusters from deep multi-view ......clusters than less structured...

Documents