text-to-text pre-training model with plan selection for

8
Text-to-Text Pre-Training Model with Plan Selection for RDF-to-Text Generation Natthawut Kertkeidkachorn Hiroya Takamura Artificial Intelligence Research Center (AIRC) National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan {n.kertkeidkachorn, takamura.hiroya}@aist.go.jp Abstract We report our system description for the RDF- to-Text task in English on the WebNLG 2020 Challenge. Our approach consists of two parts: 1) RDF-to-Text Generation Pipeline and 2) Plan Selection. RDF-to-Text Generation Pipeline is built on the state-of-the-art pre- training model, while Plan Selection helps de- cide the proper plan into the pipeline. 1 Introduction Natural language generation from data, or data-to- text, aims to generate natural language text that describes the input data such as tables and graphs. WebNLG (Gardent et al., 2017; Ferreira et al., 2018) is one of the data-to-text tasks, where the given input data is a set of triples (a tree or a graph). A triple consists of entities and a relationship be- tween them in the form of (subject, predicate, ob- ject) describing one fact. For example, (Donald Trump, birthplace, New York) describes the facts “Donald Trump was born in New York.” In the WebNLG task, up to 7 triples are given as the input data. Figure 1 presents an instance of the WebNLG dataset 1 , where 3 triples are given as the RDF input graph and the output text is the natural language de- scription describing the fact from triples. Generat- ing text from triples is useful for many applications such as question and answering (He et al., 2017), where a set of triples retrieved as an answer can be organized and translated into natural language text. Many approaches have been proposed to deal with the WebNLG task. Neural Machine Transla- tion (NMT) approach is one of the popular methods for this task (Gardent et al., 2017). In WebNLG 2017, a neural machine translation-based approach, achieved the highest score in the automatic eval- uation (Gardent et al., 2017). In this approach, 1 An example is from https://webnlg-challenge. loria.fr Figure 1: Illustration of the WebNLG 2020 task on RDF-to-text (English). the delexicalization, where entities are replaced by placeholders, with N-gram search and the lin- earization using DBpedia type enrichment are used together with the standard encoder-decoder with attention model. Due to the simple linearization, the relationship between entities as the graph in triples might lessen. To preserve such information, a graph-based triple encoder (GTR-LSTM) (Disti- awan et al., 2018), is introduced. Later, Moryossef et al. (2019) argued that splitting the generation pro- cess into plan selection and text realization could help generate the description text that is faithful to the triples. PlanEnc uses a graph convolution network-based model to predict the order of the triples and then an LSTM with attention and the copy mechanism is employed to generate the text corresponding to the order of triples (Zhao et al., 2020). Recently, Kale (2020) investigated Text- to-Text Transfer Transformer (T5) model for the data-to-text task. With transfer learning ability, the T5 model is the current state-of-the-art for RDF-to- text generation. To further investigate the T5 model from the study (Kale, 2020), we conduct experiments with the T5 model on WebNLG 2020 challenge 2 . Based 2 https://webnlg-challenge.loria.fr/ challenge_2020/ 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), Dublin, Ireland (Virtual), 18 December 2020, pages 159–166, c 2020 Association for Computational Linguistics Attribution 4.0 International.

Upload: others

Post on 18-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Text-to-Text Pre-Training Model with Plan Selectionfor RDF-to-Text Generation

Natthawut Kertkeidkachorn Hiroya TakamuraArtificial Intelligence Research Center (AIRC)

National Institute of Advanced Industrial Science and Technology (AIST)2-4-7 Aomi Koto-ku Tokyo 135-0064 Japan

nkertkeidkachorn takamurahiroyaaistgojp

Abstract

We report our system description for the RDF-to-Text task in English on the WebNLG 2020Challenge Our approach consists of twoparts 1) RDF-to-Text Generation Pipeline and2) Plan Selection RDF-to-Text GenerationPipeline is built on the state-of-the-art pre-training model while Plan Selection helps de-cide the proper plan into the pipeline

1 Introduction

Natural language generation from data or data-to-text aims to generate natural language text thatdescribes the input data such as tables and graphsWebNLG (Gardent et al 2017 Ferreira et al2018) is one of the data-to-text tasks where thegiven input data is a set of triples (a tree or a graph)A triple consists of entities and a relationship be-tween them in the form of (subject predicate ob-ject) describing one fact For example (DonaldTrump birthplace New York) describes the factsldquoDonald Trump was born in New Yorkrdquo In theWebNLG task up to 7 triples are given as the inputdata Figure 1 presents an instance of the WebNLGdataset1 where 3 triples are given as the RDF inputgraph and the output text is the natural language de-scription describing the fact from triples Generat-ing text from triples is useful for many applicationssuch as question and answering (He et al 2017)where a set of triples retrieved as an answer can beorganized and translated into natural language text

Many approaches have been proposed to dealwith the WebNLG task Neural Machine Transla-tion (NMT) approach is one of the popular methodsfor this task (Gardent et al 2017) In WebNLG2017 a neural machine translation-based approachachieved the highest score in the automatic eval-uation (Gardent et al 2017) In this approach

1An example is from httpswebnlg-challengeloriafr

Figure 1 Illustration of the WebNLG 2020 task onRDF-to-text (English)

the delexicalization where entities are replacedby placeholders with N-gram search and the lin-earization using DBpedia type enrichment are usedtogether with the standard encoder-decoder withattention model Due to the simple linearizationthe relationship between entities as the graph intriples might lessen To preserve such informationa graph-based triple encoder (GTR-LSTM) (Disti-awan et al 2018) is introduced Later Moryossefet al (2019) argued that splitting the generation pro-cess into plan selection and text realization couldhelp generate the description text that is faithfulto the triples PlanEnc uses a graph convolutionnetwork-based model to predict the order of thetriples and then an LSTM with attention and thecopy mechanism is employed to generate the textcorresponding to the order of triples (Zhao et al2020) Recently Kale (2020) investigated Text-to-Text Transfer Transformer (T5) model for thedata-to-text task With transfer learning ability theT5 model is the current state-of-the-art for RDF-to-text generation

To further investigate the T5 model from thestudy (Kale 2020) we conduct experiments withthe T5 model on WebNLG 2020 challenge 2 Based

2httpswebnlg-challengeloriafrchallenge_2020

3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)Dublin Ireland (Virtual) 18 December 2020 pages 159ndash166 ccopy2020 Association for Computational Linguistics

Attribution 40 International

Figure 2 An Overview Pipeline of RDF-to-Text Gen-eration (a) RDF graph as input (b) Linearization formof RDF graph and (c) Generated Text

on our preliminary results we found that chang-ing the order of the triples before the linearizationaffected the BLEU score In this paper we there-fore present a Text-to-text pre-training model forthe data-to-text generation with plan selection Inour approach we follow the study (Kale 2020) tobuild the RDF-to-Text Generation pipeline Thenwe develop the graph neural network (GNN) basedmodel to score the plan associating with the tripleorder The triple order with the highest plan scoreis linearized and feed to Generator in the RDF-to-Text Generation pipeline

2 Approach

In our approach there are two steps 1) RDF-to-Text Generation and 2) Plan Selection RDF-to-Text generation aims to transform input triples intonatural language text describing them while PlanSelection is the pre-processing process to select theorder of the triples for the RDF-to-Text Generationpipeline Note that Plan Selection is performed be-fore RDF-to-Text Generation in the testing phase

21 RDF-to-Text Generation

In the RDF-to-text generation step we follow asimilar approach with the study (Kale 2020) tobuild the pipeline As illustrated in 2 there are twosteps in the pipeline of RDF-to-Text Generation 1)Linearization and 2) Generator

211 LinearizationLinearization is to convert RDF graph into the se-quence so that the conventional seq-2-seq modelcan be applied Specifically given a set of triples

T = (s1 p1 o1) (s2 p2 o2) (sn pn on)the linearization maps the triples into the wordsequence as follows lt s gt s1 lt p gt p1 lt o gto1 lt s gt s2 lt s gt sn lt p gt pn lt o gt on Anexample of the linearization is shown in Figure 2 (a)and (b) In Figure 2 (a) the input triples are givenas the RDF graph and the linearization form of thegraph is converted as in Figure 2 (b) Note thatusually in an RDF graph rsquoDonald Trumprsquo is con-catenated into one single token rsquoDonald TrumprsquoIn the conversion process we therefore removeunderscore ( ) in the entity name to recover theoriginal words denoting the entity

212 GeneratorGenerator is to generate natural language text cor-responding to the given word sequence from thelinearization step as shown in Figure 2 (b) and(c) In our study we employ Text-To-Text Trans-fer Transformer (T5) pre-trained models releasedby the study (Raffel et al 2020) as the generatormodel The T5 model is built on a transformer-based encoder-decoder architecture The modelwas trained through an unsupervised multi-tasking(span masking) in the Colossal Clean Crawled Cor-pus (C4) dataset as well as through supervisedlearning tasks including translation summariza-tion classification and question answering In thetraining phase we fine-tune the T5 model by usingthe linearization form of the RDF graph as the inputto target the text description as the output Duringthe fine-tuning all parameters in the models areupdated

22 Plan Selection

The objective of Plan Selection is to computescores of linearization forms of triples with dif-ferent orders of triple and then find the order oftriples providing the highest score as the selectedplan

The intuition behind this idea is that permutingthe order of triples during the linearization mayyield different texts To concretely elaborate thissensitivity of Generator in the pipeline we illus-trate the example of the permutation of the order oftriples in Figure 3 and discuss the intuition usingthis example Since there are 3 triples in the exam-ple we can linearize RDF triples into 6 patternsowing to the order of triples When we input these6 linearization forms of triples into Generator inthe pipeline Generator returns the different gener-ated texts For example with the linearization 1 it

160

Figure 3 The example of the permuting the triple or-ders and their Linearization forms

may generate ldquoJohn E Blaha (1942-08-26) born inSan Antonio worked as a fighter pilotrdquo while withthe linearization 3 it may generate rdquoJohn E Blahaborn in San Antonio on 1942-08-26 worked as afighter pilotrdquo We then evaluate generated textsand found out that the BLEU score of generatedtexts on the same RDF graph may differ When wemanually select the order of triples providing thehighest BLEU score to examine the upperboundthe BLEU score improved by around 10 points onthe development set of WebNLG 2020 Based onthis preliminary experiment we therefore aim tobuild Plan Selection which helps select the order oftriples for Generator in the RDF-to-Text pipeline

In Plan Selection there are 2 steps 1) Plan Gen-eration and 2) Plan Evaluation

221 Plan GenerationPlan Generation is to generate all possible ordersof triples given the input triples as plans In thePlan Generation step we create all combinationsof the triple orders The concrete example is shownin Figure 3 In the example we can create 6 ordersof triples Since RDF graphs in WenNLG 2020Challange are small exhaustive plan generation isfeasible

222 Plan EvaluationPlan Evaluation is to estimate the score of a planie an order of triples We design the Plan Eval-uation architecture into three components 1) Lin-earization Representation 2) RDF Graph Repre-sentation and 3) Plan Score as shown in Figure4 Linearization Representation aims to learn therepresentation of the Linearization form of triplesby the given order RDF Graph representation isto learn the representation of the RDF graph in thelow dimensional vector space Plan Score is tocompute the score for the given triple order corre-

Figure 4 The overview design of Plan Selection

sponding to the RDF graph by using LinearizationRepresentation and RDF Graph Representation

Linearization Representation is to project theword sequence that linearizes from one order oftriples into a low dimensional vector space In Lin-ear Representation we learn the representation byfeeding the word sequence into the contextualizedlanguage representation Recently contextualizedlanguage representations (Devlin et al 2019 Pe-ters et al 2018 Radford et al 2019 Lewis et al2019) gain wide attention from the NLP commu-nity Bidirectional Encoder Representations fromTransformers (BERT) (Devlin et al 2019) is oneof popular pre-trained models based on the trans-former architecture (Vaswani et al 2017) Dueto its performance it has been used in many NLPtasks We therefore utilize BERT to learn Lin-earization representations Given the RDF graphG and the plan p we can generate the lineariza-tion form as the sequence L = w1w2w3wn bymapping the plan to sequence as demonstrated inthe example of Figure 3 After obtained the se-quence L we always add a special classificationtoken [CLS] as the beginning of the sequence LThen the sequence L with [CLS] token is fed to thepre-trained BERT model After feeding the inputsequence to the pre-train model the final vectorrepresentation C corresponding to [CLS] token isused as the Linearization Representation

RDF Graph Representation is to embed theentire graph into the low dimensional vector spaceIn RDF Graph Representation we first employNode2Vec (Grover and Leskovec 2016) to gen-erate the node attributes We do not represent thenode attributes with one-hot vector encoding due

161

to the sparseness of the RDF graph Then we useGraphConv (Morris et al 2019) with the globalaverage pool to compute the representation of theRDF graph We formally define the computation inthis component as follows Given the RDF GraphG = (XA) where X is the matrix representingthe node attributes and A is the adjacency matrixof RDF graph we learn the RDF Graph Represen-tation by the following equation

xprimei = W1xi +

sum

jisinN(i)

W2xj (1)

where xprimei is the representation of the i-th node xi

is the attribute vector of the i-th node xj is theattribute vector of the j-th node which is the neigh-bor node of the i-th node N(i) is a set of nodesconnecting to the i-th node W1 and W2 are thelearnable weight matrix trained by propagating theloss in Eq 4 We then use the calculated representa-tions to obtain the RDF representation of graph G

XG =1

|V |sum

visinVxprimev (2)

where xprimei is the representation of the i-th node and

V is the set of the nodes in GPlan Score computes the score by using Lin-

earization Representation and RDF Graph Repre-sentation We concatenate Linearization Represen-tation C and RDF Graph Representation XG intoa single vector representation and feed it to thefeed-forward neural network (FFNN) as shown inEquation (3) The output of FFNN is the score ofthe plan subjected to the RDF graph

y = [CXG]WTFFNN + b (3)

where WFFNN and b are trainable parameters forFFNN

In the training phase we need to build the scorefor each plan as the target for Plan Selection tolearn To obtain the score we feed all possiblelinearization of triple orders into the RDF-to-Textgeneration pipeline Then we compute the BLEUscore for each order of triples from the corpusWith this strategy we can get the plan associatedwith the BLEU score based on the model of thepipeline We then use the following objective func-tion to optimize all trainable parameters in PlanSelection

L =sum

sisinS

sum

pisinP (s)

|ys minus ysp| (4)

where S is a WebNLG 2020 corpus P (s) is a setof plans for sample s in the corpus

In the testing phase we compute the scores ofall plans for the given RDF graph and then selectthe plan with the highest score The triple orderassociating with that plan is linearized as the wordsequence and input to Generator in the pipeline togenerate text description

3 Experiments and Results

31 Experimental SetupThe experimental setup is as follows

Datasets The dataset used in the experimentis WebNLG 2020 Challenge on the RDF-to-Text(English) task In the task there are 35426 4464and 1779 RDF Graph-Text pairs for training val-idating and testing respectively RDF Graph-textpairs are generated from 16 distinct DBpedia (Aueret al 2007) categories Athlete Artist Celestial-Body MeanOfTransportation Politician AirportAstronaut Building City ComicsCharacter FoodMonument SportsTeam University WrittenWorkand Company

Settings In our pipeline we use the pre-trainedmodel t5-base from the huggingface repository3We set the hyperparameters in the fine-tuning pro-cess as follows batch 8 learning rate 20times10minus5epochs 20 maximum tokens 512 and optimizerAdam (Kingma and Ba 2014) The default valuesare used for the other hyperparameters

For Plan Selection we set the following hyperpa-rameters batch 8 learning rate 0001 epoch 10and optimizer Adam In each component of PlanSelection we set the hyperparameters as followsIn Linearization representation we select the bert-base-uncased model as the BERT model with thedefault setting In RDF Graph Representation weset the following hyperparameters for Node2Vec4vector dimension 128 window size 10 batch 32walk length 80 walk step 10 and min count 1 tocreate node attributes We stack GraphConv5 into 3layers and use Global Average Pool as the poolingmethod In FF Neural Network we use 2 hiddenlayers with 410 neurons and the ReLU activation

Baseline The baseline in the experiment isprovided by WebNLG 2020 Challenge6

3httpshuggingfacecomodels4httpsgithubcomeliorcnode2vec5httpspytorch-geometricreadthedocs

ioenlatestmodulesnnhtmltorch_geometricnnconvGraphConv

6httpswebnlg-challengeloriafr

162

Table 1 Overall Results of Our Approach on RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5093 5043 4057BLEU NLTK 0482 0476 0396METEOR 0384 0382 0373CHRF++ 0636 0637 0621TER 0454 0439 0517BERT P 0952 0955 0946BERT R 0947 0949 0941BERT F1 0949 0951 0943BLEURT 054 057 047

Table 2 Results on Seen Category of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 6053 6099 4295BLEU NLTK 0522 0518 0415METEOR 0380 0381 0387CHRF++ 0638 0641 0650TER 0458 0432 0563BERT P 0955 0961 0945BERT R 0946 0948 0942BERT F1 0950 0954 0943BLEURT 051 055 041

Evaluation Settings There are two evaluationsettings 1) Automatic Evaluation and 2) HumanEvaluation

In Automatic Evaluation the evaluationmetrics are BLEU (Papineni et al 2002)BLEU NLTK7 METEOR (Banerjee and Lavie2005) CHRF++ (Popovic 2017) TER (Snoveret al 2006) BERT Precision (BERT P) (Zhanget al 2019) BERT Recall (BERT R) (Zhanget al 2019) BERT F1 (Zhang et al 2019) andBLEURT (Sellam et al 2020)

In Human Evaluation the evaluation metrics are

bull Data Coverage this metric assesses howmuch information from the data has been cov-ered in the text

challenge_20207httpswwwnltkorg_modulesnltk

translatebleu_scorehtml

Table 3 Results on Unseen Category of Our Approachon RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 4382 4307 3756BLEU NLTK 0436 0430 0370METEOR 0379 0376 0357CHRF++ 0618 0618 0584TER 0472 0456 0510BERT P 0947 0949 0944BERT R 0944 0945 0936BERT F1 0945 0946 0940BLEURT 050 054 044

Table 4 Results on Unseen Entity of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5074 4940 4022BLEU NLTK 0504 0490 0393METEOR 0405 0398 0384CHRF++ 0672 0669 0648TER 0410 0410 0476BERT P 0958 0961 0949BERT R 0956 0958 0950BERT F1 0957 0959 0949BLEURT 061 064 055

bull Relevance this metric assesses whether thetext contains any non-presented predicates

bull Correctness this metric assesses whether thetext describes predicates with correct objects

bull Text Structure this metric assesses whetherthe text is grammatical and well-structuredwritten in good English

bull Fluency this metric assesses the naturalnessof generated texts

32 Results321 Automatic EvaluationIn Automatic Evaluation the results are listed inTables 1-4 These results are provided by WebNLG2020 (Castro-Ferreira et al 2020 Moussalemet al 2020)

163

Figure 5 Overall Results of Human Evaluation onRDF-to-Text of WebNLG 2020 (English)

Figure 6 Seen Category Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

In our approaches the pipeline with Plan Selec-tion slightly outperforms the pipeline without PlanSelection on both BLEU score metrics (BLEU andBLEU NLTK) as shown in Table 1 In the seencategory as presented in Table 2 it turns out thatour approach with Plan Selection slightly yieldsbetter performance than the approach without PlanSelection on only BLEU NLTK With this obser-vation the Plan Selection model therefore doesnot contribute much to the seen category and evendegrades the performance in some cases Neverthe-less on the unseen category and unseen entity eval-uation in Tables3-4 we found our approach withPlan Selection improves both BLEU score metricswhen comparing with our approach without PlanSelection

Based on these results in Tables 2-4 the ap-proach building on the T5 model works well onthe seen category and it may not require the planselection process Still on the unseen categoryand unseen entity there is a room where the planselection could play a role in improving the perfor-mances

So far we are interested in the BLEU metric be-cause we optimize the Plan Selection model using

Figure 7 Unseen Category Results of Human Evalua-tion on RDF-to-Text of WebNLG 2020 (English)

Figure 8 Unseen Entity Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

the BLEU metric To further investigate the planselection in this approach we might optimize theplan selection using other metrics

Overall our approach with and without the planselection outperforms the baseline in most metricsexcept for CHRF++ as shown in Table 1 Althoughour approach with Plan Selection could slightlyimprove some metrics the results do not explic-itly surpass our approach without Plan SelectionTherefore the current contribution of Plan Selec-tion is very limited

322 Human EvaluationIn Human Evaluation the results are illustratedin Figure 5-8 These results are also provided byWebNLG 2020 (Castro-Ferreira et al 2020 Mous-salem et al 2020) Due to the limitation of theWebNLG 2020 submission only our approach withPlan Selection is evaluated in the Human Evalua-tion setting Note that the reference texts used inthe automatic evaluation are also evaluated in thissetting

Overall results in Figure 5 show our approachoutperforms the baseline in the relevance metricthe text structure metric and the fluency metric

164

however the baseline provides the better results inthe data coverage metric and the correctness metricAccording to the statistical testing we found statis-tically significant differences in the data coverageresult the correctness result and the fluency re-sult This indicates our approach provides fluencytext while it lacks the coverage and the correctnesscorresponding to RDF inputs

To further investigate the results we analyze theresults based on the seen category the unseen cat-egory and the unseen entity As shown in Figure6-8 the data coverage result the text structure andthe fluency follow the overall resultrsquos trend Nev-ertheless the relevance result and the correctnessresult are varied It turns out that on the unseenentity our approach could not outperform the base-line in the relevance result however our approachachieves even better relevance result than the ref-erence texts in the unseen category In the seencategory our approach yields higher correctness re-sult than the baseline while in the unseen categoryand unseen entity the correctness results follow theoverall trend

In Human Evaluation we notice that our ap-proach severely suffers from the data coverage asshown in Figure 6-8 The gap between our ap-proach and the reference (even the baseline) is largeThis is the limitation of our approach where we donot force the model to use all inputs

4 Conclusion

In this paper we introduce RDF-to-Text Generationwith Plan Selection for WebNLG 2020 challengeThe approach comprises two parts 1) the pipelinefor RDF-to-Text Generation and 2) Plan SelectionIn the pipeline we follow the state-of-the-art modelfor RDF-to-Text Generation (Kale 2020) while inPlan Selection we propose a method for selectingthe order of triples for the pipeline Overall weobserved a slight improvement in the BLEU scorewhen applying Plan Selection Nevertheless PlanSelection only optimized the BLEU score in thisstudy In the future we aim to investigate optimizePlan Selection in other metrics

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO) JPNP20006

ReferencesSoren Auer Christian Bizer Georgi Kobilarov Jens

Lehmann Richard Cyganiak and Zachary Ives2007 Dbpedia A nucleus for a web of open dataIn The semantic web pages 722ndash735 Springer

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation andor sum-marization pages 65ndash72

Thiago Castro-Ferreira Claire Gardent NikolaiIlinykh Chris van der Lee Simon Mille DiegoMoussalem and Anastasia Shimorina 2020 The2020 bilingual bi-directional webnlg+ shared taskOverview and evaluation results (webnlg+ 2020) InProceedings of the 3rd WebNLG Workshop on Nat-ural Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Associ-ation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofNAACL-HLT pages 4171ndash4186

Bayu Distiawan Jianzhong Qi Rui Zhang and WeiWang 2018 Gtr-lstm A triple encoder for sentencegeneration from rdf data In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages1627ndash1637

Thiago Castro Ferreira Diego Moussallem EmielKrahmer and Sander Wubben 2018 Enriching thewebnlg corpus In Proceedings of the 11th Interna-tional Conference on Natural Language Generationpages 171ndash176

Claire Gardent Anastasia Shimorina Shashi Narayanand Laura Perez-Beltrachini 2017 The webnlgchallenge Generating text from rdf data In Pro-ceedings of the 10th International Conference onNatural Language Generation pages 124ndash133

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data miningpages 855ndash864

Shizhu He Cao Liu Kang Liu and Jun Zhao2017 Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) pages 199ndash208

Mihir Kale 2020 Text-to-text pre-training for data-to-text tasks arXiv preprint arXiv200510433

165

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166

Figure 2 An Overview Pipeline of RDF-to-Text Gen-eration (a) RDF graph as input (b) Linearization formof RDF graph and (c) Generated Text

on our preliminary results we found that chang-ing the order of the triples before the linearizationaffected the BLEU score In this paper we there-fore present a Text-to-text pre-training model forthe data-to-text generation with plan selection Inour approach we follow the study (Kale 2020) tobuild the RDF-to-Text Generation pipeline Thenwe develop the graph neural network (GNN) basedmodel to score the plan associating with the tripleorder The triple order with the highest plan scoreis linearized and feed to Generator in the RDF-to-Text Generation pipeline

2 Approach

In our approach there are two steps 1) RDF-to-Text Generation and 2) Plan Selection RDF-to-Text generation aims to transform input triples intonatural language text describing them while PlanSelection is the pre-processing process to select theorder of the triples for the RDF-to-Text Generationpipeline Note that Plan Selection is performed be-fore RDF-to-Text Generation in the testing phase

21 RDF-to-Text Generation

In the RDF-to-text generation step we follow asimilar approach with the study (Kale 2020) tobuild the pipeline As illustrated in 2 there are twosteps in the pipeline of RDF-to-Text Generation 1)Linearization and 2) Generator

211 LinearizationLinearization is to convert RDF graph into the se-quence so that the conventional seq-2-seq modelcan be applied Specifically given a set of triples

T = (s1 p1 o1) (s2 p2 o2) (sn pn on)the linearization maps the triples into the wordsequence as follows lt s gt s1 lt p gt p1 lt o gto1 lt s gt s2 lt s gt sn lt p gt pn lt o gt on Anexample of the linearization is shown in Figure 2 (a)and (b) In Figure 2 (a) the input triples are givenas the RDF graph and the linearization form of thegraph is converted as in Figure 2 (b) Note thatusually in an RDF graph rsquoDonald Trumprsquo is con-catenated into one single token rsquoDonald TrumprsquoIn the conversion process we therefore removeunderscore ( ) in the entity name to recover theoriginal words denoting the entity

212 GeneratorGenerator is to generate natural language text cor-responding to the given word sequence from thelinearization step as shown in Figure 2 (b) and(c) In our study we employ Text-To-Text Trans-fer Transformer (T5) pre-trained models releasedby the study (Raffel et al 2020) as the generatormodel The T5 model is built on a transformer-based encoder-decoder architecture The modelwas trained through an unsupervised multi-tasking(span masking) in the Colossal Clean Crawled Cor-pus (C4) dataset as well as through supervisedlearning tasks including translation summariza-tion classification and question answering In thetraining phase we fine-tune the T5 model by usingthe linearization form of the RDF graph as the inputto target the text description as the output Duringthe fine-tuning all parameters in the models areupdated

22 Plan Selection

The objective of Plan Selection is to computescores of linearization forms of triples with dif-ferent orders of triple and then find the order oftriples providing the highest score as the selectedplan

The intuition behind this idea is that permutingthe order of triples during the linearization mayyield different texts To concretely elaborate thissensitivity of Generator in the pipeline we illus-trate the example of the permutation of the order oftriples in Figure 3 and discuss the intuition usingthis example Since there are 3 triples in the exam-ple we can linearize RDF triples into 6 patternsowing to the order of triples When we input these6 linearization forms of triples into Generator inthe pipeline Generator returns the different gener-ated texts For example with the linearization 1 it

160

Figure 3 The example of the permuting the triple or-ders and their Linearization forms

may generate ldquoJohn E Blaha (1942-08-26) born inSan Antonio worked as a fighter pilotrdquo while withthe linearization 3 it may generate rdquoJohn E Blahaborn in San Antonio on 1942-08-26 worked as afighter pilotrdquo We then evaluate generated textsand found out that the BLEU score of generatedtexts on the same RDF graph may differ When wemanually select the order of triples providing thehighest BLEU score to examine the upperboundthe BLEU score improved by around 10 points onthe development set of WebNLG 2020 Based onthis preliminary experiment we therefore aim tobuild Plan Selection which helps select the order oftriples for Generator in the RDF-to-Text pipeline

In Plan Selection there are 2 steps 1) Plan Gen-eration and 2) Plan Evaluation

221 Plan GenerationPlan Generation is to generate all possible ordersof triples given the input triples as plans In thePlan Generation step we create all combinationsof the triple orders The concrete example is shownin Figure 3 In the example we can create 6 ordersof triples Since RDF graphs in WenNLG 2020Challange are small exhaustive plan generation isfeasible

222 Plan EvaluationPlan Evaluation is to estimate the score of a planie an order of triples We design the Plan Eval-uation architecture into three components 1) Lin-earization Representation 2) RDF Graph Repre-sentation and 3) Plan Score as shown in Figure4 Linearization Representation aims to learn therepresentation of the Linearization form of triplesby the given order RDF Graph representation isto learn the representation of the RDF graph in thelow dimensional vector space Plan Score is tocompute the score for the given triple order corre-

Figure 4 The overview design of Plan Selection

sponding to the RDF graph by using LinearizationRepresentation and RDF Graph Representation

Linearization Representation is to project theword sequence that linearizes from one order oftriples into a low dimensional vector space In Lin-ear Representation we learn the representation byfeeding the word sequence into the contextualizedlanguage representation Recently contextualizedlanguage representations (Devlin et al 2019 Pe-ters et al 2018 Radford et al 2019 Lewis et al2019) gain wide attention from the NLP commu-nity Bidirectional Encoder Representations fromTransformers (BERT) (Devlin et al 2019) is oneof popular pre-trained models based on the trans-former architecture (Vaswani et al 2017) Dueto its performance it has been used in many NLPtasks We therefore utilize BERT to learn Lin-earization representations Given the RDF graphG and the plan p we can generate the lineariza-tion form as the sequence L = w1w2w3wn bymapping the plan to sequence as demonstrated inthe example of Figure 3 After obtained the se-quence L we always add a special classificationtoken [CLS] as the beginning of the sequence LThen the sequence L with [CLS] token is fed to thepre-trained BERT model After feeding the inputsequence to the pre-train model the final vectorrepresentation C corresponding to [CLS] token isused as the Linearization Representation

RDF Graph Representation is to embed theentire graph into the low dimensional vector spaceIn RDF Graph Representation we first employNode2Vec (Grover and Leskovec 2016) to gen-erate the node attributes We do not represent thenode attributes with one-hot vector encoding due

161

to the sparseness of the RDF graph Then we useGraphConv (Morris et al 2019) with the globalaverage pool to compute the representation of theRDF graph We formally define the computation inthis component as follows Given the RDF GraphG = (XA) where X is the matrix representingthe node attributes and A is the adjacency matrixof RDF graph we learn the RDF Graph Represen-tation by the following equation

xprimei = W1xi +

sum

jisinN(i)

W2xj (1)

where xprimei is the representation of the i-th node xi

is the attribute vector of the i-th node xj is theattribute vector of the j-th node which is the neigh-bor node of the i-th node N(i) is a set of nodesconnecting to the i-th node W1 and W2 are thelearnable weight matrix trained by propagating theloss in Eq 4 We then use the calculated representa-tions to obtain the RDF representation of graph G

XG =1

|V |sum

visinVxprimev (2)

where xprimei is the representation of the i-th node and

V is the set of the nodes in GPlan Score computes the score by using Lin-

earization Representation and RDF Graph Repre-sentation We concatenate Linearization Represen-tation C and RDF Graph Representation XG intoa single vector representation and feed it to thefeed-forward neural network (FFNN) as shown inEquation (3) The output of FFNN is the score ofthe plan subjected to the RDF graph

y = [CXG]WTFFNN + b (3)

where WFFNN and b are trainable parameters forFFNN

In the training phase we need to build the scorefor each plan as the target for Plan Selection tolearn To obtain the score we feed all possiblelinearization of triple orders into the RDF-to-Textgeneration pipeline Then we compute the BLEUscore for each order of triples from the corpusWith this strategy we can get the plan associatedwith the BLEU score based on the model of thepipeline We then use the following objective func-tion to optimize all trainable parameters in PlanSelection

L =sum

sisinS

sum

pisinP (s)

|ys minus ysp| (4)

where S is a WebNLG 2020 corpus P (s) is a setof plans for sample s in the corpus

In the testing phase we compute the scores ofall plans for the given RDF graph and then selectthe plan with the highest score The triple orderassociating with that plan is linearized as the wordsequence and input to Generator in the pipeline togenerate text description

3 Experiments and Results

31 Experimental SetupThe experimental setup is as follows

Datasets The dataset used in the experimentis WebNLG 2020 Challenge on the RDF-to-Text(English) task In the task there are 35426 4464and 1779 RDF Graph-Text pairs for training val-idating and testing respectively RDF Graph-textpairs are generated from 16 distinct DBpedia (Aueret al 2007) categories Athlete Artist Celestial-Body MeanOfTransportation Politician AirportAstronaut Building City ComicsCharacter FoodMonument SportsTeam University WrittenWorkand Company

Settings In our pipeline we use the pre-trainedmodel t5-base from the huggingface repository3We set the hyperparameters in the fine-tuning pro-cess as follows batch 8 learning rate 20times10minus5epochs 20 maximum tokens 512 and optimizerAdam (Kingma and Ba 2014) The default valuesare used for the other hyperparameters

For Plan Selection we set the following hyperpa-rameters batch 8 learning rate 0001 epoch 10and optimizer Adam In each component of PlanSelection we set the hyperparameters as followsIn Linearization representation we select the bert-base-uncased model as the BERT model with thedefault setting In RDF Graph Representation weset the following hyperparameters for Node2Vec4vector dimension 128 window size 10 batch 32walk length 80 walk step 10 and min count 1 tocreate node attributes We stack GraphConv5 into 3layers and use Global Average Pool as the poolingmethod In FF Neural Network we use 2 hiddenlayers with 410 neurons and the ReLU activation

Baseline The baseline in the experiment isprovided by WebNLG 2020 Challenge6

3httpshuggingfacecomodels4httpsgithubcomeliorcnode2vec5httpspytorch-geometricreadthedocs

ioenlatestmodulesnnhtmltorch_geometricnnconvGraphConv

6httpswebnlg-challengeloriafr

162

Table 1 Overall Results of Our Approach on RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5093 5043 4057BLEU NLTK 0482 0476 0396METEOR 0384 0382 0373CHRF++ 0636 0637 0621TER 0454 0439 0517BERT P 0952 0955 0946BERT R 0947 0949 0941BERT F1 0949 0951 0943BLEURT 054 057 047

Table 2 Results on Seen Category of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 6053 6099 4295BLEU NLTK 0522 0518 0415METEOR 0380 0381 0387CHRF++ 0638 0641 0650TER 0458 0432 0563BERT P 0955 0961 0945BERT R 0946 0948 0942BERT F1 0950 0954 0943BLEURT 051 055 041

Evaluation Settings There are two evaluationsettings 1) Automatic Evaluation and 2) HumanEvaluation

In Automatic Evaluation the evaluationmetrics are BLEU (Papineni et al 2002)BLEU NLTK7 METEOR (Banerjee and Lavie2005) CHRF++ (Popovic 2017) TER (Snoveret al 2006) BERT Precision (BERT P) (Zhanget al 2019) BERT Recall (BERT R) (Zhanget al 2019) BERT F1 (Zhang et al 2019) andBLEURT (Sellam et al 2020)

In Human Evaluation the evaluation metrics are

bull Data Coverage this metric assesses howmuch information from the data has been cov-ered in the text

challenge_20207httpswwwnltkorg_modulesnltk

translatebleu_scorehtml

Table 3 Results on Unseen Category of Our Approachon RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 4382 4307 3756BLEU NLTK 0436 0430 0370METEOR 0379 0376 0357CHRF++ 0618 0618 0584TER 0472 0456 0510BERT P 0947 0949 0944BERT R 0944 0945 0936BERT F1 0945 0946 0940BLEURT 050 054 044

Table 4 Results on Unseen Entity of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5074 4940 4022BLEU NLTK 0504 0490 0393METEOR 0405 0398 0384CHRF++ 0672 0669 0648TER 0410 0410 0476BERT P 0958 0961 0949BERT R 0956 0958 0950BERT F1 0957 0959 0949BLEURT 061 064 055

bull Relevance this metric assesses whether thetext contains any non-presented predicates

bull Correctness this metric assesses whether thetext describes predicates with correct objects

bull Text Structure this metric assesses whetherthe text is grammatical and well-structuredwritten in good English

bull Fluency this metric assesses the naturalnessof generated texts

32 Results321 Automatic EvaluationIn Automatic Evaluation the results are listed inTables 1-4 These results are provided by WebNLG2020 (Castro-Ferreira et al 2020 Moussalemet al 2020)

163

Figure 5 Overall Results of Human Evaluation onRDF-to-Text of WebNLG 2020 (English)

Figure 6 Seen Category Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

In our approaches the pipeline with Plan Selec-tion slightly outperforms the pipeline without PlanSelection on both BLEU score metrics (BLEU andBLEU NLTK) as shown in Table 1 In the seencategory as presented in Table 2 it turns out thatour approach with Plan Selection slightly yieldsbetter performance than the approach without PlanSelection on only BLEU NLTK With this obser-vation the Plan Selection model therefore doesnot contribute much to the seen category and evendegrades the performance in some cases Neverthe-less on the unseen category and unseen entity eval-uation in Tables3-4 we found our approach withPlan Selection improves both BLEU score metricswhen comparing with our approach without PlanSelection

Based on these results in Tables 2-4 the ap-proach building on the T5 model works well onthe seen category and it may not require the planselection process Still on the unseen categoryand unseen entity there is a room where the planselection could play a role in improving the perfor-mances

So far we are interested in the BLEU metric be-cause we optimize the Plan Selection model using

Figure 7 Unseen Category Results of Human Evalua-tion on RDF-to-Text of WebNLG 2020 (English)

Figure 8 Unseen Entity Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

the BLEU metric To further investigate the planselection in this approach we might optimize theplan selection using other metrics

Overall our approach with and without the planselection outperforms the baseline in most metricsexcept for CHRF++ as shown in Table 1 Althoughour approach with Plan Selection could slightlyimprove some metrics the results do not explic-itly surpass our approach without Plan SelectionTherefore the current contribution of Plan Selec-tion is very limited

322 Human EvaluationIn Human Evaluation the results are illustratedin Figure 5-8 These results are also provided byWebNLG 2020 (Castro-Ferreira et al 2020 Mous-salem et al 2020) Due to the limitation of theWebNLG 2020 submission only our approach withPlan Selection is evaluated in the Human Evalua-tion setting Note that the reference texts used inthe automatic evaluation are also evaluated in thissetting

Overall results in Figure 5 show our approachoutperforms the baseline in the relevance metricthe text structure metric and the fluency metric

164

however the baseline provides the better results inthe data coverage metric and the correctness metricAccording to the statistical testing we found statis-tically significant differences in the data coverageresult the correctness result and the fluency re-sult This indicates our approach provides fluencytext while it lacks the coverage and the correctnesscorresponding to RDF inputs

To further investigate the results we analyze theresults based on the seen category the unseen cat-egory and the unseen entity As shown in Figure6-8 the data coverage result the text structure andthe fluency follow the overall resultrsquos trend Nev-ertheless the relevance result and the correctnessresult are varied It turns out that on the unseenentity our approach could not outperform the base-line in the relevance result however our approachachieves even better relevance result than the ref-erence texts in the unseen category In the seencategory our approach yields higher correctness re-sult than the baseline while in the unseen categoryand unseen entity the correctness results follow theoverall trend

In Human Evaluation we notice that our ap-proach severely suffers from the data coverage asshown in Figure 6-8 The gap between our ap-proach and the reference (even the baseline) is largeThis is the limitation of our approach where we donot force the model to use all inputs

4 Conclusion

In this paper we introduce RDF-to-Text Generationwith Plan Selection for WebNLG 2020 challengeThe approach comprises two parts 1) the pipelinefor RDF-to-Text Generation and 2) Plan SelectionIn the pipeline we follow the state-of-the-art modelfor RDF-to-Text Generation (Kale 2020) while inPlan Selection we propose a method for selectingthe order of triples for the pipeline Overall weobserved a slight improvement in the BLEU scorewhen applying Plan Selection Nevertheless PlanSelection only optimized the BLEU score in thisstudy In the future we aim to investigate optimizePlan Selection in other metrics

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO) JPNP20006

ReferencesSoren Auer Christian Bizer Georgi Kobilarov Jens

Lehmann Richard Cyganiak and Zachary Ives2007 Dbpedia A nucleus for a web of open dataIn The semantic web pages 722ndash735 Springer

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation andor sum-marization pages 65ndash72

Thiago Castro-Ferreira Claire Gardent NikolaiIlinykh Chris van der Lee Simon Mille DiegoMoussalem and Anastasia Shimorina 2020 The2020 bilingual bi-directional webnlg+ shared taskOverview and evaluation results (webnlg+ 2020) InProceedings of the 3rd WebNLG Workshop on Nat-ural Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Associ-ation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofNAACL-HLT pages 4171ndash4186

Bayu Distiawan Jianzhong Qi Rui Zhang and WeiWang 2018 Gtr-lstm A triple encoder for sentencegeneration from rdf data In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages1627ndash1637

Thiago Castro Ferreira Diego Moussallem EmielKrahmer and Sander Wubben 2018 Enriching thewebnlg corpus In Proceedings of the 11th Interna-tional Conference on Natural Language Generationpages 171ndash176

Claire Gardent Anastasia Shimorina Shashi Narayanand Laura Perez-Beltrachini 2017 The webnlgchallenge Generating text from rdf data In Pro-ceedings of the 10th International Conference onNatural Language Generation pages 124ndash133

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data miningpages 855ndash864

Shizhu He Cao Liu Kang Liu and Jun Zhao2017 Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) pages 199ndash208

Mihir Kale 2020 Text-to-text pre-training for data-to-text tasks arXiv preprint arXiv200510433

165

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166

Figure 3 The example of the permuting the triple or-ders and their Linearization forms

may generate ldquoJohn E Blaha (1942-08-26) born inSan Antonio worked as a fighter pilotrdquo while withthe linearization 3 it may generate rdquoJohn E Blahaborn in San Antonio on 1942-08-26 worked as afighter pilotrdquo We then evaluate generated textsand found out that the BLEU score of generatedtexts on the same RDF graph may differ When wemanually select the order of triples providing thehighest BLEU score to examine the upperboundthe BLEU score improved by around 10 points onthe development set of WebNLG 2020 Based onthis preliminary experiment we therefore aim tobuild Plan Selection which helps select the order oftriples for Generator in the RDF-to-Text pipeline

In Plan Selection there are 2 steps 1) Plan Gen-eration and 2) Plan Evaluation

221 Plan GenerationPlan Generation is to generate all possible ordersof triples given the input triples as plans In thePlan Generation step we create all combinationsof the triple orders The concrete example is shownin Figure 3 In the example we can create 6 ordersof triples Since RDF graphs in WenNLG 2020Challange are small exhaustive plan generation isfeasible

222 Plan EvaluationPlan Evaluation is to estimate the score of a planie an order of triples We design the Plan Eval-uation architecture into three components 1) Lin-earization Representation 2) RDF Graph Repre-sentation and 3) Plan Score as shown in Figure4 Linearization Representation aims to learn therepresentation of the Linearization form of triplesby the given order RDF Graph representation isto learn the representation of the RDF graph in thelow dimensional vector space Plan Score is tocompute the score for the given triple order corre-

Figure 4 The overview design of Plan Selection

sponding to the RDF graph by using LinearizationRepresentation and RDF Graph Representation

Linearization Representation is to project theword sequence that linearizes from one order oftriples into a low dimensional vector space In Lin-ear Representation we learn the representation byfeeding the word sequence into the contextualizedlanguage representation Recently contextualizedlanguage representations (Devlin et al 2019 Pe-ters et al 2018 Radford et al 2019 Lewis et al2019) gain wide attention from the NLP commu-nity Bidirectional Encoder Representations fromTransformers (BERT) (Devlin et al 2019) is oneof popular pre-trained models based on the trans-former architecture (Vaswani et al 2017) Dueto its performance it has been used in many NLPtasks We therefore utilize BERT to learn Lin-earization representations Given the RDF graphG and the plan p we can generate the lineariza-tion form as the sequence L = w1w2w3wn bymapping the plan to sequence as demonstrated inthe example of Figure 3 After obtained the se-quence L we always add a special classificationtoken [CLS] as the beginning of the sequence LThen the sequence L with [CLS] token is fed to thepre-trained BERT model After feeding the inputsequence to the pre-train model the final vectorrepresentation C corresponding to [CLS] token isused as the Linearization Representation

RDF Graph Representation is to embed theentire graph into the low dimensional vector spaceIn RDF Graph Representation we first employNode2Vec (Grover and Leskovec 2016) to gen-erate the node attributes We do not represent thenode attributes with one-hot vector encoding due

161

to the sparseness of the RDF graph Then we useGraphConv (Morris et al 2019) with the globalaverage pool to compute the representation of theRDF graph We formally define the computation inthis component as follows Given the RDF GraphG = (XA) where X is the matrix representingthe node attributes and A is the adjacency matrixof RDF graph we learn the RDF Graph Represen-tation by the following equation

xprimei = W1xi +

sum

jisinN(i)

W2xj (1)

where xprimei is the representation of the i-th node xi

is the attribute vector of the i-th node xj is theattribute vector of the j-th node which is the neigh-bor node of the i-th node N(i) is a set of nodesconnecting to the i-th node W1 and W2 are thelearnable weight matrix trained by propagating theloss in Eq 4 We then use the calculated representa-tions to obtain the RDF representation of graph G

XG =1

|V |sum

visinVxprimev (2)

where xprimei is the representation of the i-th node and

V is the set of the nodes in GPlan Score computes the score by using Lin-

earization Representation and RDF Graph Repre-sentation We concatenate Linearization Represen-tation C and RDF Graph Representation XG intoa single vector representation and feed it to thefeed-forward neural network (FFNN) as shown inEquation (3) The output of FFNN is the score ofthe plan subjected to the RDF graph

y = [CXG]WTFFNN + b (3)

where WFFNN and b are trainable parameters forFFNN

In the training phase we need to build the scorefor each plan as the target for Plan Selection tolearn To obtain the score we feed all possiblelinearization of triple orders into the RDF-to-Textgeneration pipeline Then we compute the BLEUscore for each order of triples from the corpusWith this strategy we can get the plan associatedwith the BLEU score based on the model of thepipeline We then use the following objective func-tion to optimize all trainable parameters in PlanSelection

L =sum

sisinS

sum

pisinP (s)

|ys minus ysp| (4)

where S is a WebNLG 2020 corpus P (s) is a setof plans for sample s in the corpus

In the testing phase we compute the scores ofall plans for the given RDF graph and then selectthe plan with the highest score The triple orderassociating with that plan is linearized as the wordsequence and input to Generator in the pipeline togenerate text description

3 Experiments and Results

31 Experimental SetupThe experimental setup is as follows

Datasets The dataset used in the experimentis WebNLG 2020 Challenge on the RDF-to-Text(English) task In the task there are 35426 4464and 1779 RDF Graph-Text pairs for training val-idating and testing respectively RDF Graph-textpairs are generated from 16 distinct DBpedia (Aueret al 2007) categories Athlete Artist Celestial-Body MeanOfTransportation Politician AirportAstronaut Building City ComicsCharacter FoodMonument SportsTeam University WrittenWorkand Company

Settings In our pipeline we use the pre-trainedmodel t5-base from the huggingface repository3We set the hyperparameters in the fine-tuning pro-cess as follows batch 8 learning rate 20times10minus5epochs 20 maximum tokens 512 and optimizerAdam (Kingma and Ba 2014) The default valuesare used for the other hyperparameters

For Plan Selection we set the following hyperpa-rameters batch 8 learning rate 0001 epoch 10and optimizer Adam In each component of PlanSelection we set the hyperparameters as followsIn Linearization representation we select the bert-base-uncased model as the BERT model with thedefault setting In RDF Graph Representation weset the following hyperparameters for Node2Vec4vector dimension 128 window size 10 batch 32walk length 80 walk step 10 and min count 1 tocreate node attributes We stack GraphConv5 into 3layers and use Global Average Pool as the poolingmethod In FF Neural Network we use 2 hiddenlayers with 410 neurons and the ReLU activation

Baseline The baseline in the experiment isprovided by WebNLG 2020 Challenge6

3httpshuggingfacecomodels4httpsgithubcomeliorcnode2vec5httpspytorch-geometricreadthedocs

ioenlatestmodulesnnhtmltorch_geometricnnconvGraphConv

6httpswebnlg-challengeloriafr

162

Table 1 Overall Results of Our Approach on RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5093 5043 4057BLEU NLTK 0482 0476 0396METEOR 0384 0382 0373CHRF++ 0636 0637 0621TER 0454 0439 0517BERT P 0952 0955 0946BERT R 0947 0949 0941BERT F1 0949 0951 0943BLEURT 054 057 047

Table 2 Results on Seen Category of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 6053 6099 4295BLEU NLTK 0522 0518 0415METEOR 0380 0381 0387CHRF++ 0638 0641 0650TER 0458 0432 0563BERT P 0955 0961 0945BERT R 0946 0948 0942BERT F1 0950 0954 0943BLEURT 051 055 041

Evaluation Settings There are two evaluationsettings 1) Automatic Evaluation and 2) HumanEvaluation

In Automatic Evaluation the evaluationmetrics are BLEU (Papineni et al 2002)BLEU NLTK7 METEOR (Banerjee and Lavie2005) CHRF++ (Popovic 2017) TER (Snoveret al 2006) BERT Precision (BERT P) (Zhanget al 2019) BERT Recall (BERT R) (Zhanget al 2019) BERT F1 (Zhang et al 2019) andBLEURT (Sellam et al 2020)

In Human Evaluation the evaluation metrics are

bull Data Coverage this metric assesses howmuch information from the data has been cov-ered in the text

challenge_20207httpswwwnltkorg_modulesnltk

translatebleu_scorehtml

Table 3 Results on Unseen Category of Our Approachon RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 4382 4307 3756BLEU NLTK 0436 0430 0370METEOR 0379 0376 0357CHRF++ 0618 0618 0584TER 0472 0456 0510BERT P 0947 0949 0944BERT R 0944 0945 0936BERT F1 0945 0946 0940BLEURT 050 054 044

Table 4 Results on Unseen Entity of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5074 4940 4022BLEU NLTK 0504 0490 0393METEOR 0405 0398 0384CHRF++ 0672 0669 0648TER 0410 0410 0476BERT P 0958 0961 0949BERT R 0956 0958 0950BERT F1 0957 0959 0949BLEURT 061 064 055

bull Relevance this metric assesses whether thetext contains any non-presented predicates

bull Correctness this metric assesses whether thetext describes predicates with correct objects

bull Text Structure this metric assesses whetherthe text is grammatical and well-structuredwritten in good English

bull Fluency this metric assesses the naturalnessof generated texts

32 Results321 Automatic EvaluationIn Automatic Evaluation the results are listed inTables 1-4 These results are provided by WebNLG2020 (Castro-Ferreira et al 2020 Moussalemet al 2020)

163

Figure 5 Overall Results of Human Evaluation onRDF-to-Text of WebNLG 2020 (English)

Figure 6 Seen Category Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

In our approaches the pipeline with Plan Selec-tion slightly outperforms the pipeline without PlanSelection on both BLEU score metrics (BLEU andBLEU NLTK) as shown in Table 1 In the seencategory as presented in Table 2 it turns out thatour approach with Plan Selection slightly yieldsbetter performance than the approach without PlanSelection on only BLEU NLTK With this obser-vation the Plan Selection model therefore doesnot contribute much to the seen category and evendegrades the performance in some cases Neverthe-less on the unseen category and unseen entity eval-uation in Tables3-4 we found our approach withPlan Selection improves both BLEU score metricswhen comparing with our approach without PlanSelection

Based on these results in Tables 2-4 the ap-proach building on the T5 model works well onthe seen category and it may not require the planselection process Still on the unseen categoryand unseen entity there is a room where the planselection could play a role in improving the perfor-mances

So far we are interested in the BLEU metric be-cause we optimize the Plan Selection model using

Figure 7 Unseen Category Results of Human Evalua-tion on RDF-to-Text of WebNLG 2020 (English)

Figure 8 Unseen Entity Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

the BLEU metric To further investigate the planselection in this approach we might optimize theplan selection using other metrics

Overall our approach with and without the planselection outperforms the baseline in most metricsexcept for CHRF++ as shown in Table 1 Althoughour approach with Plan Selection could slightlyimprove some metrics the results do not explic-itly surpass our approach without Plan SelectionTherefore the current contribution of Plan Selec-tion is very limited

322 Human EvaluationIn Human Evaluation the results are illustratedin Figure 5-8 These results are also provided byWebNLG 2020 (Castro-Ferreira et al 2020 Mous-salem et al 2020) Due to the limitation of theWebNLG 2020 submission only our approach withPlan Selection is evaluated in the Human Evalua-tion setting Note that the reference texts used inthe automatic evaluation are also evaluated in thissetting

Overall results in Figure 5 show our approachoutperforms the baseline in the relevance metricthe text structure metric and the fluency metric

164

however the baseline provides the better results inthe data coverage metric and the correctness metricAccording to the statistical testing we found statis-tically significant differences in the data coverageresult the correctness result and the fluency re-sult This indicates our approach provides fluencytext while it lacks the coverage and the correctnesscorresponding to RDF inputs

To further investigate the results we analyze theresults based on the seen category the unseen cat-egory and the unseen entity As shown in Figure6-8 the data coverage result the text structure andthe fluency follow the overall resultrsquos trend Nev-ertheless the relevance result and the correctnessresult are varied It turns out that on the unseenentity our approach could not outperform the base-line in the relevance result however our approachachieves even better relevance result than the ref-erence texts in the unseen category In the seencategory our approach yields higher correctness re-sult than the baseline while in the unseen categoryand unseen entity the correctness results follow theoverall trend

In Human Evaluation we notice that our ap-proach severely suffers from the data coverage asshown in Figure 6-8 The gap between our ap-proach and the reference (even the baseline) is largeThis is the limitation of our approach where we donot force the model to use all inputs

4 Conclusion

In this paper we introduce RDF-to-Text Generationwith Plan Selection for WebNLG 2020 challengeThe approach comprises two parts 1) the pipelinefor RDF-to-Text Generation and 2) Plan SelectionIn the pipeline we follow the state-of-the-art modelfor RDF-to-Text Generation (Kale 2020) while inPlan Selection we propose a method for selectingthe order of triples for the pipeline Overall weobserved a slight improvement in the BLEU scorewhen applying Plan Selection Nevertheless PlanSelection only optimized the BLEU score in thisstudy In the future we aim to investigate optimizePlan Selection in other metrics

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO) JPNP20006

ReferencesSoren Auer Christian Bizer Georgi Kobilarov Jens

Lehmann Richard Cyganiak and Zachary Ives2007 Dbpedia A nucleus for a web of open dataIn The semantic web pages 722ndash735 Springer

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation andor sum-marization pages 65ndash72

Thiago Castro-Ferreira Claire Gardent NikolaiIlinykh Chris van der Lee Simon Mille DiegoMoussalem and Anastasia Shimorina 2020 The2020 bilingual bi-directional webnlg+ shared taskOverview and evaluation results (webnlg+ 2020) InProceedings of the 3rd WebNLG Workshop on Nat-ural Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Associ-ation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofNAACL-HLT pages 4171ndash4186

Bayu Distiawan Jianzhong Qi Rui Zhang and WeiWang 2018 Gtr-lstm A triple encoder for sentencegeneration from rdf data In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages1627ndash1637

Thiago Castro Ferreira Diego Moussallem EmielKrahmer and Sander Wubben 2018 Enriching thewebnlg corpus In Proceedings of the 11th Interna-tional Conference on Natural Language Generationpages 171ndash176

Claire Gardent Anastasia Shimorina Shashi Narayanand Laura Perez-Beltrachini 2017 The webnlgchallenge Generating text from rdf data In Pro-ceedings of the 10th International Conference onNatural Language Generation pages 124ndash133

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data miningpages 855ndash864

Shizhu He Cao Liu Kang Liu and Jun Zhao2017 Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) pages 199ndash208

Mihir Kale 2020 Text-to-text pre-training for data-to-text tasks arXiv preprint arXiv200510433

165

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166

to the sparseness of the RDF graph Then we useGraphConv (Morris et al 2019) with the globalaverage pool to compute the representation of theRDF graph We formally define the computation inthis component as follows Given the RDF GraphG = (XA) where X is the matrix representingthe node attributes and A is the adjacency matrixof RDF graph we learn the RDF Graph Represen-tation by the following equation

xprimei = W1xi +

sum

jisinN(i)

W2xj (1)

where xprimei is the representation of the i-th node xi

is the attribute vector of the i-th node xj is theattribute vector of the j-th node which is the neigh-bor node of the i-th node N(i) is a set of nodesconnecting to the i-th node W1 and W2 are thelearnable weight matrix trained by propagating theloss in Eq 4 We then use the calculated representa-tions to obtain the RDF representation of graph G

XG =1

|V |sum

visinVxprimev (2)

where xprimei is the representation of the i-th node and

V is the set of the nodes in GPlan Score computes the score by using Lin-

earization Representation and RDF Graph Repre-sentation We concatenate Linearization Represen-tation C and RDF Graph Representation XG intoa single vector representation and feed it to thefeed-forward neural network (FFNN) as shown inEquation (3) The output of FFNN is the score ofthe plan subjected to the RDF graph

y = [CXG]WTFFNN + b (3)

where WFFNN and b are trainable parameters forFFNN

In the training phase we need to build the scorefor each plan as the target for Plan Selection tolearn To obtain the score we feed all possiblelinearization of triple orders into the RDF-to-Textgeneration pipeline Then we compute the BLEUscore for each order of triples from the corpusWith this strategy we can get the plan associatedwith the BLEU score based on the model of thepipeline We then use the following objective func-tion to optimize all trainable parameters in PlanSelection

L =sum

sisinS

sum

pisinP (s)

|ys minus ysp| (4)

where S is a WebNLG 2020 corpus P (s) is a setof plans for sample s in the corpus

In the testing phase we compute the scores ofall plans for the given RDF graph and then selectthe plan with the highest score The triple orderassociating with that plan is linearized as the wordsequence and input to Generator in the pipeline togenerate text description

3 Experiments and Results

31 Experimental SetupThe experimental setup is as follows

Datasets The dataset used in the experimentis WebNLG 2020 Challenge on the RDF-to-Text(English) task In the task there are 35426 4464and 1779 RDF Graph-Text pairs for training val-idating and testing respectively RDF Graph-textpairs are generated from 16 distinct DBpedia (Aueret al 2007) categories Athlete Artist Celestial-Body MeanOfTransportation Politician AirportAstronaut Building City ComicsCharacter FoodMonument SportsTeam University WrittenWorkand Company

Settings In our pipeline we use the pre-trainedmodel t5-base from the huggingface repository3We set the hyperparameters in the fine-tuning pro-cess as follows batch 8 learning rate 20times10minus5epochs 20 maximum tokens 512 and optimizerAdam (Kingma and Ba 2014) The default valuesare used for the other hyperparameters

For Plan Selection we set the following hyperpa-rameters batch 8 learning rate 0001 epoch 10and optimizer Adam In each component of PlanSelection we set the hyperparameters as followsIn Linearization representation we select the bert-base-uncased model as the BERT model with thedefault setting In RDF Graph Representation weset the following hyperparameters for Node2Vec4vector dimension 128 window size 10 batch 32walk length 80 walk step 10 and min count 1 tocreate node attributes We stack GraphConv5 into 3layers and use Global Average Pool as the poolingmethod In FF Neural Network we use 2 hiddenlayers with 410 neurons and the ReLU activation

Baseline The baseline in the experiment isprovided by WebNLG 2020 Challenge6

3httpshuggingfacecomodels4httpsgithubcomeliorcnode2vec5httpspytorch-geometricreadthedocs

ioenlatestmodulesnnhtmltorch_geometricnnconvGraphConv

6httpswebnlg-challengeloriafr

162

Table 1 Overall Results of Our Approach on RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5093 5043 4057BLEU NLTK 0482 0476 0396METEOR 0384 0382 0373CHRF++ 0636 0637 0621TER 0454 0439 0517BERT P 0952 0955 0946BERT R 0947 0949 0941BERT F1 0949 0951 0943BLEURT 054 057 047

Table 2 Results on Seen Category of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 6053 6099 4295BLEU NLTK 0522 0518 0415METEOR 0380 0381 0387CHRF++ 0638 0641 0650TER 0458 0432 0563BERT P 0955 0961 0945BERT R 0946 0948 0942BERT F1 0950 0954 0943BLEURT 051 055 041

Evaluation Settings There are two evaluationsettings 1) Automatic Evaluation and 2) HumanEvaluation

In Automatic Evaluation the evaluationmetrics are BLEU (Papineni et al 2002)BLEU NLTK7 METEOR (Banerjee and Lavie2005) CHRF++ (Popovic 2017) TER (Snoveret al 2006) BERT Precision (BERT P) (Zhanget al 2019) BERT Recall (BERT R) (Zhanget al 2019) BERT F1 (Zhang et al 2019) andBLEURT (Sellam et al 2020)

In Human Evaluation the evaluation metrics are

bull Data Coverage this metric assesses howmuch information from the data has been cov-ered in the text

challenge_20207httpswwwnltkorg_modulesnltk

translatebleu_scorehtml

Table 3 Results on Unseen Category of Our Approachon RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 4382 4307 3756BLEU NLTK 0436 0430 0370METEOR 0379 0376 0357CHRF++ 0618 0618 0584TER 0472 0456 0510BERT P 0947 0949 0944BERT R 0944 0945 0936BERT F1 0945 0946 0940BLEURT 050 054 044

Table 4 Results on Unseen Entity of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5074 4940 4022BLEU NLTK 0504 0490 0393METEOR 0405 0398 0384CHRF++ 0672 0669 0648TER 0410 0410 0476BERT P 0958 0961 0949BERT R 0956 0958 0950BERT F1 0957 0959 0949BLEURT 061 064 055

bull Relevance this metric assesses whether thetext contains any non-presented predicates

bull Correctness this metric assesses whether thetext describes predicates with correct objects

bull Text Structure this metric assesses whetherthe text is grammatical and well-structuredwritten in good English

bull Fluency this metric assesses the naturalnessof generated texts

32 Results321 Automatic EvaluationIn Automatic Evaluation the results are listed inTables 1-4 These results are provided by WebNLG2020 (Castro-Ferreira et al 2020 Moussalemet al 2020)

163

Figure 5 Overall Results of Human Evaluation onRDF-to-Text of WebNLG 2020 (English)

Figure 6 Seen Category Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

In our approaches the pipeline with Plan Selec-tion slightly outperforms the pipeline without PlanSelection on both BLEU score metrics (BLEU andBLEU NLTK) as shown in Table 1 In the seencategory as presented in Table 2 it turns out thatour approach with Plan Selection slightly yieldsbetter performance than the approach without PlanSelection on only BLEU NLTK With this obser-vation the Plan Selection model therefore doesnot contribute much to the seen category and evendegrades the performance in some cases Neverthe-less on the unseen category and unseen entity eval-uation in Tables3-4 we found our approach withPlan Selection improves both BLEU score metricswhen comparing with our approach without PlanSelection

Based on these results in Tables 2-4 the ap-proach building on the T5 model works well onthe seen category and it may not require the planselection process Still on the unseen categoryand unseen entity there is a room where the planselection could play a role in improving the perfor-mances

So far we are interested in the BLEU metric be-cause we optimize the Plan Selection model using

Figure 7 Unseen Category Results of Human Evalua-tion on RDF-to-Text of WebNLG 2020 (English)

Figure 8 Unseen Entity Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

the BLEU metric To further investigate the planselection in this approach we might optimize theplan selection using other metrics

Overall our approach with and without the planselection outperforms the baseline in most metricsexcept for CHRF++ as shown in Table 1 Althoughour approach with Plan Selection could slightlyimprove some metrics the results do not explic-itly surpass our approach without Plan SelectionTherefore the current contribution of Plan Selec-tion is very limited

322 Human EvaluationIn Human Evaluation the results are illustratedin Figure 5-8 These results are also provided byWebNLG 2020 (Castro-Ferreira et al 2020 Mous-salem et al 2020) Due to the limitation of theWebNLG 2020 submission only our approach withPlan Selection is evaluated in the Human Evalua-tion setting Note that the reference texts used inthe automatic evaluation are also evaluated in thissetting

Overall results in Figure 5 show our approachoutperforms the baseline in the relevance metricthe text structure metric and the fluency metric

164

however the baseline provides the better results inthe data coverage metric and the correctness metricAccording to the statistical testing we found statis-tically significant differences in the data coverageresult the correctness result and the fluency re-sult This indicates our approach provides fluencytext while it lacks the coverage and the correctnesscorresponding to RDF inputs

To further investigate the results we analyze theresults based on the seen category the unseen cat-egory and the unseen entity As shown in Figure6-8 the data coverage result the text structure andthe fluency follow the overall resultrsquos trend Nev-ertheless the relevance result and the correctnessresult are varied It turns out that on the unseenentity our approach could not outperform the base-line in the relevance result however our approachachieves even better relevance result than the ref-erence texts in the unseen category In the seencategory our approach yields higher correctness re-sult than the baseline while in the unseen categoryand unseen entity the correctness results follow theoverall trend

In Human Evaluation we notice that our ap-proach severely suffers from the data coverage asshown in Figure 6-8 The gap between our ap-proach and the reference (even the baseline) is largeThis is the limitation of our approach where we donot force the model to use all inputs

4 Conclusion

In this paper we introduce RDF-to-Text Generationwith Plan Selection for WebNLG 2020 challengeThe approach comprises two parts 1) the pipelinefor RDF-to-Text Generation and 2) Plan SelectionIn the pipeline we follow the state-of-the-art modelfor RDF-to-Text Generation (Kale 2020) while inPlan Selection we propose a method for selectingthe order of triples for the pipeline Overall weobserved a slight improvement in the BLEU scorewhen applying Plan Selection Nevertheless PlanSelection only optimized the BLEU score in thisstudy In the future we aim to investigate optimizePlan Selection in other metrics

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO) JPNP20006

ReferencesSoren Auer Christian Bizer Georgi Kobilarov Jens

Lehmann Richard Cyganiak and Zachary Ives2007 Dbpedia A nucleus for a web of open dataIn The semantic web pages 722ndash735 Springer

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation andor sum-marization pages 65ndash72

Thiago Castro-Ferreira Claire Gardent NikolaiIlinykh Chris van der Lee Simon Mille DiegoMoussalem and Anastasia Shimorina 2020 The2020 bilingual bi-directional webnlg+ shared taskOverview and evaluation results (webnlg+ 2020) InProceedings of the 3rd WebNLG Workshop on Nat-ural Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Associ-ation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofNAACL-HLT pages 4171ndash4186

Bayu Distiawan Jianzhong Qi Rui Zhang and WeiWang 2018 Gtr-lstm A triple encoder for sentencegeneration from rdf data In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages1627ndash1637

Thiago Castro Ferreira Diego Moussallem EmielKrahmer and Sander Wubben 2018 Enriching thewebnlg corpus In Proceedings of the 11th Interna-tional Conference on Natural Language Generationpages 171ndash176

Claire Gardent Anastasia Shimorina Shashi Narayanand Laura Perez-Beltrachini 2017 The webnlgchallenge Generating text from rdf data In Pro-ceedings of the 10th International Conference onNatural Language Generation pages 124ndash133

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data miningpages 855ndash864

Shizhu He Cao Liu Kang Liu and Jun Zhao2017 Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) pages 199ndash208

Mihir Kale 2020 Text-to-text pre-training for data-to-text tasks arXiv preprint arXiv200510433

165

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166

Table 1 Overall Results of Our Approach on RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5093 5043 4057BLEU NLTK 0482 0476 0396METEOR 0384 0382 0373CHRF++ 0636 0637 0621TER 0454 0439 0517BERT P 0952 0955 0946BERT R 0947 0949 0941BERT F1 0949 0951 0943BLEURT 054 057 047

Table 2 Results on Seen Category of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 6053 6099 4295BLEU NLTK 0522 0518 0415METEOR 0380 0381 0387CHRF++ 0638 0641 0650TER 0458 0432 0563BERT P 0955 0961 0945BERT R 0946 0948 0942BERT F1 0950 0954 0943BLEURT 051 055 041

Evaluation Settings There are two evaluationsettings 1) Automatic Evaluation and 2) HumanEvaluation

In Automatic Evaluation the evaluationmetrics are BLEU (Papineni et al 2002)BLEU NLTK7 METEOR (Banerjee and Lavie2005) CHRF++ (Popovic 2017) TER (Snoveret al 2006) BERT Precision (BERT P) (Zhanget al 2019) BERT Recall (BERT R) (Zhanget al 2019) BERT F1 (Zhang et al 2019) andBLEURT (Sellam et al 2020)

In Human Evaluation the evaluation metrics are

bull Data Coverage this metric assesses howmuch information from the data has been cov-ered in the text

challenge_20207httpswwwnltkorg_modulesnltk

translatebleu_scorehtml

Table 3 Results on Unseen Category of Our Approachon RDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 4382 4307 3756BLEU NLTK 0436 0430 0370METEOR 0379 0376 0357CHRF++ 0618 0618 0584TER 0472 0456 0510BERT P 0947 0949 0944BERT R 0944 0945 0936BERT F1 0945 0946 0940BLEURT 050 054 044

Table 4 Results on Unseen Entity of Our Approach onRDF-to-Text of WebNLG 2020 (English)

MetricsOur Approach

BaselineSelect

NoSelect

BLEU 5074 4940 4022BLEU NLTK 0504 0490 0393METEOR 0405 0398 0384CHRF++ 0672 0669 0648TER 0410 0410 0476BERT P 0958 0961 0949BERT R 0956 0958 0950BERT F1 0957 0959 0949BLEURT 061 064 055

bull Relevance this metric assesses whether thetext contains any non-presented predicates

bull Correctness this metric assesses whether thetext describes predicates with correct objects

bull Text Structure this metric assesses whetherthe text is grammatical and well-structuredwritten in good English

bull Fluency this metric assesses the naturalnessof generated texts

32 Results321 Automatic EvaluationIn Automatic Evaluation the results are listed inTables 1-4 These results are provided by WebNLG2020 (Castro-Ferreira et al 2020 Moussalemet al 2020)

163

Figure 5 Overall Results of Human Evaluation onRDF-to-Text of WebNLG 2020 (English)

Figure 6 Seen Category Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

In our approaches the pipeline with Plan Selec-tion slightly outperforms the pipeline without PlanSelection on both BLEU score metrics (BLEU andBLEU NLTK) as shown in Table 1 In the seencategory as presented in Table 2 it turns out thatour approach with Plan Selection slightly yieldsbetter performance than the approach without PlanSelection on only BLEU NLTK With this obser-vation the Plan Selection model therefore doesnot contribute much to the seen category and evendegrades the performance in some cases Neverthe-less on the unseen category and unseen entity eval-uation in Tables3-4 we found our approach withPlan Selection improves both BLEU score metricswhen comparing with our approach without PlanSelection

Based on these results in Tables 2-4 the ap-proach building on the T5 model works well onthe seen category and it may not require the planselection process Still on the unseen categoryand unseen entity there is a room where the planselection could play a role in improving the perfor-mances

So far we are interested in the BLEU metric be-cause we optimize the Plan Selection model using

Figure 7 Unseen Category Results of Human Evalua-tion on RDF-to-Text of WebNLG 2020 (English)

Figure 8 Unseen Entity Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

the BLEU metric To further investigate the planselection in this approach we might optimize theplan selection using other metrics

Overall our approach with and without the planselection outperforms the baseline in most metricsexcept for CHRF++ as shown in Table 1 Althoughour approach with Plan Selection could slightlyimprove some metrics the results do not explic-itly surpass our approach without Plan SelectionTherefore the current contribution of Plan Selec-tion is very limited

322 Human EvaluationIn Human Evaluation the results are illustratedin Figure 5-8 These results are also provided byWebNLG 2020 (Castro-Ferreira et al 2020 Mous-salem et al 2020) Due to the limitation of theWebNLG 2020 submission only our approach withPlan Selection is evaluated in the Human Evalua-tion setting Note that the reference texts used inthe automatic evaluation are also evaluated in thissetting

Overall results in Figure 5 show our approachoutperforms the baseline in the relevance metricthe text structure metric and the fluency metric

164

however the baseline provides the better results inthe data coverage metric and the correctness metricAccording to the statistical testing we found statis-tically significant differences in the data coverageresult the correctness result and the fluency re-sult This indicates our approach provides fluencytext while it lacks the coverage and the correctnesscorresponding to RDF inputs

To further investigate the results we analyze theresults based on the seen category the unseen cat-egory and the unseen entity As shown in Figure6-8 the data coverage result the text structure andthe fluency follow the overall resultrsquos trend Nev-ertheless the relevance result and the correctnessresult are varied It turns out that on the unseenentity our approach could not outperform the base-line in the relevance result however our approachachieves even better relevance result than the ref-erence texts in the unseen category In the seencategory our approach yields higher correctness re-sult than the baseline while in the unseen categoryand unseen entity the correctness results follow theoverall trend

In Human Evaluation we notice that our ap-proach severely suffers from the data coverage asshown in Figure 6-8 The gap between our ap-proach and the reference (even the baseline) is largeThis is the limitation of our approach where we donot force the model to use all inputs

4 Conclusion

In this paper we introduce RDF-to-Text Generationwith Plan Selection for WebNLG 2020 challengeThe approach comprises two parts 1) the pipelinefor RDF-to-Text Generation and 2) Plan SelectionIn the pipeline we follow the state-of-the-art modelfor RDF-to-Text Generation (Kale 2020) while inPlan Selection we propose a method for selectingthe order of triples for the pipeline Overall weobserved a slight improvement in the BLEU scorewhen applying Plan Selection Nevertheless PlanSelection only optimized the BLEU score in thisstudy In the future we aim to investigate optimizePlan Selection in other metrics

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO) JPNP20006

ReferencesSoren Auer Christian Bizer Georgi Kobilarov Jens

Lehmann Richard Cyganiak and Zachary Ives2007 Dbpedia A nucleus for a web of open dataIn The semantic web pages 722ndash735 Springer

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation andor sum-marization pages 65ndash72

Thiago Castro-Ferreira Claire Gardent NikolaiIlinykh Chris van der Lee Simon Mille DiegoMoussalem and Anastasia Shimorina 2020 The2020 bilingual bi-directional webnlg+ shared taskOverview and evaluation results (webnlg+ 2020) InProceedings of the 3rd WebNLG Workshop on Nat-ural Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Associ-ation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofNAACL-HLT pages 4171ndash4186

Bayu Distiawan Jianzhong Qi Rui Zhang and WeiWang 2018 Gtr-lstm A triple encoder for sentencegeneration from rdf data In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages1627ndash1637

Thiago Castro Ferreira Diego Moussallem EmielKrahmer and Sander Wubben 2018 Enriching thewebnlg corpus In Proceedings of the 11th Interna-tional Conference on Natural Language Generationpages 171ndash176

Claire Gardent Anastasia Shimorina Shashi Narayanand Laura Perez-Beltrachini 2017 The webnlgchallenge Generating text from rdf data In Pro-ceedings of the 10th International Conference onNatural Language Generation pages 124ndash133

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data miningpages 855ndash864

Shizhu He Cao Liu Kang Liu and Jun Zhao2017 Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) pages 199ndash208

Mihir Kale 2020 Text-to-text pre-training for data-to-text tasks arXiv preprint arXiv200510433

165

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166

Figure 5 Overall Results of Human Evaluation onRDF-to-Text of WebNLG 2020 (English)

Figure 6 Seen Category Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

In our approaches the pipeline with Plan Selec-tion slightly outperforms the pipeline without PlanSelection on both BLEU score metrics (BLEU andBLEU NLTK) as shown in Table 1 In the seencategory as presented in Table 2 it turns out thatour approach with Plan Selection slightly yieldsbetter performance than the approach without PlanSelection on only BLEU NLTK With this obser-vation the Plan Selection model therefore doesnot contribute much to the seen category and evendegrades the performance in some cases Neverthe-less on the unseen category and unseen entity eval-uation in Tables3-4 we found our approach withPlan Selection improves both BLEU score metricswhen comparing with our approach without PlanSelection

Based on these results in Tables 2-4 the ap-proach building on the T5 model works well onthe seen category and it may not require the planselection process Still on the unseen categoryand unseen entity there is a room where the planselection could play a role in improving the perfor-mances

So far we are interested in the BLEU metric be-cause we optimize the Plan Selection model using

Figure 7 Unseen Category Results of Human Evalua-tion on RDF-to-Text of WebNLG 2020 (English)

Figure 8 Unseen Entity Results of Human Evaluationon RDF-to-Text of WebNLG 2020 (English)

the BLEU metric To further investigate the planselection in this approach we might optimize theplan selection using other metrics

Overall our approach with and without the planselection outperforms the baseline in most metricsexcept for CHRF++ as shown in Table 1 Althoughour approach with Plan Selection could slightlyimprove some metrics the results do not explic-itly surpass our approach without Plan SelectionTherefore the current contribution of Plan Selec-tion is very limited

322 Human EvaluationIn Human Evaluation the results are illustratedin Figure 5-8 These results are also provided byWebNLG 2020 (Castro-Ferreira et al 2020 Mous-salem et al 2020) Due to the limitation of theWebNLG 2020 submission only our approach withPlan Selection is evaluated in the Human Evalua-tion setting Note that the reference texts used inthe automatic evaluation are also evaluated in thissetting

Overall results in Figure 5 show our approachoutperforms the baseline in the relevance metricthe text structure metric and the fluency metric

164

however the baseline provides the better results inthe data coverage metric and the correctness metricAccording to the statistical testing we found statis-tically significant differences in the data coverageresult the correctness result and the fluency re-sult This indicates our approach provides fluencytext while it lacks the coverage and the correctnesscorresponding to RDF inputs

To further investigate the results we analyze theresults based on the seen category the unseen cat-egory and the unseen entity As shown in Figure6-8 the data coverage result the text structure andthe fluency follow the overall resultrsquos trend Nev-ertheless the relevance result and the correctnessresult are varied It turns out that on the unseenentity our approach could not outperform the base-line in the relevance result however our approachachieves even better relevance result than the ref-erence texts in the unseen category In the seencategory our approach yields higher correctness re-sult than the baseline while in the unseen categoryand unseen entity the correctness results follow theoverall trend

In Human Evaluation we notice that our ap-proach severely suffers from the data coverage asshown in Figure 6-8 The gap between our ap-proach and the reference (even the baseline) is largeThis is the limitation of our approach where we donot force the model to use all inputs

4 Conclusion

In this paper we introduce RDF-to-Text Generationwith Plan Selection for WebNLG 2020 challengeThe approach comprises two parts 1) the pipelinefor RDF-to-Text Generation and 2) Plan SelectionIn the pipeline we follow the state-of-the-art modelfor RDF-to-Text Generation (Kale 2020) while inPlan Selection we propose a method for selectingthe order of triples for the pipeline Overall weobserved a slight improvement in the BLEU scorewhen applying Plan Selection Nevertheless PlanSelection only optimized the BLEU score in thisstudy In the future we aim to investigate optimizePlan Selection in other metrics

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO) JPNP20006

ReferencesSoren Auer Christian Bizer Georgi Kobilarov Jens

Lehmann Richard Cyganiak and Zachary Ives2007 Dbpedia A nucleus for a web of open dataIn The semantic web pages 722ndash735 Springer

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation andor sum-marization pages 65ndash72

Thiago Castro-Ferreira Claire Gardent NikolaiIlinykh Chris van der Lee Simon Mille DiegoMoussalem and Anastasia Shimorina 2020 The2020 bilingual bi-directional webnlg+ shared taskOverview and evaluation results (webnlg+ 2020) InProceedings of the 3rd WebNLG Workshop on Nat-ural Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Associ-ation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofNAACL-HLT pages 4171ndash4186

Bayu Distiawan Jianzhong Qi Rui Zhang and WeiWang 2018 Gtr-lstm A triple encoder for sentencegeneration from rdf data In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages1627ndash1637

Thiago Castro Ferreira Diego Moussallem EmielKrahmer and Sander Wubben 2018 Enriching thewebnlg corpus In Proceedings of the 11th Interna-tional Conference on Natural Language Generationpages 171ndash176

Claire Gardent Anastasia Shimorina Shashi Narayanand Laura Perez-Beltrachini 2017 The webnlgchallenge Generating text from rdf data In Pro-ceedings of the 10th International Conference onNatural Language Generation pages 124ndash133

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data miningpages 855ndash864

Shizhu He Cao Liu Kang Liu and Jun Zhao2017 Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) pages 199ndash208

Mihir Kale 2020 Text-to-text pre-training for data-to-text tasks arXiv preprint arXiv200510433

165

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166

however the baseline provides the better results inthe data coverage metric and the correctness metricAccording to the statistical testing we found statis-tically significant differences in the data coverageresult the correctness result and the fluency re-sult This indicates our approach provides fluencytext while it lacks the coverage and the correctnesscorresponding to RDF inputs

To further investigate the results we analyze theresults based on the seen category the unseen cat-egory and the unseen entity As shown in Figure6-8 the data coverage result the text structure andthe fluency follow the overall resultrsquos trend Nev-ertheless the relevance result and the correctnessresult are varied It turns out that on the unseenentity our approach could not outperform the base-line in the relevance result however our approachachieves even better relevance result than the ref-erence texts in the unseen category In the seencategory our approach yields higher correctness re-sult than the baseline while in the unseen categoryand unseen entity the correctness results follow theoverall trend

In Human Evaluation we notice that our ap-proach severely suffers from the data coverage asshown in Figure 6-8 The gap between our ap-proach and the reference (even the baseline) is largeThis is the limitation of our approach where we donot force the model to use all inputs

4 Conclusion

In this paper we introduce RDF-to-Text Generationwith Plan Selection for WebNLG 2020 challengeThe approach comprises two parts 1) the pipelinefor RDF-to-Text Generation and 2) Plan SelectionIn the pipeline we follow the state-of-the-art modelfor RDF-to-Text Generation (Kale 2020) while inPlan Selection we propose a method for selectingthe order of triples for the pipeline Overall weobserved a slight improvement in the BLEU scorewhen applying Plan Selection Nevertheless PlanSelection only optimized the BLEU score in thisstudy In the future we aim to investigate optimizePlan Selection in other metrics

Acknowledgments

This paper is based on results obtained from aproject commissioned by the New Energy andIndustrial Technology Development Organization(NEDO) JPNP20006

ReferencesSoren Auer Christian Bizer Georgi Kobilarov Jens

Lehmann Richard Cyganiak and Zachary Ives2007 Dbpedia A nucleus for a web of open dataIn The semantic web pages 722ndash735 Springer

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation andor sum-marization pages 65ndash72

Thiago Castro-Ferreira Claire Gardent NikolaiIlinykh Chris van der Lee Simon Mille DiegoMoussalem and Anastasia Shimorina 2020 The2020 bilingual bi-directional webnlg+ shared taskOverview and evaluation results (webnlg+ 2020) InProceedings of the 3rd WebNLG Workshop on Nat-ural Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Associ-ation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofNAACL-HLT pages 4171ndash4186

Bayu Distiawan Jianzhong Qi Rui Zhang and WeiWang 2018 Gtr-lstm A triple encoder for sentencegeneration from rdf data In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages1627ndash1637

Thiago Castro Ferreira Diego Moussallem EmielKrahmer and Sander Wubben 2018 Enriching thewebnlg corpus In Proceedings of the 11th Interna-tional Conference on Natural Language Generationpages 171ndash176

Claire Gardent Anastasia Shimorina Shashi Narayanand Laura Perez-Beltrachini 2017 The webnlgchallenge Generating text from rdf data In Pro-ceedings of the 10th International Conference onNatural Language Generation pages 124ndash133

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD international con-ference on Knowledge discovery and data miningpages 855ndash864

Shizhu He Cao Liu Kang Liu and Jun Zhao2017 Generating natural answers by incorporatingcopying and retrieving mechanisms in sequence-to-sequence learning In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1 Long Papers) pages 199ndash208

Mihir Kale 2020 Text-to-text pre-training for data-to-text tasks arXiv preprint arXiv200510433

165

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166

Diederik P Kingma and Jimmy Ba 2014 Adam Amethod for stochastic optimization arXiv preprintarXiv14126980

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Ves Stoyanov and Luke Zettlemoyer 2019Bart Denoising sequence-to-sequence pre-trainingfor natural language generation translation andcomprehension arXiv preprint arXiv191013461

Christopher Morris Martin Ritzert Matthias FeyWilliam L Hamilton Jan Eric Lenssen Gaurav Rat-tan and Martin Grohe 2019 Weisfeiler and lemango neural Higher-order graph neural networks InProceedings of the AAAI Conference on Artificial In-telligence volume 33 pages 4602ndash4609

Amit Moryossef Yoav Goldberg and Ido Dagan 2019Step-by-step Separating planning from realizationin neural data-to-text generation In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational LinguisticsHuman Language Technologies Volume 1 (Longand Short Papers) pages 2267ndash2277

Diego Moussalem Paramjot Kaur Thiago Castro-Ferreira Chris van der Lee Conrads Felix Shimo-rina Anastasia Michael Roder Rene Speck ClaireGardent Simon Mille Nikolai Ilinykh and Axel-Cyrille Ngonga Ngomo 2020 A general bench-marking framework for text generation In Pro-ceedings of the 3rd WebNLG Workshop on Natu-ral Language Generation from the Semantic Web(WebNLG+ 2020) Dublin Ireland (Virtual) Asso-ciation for Computational Linguistics

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics pages 311ndash318

Matthew E Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word repre-sentations arXiv preprint arXiv180205365

Maja Popovic 2017 chrf++ words helping charactern-grams In Proceedings of the second conferenceon machine translation pages 612ndash618

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners

Colin Raffel Noam Shazeer Adam Roberts KatherineLee Sharan Narang Michael Matena Yanqi ZhouWei Li and Peter J Liu 2020 Exploring the lim-its of transfer learning with a unified text-to-texttransformer Journal of Machine Learning Research21(140)1ndash67

Thibault Sellam Dipanjan Das and Ankur P Parikh2020 Bleurt Learning robust metrics for text gen-eration arXiv preprint arXiv200404696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A study oftranslation edit rate with targeted human annotationIn Proceedings of association for machine transla-tion in the Americas volume 200 Cambridge MA

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in neural information pro-cessing systems pages 5998ndash6008

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2019 Bertscore Eval-uating text generation with bert arXiv preprintarXiv190409675

Chao Zhao Marilyn Walker and Snigdha Chaturvedi2020 Bridging the structural gap between encod-ing and decoding for data-to-text generation In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics volume 1

166