textanaly(cs$in$its$2.0:$annotaon$of$ named$en((es$€¦ ·...

12
The Mul(lingualWebLT Working Group receives funding by the European Commission (project name LTWeb) through the Seventh Framework Programme (FP7) in the area of Language Technologies. Grant Agreement No. 287815. Text Analy(cs in ITS 2.0: Annota(on of Named En((es Tadej Štajner Jožef Stefan Ins(tute

Upload: others

Post on 28-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Text  Analy(cs  in  ITS  2.0:  Annota(on  of  Named  En((es  

Tadej  Štajner  Jožef  Stefan  Ins(tute  

Page 2: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Mo(va(on  •  Transla(ng  proper  names  

         …  can  be  problema(c  for  sta(s(cal  MT  systems    

Page 3: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Mo(va(on  (2)  •  Transla(on  depends  on  source  and  target  language:  – There  are  specific  rules  to  translate  (or  transliterate)  par(cular  proper  names  or  concepts  

– Some(mes,  they  should  not  even  be  translated    

•  Solu(on:  figure  out  what  is  actually  being  men4oned  and  see  if  any  exis4ng  translated  expression  exists  for  that  en4ty  

Page 4: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Mo(va(on  (3)  •  Localiza(on  of  proper  names:  –  personal  names,  product  names,  or  geographic  

names,  chemical  compounds,  protein  names    

•  Names  can  appear  without  sufficient  context:  –  we  can  use  ITS2.0  Text  Analysis  annota(ons  to  

provide  context  for  ambiguous  content.      

Page 5: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS2.0  Text  Analysis  •  Support  text  analysis  agents  that  enhance  content  by  sugges(ng  or  iden(fying  concepts,  iden((es,  iden(fied  by  IRIs.  

•  The  data  category  provides  three  pieces  of  informa(on:    – confidence  – en(ty  type/concept  class  – en(ty/concept  iden(fier  

Page 6: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

ITS2.0  Text  Analysis  <!DOCTYPE  html>  <div    its-­‐annotators-­‐ref="text-­‐analysis|http://enrycher.ijs.si/mlw/toolinfo.xml#enrycher">      <span        its-­‐ta-­‐ident-­‐ref="http://dbpedia.org/resource/Dublin"        its-­‐ta-­‐class-­‐ref="http://schema.org/Place">Dublin</span>  is  the  <span        its-­‐ta-­‐ident-­‐ref="http://purl.org/vocabularies/princeton/wn30/synset-­‐capital-­‐noun-­‐3.rdf">capital</span>  of  <span        its-­‐ta-­‐ident-­‐ref="http://dbpedia.org/resource/Ireland"        its-­‐ta-­‐class-­‐ref="http://schema.org/Place">Ireland</span>.  </div>    

Page 7: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Producing  these  annota(ons  •  NLP  Techniques  – Named  en(ty  extrac(on  &  disambigua(on  – Word  sense  disambigua(on  

•  Manual  annota(on  

Page 8: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Named  en(ty  disambigua(on  Document  

Label  

En(ty  

Men(on  

Page 9: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Use  cases  •  Informing  a  human  agent  (i.e.  translator)  that  a  certain  fragment  of  text  is  subject  to  follow  specific  transla(on  rules:  [this  is  taken  up  in  OKAPI  and  the  XLIFF  genera(on]  –  proper  names  –  officially  regulated  transla(ons.  

•  Informing  sogware  agent  (i.e.  CMS)  about  the  conceptual  type  of  a  textual  en(ty  in  order  to  enable  special  processing  or  indexing;  –  geographic  names  –  personal  names  –  product  names  –  chemical  compounds  

Page 10: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Open  issues  •  Text  Analysis  data  category  can’t  represent  stand-­‐off  annota(ons,  so  only  one  layer  can  be  done  

•  Support  for  the  domain  data  category  via  text  analysis  tools  

Page 11: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Business  case  •  By  itself,  it’s  infrastructure  that  indirectly  supports  business  cases  by  suppor(ng  other  technical  scenarios  

•  [Clemens:]  Having  metadata  saves  (me,  –   producing  it  automa(cally  can  compound  the  savings  

•  [XLIFF  roundtrip:]  Human  as  well  as  machine  consump(on  of  this  metadata  

Page 12: TextAnaly(cs$in$ITS$2.0:$Annotaon$of$ Named$En((es$€¦ · The$Mul(lingualWeb/LT$Working$Group$receives$funding$by$the$European$Commission$(projectname$LT/Web)$through$the$Seventh$

The  Mul(lingualWeb-­‐LT  Working  Group  receives  funding  by  the  European  Commission  (project  name  LT-­‐Web)  through  the  Seventh  Framework  Programme  (FP7)  in  the  area  of  Language  Technologies.  Grant  Agreement  No.  287815.  

Demo  •  hmp://enycher.ijs.si/mlw/