semantic word clouds - marina...

Post on 23-May-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Seman&c  Analysis  in  Language  Technology  http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Semantic Word Clouds

Marina  San(ni  san$nim@stp.lingfil.uu.se  

 Department  of  Linguis(cs  and  Philology  Uppsala  University,  Uppsala,  Sweden  

 Spring  2016  

   

Previous  lecture:  Ontologies  

2  

Semantic Web & Ontologies •  The  goal  of  the  Seman(c  Web  is  to  allow  web  informa(on  and  services  to  be  more  

effec(vely  exploited  by  humans  and  automated  tools.    

•  Essen(ally,  the  focus  of  the  seman(c  web  is  to  share  data  instead  of  documents.    

•  This  data  must  be  ”meaningful”  both  for  human  and  for  machines  (ie  automated  tools  and  web  applica(ons)  

•  Q:  How  are  we  going  to  represent  meaning  and  knowledge  on  the  web?  

•  A:  …  via  annota&on.    

•  Knowledge  is  represented  in  the  form  of  rich  conceptual  schemas/formalisms  called  ontologies.    

•  Therefore,  ontologies  are  the  backbone  of  the  Seman(c  Web.  

•  Ontologies  give  formally  defined  meanings  to  the  terms  used  in  annota&ons,  transforming  them  into  seman&c  annota&ons.   3

Ontologies  are…  •  …  concepts  that  are  hierarchically  organized  

4  

Tree  of  Porphyry,  III  AD  

Wordnet,  XXI  AD  (see  Lect  5,  ex  similarity  measures)  

Reasoning:  RDF/OWL  vs  Databases  (and  other  data  structures)  OWL  axioms  behave  like  inference  rules  rather  than  database  constraints.    

!Class: Phoenix!

!SubClassOf: isPetOf only Wizard!!Individual: Fawkes!

Types: Phoenix!Facts: isPetOf Dumbledore!

•  Fawkes  is  said  to  be  a  Phoenix  and  to  be  the  pet  of  Dumbledore,  and  it  is  also  stated  that  only  a  Wizard  can  have  a  pet  Phoenix.    

•  In  OWL,  this  leads  to  the  implica(on  that  Dumbledore  is  a  Wizard.  That  is,  if  we  were  to  query  the  ontology  for  instances  of  Wizard,  then  Dumbledore  would  be  part  of  the  answer.    

•  In  a  database  se[ng  the  schema  could  include  a  similar  statement  about  the  Phoenix  class,  but  in  this  case  it  would  be  interpreted  as  a  constraint  on  the  data:  adding  the  fact  that  Fawkes  isPetOf  Dumbledore  without  Dumbledore  being  already  known  to  be  a  Wizard  would  lead  to  an  invalid  database  state,  and  such  an  update  would  therefore  be  rejected  by  a  database  management  system  as  a  constraint  viola(on.  

5  

So, what is an ontology for us?

6

“An  ontology  is  a  FORMAl,  EXPLICIT  specifica&on  of  a    SHARED  conceptualiza&on”  

Studer,  Benjamins,  Fensel.  Knowledge  Engineering:  Principles  and  Methods.  Data  and  Knowledge  Engineering.  25  (1998)  161-­‐197  

 

An ontology is an explicit specification of a conceptualization Gruber, T. A translation Approach to portable ontology specifications. Knowledge Acquisition. Vol. 5. 1993. 199-220

 

Abstract model and simplified view of some phenomenon in the world that we want to represent

Machine-readable

Concepts, properties relations, functions, constraints, axioms, are explicitly defined

Consensual Knowledge

How  to  build  an  ontology  

Generally  speaking  (and  roughly  said),  when  designing  an  ontology,  four  main  components  are  used:  1.  Classes  2.  Rela(ons  3.  Axioms  4.  Instances       7  

Prac(cal  Ac(vity:  emo(ons  

8  

Your  remarks:  •  Emo(ons  are  ambiguous:  

eg.  happiness  can  be  also  ill-­‐directed  

•  The  polarity  of  some  emo(ons  cannot  be  assessed…  

•  etc.      

Classes  Rela(ons  Axioms  Instances  etc.    

Occupa(onal  psychology  (wikipedia)  

•  Industrial  and  organiza(onal  psychology  (also  known  as  I–O  psychology,  occupa(onal  psychology,  work  psychology,  WO  psychology,  IWO  psychology  and  business  psychology)  is  the  scien$fic  study  of  human  behavior  in  the  workplace  and  applies  psychological  theories  and  principles  to  organiza(ons  and  individuals  in  their  workplace.  

•  I-­‐O  psychologists  are  trained  in  the  scien(st–prac((oner  model.  I-­‐O  psychologists  contribute  to  an  organiza(on's  success  by  improving  the  performance,  mo(va(on,  job  sa(sfac(on,  occupa(onal  safety  and  health  as  well  as  the  overall  health  and  well-­‐being  of  its  employees.  An  I–O  psychologist  conducts  research  on  employee  behaviors  and  a[tudes,  and  how  these  can  be  improved  through  hiring  prac(ces,  training  programs,  feedback,  and  management  systems.  

9  

In  summary…  

Why  to  build  an  ontology?    •  To  share  common  understanding  of  the  structure  of  informa(on  among  people  or  machines  

•  To  make  domain  assump$ons  explicit  •  Ojen  based  on  controlled  vocabulary  •  To  analyze  domain  knowledge  •  To  enable  reuse  of  domain  knowledge  

10  

Ontologies  and  Tags  

•  Ontologies  and  tagging  systems  are  two  different  ways  to  organize  the  knowledge  present  in  Web.    

•  The  first  one  has  a  formal  fundamental  that  derives  from  descrip(ve  logic  and  ar(ficial  intelligence.  Domain  experts  decide  the  terms.  

•  The  other  one  is  simpler  and  it  integrates  heterogeneous  contents,  and  it  is  based  on  the  collabora(on  of  users  in  the  Web  2.0.  User-­‐  generated  annota(on.    

11  

Folksonomies  

•  Tagging  facili(es  within  Web  2.0  applica(ons  have  shown  how  it  might  be  possible  for  user  communi$es  to  collabora$vely  annotate  web  content,  and  create  simple  forms  of  ontology  via  the  development  of  loosely-­‐hierarchically  organised  sets  of  tags,  oNen  called  folksonomies….    

12  

Folksonomy=Social  Tagging  •  Folksonomies  (also  known  as  social  tagging)  are  user-­‐defined  metadata  collec(ons.    

•  Users  do  not  deliberately  create  folksonomies  and  there  is  rarely  a  prescribed  purpose,  but  a  folksonomy  evolves  when  many  users  create  or  store  content  at  par(cular  sites  and  iden(fy  what  they  think  the  content  is  about.    

•  “Tag  clouds”  pinpoint  the  frequency  of  certain  tags.  

13  

•  A  common  way  to  organize  tags  is  in  tag  clouds…  

14  

Automa(c  folksonomy  construc(on  

•  The  collec(ve  knowledge  expressed  though  user-­‐generated  tags  has  a  great  poten(al.    

•  However,  we  need  tools  to  efficiently  aggregate  data  from  large  numbers  of  users  with  highly  idiosyncra$c  vocabularies  and  invented  words  or  expressions.    

•  Many  approaches  to  automa(c  folksonomy  construc(on  combine  tags  using  sta(s(cal  methods  ...    

•  Ample  space  for  improvement…  

15  

Ontology,  taxonomy,  folksonomy,  etc.    

•  Many  different  defini(ons…  

•  A  good  summary  and  interpreta(on  is  here:  hpp://www.ideaeng.com/taxonomies-­‐ontologies-­‐0602    

16  

Today…  

•  We  will  talk  more  generally  about  word  clouds…  

17  

Further  Reading  Seman&c  Similarity  from  Natural  Language  and  Ontology  Analysis  by  Sébas(en  Harispe,  Sylvie  Ranwez,  Stefan  Janaqi,  and  Jacky  Montmain  Synthesis  Lectures  on  Human  Language  Technologies,  May  2015,  Vol.  8,  No.  1  

•  The  two  state-­‐of-­‐the-­‐art  approaches  for  es(ma(ng  and  quan(fying  seman(c  similari(es/relatedness  of  seman(c  en((es  are  presented  in  detail:  the  first  one  relies  on  corpora  analysis  and  is  based  on  Natural  Language  Processing  techniques  and  seman(c  models  while  the  second  is  based  on  more  or  less  formal,  computer-­‐readable  and  workable  forms  of  knowledge  such  as  seman(c  networks,  thesauri  or  ontologies.  

18  

Previous  lecture:  the  end  

19  

Acknowledgements  This  presenta(on  is  based  on  the  following  paper:    •  Barth  et  al.  (2014).  Experimental  Comparison  of  Seman(c  

Word  Cloud.  In  Experimental  Algorithms,  Volume  8504  of  the  series  Lecture  Notes  in  Computer  Science  pp  247-­‐258    –  Link:  hpps://www.cs.arizona.edu/~kobourov/wordle2.pdf    

 Some  slides  have  been  borrowed  from  Sergey  Pupyrev.  

20  

Today  

•  Experiments  on  seman&cs-­‐preserving  word  clouds,  in  which  seman(cally  related  words  are  close  to  each  other.  

21  

Outline  

•  What  is  a  Word  Cloud?  •  3  early  algorithms  •  3  new  algorithms  •  Metrics  &  Quan(ta(ve  Evalua(on  

22  

Word  Clouds  

•  Word  clouds  have  become  a  standard  tool  for  abstrac(ng,  visualizing  and  comparing  texts…  

•  We  could  apply  the  same  or  similar  techniques  to  the  huge  amonts  of  tags  produced  by  users  interac(ng  in  the  social  networks    

23  

Comparison  &  conceptualiza(on  Tool  

24  

•  Word  Clouds  as  a  tool  for  ”conceptualizing”  documents.  Cf  Ontologies  

•  Ex:  2008,    comparison  of  speeches:  Obama  vs  McCain  

Cf.  Lect  10:  Extrac(ve  

summariza(on  &    Abstrac(ve  

summariza(on  

Word  Clouds  and  Tag  Clouds…  

•  …  are  ojen  used  to  represent  importance  among  terms  (ex,  band  popularity)  or  serve  as  a  naviga(on  tool  (ex,  Google  search  results).  

25  

The  Problem…  

• How  to  compute  seman(c-­‐preserving  word  clouds  in  which  seman(cally-­‐related  words  are  close  to  each  other?  

26  

Wordle  hpp://www.wordle.net    

•  Prac(cal  tools,  like  Wordle,  make  word  cloud  visualiza(on  easy.  

They  offer  an  appealing  way  to  SUMMARIZE  text…  

Shortoming:  they  do  not  capture  the  rela(onships  between  words  in  any  way  since  word  placement  is  independent  of  context  

27  

Many  word  clouds  are  arranged  randomly  (look  also  at  the  scapered  colours)  

28  

Paperns  and  Vicinity/Adjacency  

Humans  are  spontaneously  papern-­‐seekers:    if  they  see  two  words  close  to  each  other  in  a  word  cloud,  they  spontaneously  think  they  are  related…  

29  

In  Linguis(cs  and  NLP…  

•  This  natural  tendency  in  linking  spacial  vicinity  to  seman&c  relatedness  is  exploited  as  evidence  that  words  are  seman(cally  related  or  seman(cally  similar…  

Remember?  :  ”You  shall  know  a  word  by  the  company  it  keeps  (Firth,  J.  R.  1957:11)”    

30  

So,  it  makes  sense  to  place  such  related  words  close  to  each  other  (look  also  at  the  color  distribu(on)  

31  

Seman(c  word  clouds  have  higher  user  sa(sfac(on  compared  to  other  layouts…  

32  

All  recent  word  cloud  visualiza(on  tools  aim  to  incoprorate  seman(cs  in  the  layout…    

33  

…  but  none  of  them  provide  any  guarantee  about  the  quality  of  the  layout  in  terms  of  seman(cs  

34  

Early  algorithms:  Force-­‐Directed  Graph  

•  Most  of  the  exis(ng  algorithms  are  based  on  force-­‐directed  graph  layout.    

•  Force-­‐directed  graph  drawing  algorithms  are  a  class  of  algorithms  for  drawing  graphs  in  an  aesthe(cally  pleasing  way  

–  Aprac(ve  forces  between  pairs  to  reduce  empty  space  

–  Repulsive  forces  ensure  that  words  do  not  overlap  

–  Final  force  preserve  seman(c  rela(ons  between  words.    

35  

Some  of  the  most  flexible  algorithms  for  calcula(ng  layouts  of  simple  undirected  graphs  belong  to  a  class  known  as  force-­‐directed  algorithms.  Such  algorithms  calculate  the  layout  of  a  graph  using  only  informa(on  contained  within  the  structure  of  the  graph  itself,  rather  than  relying  on  domain-­‐specific  knowledge.  Graphs  drawn  with  these  algorithms  tend  to  be  aesthe(cally  pleasing,  exhibit  symmetries,  and  tend  to  produce  crossing-­‐free  layouts  for  planar  graphs.  

Newer  Algorithms:  rectangle  representa(on  of  graphs  

•  Vertex-­‐weighted  and  edge-­‐weighed  graph:  –  The  ver(ces  of  the  graph  are  the  words  

•  Their  weight  correspond  to  some  measure  of  importance  (eg.  word  frequencies)  

–  The  edges  capture  the  seman(c  relatedness  of  pair  of  words  (eg.  co-­‐occurrence)  

•  Their  weight  correspond  to  the  strength  of  the  rela(on  –  Each  vertex  can  be  drawn  as  a  box  (rectangle)  with  a  dimension  determing  by  its  weight  

– A  realized  adjacency    is  the  sum  of  the  edge  weights  for  all  pairs  of  touching  boxes.    

–  The  goal  is  to  maximize  the  realized  adjacencies.  

36  

Purpose  of  the  experiments  that  are  shown  here:  

•  Seman(cs  preserva(on  in  terms  of  closeness/vicinity/adjacency  

37  

Example  •  A  contact  of  2  boxes  is  a  common  boundary.  •  The  contact  of  two  boxes  is  interpredet  as  

seman(c  relatedness  •  The  contact  of  2  boxes  can  be  calculated,  so  the  

adjacency  can  be  computed  and  evaluated.  

38  

Preprocessing:    1)  Term  Extrac(on    2)  Ranking    3)  Similarity/Dissimilarity  Computa(on  

39  

•  Similarity/dissimilarity  matrix  

40  

Lect  6:  Repe((on  

large   data   computer  

apricot   1   0   0  

digital   0   1   2  

informa(on   1   6   1  

41  

Which  pair  of  words  is  more  similar?  cosine(apricot,informa(on)  =        cosine(digital,informa(on)  =        cosine(apricot,digital)  =    

cos(v, w) =v • wv w

=vv•ww=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

1+ 0+ 0

1+ 0+ 0

1+36+1

1+36+1

0+1+ 4

0+1+ 4

1+ 0+ 0

0+ 6+ 2

0+ 0+ 0

=138

= .16

=838 5

= .58

= 0

Lect  06:  Other  possible  similarity  measures  

42  

Input  -­‐  Output  

•  The  input  for  all  algorithms  is    – a  collec(on  of  n  rectangles,  each  with  a  fixed  width  and  height  propor(onal  to  the  rank  of  the  word  

– A  similarity/dissimilarity  matrix  

•  The  output  is  a  set  of  non-­‐overlapping  posi(ons  for  the  rectangles.  

43  

Early  Algorithms  

1.  Wordle  (Random)  2.  Context-­‐Preserving  Word  Cloud  Visualiza(on  

(CPWCV)  3.  Seam  Carving  

44  

Wordle  à  Random  

•   The  Wordle  algorithm  places  one  word  at  a  (me  in  a  greedy  fashion,  ie  aiming  to  use  space  as  efficiently  as  possible.    

•  First  the  words  are  sorted  by  weight/rank  in  decreasing  order.    

•  Then  for  each  word  in  the  order,  a  posi(on  is  picked  at  random.    

45  

1:  Random  

46  

2:  Random  

47  

3:  Random  

48  

4:  Random  

49  

5:  Random  

50  

6:  Random  

51  

Context-­‐Preserving  Word  Cloud  Visualiza(on  (CPWCV)    

•  First,  a  dissimilarity  matrix  is  computed  and  Mul(dimensional  Scaling  (MDS)  is  performed  

•  Second,  effort  to  create  a  compact  layout    

52  

Mul(dimensional  Scaling  (MDS)  aims  at  detec(ng  meaningful  underlying  dimensions  in  the  data.    

1:  Context-­‐Preserving    

53  

2:  Context-­‐Preserving  :  repulsive  force  

54  

3:  Context-­‐Preserving  :  aprac(ve  force  

55  

Seam  Carving  

 •  Basically,  an  algorithm  for  image  resizing  

•  It  was  invented  at  Mitsubishi’s  

56  

1:  Seam  Carving  

57  

2:  Seam  Carving  :  space  is  divided  into  regions  

58  

3:  Seam  Carving  :  empty  paths  trimmed  out  itera(vely  

59  

4:  Seam  Carving  

60  

5:  Seam  Carving  

61  

6:  Seam  Carving:  space  divided  into  regions  

62  

7:  Seam  Carving  

63  

3  New  Algorithms  

1.  Inflate  and  Push  2.  Star  Forest  3.  Cycle  Cover  

64  

Inflate-­‐and-­‐Push  •  Simple  heuris(c  method  for  word  layout,  which  aims  to  preserve  seman(c  rela(ons  between  pair  of  words.  

•  Based  on    1.  Heuris(cs:  scaling  down  all  word  rectangles  by  some  

constant;    2.  Compu(ng  MDS  (mul(dimensional  scaling)  on  the  

dissimilarity  matrix  3.  Iteretavely  increase  the  size  of  rectangles  by  5%  (ie  

”inflate”  words;    4.  When  words  overlaps,  apply  a  force-­‐directed  algorithm  

to  ”push”  words  away.  

65  

Inflate:  star(ng  point  

66  

Inflate  :  scaling  down  

67  

Inflate  :  seman(cally-­‐related  words  are  placed  close  to  each  other.  Apply  ”inflate  words”  (5%)  itera(vely.  

68  

Inflate:  ”push  words”:  repulsive  force  to  resolve  overlaps  

69  

Inflate:  final  stage  

70  

Star  Forest  

•  A  star  is  a  tree    •  A  star  forest  is  a  forest  whose  connected  components  are  all  stars.  

71  

Repe((on:  trees  and  graphs  •  A  tree  is  special  form  of  graph  i.e.  minimally  connected  graph  and  having  only  one  path  between  any  two  ver(ces.    

•  In  a  graph  there  can  be  more  than  one  path  i.e.  graph  can  have  uni-­‐direc(onal  or  bi-­‐direc(onal  paths  (edges)  between  nodes.  

72  

Three  steps  

1.  Extrac(ng  the  star  forest:  par&&on  a  graph  into  disjoint  stars    

2.  Realising  a  star:  build  a  word  cloud  for  every  star  

3.  Pack  all  the  stars  together  

73  

Star  Forest  :  star  =  tree  1.  Extract  stars  greedily  from  a  dissimilarity  matrix  à  disjoint  stars  =  star  forest  2.  Compute  the  op(mal  stars,  ie  the  best  set  of  words  to  be  adjacent  3.  Aprac(ve  force  to  get  a  compact  layout  

74  

Cycle  Cover  •  This  algorithm  is  based  on  a  similarity  matrix.  •  First,  a  similarity  path  is  created  •  Then,  the  op(mal  level  of  compact-­‐ness  is  computed  

75  

Quan(ta(ve  Metrics  

76  

1.  Realized  Adjacenies  –  how  close  are  similar  words  to  each  

other?  2.  Distor(on  

–  how  distant  are  dissimilar  words?  3.  Uniform  Area  U(liza(on  

–  uniformity  of  the  distribu(on  (overpopulated  vs  sparse  areas  in  the  word  cloud)  

4.  Comptactness  –  how  well  u(lized  is  the  drawing  

area?  5.  Aspect  Ra(o  

–  width  and  height  of  the  bounding  box  

6.  Running  Time  –  execu(on  (me  

2  datasets  

 (1)  WIKI  ,  a  set  of  112    plain-­‐text  ar(cles  extracted  from  the  English  Wikipedia,  each  consis(ng  of  at  least  200    dis(nct  words    (2)  PAPERS  ,  a  set  of  56    research  papers  published  in  conferences  on  experimental  algorithms  (SEA  and  ALENEX)  in  2011-­‐2012.  

77  

Cycle  Cover  wins  

78  

Seam  Carving  wins  

79  

Random  wins  

80  

Inflate  wins  

81  

Random  and  Seam  Carving  win  

82  

All  ok  except  Seam  Carving    

83  

Demo  

84  

The  end  

85  

top related