lecture: semantic word clouds

60
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2014/sais_2014.htm Semantic Word Clouds Marina San(ni [email protected]fil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Autumn 2014 1 Lect 10: Seman(c Word Clouds

Upload: marina-santini

Post on 12-Jul-2015

496 views

Category:

Education


4 download

TRANSCRIPT

Page 1: Lecture: Semantic Word Clouds

Seman&c  Analysis  in  Language  Technology  http://stp.lingfil.uu.se/~santinim/sais/2014/sais_2014.htm

Semantic Word Clouds

Marina  San(ni  [email protected]  

 Department  of  Linguis(cs  and  Philology  Uppsala  University,  Uppsala,  Sweden  

 Autumn  2014  

1  Lect  10:  Seman(c  Word  Clouds  

Page 2: Lecture: Semantic Word Clouds

Acknowledgements  

•  Some  slides  borrowed  from  Sergey  Pupyrev.  

Lect  10:  Seman(c  Word  Clouds   2  

Page 3: Lecture: Semantic Word Clouds

Outline  

•  Word  Clouds  •  3  early  algorithms  •  3  new  algorithms  •  Metrics  &  Quan(ta(ve  Evalua(on  

Lect  10:  Seman(c  Word  Clouds   3  

Page 4: Lecture: Semantic Word Clouds

Word  Clouds  

•  Word  clouds  have  become  a  standard  tool  for  abstrac(ng,  visualizing  and  comparing  texts…  

•  We  could  apply  the  same  or  similar  techniques  to  the  huge  amonts  of  tags  produced  by  users  interac(ng  in  the  social  networks    

Lect  10:  Seman(c  Word  Clouds   4  

Page 5: Lecture: Semantic Word Clouds

Comparison  &  conceptualiza(on  Tool  

Lect  10:  Seman(c  Word  Clouds   5  

•  Word  Clouds  as  a  tool  for  ”conceptualizing”  documents.  Cf  Ontologies  

•  Ex:  2008,    comparison  of  speeches:  Obama  vs  McCain  

Page 6: Lecture: Semantic Word Clouds

Word  Clouds  and  Tag  Clouds…  

•  …  are  oVen  used  to  represent  importance  among  terms  (ex,  band  popularity)  or  serve  as  a  naviga(on  tool  (ex,  Google  search  results).  

Lect  10:  Seman(c  Word  Clouds   6  

Page 7: Lecture: Semantic Word Clouds

The  Problem…  

•  How  to  compute  seman(c-­‐preserving  word  clouds  in  which  seman(cally-­‐related  words  are  close  to  each  other.    

Lect  10:  Seman(c  Word  Clouds   7  

Page 8: Lecture: Semantic Word Clouds

Wordle  h^p://www.wordle.net    

•  Prac(cal  tools,  like  Wordle,  make  word  cloud  visualiza(on  easy.  

•  Shortoming:  they  do  not  capture  the  rela(onships  between  words  in  any  way  

Lect  10:  Seman(c  Word  Clouds   8  

Page 9: Lecture: Semantic Word Clouds

Many  word  clouds  are  arranged  randomly  (look  also  at  the  sca^ered  colours)  

Lect  10:  Seman(c  Word  Clouds   9  

Page 10: Lecture: Semantic Word Clouds

Seman(c  Pa^erns  

•  Humans  ins(nc(vely  tend  to  pick  up  pa^erns  

•  Ins(nc(vely,  one  could  say  that  two  words  that  are  close  to  each  other  in  a  word  cloud  are  seman(cally  related.  

Lect  10:  Seman(c  Word  Clouds   10  

Page 11: Lecture: Semantic Word Clouds

So,  it  makes  sense  to  place  such  related  words  close  to  each  other  (look  also  at  the  color  distribu(on)  

Lect  10:  Seman(c  Word  Clouds   11  

Page 12: Lecture: Semantic Word Clouds

In  linguis(cs  and  in  LT…  

•  …  if  a  pair  of  words  oVen  appear  together  in  a  sentence,  then  we  can  assume  that  this  pair  of  words  is  related  seman(cally.    

Lect  10:  Seman(c  Word  Clouds   12  

Page 13: Lecture: Semantic Word Clouds

Seman(c  word  clouds  have  higher  user  sa(sfac(on  compared  to  other  layouts…  

Lect  10:  Seman(c  Word  Clouds   13  

Page 14: Lecture: Semantic Word Clouds

All  recent  word  cloud  visualiza(on  tools  aim  to  incoprorate  seman(cs  in  the  layout…    

Lect  10:  Seman(c  Word  Clouds   14  

Page 15: Lecture: Semantic Word Clouds

…  but  none  of  them  provide  any  guarantee  about  the  quality  of  the  layout  in  terms  of  seman(cs  

Lect  10:  Seman(c  Word  Clouds   15  

Page 16: Lecture: Semantic Word Clouds

Early  algorithms:  Force-­‐Directed  Graph  

•  Most  of  the  exis(ng  algorithms  are  based  on  force-­‐directed  graph  layout.    

•  Force-­‐directed  graph  drawing  algorithms  are  a  class  of  algorithms  for  drawing  graphs  in  an  aesthe(cally  pleasing  way  

–  A^rac(ve  forces  between  pairs  to  reduce  empty  space  

–  Repulsive  forces  ensure  that  words  do  not  overlap  

–  Final  force  preserve  seman(c  rela(ons  between  words.    

Lect  10:  Seman(c  Word  Clouds   16  

Force-­‐directed  graph  drawing  algorithms  assign  forces  among  the  set  of  edges  and  the  set  of  nodes  of  a  graph  drawing.  Typically,  spring-­‐like  a^rac(ve  forces  based  on  Hooke's  law  are  used  to  a^ract  pairs  of  endpoints  of  the  graph's  edges  towards  each  other,  while  simultaneously  repulsive  forces  like  those  of  electrically  charged  par(cles  based  on  Coulomb's  law  are  used  to  separate  all  pairs  of  nodes.    

Page 17: Lecture: Semantic Word Clouds

Newer  Algorithms:  rectangle  representa(on  of  graphs  

•  Vertex-­‐weighted  and  edge-­‐weighed  graph:  –  The  ver(ces  of  the  graph  are  the  words  

•  Their  weight  correspond  to  some  measure  of  importance  (eg.  word  frequencies)  

–  The  edges  capture  the  seman(c  relatedness  of  pair  of  words  (eg.  co-­‐occurrence)  •  Their  weight  correspond  to  the  strength  of  the  rela(on  

–  Each  vertex  can  be  drawn  as  a  box  (rectangle)  with  a  dimension  determing  by  its  weight  

– A  realized  adjacency    is  the  sum  of  the  edge  weights  for  all  pairs  of  touching  boxes.    

–  The  goal  is  to  maximize  the  realized  adjacencies.  

Lect  10:  Seman(c  Word  Clouds   17  

Page 18: Lecture: Semantic Word Clouds

Experimental  Setup:    1)  Term  Extrac(on    2)  Ranking    3)  Similarity  Conputa(on  

Lect  10:  Seman(c  Word  Clouds   18  

Page 19: Lecture: Semantic Word Clouds

Early  Algorithms  

1.  Wordle  (Random)  2.  Context-­‐Preserving  Word  Cloud  Visualiza(on  

(CPWCV)  3.  Seam  Carving  

Lect  10:  Seman(c  Word  Clouds   19  

Page 20: Lecture: Semantic Word Clouds

Wordle  à  Random  

•   The  Wordle  algorithm  places  one  word  at  a  (me  in  a  greedy  fashion,  aiming  to  use  space  as  efficiently  as  possible.    

•  First  the  words  are  sorted  by  weight  in  decreasing  order.    

•  Then  for  each  word  in  the  order,  a  posi(on  is  picked  at  random.    

Lect  10:  Seman(c  Word  Clouds   20  

Page 21: Lecture: Semantic Word Clouds

1:  Random  

Lect  10:  Seman(c  Word  Clouds   21  

Page 22: Lecture: Semantic Word Clouds

2:  Random  

Lect  10:  Seman(c  Word  Clouds   22  

Page 23: Lecture: Semantic Word Clouds

3:  Random  

Lect  10:  Seman(c  Word  Clouds   23  

Page 24: Lecture: Semantic Word Clouds

4:  Random  

Lect  10:  Seman(c  Word  Clouds   24  

Page 25: Lecture: Semantic Word Clouds

5:  Random  

Lect  10:  Seman(c  Word  Clouds   25  

Page 26: Lecture: Semantic Word Clouds

6:  Random  

Lect  10:  Seman(c  Word  Clouds   26  

Page 27: Lecture: Semantic Word Clouds

Context-­‐Preserving  Word  Cloud  Visualiza(on  (CPWCV)    

•  First,  a  dissimilarity  matrix  is  computed  and  Mul(dimensional  Scaling  (MDS)  is  performed  

•  Second,  effort  to  create  a  compact  layout    

Lect  10:  Seman(c  Word  Clouds   27  

Mul(dimensional  scaling  (MDS)  is  a  means  of  visualizing  the  level  of  similarity  of  individual  cases  of  a  dataset.    

Page 28: Lecture: Semantic Word Clouds

1:  Context-­‐Preserving    

Lect  10:  Seman(c  Word  Clouds   28  

Page 29: Lecture: Semantic Word Clouds

2:  Context-­‐Preserving  :  repulsive  force  

Lect  10:  Seman(c  Word  Clouds   29  

Page 30: Lecture: Semantic Word Clouds

3:  Context-­‐Preserving  :  a^rac(ve  force  

Lect  10:  Seman(c  Word  Clouds   30  

Page 31: Lecture: Semantic Word Clouds

Seam  Carving  

•  Seam  carving  is  a  content-­‐aware  image  resizing  technique  

•  Basically,  an  algorithm  for  image  resizing  

•  It  was  invented  at  Mitsubishi’s  

Lect  10:  Seman(c  Word  Clouds   31  

Page 32: Lecture: Semantic Word Clouds

1:  Seam  Carving  

Lect  10:  Seman(c  Word  Clouds   32  

Page 33: Lecture: Semantic Word Clouds

2:  Seam  Carving  :  space  is  divided  into  regions  

Lect  10:  Seman(c  Word  Clouds   33  

Page 34: Lecture: Semantic Word Clouds

3:  Seam  Carving  :  empty  paths  trimmed  out  itera(vely  

Lect  10:  Seman(c  Word  Clouds   34  

Page 35: Lecture: Semantic Word Clouds

4:  Seam  Carving  

Lect  10:  Seman(c  Word  Clouds   35  

Page 36: Lecture: Semantic Word Clouds

5:  Seam  Carving  

Lect  10:  Seman(c  Word  Clouds   36  

Page 37: Lecture: Semantic Word Clouds

6:  Seam  Carving:  space  divided  into  regions  

Lect  10:  Seman(c  Word  Clouds   37  

Page 38: Lecture: Semantic Word Clouds

7:  Seam  Carving  

Lect  10:  Seman(c  Word  Clouds   38  

Page 39: Lecture: Semantic Word Clouds

3  New  Algorithms  

1.  Inflate  and  Push  2.  Star  Forest  3.  Cycle  Cover  

Lect  10:  Seman(c  Word  Clouds   39  

Page 40: Lecture: Semantic Word Clouds

Inflate-­‐and-­‐Push  

•  Simple  heuris(c  method  for  word  layout,  which  aims  to  preserve  seman(c  rela(ons  between  pair  of  words.  

Lect  10:  Seman(c  Word  Clouds   40  

Page 41: Lecture: Semantic Word Clouds

1:  Inflate  

Lect  10:  Seman(c  Word  Clouds   41  

Page 42: Lecture: Semantic Word Clouds

2:  Inflate  :  scaling  down  

Lect  10:  Seman(c  Word  Clouds   42  

Page 43: Lecture: Semantic Word Clouds

3:  Inflate  :  seman(cally-­‐related  words  are  placed  close  to  each  other  

Lect  10:  Seman(c  Word  Clouds   43  

Page 44: Lecture: Semantic Word Clouds

4:  Inflate  :  repulsive  force  to  resolve  overlaps  

Lect  10:  Seman(c  Word  Clouds   44  

Page 45: Lecture: Semantic Word Clouds

5:  Inflate  

Lect  10:  Seman(c  Word  Clouds   45  

Page 46: Lecture: Semantic Word Clouds

Star  Forest  

•  A  star  is  a  tree  and  a  star  forest  is  a  forest  whose  connected  components  are  all  stars.  

Lect  10:  Seman(c  Word  Clouds   46  

Page 47: Lecture: Semantic Word Clouds

Star  Forest  :  star  =  graph  •  Dissimilarity  matrix  à  disjoint  stars  =  star  forest  •  A^rac(ve  force  to  get  a  compact  layout  

Lect  10:  Seman(c  Word  Clouds   47  

Page 48: Lecture: Semantic Word Clouds

Cycle  Cover  •  This  algorithm  is  based  on  a  similarity  matrix.  •  First,  a  similarity  path(=cycle)  is  created  •  Then,  the  op(mal  level  of  compact-­‐ness  is  computed  

Lect  10:  Seman(c  Word  Clouds   48  

Page 49: Lecture: Semantic Word Clouds

Quan(ta(ve  Metrics  

Lect  10:  Seman(c  Word  Clouds   49  

Page 50: Lecture: Semantic Word Clouds

Criteria  1.  Realized  Adjacenies  –  how  close  are  similar  words  to  each  other?  

2.  Distor(on  –  how  distant  are  dissimilar  words?  

3.  Comptactness  –  how  well  u(lized  is  the  drawing  area?  

4.  Uniform  Area  U(liza(on  –  uniformity  of  the  distribu(on  (overpopulated  vs  sparse  areas  

in  the  word  cloud)  5.  Aspect  Ra(o  –  width  and  height  of  the  bounding  box  

6.  Running  Time  –  execu(on  (me  

Lect  10:  Seman(c  Word  Clouds   50  

Page 51: Lecture: Semantic Word Clouds

2  datasets  

 (1)  WIKI  ,  a  set  of  112    plain-­‐text  ar(cles  extracted  from  the  English  Wikipedia,  each  consis(ng  of  at  least  200    dis(nct  words    (2)  PAPERS  ,  a  set  of  56    research  papers  published  in  conferences  on  experimental  algorithms  (SEA  and  ALENEX)  in  2011-­‐2012.  

Lect  10:  Seman(c  Word  Clouds   51  

Page 52: Lecture: Semantic Word Clouds

Cycle  Cover  wins  

Lect  10:  Seman(c  Word  Clouds   52  

Page 53: Lecture: Semantic Word Clouds

Seam  Carving  wins  

Lect  10:  Seman(c  Word  Clouds   53  

Page 54: Lecture: Semantic Word Clouds

Random  wins  

Lect  10:  Seman(c  Word  Clouds   54  

Page 55: Lecture: Semantic Word Clouds

Inflate  wins  

Lect  10:  Seman(c  Word  Clouds   55  

Page 56: Lecture: Semantic Word Clouds

Random  and  Seam  Carving  win  

Lect  10:  Seman(c  Word  Clouds   56  

Page 57: Lecture: Semantic Word Clouds

All  ok  except  Seam  Carving    

Lect  10:  Seman(c  Word  Clouds   57  

Page 58: Lecture: Semantic Word Clouds

Demo  

Lect  10:  Seman(c  Word  Clouds   58  

Page 59: Lecture: Semantic Word Clouds

Final  Words  

Lect  10:  Seman(c  Word  Clouds   59  

Page 60: Lecture: Semantic Word Clouds

The  end  

60  Lect  10:  Seman(c  Word  Clouds