explicit semantic analysis

15

Upload: badr

Post on 20-Mar-2017

61 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Explicit Semantic Analysis
Page 2: Explicit Semantic Analysis

Explicit Semantic Analysis

Page 3: Explicit Semantic Analysis

Semantic analysis intuition

● Bag of words techniques (BOW)⚪ Text represented as vector in high-dimensional space of

orthogonal unit vectors representing language keywords. ⚪ Vectors weights in BOW rely on keywords frequencies in

represented text, using scoring scheme like popular TF-IDF.⚪ Entities similarity in BOW depends on comparing their vectors

and measuring distance between them (e.g. Cosine similarity).

Page 4: Explicit Semantic Analysis

Semantic analysis intuition

● Challenging problem1 to BOW⚪ Query: Software⚪ Pool: application, program, package, freeware, shareware⚪ Result: no match!

● Although human interpreter could find matches, a program based merely on syntax analysis, treating words as orthogonal independent unit vectors, would see no relation between software and program!

Page 5: Explicit Semantic Analysis

Semantic Analysis Intuition

● For humans natural language text act as index to rich knowledge base natural maintained by human minds.

● The challenge is to come up with a model that captures adequate level of details of the semantics of natural language.

● Semantic analysis focuses on the meaning of the words in their context, rather than just lexical interpretation.

Page 6: Explicit Semantic Analysis

Explicit Semantic Analysis (ESA)

● A variation of semantic analysis based on “explicit concepts”.

● “Our approach is inspired by the desire to augment text representation with massive amounts of world knowledge.”

● Introduced the idea of explicit concept, a bag of words together forming a real life concept, explicitly declared by humans.

● Moved text representation from key-words space, to explicit concepts space.

● Defined a mapping between keywords space and explicit concepts space.

Page 7: Explicit Semantic Analysis

ESA Using Wikipedia

● Wikipedia was chosen, as it’s the largest knowledge repository on the web. Available in dozens of languages. Also a it’s open editing approach enhances its’ quality2.

● Concepts are derived from wikipedia articles. Each article maps to a real life concept, is represented in the form of bag of words.

● Choice of encyclopedia is based on the fact that each article talks mainly about a single topic. Which supports the orthogonality of concept vectors.

Page 8: Explicit Semantic Analysis

ESA Technical Details

● Concepts derived from Wikipedia articles as vectors of TF-IDF scores of keywords.

● An inverted index is generated from derived concepts, where each keyword is represented as a weighted vector of concepts.

● Text T is represented in concept space as

where,T = { wi }vi is TF-IDF for wi

kj is inverted index entry for wi

Page 9: Explicit Semantic Analysis

Semantic Relatedness Application

Page 10: Explicit Semantic Analysis

Semantic Relatedness Application

Page 11: Explicit Semantic Analysis

Semantic Relatedness Application

Text: “A group of European-led astronomers has made a photograph of what appears to be a planet orbiting another star. If so, it would be the first confirmed picture of a world beyond our solar system.”

Top generated concepts: (1) Planet; (2) Planetary orbit; (3) Solar system;

(4) Extrasolar planet; (5) Jupiter; (6) Astronomy; (7) Definition of planet;

(8) Pluto; (9) Minor planet; (10) PSR 1257+12

All concepts are highly relevant and describe or relate to the subject of the text,

with the fourth concept (Extrasolar planet) being the exact topic, despite the

fact that these words were not explicitly mentioned in the text. PSR 1257+12

is the name of a pulsar around which the first extrasolar planets were discovered

orbiting.

Page 12: Explicit Semantic Analysis

Semantic Relatedness Application

Text: “New Jaguar model unveiled by firm”

Top generated concepts: (1) Jaguar XJ ; (2) Jaguar (car) ; (3) Ford Motor Company ; (4) Jaguar XK ; (5) Land Rover Range Rover ; (6) Jaguar S-Type ; (7) Jaguar X-Type ; (8) Nissan Micra ; (9) V8 engine ; (10) Jaguar E-type

The disambiguation power of ESA is obvious, as the top concepts all refer to Jaguar the car maker rather than to the namesake animal or American football team (e.g. the ESA concept Jacksonville Jaguars ). Despite the text containing no explicit car-related terms, words such as “model” and “unveil” were more related to the industry meaning and helped trigger the correct concepts. The concepts generated also hint at rich world knowledge, such as the business relations to Ford Motor Company and Land Rover Range Rover and the use of a V8 Engine on Jaguar models. The Nissan Micra concept was triggered by a Micra variant that was inspired by a Jaguar model.

Page 13: Explicit Semantic Analysis

Semantic Relatedness Application

Page 14: Explicit Semantic Analysis

ESA Semantic Relatedness Evaluation

Computing word relatedness Computing word relatedness

Page 15: Explicit Semantic Analysis

References

1. “Information Retrieval using Semantic Similarity”, slides by Saswat Padhi http://www.slideshare.net/SaswatPadhi/information-retrieval-using-semantic-similarity

2. [Giles, 2005] Jim Giles. Internet encyclopaedias go head to head. Nature, 438:900–901, 2005.