explicit vs. latent concept models for cross language information retrieval
TRANSCRIPT
Copyright 2011 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Explicit vs. Latent Concept Models for Cross-Language Information Retrieval
Nitish AggarwalDERI, NUI Galway
Tuesday, 26th June, 2012DERI, Reading Group
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Based On:
Title: “Explicit vs. Latent Concept Models for Cross-Language
Information Retrieval”
Authors: Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp Sorg,
Steffen Staab
Published: International Joint Conference on Artificial Intelligence,
2009
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Overview
Introduction Cross lingual information retrieval (CLIR)
Concept Model Explicit Semantics
Latent Semantics
Evaluation Conclusion
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Introduction: CLIR
Cross Lingual Information Retrieval Many documents, web sites
are written in different languages
Retrieve all information without
a language barrier
Query and documents are in different
languages
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Introduction: CLIR
CLIR based on Machine Translation Translation of queries or documents Reduced problem to monolingual retrieval
– Issues: – MT is not available for all language pairs– Increase vocabulary mismatch
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Introduction: CLIR
Interlingua or Concepts based Use language independent representation
– Mapping all queries and documents in different language to concepts space– Define a concept space and relevance function
Language independent
representation
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Concept Model
Document in concept space Di = {t1, t2,t3…tn}
ti in space – Association with every concept
Composite semantics of all tokens– Σti , Πti
Types of concept model Explicit Latent/implicit
C1
C2
C3
ti
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Concept Model: Explicit
Intuition: define concepts from external resources Definition of concepts
– Wikipedia articles, tagged web pages
Cover a broad range of vocabulary and language Example
Wikipedia based Explicit semantic analysis (ESA)–
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Concept Model: ESA
Explicit Concept Space Di = {t1, t2,t3…tn}
ti = {w1a1 + w2a2 … + wnan } Composite semantics of all token
– Σti
University
Student
Education
query docs
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Cross lingual - ESA
Extension of ESA Use Wikipedia cross language links Linked articles define same concepts in different
languagesWord1
Wordn
W1*URI1+w2*URI2…. wn*URIn
W1*URI1+w2*URI2…. wn*URIn
Word1
Wordn
W1*URI1+w2*URI2…. wn*URIn
W1*URI1+w2*URI2…. wn*URIn
Word1
Wordn
W1*URI1+w2*URI2…. wn*URIn
W1*URI1+w2*URI2…. wn*URIn
EN
DE
ES
Inverted Index
W11*URI1+w12*URI2…. w1n*URIn
W11*URI1+w12*URI2…. w1n*URIn
Vector Cosine
Semantic Relatedness
Term@en
Term@de
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Concept Model: Latent
Intuition: semantic space of latent concepts Definition of latent concepts
– Cluster of similar things define a latent concept
Latent Concept130% broccoli15% bananas10% breakfast10% munching
Latent Concept2
20% chinchillas 20% kittens20% cute
15% hamster(Food) (animals)
Look at this cute hamster munching on a piece of brocoli(40% Latent Concept1 and 60% Latent Concept2)
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Concept Model: Latent
LC1
LC2
LC3
Training Corpus
Derived Latent Concepts
LC1
LC2
LC3
querydocs
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Latent Semantic Analysis (LSA)
Definition Dimensionality reductions to find latent concepts
Approach Build term-documents matrix M Perform single value decomposition (SVD) on M
Approximate M by taking top N singular values– N singular values reflect N different latent concepts– U defines term-concept-correlation– V defines document-concept-correlation
Cross Lingual-LSA Use parallel corpus
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Latent Dirichlet Allocation (LDA)
Definition Generative model
– Words generate latent concepts (Topics)– Topics generate document to learn the parameter
Approach Topic distribution is assumed to be Dirichlet prior Fit corpus and document level properties using
variational Expectation Maximization (EM) procedure
Cross-lingual-LDA Use parallel corpus
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation
Parallel corpora All documents are translated into many languages
Relevance assessment Use documents in one language as query to retrieve
documents of other language Translated document = relevant document
– No manual relevant assessment is needed
Measures used Mean reciprocal rank (MRR) Average score over all language pairs
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation: Datasets
Multilingual corpora Multext Corpus
– 3066 Q/A pairs from the Official Journal of European Community
JRC-AQUIS Corpus– 21,000 legislative documents of the European Union– We randomly selected 3,000 documents as queries
Set up English, German and French documents were used Split dataset for latent topic extraction
– 60% learning, 40% testing
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation: Datasets
Wikipedia Snapshot
– 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German)– Collection of 166,484 articles
CL-ESA: Use cross-language links for concepts in different language
LSA/LDA: Wikipedia as parallel corpus– Use it as training corpus for latent concepts extraction
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation: Parameter
Cross-lingual ESA Problem
– Too many concepts
Solution– Only use highest m values
LSI/LDA Problem
– Computational costs increase with number of topics
Solution– Use fixed number of latent topics
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation: Results
Multext Dataset
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Evaluation: Results
JRC-Aquis Dataset
Digital Enterprise Research Institute www.deri.ie
Enabling Networked Knowledge
Conclusion
Parameter tuning ESA performs good for m=10,000 Maximum of 500 topics for LSI tested
– Not maximal performance, but seems to converge
Results LSA performs better than LDA Comparable results of CL-ESA and LSA
– Explicit Vs Implicit Explicit model Perform better than latent model