explicit vs. latent concept models for cross language information retrieval

Copyright 2011 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Explicit vs. Latent Concept Models for Cross-Language Information Retrieval

Nitish AggarwalDERI, NUI Galway

[email protected]

Tuesday, 26th June, 2012DERI, Reading Group



Based On:

Title: “Explicit vs. Latent Concept Models for Cross-Language

Information Retrieval”

Authors: Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp Sorg,

Steffen Staab

Published: International Joint Conference on Artificial Intelligence,

2009



Overview

Introduction Cross lingual information retrieval (CLIR)

Concept Model Explicit Semantics

Latent Semantics

Evaluation Conclusion



Introduction: CLIR

Cross Lingual Information Retrieval Many documents, web sites

are written in different languages

Retrieve all information without

a language barrier

Query and documents are in different

languages



Introduction: CLIR

CLIR based on Machine Translation Translation of queries or documents Reduced problem to monolingual retrieval

– Issues: – MT is not available for all language pairs– Increase vocabulary mismatch



Introduction: CLIR

Interlingua or Concepts based Use language independent representation

– Mapping all queries and documents in different language to concepts space– Define a concept space and relevance function

Language independent

representation



Concept Model

Document in concept space Di = {t1, t2,t3…tn}

ti in space – Association with every concept

Composite semantics of all tokens– Σti , Πti

Types of concept model Explicit Latent/implicit

C1

C2

C3

ti



Concept Model: Explicit

Intuition: define concepts from external resources Definition of concepts

– Wikipedia articles, tagged web pages

Cover a broad range of vocabulary and language Example

Wikipedia based Explicit semantic analysis (ESA)–



Concept Model: ESA

Explicit Concept Space Di = {t1, t2,t3…tn}

ti = {w1a1 + w2a2 … + wnan } Composite semantics of all token

– Σti

University

Student

Education

query docs



Cross lingual - ESA

Extension of ESA Use Wikipedia cross language links Linked articles define same concepts in different

languagesWord1

Wordn

W1*URI1+w2*URI2…. wn*URIn


Word1

Wordn



Word1

Wordn



EN

DE

ES

Inverted Index

W11*URI1+w12*URI2…. w1n*URIn

W11*URI1+w12*URI2…. w1n*URIn

Vector Cosine

Semantic Relatedness

Term@en

Term@de



Concept Model: Latent

Intuition: semantic space of latent concepts Definition of latent concepts

– Cluster of similar things define a latent concept

Latent Concept130% broccoli15% bananas10% breakfast10% munching

Latent Concept2

20% chinchillas 20% kittens20% cute

15% hamster(Food) (animals)

Look at this cute hamster munching on a piece of brocoli(40% Latent Concept1 and 60% Latent Concept2)



Concept Model: Latent

LC1

LC2

LC3

Training Corpus

Derived Latent Concepts

LC1

LC2

LC3

querydocs



Latent Semantic Analysis (LSA)

Definition Dimensionality reductions to find latent concepts

Approach Build term-documents matrix M Perform single value decomposition (SVD) on M

Approximate M by taking top N singular values– N singular values reflect N different latent concepts– U defines term-concept-correlation– V defines document-concept-correlation

Cross Lingual-LSA Use parallel corpus



Latent Dirichlet Allocation (LDA)

Definition Generative model

– Words generate latent concepts (Topics)– Topics generate document to learn the parameter

Approach Topic distribution is assumed to be Dirichlet prior Fit corpus and document level properties using

variational Expectation Maximization (EM) procedure

Cross-lingual-LDA Use parallel corpus



Evaluation

Parallel corpora All documents are translated into many languages

Relevance assessment Use documents in one language as query to retrieve

documents of other language Translated document = relevant document

– No manual relevant assessment is needed

Measures used Mean reciprocal rank (MRR) Average score over all language pairs



Evaluation: Datasets

Multilingual corpora Multext Corpus

– 3066 Q/A pairs from the Official Journal of European Community

JRC-AQUIS Corpus– 21,000 legislative documents of the European Union– We randomly selected 3,000 documents as queries

Set up English, German and French documents were used Split dataset for latent topic extraction

– 60% learning, 40% testing



Evaluation: Datasets

Wikipedia Snapshot

– 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German)– Collection of 166,484 articles

CL-ESA: Use cross-language links for concepts in different language

LSA/LDA: Wikipedia as parallel corpus– Use it as training corpus for latent concepts extraction



Evaluation: Parameter

Cross-lingual ESA Problem

– Too many concepts

Solution– Only use highest m values

LSI/LDA Problem

– Computational costs increase with number of topics

Solution– Use fixed number of latent topics



Evaluation: Results

Multext Dataset



Evaluation: Results

JRC-Aquis Dataset



Conclusion

Parameter tuning ESA performs good for m=10,000 Maximum of 500 topics for LSI tested

– Not maximal performance, but seems to converge

Results LSA performs better than LDA Comparable results of CL-ESA and LSA

– Explicit Vs Implicit Explicit model Perform better than latent model

explicit vs. latent concept models for cross language information retrieval

Technology