explicit vs. latent concept models for cross language information retrieval

21
Copyright 2011 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.i e Enabling Networked Knowledge Explicit vs. Latent Concept Models for Cross-Language Information Retrieval Nitish Aggarwal DERI, NUI Galway [email protected] Tuesday, 26 th June, 2012 DERI, Reading Group

Upload: nitish-aggarwal

Post on 13-Jun-2015

626 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Explicit vs. latent concept models for cross language information retrieval

Copyright 2011 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Explicit vs. Latent Concept Models for Cross-Language Information Retrieval

Nitish AggarwalDERI, NUI Galway

[email protected]

Tuesday, 26th June, 2012DERI, Reading Group

Page 2: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Based On:

Title: “Explicit vs. Latent Concept Models for Cross-Language

Information Retrieval”

Authors: Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp Sorg,

Steffen Staab

Published: International Joint Conference on Artificial Intelligence,

2009

Page 3: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Overview

Introduction Cross lingual information retrieval (CLIR)

Concept Model Explicit Semantics

Latent Semantics

Evaluation Conclusion

Page 4: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Introduction: CLIR

Cross Lingual Information Retrieval Many documents, web sites

are written in different languages

Retrieve all information without

a language barrier

Query and documents are in different

languages

Page 5: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Introduction: CLIR

CLIR based on Machine Translation Translation of queries or documents Reduced problem to monolingual retrieval

– Issues: – MT is not available for all language pairs– Increase vocabulary mismatch

Page 6: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Introduction: CLIR

Interlingua or Concepts based Use language independent representation

– Mapping all queries and documents in different language to concepts space– Define a concept space and relevance function

Language independent

representation

Page 7: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Concept Model

Document in concept space Di = {t1, t2,t3…tn}

ti in space – Association with every concept

Composite semantics of all tokens– Σti , Πti

Types of concept model Explicit Latent/implicit

C1

C2

C3

ti

Page 8: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Concept Model: Explicit

Intuition: define concepts from external resources Definition of concepts

– Wikipedia articles, tagged web pages

Cover a broad range of vocabulary and language Example

Wikipedia based Explicit semantic analysis (ESA)–

Page 9: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Concept Model: ESA

Explicit Concept Space Di = {t1, t2,t3…tn}

ti = {w1a1 + w2a2 … + wnan } Composite semantics of all token

– Σti

University

Student

Education

query docs

Page 10: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Cross lingual - ESA

Extension of ESA Use Wikipedia cross language links Linked articles define same concepts in different

languagesWord1

Wordn

W1*URI1+w2*URI2…. wn*URIn

W1*URI1+w2*URI2…. wn*URIn

Word1

Wordn

W1*URI1+w2*URI2…. wn*URIn

W1*URI1+w2*URI2…. wn*URIn

Word1

Wordn

W1*URI1+w2*URI2…. wn*URIn

W1*URI1+w2*URI2…. wn*URIn

EN

DE

ES

Inverted Index

W11*URI1+w12*URI2…. w1n*URIn

W11*URI1+w12*URI2…. w1n*URIn

Vector Cosine

Semantic Relatedness

Term@en

Term@de

Page 11: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Concept Model: Latent

Intuition: semantic space of latent concepts Definition of latent concepts

– Cluster of similar things define a latent concept

Latent Concept130% broccoli15% bananas10% breakfast10% munching

Latent Concept2

20% chinchillas 20% kittens20% cute

15% hamster(Food) (animals)

Look at this cute hamster munching on a piece of brocoli(40% Latent Concept1 and 60% Latent Concept2)

Page 12: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Concept Model: Latent

LC1

LC2

LC3

Training Corpus

Derived Latent Concepts

LC1

LC2

LC3

querydocs

Page 13: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Latent Semantic Analysis (LSA)

Definition Dimensionality reductions to find latent concepts

Approach Build term-documents matrix M Perform single value decomposition (SVD) on M

Approximate M by taking top N singular values– N singular values reflect N different latent concepts– U defines term-concept-correlation– V defines document-concept-correlation

Cross Lingual-LSA Use parallel corpus

Page 14: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Latent Dirichlet Allocation (LDA)

Definition Generative model

– Words generate latent concepts (Topics)– Topics generate document to learn the parameter

Approach Topic distribution is assumed to be Dirichlet prior Fit corpus and document level properties using

variational Expectation Maximization (EM) procedure

Cross-lingual-LDA Use parallel corpus

Page 15: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Evaluation

Parallel corpora All documents are translated into many languages

Relevance assessment Use documents in one language as query to retrieve

documents of other language Translated document = relevant document

– No manual relevant assessment is needed

Measures used Mean reciprocal rank (MRR) Average score over all language pairs

Page 16: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Evaluation: Datasets

Multilingual corpora Multext Corpus

– 3066 Q/A pairs from the Official Journal of European Community

JRC-AQUIS Corpus– 21,000 legislative documents of the European Union– We randomly selected 3,000 documents as queries

Set up English, German and French documents were used Split dataset for latent topic extraction

– 60% learning, 40% testing

Page 17: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Evaluation: Datasets

Wikipedia Snapshot

– 03/12/2008 (English), 06/25/2008 (French), 06/29/2008 (German)– Collection of 166,484 articles

CL-ESA: Use cross-language links for concepts in different language

LSA/LDA: Wikipedia as parallel corpus– Use it as training corpus for latent concepts extraction

Page 18: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Evaluation: Parameter

Cross-lingual ESA Problem

– Too many concepts

Solution– Only use highest m values

LSI/LDA Problem

– Computational costs increase with number of topics

Solution– Use fixed number of latent topics

Page 19: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Evaluation: Results

Multext Dataset

Page 20: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Evaluation: Results

JRC-Aquis Dataset

Page 21: Explicit vs. latent concept models for cross language information retrieval

Digital Enterprise Research Institute www.deri.ie

Enabling Networked Knowledge

Conclusion

Parameter tuning ESA performs good for m=10,000 Maximum of 500 topics for LSI tested

– Not maximal performance, but seems to converge

Results LSA performs better than LDA Comparable results of CL-ESA and LSA

– Explicit Vs Implicit Explicit model Perform better than latent model