comad tutorial on - cse, iit bombaycomad/2006/tutorial_jayant.pdf · descartes rené les...

124
December 2006 Slide 1 COMAD 2006 Tutorial COMAD Tutorial on Multi-lingual Database Systems Jayant Haritsa Indian Institute of Science Bangalore, India Database Systems Laboratory

Upload: others

Post on 09-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 1COMAD 2006 Tutorial

COMAD Tutorial on Multi-lingual Database Systems

Jayant Haritsa

Indian Institute of Science Bangalore, India

Database Systems Laboratory

Page 2: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 2COMAD 2006 Tutorial

Tutorial Contents

• Motivation• Multilingual Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research

Page 3: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 3COMAD 2006 Tutorial

Organization

• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research

Page 4: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 4COMAD 2006 Tutorial

Tracks History(User History)

Generates Recommendations

(User Pref. & Mined Data)

Generates Incentives (User Pref. & Mined Data)

Product Categories (Meta-Data)

In Database-speak…Data: 1TB Live; 25TB WarehouseDBMS: 26Systems: 1500 Servers

Deployment: A set of Monolingual DBMS Sub-second Response time

What is needed in DBMS to make such a Portal

Multilingual?

Multilingual Portal

Page 5: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 5COMAD 2006 Tutorial

Example Multilingual Database• Books.com

INR 250ÝCò «ü£F

€ 19.95L'Histoire De La France SAR 95

€ 75.00Il Coronation del Virgin

PriceTitle

üõý˜ô£™«ï¼

François Lebrun

BicciNero

Author_FNAuthor Category

êKˆFó‹

Arti Fini

Histoire

Language

$ 49.95History of CivilizationWill/ArielDurant History English

TamilItalian

FrenchArabic

INR 175ªddT£d HI¶ šddy¡d¡d®ddUµT¬dd¬d¦dyUµè Be£d²d±d Hindi

£ 35.00History and HistoriansMark T.Gilderhus Historiography English

€ 12.00ΚατεριναΣαρρη Μουσικη′ Greek

£ 15.00Letters to My DaughterJawaharlalNehru Autobiography English

€ 99.95Les Méditations MetaphysiquesRenéDescartes Philosophie French

¥ 7500無門關慧開無門 禅 Japanese

€ 99.95êˆFò «ê£î¬ù«ñ£è¡î£v裉F ²òêKî‹ Tamil

Παιχνι′δια στο Πια′νο

Page 6: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 6COMAD 2006 Tutorial

Granularity of Multi-lingualism

• Uni-lingual rows, multi-lingual columns• Uni-lingual columns, multi-lingual rows• Multi-lingual rows, multi-lingual columns• Multi-lingual attribute values

Page 7: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 7COMAD 2006 Tutorial

Why Worry about Multilingual Data?

• Growing Multilingual On-line Users and Data – By 2010, most of the web-pages will be multilingual

• Today English down to 35% from 90% in 1995– Non-native English speaking population has grown rapidly

from about one-third in mid-90’s to about two-thirds in mid-00’s

• E-commerce Implications – Opens up enormous new markets– Users are four-times more likely to buy a product or a service,

if the information is presented in their native language [Aberdeen]

• E-governance Implications– Opens up communication to native communities

Page 8: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 8COMAD 2006 Tutorial

Sample Applications

• Vidyanidhi– Portal for all Indian research theses, hosted at

Univ. of Mysore, Karnataka, India– Contains close to 100000 records in English,

Hindi, Kannada

• Bhoomi– Computerized Land Record System in State of

Karnataka, India• storing 20 million records with composite information in

Kannada and English

– to be followed in all other states as well

Page 9: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 9COMAD 2006 Tutorial

MLDB Research Questions (1)

• Are today’s database systems equally (a) functional(b) efficientacross all human languages?– i.e. is the DBMS “natural-language-neutral” ?– Specifically, is there a preference for Latin-script

based languages (English, French, German, …) as compared to those based on other scripts (Arabic, Cyrillic, CJK, Indic, …) ?

Page 10: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 10COMAD 2006 Tutorial

MLDB Research Questions (2)

• Are new functionalities desired from a multilingual database system?– i.e. multi-lingual SQL operators ?

Page 11: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 11COMAD 2006 Tutorial

Organization

• Motivation• Multilingual Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research

Page 12: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 12COMAD 2006 Tutorial

Multilingual Functionality

To assess the support offered by current database systems and standards for multilingual data

Page 13: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 13COMAD 2006 Tutorial

Background – Character Encoding

• Character is smallest component of a written language that has a semantic value. The set of all the characters in a language is called a repertoire.

• Character Encoding assigns unique numerical value to each of the characters in a repertoire.– Several encodings available

Page 14: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 14COMAD 2006 Tutorial

Multilingual Character Encoding

• ASCII [&ISO:8859] Encoding – 7-bit [8-bit] for English [Western European]

• ISCII Encoding– 8-bit (proprietary) encoding for Indic Languages

• ISO:10646 – Universal Character Set (UCS-2)– Uniform 2-Byte encoding for all languages

• Unicode Encoding – 2-byte encoding along the lines of UCS-2 (UTF-16)

• The default standard for Multilingual Data Storage in DBMS– Has a variable-byte encoding (UTF-8) that favors

ASCII (Western European Languages)

Page 15: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 15COMAD 2006 Tutorial

Unicode

• Unicode is a uniform 2-byte encoding standard that allows storage of characters from any known alphabet or ideographic system irrespective of platform or programming environments.

• Unicode codes are arranged in Character Blocks, which encode contiguously the characters of a given Script (usually single language).

• Unicode has 3 different byte encodings – UTF-8, UTF-16 and UTF-32 to store same character in a byte, half-word or double-word formats.

Page 16: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 16COMAD 2006 Tutorial

Sample Encodings

E4.16.27.16.97.16.E600.E4.00.16.00.27.00.16.00.97.00.16.00.E6E4.16.27.16.97.16.E6

NarayanNarayanNarayan

ASCIIUnicode (UTF-16)Unicode (UTF-8)

EnglishEnglishEnglish

Representation(Hexadecimal)

Multilingual String

EncodingLang.

A8.BE.B0.BE.AF.A9.CD0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CDE0.AE.A8.E0.AE.BE.E0.AE.B0.E0.AE.BE.E0.AE.AF.E0.AE.A9.E0.AF.CD

ï£ó£ò¡ï£ó£ò¡ï£ó£ò¡

ISCIIUnicode (UTF-16)Unicode (UTF-8)

TamilTamilTamil

A8.BE.B0.BE.AF.A3.CD0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD

£ÁgÁAiÀÄ£ï

£ÁgÁAiÀÄ£ïISCIIUnicode (UTF-16)

KannadaKannada

5B.FA.4E.95.6B.63.53.5ABF.BA.AF.BA.E4.E6.95.A3.AD.8D.E5.9A

Unicode (UTF-16)Unicode (UTF-8)

KanjiKanji

Page 17: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 17COMAD 2006 Tutorial

Database Systems

Commercial SystemsOracle, Microsoft SQL Server, IBM DB2

Public-domain:MySQL, PostgreSQL

For legal reasons, will randomly refer to themas Systems A, B, C, D, E

Page 18: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 18COMAD 2006 Tutorial

Standards and SystemsSystem

ESystem

D

NoneNoneNoneNo SpecCross-Language Query Support

No Spec

~25

Similar to Character

Any Collations

System Defined &

User Definable

National Char

SQL:1999 Standard

Data + Meta-data

OS Defined

Similar to Character

System Collations

OS Specified (Pre-defined)

UCS-2

SystemB

DataDataSupport Level

~40~50Locales

Similar to Character

Similar to Character

Query Processing

System Collations

System Collations

Indexing

Pre-definedPre-definedCollations

UnicodeUTF-8/16

Unicode UTF-8/16

Storage Format

SystemC

SystemA

Database

None

Data

~30

Similar to Character

Any Collations

User Definable

(source changes)

UnicodeUTF-8

None

Data

~25

Binary

Any Collations

User Definable

(source changes)

Binary

Page 19: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 19COMAD 2006 Tutorial

Remarks

Database systems are generally equivalent in their storage and querying of multilingual data, and offer similar SQL querying power ...

However,- Uniformly, no cross-language support- Unknown differential performance

Page 20: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 20COMAD 2006 Tutorial

10,000’ View …Proper Names

DocumentsString Values

Other Attributes Category Attributes

Visual

Grapheme Image

Encoding(ASCII, Unicode …)

Scripts(English, Hindi …)

Nehru «ï¼ ¦dyUµè AmazonText Strings

Semantics

RepresentationTEXT DATA

PhonemicTransformation

Aural

Encoding(ITrans, Unicode …)

Normalized Representation(IPA, Arpabet …)

Q m ´ z A n /N/ /ae/ /R/ /oo/

Phoneme Strings

unicode ⇔ cuniform

Concepts

Abstracted Synsets(WordNet Taxonomies)

Multilingual Synsets(WordNet)

SemanticTransformation

Page 21: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 21COMAD 2006 Tutorial

Organization

• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research

Page 22: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 22COMAD 2006 Tutorial

Multilingual Performance

Are the DBMS’s natural-language neutral, wrt performance?

Page 23: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 23COMAD 2006 Tutorial

Database Setup

• Generated a 1 GB TPC-H Database • Tables modified to hold both original CHAR [ASCII] Data and

equivalent NCHAR [Unicode] Data• Experiments conducted with

– separate CHAR and NCHAR tables– common tables (to eliminate the impact of disk I/O)

• Example PartSuppCommon table with equivalent Tamil data

ð£è‹ ªðò˜#000018

Part Name #00001818îò£KŠðõ˜ #2503

Supplier #2503

2503

PartName_NChar(Unicode)

PartName(ASCII)

Part ID

SuppName(ASCII)

SupplierID

SuppName_NChar(Unicode)

Page 24: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 24COMAD 2006 Tutorial

System and DBMS Setup

• Stand-alone Pentium-IV running Windows 2000– Cold-start ensured before each experiment

• DBMS – Oracle, DB2, SQL Server, Postgres– Installed with Default Configurations

• Display time nullified through aggregate functions in the select clause

Page 25: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 25COMAD 2006 Tutorial

Database Operators Measured

• Table-Scan– Time for scanning for a specific key

• Index-Create and Index-Scan– Time for creating index– Time for retrieving 20% of search keys in index

• Sort– Time for sorting the attribute

• Join [Nested-Loop, Hash, Sort-Merge]– Join types forced by Optimizer Hints, Setting Optimization Levels, etc.– Plan pictures verified the use of appropriate join type in a query

Page 26: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 26COMAD 2006 Tutorial

Sample Queries

Join Operator select count(*)

from PartSuppCommon PS1, PartSuppCommon PS2

where PS1.SuppName = PS2.SuppName{ PS1.SuppName_NChar = PS2.SuppName_NChar }

and PS1.PartName <> PS2.PartName{ PS1.PartName_NChar <> PS2.PartName_NChar }

Table-Scan Operator select count(*)

from PartSuppCommon PS1

where PS1.SuppName = ‘Supplier #2503’

{ PS1.SuppName_NChar = ‘îò£KŠðõ˜ #2503’}

Page 27: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 27COMAD 2006 Tutorial

Performance Metrics

• Operator Performance– Measure Operator Differential

Performance between Char and NChar

• Database Relative Efficiency– Measures Database Differential

Performance between Char and NChar

TNChar / Tchar(Ideal Value: 1)

• Optimizer Prediction Equity– Optimizer Prediction [In]Equity

between Char and NChar

GMNChar / GMChar

(Ideal Value: 1)

(Ideal Value: 1)

(ONChar/OChar)(TNChar/TChar)

Page 28: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 28COMAD 2006 Tutorial

Operator Performance(Common Table)

• MROOper Metric (Ideal Value = 1)

• In a Nutshell:– Wide variation in System Performance– Slowdown can be as much as 200%– Generally, 30-100% Slower for NChar

All operators are slow on Multilingual Data

2.70 1.791.01 1.24 1.361.12 1.32Database System D1.341.551.92

Join (Sort-

Merge)

1.291.352.60

Join (Hash)

1.971.352.75

Index-Scan

1.231.481.81

Sort

1.591.031.03

Join (Nested-Loop)

1.061.332.72

Table-Scan

1.391.251.21

Index-Create

Database System CDatabase System BDatabase System A

Database System

Page 29: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 29COMAD 2006 Tutorial

Overall DBMS Performance

• MEDBMS Metric (Ideal Value = 1)

• Databases are about 50% to 100% inefficient in multilingual world– Note, conservative estimate since only considering in-memory

differentials because of common table

– With separate tables, the inefficiency jumps to several hundred percent (e.g. slowdown was upto 475% for Scans and upto 275% for Joins)

0.69Database System D0.70Database System C0.80Database System B0.57Database System A

EfficiencyDatabase

Page 30: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 30COMAD 2006 Tutorial

Query Optimizer Performance

• MPEOper Metric (Ideal Value = 1)

• In a Nutshell:– Generally, 5-100% mis-prediction– Could be due to erroneous cost-models between Char and NChar.

0.740.550.370.990.89Database System D0.951.200.89

Join (Sort-

Merge)

1.220.751.26

Join (Hash)

0.311.550.38

Index-Scan

1.160.970.97

Join (Nested-Loop)

0.940.750.37

Table-Scan

Database System CDatabase System BDatabase System A

Database System

Could lead to grossly inefficient plans.

Page 31: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 31COMAD 2006 Tutorial

Analysis of Performance

• Experiments on DBMS system A– System A exhibited worst differential

performance

• Our Objective:– What are the components of the slowdown?– How can these be addressed in improving

performance?

Page 32: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 32COMAD 2006 Tutorial

Slowdown vs String Size

• How does the slowdown vary with (logical) String Length?

• High Differential in Scan at small string sizes indicates:• high fixed-cost overheads (such as call overhead)

• Increasing differential cost indicates:• higher variable-cost overheads (such as string handling

for function calls)

Page 33: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 33COMAD 2006 Tutorial

Slowdown w.r.t. Data Type

• We created Common Table with:– Char Attributes of size 110; NChar Attributes of size 55– Attributes have same physical size, but are of different

types

• We ran the same queries– Call overheads are the same– Only difference is in Datatype specific code-segments in

common operator implementation

• The observed differential performance is ~10-15% of corresponding Operator Slowdowns– Small, but not insignificant.

Page 34: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 34COMAD 2006 Tutorial

Slowdown wrt. String Processing

• Created Common Table (as before)– All NChar attributes replaced with Char attributes of twice the size

• Ran the same queries– Disk I/O & Call Overheads are the same, except the data being

passed as parameter to the operator functions is different in size, same in type

– Measures any differences in in-memory handling of different sized strings in a common operator implementation

• The observed differential performance is ~80-90%of the corresponding Operator Slowdowns– This component contributes primarily to the operator slowdown

Page 35: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 35COMAD 2006 Tutorial

Overall Performance Analysis

• The slowdown of NChar over Char (including Disk I/O) is very large (several hundred percent)

• The slowdown of NChar over Char (considering only in-memory processing) is still large (50 to 100%):– Primarily, due to size of the NChar Strings (~85%)– Secondarily, due to the type-specific implementation

(~10%)

• Hence, to improve performance the size of the NChar storage must be tackled …

Page 36: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 36COMAD 2006 Tutorial

Cuniform Storage Format

Page 37: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 37COMAD 2006 Tutorial

Two Observations…

#2: An Attribute Value is likely to be in ONE script.

Rs 175®d¦Qî «dd£dTa (Vol 1)590¦dyUµèRs 975ªddT£d HI¶ šddy¡d127¦dyUµèPriceTitleISBNAuthor

$12.95Discovery of India992Nehru

ISCIIUnicode

TamilTamil

A8.BE.B0.BE.AF.A9.CD0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CD

Representation(Hexadecimal)

StringEncodingLang

ï£ó£ò¡ï£ó£ò¡

#1: Unicode = Character Block + Offset

about half the bits for character block

Page 38: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 38COMAD 2006 Tutorial

Cuniform – Compressed UNIcode FORMat

• … After skinning into Cuniform pair000B

0C

Narayanï£ó£ò¡£ÁgÁAiÀÄ£ï

NULL

E4.16.27.16.97.16.E6

A8.BE.B0.BE.AF.A9.CD

A8.BE.B0.BE.AF.A3.CD

00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CDRK£ÁgÁAiÀÄ£ï

00.E4.00.16.00.27.00.16.00.97.00.16.00.E6Narayan0B.A8.0B.BE.0B.B0.0B.BE.0B.AF.0B.A9.0B.CDï£ó£ò¡0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CD£ÁgÁAiÀÄ£ï00.27.00.EF.0C.A8.0C.BE.0C.B0.0C.BE.0C.AF.0C.A3.0C.CDRK£ÁgÁAiÀÄ£ï

• Original Unicode Strings …

• Store each string data item as an ordered pair • Common Character Block• Offset of each character, in the common Char Block

Page 39: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 39COMAD 2006 Tutorial

Implementation of Cuniform

• Transparently remap all the SQL queries to work on the Cuniform Pairs

• For presentation of results, conversion from Cuniform to Unicode is trivial

Page 40: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 40COMAD 2006 Tutorial

Operator Performance on Cuniform

• Outside-the-engine Implementation• Space Occupancy

– Only 2% larger than Char; compare this with NChar’s 100 % overhead

1.031.04Join (Nested-Loop)1.222.74Join (Hash)1.151.99Join (Sort-Merge)1.991.88IndexScan1.052.56TableScan

Cuniform Slow-down

UnicodeSlow-down

Operator

Largely the Differential Performance is eliminated (Except, Index Scan)

Page 41: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 41COMAD 2006 Tutorial

Cuniform Performance Summary

• Overall,– Generally, Better Performance

• Index Tree is built on a pair of attributes, resulting in worse performance

– Multilingual Efficiency up to 0.81 from 0.57• With inside-engine implementation, can be made

even better– The performance is made almost natural

language-neutral …

Page 42: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 42COMAD 2006 Tutorial

Remarks

There is a performance barrier separating languages in Latin script (e.g., English) from those in other scripts (e.g., Indic languages), but this barrier can be largely broken down with the Cuniform format …

Page 43: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 43COMAD 2006 Tutorial

Organization

• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research

Page 44: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 44COMAD 2006 Tutorial

New Multilingual Operators

Objective:To assess the new operators desired from a

multilingual database system

Page 45: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 45COMAD 2006 Tutorial

MLNameJoin Operator

† Referred to as LexEQUAL in published work

Page 46: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 46COMAD 2006 Tutorial

Multilingual Books.com(with Language Information)

INR 250ÝCò «ü£F

€ 19.95L'Histoire De La France SAR 95

€ 75.00Il Coronation del Virgin

PriceTitle

üõý˜ô£™«ï¼

François Lebrun

BicciNero

Author_FNAuthor Category

êKˆFó‹

Arti Fini

Histoire

Language

$ 49.95History of CivilizationWill/ArielDurant History English

TamilItalian

FrenchArabic

INR 175ªddT£d HI¶ šddy¡d¡d®ddUµT¬dd¬d¦dyUµè Be£d²d±d Hindi

£ 35.00History and HistoriansMark T.Gilderhus Historiography English

€ 12.00ΚατεριναΣαρρη Μουσικη′ Greek

£ 15.00Letters to My DaughterJawaharlalNehru Autobiography English

€ 99.95Les Méditations MetaphysiquesRenéDecates Philosophie French

¥ 7500無門關慧開無門 禅 Japanese

€ 99.95êˆFò «ê£î¬ù«ñ£è¡î£v裉F ²òêKî‹ Tamil

Παιχνι′δια στο Πια′νο

Page 47: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 47COMAD 2006 Tutorial

Multilingual Selection

INR 250ÝCò «ü£F

£ 15.00Letters to My Daughter

INR 175ªddT£d HI¶ šddy¡d

PriceTitle

Tamil

English

Hindi

üõý˜ô£™«ï¼

JawaharlalNehru

¡d®dUµT¬dd¬d¦dyUµè

Author_FNAuthor Language

Suppose a User wants the books of “Nehru” in English, Tamil and Hindi …

Page 48: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 48COMAD 2006 Tutorial

Current SQL Approach

Select Author, Title, ... From Books

Where Author = “Nehru“

or Author = “«ï¼“or Author = “¦dyUµè“ ...

• Problems with this approach– User needs to be fluent in all the target languages – Need specialized lexical resources (fonts, keyboard mappings, etc.)

for input– Input prone to Error, due to the lack of Directory support

• Further, CANNOT BE USED TO EXPRESS JOIN ACROSS MULTILINGUAL COLUMNS

Page 49: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 49COMAD 2006 Tutorial

Proposed MLNameJoin Query

where Author MLNameJoin “Nehru”

InLanguages { English, Tamil, Hindi }

Select Author, Title, ... From Books

• Input in a convenient language, with Multilingual output • Equivalence based on intuitive Phonetic correspondence

– restricted to proper names (form about 20 percent of text corpora)• Customizable Fuzzy Matching

– Threshold Parameter• Most Importantly, extensible…

– “Retrieve in All Languages” ( InLanguages { * } )– Join ( Author MLNameJoin Faculty )

Threshold 0.2

Page 50: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 50COMAD 2006 Tutorial

Matching Strategy

• Store Multilexical Strings in Database– In Unicode (or Cuniform)

• To match, transform to equivalent phonemic strings in IPA alphabet using standard Text-to-Phoneme(TTP) converters …

• … and compare transformed strings using Approximate Matching Techniques– Incompatibilities in Phonemes of different languages

MLNameJoin transforms matching from textual space to phonemic space

Page 51: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 51COMAD 2006 Tutorial

Example MLNameJoin Operation

Books Table (the last column is generated):

The Query :where Author MLNameJoin “Nehru” Threshold 0.2

InLanguages { English, Tamil, Hindi }

Select Author ... From Books

Will be executed as:Transform “Nehru” to Phonemic string (in English TTP) as “næhru”

Retrieve all records whose Language is one of (English, Tamil or Hindi) andwhose phoneme strings are within edit distance of 1 from “næhru”

1 = 0.2 * 5 = Threshold * phoneme length of “næhru”

ÝCò «ü£F

Discovery of India

ªddT£d HI¶ šddy¡d

Author (Phonemes) Title

TamilEnglish

Hindi«ï¼

Nehru

¦dyUµè

Authornæhru

nærunæhru

The Coronation of the VirginEin Amerikanischer Autobiographie

EnglishGerman

NeroFranklin

nerou

frAŋklın

Language

Page 52: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 52COMAD 2006 Tutorial

MLNameJoin Implementation Goals

• Accurate & Efficient Matching across languages

• Minimum changes to the DB Server

Page 53: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 53COMAD 2006 Tutorial

State of the Art

• No Support for Multilexical Matching in Commercial DBMS– Soundex approximations for Latin-based scripts

• Approx. Matching Algos– No Approximate Matching supported in DBMS

• However, UDF’s can solve this problem, partially

• Phonetic Matching in IR & Speech Processing Community– [Zobel-SIGIR96] In English using Soundex type algorithms– Speech Processing research in online-speech processing

• Proprietary Solutions– LASA (look-alike-Sound-alike) system for FDA

Page 54: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 54COMAD 2006 Tutorial

MLNameJoin Function

Steps:Convert input strings to phonemes and find edit-distance between the phonemic equivalentsIf (edit-distance < threshold * Size of Query String) return TRUE

Page 55: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 55COMAD 2006 Tutorial

MLNameJoin Parameters

• Match Threshold– Specifies the level of tolerance for mismatch between

the phonemic strings– Tunable for Matching

• User-Settable (per Query), or Global (for an application)

– Threshold varied in [0, 1]• 0 => only perfect matches are accepted, 1 => anything can be

matched

• Set of Output Languages– Those languages of interest to the multilingual user

Page 56: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 56COMAD 2006 Tutorial

EditDistance Function

Steps:Basic Dynamic Programming Algorithm to find Edit-DistanceInsCost, DelCost and SubCost are Parameterized

Cost for OperationClusters of Phonemes and Intra-Cluster Substitution Cost

Page 57: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 57COMAD 2006 Tutorial

EditDistance Parameters

• Cost Functions– InsCost, DelCost and SubCost may be set– Simulates different types of edit-distances

• Intra-Cluster Substitution Cost– Phonemes may be clustered based on their like-ness

• Clusters to be formed with linguist’s input– Matching phonemes within a cluster may be more

acceptable, than from outside a cluster– Cost varied in [0, 1]

• 0 implies all phonemes are equivalent within a cluster

Page 58: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 58COMAD 2006 Tutorial

Implementation Architecture

• Query String • Match Threshold

• Matched Strings

Server Manager

Database

TTSq

QueryProc.

Engine

ApproximateMatching

TTSn

Unicode

Cost Fn

Clusters

Page 59: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 59COMAD 2006 Tutorial

Performance Experiments

Page 60: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 60COMAD 2006 Tutorial

Data Sets

798

198

300

300

# of Strings

Combined SetAll

Generic Names (Places/Chemicals/Objects)3

Occidental Names (San Francisco Physicians Directory)2

Indian Names (Bangalore Telephone Directory)1

DescriptionSet

Three Data Sets:

Equivalent Phonemic Strings were stored in IPA Alphabet, Unicode FormatHand-tagged each set of Matchable strings with a Group ID

Page 61: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 61COMAD 2006 Tutorial

Data Sets (Continued…)

Generated Data Sets:Concatenated each string with every other string in the same language

Each set of matchable Strings have Same Group ID

Generated about 200,000 Strings

Page 62: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 62COMAD 2006 Tutorial

Phonetic Transformation

• Used standard Text to Phoneme Conversion to IPA alphabet• For English:

– A web-based TTP Convertor (http://www.foreignword.com)• Dictionary: Oxford English Dictionary

• For Indic:– Dhvani TTP (http://dhvani.sourceforge.com) after source modifications

• All Indic Languages are Phonetic; Hence almost a 1-to-1 mapping exists.

nArAIQn

haIdr´dZ´n

Qm´zAn

jun´vŒrsIti

krIst´p´rk´mpjut´r

Phoneme Name (IPA Alphabets)

Tamilï£ó£ò¡

EnglishHydrogen

EnglishAmazon

EnglishUniversity

Tamil‚Kvìð˜

EnglishComputer

LanguageCharacter Name

Page 63: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 63COMAD 2006 Tutorial

Metrics Measured

• Parameters varied– User Match Threshold– Intra-Cluster Substitution Cost

• Metrics, Measured• M1 : # of Correctly Reported Matches• M2 : # of Reported Matches• n : number of groups of equivalent names, and

ni : #of elements in ith group

Run Time – As Wall-clock Time

Precision - Fraction of the returned results that are correct

Precision = M1 / M2

Recall = M1 / ( niC2 )Σi=1

nRecall - Fraction of the correct results that are returned

Page 64: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 64COMAD 2006 Tutorial

Correctness ExperimentsRecall & Precision Metrics

• Desired Quality of Match: (Recall 1, Precision 1)

Analysis of observed results:Recall Rate is reasonable (≥ 0.90 ) only with e ≥ 0.20Precision Rate is reasonable (≥ 0.90 ) only with e ≤ 0.30

Page 65: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 65COMAD 2006 Tutorial

Correctness ExperimentsTuning the Matching

• Best Matching Point (Empirically, with Precision-Recall curves)

The best possible point of matching is reached with:Match Threshold ∈ [0.25, 0.3] and Intra-Cluster Subs Cost ∈ [0.25, 0.5]

For this dataset Recall is 95% and Precision is 85%

Page 66: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 66COMAD 2006 Tutorial

Performance ExperimentsBaseline

• Edit-Distance function was implemented as a UDF

ReasonsUDF incurs large costs and runs in an Interpreted Environment

Edit-Distance using O(mn) algorithm

All the strings in the Database were compared – O(N)

How to Improve the Performance?Q-Gram Technique

Approximate Phonetic Indexing

4004Approximate (using UDF)Join

Approximate (using UDF)

Matching Methodology

1418Scan

Time (Sec)

Query

Page 67: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 67COMAD 2006 Tutorial

Performance Experiments Q-Grams

• Key Idea:– Generate and store all q-grams

– Use q-grams filter properties to generate a candidate set (cheap) and prune false positives using UDF (expensive)

• Three different filters are used– Length Filter: Matching strings cannot differ by more than k– Count Filter: Matching q-grams ≥ max (|a|, |b|) –1 –(k-1) q– Position Filter: Matching q-grams cannot be more than k positions apart

Page 68: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 68COMAD 2006 Tutorial

Performance Experiments Q-Grams (continued…)

856Approximate (using q-grams)Join

Approximate (using q-grams)

Matching Methodology

13.5Scan

Time (Sec)

Query

• From baseline, improved two orders of magnitude in Scan and five times in Join– # of Calls to UDF reduced tremendously

• Caveats:– No false-dismissals, only false-positives– At the cost of 15X storage space, for storing the Q-Grams

Page 69: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 69COMAD 2006 Tutorial

Performance Experiments Phonetic Index

• Key Idea:– Use the “Phonemes Clusters” to generate a string of integers (like Soundex)– Convert to a single number– Index the phonemic strings in standard B+ Tree index, on Number

• For searching:– Transform search string to its Index String, and search in index

• Returned set is a Candidate Set for approximate matching strings• Use UDF to weed out false positives

Page 70: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 70COMAD 2006 Tutorial

Performance Experiments Phonetic Index (Continued…)

15.2Approximate (using Phon.Index)Join

Approximate (using Phon.Index)

Matching Methodology

0.71Scan

Time (Sec)

Query

From Baseline, three orders of magnitude improvementIndex speeds up look-up of “like-strings” tremendously

Caveats:False-dismissals possible now.Indexing cost is added

Page 71: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 71COMAD 2006 Tutorial

Performance Improvements(Three Alternative Techniques)

• Technique #1: Metric Distance Index– Pre-compute edit-distance of Phonetic Strings from a given Key String.– Use Properties of edit-distances to reduce calls to UDF

• Technique #2: Q-Grams– Generate and Store all Q-Grams of Phonetic Strings– Use Properties of Q-Grams to reduce Calls to UDF

• Technique #3: Phonemic Indexes– Convert Phoneme Strings to Numbers (corresponding to Phoneme

Clusters)– Index resulting numbers, using B+ Trees

~5-8% False-Dismissals15.20.71Matching using Phonemic Indexes

13.5

Scan-Time

856

Join-Time

15X Storage Overheads for Q-Grams Matching using Q-Grams

Matching Methodology

356 1728Matching using Metric Distance Index Not much improvement on baseline

Remarks

Page 72: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 72COMAD 2006 Tutorial

Remarks

The MLNameJoin operator employing phonetic matching can complement the standard lexicographic operators, for cross-language name searches…

…Further, it may be implemented efficiently on existing systems.

Page 73: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 73COMAD 2006 Tutorial

MLSemJoin Operator

† Referred to as MLSemJoin in published work

Page 74: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 74COMAD 2006 Tutorial

INR 250ÝCò «ü£F

€ 19.95L'Histoire De La France SAR 95

€ 75.00Il Coronation del Virgin

PriceTitle

üõý˜ô£™«ï¼

François Lebrun

BicciNero

Author_FNAuthor Category

êKˆFó‹

Arti Fini

Histoire

Language

$ 49.95History of CivilizationWill/ArielDurant History English

TamilItalian

FrenchArabic

INR 175ªddT£d HI¶ šddy¡d¡d®ddUµT¬dd¬d¦dyUµè Be£d²d±d Hindi

£ 35.00History and HistoriansMark T.Gilderhus Historiography English

€ 12.00ΚατεριναΣαρρη Μουσικη′ Greek

£ 15.00Letters to My DaughterJawaharlalNehru Autobiography English

€ 99.95Les Méditations MetaphysiquesRenéDecates Philosophie French

¥ 7500無門關慧開無門 禅 Japanese

€ 99.95êˆFò «ê£î¬ù«ñ£è¡î£v裉F ²òêKî‹ Tamil

Παιχνι′δια στο Πια′νο

Multilingual Books.com(with Category Information)

Page 75: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 75COMAD 2006 Tutorial

Multilingual Semantic Selection

INR 250ÝCò «ü£F

$ 49.95History of Civilization

€ 19.95L'Histoire De La France

PriceTitle

êKˆFó‹

History

Histoire

üõý˜ô£™«ï¼

Will/ArielDurant

FrançoisLebrun

Author_FNAuthor Category

Suppose a user wants to retrieve all “History” Books in English, Tamil and French ...

Currently, an equivalent SQL expression is as follows…

where Category =“History” or Category =“êKˆFó‹” or Category =“Histoire” ...Select Author, Title, Category ... From Books

where Category MLSemJoin “History” InLanguages {English,Tamil,French}Select Author, Title, Category... From Books

We propose a simpler syntax as follows…

Page 76: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 76COMAD 2006 Tutorial

Multilingual Semantic Selection(Adding Expressive Power)

INR 250ÝCò «ü£F

$ 49.95History of Civilization

€ 19.95L'Histoire De La France

PriceTitle

êKˆFó‹

History

Histoire

üõý˜ô£™«ï¼

Will/ArielDurant

FrançoisLebrun

Author_FNAuthor Category

Suppose a User wants to retrieve all “History-type” Books in English, Tamil and French

where Category MLSemJoin All “History” InLanguages {English,Tamil,Hindi}Select Author, Title, ... From Books

Currently, no equivalent SQL expression is available for this query…

£ 15.00Letters to my DaughterAutobiographyJawaharlalNehru£ 35.00History and Historians HistoriographyMark T.Gilderhus

Page 77: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 77COMAD 2006 Tutorial

MLSemJoin Features

• Simpler Query Specification– Input in any convenient language, with Multilingual output

• When Appropriate Linguistic Resources are not available• Specially Suited for PDA, Cell Phone Interfaces

• Robustness of Query Processing– Query String is more robust with respect to meaning, spelling etc, as

the matching relies not just on Lexicographic Matching– Equivalence based on intuitive Semantic correspondence

• Semantic ⇒ A Specified Ontology Based

• Restrictions– Restricted to Specific types of Attributes (Categorical)

Page 78: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 78COMAD 2006 Tutorial

Is MLSemJoin just Syntactic Sugar?

• Extends SQL Expressive Power– “Retrieve in All Languages”

• To Retrieve books irrespective of language of Publication( InLanguages { * } )

– Join Functionality based on Semantics• To Retrieve “Books published by Publishers in their Specialty”

( Book.Category MLSemJoin Publisher.Specialty )

• Query Processing with Domain Specific Ontologies– The same Mechanism may be extended to any Ontological Query

Processing• Use domain-specific ontologies in specific domains

Page 79: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 79COMAD 2006 Tutorial

BackgroundWordNet Linguistic Resource

Page 80: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 80COMAD 2006 Tutorial

WordNet Basics: Words vs. Meaning

• WordNet is a Psycholinguistic Dictionary– It organizes concepts, similar to human mind– CS-speak: Semantic Network

• Word is an association with a concept & string– Given by a Lexical Matrix, as follows:

Synonymy

Polysemy

English has ~110K Noun Words and ~75K Noun Synsets~150K Associations between them

Need about 5MBto Store on-line

Aero

-pla

n e

Auto -

mob il

e

Car

Fligh

t

Leo

Sy:1

Sy:2

Sy:3

Sy:4

Gloss 1Gloss 2Gloss 3Gloss 4

Page 81: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 81COMAD 2006 Tutorial

WordNet Basics: Noun Hierarchy

• Nouns are Grouped into 25 “Semantic Primes”• Under each, concepts are arranged in a Taxonomic Hierarchy

– Can Specialize / Generalize a “Synset”

Mouse1

Land

Whale

Fauna

Bird Mammal

Water-Based

Dolphin

Artifact

Computer Peripherals

Pointing Devices

Mouse2

Biography

Knowledge

Philosophy History

Personal History

Autobiography

Subject History

HistoriographySynsets Corresponding to

Synonyms

Page 82: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 82COMAD 2006 Tutorial

WordNet Basics: Interlinked Synsets• With Multilingual WordNets, the English WN Hierarchy is

taken as the base, and modified for Target Languages – Interlinking provided between the Synsets

Biography

Knowledge

Philosophy History

Personal History

Autobiography

Subject History

Historiography Biographie

Wissen

Philosophie Geschichte

Persönliche Geschichte

Autobiography

Inter-language Semantic LinksIntra-language Is-A Links

Page 83: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 83COMAD 2006 Tutorial

Multilingual WordNet Initiatives

• There are several WordNet initiatives around the globe, coordinated by Global WordNet Organization– Euro WordNet: Covers all major European Languages– Indo WordNet: Covers ~15 Official Indic Languages– CJK WordNet: Between CJK Languages– …

• Most of them take English WordNet as the Base– Maintain a structural similarity with English WordNet and

specialize for their specific languages– Provide Inter-lingual Index between “Equivalent” Synsets

Page 84: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 84COMAD 2006 Tutorial

Implementation Derived Operator Approach

Page 85: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 85COMAD 2006 Tutorial

Semantic Matching Strategy

• Integrate WordNet Linguistic Ontological Resources to the DBMS

• Map [Multilingual] Words to Canonical Semantic Primitives– WordNet provides rich Ontological Hierarchies for nouns– Inter-linked WordNets between languages provide cross-

lingual mappings

• Match on Semantic Primitives – Directly as a not-null intersection of Primitives– Or as not-null intersection on Transitive Closures for Matching

on Specializations

Page 86: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 86COMAD 2006 Tutorial

MLSemJoin Example• Database:

– All books are tagged with Category – The following Hierarchy is given as a “Resource”

Biography

Knowledge

Philosophy History

Personal History

Autobiography

Subject History

Historiography Biographie

Wissen

Philosophie Geschichte

Persönliche Geschichte

Autobiographyõ£›¬è êKî‹

ÜP¾Þò™

õ‹ êKˆFó‹

üùêKî‹

²òêKî‹

Query: Retrieve all History-Type Books in English, Tamil & German

English GermanTamil

Page 87: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 87COMAD 2006 Tutorial

MLSemJoin Algorithm

Steps:Convert Query String to a SynsetFind Transitive Closure (TCQ) of Query Synset in Interlinked WordNet Hierarchies

If ( Synset of DataString ∈ TCQ ) then return TRUE else FALSE

Page 88: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 88COMAD 2006 Tutorial

Implementation Details

• In MLSemJoin Algorithm– Computing Recursive Closure (Line#3) in Relational Systems

is expensive• Takes about 98% of the time of Query

– Can implement a UDF, but is very expensive

• We took a “Derived-Operator” approach, in an unmodified RDBMS using Recursive SQL feature– Transparently re-write the MLSemJoin query into one that

uses WITH and IN clauses of SQL:1999• Caveat: System should support SQL:1999

– The query may be optimized using standard Relational Query Optimizer

Page 89: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 89COMAD 2006 Tutorial

Derived Operator Implementation of MLSemJoin

• The query is transformed from MLSemJoin query to a standard SQL:1999 query as follows:

where Category MLSemJoin ALL “History”

InLanguages { English, Tamil, German }

Select Author, Title, ... From Books

where Category in { ‘History’,‘Biography’,‘Autobiography’,

Select Author, Title, ... From Books

‘êKˆFó‹’,‘²òêKî‹’,…,‘Geschichte’... }

The data for IN Clause is the Recursive Closure of “History” across target Languages

Page 90: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 90COMAD 2006 Tutorial

MLSemJoin Performance

Page 91: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 91COMAD 2006 Tutorial

Data

• WordNet (Ver 1.5) Stored in DBMS– ~110,000 Word-Forms and ~80,000 Word-Senses and

~140,000 Relationships between them

• Stored in DBMS in Plain Taxonomy Tables– Plain vanilla <Parent,Child> Relationships– Occupies 4 MB for ASCII and ~8 MB for Unicode

• Multilingual WordNets were simulated by copies of English WordNet in Unicode– Inter-Language-Links created between all pairs (p:0.95)

• For Performance experiments, this approach gives a good approximation

Page 92: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 92COMAD 2006 Tutorial

WordNet Profiles

Structural Characteristics of WordNets

Different WordNets are at different stages of developmentThey are highly correlated

More so in Euro-WordNets, than in Indo- WordNetsConfirms their design goal to ...

“keep the basic taxonomies as much as possible”

NA0.9081.0800.9991.000Equivalence Links to English2.2862.1621.3521.4421.985Avg. (Word Form / Synset)3.8892.3602.3012.1762.236Avg. (Synset / Word Form)7,86823,37815,13222,74580,000Word Sense (Synsets)22,52250,52620,45332,809114,648Word Form (Words)

HindiSpanishGermanFrenchEnglishCharacteristics

Page 93: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 93COMAD 2006 Tutorial

Queries Run

• Three Commercial Database Systems were studied – Identified only as Systems A, B and C to protect identities– WordNet stored in ASCII (English) and in Unicode (others)

• MLSemJoin queries that compute closures of various sizes– Measured the Wall-Clock time for queries– In MLSemJoin queries, the TC Computation takes ~98% of time

• A Typical Query requires a Closure size of ~2,000– Average of Top-100 Query Nouns from popular Web-Search Engine, on

English WordNet– Assuming user is interested in 3 languages

Page 94: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 94COMAD 2006 Tutorial

Performance: Baseline MLSemJoin

Highlights:Runtime proportional to the closure cardinality

Runtime for a typical query (TC of size ~2,000) is in tens of seconds (no index) and close to a second (with index)!

Page 95: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 95COMAD 2006 Tutorial

Optimization #1: Pre-Computed Closures• Pre-Compute the Closures for all nodes in WordNet W,

and store in a WTC Table– Closure of x in W can be computed by a scan of WTC

– Index WTC for performance

• Positives:– Expected to have much better performance– More importantly, linear scale-up wrt Closure Size

• Negatives:– Space overheads are substantial– For WordNet, the size goes from 4 MB/language to ~120

MB/Language

Page 96: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 96COMAD 2006 Tutorial

Performance: Pre-Computed Closures

Highlights:Runtime near-Constant irrespective of the magnitude of Closure Cardinality

Runtime for all queries is sub-second (~700 mSec) with Index

Page 97: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 97COMAD 2006 Tutorial

Optimization #2: Reversed Traversals

• Traverse the Taxonomic Hierarchy in Reverse– Use the Same Taxonomic Table W– Instead of Checking if (Data ∈ Descendents of Query), check

if (Query ∈ Ancestors of Data).

• Positives:– Expected Closure Cardinalities are smaller

• Though WordNet is a DAG, itsAverage In-degree << Average Out-degree

• Negatives:– Computation of Closure for every Data String

Page 98: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 98COMAD 2006 Tutorial

Performance: Reversed Traversals

Highlights:Clearly much better performance for a single TC computation

The query is very expensive since Closure needs to be computed for every record

Page 99: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 99COMAD 2006 Tutorial

Optimization #3:Reorganized Schema

• Leveraging the Structural Characteristics of the WordNet– A large number of nodes have a few children, obeying Power Law

Inline up to 16 ChildrenCovers ~90% of them

TC Computation is modified to look into both tables

Page 100: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 100COMAD 2006 Tutorial

Performance: Re-Organized Schema

Highlights:Runtime is ~3 orders of magnitude better than Baseline Performance and~1 order of magnitude better than Pre-computed Closures

No Space Overhead !

Runtime for typical query (TC of size ~2,000) is ~25 mSec

Page 101: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 101COMAD 2006 Tutorial

Scaling up wrt Languages

Highlights:Increased number of [simulated] languages up to 8

The performance of Optimized Versions remain efficient

Page 102: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 102COMAD 2006 Tutorial

Implementation Architecture

• Query String • Match Parameters

• Result Set

Server Manager

Database

QueryProc.

Engine

SemanticEqualityFunction

UnicodeOntology

RecSQL

Page 103: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 103COMAD 2006 Tutorial

MLSemJoin: Take Away

The MLSemJoin operator employing WordNet-based matching can complement the standard lexicographic operators, for multilingual semantic searches …

… Further, the performance may be tuned to a level acceptable for online user interaction

Page 104: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 104COMAD 2006 Tutorial

Organization

• Motivation• Multilingual Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research

Page 105: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 105COMAD 2006 Tutorial

Multilingual Relational Algebra(Mural)

Page 106: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 106COMAD 2006 Tutorial

Why a Query Algebra?

• For expressing complex queries declaratively• For evaluating alternative query execution plans

– Critical for leveraging the Query Optimizer – For a core implementation of the multilingual operators

• What is Needed?– Functionality defined as Operators– Composition Rules, Cost Models and Selectivity Estimates

Page 107: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 107COMAD 2006 Tutorial

Multilingual Datatype and Operators

• Uniform (Unicode Format) Datatype– A Representation that is tagged with language– E.g.: <“Sample String”, English>, <“àõñ£ù êóñ¢”, Tamil> or <“Corde Témoin”, French>

• Operators on Uniform datatype– Composing: : <Text, ID> → Uniform– Decomposing: : <Uniform> → <Text, ID>– Uniform Equality: Ξ: <Uniform, Uniform> → <Boolean>

– MLNameJoin: Ψ : <Uniform,Uniform>→< Uniform,Uniform,Integer>• Edit-Distance between the phonemic-equivalents of input Uniform strings

– MLSemJoin: Φ : <Uniform,Uniform>→< Uniform,Uniform,Boolean>• Boolean indicates if LHS is a sub-class of RHS

Page 108: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 108COMAD 2006 Tutorial

Composition Rules

Page 109: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 109COMAD 2006 Tutorial

MLNameJoin Operator

• Simplified version of Earlier Definition:

Ψ Commutative and Associative with all relational operators

Cost of Ψ Scan of a Table: O(RL lL k / √Σ) and PL Disk I/O (Without Index) O(RL lL k2 / √Σ) and AL Disk I/O (With Index)

Cost of Ψ Join of a pair of Tables: O(RL RR lL k / √Σ) and (PL + PR)Disk I/O (Without Index) O(RL RR lL k2 / √Σ) and (AL+ AR ) Disk I/O (With Index)

Selectivity estimates based on End-Serial histograms and relaxation for approximate matching

Page 110: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 110COMAD 2006 Tutorial

MLSemJoin Operator• Simplified version of Earlier Definition:

Φ is not a commutative operator; but, associates with others

Cost of Φ Scan of a Table: O(RL + RH (h+1) ) and PH (h+1) Disk I/O (Without Index) O(RL lL k2 / √Σ) and AL Disk I/O (With Index)

Cost of Φ Join of a pair of Tables: O(RL +RR + UR RH (h+1) ) and (PL + PR)Disk I/O (Without Index) O(RL +RR + UR logEH (h+1)) and 3(PL + PR) EH Disk I/O (With Index)

Selectivity estimates based on structural characteristics of thehierarchy

Page 111: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 111COMAD 2006 Tutorial

Relational Completeness of Mural

• Lemma: There exists a mapping Scheme ΩSch between a DB in Mural Schema and Standard Relational Schema– Sketch of the Proof:

• Using Composing and Decomposing Operators of Uniform, ΩSch can be defined.

• Theorem: There exists a mapping scheme Ω that maps a relational algebra database D to a Mural database Ω(D) such that, for every query Q on D, there is a corresponding expression Q’, such that Q’(Ω(D)) = Ω(Q(D))– Sketch of the Proof:

• ΩSch is known from Lemma.• Since only a mapping of queries from Normal Schema to Mural Schema

needs to be derived, we can map queries in Normal Schema to appropriate component of Uniform, in Mural Schema.

Page 112: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 112COMAD 2006 Tutorial

Organization

• Motivation• Standards and Systems• Multilingual Performance• New Multilingual Query Operators• Multilingual Relational Query Algebra• Multilingual Implementation• Conclusions & Future Research

Page 113: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 113COMAD 2006 Tutorial

Multilingual Architecture &

Implementation Experience

Page 114: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 114COMAD 2006 Tutorial

Design Goals

• Relational Systems Oriented

• Attribute Data Oriented– Primarily for OLTP Environments

• DB Transparent to Language– Linguistic Resources are only “plugged-in”

• Modular, Dynamic & Configurable …

Page 115: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 115COMAD 2006 Tutorial

Implementation Architecture

Server Manager

• Query String • Parameters

• Result Set

QueryProc.

Engine

Database

Unicode

Cuniform

ApproximateMatching

TTPn

CostCluster

MLNameJoin

TransitiveClosure

CostOntology

MLSemJoin

MURAL

Page 116: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 116COMAD 2006 Tutorial

Outside-the-Server Implementation

• On Commercial Systems– Using UDF (in PLSQL) for MLNameJoin– Using Recursive SQL and IN Clause for MLSemJoin

• Can be packaged as a PL/SQL procedure

• Advantages:– Implementation with existing features, though slow– Optimization using techniques outlined here

• Disadvantages:– Slow Performance– No leveraging on Optimizer for Better Plan Selection

Page 117: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 117COMAD 2006 Tutorial

Native Implementation†

• Implemented Natively…– Uniform Datatype as Derived Datatype– MLNameJoin and MLSemJoin as First-Class Operators

• Added TTP Converters in specific languages• Added WordNet (English Only) for Semantic Matching

– Added Metric M-Tree Index Structure using Gist API– Added All Components of Mural Algebra

• Cost Models, Composition Rules and Selectivity Estimations– Optimizer made “aware” of Mural components

† Implemented on PostgreSQL Open-Source Database System

Page 118: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 118COMAD 2006 Tutorial

Performance of Native Implementation [MLNameJoin]

• Outside-the-Server UDF Performance• Native Implementation with MTree

169498MLNameJoin (w/ MetricDist or M Tree Idx)

3618

Scan-Time (Outside)

453

Join-Time (Outside)

MLNameJoin

Matching Methodology

1.924.24

5.20

Scan-Time (Native)

1.96

Join-Time (Native)

Page 119: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 119COMAD 2006 Tutorial

Performance of Native Implementation[MLSemJoin]

• Outside-the-Server Performance using SQL:1999• Native pinning WordNet in-memory

Page 120: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 120COMAD 2006 Tutorial

Optimizer Evaluation Possible…

Optimizer Predictions of Query Executions are accurate

Optimizer Cost:2,439,370

Runtime:82.15s

Optimizer Cost:7,513,852

Runtime:2338s

Optimizer Evaluation and Runtimes for two alternatives:Plan 1: π AuthorID,BookID,PubID(σ (Threshold≤0.25)(Ψ Aname,Pname(A X B)))

Plan 2: π AuthorID,BookID,PubID(B X (σ(Threshold≤0.25) (ΨAname,Pname(P,A)))

Query: Find the books whose Author name sounds like Publisher’s name

Page 121: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 121COMAD 2006 Tutorial

Tutorial - Take Away

This tutorial explored research to make Information Systems,Natural Language Neutral in functionality and performance

Cuniform storage format nearly nullifies performance differential

The MLNameJoin and MLSemJoin operators enhance thestandard lexicographic operators for cross-lingual queryingThe Mural Algebra is critical for a Native implementation of the Functionality in Relational Systems

Page 122: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 122COMAD 2006 Tutorial

More Information[http://dsl.serc.iisc.ernet.in/~publications]• Phd Thesis of A. Kumaran [Microsoft Research India]

• On Database Support for Multilingual EnvironmentsRIDE/MLIM Workshop, part of ICDE ’03, March 2003.

• On the Costs of Multilingualism in Database Systems VLDB ’03, September 2003.

• Supporting Multilexical Queries in Database Systems ICDE ’04, March 2004.

• Supporting Multiscript Query Processing in Database Systems EDBT ’04, March 2004.

• LexEQUAL: Multilexical Operator in SQL SIGMOD ’04, June 2004.

• MIRA – Multilingual Information-processing on Relational Architecture Springer LNCS 3268, November 2004.

• On Semantic Matching of Multilingual Attributes in Relational Systems CIKM ’04, November 2004.

• MLSemJoin: Multilingual Semantic Matching in Relational SystemsDASFAA ’05, April 2005.

• On Pushing Multilingual Query Operators into Database EnginesICDE ’06, April 2006.

Page 123: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 123COMAD 2006 Tutorial

Future Research Avenues

• HomoGlyphic Operator• Extensions to MLNameJoin Operator

– Better Index Structures in the Phonetic Domains– Automatic Tuning of Optimal Match Parameters Based on a

Training Set provided by the User• Extensions to MLSemJoin Operator

– Domain-Specific Ontological Matching• Multilingual Performance Suites based on a

Standard Application– Multilingual Benchmarks

Page 124: COMAD Tutorial on - CSE, IIT Bombaycomad/2006/tutorial_jayant.pdf · Descartes René Les Méditations Metaphysiques € 99.95 Philosophie French 無門 慧開 無門關 ¥7500 禅

December 2006 Slide 124COMAD 2006 Tutorial

Thank you!http://dsl.serc.iisc.ernet.in/~projects/MIRA

Database Systems LaboratoryIndian Institute of Science