presentation of the clia project by pushpak bhattacharyya, iit bombay, on behalf of the clia...

88
Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 12 Dec 2008 On the occasion of On the occasion of FIRE FIRE at at Kolkata Kolkata

Upload: jayson-griffin

Post on 24-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

Presentation of the CLIA Project

byPushpak Bhattacharyya,

IIT Bombay,On behalf of

the CLIA Consortium12 Dec 200812 Dec 2008

On the occasion of On the occasion of

FIRE FIRE at at

KolkataKolkata

Page 2: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

2

Motivation

Page 3: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 3

CLIA is a real need Great language diversity in India Low comfort level with English

less than 5% of the total population of about 700 million can use English effectively

Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors

CLIA project started in 2006: domains- tourism and health

Page 4: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 4

Geographically speaking

Telugutamil

Bengali

Marathi

Punjabi

World Rank inTerms of #speakers:

Hindi-Urdu: 5th

Bengali: 7th

Marathi: 14th

…..

Page 5: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

5

CLIA: basic information

Page 6: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 6

Defining Diagram

Page 7: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 7

CLIA Consortium MembersName of Institute Assigned

Language(s) IIT Bombay (Consortium Leader) Marathi,

Hindi IIT-Kharagpur (consortium co-leader) Bengali IIIT Hyderabad Telugu, Hindi Anna University-KBC Tamil Anna University-College of Engg Tamil ISI Kol Bengali Jadavpur University Kolkata Bengali CDAC-Pune Marathi,

Hindi, Tamil CDAC-Noida Punjabi Utkal University --

Page 8: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 8

Principal InvestigatorsName of Institute Names

IITB Prof. Pushpak BhattacharyyaIIT-Kgp Prof. Sudeshna SarkarIIITH Prof. Vasudev VermaAU-KBC Prof. Sobha L.AU-CEG Prof. Ranjani ParthasarthyISI Kol Prof. Mandar MitraJU Kol Prof. Sivaji BandyopadhyaCDAC-P Dr. Ajai KumarCDAC-N Dr. Karunesh AroraUtkal University Prof. Sanghamitra Mohanty

Page 9: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 9

Some prominent research members

Name of Institute Names

IITB Manoj, Vishal, Vishaal, Ashish

IIT-Kgp Nimesh, Dr. RajendraIIITH Bhupal, PraneetAU-KBC Pattavi, Vijay, VijayAU-CEG Kaviha, Subha LalithaISI Kol Prasenjt, Deepashri,

AyanJU Kol Asif, PinakiCDAC-P Swati, AbhishekCDAC-N Gaur Mohan, AnkurUtkal University Balbant Rai

Page 10: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 10

Prior expertise brought to the project (Horizontal, i.e., language independent)

Name of Institute Areas of prior expertise/experience

IITB NLP (LR, WSD, MT), Semantic SearchIIT-Kgp Search and Ranking, Shallow ParsingIIITH Commercial level search engine

building, query processingAU-KBC NER, Information Extraction,

Summarization, AnaphoraAU-CEG Morphology, InterlinguaISI Kol IR Evaluation, large scale IR system

building (SMART)JU Kol Example based MT, Summarization, NERCDAC-P Converters, File format processors, MTCDAC-N Parallel corpora, Query processingUtkal University Machine Translation, Lexical Resources

Page 11: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 11

Prior expertise brought to the project (vertical, i.e., language specific)

Name of Institute Areas of prior expertise/experience

IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing

IIT-Kgp Bengali shallow parsing including MAIIITH Telugu-Eng CLIR, Telugu query

processingAU-KBC Tamil NER, Tamil IE, Tamil MorphAU-CEG Tamil Morph, Eng-Tamil MTISI Kol Bengali statistical stemming, large

scale corpora for BengaliJU Kol Bengali NER, EBMT involving BengaliCDAC-P Various Indian language convertersCDAC-N Aligned parallel corpora for Indian

languagesUtkal University --

Page 12: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 12

Horizontal tasks of CLIA and the organizations responsible

Input Query processing IIIT Hyderabad

Crawling, Indexing IIT KGP, IIITH, IITB

Searching, Ranking IIT KGP, IIITH, IITB

User Interface CDAC Noida

File format processing CDAC Pune

Page 13: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 13

Horizontal tasks of CLIA and the organizations responsible (contd)

Document Processing (index time NER, IE) AU KBC

Document Processing (Post Retrieval: Snippet, Summary) Jadavpur University

Distributed Search IIT KGP, Utkal, CDACP

Evaluation, Relevance Judgement ISI Kolkata

UNL based semantic search (for Tamil) AU CEG

Page 14: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 14

Languages and the organizations responsible

Language Organization(s)

Bengali IIT KGP (c), JU, ISIHindi IIITH (c), IITB, CDAC

NoidaMarathi IITB (c), CDAC PunePunjabi CDAC NoidaTamil AUKBC (c), AUCEGTelugu IIITH

Page 15: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 15

CLIA Important Dates Project Start Date: 29th Aug 06

(effectively Jan 2007) First meeting of the Project Review and

Steering Group (PRSG): 2nd March 2007 Second PRSG: 30th Aug 2007 Third PRSG: 08th March 2008 Fourth PRSG: 15th July 2008 Alpha version released: 15th July, 2008 Beta version to be released (along with

the 5th PRSG): January, 2009

Page 16: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 16

Related consortium: E-IL MT project

English to Indian Language MT Indian Languages: Hindi, Marathi,

Bengali, Urdu, Oriya, Telugu, Tamil Approaches: Statistical MT,

Example Based MT Members: CDAC Pune (c), IIT

Bombay, JU, UU, IIITH, IIITA

Page 17: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 17

Related consortium:IL-IL MT project

Indian Language to Indian Language MT

Indian Languages: Hindi, Marathi, Bengali, Punjabi, Tamil, Telugu, Kannada

Approach: Transfer Based Members: IIITH (c), CDAC Pune, IIT

Bombay, JU, University of Hyderabad, AU KBC

Page 18: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 18

All three projects are time bound and result oriented

2 years time frame (extension granted for 1 year)

Strict deliverables For each project the budget outlay

is about Rs 80 million (USD 2 million)

Page 19: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

19

CLIA: Top level technological information

Page 20: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 20

Process Flow

Page 21: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 21

Page 22: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

22

CLIA: achievements in 2 years (Jan 2007 to Dec 2008)

Tools and resources(Copyrightable code and

data)

Page 23: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 23

Steps towards overall evaluation

Yet to be completed Precision, Recall, MAP, F-score etc.

Large Relevance judgment base under construction 50 queries per language (6 languages) About 5000 documents per language (6

languages) Crawled and indexed document base

of English: approx 600,000 pages

Page 24: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 24

Copyright for CLIA (code)       Code Details

              Input Processing

Soft Keyboard (Hindi, Bengali, Tamil, Telugu, Punjabi, Marathi Languages) (CDAC - P)

 Algorithm for transliteration of Devanagari words to English using

Segment Based Transliteration (IIITH, IITB)

 Implementation of Multilingual Sense Dictionary along with API for

accessing MSD during lexical substitution (IITB)

 Implementation of automatic Multi-word extraction algorithm for

populating the multi-word field of index (IITB)

   

        Bengali Bengali stemmer (IITKGP)

  Bengali Hindi transliteration (IITKGP)

   

        MarathiImplementation of Language Analyzers (Morphological Analyzer) for

Marathi (IITB)   

Page 25: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 25

Copyright for CLIA (code) contd.

       Code Details

   

       Punjabi Punjabi Spell Normalizer (CDAC-N)

  Punjabi Stemmer (CDAC-N)

  Font transcoders  (Unicode - Proprietary fonts) - map files etc. (CDAC-N)

   

        Tamil Stemmer for Tamil (AUKBC)

  Named Entity Recognition engine (AUKBC)

  Information Extraction (AUKBC)

  Font transcoders  (Tamil Proprietary fonts) (AUKBC)

  IE template Translation (AUKBC)

Page 26: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 26

Copyright for CLIA (code) Cont..       Code Details

        Telugu Language Analyzer for Telugu (IIITH)

  Query Translation for Telugu and Hindi (IIITH).

  Query Transliteration for all languages. (IIITH)

  Transcoder (IIITH)

   

         Indexing CML converter (IITKGP)

  Focused Crawler (IIITH)

  Language Identifier (IIITH)

  File Format Processors (CDACP)

   

Page 27: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 27

Copyright for CLIA (code) Cont..       Code Details

   

              Ranking           Ranker implementation (IITKGP)

   

         Output Processing Snippet Generation (JU)

  Summary Generation (JU)

  Snippet Translation (JU)

   

UNL Sentence constituent UNL enconverter (AUCEG)

  UNL indexer (AUCEG)

  UNL Template based Information extractor (AUCEG)

  UNL Template based Summarizer (AUCEG)

 UNL based Search and ranking (ranking module under development)

(AUCEG)

Page 28: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 28

Copyright for CLIA (data)

Data Details

   

              Input Processing  

BengaliSynset dictionary entries for Bengali (shared with JU and

CDAC Pune)

 English to Bengali Transliteration of NE list (shared with JU

and IIT KGP)

  NE annotated corpora (IITKGP)

  NE list transliterated (IITKGP)

   

Telugu Telugu to English Dictionary (IIITH)

  Telugu to English Transliteration list (IIITH)

  NE annotated corpora for Telugu and Hindi. (IIITH)

  Telugu corpus developed for IE module. (IIITH)

   

Page 29: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 29

Copyright for CLIA (data) contd.

Data Details

   

              Input Processing  

   

Tamil English - Tamil Parallel Named Entity List (AUKBC)

  Tamil - English Dictionary (AUKBC)

  Synset dictionary entries for Tamil (AUKBC)

  Tamil Named Entity annotated corpus (AUKBC)

  English Named Entity annotated corpus (AUKBC)

  Named Entity Tagset (AUKBC)

Page 30: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 30

Copyright for CLIA Cont..

Data Details

   

Punjabi Punjabi translations ( for parallel corpora ) (CDAC-N)

  English - Hindi - Punjabi parallel named entity list (CDAC-N)

  Punjabi Named Entity Tagged Corpus (under development) (CDAC-N)

  Database for Punjabi stemmer (prior development) (CDAC-N)

   

Marathi English to Marathi Transliteration of NE list (IITB and CDAC Pune)

 Marathi-English parallel corpora in tourism domain used for training the

snippet translation SMT system (IITB)

  List of Multi-Word Expressions in Marathi and Hindi (IITB)

 English-Marathi Parallel list of Named-entities used for IE Template

translation (Shared with C-DAC Pune)

 Hindi  Hindi to English Dictionary (IIIH)

 Hindi to English transliteration list (IIIH)Hindi MW list (IITB)

Page 31: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 31

Copyright for CLIA Cont..

Data Details

   

Evaluation of the IR system

  Set of test topics (general domain, tourism domain).(ISIK)

  Relevance judgments for the above pair.(ISIK)

   

UNL UW list - Tourism domain (AUCEG)

Page 32: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 32

Conclusion Large scale national level activity Large number of tools and resources

developed under the consortium Alpha release done in July, 2008 Beta release to take place in Jan,

2009 Look forward to more detailed

interactions and suggestions from the international audience

Page 33: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

33

Introducing people…

Page 34: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 34

Principal InvestigatorsName of Institute Names

IITB Prof. Pushpak BhattacharyyaIIT-Kgp Prof. Sudeshna SarkarIIITH Prof. Vasudev VermaAU-KBC Prof. Sobha NairAU-CEG Prof. Ranjani ParthasarthyISI Kol Prof. Mandar MitraJU Kol Prof. Sivaji BandyopadhyaCDAC-P Dr. Ajai KumarCDAC-N Dr. Karunesh AroraUtkal University Prof. Sanghamitra Mohanty

Page 35: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 35

Some prominent research members

Name of Institute Names

IITB Manoj, Vishal, Vishaal, Ashish

IIT-Kgp Nimesh, Dr. RajendraIIITH Bhupal, PraneetAU-KBC Pattavi, Vijay, VijayAU-CEG Kaviha, Subha LalithaISI Kol Prasenjt, Deepashri,

AyanJU Kol Asif, PinakiCDAC-P Swati, AbhishekCDAC-N Gaur Mohan, AnkurUtkal University Balbant Rai

Page 36: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 36

Overview Technical Status of the Project Technical Documentation Shared resources Testing methodology Software Documentation Alpha and Beta versions

Page 37: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

Technical Summary

Page 38: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 38

Work Flow

Input Query Processing

Search

Output Generation

Document Processing

Evaluation

Input Query in IL

Page 39: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 39

Project Status

Input Query Processing

Search

Output Generation

Document Processing

Evaluation

Input Query in IL

Page 40: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 40

Status - Input Processing Stemmer

All Language stemmers developed Integrated with Nutch through plug-ins Monolingual retrievals are working

MWE Guidelines are under discussion (IITB) Marathi ~ 2000 MWE Bangla ~ 600

MWE Tamil ~ 600 MWE Punjabi ~ 4000 MWE

Page 41: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 41

Status – Input Processing : NER

Language NE-tagged Corpus size

Accuracy NE list Details

Hindi (IIITH) 50K words 68% 31,177 entries

English 50K (AUKBC) 88.5% (Precision) 73.7% (Recall)F-Score-80.44%

7,500 entries (AUKBC)Gazetteer List size (IITKgp) : Health-39,819 entriesTourism-90,848 entriesGeneral-4,79,427 entries

Punjabi (CDACN)

Not started NA Person-10,004 | City-500 | Company-500Hospital-20,603

Marathi (IITB)

50K 61.43% (F-score)

Total-4763 | Time-361 | Numerical-706 | Names - 3666

Bengali (IITKgp)

125K (all domains)

~ 75-78% Bangla: 90,000 names (all domains)Gazetteer list is being transliterated to Bangla

Tamil (AUKBC)

94K 88.5% (Precision) 73.7% (Recall)F-Score-80.44%

NE-23,000 entriesDictionary of Personal names-70,000 (Tagged corpus + Dictionary used for NER)

Telugu (IIITH)

60K 74% 38,000 entries

Page 42: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 42

Status - Input Processing WSD (IITB)

2nd version WSD Interface for Sense-marking of corpus

developed by IITB

Dictionary IITB working on E-Hin linkage All LVs working on IL-IL linking and E-IL

linking ~10,000 synsets generated from Tourism

corpora

Page 43: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 43

Status: Dictionary Eng-Hin Linkage

~ 2500 synsets linked (IITB)

Language

#Synsets linked (without cross-linking)

Bengali 2005

Marathi 4298 (all cross-linked)

Punjabi 559

Tamil 1890

Telugu 461

IL-IL Dictionary Status (as on 30 Sept 07)

Page 44: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 44

Sample Input screen Input Screen

Page 45: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 45

Sample Input screen Advanced search option

Page 46: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 46

Project Status

Input Query Processing

Search

Output Generation

Document Processing

Evaluation

Input Query in IL

Page 47: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 47

Status – Search Size of Indexed corpus

Language No of pages No of URLsEnglish 10,000 115

Hindi 21,000 25

Bangla 3,000 25

Tamil 20,000 25

Punjabi 17,000 25

Marathi 3,300 42

Page 48: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 48

Status – Search cML-Text Converter (IIT-Kgp)

First version of the engine is ready Software extracts the fields and body,

but does not identify paragraphs and blocks in this version

Has been tested for Bengali Ready to be integrated with Nutch

Page 49: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 49

Project Status

Input Query Processing

Search

Output Generation

Document Processing

Evaluation

Input Query in IL

Page 50: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 50

Status – Document Processing

Basic IE Engine and eleven IE Templates are ready (AUKBC)

Has been tested with sample documents (EILMT corpus)

First template “How to reach the place” is getting translated to Tamil, Telugu

For other languages, the inflectionary markers are being provided

Page 51: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 51

Project Status

Input Query Processing

Search

Output Generation

Document Processing

Evaluation

Input Query in IL

Page 52: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 52

Sample Output ScreenOutput screen if Input language is Hindi

Page 53: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 53

Sample Output screen Output screen if Input language is Hindi, and English tab is selected

Page 54: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 54

Sample Output screen Output screen of translation of Snippet (English to Bengali)

Page 55: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 55

Sample Output ScreenAdvanced output screen with Hindi Summary

Page 56: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 56

Sample Output ScreenAdvanced output screen with Hindi Summary

Page 57: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 57

Sample Output ScreenSample screen with Information Extraction

Page 58: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 58

Status – Output Generation

Snippet Generation (JU) Working for monolingual retrieval Integrated with Nutch Has been tested for Bengali

Page 59: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 59

Project Status

Input Query Processing

Search

Output Generation

Document Processing

Evaluation

Input Query in IL

Page 60: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 60

Corpora Tourism and Health Corpora being collected

for all languages

News corpora also being collected. Period of news corpora ranges from 2002 to

2007 For News corpora, ISI Kol having dialogues

with TOI and Hindustan Times for permission for the use of their multilingual corpora

Status - Evaluation

Page 61: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 61

Details of Corpora (crawled)

Assumption in SRS: Each language corpus

has at least 50,000 documents from General / News + all available documents in Tourism and Health

Page 62: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 62

Evaluation : Topics Topics (ISI Kol)

A set of 95 topics are ready for evaluation 30 topics for training and 50 topics for

testing and 15 topics as stand-by Each topic = Title + Narration + Description Translation of these 95 topics have been

completed by all the six language verticals Sample Topic

<title> Euro Inflation</title> <desc> Find documents about rises in prices after

the introduction of the Euro</desc> <narr> Any document is relevant that provides

information on the rise of prices in any country that introduced the common European currency.</narr>

Page 63: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 63

Evaluation Methodology Benchmark data creation

Human judges

Corpus Queries

IR engine

1

IR engine

2

IR engine

n

Pool

Relevance Judgements

Page 64: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 64

Evaluation Methodology Benchmark data creation

Sample documents (corpus) Sample Queries / Topics (95) Relevance judgement

No of relevance judged Bangla documents ~ 4,500

Independently judged against 23 topics by each of two judges

Pooling Pooling strategies adopted by TREC List of top ~100 documents are taken Pool = union of these

Page 65: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 65

Evaluation methodology Evaluation engine

30 Topics/Queries

Corpus > 50,000 docs

Retrieval Engine

Top 100 Docs

Evaluation Engine

Relevance Judgments

Metrics

Page 66: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 66

UNL Monolingual retrieval is working for

Tamil documents 6500 words in UNL Dictionary Words + MWE indexed Documents indexed

No. of documents processed in Tourism - 564 No of Concept-Relation-Concept indexed -

11,754 No of Concept-Relation indexed - 11,754 No of Concepts indexed - 17,650

Page 67: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 67

Testing Methodology Testing methodology

Black box testing based on SRS and design documents Unit testing by each sub-system Test cases (format) and test reports

Integration testing Top down / Bottom-up based on dependencies Stubs and drivers Sub-system wise testing (module-wise)

Input processing Search and Retrieval Document processing Output Generation Evaluation UNL

System Testing Performance testing

Page 68: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 68

Integration Use of controlled corpora for Integration Use of EILMT English and Hindi parallel corpus ISI generates the queries for corpus Translation of queries by all LVs English and Hindi synsets identified for building

multilingual dictionary by each LV Each language vertical will be tested for their

respective cross-lingual retrieval Information Extraction and output generation

will be done on the same corpora Integration of each LV into Nutch at IITKgp

Page 69: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 69

Test and Integration (contd.)

Bug tracking system (Bugzilla) to be installed

Currently planned for installation at IITB on the same server as CVS

Bugzilla Web-based general-purpose bug tracker tool Detects not only software bugs but also all

other user-submitted tracking tickets Eases communication between team

members Can be integrated with CVS and WIKI

Page 70: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 70

Bugzilla Requirements

A compatible database management system – MySQL, Postgressql

A suitable release of Perl 5 A compatible web server A suitable mail transfer agent, or any SMTP

server Bugzilla Demo

https://landfill.bugzilla.org/bugzilla-tip/index.cgi

Page 71: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 71

Bugzilla - Design Bugs can be

submitted by anybody, and will be assigned to a particular developer

Page 72: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 72

Deployment diagramDeployment Diagram

for Nutch-based Search Subsystem

The real life scenario would have four more such index servers, one for every Indian language and (maybe) more search servers to ensure greater number of searches per unit time

Quoted from Mike Cafarella , Doug Cutting, Building Nutch: Open Source Search, Queue, v.2 n.2, April 2004

Page 73: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 73

Hosting of Alpha and Beta versions

Alpha Version ~10,000 documents in each language Low complexity system Hence simple hardware configuration sufficient Does not include Summary generation and Output

translation Planned for Dec 2008

Beta Version ~10,00,000 documents in each language Hardware configuration being worked out - based on

disk space requirements, throughput of system, response times, simultaneous users etc.

Following details are being worked out: Connectivity Where to host Support for hosting

Planned for July 2008

Page 74: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 74

Elitex08: Demo of Alpha Version

Plan to demonstrate the following: Cross-lingual information retrieval for all

languages Information Extraction and translation of at

least one template to Tamil / Telugu Snippet Generation (monolingual) Hardware integration – IITKgp Publicity management / Poster design - JU Funds: Participation fees to be shared

Demonstrate the same at IJCNLP08 exhibition (in Hyderabad - Jan 2008)

Page 75: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 75

Gantt chart (as on Aug 30)

Page 76: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 76

Gantt chart (as on Aug 30)

Page 77: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 77

SRS (Based on IEEE) Design document v2.0 (based on RUP) User Requirements Document (Ver 5.0) Java docs Test cases template File naming conventions Testing and integration guidelines Code review guidelines

Skip templates

Software documentation

Page 78: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 78

Software documentation : SRS

SRS Introduction Overall description External interface requirements System features (module-wise) Advanced Search system for Tamil

using UNL

Back to Software Documentation Next

Page 79: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 79

Software documentation: DD

Design document (v 2.0) Has been simplified to suit project

needs Introduction System Architecture

Solution Architecture (brief description of systems, subsystems)

Software Architecture ( block diagrams) System Design

Logical Design (Class Diagrams ) Component Design (Component Diagrams )

Appendix - other details Back to Software Documentation Next

Page 80: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 80

Software documentation:URD

URD Introduction Objective Scope of the project Product perspective Capabilities of the Product User Characteristics Assumptions and dependencies Operational environment Input / Output scenarios Definitions, acronyms and abbreviations References

Back to Software Documentation Next

Page 81: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 81

Software documentation:Test

Test case template: for all tests

Test case Test data Expected result

Actual result Remarks

Back to Software Documentation Next

Page 82: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 82

Software documentation:File naming

File naming convention captures the following: Subject & domain of document Content Type (ppt / doc / rpt / Tr / etc) Name of Institute (IITB / ISI / IIITH etc.) Date of creation of doc (dd-mon-yy) Version no. Format

<Subject>_<Content_type>_<Institute>_<date>_<ver.no>.<file ext>

E.g. PRSG_Pres_IITB_08dec07_v1.ppt Back to Software Documentation Next

Page 83: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 83

Shareable Resources and Tools

Shared Resources across projects From ILILMT to CLIA:

Morph Analyzer POS Tagger Chunker Dictionary Standardization IL-IL Synsets

From EILMT to CLIA Synsets E-IL

From CLIA to other projects: NER engine NE list MWE

Page 84: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 84

Collaborative tools used - CLIA

Tool Purpose

Googlegroups Group e-Mailing

Wiki Project Documents, Member Contact details, Minutes of meeting, Presentations, Timelines, progress reports, fund details etc

CVS Source code

Google docs Sharing and editing of documents

Webex Audioconferencing

Weekly teleconferences

Page 85: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 85

CLIA Wiki site http://www.cfilt.iitb.ac.in/~consortia/doku

wiki CLIA Wiki contents

Project Team Contact details Project documentation (SRS, Design doc,

URD..) Meeting minutes and presentations Project fund details Progress reports and timelines Project resources Corpus Collaborative platform for audio conferences

Page 86: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 86

CLIA Wiki site

Page 87: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

12 Dec 08 FIRE– Kolkata - CLIA Project 87

Wiki – Upload notification

Page 88: Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

Thank You