presentation of the clia project by pushpak bhattacharyya, iit bombay, on behalf of the clia...
TRANSCRIPT
Presentation of the CLIA Project
byPushpak Bhattacharyya,
IIT Bombay,On behalf of
the CLIA Consortium12 Dec 200812 Dec 2008
On the occasion of On the occasion of
FIRE FIRE at at
KolkataKolkata
2
Motivation
12 Dec 08 FIRE– Kolkata - CLIA Project 3
CLIA is a real need Great language diversity in India Low comfort level with English
less than 5% of the total population of about 700 million can use English effectively
Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors
CLIA project started in 2006: domains- tourism and health
12 Dec 08 FIRE– Kolkata - CLIA Project 4
Geographically speaking
Telugutamil
Bengali
Marathi
Punjabi
World Rank inTerms of #speakers:
Hindi-Urdu: 5th
Bengali: 7th
Marathi: 14th
…..
5
CLIA: basic information
12 Dec 08 FIRE– Kolkata - CLIA Project 6
Defining Diagram
12 Dec 08 FIRE– Kolkata - CLIA Project 7
CLIA Consortium MembersName of Institute Assigned
Language(s) IIT Bombay (Consortium Leader) Marathi,
Hindi IIT-Kharagpur (consortium co-leader) Bengali IIIT Hyderabad Telugu, Hindi Anna University-KBC Tamil Anna University-College of Engg Tamil ISI Kol Bengali Jadavpur University Kolkata Bengali CDAC-Pune Marathi,
Hindi, Tamil CDAC-Noida Punjabi Utkal University --
12 Dec 08 FIRE– Kolkata - CLIA Project 8
Principal InvestigatorsName of Institute Names
IITB Prof. Pushpak BhattacharyyaIIT-Kgp Prof. Sudeshna SarkarIIITH Prof. Vasudev VermaAU-KBC Prof. Sobha L.AU-CEG Prof. Ranjani ParthasarthyISI Kol Prof. Mandar MitraJU Kol Prof. Sivaji BandyopadhyaCDAC-P Dr. Ajai KumarCDAC-N Dr. Karunesh AroraUtkal University Prof. Sanghamitra Mohanty
12 Dec 08 FIRE– Kolkata - CLIA Project 9
Some prominent research members
Name of Institute Names
IITB Manoj, Vishal, Vishaal, Ashish
IIT-Kgp Nimesh, Dr. RajendraIIITH Bhupal, PraneetAU-KBC Pattavi, Vijay, VijayAU-CEG Kaviha, Subha LalithaISI Kol Prasenjt, Deepashri,
AyanJU Kol Asif, PinakiCDAC-P Swati, AbhishekCDAC-N Gaur Mohan, AnkurUtkal University Balbant Rai
12 Dec 08 FIRE– Kolkata - CLIA Project 10
Prior expertise brought to the project (Horizontal, i.e., language independent)
Name of Institute Areas of prior expertise/experience
IITB NLP (LR, WSD, MT), Semantic SearchIIT-Kgp Search and Ranking, Shallow ParsingIIITH Commercial level search engine
building, query processingAU-KBC NER, Information Extraction,
Summarization, AnaphoraAU-CEG Morphology, InterlinguaISI Kol IR Evaluation, large scale IR system
building (SMART)JU Kol Example based MT, Summarization, NERCDAC-P Converters, File format processors, MTCDAC-N Parallel corpora, Query processingUtkal University Machine Translation, Lexical Resources
12 Dec 08 FIRE– Kolkata - CLIA Project 11
Prior expertise brought to the project (vertical, i.e., language specific)
Name of Institute Areas of prior expertise/experience
IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing
IIT-Kgp Bengali shallow parsing including MAIIITH Telugu-Eng CLIR, Telugu query
processingAU-KBC Tamil NER, Tamil IE, Tamil MorphAU-CEG Tamil Morph, Eng-Tamil MTISI Kol Bengali statistical stemming, large
scale corpora for BengaliJU Kol Bengali NER, EBMT involving BengaliCDAC-P Various Indian language convertersCDAC-N Aligned parallel corpora for Indian
languagesUtkal University --
12 Dec 08 FIRE– Kolkata - CLIA Project 12
Horizontal tasks of CLIA and the organizations responsible
Input Query processing IIIT Hyderabad
Crawling, Indexing IIT KGP, IIITH, IITB
Searching, Ranking IIT KGP, IIITH, IITB
User Interface CDAC Noida
File format processing CDAC Pune
12 Dec 08 FIRE– Kolkata - CLIA Project 13
Horizontal tasks of CLIA and the organizations responsible (contd)
Document Processing (index time NER, IE) AU KBC
Document Processing (Post Retrieval: Snippet, Summary) Jadavpur University
Distributed Search IIT KGP, Utkal, CDACP
Evaluation, Relevance Judgement ISI Kolkata
UNL based semantic search (for Tamil) AU CEG
12 Dec 08 FIRE– Kolkata - CLIA Project 14
Languages and the organizations responsible
Language Organization(s)
Bengali IIT KGP (c), JU, ISIHindi IIITH (c), IITB, CDAC
NoidaMarathi IITB (c), CDAC PunePunjabi CDAC NoidaTamil AUKBC (c), AUCEGTelugu IIITH
12 Dec 08 FIRE– Kolkata - CLIA Project 15
CLIA Important Dates Project Start Date: 29th Aug 06
(effectively Jan 2007) First meeting of the Project Review and
Steering Group (PRSG): 2nd March 2007 Second PRSG: 30th Aug 2007 Third PRSG: 08th March 2008 Fourth PRSG: 15th July 2008 Alpha version released: 15th July, 2008 Beta version to be released (along with
the 5th PRSG): January, 2009
12 Dec 08 FIRE– Kolkata - CLIA Project 16
Related consortium: E-IL MT project
English to Indian Language MT Indian Languages: Hindi, Marathi,
Bengali, Urdu, Oriya, Telugu, Tamil Approaches: Statistical MT,
Example Based MT Members: CDAC Pune (c), IIT
Bombay, JU, UU, IIITH, IIITA
12 Dec 08 FIRE– Kolkata - CLIA Project 17
Related consortium:IL-IL MT project
Indian Language to Indian Language MT
Indian Languages: Hindi, Marathi, Bengali, Punjabi, Tamil, Telugu, Kannada
Approach: Transfer Based Members: IIITH (c), CDAC Pune, IIT
Bombay, JU, University of Hyderabad, AU KBC
12 Dec 08 FIRE– Kolkata - CLIA Project 18
All three projects are time bound and result oriented
2 years time frame (extension granted for 1 year)
Strict deliverables For each project the budget outlay
is about Rs 80 million (USD 2 million)
19
CLIA: Top level technological information
12 Dec 08 FIRE– Kolkata - CLIA Project 20
Process Flow
12 Dec 08 FIRE– Kolkata - CLIA Project 21
22
CLIA: achievements in 2 years (Jan 2007 to Dec 2008)
Tools and resources(Copyrightable code and
data)
12 Dec 08 FIRE– Kolkata - CLIA Project 23
Steps towards overall evaluation
Yet to be completed Precision, Recall, MAP, F-score etc.
Large Relevance judgment base under construction 50 queries per language (6 languages) About 5000 documents per language (6
languages) Crawled and indexed document base
of English: approx 600,000 pages
12 Dec 08 FIRE– Kolkata - CLIA Project 24
Copyright for CLIA (code) Code Details
Input Processing
Soft Keyboard (Hindi, Bengali, Tamil, Telugu, Punjabi, Marathi Languages) (CDAC - P)
Algorithm for transliteration of Devanagari words to English using
Segment Based Transliteration (IIITH, IITB)
Implementation of Multilingual Sense Dictionary along with API for
accessing MSD during lexical substitution (IITB)
Implementation of automatic Multi-word extraction algorithm for
populating the multi-word field of index (IITB)
Bengali Bengali stemmer (IITKGP)
Bengali Hindi transliteration (IITKGP)
MarathiImplementation of Language Analyzers (Morphological Analyzer) for
Marathi (IITB)
12 Dec 08 FIRE– Kolkata - CLIA Project 25
Copyright for CLIA (code) contd.
Code Details
Punjabi Punjabi Spell Normalizer (CDAC-N)
Punjabi Stemmer (CDAC-N)
Font transcoders (Unicode - Proprietary fonts) - map files etc. (CDAC-N)
Tamil Stemmer for Tamil (AUKBC)
Named Entity Recognition engine (AUKBC)
Information Extraction (AUKBC)
Font transcoders (Tamil Proprietary fonts) (AUKBC)
IE template Translation (AUKBC)
12 Dec 08 FIRE– Kolkata - CLIA Project 26
Copyright for CLIA (code) Cont.. Code Details
Telugu Language Analyzer for Telugu (IIITH)
Query Translation for Telugu and Hindi (IIITH).
Query Transliteration for all languages. (IIITH)
Transcoder (IIITH)
Indexing CML converter (IITKGP)
Focused Crawler (IIITH)
Language Identifier (IIITH)
File Format Processors (CDACP)
12 Dec 08 FIRE– Kolkata - CLIA Project 27
Copyright for CLIA (code) Cont.. Code Details
Ranking Ranker implementation (IITKGP)
Output Processing Snippet Generation (JU)
Summary Generation (JU)
Snippet Translation (JU)
UNL Sentence constituent UNL enconverter (AUCEG)
UNL indexer (AUCEG)
UNL Template based Information extractor (AUCEG)
UNL Template based Summarizer (AUCEG)
UNL based Search and ranking (ranking module under development)
(AUCEG)
12 Dec 08 FIRE– Kolkata - CLIA Project 28
Copyright for CLIA (data)
Data Details
Input Processing
BengaliSynset dictionary entries for Bengali (shared with JU and
CDAC Pune)
English to Bengali Transliteration of NE list (shared with JU
and IIT KGP)
NE annotated corpora (IITKGP)
NE list transliterated (IITKGP)
Telugu Telugu to English Dictionary (IIITH)
Telugu to English Transliteration list (IIITH)
NE annotated corpora for Telugu and Hindi. (IIITH)
Telugu corpus developed for IE module. (IIITH)
12 Dec 08 FIRE– Kolkata - CLIA Project 29
Copyright for CLIA (data) contd.
Data Details
Input Processing
Tamil English - Tamil Parallel Named Entity List (AUKBC)
Tamil - English Dictionary (AUKBC)
Synset dictionary entries for Tamil (AUKBC)
Tamil Named Entity annotated corpus (AUKBC)
English Named Entity annotated corpus (AUKBC)
Named Entity Tagset (AUKBC)
12 Dec 08 FIRE– Kolkata - CLIA Project 30
Copyright for CLIA Cont..
Data Details
Punjabi Punjabi translations ( for parallel corpora ) (CDAC-N)
English - Hindi - Punjabi parallel named entity list (CDAC-N)
Punjabi Named Entity Tagged Corpus (under development) (CDAC-N)
Database for Punjabi stemmer (prior development) (CDAC-N)
Marathi English to Marathi Transliteration of NE list (IITB and CDAC Pune)
Marathi-English parallel corpora in tourism domain used for training the
snippet translation SMT system (IITB)
List of Multi-Word Expressions in Marathi and Hindi (IITB)
English-Marathi Parallel list of Named-entities used for IE Template
translation (Shared with C-DAC Pune)
Hindi Hindi to English Dictionary (IIIH)
Hindi to English transliteration list (IIIH)Hindi MW list (IITB)
12 Dec 08 FIRE– Kolkata - CLIA Project 31
Copyright for CLIA Cont..
Data Details
Evaluation of the IR system
Set of test topics (general domain, tourism domain).(ISIK)
Relevance judgments for the above pair.(ISIK)
UNL UW list - Tourism domain (AUCEG)
12 Dec 08 FIRE– Kolkata - CLIA Project 32
Conclusion Large scale national level activity Large number of tools and resources
developed under the consortium Alpha release done in July, 2008 Beta release to take place in Jan,
2009 Look forward to more detailed
interactions and suggestions from the international audience
33
Introducing people…
12 Dec 08 FIRE– Kolkata - CLIA Project 34
Principal InvestigatorsName of Institute Names
IITB Prof. Pushpak BhattacharyyaIIT-Kgp Prof. Sudeshna SarkarIIITH Prof. Vasudev VermaAU-KBC Prof. Sobha NairAU-CEG Prof. Ranjani ParthasarthyISI Kol Prof. Mandar MitraJU Kol Prof. Sivaji BandyopadhyaCDAC-P Dr. Ajai KumarCDAC-N Dr. Karunesh AroraUtkal University Prof. Sanghamitra Mohanty
12 Dec 08 FIRE– Kolkata - CLIA Project 35
Some prominent research members
Name of Institute Names
IITB Manoj, Vishal, Vishaal, Ashish
IIT-Kgp Nimesh, Dr. RajendraIIITH Bhupal, PraneetAU-KBC Pattavi, Vijay, VijayAU-CEG Kaviha, Subha LalithaISI Kol Prasenjt, Deepashri,
AyanJU Kol Asif, PinakiCDAC-P Swati, AbhishekCDAC-N Gaur Mohan, AnkurUtkal University Balbant Rai
12 Dec 08 FIRE– Kolkata - CLIA Project 36
Overview Technical Status of the Project Technical Documentation Shared resources Testing methodology Software Documentation Alpha and Beta versions
Technical Summary
12 Dec 08 FIRE– Kolkata - CLIA Project 38
Work Flow
Input Query Processing
Search
Output Generation
Document Processing
Evaluation
Input Query in IL
12 Dec 08 FIRE– Kolkata - CLIA Project 39
Project Status
Input Query Processing
Search
Output Generation
Document Processing
Evaluation
Input Query in IL
12 Dec 08 FIRE– Kolkata - CLIA Project 40
Status - Input Processing Stemmer
All Language stemmers developed Integrated with Nutch through plug-ins Monolingual retrievals are working
MWE Guidelines are under discussion (IITB) Marathi ~ 2000 MWE Bangla ~ 600
MWE Tamil ~ 600 MWE Punjabi ~ 4000 MWE
12 Dec 08 FIRE– Kolkata - CLIA Project 41
Status – Input Processing : NER
Language NE-tagged Corpus size
Accuracy NE list Details
Hindi (IIITH) 50K words 68% 31,177 entries
English 50K (AUKBC) 88.5% (Precision) 73.7% (Recall)F-Score-80.44%
7,500 entries (AUKBC)Gazetteer List size (IITKgp) : Health-39,819 entriesTourism-90,848 entriesGeneral-4,79,427 entries
Punjabi (CDACN)
Not started NA Person-10,004 | City-500 | Company-500Hospital-20,603
Marathi (IITB)
50K 61.43% (F-score)
Total-4763 | Time-361 | Numerical-706 | Names - 3666
Bengali (IITKgp)
125K (all domains)
~ 75-78% Bangla: 90,000 names (all domains)Gazetteer list is being transliterated to Bangla
Tamil (AUKBC)
94K 88.5% (Precision) 73.7% (Recall)F-Score-80.44%
NE-23,000 entriesDictionary of Personal names-70,000 (Tagged corpus + Dictionary used for NER)
Telugu (IIITH)
60K 74% 38,000 entries
12 Dec 08 FIRE– Kolkata - CLIA Project 42
Status - Input Processing WSD (IITB)
2nd version WSD Interface for Sense-marking of corpus
developed by IITB
Dictionary IITB working on E-Hin linkage All LVs working on IL-IL linking and E-IL
linking ~10,000 synsets generated from Tourism
corpora
12 Dec 08 FIRE– Kolkata - CLIA Project 43
Status: Dictionary Eng-Hin Linkage
~ 2500 synsets linked (IITB)
Language
#Synsets linked (without cross-linking)
Bengali 2005
Marathi 4298 (all cross-linked)
Punjabi 559
Tamil 1890
Telugu 461
IL-IL Dictionary Status (as on 30 Sept 07)
12 Dec 08 FIRE– Kolkata - CLIA Project 44
Sample Input screen Input Screen
12 Dec 08 FIRE– Kolkata - CLIA Project 45
Sample Input screen Advanced search option
12 Dec 08 FIRE– Kolkata - CLIA Project 46
Project Status
Input Query Processing
Search
Output Generation
Document Processing
Evaluation
Input Query in IL
12 Dec 08 FIRE– Kolkata - CLIA Project 47
Status – Search Size of Indexed corpus
Language No of pages No of URLsEnglish 10,000 115
Hindi 21,000 25
Bangla 3,000 25
Tamil 20,000 25
Punjabi 17,000 25
Marathi 3,300 42
12 Dec 08 FIRE– Kolkata - CLIA Project 48
Status – Search cML-Text Converter (IIT-Kgp)
First version of the engine is ready Software extracts the fields and body,
but does not identify paragraphs and blocks in this version
Has been tested for Bengali Ready to be integrated with Nutch
12 Dec 08 FIRE– Kolkata - CLIA Project 49
Project Status
Input Query Processing
Search
Output Generation
Document Processing
Evaluation
Input Query in IL
12 Dec 08 FIRE– Kolkata - CLIA Project 50
Status – Document Processing
Basic IE Engine and eleven IE Templates are ready (AUKBC)
Has been tested with sample documents (EILMT corpus)
First template “How to reach the place” is getting translated to Tamil, Telugu
For other languages, the inflectionary markers are being provided
12 Dec 08 FIRE– Kolkata - CLIA Project 51
Project Status
Input Query Processing
Search
Output Generation
Document Processing
Evaluation
Input Query in IL
12 Dec 08 FIRE– Kolkata - CLIA Project 52
Sample Output ScreenOutput screen if Input language is Hindi
12 Dec 08 FIRE– Kolkata - CLIA Project 53
Sample Output screen Output screen if Input language is Hindi, and English tab is selected
12 Dec 08 FIRE– Kolkata - CLIA Project 54
Sample Output screen Output screen of translation of Snippet (English to Bengali)
12 Dec 08 FIRE– Kolkata - CLIA Project 55
Sample Output ScreenAdvanced output screen with Hindi Summary
12 Dec 08 FIRE– Kolkata - CLIA Project 56
Sample Output ScreenAdvanced output screen with Hindi Summary
12 Dec 08 FIRE– Kolkata - CLIA Project 57
Sample Output ScreenSample screen with Information Extraction
12 Dec 08 FIRE– Kolkata - CLIA Project 58
Status – Output Generation
Snippet Generation (JU) Working for monolingual retrieval Integrated with Nutch Has been tested for Bengali
12 Dec 08 FIRE– Kolkata - CLIA Project 59
Project Status
Input Query Processing
Search
Output Generation
Document Processing
Evaluation
Input Query in IL
12 Dec 08 FIRE– Kolkata - CLIA Project 60
Corpora Tourism and Health Corpora being collected
for all languages
News corpora also being collected. Period of news corpora ranges from 2002 to
2007 For News corpora, ISI Kol having dialogues
with TOI and Hindustan Times for permission for the use of their multilingual corpora
Status - Evaluation
12 Dec 08 FIRE– Kolkata - CLIA Project 61
Details of Corpora (crawled)
Assumption in SRS: Each language corpus
has at least 50,000 documents from General / News + all available documents in Tourism and Health
12 Dec 08 FIRE– Kolkata - CLIA Project 62
Evaluation : Topics Topics (ISI Kol)
A set of 95 topics are ready for evaluation 30 topics for training and 50 topics for
testing and 15 topics as stand-by Each topic = Title + Narration + Description Translation of these 95 topics have been
completed by all the six language verticals Sample Topic
<title> Euro Inflation</title> <desc> Find documents about rises in prices after
the introduction of the Euro</desc> <narr> Any document is relevant that provides
information on the rise of prices in any country that introduced the common European currency.</narr>
12 Dec 08 FIRE– Kolkata - CLIA Project 63
Evaluation Methodology Benchmark data creation
Human judges
Corpus Queries
IR engine
1
IR engine
2
IR engine
n
Pool
Relevance Judgements
12 Dec 08 FIRE– Kolkata - CLIA Project 64
Evaluation Methodology Benchmark data creation
Sample documents (corpus) Sample Queries / Topics (95) Relevance judgement
No of relevance judged Bangla documents ~ 4,500
Independently judged against 23 topics by each of two judges
Pooling Pooling strategies adopted by TREC List of top ~100 documents are taken Pool = union of these
12 Dec 08 FIRE– Kolkata - CLIA Project 65
Evaluation methodology Evaluation engine
30 Topics/Queries
Corpus > 50,000 docs
Retrieval Engine
Top 100 Docs
Evaluation Engine
Relevance Judgments
Metrics
12 Dec 08 FIRE– Kolkata - CLIA Project 66
UNL Monolingual retrieval is working for
Tamil documents 6500 words in UNL Dictionary Words + MWE indexed Documents indexed
No. of documents processed in Tourism - 564 No of Concept-Relation-Concept indexed -
11,754 No of Concept-Relation indexed - 11,754 No of Concepts indexed - 17,650
12 Dec 08 FIRE– Kolkata - CLIA Project 67
Testing Methodology Testing methodology
Black box testing based on SRS and design documents Unit testing by each sub-system Test cases (format) and test reports
Integration testing Top down / Bottom-up based on dependencies Stubs and drivers Sub-system wise testing (module-wise)
Input processing Search and Retrieval Document processing Output Generation Evaluation UNL
System Testing Performance testing
12 Dec 08 FIRE– Kolkata - CLIA Project 68
Integration Use of controlled corpora for Integration Use of EILMT English and Hindi parallel corpus ISI generates the queries for corpus Translation of queries by all LVs English and Hindi synsets identified for building
multilingual dictionary by each LV Each language vertical will be tested for their
respective cross-lingual retrieval Information Extraction and output generation
will be done on the same corpora Integration of each LV into Nutch at IITKgp
12 Dec 08 FIRE– Kolkata - CLIA Project 69
Test and Integration (contd.)
Bug tracking system (Bugzilla) to be installed
Currently planned for installation at IITB on the same server as CVS
Bugzilla Web-based general-purpose bug tracker tool Detects not only software bugs but also all
other user-submitted tracking tickets Eases communication between team
members Can be integrated with CVS and WIKI
12 Dec 08 FIRE– Kolkata - CLIA Project 70
Bugzilla Requirements
A compatible database management system – MySQL, Postgressql
A suitable release of Perl 5 A compatible web server A suitable mail transfer agent, or any SMTP
server Bugzilla Demo
https://landfill.bugzilla.org/bugzilla-tip/index.cgi
12 Dec 08 FIRE– Kolkata - CLIA Project 71
Bugzilla - Design Bugs can be
submitted by anybody, and will be assigned to a particular developer
12 Dec 08 FIRE– Kolkata - CLIA Project 72
Deployment diagramDeployment Diagram
for Nutch-based Search Subsystem
The real life scenario would have four more such index servers, one for every Indian language and (maybe) more search servers to ensure greater number of searches per unit time
Quoted from Mike Cafarella , Doug Cutting, Building Nutch: Open Source Search, Queue, v.2 n.2, April 2004
12 Dec 08 FIRE– Kolkata - CLIA Project 73
Hosting of Alpha and Beta versions
Alpha Version ~10,000 documents in each language Low complexity system Hence simple hardware configuration sufficient Does not include Summary generation and Output
translation Planned for Dec 2008
Beta Version ~10,00,000 documents in each language Hardware configuration being worked out - based on
disk space requirements, throughput of system, response times, simultaneous users etc.
Following details are being worked out: Connectivity Where to host Support for hosting
Planned for July 2008
12 Dec 08 FIRE– Kolkata - CLIA Project 74
Elitex08: Demo of Alpha Version
Plan to demonstrate the following: Cross-lingual information retrieval for all
languages Information Extraction and translation of at
least one template to Tamil / Telugu Snippet Generation (monolingual) Hardware integration – IITKgp Publicity management / Poster design - JU Funds: Participation fees to be shared
Demonstrate the same at IJCNLP08 exhibition (in Hyderabad - Jan 2008)
12 Dec 08 FIRE– Kolkata - CLIA Project 75
Gantt chart (as on Aug 30)
12 Dec 08 FIRE– Kolkata - CLIA Project 76
Gantt chart (as on Aug 30)
12 Dec 08 FIRE– Kolkata - CLIA Project 77
SRS (Based on IEEE) Design document v2.0 (based on RUP) User Requirements Document (Ver 5.0) Java docs Test cases template File naming conventions Testing and integration guidelines Code review guidelines
Skip templates
Software documentation
12 Dec 08 FIRE– Kolkata - CLIA Project 78
Software documentation : SRS
SRS Introduction Overall description External interface requirements System features (module-wise) Advanced Search system for Tamil
using UNL
Back to Software Documentation Next
12 Dec 08 FIRE– Kolkata - CLIA Project 79
Software documentation: DD
Design document (v 2.0) Has been simplified to suit project
needs Introduction System Architecture
Solution Architecture (brief description of systems, subsystems)
Software Architecture ( block diagrams) System Design
Logical Design (Class Diagrams ) Component Design (Component Diagrams )
Appendix - other details Back to Software Documentation Next
12 Dec 08 FIRE– Kolkata - CLIA Project 80
Software documentation:URD
URD Introduction Objective Scope of the project Product perspective Capabilities of the Product User Characteristics Assumptions and dependencies Operational environment Input / Output scenarios Definitions, acronyms and abbreviations References
Back to Software Documentation Next
12 Dec 08 FIRE– Kolkata - CLIA Project 81
Software documentation:Test
Test case template: for all tests
Test case Test data Expected result
Actual result Remarks
Back to Software Documentation Next
12 Dec 08 FIRE– Kolkata - CLIA Project 82
Software documentation:File naming
File naming convention captures the following: Subject & domain of document Content Type (ppt / doc / rpt / Tr / etc) Name of Institute (IITB / ISI / IIITH etc.) Date of creation of doc (dd-mon-yy) Version no. Format
<Subject>_<Content_type>_<Institute>_<date>_<ver.no>.<file ext>
E.g. PRSG_Pres_IITB_08dec07_v1.ppt Back to Software Documentation Next
12 Dec 08 FIRE– Kolkata - CLIA Project 83
Shareable Resources and Tools
Shared Resources across projects From ILILMT to CLIA:
Morph Analyzer POS Tagger Chunker Dictionary Standardization IL-IL Synsets
From EILMT to CLIA Synsets E-IL
From CLIA to other projects: NER engine NE list MWE
12 Dec 08 FIRE– Kolkata - CLIA Project 84
Collaborative tools used - CLIA
Tool Purpose
Googlegroups Group e-Mailing
Wiki Project Documents, Member Contact details, Minutes of meeting, Presentations, Timelines, progress reports, fund details etc
CVS Source code
Google docs Sharing and editing of documents
Webex Audioconferencing
Weekly teleconferences
12 Dec 08 FIRE– Kolkata - CLIA Project 85
CLIA Wiki site http://www.cfilt.iitb.ac.in/~consortia/doku
wiki CLIA Wiki contents
Project Team Contact details Project documentation (SRS, Design doc,
URD..) Meeting minutes and presentations Project fund details Progress reports and timelines Project resources Corpus Collaborative platform for audio conferences
12 Dec 08 FIRE– Kolkata - CLIA Project 86
CLIA Wiki site
12 Dec 08 FIRE– Kolkata - CLIA Project 87
Wiki – Upload notification
Thank You