the development of e2t and t2e active reading via web asanee kawtrakul and teams kasetsart...
TRANSCRIPT
The Development of E2T The Development of E2T and T2E Active Reading and T2E Active Reading
via Webvia Web
Asanee Kawtrakul and TeamsKasetsart University, Bangkok, Thailand
Fifth Agricultural Ontology Service (AOS) Workshop29 April 2004, Beijing, China
4
CollaborationCollaborationLibrary Institute of Kasetsart UniversityProviding thesaurus and Agricultural Corpus
5
MotivationMotivation Valued data scattering
throughout the organization in multi-language
Good Information collected by many individuals in unstructured format
Digested information gives quicker decision-making
6
Proposed projectProposed projectSummarization
From unstructured to structured format
Only the gist of information
TranslationFrom English to Thai (E2T)Thai to English (T2E)
7
ObjectivesObjectives To develop a system for
summarizing and translating the agricultural information from English to Thai using statistical and frame-based approach (E2T)
To support the development of information discovery and web-based information exchange in the agricultural domain(T2E)
9
Summarization (Input)Summarization (Input)
Let us focus on Canada’s agricultural products. In 1998, there were 1,216 registered commercial egg producers in Canada. Ontario produced 39.8% of all eggs in Canada, Quebec was second with 16.6%. The western provinces have a combined egg production of 35.6% and the eastern provinces have a combined production of 8.0%.
With a courtesy of Agriculture and Agri-Food Canada, http://www.agr.ca/cb
10
Summarization (Cube)Summarization (Cube)
ProducProductsts
CountCountryry
RegionRegion YeaYearr
QuantiQuantityty
Egg Canada
State: Ontario 1998
39.8%
Egg Canada
State: Quebec 1998
16.6%
Egg Canada
Western Provinces
1998
35.6%
Egg Canada
Eastern Provinces
1998
8.0%
11
Other OutputOther Output
0
2
4
6
8
มกราค
ม
ก�มภาพั
นธ์
ม�นาคม
เมษายน
พัฤษภ
าคม
ม�ถุ�นา
ยน
กรกฎา
คม
สิ�งหา
คม
กนยาย
น
ตุ�ลาคม
พัฤศจิ
�กายน
ธ์นวาค
ม ข้�าวโพัด
ข้�าวโพัดถุ"วเหล#องข้�าว
12
Some related worksSome related works Frame
Knowledge representation in form of slot and filler
Consisting of attributes and their values
CategoCategoryry
Paddy
ExportExporterer
Thailand
PricePrice 300
UnitUnit Dollars/Ton
Attributes
Values
13
MethodologiesMethodologies Integration of NLP techniques and
data cube structure Gist of information extracted and summarized by
frames and then translated into the target language
Data cube structure supporting efficient data access management and powerful decision making
Focusing on the case Agricultural summary articles which have
merely similar structure
14
Why needs NLP Why needs NLP techniques?techniques? NP Analysis
To extract the name entity for activating a frame
To enhance the performance of indexing Word sense Disambiguation
Pound1) The basic monetary unit of the United
Kingdom2) Unit of mass and weight
15
System OverviewSystem Overview
GatheringModule
DocumentDatabase
Indexingand Clustering
Module
Internet
SummarizationModule
TranslationModule Data Cube
GraphicalUser Interface
16
Gathering ModuleGathering Module
Web RobotInternet Preprocessing
DocumentDatabase
AgriculturalPapers’
Abstracts
17
Indexing and Clustering Indexing and Clustering ModuleModule
Lexical TokenIdentification
WeightComputation
PhraseExtraction
Multi-levelIndexing
(Word, Phrase,and Concept)
DocumentClassification
(Statistical Method)
Documents
Clusters ofDocuments
18
Summarization ModuleSummarization Module
Document SentenceFiltering
ShallowParsing
SentenceStructures
FrameGeneration
Frames TranslationTemplates
(Depending onContent’s Domain)
Data Cube Knowledge Base:Frame, Thesaurus
19
Summarization (Input)Summarization (Input)
Let us focus on Canada’s agricultural products. In 1998, there were 1,216 registered commercial egg producers in Canada. Ontario produced 39.8% of all eggs in Canada, Quebec was second with 16.6%. The western provinces have a combined egg production of 35.6% and the eastern provinces have a combined production of 8.0%.
With a courtesy of Agriculture and Agri-Food Canada, http://www.agr.ca/cb
20
Summarization Summarization (Filtering)(Filtering)
In 1998, there were 1,216 registeredcommercial egg producers in Canada.
Ontario produced 39.8% of all eggsin Canada.
Quebec was second with 16.6%
The western provinces have a combinedegg production of 35.6%.
The eastern provinces have a combinedproduction of 8.0%.
Let us focus on Canada’s agriculturalproducts.
IndicesIndices WeightWeight
Number Very High
Egg- High
% High
Produc- Medium
In 1998, there were 1,216 registeredcommercial egg producers in Canada.
Ontario produced 39.8% of all eggsin Canada.
Quebec was second with 16.6%
The western provinces have a combinedegg production of 35.6%.
The eastern provinces have a combinedproduction of 8.0%.
Let us focus on Canada’s agriculturalproducts.
21
Summarization Summarization (Templates)(Templates)
ProductProduct X
CountryCountry X
RegionRegion X
YearYear X
QuantitQuantityy
X
22
Summarization Summarization (Frames)(Frames)
ProductProduct Egg
CountryCountry Canada
RegionRegion State: Ontario
YearYear 1998
QuantitQuantityy
39.8%
ProductProduct Egg
CountryCountry Canada
RegionRegion Eastern Provinces
YearYear 1998
QuantitQuantityy
8.0%
ProductProduct Egg
CountryCountry Canada
RegionRegion State: Quebec
YearYear 1998
QuantitQuantityy
16.6%
ProductProduct Egg
CountryCountry Canada
RegionRegion Western Provinces
YearYear 1998
QuantitQuantityy
35.6%
23
Summarization (Cube)Summarization (Cube)
ProducProductsts
CountCountryry
RegionRegion YeaYearr
QuantiQuantityty
Egg Canada
State: Ontario 1998
39.8%
Egg Canada
State: Quebec 1998
16.6%
Egg Canada
Western Provinces
1998
35.6%
Egg Canada
Eastern Provinces
1998
8.0%
24
Translation ModuleTranslation Module
User’s Query QueryProcessing
Data Cube
Translationand MeasurementUnit Conversion
Biligual Dictionary and ThesaurusVisualization
Tool
25
Translation ResultTranslation Result
CategoCategoryry
ExportExporterer
YeaYearr
MontMonthh
PricPricee
UnitUnit
Paddy Thailand 2002 January 300 Dollars/Ton
Paddy Thailand 2002 February
285 Dollars/Ton
ประเประเภทภท
ผู้)�สิ*งผู้)�สิ*งออกออก
ป+ป+ เด#อนเด#อน ราคราคาา
หน*วยหน*วย
ข้�าวเปลื�อก
ประเทศไทย
2545
มกราคม
14,340
บาทต่�อเกว�ยน
ข้�าวเปลื�อก
ประเทศไทย
2545
ก�มภาพั�นธ์�
13,625
บาทต่�อเกว�ยน
26
Web-based User Web-based User InterfaceInterfaceTo make inquiries about the
history of agricultural products’ price, including their chronological, statistical data
27
OutputOutput
0
2
4
6
8
มกราค
ม
ก�มภาพั
นธ์
ม�นาคม
เมษายน
พัฤษภ
าคม
ม�ถุ�นา
ยน
กรกฎา
คม
สิ�งหา
คม
กนยาย
น
ตุ�ลาคม
พัฤศจิ
�กายน
ธ์นวาค
ม ข้�าวโพัด
ข้�าวโพัดถุ"วเหล#องข้�าว
28
Current State E2T: Current State E2T: the the systemsystemParser: Shallow parsing
English to ThaiSummarization and Translation: Frame-basedText to relational database
29
ParserParser
Big dog loves small cat.
S
vp
np
small
adj
cat
nloves
v
np
big
adj
dog
n
S
vp
np
แมว
n
เล-ก
adjรก
v
np
สิ�นข้
n
ใหญ่*
adj สุ�น�ข้ ใหญ่� ร�ก แมว เลื#ก
/sulnakh yail rakh määwm lekh/
SL Analysis
TL Generation
Transfer
31
Input and OutputInput and Output Input characteristics (SL)
Web pages must be of ‘html’ file only Web pages displayed in Thai
Output characteristics (TL) The system will display output in English by
popping up the new window
32
Why Translate only Why Translate only Table?Table?
From the survey, the agricultural web pages could be divided into 3 types
– Full text– Tables with contexts– Tables only (approx. 50%)
33
Table Characteristics Table Characteristics (cnt.)(cnt.)
Numeric
Heading (Outside Table)Pure TextsUnit
35
Input Format ExampleInput Format Example Input as Frame format
Department of Internal Trade(DIT)
Office of Agriculture Economics(OAE)
37
System overviewSystem overview
Pages TableAnalysis
Chunk-level Translation
UnitConversion
OutputGeneration
Output
Dictionary & Grammar
Rules
ConversionTable
39
Table AnalysisTable Analysis
HTML File
Html Parser
Tag with position anchor
Text with position anchor
40
Position Anchor (Position Anchor (Table Table AnalysisAnalysis))
Using letter to stand for the data’s position in each cell of table
T stands for ‘table’ R stands for ‘row’ C stands for ‘column’
41
Keyword Definition Keyword Definition ExampleExample((Table AnalysisTable Analysis))ข้�าว 1999 2000
ประเทศไทย
24,245
28,356
ข้�าวโพัด 1999
ประเทศไทย 2,172,000
The result will be:
T1R1C1 ^ ข้�าวT1R1C2 ^ 1999T1R1C3 ^ 2000T1R2C1 ^ ประเทศไทยT1R2C2 ^ 24,245T1R2C3 ^ 28,356
T2R1C1 ^ ข้�าวโพัดT2R1C2 ^ 1999T2R2C1 ^ ประเทศไทยT2R2C2 ^ 2,172,000
42
Chunk-level Chunk-level TranslationTranslation
Text with Keyword
Phrase Chunker& NE Extraction
Dictionary & Grammar
Rules
Translated File
43
Phrase Chunker (cnt.)Phrase Chunker (cnt.)(Chunk level (Chunk level Translation)Translation) rulesrules1: np n+ vp vp aux? v n
ราคา น&าเข้�า สุ'นค�าn v n
vp
np
1:
2:
3:
45
Chunk level Chunk level Translation (cnt.)Translation (cnt.)Handle with Name Entity!
NE cannot be word-by-word translated
e.g. กองควบค�มพั�ชแลืะว�สุด�การเกษต่ร Chunker AGRICULTURAL PLANT AND
MATERIAL CONTROL DIVISION NE Extraction AGRICULTURAL
REGULATORY DIVISION
46
Table Characteristics Table Characteristics (Unit Conversion)(Unit Conversion)Unit outside table
Unit Inside table
1
2
48
Sentence GenerationSentence Generation
rulesrules1: np n+ vp vp aux? v+ n
ราคา น0าเข้�า สิ�นค�าn v n
vp
np
1:
2:
3:
49
Sentence Generation Sentence Generation (cnt.)(cnt.)
[NP ราคา[vp น&าเข้�า สุ'นค�า]]
[NP [np สุ'นค�า น&าเข้�า]ราคา]
[NP [np goods importing]ราคา]
Transfer rulesTransfer rules
Thai Englishnp n+ vp np adjp n+vp v+ n adjp adj* | np
[NP [np goods importing] price]
51
Available Web sitesAvailable Web sites Department of Internal Trade
http://www.dit.go.th/ Office of the Rubber Replanting Aid Fund
http://www.thailandrubber.thaigov.net/menu5.php http://www.talaadthai.com/pricebase/default.asp http://www.rubberthai.com/price/price_index.htm http://www.thaifruitnews.com/
54
Structure of ML-Dictionary Structure of ML-Dictionary
(New version)(New version) Main language: English (Vocabulary and POS.)
Separate table for each language. Vocabularies that have the same
meaning are linking together by ID attribute.
Supported 10 languages:Bahasa Indonesian, Chinese, English, French, Italian, Japanese, Korean, Tagalog, Thai and Vietnamese.
UTF-8 Character encoding.
57
Current result based on Current result based on FAO statFAO stat English – 23,207 vocabularies. French – 1,482 vocabularies. Thai – 23,097 vocabularies. Vietnamese – 175 vocabularies. Japanese – 108 vocabularies. Bahasa Indonesian – 13 vocabularies. Chinese, Italian, Korean and Tagalog
– 0 vocabulary.
58
Future workFuture work
Web-based Multilingual Active Reading System for Information ExchangeLanguage ConfigurationActive Reading assistantTable Translator with more multilingual dictionary