a method for enhancing search using transliteration of mandarin chinese vijay john...
Post on 21-Dec-2015
212 views
TRANSCRIPT
Transliterated Mandarin Search
Google suggests spelling correction
Alternate Transliterations?
Want to say “Did you mean Peiching?”
Transliteration Problems
• “Beijing” provides many results
• Google doesn’t find “Peiching,” “Peking,” “Bukgyeong,” etc.
• Many pages using variety of transliterations
• Transliterations unorganized
• This paper organizes for Mandarin Chinese
The Problem (Cont’d)
• Why variety of transliterations?
• Web content: 82% Romanized
• Majority’s native languages: other scripts
• Standard keyboards
• Non-Romanized sources normally transliterated (esp. on Web)
• Transliteration variations
Example 1: Tibetan
• Four languages: transliteration problems• Hello in Tibetan• Wylie (bkra shis bde legs)• Tibetan Pinyin• Several unofficial systems based on
pronunciation• Spelled/transcribed in several ways (with
some guidelines)
Example 2: Malayalam
• No official transliteration system
• Transliteration based on personal preference (many unorganized variations)
• Script conversion programs: more consistent systems
• /maleja:m/ usu. transcribed “Malayalam”
• malayaaLam (Maya), Malajal- (Slavic)
Example 3: Romani
• Vlax Romani standard
• Literacy → few adopt standard
• Different countries, different official languages → different spellings
• No official systems (government)
• Several transliteration systems exist (often inconsistent)—as in last 2 languages
Example 4: Mandarin
• Hànyŭ Pīnyīn
• Tōngyòng Pīnyīn
• Wade-Giles
• Gwoyeu Romatzyh
• (Yóuzhèngshì Pīnyīn) (etc.)
Prior Work
• In Mandarin: geared towards Chinese users searching for information from West
• Western names-Hànzĭ-Hànyŭ Pīnyīn-Hànzĭ• Algorithms designed for Arabic & Japanese
transliteration• Google• This method designed for Western users
searching for Chinese information
Initial Effort on Mandarin
• Practical first step: increased trade with China• Simple transliteration problem (relatively)• Modifications for Tibetan, Romani,
Hindustani, etc.• Intact for some other languages? (e.g.
Russian, Arabic, Japanese, Korean)• Input = Hànyŭ Pīnyīn; output = other systems
Initial Program
• Combined many systems
• Ying – yink – yenk – yenk’ – yemk’ – yermk’ – yarmk’
• Instead of “victory,” searched for “Yarmuk” River in Middle East
• Transliteration systems organized by row but not by column
Organize into Transliteration TableEntries for “beijing” in two systems
(Purpose is to go from one column to another)
Hanyu Pinyin Wade Giles
1 b p
2 ei ei
3 j ch
4 ing ing
Part of Patterns Table 8 systems
HP TP1 TP2 MHP1 MHP2 MHP3 WG1 WG2ci cih cih ci ci ci tz'u tz'usi sih sih si si si szu ssuzi zih zih zi zi zi tzu tzuju jyu jyu ju ju ju chü chüqu cyu cyu qu qu qu ch'ü ch'ü
Decomposition
• Search for “Beijing” in table
• Delete one letter; search for “Beijin”
• Beiji, Beij…B
• Search for “eijing” (beijing – b) similarly
• Ei found, search for “jing”
• “J” found, search for “ing”
Composing new search terms
• Components: b, ei, j, ing
• B → b, p
• ei → ei
• j → j, ch
• ing → ing
Implementation
• Java program
• After composition, how does algorithm search?
• Connects to Google via Google API (Application Programming Interface)
• Google searches
• 1-2 second delay (due to Google)
Transliteration Patterns
• Transliterations organized into table
• {"üe", "yue", "yue", "ue", "ve", "üeh", "üeh", "üeh"}
• lüe, lyue, lue, lve, lüeh
• 3 transliteration systems; at most 5 patterns
• First column Hànyŭ Pīnyīn like “ing” “b” “ei”
Transliteration Systems By Column
• Only 3 systems (in effect)• Hànyŭ Pīnyīn (HP)• Tōngyòng Pīnyīn #1 (TP1) & Tōngyòng
Pīnyīn #2 (TP2)• Modified Hànyŭ Pīnyīn #1 (MHP1) &
Modified Hànyŭ Pīnyīn #2 (MHP2) • Wade-Giles #1 (WG1), Wade-Giles #2
(WG2), & Wade-Giles #3 (WG3)
Differences Between Transliteration System Variants
• TP1- iu, ui, ‘
• TP2- iou, uei, -
• WG2- h’ung (not hung)
• WG3- ts’u (not tz’u)
• WG1- szu (not ssu)
Web versionhttp://www.translitsearch.com/demos/demos.htm
Web search
What is the effect?
• Search for 130 Pinyin cities/regions
• 16 – no other transliteration
• 60 – at least two others
• 6 – three or more
• How much did Xiaozhi find? (8% more)
• 5 min. 12 sec. – entire search
Further work 1
• Include Yale, GR (Gwoyeu Romatzyh), &c.
• YZSPY (Yóuzhèngshì Pīnyīn)
• Accents
• Hanja- and Kanji-based transliterations
• Application to research archives
Further Work 2
• Improvements in accuracy of transliteration
• Search in other transliterations
• Japanese version of current paper
• Hindustani version
• Romani with Indic cognates
• Extension to translation (transliterated Mandarin-Cantonese characters)
Solutions for Tibetan
• Start with Wylie
• Xiaozhi with adjustments
• Dzongkha
• Dzongkha-based variations?
• Analysis of common transliteration patterns (usu. based on closest pronunciation)
Solutions for Malayalam
• Start with Maya (script conversion program)
• Include minor variations from other script conversion programs
• Analysis of transliterations used
Solutions for Romani
• Start with Vlax Romani Standard
• Regional variations
• Some transliterations easier to use on computers
• e.g. chh, sh to omit hacek
Conclusions
• Enhances search by finding alternate transliterations– Applied to Mandarin– Applicable to other languages
• Applicable to lesser-studied (& other) languages
• Language- (or script-) specific