locale-aware sorting and text handling in the open toolkit

12
Locale-Aware Sorting and Text Handling in the Open Toolkit Eliot Kimber Contrext DITA OT Day 2017 美丽的话 All translations via Google Translate—may be total nonsense

Upload: contrext-solutions

Post on 28-Jan-2018

76 views

Category:

Software


5 download

TRANSCRIPT

Page 1: Locale-Aware Sorting and Text Handling in the Open Toolkit

Locale-Aware Sorting and Text

Handling in the

Open Toolkit

Eliot KimberContrext

DITA OT Day 2017

大话

功夫

美丽的话

All translations via Google Translate—may be total nonsense

Page 2: Locale-Aware Sorting and Text Handling in the Open Toolkit

About the Author

关于作者• Independent consultant focusing on DITA

analysis, design, and implementation• Doing SGML and XML for cough 30 years cough• Founding member of the DITA Technical

Committee• Founding member of the XML Working Group• Co-editor of HyTime standard (ISO/IEC 10744)• Primary developer and founder of the DITA for

Publishers project• Author of DITA for Practitioners, Vol 1 (XML Press)

DITA OT Day 201710/29/17 2

Page 3: Locale-Aware Sorting and Text Handling in the Open Toolkit

Requirements 要求

• Sort and group Simplified Chinese (and other languages)

• Do rough layout of lines– Requires finding work boundaries

– Requires finding line-break opportunities

• Target applications– Sort and group glossaries, indexes, etc.

– Fixed-layout EPUBs

• Free, open-source solution

10/29/17 DITA OT Day 2017 3

Page 4: Locale-Aware Sorting and Text Handling in the Open Toolkit

Challenge: Simplified Chinese

挑战: 简体中文

• Simplified Chinese collates using pinyin transliteration of words

• All other languages collate based on individual characters

– Alphabetic and syllabic languages

– Traditional Chinese: Radical and stroke count

– Can be done with static per-character configuration

10/29/17 DITA OT Day 2017 4

Page 5: Locale-Aware Sorting and Text Handling in the Open Toolkit

Simplified Chinese Requires a

Dictionary

简体中文需要一个词典• Pinyin transliteration depends on whole word

– Same character may have different transliteration in different words (多音字)

– As if “c” was pronounced “see” by itself but “tee” in “cat”

– E.g.: 行 – xíng or hāng

• Requires a dictionary with pinyin transliteration – Requires ability to find words

• Need word frequency as secondary sort parameter or to disambiguate words– Requires frequency dictionary

10/29/17 DITA OT Day 2017 5

Page 6: Locale-Aware Sorting and Text Handling in the Open Toolkit

Solution 解

• Use CC-CEDICT open-source dictionary– ~155,000.00 entries

• ICU4J for word and line breaking

• Java 2D for text length approximation

• ICU4J for grouping and sorting of other languages

• Integrated with Saxon’s collator Java API

• XSLT functions for use from XSLT

• Configured as DITA OT plugin for use in OT

• Java and XSLT library not OT-specific

10/29/17 DITA OT Day 2017 6

Page 7: Locale-Aware Sorting and Text Handling in the Open Toolkit

Result: DITA Community i18n

Plugin

结果:DITA社区i18n插件• Java: Custom collator for Simplified Chinese

– Implements Saxon RuleBasedCollator– Can be used with Saxon 9.1+ or any Java code– Locale-aware text analysis and metrics

• XSLT function library– Grouping and sorting keys– Word and line breaking – Rendered text length given font details

• General grouping and sorting configuration file mechanism

10/29/17 DITA OT Day 2017 7

Page 8: Locale-Aware Sorting and Text Handling in the Open Toolkit

Under Active Development

积极发展

• More accurate pinyin determination

• Use of word frequency data as secondary sort key

10/29/17 DITA OT Day 2017 8

Page 9: Locale-Aware Sorting and Text Handling in the Open Toolkit

Not Yet Implemented

尚未落实

• Grouping for glossaries

– And other things you might want to group

• Extension/replacement for built-in OT index grouping and sorting

10/29/17 DITA OT Day 2017 9

Page 10: Locale-Aware Sorting and Text Handling in the Open Toolkit

Demo 演示

If time permits

DITA OT Day 201710/29/17 10

Page 11: Locale-Aware Sorting and Text Handling in the Open Toolkit

Questions? 问题

DITA OT Day 201710/29/17 11

Page 12: Locale-Aware Sorting and Text Handling in the Open Toolkit

Resources 资源

• Me: [email protected]

• DITA Community i18n project: https://github.com/dita-community/org.dita-community.i18n

• http://blog.tutorming.com/mandarin-chinese-learning-tips/chinese-characters-with-various-pronunciations

DITA OT Day 201710/29/17 12