cjk character validation – impact from eacc to unicode migration 2006 ceal conference committee on...

15
CJK Character Validation Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5, 2006

Upload: jocelyn-lucas

Post on 27-Mar-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

  • Slide 1

CJK Character Validation Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5, 2006 Slide 2 EACC/MARC21 and Unicode East Asian Character Code (EACC) is MARC-8 CJK in MARC21 Migration to Unicode Library of Congress database RLGs Union catalog database OCLCs WorldCat database CJK Bibliographic records are restricted to EACC characters Slide 3 Microsoft IME Variants Non-MARC21 characters Duplicate CJK characters (e.g., F937, and, 8DEF) Close variants (e.g., 6B65, and, 6B69) Typically one of these variants is a MARC21 character CJK character validation errors in OCLC OCLC XWC (Extended WorldCat) in Oracle database is built on Unicode OCLC online cataloging follows MARC21 standards CJK scripts are input by using Microsoft Global Input Method Editors (IMEs) Non-MARC21 characters cause CJK character validation errors Slide 4 OCLC Connxion / IME Online Cataloging Examples Title: (simplified ) 245 (non-Latin) occurrence 1, $a occurrence 1, position 2 - invalid character - data must be valid non-Latin characters Valid when changed to: (traditional ) Title: (simplified ) 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters Valid when changed to: (traditional ) Title: (traditional ) 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters Valid when changed to: (traditional ) Slide 5 OCLC Connxion / IME Online Cataloging Examples Title: (simplified ) 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters (traditional ) 245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid character - data must be valid non-Latin characters Valid when changed to: (traditional ) Title: only can be found in the traditional list; this character does not exist in the simplified list Slide 6 Solutions Unihan Database CJK Compatibility Database OCLC CJK E-dictionary Slide 7 Unihan Database http://www.unicode.org/charts/unihan.html http://www.unicode.org/charts/unihan.html Unihan database index Unihan grid index Unihan radical-stroke index Unihan database information (I) Several different glyphs for the character (N) Different representations of the character's scalar value (N) Mappings to the IRG sources for the character (I) Mappings to major industrial and national standards and other character collections (N) Positions in the four dictionaries used by the IRG (I) Positions in other commonly-used dictionaries (I) Radical-stroke counts as derived from different sources (I) Phonetic data derived from various sources (I) Other dictionary data (I) Variants (with links to the variant forms) Compounds containing the character (I) Other information contained in the Unihan database Slide 8 Unihan Database Search (U+6237) Slide 9 Unihan Database Search (U+6236) Slide 10 CJK Compatibility Database http://www.loc.gov/ils/cjk_search/cjk_cpso.html http://www.loc.gov/ils/cjk_search/cjk_cpso.html Replace a non-MARC21 character with its MARC21 equivalent Steps for using the CJK compatibility database 1) Copy the invalid character from your bibliographic record 2) Open the CJK Compatibility PageCJK Compatibility Page 3) Paste the invalid character in the white box and use the index "Invalid character" 4) Click "Submit" 5) Copy & Paste the valid alternative into your bibliographic record Slide 11 CJK Compatibility Database Search Slide 12 OCLC CJK E-Dictionary Slide 13 OCLC CJK E-Dictionary Search Slide 14 Slide 15 CJK Character Validation Thank you!