Voyager® with Unicode™ : Voyager® with Unicode™ : A Catalogers Session A Catalogers Session
Connie BraunConnie Braun
Training ConsultantTraining Consultant
AgendaAgenda
Introduction
Your Work Environment
Conversion
New Features
Learning More
Q&A
Release UpdateRelease Update
General release occurred October 6, 2004! 4 production partners
1 Windows Server, 3 Solaris 8 test server partners
4 Task Force members (large non-roman collections) 1 large consortia with Universal Borrowing &
Universal Catalog 2 European customers
As of 01/20/05, 71 customers have upgraded and are functioning in a production environment with Voyager with Unicode. Approximately 50 upgrades are scheduled between now and May 2005.
Why Unicode™ in Why Unicode™ in
Voyager?Voyager?
Brings Voyager up to current IT standards
Finds and displays records in the native language
Create and edit any MARC record using UTF-8
Import and export of records with any supported character set
Operators may select a Unicode-compliant font of their choice
Display Unicode characters in OPAC without proprietary software
Implementing Voyager with Implementing Voyager with UnicodeUnicode
For our customers, it’s business as usual, but with some interesting changes and improvements, especially in Cataloging.
Helping everyone to implement a Unicode-compliant system is Endeavor’s aim.
The Unicode standard is an important step towards realizing that goal.
Implementing the Unicode standard is an extension of Endeavor’s original mission: access to information regardless of location or format.
Following StandardsFollowing Standards
Follows Standards (not proprietary) See http://www.unicode.org for much more detail on these
standards. See
http://lcweb.loc.gov/marc/specifications/speccharucs.html for details on LC’s format of MARC records that use Unicode. Voyager follows this specification.
Specifics on the Code Tables may be viewed at http://www.loc.gov/marc/specifications/specchartables.html
The Voyager implementation of the Unicode standard gives libraries and their users greater flexibility when accessing collection materials that contain both Roman and non-Roman text.
Multilingual Input and DisplayMultilingual Input and Display
By introducing improved multilingual input and display capabilities in Voyager, characters now display correctly according to the Unicode and MARC standards.
Greater script coverage for cataloging items in your collections, published in languages around the world.
How many? The total number of possible characters for UTF-8 is: 2,147,483,648!
Preview ServerPreview Server
• Anyone interested in trying out Voyager with Unicode before your upgrade? You can!
• http://support.endinfosys.com/cust/voy/upgrade/unicode/testwv_pre.html provides all the details necessary to get you started
• Preview Server uses the Voyager training database that has been augmented with numerous records in both Roman and non-Roman languages
• Try keyword searches: • “non roman script japanese”• “non roman script arabic”• “roman script french”• “roman script italian”
AgendaAgenda
Introduction
Your Work Environment Workstation Requirements Setting Up For Languages Other Than English Tag Tables Session Defaults and Preferences
Conversion
New Features
Q&A
Workstation RequirementsWorkstation Requirements
In order to enjoy the full range of benefits, PCs must have up-to-date operating systems and productivity software.
This means that staff PCs will need: Windows 2000 or XP operating system Unicode standard compliant Internet browser
IE 6+ Netscape 6+
Unicode-compliant font: Lucida and Arial Unicode MS
MS Windows™MS Windows™
Voyager is more integrated with Windows in terms of
Standard Windows 2000/XP Unicode support
Standard Unicode fonts
Standard input using Input Method Editors (IMEs)
Standard browser support
Setting Up for Languages Setting Up for Languages Other Than EnglishOther Than English
• Workstations need to be specifically configured to work with languages other than English
• Likely will require technical IT assistance to install needed languages on staff PCs
• Best to install all languages so that cataloger may easily include new ones as necessary
Adding Languages to PCsAdding Languages to PCs• Regional and language
options are specific to each PC
• Among options available via Start – Settings – Control Panel
• Details button on Languages tab lets operator view or change languages and methods to enter text
• Can include supplemental language support, too
Choosing LanguagesChoosing Languages• Languages added
to PCs will match languages for items found in your collections
• Add and remove according to your needs; as few or many as necessary
• May also set preferences for language bar and key settings
Tag TablesTag Tables
MARC Tag Tables have been completely revised and rewritten for Voyager with Unicode
Tag TablesTag Tables
• Ability to modify tag table configuration remains the same as in earlier releases
• But, may not specify anything for Leader position 9 since that byte is now hard-coded to identify records that have been converted to UTF-8
• May want to consider whether or not library will need or want to revise Tag Tables for local use
• See Appendix A of Cataloging User’s Guide for full details on revising, maintaining and updating the Tag Tables
Record ValidationRecord Validation
MARC validation
MARC21 character set validation
Authority control validation
Decomposition of accented characters for MARC21
Session Defaults and Session Defaults and Preferences:Preferences:
Record ValidationRecord ValidationBypass MARC21 Character set validation
• Uses MARC21 Repertoire.cfg to control validation of the MARC21 character set
• Helps to enforce MARC21 standard
Bypass Decomposition of accented characters for MARC21• Allows records to be saved to the database without
decomposing the characters • IMPORTANT: If you select this option, MARC21 rules
are ignored. We strongly recommend that this check box be un-checked, in order to comply with the MARC21 standard.
Session Defaults and Session Defaults and Preferences:Preferences:Mapping TabMapping Tab
Expected Character Set of Imported Records now has six options
Session Defaults and Session Defaults and Preferences: Colors/Fonts TabPreferences: Colors/Fonts Tab
AgendaAgendaIntroduction
Your Work Environment
Conversion• Data Conversion• Conversion Error Logging• Conversion Details• Identifying Non-Unicode Data• The Rest of Voyager
New Features
Learning More
Q&A
Data ConversionData ConversionConversion process during upgrade treats data
differently than when importing records through Cataloging client or via BulkImport
MARC records are converted from VRLIN (Voyager legacy encoding) to MARC21 compliant UTF-8 encoding Leader position 9 becomes an ‘a’ Conversion Log Created UTF-8 allows for variable length characters. The
majority of characters in the database occupy the same amount of space as before conversion.
Note: All indexes and database columns with MARC data are regenerated after conversion.
Conversion DetailsConversion DetailsIMPORTANT! NO RECORDS ARE LOST
Each field in the record handled individually.
As each field is processed, it may change length, requiring adjustments to the leader and directory of the record.
Records are saved to the database with a leader position 9 = ‘a’.
Both record-level and field-level checking are performed. In rare cases an entire record might fail conversion; it is more likely that an individual field fails to be converted.
Records may not convert if they contain text that cannot be mapped into Unicode according to the standard MARC-8 to Unicode mappings.
Records that do not convert are stored in the database as is, without being converted to Unicode.
Conversion Error Logging Conversion Error Logging
Libraries need to know the details about the results of the conversion process.
Full error checking and logging is included as part of the upgrade
Technical User’s Guide, Chapter 4
Cataloging User’s Guide, Appendix C
Library designates should review this file to plan for correcting any records that have errors
Sample from Conversion Log Sample from Conversion Log FileFile
Conversion Log Details 1Conversion Log Details 1
1 2 3 4 5 6 7
# 11 secs read=982 changed=791 880=0 okay=982 errors=0 written=982
# 21 secs read=1931 changed=1558 880=0 okay=1931 errors=0 written=1931
# 29 secs read=2848 changed=2087 880=0 okay=2848 errors=0 written=2848
# 36 secs read=3699 changed=2533 880=0 okay=3699 errors=0 written=3699
# 43 secs read=4607 changed=3076 880=0 okay=4607 errors=0 written=4607
# 51 secs read=5519 changed=3610 880=0 okay=5519 errors=0 written=5519
Legend1 number of seconds used by job so far2 read=number of records processed3 changed=number of records changed
4 880=how many records contain 880s5 okay=# records processed successfully6 errors=# records not processed due to errors7 written=# records written to the database
Conversion Log Details 2Conversion Log Details 21 2 3 4 5 6 7 8
=bib 6213: [17](700): c->8 loose char page=0 at 20 '091e ..‘
9=bib 35322: [14](856): c->8 undefined char page=0 at 61
'fc7220486973746f .r Histo‘
10=bib 35516: [23](856): c->8 no char to combine to page=0 at 82 '1e
.‘
=================================================================
1 record type and id2 index within record of field that generated error3 tag that generated error4 c->8 indicates conversion to UTF-8 encoding5 description of error
6 page=subset to which source character belongs7 at # position of source character that caused error8 hex dump of source character9 description of error10 description of error
Conversion Log Details 3Conversion Log Details 3
loose char: a warning message indicating that a character not strictly part of Voyager encoding has been converted (e.g. unexpected carriage return)
no char to combine to: a warning message indicating that a combining character appeared but it lacks a base character with which to combine (e.g. umlaut but no a, o, u base letter)
undefined char: an error message indicating that there is a single character that cannot be mapped to UTF-8
Identifying non-Unicode dataIdentifying non-Unicode data• To identify a non-Unicode record in the Cataloging client, select a
color for Conversion records in Session Defaults and Preferences > Colors-Fonts tab.
Identifying non-Unicode dataIdentifying non-Unicode data
• Any non-converted record displays in the color selected in Options/Preferences.
Identifying non-Unicode dataIdentifying non-Unicode data
There are other ways to identify records that have conversion errors.
Records that cannot be converted to Unicode are viewable in the Cataloging module with nc (not converted) displayed in the Title Bar.
Any characters that cannot be matched or recognized are replaced with a Unicode substitution character.
Fonts and UnicodeFonts and Unicode
• A MARC record may contain non-Roman characters even though you cannot see them. Records are sure to display correctly if a Unicode-
compliant font has been selected.
• Lucida Sans Unicode installed by default with Windows
• Arial Unicode MS Good choice for libraries with mixed cataloging Included with Microsoft Office and other Microsoft
products
The Rest of VoyagerThe Rest of Voyager
• Non-MARC data is not converted Acquisitions data Circulation data (patron info, etc.) Item data
• Reporter Not Unicode standard compliant Translates data to LATIN1 Dots appear where you used to see squares
AgendaAgenda
Introduction
Your Work Environment
Conversion
New Features• Cataloging
Diacritics & Special Characters, Importing Records, New Record Views, Search URIs
• WebVoyágeBrowsers, Searching, Displaying
• Interacting with Other Systems
Learning More
Q&A
Diacritic and Special Character Diacritic and Special Character EntryEntry
• Cataloging practices: then and now
Pre-Unicode input in Cataloging = accent character (diacritic) precedes the base character.
Example: Espa~na Post-Unicode input in Cataloging = accent character
(diacritic) follows the base character. Example: Espan~a
Ability to display combined characters is an improvement over past versions and a way to insure accurate entry
Example: España
Special Characters.cfgSpecial Characters.cfg
SpecialCharacters.cfg, located in the C:\Voyager\Catalog folder, defines the content of the special character entry dialog box.
Operators may define their most frequently used characters here.
Special Character EntrySpecial Character Entry
This is what the dialog box in Cataloging looks like.
The key press column identifies the keyboard equivalent that may be used instead of turning on Special Character Mode in Cataloging.
Finding Little Used CharactersFinding Little Used Characters
• For situations where a character not part of the Special Characters list is needed, operator can use Character Map from MS Windows
• Start – Programs – Accessories – System Tools – Character Map
• Locate character or perform search• Select and Copy character, then paste into
position in bib record
Cataloging: Input of Non-Roman Cataloging: Input of Non-Roman TextText
Voyager® with Unicode allows Cataloging operators to use all of the standard Microsoft Windows keyboard and input method editors (IMEs).
With this functionality in place, operators may search for, display, and edit the contents of all MARC records using the full range of UTF-8 characters.
Entire JACKPHY group is part of the UTF-8 character set which includes right-to-left input needed for Arabic, Persian, Hebrew and Yiddish.
Reminder: JACKPHY = Japanese, Arabic, Chinese, Korean, Persian, Hebrew, Yiddish
Linking in a MARC21 RecordLinking in a MARC21 RecordTag I1 I2 Subfield Data
100 1 ‡6 880-01 ‡a An, Zhen.
245 1 0 ‡6 880-02 ‡a Ri yue yun yan / ‡c An Zhen zhu.
250 ‡6 880-03 ‡a Di 1 ban.
260 ‡6 880-04 ‡a Changchun Shi : ‡b Changchun chu ban she, ‡c 1997.
300 ‡a 4, 2, 291 p. ; ‡c 21 cm.
440 0 ‡6 880-05 ‡a Zhongguo li dai wang chao xing shuai qu shi lu
500 ‡a Non-Roman script – Chinese
651 0 ‡a China ‡x History ‡y Ming dynasty, 1368-1644.
880 1 ‡6 100-01/$1 ‡a 安 震 .
880 1 0 ‡6 245-02/$1 ‡a 日月 云烟 / ‡c 安 震 著 .
880 ‡6 250-03/$1 ‡a 第 1 版 .
880 ‡6 260-04/$1 ‡a 长春市 : ‡b 长春 出版社 ,‡c 1997.
880 0 ‡6 440-05/$1 ‡a 中国 历代 王朝 兴衰 启示录
Using On-Screen KeyboardUsing On-Screen Keyboard
Typically, the path is Start—Programs—Accessories—Accessibility—On-Screen Keyboard
Importing RecordsImporting Records
• Conversion process is separate and distinct from the process of importing records
• Important distinction for operators who import records through the Cataloging client or via BulkImport
• Expected character set needs to be accurately identified if records are to be imported correctly
• Some experimentation may be necessary to determine the correct character set
• Let’s look at some details to help everyone understand what is happening
Record Exchange ScenariosRecord Exchange Scenarios
Voyager 2001.2 and earlier Voyager 2001.2 and earlier • In Voyager 2001.2
and earlier, there were several options from which to choose regarding the character set:
• Latin1• OCLC• RLIN legacy• MARC21 MARC8
• Until now it has been quite simple to choose the correct option when importing records through the Cataloging client or processing large numbers of records through BulkImport.
After Upgrade to Voyager After Upgrade to Voyager 2003.12003.1
• From Voyager 2003.1 forward, there are numerous options from which to choose regarding the character set:
• Latin1 (non-Unicode)• MARC21 MARC8 (non-
Unicode)• MARC21 UTF8• OCLC (non-Unicode)• RLIN legacy (non-Unicode)• Voyager legacy (non-Unicode)
• With Voyager 2003.1 and beyond, it is very important to determine the character set of records before importing records through the Cataloging client or processing large numbers of records through BulkImport. Some experimentation may be necessary.
• * transition to MARC21 UTF8 occurs as Unicode standard becomes pervasive
One Year From NowOne Year From Now• In Voyager 2003.1 and
beyond, numerous options for character sets will continue to be needed:
• Latin1 (non-Unicode)• MARC21 MARC8 (non-
Unicode)• MARC21 UTF8• OCLC (non-Unicode)• RLIN legacy (non-Unicode)• Voyager legacy (non-
Unicode)
• But, the Unicode standard will be much more pervasive, having been adopted and deployed by bibliographic utilities, vendors who massage records, vendors who supply records, and others.
• This means that selecting the correct option will again be simpler, even though knowing the character sets will continue to be very important.
Bulk ImportBulk Import• Bulk Import of MARC Records
Fundamentally the same as before
Leader byte 9 is checked against the incoming character set identified in the import rule.
Blank = non-Unicode™; converted & imported ‘a’ = Unicode™; imported Neither Blank nor ‘a’; errors out – not imported See log.imp.yyyymmdd for details on import
success Records that cannot be converted are not imported;
found in err.imp.yyyymmdd
Bulk Import and Expected Bulk Import and Expected Character SetCharacter Set
Character set mapping for Bulk Import is designated in the Bulk Import rule in SysAdmin > Cataloging > Bulk Import Rules.
MARC ExportMARC Export
Default export character set is MARC21 UTF-8
Use the –a option to choose different character set (in the command line) See page 10-8, in Technical User’s Guide for
more detail
LATIN1 records will get a dot exported for characters outside the LATIN1 character set
If mapping for a composed character is not found, it decomposes and Voyager® attempts to find a match for each part.
New ISBN IndexesNew ISBN Indexes
For improved duplicate detection:
New ISBN Index 020N 020a Number only 020R 020z Number only
020 |a 1234567890 (Knopf)020 |a 1234567890
Check Bibliographic and Authority duplicate detection profiles in System Administration!
HTTP PostingHTTP Posting
Much easier access to WebVoyáge display from clients
Available in Cataloging, Acquisitions & Circulation
Toggle record view from staff client to WebVoyáge Record menu in Cataloging contains a Send Record to option
Send Record To: WebVoyáge LinkFinderPlus available in Cataloging, Acquisitions &
Circulation Record menu in Cataloging contains a Send Record to option
Send Record To: LinkFinderPlus
Configured in voyager.ini file [MARC POSTing] stanza
Enabling HTTP PostingEnabling HTTP Posting
To enable HTTP posting, a stanza is added to the voyager.ini file. An example is shown below.
• [MARC POSTing]• WebVoyage="http://train20031-
c1db.comet.endinfosys.com/cgi-bin/Pbibredirect.cgi"
• LinkfinderPlus="http://207.56.64.116/cgi-bin/Phttplinkresolver.cgi"
Easier Access to OPAC DisplayEasier Access to OPAC DisplaySend Record To…….in Cataloging
•Send Record To…….in Acquisitions
Search URISearch URI• Staff Client Search URI in Cataloging,
Circulation and Acquisitions
Drive searches to resources on the web
Add new button to search interface in staff clients
Click button…a browser is opened & search is executed
This is PC specific (voyager.ini)
Possible applications Link to another OPAC Link to one of your vendors Link to an online book seller
Presenting Search URIPresenting Search URI
Staff client search URI
Available in Cataloging, Circulation, and Acquisitions
Adding Search URIsAdding Search URIsclipped from voyager.ini
• [SearchURI]
• Name=Google• URI=http://www.google.com• Copy=Y• SearchSyntax=/search?&q=<searchtext>
• #Name=Barnes&Noble• #URI=http://search.barnesandnoble.com• #Copy=Y• #SearchSyntax=/booksearch/results.asp?WRD=<searchtext>
• #Name=Gale Group• #URI=http://www.galegroup.com• #Copy=Y• #SearchSyntax=/servlet/SearchPageServlet?
region=9&imprint=<searchtext>
WebVoyWebVoyááge and Unicodege and Unicode
• MARC data supplied to the browser in UTF-8
IE 6+ generally displays Unicode characters correctly. Some characters do not display correctly unless a Unicode-compliant font is selected.
Netscape 6+ figures out that it needs to display Unicode characters without any special settings
Consider new help text in your OPAC to help patrons understand about language options, especially if there are records using different languages in your database
• New UTF-8 download/save format
Searching in WebVoySearching in WebVoyáágege
Search and display in native languages for staff and users.
WebVoyáge and Cataloging allow Unicode character input; you can search for and retrieve records in native languages.
Record display includes non-Latin scripts, including right-to-left scripts like Arabic and Hebrew. Voyager takes advantage of the web browser’s native rendering support.
Records with Other Languages in the Records with Other Languages in the OPACOPAC
Displaying Records in Displaying Records in WebVoyágeWebVoyáge
Linking in a MARC21 RecordLinking in a MARC21 RecordTag I1 I2 Subfield Data
100 1 ‡6 880-01 ‡a An, Zhen.
245 1 0 ‡6 880-02 ‡a Ri yue yun yan / ‡c An Zhen zhu.
250 ‡6 880-03 ‡a Di 1 ban.
260 ‡6 880-04 ‡a Changchun Shi : ‡b Changchun chu ban she, ‡c 1997.
300 ‡a 4, 2, 291 p. ; ‡c 21 cm.
440 0 ‡6 880-05 ‡a Zhongguo li dai wang chao xing shuai qu shi lu
500 ‡a Non-Roman script – Chinese
651 0 ‡a China ‡x History ‡y Ming dynasty, 1368-1644.
880 1 ‡6 100-01/$1 ‡a 安 震 .
880 1 0 ‡6 245-02/$1 ‡a 日月 云烟 / ‡c 安 震 著 .
880 ‡6 250-03/$1 ‡a 第 1 版 .
880 ‡6 260-04/$1 ‡a 长春市 : ‡b 长春 出版社 ,‡c 1997.
880 0 ‡6 440-05/$1 ‡a 中国 历代 王朝 兴衰 启示录
Interacting with Other SystemsInteracting with Other Systems
• Incoming Z39.50 Connections
Records in Unicode databases are UTF8 encoded
z3950svr may send either or both MARC8-encoded or UTF8-encoded records
Default is set to send MARC8 encoded records
But, two different z3950svr ports can be configured to provide records in both formats, thereby accommodating all sites connecting to database
Interacting with Other SystemsInteracting with Other Systems
• Outgoing Z39.50 Connections Retrieves and displays records of any type in UTF-
8 Converts incoming records based on new
Database Definitions setting in System Administration called ‘Source Character Set’
Latin1 (non Unicode™) MARC 21 MARC8 (non Unicode™) MARC21 UTF8 OCLC (non Unicode™) RLIN legacy (non Unicode™) Voyager® legacy (non Unicode™)
AgendaAgenda
Introduction
Your Work Environment
Conversion
New Features
Learning More
Final Q&A
If you want to know more about…..If you want to know more about…..
Coded Character Sets - EndUser 2004: Session 29Title: Coded Character Sets: A Technical Primer for LibrariansPresenters: Michael Doran, Systems Librarian, University of Texas at Arlington; Dan Sweeney, Business Analyst II, Endeavor Information SystemsGreat Website: http://rocky.uta.edu/doran/charsets/
Strategies and Tools for Cleaning Up Your Data -- EndUser 2004: Session 45Title: Transitioning To Unicode: Strategies for Tidying Your DataPresenters: Fran Budde, Acquisitions & Cataloging Specialist, Pacific Lutheran University; Francesca Lane Rasmus, Director, Technical Services, Pacific Lutheran University; Layne Nordgren, Director of Instructional Technologies/Library Systems, Pacific Lutheran University
If you want to know more about…..If you want to know more about…..
Special Character Input/Issues – EndUser 2004:Session 65Title: Why Unicode?Presenter: Martin Heijdra, Chinese Bibliographer/ Head of Public
Services,East Asian Library, Princeton University
Preparing for Unicode Conversion & Cataloging Issues – EndUser 2004: Session 74
Title: Unicode Conversion at the Library of CongressPresenter: Ann Della Porta, Assistant Coordinator, Integrated
SystemsOffice, Library of Congress
SupportWeb: KnowledgeBase, EndUser archiveshttp://support.endinfosys.com/cust/index.html
If you want to know more If you want to know more about….about….
880 – Alternate Graphic Representation (R)http://www.loc.gov/marc/bibliographic/ecbdhold.html#mrcb880
OCLC Character Setshttp://www.oclc.org/support/documentation/worldcat/records/
subscription/5/5.pdf
Original Scripts in RLG Databaseshttp://www.rlg.org/origscripts.html
MARC 21 Concise Bibliographic: Control Subfieldshttp://www.loc.gov/marc/bibliographic/ecbdcntf.html
MARC 21 Concise Bibliographic: Multiscript Recordshttp://www.loc.gov/marc/bibliographic/ecbdmulti.html
Thank you!Thank you!