from code to xliff bridging the chasm dr. stephen flinter connect global solutions lrc conference...
TRANSCRIPT
From Code to XLIFFBridging the Chasm
Dr. Stephen FlinterConnect Global Solutions
LRC Conference – 19 November 2003
Agenda
• The XLIFF Transformation Problem
• Current approaches• Grammar based approach – XPG• XPG & XML• Summary
The Problem
• The XLIFF Transformation Problem
• Current approaches• Grammar based approach – XPG• XPG & XML• Summary
The Problem
• XLIFF has made the representation of resources translation/localisation friendly
• Non-trivial to convert existing files to XLIFF
• Adding new file formats can be painful
XLIFF Transformation
Definition: XLIFF Transformation is the process by which native file formats are transformed into XLIFF, and from XLIFF back to its native format (after translation).
File formats include: Java, .properties, XML, HTML, custom.
Architecture
Original Material
Extract
Non-LocalizationData (Skeleton)
Localization Data(Translation Units)
Merge
Translated Material
.com Business Model
• Parody of the .com business model that has been floating around the web:
– Get lots of users– ???– Profit
XLIFF Transformation Model• The XLIFF transformation
model could be described in similar terms:
– Native file format– ???– XLIFF
Architecture
Original Material
Extract
Non-LocalizationData (Skeleton)
Localization Data(Translation Units)
Merge
Translated Material
Magichappens here
A little magichappens here
Current Approaches
• The XLIFF Transformation Problem
• Current approaches• Grammar based approach – XPG• XPG & XML• Summary
Current Approaches to XLIFF• Use XLIFF as native format• Use commercial tools• Use regular expressions &
scripts
XLIFF as Native Format
• Use XLIFF from software development onwards
• No transformation required• Preferred approach in the long
term
Disadvantages
• Requires significant changes to the software development process
• How to handle legacy resources?– Back to the original problem
Commercial Tools
• Tool support for XLIFF is improving all the time.
• Advantages of support and expertise of tool developer.
Disadvantages
• However, many tools still only read XLIFF, and won’t generate XLIFF from native formats
• Won’t necessarily support all formats required
• Can be difficult to identify in-line tags
Scripts and Regular Expressions• Use a scripting language (e.g.
perl, python, WordBasic)• Encode rules to extract
translatable resources using regular expressions
Examples
String Regular Expression
“Translatable text” /”([^”]*)”/
id1 = Translatable text
/.* = (.*)/
Advantages
• Superficially simple to develop• Plenty of powerful RE languages
(especially perl) available• Full control and ownership of
how the formats are managed
Disadvantages
• Error prone – difficult to cover all situations
• To remove all errors, often have to add many parsing rules
• Has to be redone for every new file type
• RE’s have to change for inline tags
Other Examples
print(“First string”);
print(“Second” + “ string”);
print(“Third \”string\””);
print(“Fourth {0} string”);
Summary
This approach is doomed to failure because of the disconnect between the grammar of the language, and the regular expressions used to identify strings.
Grammar Based Approach
• The XLIFF Transformation Problem
• Current approaches• Grammar based approach – XPG• XPG & XML• Summary
A New Approach
• With this approach, we look at the language grammar (EBNF)
• Identify grammar productions that can hold translatable text
• Generate a parser that accepts instances of the grammar and emits XLIFF
Grammar-based Architecture
Original Material
Extract
Non-LocalizationData (Skeleton)
Localization Data(Translation Units)
Merge
Translated Material
Grammar-based Architecture
Original Material
Extract
Non-LocalizationData (Skeleton)
Localization Data(Translation Units)
Merge
Translated Material
Original GrammarXLIFFParser
Generator
Architecture
• New component: XLIFF parser generator (XPG)
• Accepts a JavaCC grammar• Allows one or more productions
to be marked as translatable• Generate the “extract” and
“merge” programs
JavaCC
• JavaCC: Java Compiler Compiler• Modelled after lex & yacc• Works on EBNF-type grammars
rendered as JavaCC .jj files• JavaCC grammar available for
most modern programming languages.
Big Win
Direct, one-to-one correspondence between the grammar and the mechanism for identifying strings.
Advantages
• Consistent high quality– Guaranteed to work in every case – for
all instances of the grammar.
• Painless– No scripting/regular expressions required– Extractor and merger generated
automatically
• Fast– Just need to identify the strings in the
grammar
Example
• Extract from Java BNF<literal> ::= <integer literal> |
<floating-point literal> |
<boolean literal> |
<character literal> |
<string literal> |
<null literal>
<string literal> ::= " <string characters>?"
<string characters> ::= <string character> |
<string characters> <string character>
<string character> ::= <input character> except " and \ |
<escape character>
JavaCC Extract
void Literal() :
{}
{
<INTEGER_LITERAL> |
<FLOATING_POINT_LITERAL> |
<CHARACTER_LITERAL> |
<STRING_LITERAL> |
BooleanLiteral() |
NullLiteral()
}
<STRING_LITERAL>
< STRING_LITERAL:
"\""
( (~["\"","\\","\n","\r"])
| ("\\"
( ["n","t","b","r","f","\\","'","\""]
| ["0"-"7"] ( ["0"-"7"] )?
| ["0"-"3"] ["0"-"7"] ["0"-"7"]
)
)
)*
"\""
>
Identifying <STRING_LITERAL>• We identify the
<STRING_LITERAL> as a language item that may contain strings
• XPG then generates a new grammar, which compiles to the extractor.
• The extractor then generates XLIFF.
Modified JavaCC Grammar
void Literal() :
{}
{
<INTEGER_LITERAL> |
<FLOATING_POINT_LITERAL> |
<CHARACTER_LITERAL> |
StringLiteral() |
BooleanLiteral() |
NullLiteral()
}
StringLiteral()
void StringLiteral() :
{ Token t; }
{ t = <STRING_LITERAL>
{
String s = t.image.substring(1, t.image.length() - 1);
pw.println("<trans-unit id=\"" + id++ + "\">");
pw.println("<source>" + s + "</source>");
pw.println("</trans-unit>");
}}
Other XPG Tasks
• Create XLIFF surrounding tags• Create skeleton file• Embed code for handling inline
tags
Inline Tags
• Example:– “Click on the {0} button to start the {1}
job”
• The {0} and {1} constitute inline tags• Not part of grammar itself• Can vary from application to application• We must be able to extract these based
on regular expressions:– {[0-9]+}
XPG and Inline Tags
• Embeds code to read a set of regular expressions from a file.
• When the extractor identifies a string:– Executes RE on string– Moves matches to XLIFF inline tag
Final Architecture
Original Material
Extract
Non-LocalizationData (Skeleton)
Localization Data(Translation Units)
Merge
Translated Material
Original GrammarXLIFFParser
Generator
Inline tagsRegular
Expressions
XPG & XML
• The XLIFF Transformation Problem
• Current approaches• Grammar based approach – XPG• XPG & XML• Summary
XPG and XML Applications• A similar approach can be
applied to XML Schemas• Uses XSTL & DOM rather than
JavaCC• Can identify XML tags and
attributes that may contain text