what we do
DESCRIPTION
Automating Content Analysis with Trang and Simple XSLT Scripts Bob DuCharme XML 2008 December 9, 2008. What We Do. We help companies lower the cost of creating and managing information. About me. Solutions Architect, Innodata Isogen weblog: http://www.snee.com/bobdc.blog other writing: - PowerPoint PPT PresentationTRANSCRIPT
Improve the way you create, manage and distribute information
www.innodata-isogen.com
INNOVATION INSPIRATION
Automating Content Analysis with Trang and Simple XSLT Scripts
Bob DuCharmeXML 2008
December 9, 2008
2
2
What We Do
We help companies lower the cost of
creating and managing information.
3
3
About me
• Solutions Architect, Innodata Isogen
• weblog:
http://www.snee.com/bobdc.blog
• other writing:
See http://www.snee.com/bob
• URLs referenced today:
http://www.snee.com/xml/xml2008
4
4
Single source publishing and “editorial” XML
Input1
ProcessB
Input2
Input3
ProcessA
ProcessC
ProcessD
ProcessF
Editorial Master (XML)
Input4
Input5
ProcessE
Output2
Output3
Output1
5
5
Content analysis: why?
• You’ve “inherited” some content• Convert to your current editorial format• Convert it to new output formats• Efficient development of efficient conversion
routines
6
6
Handy tool 1 before we get to the XML parts: sort
• colors.txt:
redgreenbluegreenbluebluered
$ sort colors.txt
bluebluebluegreengreenredred
7
7
Handy tool 2 before we get to the XML parts: uniq
sort colors.txt | uniq -c
3 blue2 green2 red
8
8
Sample data
9
9
trang
From http://www.thaiopensource.com/relaxng/trang.html:
Trang converts between different schema languages for XML. It supports the following languages:
• RELAX NG (XML syntax) • RELAX NG compact syntax • XML 1.0 DTDs • W3C XML Schema
A schema written in any of the supported schema languages can be converted into any of the other supported schema languages, except that W3C XML Schema is supported for output only, not for input.
Trang can also infer a schema from one or more example XML documents.
10
10
trang
Trang can also infer a schema from one or more example XML documents!!!!!
11
11
Analyzing content with trang
<whatever>
<?xml version="1.0" encoding=“UTF-8" ?> <somedoc>Here is one document</somedoc>
<somedoc>Here is another</somedoc>
<somedoc>Here is another</somedoc>
<somedoc>Here is another</somedoc>
</whatever>
12
12
Create RELAX NG versions of …
• Elsevier article DTD:
trang art510.dtd art510.rng
• Combined sample content:
trang issueContents.xml issueContents.rng
• Compare results:
saxon art510.rng compareElsRNG.xsl | sort > compareElsRNG.out
13
13
compareElsRNG.xsl (1 of 2)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0">
<xsl:strip-space elements="*"/> <xsl:output method="text"/>
<xsl:variable name="schema“ select="document('issueContents.rng')"/>
<xsl:template match="text()"/>
14
14
compareElsRNG.xsl (2 of 2)
<xsl:template match="r:element">
<xsl:variable name="name" select="@name"/>
<xsl:choose> <xsl:when test="$schema/r:grammar//r:element/@name[. =
$name]">Yes: <xsl:value-of select="$name"/> </xsl:when> <xsl:otherwise>No: <xsl:value-of select="$name"/> </xsl:otherwise> </xsl:choose> <xsl:apply-templates/> </xsl:template>
</xsl:stylesheet>
15
15
compareElsRNG.xsl: some sample output
No: tb:colspecNo: tb:left-borderNo: tb:right-borderNo: tb:top-borderYes: aidYes: articleYes: bodyYes: ce:abstractYes: ce:abstract-secYes: ce:acknowledgmentYes: ce:affiliation
16
16
Analyzing the XML itself
• Or SGML, after using James Clark’s sx:
sx -f err.out -x lower myfile.sgm > myfile.xml
17
17
Counting elements: countElements.xsl
<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/> <xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="*"> <xsl:value-of select="name()"/> <xsl:text></xsl:text> <xsl:apply-templates/> </xsl:template>
</xsl:stylesheet>
18
18
Using countElements.xsl to count elements
saxon issueContents.xml countElements.xsl | sort | uniq -c | sort
19
19
Result of counting elements
Start of list:
1 ce:chem 1 ce:displayed-quote 1 ce:inline-figure 1 ce:nomenclature 1 ce:textbox 1 ce:textbox-body 1 ce:underline 1 ce:vsp 1 doc 1 sb:e-host 2 small-caps 3 display 3 formula
End of list:
5726 ce:cross-ref 6916 entry 7225 mml:mo 7760 sb:maintitle 7760 sb:title 7929 ce:label 8458 ce:hsp 9326 mml:mi 10331 mml:mrow 12438 ce:italic 16453 sb:author 17082 ce:given-name 17095 ce:surname
20
20
Count element/parent combinations
<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/> <xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="*"> <xsl:value-of select="name(..)"/>/<xsl:value-of
select="name()"/> <xsl:text></xsl:text> <xsl:apply-templates/> </xsl:template>
</xsl:stylesheet>
21
21
Some parent/child counts
1 ce:displayed-quote/ce:simple-para
59 ce:biography/ce:simple-para
107 ce:legend/ce:simple-para
115 ce:abstract-sec/ce:simple-para
859 ce:caption/ce:simple-para
22
22
countAttributes.xsl
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/> <xsl:output method="text"/>
<xsl:template match="text()"/>
<xsl:template match="@*"> <xsl:value-of select="name(..)"/> <xsl:text>/@</xsl:text> <xsl:value-of select="name()"/> <xsl:text></xsl:text> </xsl:template>
<xsl:template match="*"> <xsl:apply-templates select="*|@*"/> </xsl:template>
</xsl:stylesheet>
23
23
Counting the attributes: an excerpt
1 ce:textbox/@id 28 ce:enunciation/@id 44 ce:table-footnote/@id 50 ce:biography/@id 79 ce:footnote/@id 104 ce:correspondence/@id 142 ce:table/@id 175 ce:affiliation/@id 180 ce:formula/@id 182 ce:section/@id 713 ce:figure/@id 4224 ce:bib-reference/@id
24
24
Count formula elements with/without ID values
<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/"> Yes: <!-- finds 180 --> <xsl:value-of select="count(//ce:formula[@id])"/> No: <!-- finds 208 --> <xsl:value-of
select="count(//ce:formula[not(@id)])"/> </xsl:template>
</xsl:stylesheet>
25
25
Find all values of a particular attribute
<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="*"> <xsl:apply-templates select="*|@*"/> </xsl:template>
<xsl:template match="text()|@*"/>
<xsl:template match="ce:link/@locator"> <xsl:value-of select="."/><xsl:text></xsl:text> </xsl:template>
</xsl:stylesheet>
26
26
Running OneAttValue.xsl
xsltproc OneAttvalue.xsl issueContents.xml | sort | uniq -c | sort
• Output ending like this:
10 gr12 11 gr11 14 gr10 17 fx1 17 fx2 18 gr9 24 gr8 37 gr7 55 gr6 67 gr5 91 gr4 99 gr3 103 gr1 103 gr2
27
27
Output just the comments in a document
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="text()"/>
<xsl:template match="comment()"> <xsl:copy/> </xsl:template>
</xsl:stylesheet>
28
28
Output just the processing instructions in a document
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml"/>
<xsl:template match="processing-instruction()"> <xsl:copy/> </xsl:template>
</xsl:stylesheet>
29
29
elAttList.xsl goal
• Go through rng schema• For each element, output
dtdname.dtd\telementName• For each attribute, output
dtdname.dtd\telementName\tattributeName
30
30
elAttList.xsl part 1 of 2
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:r="http://relaxng.org/ns/structure/1.0"
version="1.0">
<xsl:param name="dtdname">no dtdname parameter supplied</xsl:param>
<xsl:strip-space elements="*"/> <xsl:output method="text"/>
<xsl:template match="r:files|r:attribute| r:value "/>
31
31
elAttList.xsl part 1 of 2
<xsl:template match="r:element"> <xsl:variable name="elName" select="@name"/>
<xsl:value-of select="$dtdname"/> <xsl:text>	</xsl:text> <xsl:value-of select="@name"/> <xsl:text> </xsl:text>
<xsl:for-each select="r:attribute | r:optional/r:attribute"> <xsl:value-of select="$dtdname"/> <xsl:text>	</xsl:text> <xsl:value-of select="$elName"/> <xsl:text>	</xsl:text> <xsl:value-of select="@name"/> <xsl:text> </xsl:text> </xsl:for-each>
<xsl:apply-templates/> </xsl:template>
</xsl:stylesheet>
32
32
normalizeRNG.xsl
<xsl:stylesheet version="1.0“xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:r="http://relaxng.org/ns/structure/1.0" >
<xsl:output indent="yes"/>
<xsl:template match="r:element/r:ref | r:optional/r:ref"> <xsl:variable name="referent" select="@name"/>
<xsl:apply-templates select="//r:define[@name = $referent]“ mode="copying"/> </xsl:template>
<xsl:template match="@*|node()"> <xsl:copy>
<xsl:apply-templates select="@*|node()"/> </xsl:copy>
</xsl:template>
<xsl:template match="r:define" mode="copying"><xsl:apply-templates select="node()"/>
</xsl:template>
</xsl:stylesheet>
33
33
Analyzing an SGML DTD
• Why? When migrating away from it• RNG or W3C XSD both XML, but not SGML• Using Earl Hood’s perlSGML DTD analysis tools
34
34
XML-based analysis of SGML DTD
1. Run Earl Hood’s dtd2html utility
2. Run tagsoup or HTML Tidy on output files
3. Now you’ve got XML where you can pull out element information with XSLT
35
35
XML-based analysis of SGML DTD (revised)
1. Tweak dtd2html to add <div class=“whatever”></div> elements
2. Run Earl Hood’s dtd2html utility
3. Run tagsoup or HTML Tidy on output files
4. Now you’ve got XML where you can pull out element information with XSLT
36
36
Summary
• This is not an integrated report generator. It’s Legos.
• Pipelining data between existing tools, re-usable scripts, and quick hacks.
• Document your command lines, e.g.
saxon temp1.xml temp3.xsl > temp1a.xml
• Clients like reports, especially in spreadsheets.
37
37
Thank you!
• Referenced resources: http://www.snee.com/xml/xml2008