what we do

37
Improve the way you create, manage and distribute information www.innodata-isogen.com INNOVATION INSPIRATION Automating Content Analysis with Trang and Simple XSLT Scripts Bob DuCharme XML 2008 December 9, 2008

Upload: hanley

Post on 26-Jan-2016

29 views

Category:

Documents


1 download

DESCRIPTION

Automating Content Analysis with Trang and Simple XSLT Scripts Bob DuCharme XML 2008 December 9, 2008. What We Do. We help companies lower the cost of creating and managing information. About me. Solutions Architect, Innodata Isogen weblog: http://www.snee.com/bobdc.blog other writing: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What We Do

Improve the way you create, manage and distribute information

www.innodata-isogen.com

INNOVATION INSPIRATION

Automating Content Analysis with Trang and Simple XSLT Scripts

Bob DuCharmeXML 2008

December 9, 2008

Page 2: What We Do

2

2

What We Do

We help companies lower the cost of

creating and managing information.

Page 3: What We Do

3

3

About me

• Solutions Architect, Innodata Isogen

• weblog:

http://www.snee.com/bobdc.blog

• other writing:

See http://www.snee.com/bob

• URLs referenced today:

http://www.snee.com/xml/xml2008

Page 4: What We Do

4

4

Single source publishing and “editorial” XML

Input1

ProcessB

Input2

Input3

ProcessA

ProcessC

ProcessD

ProcessF

Editorial Master (XML)

Input4

Input5

ProcessE

Output2

Output3

Output1

Page 5: What We Do

5

5

Content analysis: why?

• You’ve “inherited” some content• Convert to your current editorial format• Convert it to new output formats• Efficient development of efficient conversion

routines

Page 6: What We Do

6

6

Handy tool 1 before we get to the XML parts: sort

• colors.txt:

redgreenbluegreenbluebluered

$ sort colors.txt

bluebluebluegreengreenredred

Page 7: What We Do

7

7

Handy tool 2 before we get to the XML parts: uniq

sort colors.txt | uniq -c

3 blue2 green2 red

Page 8: What We Do

8

8

Sample data

Page 9: What We Do

9

9

trang

From http://www.thaiopensource.com/relaxng/trang.html:

Trang converts between different schema languages for XML. It supports the following languages:

• RELAX NG (XML syntax) • RELAX NG compact syntax • XML 1.0 DTDs • W3C XML Schema

A schema written in any of the supported schema languages can be converted into any of the other supported schema languages, except that W3C XML Schema is supported for output only, not for input.

Trang can also infer a schema from one or more example XML documents.

Page 10: What We Do

10

10

trang

Trang can also infer a schema from one or more example XML documents!!!!!

Page 11: What We Do

11

11

Analyzing content with trang

<whatever>

<?xml version="1.0" encoding=“UTF-8" ?> <somedoc>Here is one document</somedoc>

<somedoc>Here is another</somedoc>

<somedoc>Here is another</somedoc>

<somedoc>Here is another</somedoc>

</whatever>

Page 12: What We Do

12

12

Create RELAX NG versions of …

• Elsevier article DTD:

trang art510.dtd art510.rng

• Combined sample content:

trang issueContents.xml issueContents.rng

• Compare results:

saxon art510.rng compareElsRNG.xsl | sort > compareElsRNG.out

Page 13: What We Do

13

13

compareElsRNG.xsl (1 of 2)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://relaxng.org/ns/structure/1.0">

<xsl:strip-space elements="*"/> <xsl:output method="text"/>

<xsl:variable name="schema“ select="document('issueContents.rng')"/>

<xsl:template match="text()"/>

Page 14: What We Do

14

14

compareElsRNG.xsl (2 of 2)

<xsl:template match="r:element">

<xsl:variable name="name" select="@name"/>

<xsl:choose> <xsl:when test="$schema/r:grammar//r:element/@name[. =

$name]">Yes: <xsl:value-of select="$name"/> </xsl:when> <xsl:otherwise>No: <xsl:value-of select="$name"/> </xsl:otherwise> </xsl:choose> <xsl:apply-templates/> </xsl:template>

</xsl:stylesheet>

Page 15: What We Do

15

15

compareElsRNG.xsl: some sample output

No: tb:colspecNo: tb:left-borderNo: tb:right-borderNo: tb:top-borderYes: aidYes: articleYes: bodyYes: ce:abstractYes: ce:abstract-secYes: ce:acknowledgmentYes: ce:affiliation

Page 16: What We Do

16

16

Analyzing the XML itself

• Or SGML, after using James Clark’s sx:

sx -f err.out -x lower myfile.sgm > myfile.xml

Page 17: What We Do

17

17

Counting elements: countElements.xsl

<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:strip-space elements="*"/> <xsl:output method="text"/>

<xsl:template match="text()"/>

<xsl:template match="*"> <xsl:value-of select="name()"/> <xsl:text></xsl:text> <xsl:apply-templates/> </xsl:template>

</xsl:stylesheet>

Page 18: What We Do

18

18

Using countElements.xsl to count elements

saxon issueContents.xml countElements.xsl | sort | uniq -c | sort

Page 19: What We Do

19

19

Result of counting elements

Start of list:

1 ce:chem 1 ce:displayed-quote 1 ce:inline-figure 1 ce:nomenclature 1 ce:textbox 1 ce:textbox-body 1 ce:underline 1 ce:vsp 1 doc 1 sb:e-host 2 small-caps 3 display 3 formula

End of list:

5726 ce:cross-ref 6916 entry 7225 mml:mo 7760 sb:maintitle 7760 sb:title 7929 ce:label 8458 ce:hsp 9326 mml:mi 10331 mml:mrow 12438 ce:italic 16453 sb:author 17082 ce:given-name 17095 ce:surname

Page 20: What We Do

20

20

Count element/parent combinations

<xsl:stylesheet version="1.0“ xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:strip-space elements="*"/> <xsl:output method="text"/>

<xsl:template match="text()"/>

<xsl:template match="*"> <xsl:value-of select="name(..)"/>/<xsl:value-of

select="name()"/> <xsl:text></xsl:text> <xsl:apply-templates/> </xsl:template>

</xsl:stylesheet>

Page 21: What We Do

21

21

Some parent/child counts

1 ce:displayed-quote/ce:simple-para

59 ce:biography/ce:simple-para

107 ce:legend/ce:simple-para

115 ce:abstract-sec/ce:simple-para

859 ce:caption/ce:simple-para

Page 22: What We Do

22

22

countAttributes.xsl

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:strip-space elements="*"/> <xsl:output method="text"/>

<xsl:template match="text()"/>

<xsl:template match="@*"> <xsl:value-of select="name(..)"/> <xsl:text>/@</xsl:text> <xsl:value-of select="name()"/> <xsl:text></xsl:text> </xsl:template>

<xsl:template match="*"> <xsl:apply-templates select="*|@*"/> </xsl:template>

</xsl:stylesheet>

Page 23: What We Do

23

23

Counting the attributes: an excerpt

1 ce:textbox/@id 28 ce:enunciation/@id 44 ce:table-footnote/@id 50 ce:biography/@id 79 ce:footnote/@id 104 ce:correspondence/@id 142 ce:table/@id 175 ce:affiliation/@id 180 ce:formula/@id 182 ce:section/@id 713 ce:figure/@id 4224 ce:bib-reference/@id

Page 24: What We Do

24

24

Count formula elements with/without ID values

<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="/"> Yes: <!-- finds 180 --> <xsl:value-of select="count(//ce:formula[@id])"/> No: <!-- finds 208 --> <xsl:value-of

select="count(//ce:formula[not(@id)])"/> </xsl:template>

</xsl:stylesheet>

Page 25: What We Do

25

25

Find all values of a particular attribute

<xsl:stylesheet version="1.0" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="*"> <xsl:apply-templates select="*|@*"/> </xsl:template>

<xsl:template match="text()|@*"/>

<xsl:template match="ce:link/@locator"> <xsl:value-of select="."/><xsl:text></xsl:text> </xsl:template>

</xsl:stylesheet>

Page 26: What We Do

26

26

Running OneAttValue.xsl

xsltproc OneAttvalue.xsl issueContents.xml | sort | uniq -c | sort

• Output ending like this:

10 gr12 11 gr11 14 gr10 17 fx1 17 fx2 18 gr9 24 gr8 37 gr7 55 gr6 67 gr5 91 gr4 99 gr3 103 gr1 103 gr2

Page 27: What We Do

27

27

Output just the comments in a document

<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="text()"/>

<xsl:template match="comment()"> <xsl:copy/> </xsl:template>

</xsl:stylesheet>

Page 28: What We Do

28

28

Output just the processing instructions in a document

<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml"/>

<xsl:template match="processing-instruction()"> <xsl:copy/> </xsl:template>

</xsl:stylesheet>

Page 29: What We Do

29

29

elAttList.xsl goal

• Go through rng schema• For each element, output

dtdname.dtd\telementName• For each attribute, output

dtdname.dtd\telementName\tattributeName

Page 30: What We Do

30

30

elAttList.xsl part 1 of 2

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:r="http://relaxng.org/ns/structure/1.0"

version="1.0">

<xsl:param name="dtdname">no dtdname parameter supplied</xsl:param>

<xsl:strip-space elements="*"/> <xsl:output method="text"/>

<xsl:template match="r:files|r:attribute| r:value "/>

Page 31: What We Do

31

31

elAttList.xsl part 1 of 2

<xsl:template match="r:element"> <xsl:variable name="elName" select="@name"/>

<xsl:value-of select="$dtdname"/> <xsl:text>&#9;</xsl:text> <xsl:value-of select="@name"/> <xsl:text>&#10;</xsl:text>

<xsl:for-each select="r:attribute | r:optional/r:attribute"> <xsl:value-of select="$dtdname"/> <xsl:text>&#9;</xsl:text> <xsl:value-of select="$elName"/> <xsl:text>&#9;</xsl:text> <xsl:value-of select="@name"/> <xsl:text>&#10;</xsl:text> </xsl:for-each>

<xsl:apply-templates/> </xsl:template>

</xsl:stylesheet>

Page 32: What We Do

32

32

normalizeRNG.xsl

<xsl:stylesheet version="1.0“xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:r="http://relaxng.org/ns/structure/1.0" >

<xsl:output indent="yes"/>

<xsl:template match="r:element/r:ref | r:optional/r:ref"> <xsl:variable name="referent" select="@name"/>

<xsl:apply-templates select="//r:define[@name = $referent]“ mode="copying"/> </xsl:template>

<xsl:template match="@*|node()"> <xsl:copy>

<xsl:apply-templates select="@*|node()"/> </xsl:copy>

</xsl:template>

<xsl:template match="r:define" mode="copying"><xsl:apply-templates select="node()"/>

</xsl:template>

</xsl:stylesheet>

Page 33: What We Do

33

33

Analyzing an SGML DTD

• Why? When migrating away from it• RNG or W3C XSD both XML, but not SGML• Using Earl Hood’s perlSGML DTD analysis tools

Page 34: What We Do

34

34

XML-based analysis of SGML DTD

1. Run Earl Hood’s dtd2html utility

2. Run tagsoup or HTML Tidy on output files

3. Now you’ve got XML where you can pull out element information with XSLT

Page 35: What We Do

35

35

XML-based analysis of SGML DTD (revised)

1. Tweak dtd2html to add <div class=“whatever”></div> elements

2. Run Earl Hood’s dtd2html utility

3. Run tagsoup or HTML Tidy on output files

4. Now you’ve got XML where you can pull out element information with XSLT

Page 36: What We Do

36

36

Summary

• This is not an integrated report generator. It’s Legos.

• Pipelining data between existing tools, re-usable scripts, and quick hacks.

• Document your command lines, e.g.

saxon temp1.xml temp3.xsl > temp1a.xml

• Clients like reports, especially in spreadsheets.

Page 37: What We Do

37

37

Thank you!

• Referenced resources: http://www.snee.com/xml/xml2008