phloc-schematron-2.6.1

9

Click here to load reader

Upload: afiefafief

Post on 21-Oct-2015

94 views

Category:

Documents


0 download

DESCRIPTION

details on schematron

TRANSCRIPT

Page 1: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 1 of 9

phloc-schematron

Version 2.6.1, 28-05-2013, by Philip Helger - [email protected]

Table of content 1 Introduction .................................................................................................................................. 1

1.1 Prerequisites .......................................................................................................................... 1

2 XML document validation.............................................................................................................. 2

2.1 Validation via XSLT ................................................................................................................. 2

2.2 Validation via Pure Schematron ............................................................................................ 2

3 Technical details ............................................................................................................................ 3

3.1 Usage with Maven ................................................................................................................. 3

3.2 Common API .......................................................................................................................... 4

3.2.1 Validation via XSLT ......................................................................................................... 5

3.2.2 Validation via Pure Schematron ..................................................................................... 5

3.3 Extensibility of Pure Schematron ........................................................................................... 7

3.3.1 Reading .......................................................................................................................... 7

3.3.2 New Query Binding ........................................................................................................ 7

3.3.3 Modify Existing Query Binding ....................................................................................... 7

3.4 Maven plugin schematron2xslt ............................................................................................. 8

4 Benchmarks ................................................................................................................................... 9

1 Introduction phloc-schematron is a Java library that validates XML documents via ISO Schematron

(http://www.schematron.com). It offers several different possibilities to perform this task where

each solution offers its own advantages and disadvantages that are outlined below in more detail.

phloc-schematron only supports ISO Schematron and no other Schematron version.

The most common way is to convert the source Schematron file to an XSLT script and apply this XSLT

on the XML document to be validated. Alternatively phloc-schematron offers a native

implementation for the Schematron XPath binding which offers superior performance over the XSLT

approach but has some other minor limitations.

1.1 Prerequisites It is assumed that you have a basic knowledge what Schematron is, and what Schematron can do for

you. A good introduction can be found in Dave Pawsons Schematron tutorial at

http://www.dpawson.co.uk/schematron/.

Page 2: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 2 of 9

It is also assumed that you have basic knowledge of the Java language, so that you can understand

the code examples, that you have at least basic understanding of XSLT (Extensible Stylesheet

Language Transformations) and that you have good knowledge of XML itself.

2 XML document validation The goal of Schematron is to provide validation mechanisms for XML documents that are beyond

DTD and XML Schema. DTD and XML Schema both purely test the structure and the data types of the

content of an XML document whereas Schematron can check relations and structure of an XML

document.

The most basic type of validation is to check, if an XML document confirms to a set of Schematron

rules or not. So the output of the basic check is either "true" - meaning the XML document conforms

to the Schematron rules - or "false" - meaning that the XML document does not conform to the

Schematron rules. Additionally Schematron defines a result document type called "SVRL" which is

short for "Schematron Validation Report Language". It is a more complex, XML-based result that

outlines exactly what assertions failed and what reports succeeded. phloc-schematron is capable of

performing both types of validation.

2.1 Validation via XSLT The proposed way to perform a Schematron validation is to apply a set of three pre-defined XSLT

scripts onto a Schematron file. After these transformations the original Schematron rule set has been

transformed into an XSLT script itself, which can then be applied onto XML documents for validation.

The output of this validation is an SVRL document. Because the pre-compilation step from

Schematron to XSLT is very time consuming (it can take many minutes for a mid-sized Schematron

rule set), it is strongly suggested to cache the resulting XSLT script, as it can be applied to all XML

documents to be validated. Please note that the created Schematron XSLT scripts differ when you

choose a special Schematron phase!

phloc-schematron ships with a special Apache Maven plugin called "schematron2xslt-maven-plugin"

that can be used to create the XSLT scripts from Schematron files during build time. It is described in

more detail below.

2.2 Validation via Pure Schematron As an alternative to the XSLT-based approach, phloc-schematron provides a pure Java

implementation which will be referred to as "Pure Schematron" within this document. Pure

Schematron is available since phloc-schematron 2.6.0. With Pure Schematron the same results can

be achieved as with the XSLT approach: basic validity checks and SVRL output documents.

The advantage of Pure Schematron is that you don't need to apply the timely conversion to XSLT

before you can start validating. The internal steps for validating an XML document with Pure

Schematron are the following:

1. Read the Schematron resource from a file or a URL or create it manually. When reading an

existing Schematron resource, all Schematron includes are resolved, so that one large

Schematron document is created.

2. Determine the query binding to be used. phloc-schematron ships with a standard XPath

binding that will be used if none is specified.

Page 3: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 3 of 9

3. Now the Schematron needs to be pre-processed, to resolve abstract patterns, abstract rules

and perform variable replacement.

4. Finally the pre-processed Schematron must be "bound". In this step a Schematron phase can

be selected which should be used. When the default query binding is used, all XPath

expressions are pre-compiled so that they can be evaluated faster. When you supply your

own query binding, you need to make sure to create an efficient representation to use as a

bound schema.

5. This created bound schema can now be used to validate arbitrary XML documents. Ideally it

should also be cached like the XSLT script from above, because the XPath compilation is kind

of costly, but by far not as costly as the XSLT creation.

Pure Schematron is designed for maximum extensibility, meaning that you can create your own

query binding, configure the reading and pre-processing of Schematron objects etc. The drawbacks

of Pure Schematron are currently:

Include handling, as it works only when you read a Schematron from a resource and not if

you create your Schematron from scratch. If you have this in mind when creating your

Schematron files it should not affect you much.

XML attributes and elements from other namespaces are read from an existing Schematron

resource but they have no impact on the validation process itself when the default query

binding is used. If you have an idea how this can be solved in a proper way, please drop me

an email.

Additionally phloc-schematron gives you the possibility to write a Schematron rule set easily to disk,

it offers the possibility to check whether a Schematron is minified, preprocessed and valid. It also

supports validating a Schematron resource against the RelaxNG Compact scheme with the additional

library called “phloc-schematron-validator”. This library was externalized because it is not used in any

regular workflow and brings a lot of additional dependencies.

3 Technical details phloc-schematron is an operating system independent Java 1.6 library. As the underlying XPath

Engine SaxonHE 9.5.0.2 (http://saxon.sourceforge.net/) is used. Compared to Apache Xalan 2.7.1

(http://xml.apache.org/xalan-j/) it offers more XPath functions out of the box. phloc-schematron also

depends on our OSS library phloc-commons (http://code.google.com/p/phloc-commons/).

For usage with Maven please look at the Wiki page http://code.google.com/p/phloc-

schematron/wiki/FirstSteps for details. phloc-schematron is built as an OSGI bundle via the

org.apache.felix:maven-bundle-plugin.

The full code of the examples used in this document can be found in the file

src/test/java/com/phloc/schematron/docs/DocumentationExamples.java.

3.1 Usage with Maven phloc-schematron is build with Apache Maven. If you want to build it from source, at least Maven

3.0.4 is required. As phloc-schematron is not yet in Maven central you need to add the following

repository to your pom.xml:

Page 4: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 4 of 9

<repositories> <repository> <id>phloc.com</id> <url>http://repo.phloc.com/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories>

The dependency for phloc-schematron looks like this:

<dependency> <groupId>com.phloc</groupId> <artifactId>phloc-schematron</artifactId> <version>2.6.1</version> </dependency>

It transitively contains phloc-commons, SLF4J and Saxon HE.

3.2 Common API A common API for both XSLT and Pure Schematron approach is available via the

com.phloc.schematron.ISchematronResource interface. It is meant for Schematron that is

read from a file or URL. It offers the possibility to check if the read Schematron is valid itself via the

boolean isValidSchematron () method.

To check if an XML document simply matches a Schematron rule set the methods

com.phloc.commons.state.EValidity getSchematronValidity(…) are provided. These

methods deliver either EValidity.VALID if the XML document matches the Schematron or

EValidity.INVALID if the XML document does not match at least one Schematron rule. With this

method you have no possibility to determine what the error exactly was. When using an XSLT based

implementation this method does not offer any performance improvement, as the SVRL is fully

created and analyzed afterwards. When using a Pure Schematron based implementation, the

validation stops after the first error and does not continue to validate the supplied XML document.

Alternatively to the basic validation the interface also offers the possibility to create an SVRL result

via the methods org.w3c.dom.Document applySchematronValidation(…) and org.oclc.purl.dsdl.svrl.SchematronOutputType

applySchematronValidationToSVRL(…). The first method type creates the SVRL only as an XML

document node, where the second method type applies a JAXB binding, so that it is easier to access

the information inside the SVRL. Internally these methods call each other depending on the concrete

implementation, so they are ensured to deliver exactly the same result. The XSLT implementation is

natively done in applySchematronValidation and then converted to a

SchematronOutputType using the com.phloc.schematron.svrl.SVRLReader class. With

Pure Schematron a SchematronOutputType object is directly created and then converted to an

XML document node via the class com.phloc.schematron.svrl.SVRLWriter.

The classes SVRLReader and SVRLWriter can generically be used to read and write SVRL files in a

structured way. Both classes validate the SVRL based on SVRL XML Schema contained in the library.

Page 5: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 5 of 9

3.2.1 Validation via XSLT

As described above it is highly recommended to cache the XSLT script that is created from the source

Schematron rule set. Nevertheless phloc-schematron offers both possibilities to use Schematron.

The easiest way to start working is by starting from a Schematron file.

com.phloc.schematron.xslt.SchematronResourceSCH is the implementation of the

ISchematronResource interface to be used for this. The constructor takes at the least the

Schematron resource that contains the rules. When using this class it is possibly to specify an

optional Schematron phase to be used for validation. Additionally some static factory methods are

present that allow creating SchematronResourceSCH objects from a String path or a

java.io.File object.

If a precompiled XSLT script is present (e.g. via the schematron2xslt Maven plugin or via manual pre-

processing) the implementation class

com.phloc.schematron.xslt.SchematronResourceXSLT should be instantiated. It offers the

same constructors and factory methods as the SchematronResourceSCH class. Please recall that

the chosen phase already affected the created XSLT script, so it is not possible to specify a phase

when using this implementation.

Both implementations use an internal cache that keeps the created pre-precompiled

javax.xml.transform.Templates objects in memory while the application is running. The cache

for SchematronResourceSCH is located in the class

com.phloc.schematron.xslt.SchematronResourceSCHCache whereas the cache for

SchematronResourceXSLT is located in the class

com.phloc.schematron.xslt.SchematronResourceXSLTCache – big surprise

A simple example to validate an XML file based on Schematron rules from a file looks like this:

01 public static boolean validateXMLViaXSLTSchematron (@Nonnull final File aSchematronFile, @Nonnull final File aXMLFile) throws Exception 02 { 03 final ISchematronResource aResSCH = SchematronResourceSCH.fromFile (aSchematronFile); 04 if (!aResSCH.isValidSchematron ()) 05 throw new IllegalArgumentException ("Invalid Schematron!"); 06 return aResSCH.getSchematronValidity (new StreamSource(aXMLFile)).isValid (); 07 }

3.2.2 Validation via Pure Schematron

For Pure Schematron the implementation of the ISchematronResource interface resides in the

class com.phloc.schematron.pure.SchematronResourcePure. The constructor also takes at

least the resource where to read the Schematron rules from. Additional a Schematron phase and a

custom error handler can be supplied.

Be careful when using the validation methods that take a javax.xml.transform.Source object

as parameter. Only DOMSource and StreamSource objects are supported at the moment!

A simple example to validate an XML file based on Schematron rules from a file looks like this:

01 public static boolean validateXMLViaPureSchematron (@Nonnull final File aSchematronFile, @Nonnull final File aXMLFile) throws Exception

Page 6: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 6 of 9

02 { 03 final ISchematronResource aResPure = SchematronResourcePure.fromFile (aSchematronFile); 04 if (!aResPure.isValidSchematron ()) 05 throw new IllegalArgumentException ("Invalid Schematron!"); 06 return aResPure.getSchematronValidity(new StreamSource(aXMLFile)).isValid (); 07 }

As an alternative you can also validate via the internal API as well, in which case the code can look

like this:

01 public static boolean validateXMLViaPureSchematron2 (@Nonnull final File aSchematronFile, @Nonnull final File aXMLFile) throws Exception 02 { 03 // Read the schematron from file 04 final PSSchema aSchema = new PSReader (new FileSystemResource (aSchematronFile)).readSchema (); 05 if (!aSchema.isValid ()) 06 throw new IllegalArgumentException ("Invalid Schematron!"); 07 // Resolve the query binding to use 08 final IPSQueryBinding aQueryBinding = PSQueryBindingRegistry.getQueryBindingOfNameOrThrow (aSchema.getQueryBinding ()); 09 // Pre-process schema 10 final PSPreprocessor aPreprocessor = new PSPreprocessor (aQueryBinding); 11 aPreprocessor.setKeepTitles (true); 12 final PSSchema aPreprocessedSchema = aPreprocessor.getAsPreprocessedSchema (aSchema); 13 // Bind the pre-processed schema 14 final IPSBoundSchema aBoundSchema = aQueryBinding.bind (aPreprocessedSchema, null, null); 15 // Read the XML file 16 final Document aXMLNode = XMLReader.readXMLDOM (aXMLFile); 17 if (aXMLNode == null) 18 return false; 19 // Perform the validation 20 return aBoundSchema.validatePartially (aXMLNode).isValid (); 21 }

The code is clearly separated into the following steps:

Reading the Schematron file from a File (lines 04-06). This part contains the Schematron

include resolution.

Determine the Schematron query binding to be used (line 08). The query binding is required

to correctly pre-process the Schematron afterwards.

Pre-process the read Schematron file (line 10-12). This resolves all abstract rules and

patterns.

Create the bound Schematron (line 14). This is the pre-compilation step, depending on the

selected query binding. The second parameter that is null in the example is the name of the

phase to use. When no phase is passed the defaultPhase attribute of the Schematron

schema is checked and used. If no defaultPhase is present, all patterns are active.

Read the XML file to be validated via DOM (line 16-18). Technical note: this is the class

com.phloc.commons.xml.serialize.XMLReader which offers a simplified API to read

XML files and is not be confused with org.xml.sax.XMLReader.

Perform the Schematron validation of the read XML file (line 20).

Page 7: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 7 of 9

It is important to note, that in the second case no caching is performed, and that the Schematron file is interpreted each time the method is called, which may not be as efficient as possible. The most customization may be done to the pre-processor. The Schematron ISO standard defines a “Minimal Syntax” that is still compliant Schematron but among other with all includes resolved, all abstract patterns and abstract rules resolved. Because a Schematron that is minified has implications on the created SVRL document it was chosen to call the class “PSPreprocessor” and not “PSMinifier”. For example if all <report> elements are converted to <assert> elements, the SVRL would contain a <failed-assert> element instead of a <successful-report> element. By default the pre-processor creates a minimal Schematron but if offers the possibility to avoid certain minimizations.

3.3 Extensibility of Pure Schematron

3.3.1 Reading

Pure Schematron can be extended in several ways. The reading itself is not very customizable, but

after reading you have the possibility to modify the created Schematron object of type

com.phloc.schematron.pure.model.PSSchema. It offers getter and setter for all elements. The

object hierarchy of PSSchema is very similar to the XML hierarchy of Schematron in general so it

should not be too hard to handle. E.g. a PSSchema has a list of PSPattern objects, which in turn

each have a list of PSRule elements. Via the method IMicroElement getAsMicroElement ()

each Schematron object can easily be converted to an XML structure which can than easily be

serialized to disk. The following example reads a Schematron file from disc, sets a <title> element and

writes the document back to the source file:

01 public static boolean readModifyAndWrite (@Nonnull final File aSchematronFile) throws Exception 02 { 03 final PSSchema aSchema = new PSReader (new FileSystemResource (aSchematronFile)).readSchema (); 04 final PSTitle aTitle = new PSTitle (); 05 aTitle.addText ("Created by phloc-schematron"); 06 aSchema.setTitle (aTitle); 07 return MicroWriter.writeToFile (aSchema.getAsMicroElement (), aSchematronFile).isSuccess (); 08 }

3.3.2 New Query Binding

It is also possible to implement your own query binding that is different from the default XPath-

based query binding. Therefore a class implementing the interface

com.phloc.schematron.pure.binding.IPSQueryBinding must be present. This

implementation class must then be registered in the

com.phloc.schematron.pure.binding.PSQueryBindingRegistry via the static method

registerQueryBinding. It is not possible to replace an existing query binding. The predefined

XPath-based query binding is registered to the names “xslt” and “xslt2” as well as to the default

(meaning unspecified) query binding. Implementing your own query binding is kind of time

consuming as you need to implement at least the interfaces

com.phloc.schematron.pure.binding.IPSQueryBinding and

com.phloc.schematron.pure.bound.IPSBoundSchema.

3.3.3 Modify Existing Query Binding

Additionally you may alter the existing Schematron processing by either using the Pure Schematron

API as outlined in the example above. Or you may subclass

Page 8: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 8 of 9

com.phloc.schematron.pure.bound.PSBoundSchemaCacheKey which offers a set of

protected methods for easy customization when using SchematronResourcePure. In case you

have a customized implementation, you need to use the special SchematronResourcePure

constructor taking the Schematron IReadableResource and the PSBoundSchemaCacheKey

implementation. See the documentation in the code for details on overriding

PSBoundSchemaCacheKey.

3.4 Maven plugin schematron2xslt The conversion of Schematron to XSLT is quite costly. That’s why phloc-schematron offers a Maven

plugin that does the conversion at build time. Because the plugin is not yet contained in Maven

central, you need add a custom repository and a custom pluginRepository to your pom.xml

like this:

<repositories> <repository> <id>phloc.com</id> <url>http://repo.phloc.com/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>phloc.com</id> <url>http://repo.phloc.com/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </pluginRepository> </pluginRepositories>

By default the plugin is run in the Maven lifecycle phase “generate-resources”. The basic

configuration of the plugin in the pom.xml looks like this (inside the <build> + <plugins>

elements):

<plugin> <groupId>com.phloc.maven</groupId> <artifactId>schematron2xslt-maven-plugin</artifactId> <version>2.6.1</version> <executions> <execution> <goals> <goal>convert</goal> </goals> </execution> </executions> <configuration> <schematronDirectory>${basedir}/src/main/schematron</schematronDirectory> <xsltDirectory>${basedir}/src/main/resources/xslt</xsltDirectory> <xsltExtension>.xsl</xsltExtension>

Page 9: phloc-schematron-2.6.1

phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013

Page 9 of 9

</configuration> </plugin>

The possible configuration parameters are:

schematronDirectory - The directory where the Schematron files reside.

schematronPattern - A pattern for the Schematron files. Can contain Ant-style wildcards and

double wildcards. All files that match the pattern will be converted. Files in the

schematronDirectory and its subdirectories will be considered.

xsltDirectory - The directory where the XSLT files will be saved.

xsltExtension - The file extension of the created XSLT files.

overwriteWithoutQuestion - Overwrite existing Schematron files without notice? If this is set

to “false” than existing XSLT files are not overwritten.

phaseName - Define the phase to be used for XSLT creation. By default the “defaultPhase”

attribute of the Schematron file is used.

languageCode - Define the language code for the XSLT creation. Default is English. Supported

language codes are: cs, de, en, fr, nl.

An example project that uses schematron2xslt-maven-plugin can be found in the Google Code

repository at https://phloc-schematron.googlecode.com/svn/trunk/schematron2xslt-demo.

4 Benchmarks To do