phloc-schematron-2.6.1
DESCRIPTION
details on schematronTRANSCRIPT
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 1 of 9
phloc-schematron
Version 2.6.1, 28-05-2013, by Philip Helger - [email protected]
Table of content 1 Introduction .................................................................................................................................. 1
1.1 Prerequisites .......................................................................................................................... 1
2 XML document validation.............................................................................................................. 2
2.1 Validation via XSLT ................................................................................................................. 2
2.2 Validation via Pure Schematron ............................................................................................ 2
3 Technical details ............................................................................................................................ 3
3.1 Usage with Maven ................................................................................................................. 3
3.2 Common API .......................................................................................................................... 4
3.2.1 Validation via XSLT ......................................................................................................... 5
3.2.2 Validation via Pure Schematron ..................................................................................... 5
3.3 Extensibility of Pure Schematron ........................................................................................... 7
3.3.1 Reading .......................................................................................................................... 7
3.3.2 New Query Binding ........................................................................................................ 7
3.3.3 Modify Existing Query Binding ....................................................................................... 7
3.4 Maven plugin schematron2xslt ............................................................................................. 8
4 Benchmarks ................................................................................................................................... 9
1 Introduction phloc-schematron is a Java library that validates XML documents via ISO Schematron
(http://www.schematron.com). It offers several different possibilities to perform this task where
each solution offers its own advantages and disadvantages that are outlined below in more detail.
phloc-schematron only supports ISO Schematron and no other Schematron version.
The most common way is to convert the source Schematron file to an XSLT script and apply this XSLT
on the XML document to be validated. Alternatively phloc-schematron offers a native
implementation for the Schematron XPath binding which offers superior performance over the XSLT
approach but has some other minor limitations.
1.1 Prerequisites It is assumed that you have a basic knowledge what Schematron is, and what Schematron can do for
you. A good introduction can be found in Dave Pawsons Schematron tutorial at
http://www.dpawson.co.uk/schematron/.
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 2 of 9
It is also assumed that you have basic knowledge of the Java language, so that you can understand
the code examples, that you have at least basic understanding of XSLT (Extensible Stylesheet
Language Transformations) and that you have good knowledge of XML itself.
2 XML document validation The goal of Schematron is to provide validation mechanisms for XML documents that are beyond
DTD and XML Schema. DTD and XML Schema both purely test the structure and the data types of the
content of an XML document whereas Schematron can check relations and structure of an XML
document.
The most basic type of validation is to check, if an XML document confirms to a set of Schematron
rules or not. So the output of the basic check is either "true" - meaning the XML document conforms
to the Schematron rules - or "false" - meaning that the XML document does not conform to the
Schematron rules. Additionally Schematron defines a result document type called "SVRL" which is
short for "Schematron Validation Report Language". It is a more complex, XML-based result that
outlines exactly what assertions failed and what reports succeeded. phloc-schematron is capable of
performing both types of validation.
2.1 Validation via XSLT The proposed way to perform a Schematron validation is to apply a set of three pre-defined XSLT
scripts onto a Schematron file. After these transformations the original Schematron rule set has been
transformed into an XSLT script itself, which can then be applied onto XML documents for validation.
The output of this validation is an SVRL document. Because the pre-compilation step from
Schematron to XSLT is very time consuming (it can take many minutes for a mid-sized Schematron
rule set), it is strongly suggested to cache the resulting XSLT script, as it can be applied to all XML
documents to be validated. Please note that the created Schematron XSLT scripts differ when you
choose a special Schematron phase!
phloc-schematron ships with a special Apache Maven plugin called "schematron2xslt-maven-plugin"
that can be used to create the XSLT scripts from Schematron files during build time. It is described in
more detail below.
2.2 Validation via Pure Schematron As an alternative to the XSLT-based approach, phloc-schematron provides a pure Java
implementation which will be referred to as "Pure Schematron" within this document. Pure
Schematron is available since phloc-schematron 2.6.0. With Pure Schematron the same results can
be achieved as with the XSLT approach: basic validity checks and SVRL output documents.
The advantage of Pure Schematron is that you don't need to apply the timely conversion to XSLT
before you can start validating. The internal steps for validating an XML document with Pure
Schematron are the following:
1. Read the Schematron resource from a file or a URL or create it manually. When reading an
existing Schematron resource, all Schematron includes are resolved, so that one large
Schematron document is created.
2. Determine the query binding to be used. phloc-schematron ships with a standard XPath
binding that will be used if none is specified.
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 3 of 9
3. Now the Schematron needs to be pre-processed, to resolve abstract patterns, abstract rules
and perform variable replacement.
4. Finally the pre-processed Schematron must be "bound". In this step a Schematron phase can
be selected which should be used. When the default query binding is used, all XPath
expressions are pre-compiled so that they can be evaluated faster. When you supply your
own query binding, you need to make sure to create an efficient representation to use as a
bound schema.
5. This created bound schema can now be used to validate arbitrary XML documents. Ideally it
should also be cached like the XSLT script from above, because the XPath compilation is kind
of costly, but by far not as costly as the XSLT creation.
Pure Schematron is designed for maximum extensibility, meaning that you can create your own
query binding, configure the reading and pre-processing of Schematron objects etc. The drawbacks
of Pure Schematron are currently:
Include handling, as it works only when you read a Schematron from a resource and not if
you create your Schematron from scratch. If you have this in mind when creating your
Schematron files it should not affect you much.
XML attributes and elements from other namespaces are read from an existing Schematron
resource but they have no impact on the validation process itself when the default query
binding is used. If you have an idea how this can be solved in a proper way, please drop me
an email.
Additionally phloc-schematron gives you the possibility to write a Schematron rule set easily to disk,
it offers the possibility to check whether a Schematron is minified, preprocessed and valid. It also
supports validating a Schematron resource against the RelaxNG Compact scheme with the additional
library called “phloc-schematron-validator”. This library was externalized because it is not used in any
regular workflow and brings a lot of additional dependencies.
3 Technical details phloc-schematron is an operating system independent Java 1.6 library. As the underlying XPath
Engine SaxonHE 9.5.0.2 (http://saxon.sourceforge.net/) is used. Compared to Apache Xalan 2.7.1
(http://xml.apache.org/xalan-j/) it offers more XPath functions out of the box. phloc-schematron also
depends on our OSS library phloc-commons (http://code.google.com/p/phloc-commons/).
For usage with Maven please look at the Wiki page http://code.google.com/p/phloc-
schematron/wiki/FirstSteps for details. phloc-schematron is built as an OSGI bundle via the
org.apache.felix:maven-bundle-plugin.
The full code of the examples used in this document can be found in the file
src/test/java/com/phloc/schematron/docs/DocumentationExamples.java.
3.1 Usage with Maven phloc-schematron is build with Apache Maven. If you want to build it from source, at least Maven
3.0.4 is required. As phloc-schematron is not yet in Maven central you need to add the following
repository to your pom.xml:
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 4 of 9
<repositories> <repository> <id>phloc.com</id> <url>http://repo.phloc.com/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories>
The dependency for phloc-schematron looks like this:
<dependency> <groupId>com.phloc</groupId> <artifactId>phloc-schematron</artifactId> <version>2.6.1</version> </dependency>
It transitively contains phloc-commons, SLF4J and Saxon HE.
3.2 Common API A common API for both XSLT and Pure Schematron approach is available via the
com.phloc.schematron.ISchematronResource interface. It is meant for Schematron that is
read from a file or URL. It offers the possibility to check if the read Schematron is valid itself via the
boolean isValidSchematron () method.
To check if an XML document simply matches a Schematron rule set the methods
com.phloc.commons.state.EValidity getSchematronValidity(…) are provided. These
methods deliver either EValidity.VALID if the XML document matches the Schematron or
EValidity.INVALID if the XML document does not match at least one Schematron rule. With this
method you have no possibility to determine what the error exactly was. When using an XSLT based
implementation this method does not offer any performance improvement, as the SVRL is fully
created and analyzed afterwards. When using a Pure Schematron based implementation, the
validation stops after the first error and does not continue to validate the supplied XML document.
Alternatively to the basic validation the interface also offers the possibility to create an SVRL result
via the methods org.w3c.dom.Document applySchematronValidation(…) and org.oclc.purl.dsdl.svrl.SchematronOutputType
applySchematronValidationToSVRL(…). The first method type creates the SVRL only as an XML
document node, where the second method type applies a JAXB binding, so that it is easier to access
the information inside the SVRL. Internally these methods call each other depending on the concrete
implementation, so they are ensured to deliver exactly the same result. The XSLT implementation is
natively done in applySchematronValidation and then converted to a
SchematronOutputType using the com.phloc.schematron.svrl.SVRLReader class. With
Pure Schematron a SchematronOutputType object is directly created and then converted to an
XML document node via the class com.phloc.schematron.svrl.SVRLWriter.
The classes SVRLReader and SVRLWriter can generically be used to read and write SVRL files in a
structured way. Both classes validate the SVRL based on SVRL XML Schema contained in the library.
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 5 of 9
3.2.1 Validation via XSLT
As described above it is highly recommended to cache the XSLT script that is created from the source
Schematron rule set. Nevertheless phloc-schematron offers both possibilities to use Schematron.
The easiest way to start working is by starting from a Schematron file.
com.phloc.schematron.xslt.SchematronResourceSCH is the implementation of the
ISchematronResource interface to be used for this. The constructor takes at the least the
Schematron resource that contains the rules. When using this class it is possibly to specify an
optional Schematron phase to be used for validation. Additionally some static factory methods are
present that allow creating SchematronResourceSCH objects from a String path or a
java.io.File object.
If a precompiled XSLT script is present (e.g. via the schematron2xslt Maven plugin or via manual pre-
processing) the implementation class
com.phloc.schematron.xslt.SchematronResourceXSLT should be instantiated. It offers the
same constructors and factory methods as the SchematronResourceSCH class. Please recall that
the chosen phase already affected the created XSLT script, so it is not possible to specify a phase
when using this implementation.
Both implementations use an internal cache that keeps the created pre-precompiled
javax.xml.transform.Templates objects in memory while the application is running. The cache
for SchematronResourceSCH is located in the class
com.phloc.schematron.xslt.SchematronResourceSCHCache whereas the cache for
SchematronResourceXSLT is located in the class
com.phloc.schematron.xslt.SchematronResourceXSLTCache – big surprise
A simple example to validate an XML file based on Schematron rules from a file looks like this:
01 public static boolean validateXMLViaXSLTSchematron (@Nonnull final File aSchematronFile, @Nonnull final File aXMLFile) throws Exception 02 { 03 final ISchematronResource aResSCH = SchematronResourceSCH.fromFile (aSchematronFile); 04 if (!aResSCH.isValidSchematron ()) 05 throw new IllegalArgumentException ("Invalid Schematron!"); 06 return aResSCH.getSchematronValidity (new StreamSource(aXMLFile)).isValid (); 07 }
3.2.2 Validation via Pure Schematron
For Pure Schematron the implementation of the ISchematronResource interface resides in the
class com.phloc.schematron.pure.SchematronResourcePure. The constructor also takes at
least the resource where to read the Schematron rules from. Additional a Schematron phase and a
custom error handler can be supplied.
Be careful when using the validation methods that take a javax.xml.transform.Source object
as parameter. Only DOMSource and StreamSource objects are supported at the moment!
A simple example to validate an XML file based on Schematron rules from a file looks like this:
01 public static boolean validateXMLViaPureSchematron (@Nonnull final File aSchematronFile, @Nonnull final File aXMLFile) throws Exception
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 6 of 9
02 { 03 final ISchematronResource aResPure = SchematronResourcePure.fromFile (aSchematronFile); 04 if (!aResPure.isValidSchematron ()) 05 throw new IllegalArgumentException ("Invalid Schematron!"); 06 return aResPure.getSchematronValidity(new StreamSource(aXMLFile)).isValid (); 07 }
As an alternative you can also validate via the internal API as well, in which case the code can look
like this:
01 public static boolean validateXMLViaPureSchematron2 (@Nonnull final File aSchematronFile, @Nonnull final File aXMLFile) throws Exception 02 { 03 // Read the schematron from file 04 final PSSchema aSchema = new PSReader (new FileSystemResource (aSchematronFile)).readSchema (); 05 if (!aSchema.isValid ()) 06 throw new IllegalArgumentException ("Invalid Schematron!"); 07 // Resolve the query binding to use 08 final IPSQueryBinding aQueryBinding = PSQueryBindingRegistry.getQueryBindingOfNameOrThrow (aSchema.getQueryBinding ()); 09 // Pre-process schema 10 final PSPreprocessor aPreprocessor = new PSPreprocessor (aQueryBinding); 11 aPreprocessor.setKeepTitles (true); 12 final PSSchema aPreprocessedSchema = aPreprocessor.getAsPreprocessedSchema (aSchema); 13 // Bind the pre-processed schema 14 final IPSBoundSchema aBoundSchema = aQueryBinding.bind (aPreprocessedSchema, null, null); 15 // Read the XML file 16 final Document aXMLNode = XMLReader.readXMLDOM (aXMLFile); 17 if (aXMLNode == null) 18 return false; 19 // Perform the validation 20 return aBoundSchema.validatePartially (aXMLNode).isValid (); 21 }
The code is clearly separated into the following steps:
Reading the Schematron file from a File (lines 04-06). This part contains the Schematron
include resolution.
Determine the Schematron query binding to be used (line 08). The query binding is required
to correctly pre-process the Schematron afterwards.
Pre-process the read Schematron file (line 10-12). This resolves all abstract rules and
patterns.
Create the bound Schematron (line 14). This is the pre-compilation step, depending on the
selected query binding. The second parameter that is null in the example is the name of the
phase to use. When no phase is passed the defaultPhase attribute of the Schematron
schema is checked and used. If no defaultPhase is present, all patterns are active.
Read the XML file to be validated via DOM (line 16-18). Technical note: this is the class
com.phloc.commons.xml.serialize.XMLReader which offers a simplified API to read
XML files and is not be confused with org.xml.sax.XMLReader.
Perform the Schematron validation of the read XML file (line 20).
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 7 of 9
It is important to note, that in the second case no caching is performed, and that the Schematron file is interpreted each time the method is called, which may not be as efficient as possible. The most customization may be done to the pre-processor. The Schematron ISO standard defines a “Minimal Syntax” that is still compliant Schematron but among other with all includes resolved, all abstract patterns and abstract rules resolved. Because a Schematron that is minified has implications on the created SVRL document it was chosen to call the class “PSPreprocessor” and not “PSMinifier”. For example if all <report> elements are converted to <assert> elements, the SVRL would contain a <failed-assert> element instead of a <successful-report> element. By default the pre-processor creates a minimal Schematron but if offers the possibility to avoid certain minimizations.
3.3 Extensibility of Pure Schematron
3.3.1 Reading
Pure Schematron can be extended in several ways. The reading itself is not very customizable, but
after reading you have the possibility to modify the created Schematron object of type
com.phloc.schematron.pure.model.PSSchema. It offers getter and setter for all elements. The
object hierarchy of PSSchema is very similar to the XML hierarchy of Schematron in general so it
should not be too hard to handle. E.g. a PSSchema has a list of PSPattern objects, which in turn
each have a list of PSRule elements. Via the method IMicroElement getAsMicroElement ()
each Schematron object can easily be converted to an XML structure which can than easily be
serialized to disk. The following example reads a Schematron file from disc, sets a <title> element and
writes the document back to the source file:
01 public static boolean readModifyAndWrite (@Nonnull final File aSchematronFile) throws Exception 02 { 03 final PSSchema aSchema = new PSReader (new FileSystemResource (aSchematronFile)).readSchema (); 04 final PSTitle aTitle = new PSTitle (); 05 aTitle.addText ("Created by phloc-schematron"); 06 aSchema.setTitle (aTitle); 07 return MicroWriter.writeToFile (aSchema.getAsMicroElement (), aSchematronFile).isSuccess (); 08 }
3.3.2 New Query Binding
It is also possible to implement your own query binding that is different from the default XPath-
based query binding. Therefore a class implementing the interface
com.phloc.schematron.pure.binding.IPSQueryBinding must be present. This
implementation class must then be registered in the
com.phloc.schematron.pure.binding.PSQueryBindingRegistry via the static method
registerQueryBinding. It is not possible to replace an existing query binding. The predefined
XPath-based query binding is registered to the names “xslt” and “xslt2” as well as to the default
(meaning unspecified) query binding. Implementing your own query binding is kind of time
consuming as you need to implement at least the interfaces
com.phloc.schematron.pure.binding.IPSQueryBinding and
com.phloc.schematron.pure.bound.IPSBoundSchema.
3.3.3 Modify Existing Query Binding
Additionally you may alter the existing Schematron processing by either using the Pure Schematron
API as outlined in the example above. Or you may subclass
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 8 of 9
com.phloc.schematron.pure.bound.PSBoundSchemaCacheKey which offers a set of
protected methods for easy customization when using SchematronResourcePure. In case you
have a customized implementation, you need to use the special SchematronResourcePure
constructor taking the Schematron IReadableResource and the PSBoundSchemaCacheKey
implementation. See the documentation in the code for details on overriding
PSBoundSchemaCacheKey.
3.4 Maven plugin schematron2xslt The conversion of Schematron to XSLT is quite costly. That’s why phloc-schematron offers a Maven
plugin that does the conversion at build time. Because the plugin is not yet contained in Maven
central, you need add a custom repository and a custom pluginRepository to your pom.xml
like this:
<repositories> <repository> <id>phloc.com</id> <url>http://repo.phloc.com/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>phloc.com</id> <url>http://repo.phloc.com/maven2</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </pluginRepository> </pluginRepositories>
By default the plugin is run in the Maven lifecycle phase “generate-resources”. The basic
configuration of the plugin in the pom.xml looks like this (inside the <build> + <plugins>
elements):
<plugin> <groupId>com.phloc.maven</groupId> <artifactId>schematron2xslt-maven-plugin</artifactId> <version>2.6.1</version> <executions> <execution> <goals> <goal>convert</goal> </goals> </execution> </executions> <configuration> <schematronDirectory>${basedir}/src/main/schematron</schematronDirectory> <xsltDirectory>${basedir}/src/main/resources/xslt</xsltDirectory> <xsltExtension>.xsl</xsltExtension>
phloc-schematron - an introduction - © 2013 by Philip Helger – version as of May 28, 2013
Page 9 of 9
</configuration> </plugin>
The possible configuration parameters are:
schematronDirectory - The directory where the Schematron files reside.
schematronPattern - A pattern for the Schematron files. Can contain Ant-style wildcards and
double wildcards. All files that match the pattern will be converted. Files in the
schematronDirectory and its subdirectories will be considered.
xsltDirectory - The directory where the XSLT files will be saved.
xsltExtension - The file extension of the created XSLT files.
overwriteWithoutQuestion - Overwrite existing Schematron files without notice? If this is set
to “false” than existing XSLT files are not overwritten.
phaseName - Define the phase to be used for XSLT creation. By default the “defaultPhase”
attribute of the Schematron file is used.
languageCode - Define the language code for the XSLT creation. Default is English. Supported
language codes are: cs, de, en, fr, nl.
An example project that uses schematron2xslt-maven-plugin can be found in the Google Code
repository at https://phloc-schematron.googlecode.com/svn/trunk/schematron2xslt-demo.
4 Benchmarks To do