ibm t. j. watson research center © 2006 ibm corporation 24 may 2006 xml screamer: integrated,...
TRANSCRIPT
![Page 1: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/1.jpg)
IBM T. J. Watson Research Center
24 May 2006 © 2006 IBM Corporation
XML Screamer:Integrated, High-Performance XML Parsing, Validation and Deserialization
Margaret Kostoulas, Morris Matsa, Noah Mendelsohn, Eric Perkins, Abraham Heifets, Martha Mercaldi
![Page 2: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/2.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation2 XML Screamer 24 May 2006
Outline
Introduction
Why is XML Parsing Slow
XML Screamer: Design
XML Screamer: Performance
Conclusion
![Page 3: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/3.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation3 XML Screamer 24 May 2006
XML Performance
XML is everywhere
Increasingly, XML is being used in processes that demand high-performance
XML is widely seen as underperforming
Validation is even worse
Often, though, XML is used for exactly the kinds of things you want to validate
![Page 4: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/4.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation4 XML Screamer 24 May 2006
Why are traditional XML parsers slow?
![Page 5: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/5.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation5 XML Screamer 24 May 2006
How fast is your computer?
An XML Parser must read through its input (and update the bytes of its output)
A 1 GHz Pentium can read through an input buffer at approximately 100 Mbytes/sec which is approximately 10 cycles/byte.
![Page 6: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/6.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation6 XML Screamer 24 May 2006
How fast are traditional XML processors?
Xerces-C 2.6, non-validating, using SAX: approximately 6 Mbytes/Sec/GHz, which is ~165 cycles/byte
Expat version 1.95.8 for Windows using its UTF-8 API is about 12 Mbytes/sec/GHz (~80 cycles/byte)
Remember: processor character scan rate = 100 MByte/sec/GHz (10 cycles/byte)
What in the world are these XML processors doing for 80-160 cycles with each byte they pick up? That may be on the order of 300+ instructions per byte!
![Page 7: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/7.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation7 XML Screamer 24 May 2006
Output
struct inventoryItem {int quantity;
};
Input:
<inventoryItem> <quantity>10</quantity> ...<inventoryItem>
Schema (abbeviated syntax):
<element name="inventoryItem"> <sequence> <element name="quantity"> <simpleType base="xsd:integer"/> <maxInclusive="10000"/> <minInclussive="0"/> <simpleType> </element> </sequence></element>
An Example
![Page 8: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/8.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation8 XML Screamer 24 May 2006
A traditional XML Parser/Deserializer
![Page 9: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/9.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation9 XML Screamer 24 May 2006
Input:
<inventoryItem> <quantity>10</quantity> ...<inventoryItem>
003c 0069 006e 0076 .... 0065 006d 003eConvert to UTF-16
Validate against "inventoryItem"
Match against "inventoryItem" in deserializerThrow Sax event
Discard Sax event
3c 69 6e 76 65 6e 74 6f 72 79 20 49 74 65 6d 3e
UTF-8
A traditional XML Parser/Deserializer
![Page 10: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/10.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation10 XML Screamer 24 May 2006
Input:
<inventoryItem> <quantity>10</quantity> ...<inventoryItem>
Convert to UTF-16
003c 0071 0075 ... 0074 0079 003e 0020 0021 0030
3c 71 75 61 6e 74 69 74 79 3e 20 31 30
UTF-8
A traditional XML Parser/Deserializer
Validate against “quantity”
Throw Sax Event for “quantity”
Match against “quantity” in deserializer
Convert “10” to UTF-16 and integer
Validate as 0 < quantity < 10000
Discard Integer
Sax event for UTF-16 “10”
Convert “10” to integer (deserializer)
Discard Sax event
Copy integer to “quantity” field in structure
![Page 11: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/11.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation11 XML Screamer 24 May 2006
Traditional parsers: performance issues
Lots of expensive UTF-8 to UTF-16 conversions
String compares done in UTF-16 (typically larger)
Work duplicated between validator and deserializer
Repeated data conversions (e.g. string/integer)
Data copying
Possible object & memory management overhead for SAX events
Incremental reporting even when documents are small
![Page 12: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/12.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation12 XML Screamer 24 May 2006
XML Screamer
An Integrated Approach toHigh Performance XML
Parsing, Validation & Deserialization
![Page 13: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/13.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation13 XML Screamer 24 May 2006
The XML Screamer Project
Goal: show how fast XML and XML Schema Processing can be
Approach: an XML Schema compiler that optimizes across layers that are traditionally separate -- scanning, parsing, validation & deserialization are integrated
![Page 14: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/14.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation14 XML Screamer 24 May 2006
What XML Screamer is…
A compiler for XML Schemas
Given a schema and a desired output API, generates a parser that:
– Parses and validates against the schema
– Populates the desired API
Screamer compiler is in Java, produces parsers in C or Java
C output much better tuned…much easier to study. All results reported here are for C.
![Page 15: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/15.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation15 XML Screamer 24 May 2006
Screamer Parser APIs
NoAPI:
– just reports well-formedness and root validity
Business object
– like gSOAP, or JAX-RPC, SDO, etc.
SAX
– knowledge of schema allows pre-computation of some SAX output
![Page 16: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/16.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation16 XML Screamer 24 May 2006
An XML Screamer Parser/Deserializer
![Page 17: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/17.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation17 XML Screamer 24 May 2006
Input:
<inventoryItem> <quantity>10</quantity> ...<inventoryItem>
3c 71 75 61 6e 74 69 74 79 3e 20 31 30
Validate UTF-8 against "quantity"Convert "1" "0" from UTF-8 to integerMake sure 0<integer<10000Copy integer to deserialized structure
Validate UTF-8 against "inventoryItem"
3c 69 6e 76 65 6e 74 6f 72 79 20 49 74 65 6d 3e
An XML Screamer Parser/Deserializer
UTF-8
![Page 18: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/18.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation18 XML Screamer 24 May 2006
What makes XML Screamer fast?
Optimizing across layers
Avoid intermediate forms
– Don’t use SAX if you don’t need it
Avoid format conversions
– Work in input encoding wherever possible
Attention to detail
In short: think like a compiler writer!
![Page 19: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/19.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation19 XML Screamer 24 May 2006
How fast is XML Screamer?
![Page 20: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/20.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation20 XML Screamer 24 May 2006
Benchmark Reporting
Test environment
– IBM eServer xSeries Model 235, 3.2 GHz Intel Xeon, 2 GB RAM, Microsoft Windows Server 2003 Service Pack 1. Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077.
– Other machines checked for consistency – see paper
Results normalized to 1 GHz Pentium Processor
– Therefore: throughput of 3.2 Mbytes/sec would be reported as 1 MByte/sec/GHz
– Scales well across Pentiums, Xeons, etc. of various clock speeds (Centrino is bit faster per cycle)
![Page 21: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/21.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation21 XML Screamer 24 May 2006
Test cases
Report on 10 separate tests over 6 schemas
All UTF-8 instances, single buffer, fits in memory.
Sizes range from 990 bytes through 116.5KBytes.
![Page 22: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/22.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation22 XML Screamer 24 May 2006
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10
Test ID
MB
/Sec
/GH
z
Xerces
Expat
Screamer
Screamer is validating, and report SAX events
Median Performance improvement: 1.9x Expat 3.8x Xerces
Comparison to Non-validating Parsers
![Page 23: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/23.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation23 XML Screamer 24 May 2006
0
10
20
30
40
50
1 2 3 4 5 8 9 10
Test ID
MB
/Se
c/G
Hz
Xerces
Expat
Screamer-BO
Business Object Creation
Xerces and Expat included for reference
Median Performance improvement: 2.9x Expat 5.9x Xerces
* Business Objects not supported for tests 6 & 7
![Page 24: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/24.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation24 XML Screamer 24 May 2006
0
10
20
30
40
50
1 2 3 4 5 6 7 8 9 10
Test ID
MB
/Sec
/GH
z
Xerces
Screamer-SAX
Screamer-Bus.Obj
Validation Comparison
Median Performance improvement:
5.5x for Sax, 11.6x for Business Objects
* Business Objects not supported for tests 6 & 7
![Page 25: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/25.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation25 XML Screamer 24 May 2006
XML Performance Summary
XML can be parsed, validated, and deserialized into high performance API at median throughput of about 35MBytes/sec. on a 1 GHz Pentium.
– That is 35% of the speed of raw character scan rate.
On modern 4GHz processor, that’s 140MBytes/sec, or 14,000 10K Byte msgs/sec.
– If you want to devote only 10% of CPU to parsing, you can still do 14 Mbytes/sec, or 1,400 10Kbyte messages per second
![Page 26: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/26.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation26 XML Screamer 24 May 2006
Conclusions
Parsing, validation, and deserialization can run at speeds within 20-40% of raw character scan rate.
– Probably close to the true limits of XML performance.
Validation can mean a net gain in performance, if you have the option to compile in advance.
XML Stack is designed in layers, but in implementations, layers disrupt performance.
– This has implications beyond just parsing and validation.
API Choice can make a significant difference in performance.
![Page 27: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/27.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation27 XML Screamer 24 May 2006
Related publications
Perkins, E., Kostoulas, M., Heifets, A., Matsa, M., Mendelsohn, N. Performance Analysis of XML APIs, XML 2005 Conference (http://www.idealliance.org/proceedings/xml05/abstracts/paper246.HTML)
Perkins, E., Matsa, M., Kostoulas, M., Heifets, A., Mendelsohn, N. Generation of Efficient Parsers through Direct Compilation of XML Schema. IBM Systems Journal, 45, No. 2, (May 2006).
![Page 28: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/28.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation28 XML Screamer 24 May 2006
END
![Page 29: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/29.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation29 XML Screamer 24 May 2006
Backup 1: Performance Measurements
anyType
ID Schema Filename Size Non val Val Non val NOAPI
Business
Object API SAX SAX
ScreamerSAXvs
Expat
ScreamerBusiness
Objectvs.
Expat
ScreamerSAXvs.
XercesVal SAX
Screamer:Schema
vsanyType
Sax
1 po 990.00 4.41 2.65 6.85 35.08 33.88 16.12 12.75 2.4x 4.9x 6.1x 1.3x
2 ipo 1,406.00 4.24 2.51 6.81 23.23 23.56 14.76 14.25 2.2x 3.5x 5.9x 1.0x
3 MI_AUS_RESPONSE2_1 1,572.00 3.21 2.98 5.21 25.95 23.21 17.00 7.51 3.3x 4.5x 5.7x 2.3x
4 po 8,062.00 6.79 3.01 13.68 48.00 44.67 24.73 16.08 1.8x 3.3x 8.2x 1.5x
5 ipo 8,077.00 6.08 2.90 10.31 38.40 34.21 21.44 16.46 2.1x 3.3x 7.4x 1.3x
6 bibteXML 8,609.00 8.28 5.58 15.54 47.49 NA 26.25 19.77 1.7x NA 4.7x 1.3x
7 MI_AUS_REQUEST2_1 9,429.00 4.06 3.16 6.74 23.15 NA 17.79 8.52 2.6x NA 5.6x 2.1x
8 po 63,754.00 6.88 3.02 15.65 49.87 46.63 26.58 16.37 1.7x 3.0x 8.8x 1.6x
9 ipo 64,233.00 5.68 2.85 13.75 44.00 36.15 24.65 16.67 1.8x 2.6x 8.6x 1.5x10 periodic_table 116,506.00 6.03 3.99 15.25 34.61 35.28 23.47 14.16 1.5x 2.3x 5.9x 1.7x
Test case
Throughput (Mytes/Sec/ProcessorGHz)
Comparisons
Xerces - SAX Expat
XML Screamer
Validating
![Page 30: IBM T. J. Watson Research Center © 2006 IBM Corporation 24 May 2006 XML Screamer: Integrated, High-Performance XML Parsing, Validation and Deserialization](https://reader035.vdocuments.site/reader035/viewer/2022070305/5514c7bd55034693478b4aec/html5/thumbnails/30.jpg)
IBM T. J. Watson Research Center
© 2006 IBM Corporation30 XML Screamer 24 May 2006
Precomputation of SAX Events
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10Test ID
MB
/Sec
/GH
z
Screamer: No precomputation (anyType)
Screamer: Events precomputed
Backup 2: SAX Pre-computation