xerces2: the sequel with no equal

27
20 November 2002 ApacheCon US - Las Vegas, N evada 1 Xerces2: The Sequel With No Equal Andy Clark

Upload: jeff

Post on 06-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

Xerces2: The Sequel With No Equal. Andy Clark. Introduction. Speaker Worked for IBM Currently unemployed  Parser First developed in IBM’s Tokyo research lab Maintained and expanded in California Donated to Apache Work continues in Toronto. Agenda. Xerces1 Overview - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Xerces2: The Sequel With No Equal

20 November 2002 ApacheCon US - Las Vegas, Nevada 1

Xerces2:The Sequel With No Equal

Andy Clark

Page 2: Xerces2: The Sequel With No Equal

ApacheCon US - Las Vegas, Nevada 220 November 2002

Introduction

SpeakerWorked for IBMCurrently unemployed

ParserFirst developed in IBM’s Tokyo research labMaintained and expanded in CaliforniaDonated to ApacheWork continues in Toronto

Page 3: Xerces2: The Sequel With No Equal

ApacheCon US - Las Vegas, Nevada 320 November 2002

Agenda

Xerces1 OverviewDesign and problems

Xerces2 OverviewChallenges and design

Q & A

Page 4: Xerces2: The Sequel With No Equal

4ApacheCon US - Las Vegas, Nevada20 November 2002

Xerces1 Overview:Design and Problems

Andy Clark

Page 5: Xerces2: The Sequel With No Equal

5ApacheCon US - Las Vegas, Nevada20 November 2002

Design

XML4J/Xerces1 designed for performance Parser Implementation

Parsing pipelineCustom reader implementationsStringPool

Defers transcoding of byte buffers until needed Symbol table for common document strings

Page 6: Xerces2: The Sequel With No Equal

6ApacheCon US - Las Vegas, Nevada20 November 2002

Scanner Validator Parser

Intended to be generic

XML API

Pipeline Configuration

Page 7: Xerces2: The Sequel With No Equal

7ApacheCon US - Las Vegas, Nevada20 November 2002

Scanner Validator Parser

Pipeline Configuration Problems

Hard-coded dependencies on implementation Inconsistent Interfaces

XML API

Page 8: Xerces2: The Sequel With No Equal

8ApacheCon US - Las Vegas, Nevada20 November 2002

Custom Readers

ScannerEntity

Handler

ReaderStack

UTF-8Reader

UCSReader

EBCDICReader

GenericReader

scanNamescanAttValuescanContent

Page 9: Xerces2: The Sequel With No Equal

9ApacheCon US - Las Vegas, Nevada20 November 2002

Custom Readers Problems

Duplicated codeAllows more bugs to appearBugs are different based on encoding

because code is not shared More complicated

Page 10: Xerces2: The Sequel With No Equal

10ApacheCon US - Las Vegas, Nevada20 November 2002

Deferred Transcoding

XML

StringPool

ParserComponent

StringProducer

Reader

DataBuffer

DataBuffer

addString

(String):i

nt

toString(int):String

addString

(StringPr

oducer,in

t,int):int

Page 11: Xerces2: The Sequel With No Equal

11ApacheCon US - Las Vegas, Nevada20 November 2002

Deferred Transcoding Problems

All components need reference to StringPoolStrings not immediately available to methodsMust make call to StringPool to query String

Memory management is complicatedResponsibility of callee to free resourcesUses more memory

Page 12: Xerces2: The Sequel With No Equal

12ApacheCon US - Las Vegas, Nevada20 November 2002

Xerces2 Overview:Challenges and Design

Andy Clark

Page 13: Xerces2: The Sequel With No Equal

13ApacheCon US - Las Vegas, Nevada20 November 2002

Challenges

Requirements Simple design and implementation Easy to maintain More modularity and configurability Support current and future features

Design Decisions Always transcode bytes into Unicode characters

Removes StringPool and dependencies

Clean architecture

Page 14: Xerces2: The Sequel With No Equal

14ApacheCon US - Las Vegas, Nevada20 November 2002

Xerces Native Interface (XNI)

“Streaming” Information SetSimilar to SAXNo loss of document information*

Parser configuration and layering Future extensions

Native pull-parser, tree model, etc.

* Does not preserve all document information but communicates more information to the application than DOM or SAX.

Page 15: Xerces2: The Sequel With No Equal

15ApacheCon US - Las Vegas, Nevada20 November 2002

org.apache.xerces.xni org.apache.xerces.xni.parser

XMLDTDHandler

XMLDTDContentModelHandler

XMLDocumentFragmentHandler

XMLLocator

XMLDocumentHandler

NamespaceContext

XMLAttributesAugmentations

QNameXMLString

XNIException

RuntimeException

XMLPullParserConfiguration

XMLErrorHandler XMLEntityResolver

XMLDTDScanner

XMLDocumentScanner

XMLDTDContentModelSourceXMLDTDContentModelFilter

XMLDTDSourceXMLDTDFilter

XMLDocumentSourceXMLDocumentFilter

XMLComponentManager XMLComponent

XMLConfigurationException

XMLParseExceptionXMLInputSource

XMLParserConfiguration

java.lang Interface

Class

Package

Extends

XMLResourceIdentifier

Page 16: Xerces2: The Sequel With No Equal

16ApacheCon US - Las Vegas, Nevada20 November 2002

Parsing Pipeline

Handlers communicate information between parser components

Scanner Validator ParserXML API

Page 17: Xerces2: The Sequel With No Equal

17ApacheCon US - Las Vegas, Nevada20 November 2002

Handler Overview

XML

API

Document

Scanner

Validator Parser

DTD

Scanner

XMLDocumentHandler

XMLDTDHandlerXMLDTDContentModelHandler

Page 18: Xerces2: The Sequel With No Equal

18ApacheCon US - Las Vegas, Nevada20 November 2002

Parser Layout

Components and Manager

Component Manager

SymbolTable

GrammarPool

DatatypeFactory

Regular Components

Scanner ValidatorEntity

ManagerError

Reporter

Configurable Components

Page 19: Xerces2: The Sequel With No Equal

19ApacheCon US - Las Vegas, Nevada20 November 2002

Reader Management

EntityScanner

Scanner

EntityManager

ReaderStack

scanNamescanAttValuescanContent

UTF-8Reader

UCSReader

EBCDICReader

GenericReader

Page 20: Xerces2: The Sequel With No Equal

20ApacheCon US - Las Vegas, Nevada20 November 2002

Parser Configuration

Before

* Parser pipeline is part of the document parser base class.

* Required duplication to re-configure parser and still take advantage of API generator code.

XML

SAX ParserDOM Parser

Document Parser

Scanner Validator

Page 21: Xerces2: The Sequel With No Equal

21ApacheCon US - Las Vegas, Nevada20 November 2002

Parser Configuration

After

* Parser pipeline and settings are specified in a separate parser configuration object.

* Allows re-use of framework without rewriting existing code.

SAX ParserDOM Parser

Document Parser

Parser Configuration

Scanner ValidatorXML

Page 22: Xerces2: The Sequel With No Equal

22ApacheCon US - Las Vegas, Nevada20 November 2002

API Generators

Different APIs can be generated from same document parser

XNISAX ParserDOM Parser …

Document Parser

JavaBean Parser

Page 23: Xerces2: The Sequel With No Equal

23ApacheCon US - Las Vegas, Nevada20 November 2002

Sample Parser Configuration #1

HTML parserAvailable as NekoHTML download

SAX ParserDOM Parser

Document Parser

HTML Parser Configuration

HTML ScannerHTML Tag Balancer

Page 24: Xerces2: The Sequel With No Equal

24ApacheCon US - Las Vegas, Nevada20 November 2002

Non-validating parser (for performance)Available with Xerces download

SAX ParserDOM Parser

Document Parser

Non-Validating Parser Configuration

Scanner / Namespace BinderXML

Sample Parser Configuration #2

Page 25: Xerces2: The Sequel With No Equal

25ApacheCon US - Las Vegas, Nevada20 November 2002

Sample Parser Configuration #3

XInclude processingNot yet implemented

SAX ParserDOM Parser

Document Parser

XInclude Parser Configuration

ScannerXML XInclude Validator

Page 26: Xerces2: The Sequel With No Equal

26ApacheCon US - Las Vegas, Nevada20 November 2002

Sample Parser Configuration #4

Database result set converted to XMLNot yet implemented

SAX ParserDOM Parser

Document Parser

Database Parser Configuration

Database Query ValidatorDB

Page 27: Xerces2: The Sequel With No Equal

ApacheCon US - Las Vegas, Nevada 2720 November 2002

That’s All, Folks!

Question and AnswersAny questions?

Linkshttp://www.apache.org/~andyc/xml/present/

http://xml.apache.org/xerces2-j/http://www.apache.org/~andyc/neko/