parsing for xml developers roger l. costello 28 september 2014

32
Parsing for XML Developers Roger L. Costello 28 September 2014

Upload: mollie-sharrock

Post on 01-Apr-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parsing for XML Developers Roger L. Costello 28 September 2014

Parsing for XML Developers

Roger L. Costello28 September 2014

Page 2: Parsing for XML Developers Roger L. Costello 28 September 2014

Flat XML Document

You might receive an XML document that has no structure. For example, this XML document contains a flat (linear) list of Book data:

<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books> 2

Page 3: Parsing for XML Developers Roger L. Costello 28 September 2014

Give it structure to facilitate processing

<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books> 3

Page 4: Parsing for XML Developers Roger L. Costello 28 September 2014

That’s parsing!

Parsing is taking a flat (linear) sequence of items and adding structure so that the result conforms to a grammar.

4

Page 5: Parsing for XML Developers Roger L. Costello 28 September 2014

Parsing

<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

parse

5

Page 6: Parsing for XML Developers Roger L. Costello 28 September 2014

6

From the book: “Parsing Techniques”

• Parsing is the process of structuring a linear representation in accordance with a given grammar.

• The “linear representation” may be:• A flat sequence of XML elements• a sentence• a computer program• a knitting pattern• a sequence of geological strata• a piece of music• actions of ritual behavior

Page 7: Parsing for XML Developers Roger L. Costello 28 September 2014

Grammar

• A grammar is a succinct description of the structure.• Here is a grammar for Books:

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

7

Page 8: Parsing for XML Developers Roger L. Costello 28 September 2014

Parsing

parser

Books → Book+Book → Title Authors Date ISBN PublisherAuthors → Author+Title → textAuthor → textDate → textISBN → textPublisher → text

<Books> <Title>Parsing Techniques</Title> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> <Title>Introduction to Graph Theory</Title> <Author>Richard J. Trudeau</Author> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> <Title>Introduction to Formal Languages</Title> <Author>Gyorgy E. Revesz</Author> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Books>

Grammar

Linear representation

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

Structured representation

8

Page 9: Parsing for XML Developers Roger L. Costello 28 September 2014

Parsing Techniques

• Over the last 50 years many parsing techniques have been created.• Some parsing techniques work from the starting grammar rule to the

bottom. These are called top-down parsing techniques.• Other parsing techniques work from the bottom grammar rules to the

starting grammar rule. These are called bottom-up parsing techniques.

• The following slides show how to apply a powerful bottom-up parsing technique to the Books example.

9

Page 10: Parsing for XML Developers Roger L. Costello 28 September 2014

What does “powerful” mean?

• The previous slide said, … following slides show how to apply a powerful bottom-up parsing technique …

• “Powerful” means the technique can be used with lots of grammars, i.e., it can be used to generate lots of different structures.

10

Page 11: Parsing for XML Developers Roger L. Costello 28 September 2014

Suppose we were to structure the XML from scratch. We might follow these steps:

<Books> </Books>

<Books> <Book> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>

11

continuedon nextslide

Page 12: Parsing for XML Developers Roger L. Costello 28 September 2014

Follow these steps (cont.):

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> </Authors> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> </Book> </Books>

continuedon nextslide

12

Page 13: Parsing for XML Developers Roger L. Costello 28 September 2014

Follow these steps (cont.):<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Dover Publications</Publisher> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> </Book> </Books>

and so forth, filling in the second Book then the third Book

13

Page 14: Parsing for XML Developers Roger L. Costello 28 September 2014

Last step: add the last Book’s Publisher

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN>

</Book></Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> <Author>Dick Grune</Author> <Author> Ceriel J.H. Jacobs</Author> </Authors> <Date>2007</Date> <ISBN>978-0-387-20248-8</ISBN> <Publisher>Springer</Publisher> </Book> <Book> <Title>Introduction to Graph Theory</Title> <Authors> <Author>Richard J. Trudeau</Author> </Authors> <Date>1993</Date> <ISBN>0-486-67870-9</ISBN> <Publisher>Dover Publications</Publisher> </Book> <Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher> </Book></Books>

last step adds this

14

Page 15: Parsing for XML Developers Roger L. Costello 28 September 2014

Alternate view of the steps (a tree view)

Books Books

Book

Books

Book

Title

Books

Book

Title Authors

Books

Book

Title Authors

Author

Books

Book

Title Authors

Author

continuedon nextslide

15

Author

Page 16: Parsing for XML Developers Roger L. Costello 28 September 2014

Alternate view (cont.)

16

Books

Book

Title Authors

Author Author

Date

Books

Book

Title Authors

Author Author

Date ISBN

Books

Book

Title Authors

Author Author

Date ISBN Publisher

continuedon nextslide

Page 17: Parsing for XML Developers Roger L. Costello 28 September 2014

Alternate view (cont.)

Books

Book

Title Authors Date ISBN Publisher

Bookand so forth, filling in the second Book then the third Book

17

Author Author

Page 18: Parsing for XML Developers Roger L. Costello 28 September 2014

Last step: add the last Book’s Publisher

Books

Book

Title Authors Date ISBN Publisher

Book

Title Authors

Author

Date ISBN Publisher

Book

Title Authors

Author

Date ISBN

Books

Book

Title Authors Date ISBN Publisher

Book

Title Authors

Author

Date ISBN Publisher

Book

Title Authors

Author

Date ISBN Publisher

Author

Author

last step adds this

18

Author

Author

Page 19: Parsing for XML Developers Roger L. Costello 28 September 2014

Terminology: Production Step

<Books> </Books>

<Books> <Book> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> </Book> </Books>

<Books> <Book> <Title>Parsing Techniques</Title> <Authors> </Authors> </Book> </Books>

Each step is called a production step

21

Page 20: Parsing for XML Developers Roger L. Costello 28 September 2014

Top down

The previous slides showed the generation of the structured XML by starting from the top (root element) down to the bottom (leaf nodes).

19

Page 21: Parsing for XML Developers Roger L. Costello 28 September 2014

Bottom-up parsing

In bottom-up parsing we work backward: from the last step to the first step.

20

Page 22: Parsing for XML Developers Roger L. Costello 28 September 2014

22

Let’s begin …• One production step must have been the last and its

result must be visible in the linear representation.• We recognize the rule Publisher → text in

This gives us the final step in the production process (and the first step in bottom-up parsing):

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 23: Parsing for XML Developers Roger L. Costello 28 September 2014

23

NextWe recognize the rule ISBN → text inThis gives us the next-to-last step in the production process (and the second step in bottom-up parsing):

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 24: Parsing for XML Developers Roger L. Costello 28 September 2014

24

NextWe recognize the rule Date → text inThis gives us the third step in bottom-up parsing:

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 25: Parsing for XML Developers Roger L. Costello 28 September 2014

25

NextWe recognize the rule Author → text inThis gives us the fourth step in bottom-up parsing:

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Author>Gyorgy E. Revesz</Author><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 26: Parsing for XML Developers Roger L. Costello 28 September 2014

26

NextWe recognize the rule Authors → Author+ inThis gives us the fifth step in bottom-up parsing:

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 27: Parsing for XML Developers Roger L. Costello 28 September 2014

27

NextWe recognize the rule Title → text inThis gives us the sixth step in bottom-up parsing:

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Title>Introduction to Formal Languages</Title><Authors> <Author>Gyorgy E. Revesz</Author></Authors><Date>2012</Date><ISBN>0-486-66697-2</ISBN><Publisher>Dover Publications</Publisher>

Page 28: Parsing for XML Developers Roger L. Costello 28 September 2014

28

NextWe recognize the rule Book → Title Authors Date ISBN Publisher inThis gives us the seventh step in bottom-up parsing:

<Title>Parsing Techniques</Title><Author>Dick Grune</Author><Author> Ceriel J.H. Jacobs</Author><Date>2007</Date><ISBN>978-0-387-20248-8</ISBN><Publisher>Springer</Publisher><Title>Introduction to Graph Theory</Title><Author>Richard J. Trudeau</Author><Date>1993</Date><ISBN>0-486-67870-9</ISBN><Publisher>Dover Publications</Publisher><Book> <Title>Introduction to Formal Languages</Title> <Authors> <Author>Gyorgy E. Revesz</Author> </Authors> <Date>2012</Date> <ISBN>0-486-66697-2</ISBN> <Publisher>Dover Publications</Publisher></Book>

Page 29: Parsing for XML Developers Roger L. Costello 28 September 2014

See the algorithm?

See how we are working backwards, from the bottom grammar rules up to the starting grammar rule? In the process we are adding structure to the flat (linear) XML – neat!

29

Page 30: Parsing for XML Developers Roger L. Costello 28 September 2014

30

Terminology: Reduction• In bottom-up parsing a collection of symbols are

recognized as derived from a symbol. For example, Title, Authors, Date, ISBN, Publisher is derived from Book:

• Title, Authors, Date, ISBN, Publisher is reduced to Book

• So the bottom-up parsing process is a reduction process.

Book

Title Authors Date ISBN Publisher

Page 31: Parsing for XML Developers Roger L. Costello 28 September 2014

Build your own bottom up parser!

You now have enough knowledge that you can go off and build your own bottom-up parser.

31

Page 32: Parsing for XML Developers Roger L. Costello 28 September 2014

I implemented a bottom-up parser

• I used XSLT to implement a bottom-up parser.• If you would like to give my implementation a go, here is the XSLT

program and a sample flat (linear) input XML document:• http://

www.xfront.com/parsing-techniques/bottom-up-parser/bottom-up-parser-for-Books.xsl

• http://www.xfront.com/parsing-techniques/bottom-up-parser/Books.xml

32