structured -document processing languages

20
Structured Structured -Document -Document Processing Languages Processing Languages (5 cp), Spring 2011 (5 cp), Spring 2011 Pekka Kilpeläinen Pekka Kilpeläinen University of Eastern Finland University of Eastern Finland School of Computing School of Computing [email protected] [email protected] SDPL 2011 1 Notes 1: Introduction

Upload: tricia

Post on 04-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Structured -Document Processing Languages. (5 cp), Spring 2011 Pekka Kilpeläinen University of Eastern Finland School of Computing [email protected]. Introduction. First: Overview and Arrangements - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Structured -Document  Processing Languages

StructuredStructured-Document -Document Processing LanguagesProcessing Languages

(5 cp), Spring 2011(5 cp), Spring 2011Pekka KilpeläinenPekka Kilpeläinen

University of Eastern FinlandUniversity of Eastern Finland

School of ComputingSchool of [email protected]@uef.fi

SDPL 2011 1Notes 1: Introduction

Page 2: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 2

IntroductionIntroduction

First: Overview and ArrangementsFirst: Overview and Arrangements

What is this course about?What is this course about?

1.1 Review of Structured-Documents 1.1 Review of Structured-Documents 2.1 XML & XML Documents2.1 XML & XML Documents

Page 3: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 3

Goals of the CourseGoals of the Course

To get familiar with central models and To get familiar with central models and languages for languages for – ManipulatingManipulating– transforming and transforming and – querying querying

structured documents (or XML)structured documents (or XML) ““Generic XML processing technology”Generic XML processing technology”

– very little about specific XML applications, or very little about specific XML applications, or commercial systemscommercial systems

Page 4: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 4

NOT an Exhaustive SurveyNOT an Exhaustive Survey

Bias in selecting course topics:Bias in selecting course topics:– estimated usefulness/valueestimated usefulness/value

» centrality (centrality (→→ longer-lasting value) longer-lasting value)» maturity: Standards, or stable specifications? maturity: Standards, or stable specifications?

Robust implementations? Robust implementations?

– Lecturer up-to-date?Lecturer up-to-date?

Emphasis on programmatic access and Emphasis on programmatic access and manipulation of data in the form of documents, manipulation of data in the form of documents, rather than describing/modelling itrather than describing/modelling it

Page 5: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 5

Motivation?Motivation?

Practical relevance of documents and XMLPractical relevance of documents and XML

Interest in models of information processingInterest in models of information processing

XMLXML

Internet / intranetInternet / intranet

e.g. ordere.g. order

e.g. invoicee.g. invoice

““Programming for XML is Programming for XML is veryvery different from programming for different from programming for other data models”other data models”

- D. Florescu & D. Kossmann at - D. Florescu & D. Kossmann at SIGMOD’06SIGMOD’06

Page 6: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 6

Course OutlineCourse Outline

1 Introduction1 IntroductionOverview and ArrangementsOverview and Arrangements1.1 Structured Documents1.1 Structured Documents

2 XML Basics 2 XML Basics - XML documents, DTDs, Namespaces- XML documents, DTDs, Namespaces

3 Programmatic Manipulation of Structured Documents 3 Programmatic Manipulation of Structured Documents (XML APIs)(XML APIs)- SAX, DOM, StAX, JAXP- SAX, DOM, StAX, JAXP

Page 7: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 7

Course Outline (2)Course Outline (2)

4 Transforming Structured Documents4 Transforming Structured Documents4.1 Addressing: XPath 1.04.1 Addressing: XPath 1.04.2 XSLT 1.04.2 XSLT 1.0

5 Querying Structured Documents5 Querying Structured Documents- W3C XML Query Language XQuery- W3C XML Query Language XQuery

6 Review of the Course6 Review of the Course

Page 8: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 8

Methodological GoalsMethodological Goals

Central professional skillsCentral professional skills– consulting technical specificationsconsulting technical specifications– experimenting with SW implementationsexperimenting with SW implementations

Ability to think…?Ability to think…?– to find out relationships, reason, explain, ...to find out relationships, reason, explain, ...– to apply knowledge in new situationsto apply knowledge in new situations

("Pidgin English" for scientific communication)("Pidgin English" for scientific communication)

Page 9: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 9

AdministrationAdministration

An elective Master-level special courseAn elective Master-level special course– suitable for all specialisation lines suitable for all specialisation lines

(esp. media technology and SWE) (esp. media technology and SWE)

5 cp (5 cp ( 133 h 20 min of work!) 133 h 20 min of work!) LecturesLectures March 9 – May 5, class MT3 March 9 – May 5, class MT3

– Video broadcast to TB180 in JoensuuVideo broadcast to TB180 in Joensuu– Lecturer: Lecturer: [email protected]

NBNB!

Page 10: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 10

Administration: ExercisesAdministration: Exercises

Exercises,Exercises, March 14 – May 9 March 14 – May 9– essential for learning the technologyessential for learning the technology– normal homework assignments, hands-on practice; normal homework assignments, hands-on practice;

Solutions discussed in classSolutions discussed in class– Grading: one prepared solution -> one activity point Grading: one prepared solution -> one activity point

Plan: Plan: – some some solutions shall be returned to Moodle, solutions shall be returned to Moodle,

for potential feedback and scaling of activity points by solution for potential feedback and scaling of activity points by solution quality quality

– These will be announced on a case-by-case basisThese will be announced on a case-by-case basis

Page 11: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 11

Administration: GradingAdministration: Grading

Final Final examexam on Friday, May 13 on Friday, May 13– ≥≥ 50% of exam points required to pass50% of exam points required to pass

Grade = Grade = round(6*Exam/MaxExam + round(6*Exam/MaxExam +

2*HomeWork/MaxHomeWork - 2.5)2*HomeWork/MaxHomeWork - 2.5)

Retake exam on Friday June 10Retake exam on Friday June 10– again again 50% to pass 50% to pass– better of grades with/without homework creditsbetter of grades with/without homework credits

Page 12: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 12

MaterialMaterial

No single textbookNo single textbook Reports, specifications, articlesReports, specifications, articles Course home pageCourse home page

– http://www.cs.uku.fi/~kilpelai/RDK11/– slides, exercises, references, announcementsslides, exercises, references, announcements

Possible background texts (See home page):Possible background texts (See home page):– Deitel et al; MDeitel et al; Møller & Schwartzbach; Key; Walmsleyøller & Schwartzbach; Key; Walmsley

Page 13: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 13

Background CheckBackground Check

Basic knowledge of structured documents and document Basic knowledge of structured documents and document standardsstandards– Course ”Introduction to Document standards"?Course ”Introduction to Document standards"?

Programming languages and conceptsProgramming languages and concepts– Java? OO programming?Java? OO programming?– Functional programming? SQL?Functional programming? SQL?– Unix/Linux vs. Windows?Unix/Linux vs. Windows?

Formal language theory Formal language theory – Theory of ComputationTheory of Computation– regular expressions, context-free grammars, parse trees?regular expressions, context-free grammars, parse trees?

Page 14: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 14

Students' Expectations?Students' Expectations?

Page 15: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 15

1.1. Structured Documents1.1. Structured Documents

DocumentDocument: : – a structured representation of information on some a structured representation of information on some

medium (medium ( message) message)

– normally for a human readernormally for a human reader» memos, manuals, articles, books, …memos, manuals, articles, books, …

– also application-to-application messagesalso application-to-application messages» e.g., btw client and server in e.g., btw client and server in Web ServicesWeb Services

– "prose-oriented XML" vs "data-oriented XML""prose-oriented XML" vs "data-oriented XML"– can be treated as a unit can be treated as a unit

» e.g., a web page vs a web site?e.g., a web page vs a web site?

Page 16: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 16

Text-Based DocumentsText-Based Documents

We concentrate on textual or text-based We concentrate on textual or text-based documentsdocuments– character data major constituent of information character data major constituent of information

contentcontent– as opposed to, say multimedia documents as opposed to, say multimedia documents

Next: Presentation vs Structure Next: Presentation vs Structure

Page 17: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 17

Presentation vs StructurePresentation vs Structure

Presentation informs the Presentation informs the human readerhuman reader about the about the meaning of text and the role of its partsmeaning of text and the role of its parts

Markup (Markup (merkkausmerkkaus)) indicates the presentation or the indicates the presentation or the meaning of different parts of text meaning of different parts of text – originally hand-written annotations for the typesetter originally hand-written annotations for the typesetter – nowadays primarily codes embedded in digital nowadays primarily codes embedded in digital

documents; documents; <Tags><Tags>

Page 18: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 18

MarkupMarkup

Procedural markup Procedural markup – formatting commands (start boldface, produce an formatting commands (start boldface, produce an

empty line, indent 5 mm, ...)empty line, indent 5 mm, ...)– proprietary word processor formats, nroff, TeX, ...proprietary word processor formats, nroff, TeX, ...

Descriptive Descriptive oror generic markup generic markup– indicating the logical structure of text using chosen indicating the logical structure of text using chosen

namesnames– LaTeX: LaTeX: \begin{abstract} ... \end{abstract} \begin{abstract} ... \end{abstract} – HTML: HTML: <TITLE> .... </TITLE><TITLE> .... </TITLE>

Markup language (Markup language (merkkauskielimerkkauskieli))– a fixed set of markup notations (e.g. nroff, TeX, HTML, a fixed set of markup notations (e.g. nroff, TeX, HTML,

SVG, …) SVG, …)

Page 19: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 19

Structured Documents?Structured Documents?

Most liberally, Most liberally, anyany document is structured document is structured (punctuation, words, sentences, fields, …)(punctuation, words, sentences, fields, …)

but especially descriptively marked-up but especially descriptively marked-up documents ... (e.g. well-formed XML)documents ... (e.g. well-formed XML)

especially if they adhere to a rigorous especially if they adhere to a rigorous specification of structure (e.g. XML+DTD)specification of structure (e.g. XML+DTD)

Page 20: Structured -Document  Processing Languages

SDPL 2011 Notes 1: Introduction 20

Structure in DocumentsStructure in Documents

HierarchyHierarchy or or nestingnesting is ubiquitous is ubiquitous– chapters of books, warnings in maintenance chapters of books, warnings in maintenance

manuals, ...manuals, ... Linear orderLinear order essential in prose documents essential in prose documents

– less important in representations of data objectsless important in representations of data objects HypertextHypertext and and cross-referencescross-references

We'll be mainly dealing with manipulation of We'll be mainly dealing with manipulation of hierarchical, or tree-like document structureshierarchical, or tree-like document structures

Next: How are these modelled?Next: How are these modelled?