1 cs 502: computing methods for digital libraries lecture 4 text

21
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

Upload: august-kelly

Post on 27-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

1

CS 502: Computing Methods for Digital Libraries

Lecture 4

Text

Page 2: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

2

Administration

• Assignment 1 submission problems:

Due date postponed to Thursday 12:20

Demonstration by Dean Eckstrom

• Wednesday discussion classes:

Olin 155, 7:30-8:25 and 8:35 to 9:00

Check Notices for sections

Page 3: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

3

Digital Libraries and Checking Information

Email to Teaching Assistants:

"I have heard that ..."

"There is a rumor that ..."

Authoritative source(s):

Course web site -- Notices

Page 4: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

4

Text

The richness of text

• Elements: letters, scripts, symbols

• Structure: words, sentences, paragraphs, headings, tables

• Appearance: fonts, layout, design, materials

• Special: mathematics, music

Digital libraries must represent ever variant!

Page 5: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

5

Markup and Page Description

Mark-up languages represent the structure of text

e.g., SGML, XML

The mark-up must be combined with a style sheet for rendering.

Page description languages represent the appearance of text

e.g., PostScript, PDF

Page 6: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

6

Markup and Style Sheets

style sheet renderingsoftware

documentcontent andstructure

formatteddocument

Page 7: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

7

Alternative Renderings

style sheetfor display

renderingsoftware

documentcontent andstructure

printeddocument

renderingsoftware

style sheetfor print

computerdisplay

Page 8: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

8

Example: the Oxford English Dictionary

• Typography of printed text represented semantic information.

• Keyboard the text, capturing all typographic information.

• Automatic parser to extract semantics (e.g., date, quotation, phonetics, etc.).

• Markup in SGML to tag semantic information.

• Separate style sheets for various editions, print, CD-ROM, online.

• Before the web, yet used with the web.

Page 9: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

9

Character

Distinguish between

• the abstract character as a structural element,

"A"

• representations of the character

A A A A 100001 A A "capital a"

Page 10: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

10

ASCII

A binary encoding of a character as an 8-bit byte,e.g., 01000001 is the encoding for "A"

0

127

255

printable ASCII

standard (7-bit) ASCII

extended (8-bit) ASCII

32

Page 11: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

11

Unicode

Unicode

• 16-bit codes that represent distinct characters

• organized by scripts, not languages

• compatible with Unihan (Chinese, Japanese, Korean)

Page 12: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

12

Scripts

Scripts supported by Unicode 2.0

Arabic Armenian Bengali Bopomofo Cyrillic Devanagari Georgian Greek Gujarati Gurmkhi Han Hangul Hebrew Hiragana Kannada Katakana Latin Lao Malayalam Oriya Phonetic Tamil Telugu Thai Tibetan

Page 13: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

13

More Scripts

Numbers General Diacritics General Punctuation General Symbols Mathematical Symbols Technical Symbols Dingbats Arrows, Blocks, Box Drawing Forms & Geometric Shapes Miscellaneous Symbols Presentation Forms

Page 14: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

14

Unicode and UTF-8

UTF-8

• a stream encoding of Unicode characters.

• one to six bytes to represent each Unicode character, identified by number of leading ones.

• single byte characters are identical to printable ASCII, e.g., 01000001 has no leading one, therefore it is a single byte code.

Page 15: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

15

Markup Languages

SGML (Standard Generalized Markup Language)

A system for creating markup languages that represent the structure of a document

XML (eXtensible Markup Language)

A simplified version of SGML intended for use with online information

DTD (Data Type Definition)

A markup specification for a class of documents, defined within the SGML framework

HTML (Hypertext Markup Language)

A markup and formatting language with links to other objects

Page 16: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

16

XML Example (Metadata)

<?xml version="1.0"?><!DOCTYPE dlib-meta0.1 SYSTEM "http://www.dlib.org/dlib/dlib-meta01.dtd"><dlib-meta0.1> <title>Digital Libraries and the Problem of Purpose</title> <creator>David M. Levy</creator> <publisher>Corporation for National Research Initiatives</publisher> <date date-type = "publication">January 2000</date> <type resource-type = "work">article</type> continued on next slide

Page 17: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

17

continued from previous slide <identifier uri-type = "DOI">10.1045/january2000-levy</identifier> <identifier uri-type = "URL">http://www.dlib.org/dlib/january00/01levy.html</identifier> <language>English</language> <relation rel-type = "InSerial"> <serial-name>D-Lib Magazine</serial-name> <issn>1082-9873</issn> <volume>6</volume> <issue>1</issue> </relation> <rights>Copyright (c) David M. Levy</rights></dlib-meta0.1>

XML Example (Metadata)

Page 18: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

18

Constructing a DTD: Entities

Entities are basic units of information:

• Character entities

a b ... z 0 1 ... 9 ! ? ...

&lt; &alpha;

• Any other entities

&logo; &square-root;

Page 19: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

19

Entities

• The name of an entity is purely mnemonic. It makes no assertions about the context in which the entity is used or its appearance when rendered.

• The DTD used by a scientific publisher will have about 4,000 entities to represent all the special symbols and the variants used in scientific disciplines.

Page 20: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

20

Constructing a DTD: Elements

Elements define the structure.

An element is a string of entities, bracketed by tags:

<p>This is a paragraph.</p>

<heading1>Some heading</heading1>

<author>Jane Austen</author>

<manuscript>John Hancock</manuscript>

Page 21: 1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text

21

Constructing a DTD: Grammar

Every DTD has a grammar that defines:

• allowable relationships between entities and elements

• hierarchies and nesting

• etc.

The grammar is expressed as a set of rules that can be processed automatically.