hypertext (1)

Hypertext (1)

• Historically, text is sequential: read from beginning to end

• Hypertext is non-sequential, with internal links from one part to another

• Hypertext, the word, coined by Ted Nelson in 1966.

• First hypertext system, Xanadu, named for Coleridge’s magical world.

Hypertext (2)

Links in hypertext give access to:

• topics or information directly related to the current idea

• notes, such as footnotes or endnotes

• explanations of special words or phrases

• biographical information about people behind the current idea

Claims about Hypertext

• Represents large body of information organized into numerous fragments

• Fragments relate to one another

• User needs only a small fraction of the fragments at any time

• Exists only in cooperation with the reader

• Is a legitimate literary concept

Claims about Hypertext (2)

• Integrates three technologies– Publishing (as a book publisher would)– Computing (as the infrastructure)– Broadcasting (over a computer network)

• Depends on computer environment for high-speed transitions between nodes

• Modelled by network ADT

Using Hypertext

• Browser, or hypertext engine: a computer-based system that allows links to be followed easily

• Navigation aids: parts of the user interface that provide a sense of location and direction

• Notation: a convenient way of specifying links as a hypertext author

WWW as a Hypertext System

• Browser: Netscape, for example

• Navigational aids:– Forward, back, home– History list– Colored anchors– Consistent titles

• Notation: HTML

Network ADT

• Model of hypertext

• Similar to tree ADT, but allows cycles

• Links have an explicit direction, capturing the idea of going forward and going back

Network ADT (2)

• Definition: A network is a collection of nodes and links between pairs of nodes such that– Each link has a direction.– Each node is reachable from any other node.

However, the path is not necessarily unique.– No node is linked to itself.– There are no duplicate links in the same

direction.

Network ADT (3)

• Observations:– There is no hierarchy; all nodes are considered

the same. (In a tree, the root is special.)– Links have direction, but reverse travel is

possible. (One can go backwards on a link, or forwards on a link that goes in the opposite direction.)

– Cycles are allowed.

Directed Graphs

• Both networks and rooted trees are examples of a connected directed graph, sometimes called a digraph.

• Formally, a digraph is a set of nodes and a set of links joining ordered pairs of nodes. The link (A,B) that joins A to B is different from the link (B,A) that joins B to A

Navigation in Sequential Text

• Low level:– Punctuation– Fonts– Separation into sentences and paragraphs

• High level:– Chapters, sections, subsections– Table of contents– Index

Navigation in Sequential Text (2)

• Page layout– Page numbers– Running heads– Displayed text

Navigating in Hypertext

• Issues:– Where am I? Have I been here before? When?

– How did I get here?

– Where can I go?• Anchors (or links)

• Implicit anchors (or links): clipboard, glossary, calculator

• Computed links: next train

• Back

• Forward

• Home

Navigating in Hypertext (2)

• Within a node:– Save to disk– Print– Annotate– Scroll– Zoom

Navigating in Hypertext (3)

• User interface support– Give power to the users through

• short response time

• low cognitive load

• path clues, perhaps decaying over time

– Follow a path forward or backward– Return to a node

Text Markup

• Unified view of text and hypertext presentation

• Foundation of all word processors

• Describes all electronic manuscripts by– separating logical elements– specifying processing functions for these

elements

Text Markup (2)

• Originated by William Tunnicliffe (Sept. 1967), in talk advocating separating information content of document from format

• Control formatting with embedded codes

Generalized Markup

• Goal: allow editing, formatting, and retrieval systems to share documents

• Devised by Goldfarb, Mosher, Lorie at IBM, 1969

• Formally defined – document types– explicit nested element structure– generic identifier associated with each element

SGML

• Standard Generalized Markup Language

• First draft standard, 1980

• ISO 8879, 1986

• Based on the ADT tree

• Allows the description of a document, considered as a tree, to be embedded in the file containing the document

Functions of SGML

• Tags documents in a formal language

• Describes internal logical structures

• Links files with an addressing scheme

• Acts as a database language for text

• Accommodates multimedia and hypertext

• Provides a grammar for style sheets

• Allows coded text reuse in surprising ways

Functions of SGML (2)

• Represents documents independent of computing platform

• Provides a standard for transfering documents among platforms and applications

• Acts as a metalanguage for document types

• Represents hierarchies

• Extends to accommodate new document types

Generic Identifiers

• Tagging vs. formatting– Tagging shows document structure– Formatting describes document display– Example: A paragraph is a sequence of closely

connected sentences and can be delimited by a tag. A paragraph can be displayed with either

• initial indenting or not

• extra separation or not

Generic Identifiers (2)

• Syntax– Beginning: < identifier >– End: </ identifier >

• Attribute list, with assigned values, may follow identifier

Generic Identifiers (3)

• Typical identifiers:– p paragraph– q quotation– ol numbered (ordered) list– ul unnumbered list– li list item– b bold face– i italics

Display of Text

• ASCII codes for printing characters carry no information about display

• Printed or displayed characters are described by their font.

Fonts• Fonts come in families, which are a group of fonts

with similar design characteristics.• A font is a set of displayed characters in a

particular design. To describe a font, we specify:– The font face, or type face, which is the design of the

font.

– The size, measured in points, which is the height of representative characters.

– The appearance: bold, italic, underline, outline, shadow, small cap, redline, strikeout, etc.

Fonts (2)

• Font families include standard modifications of a base font, such as italics and bold, to change the appearance. (This family is Times New Roman.)

• Some families are sans serif, without the cross strokes accentuating the ends of the main strokes.

Fonts (3)

• Typical examples of fonts are– Times New Roman

– Arial– Century Schoolbook– Lucinda Calligraphy– Verdana

Fonts (4)

• The size of this font is 32 points

• This is 54 points• This is 24 points

• There are exactly 72.27 points per inch

Fonts (5)

To render a character in a font, one must

• Know the computer code (ASCII) of the character

• The font name and properties

Then the computer creates the glyph that represents the character in the specified font.

Fonts (6)

In the process, the computer uses the• Baseline: the invisible line on which

characters are aligned.• x-height: the actual height of the character x• Kerning: spacing between two letters.

Note that in printing “wo” the “o” slides under the “w”

to form and locate the glyph

Input devices for text

• Keyboard

• Scanning with optical character recognition– Hand printed – Hand written (cursive)– Machine printed

• Voice recognition

• Pen-based

Input errors

• Human-based, e.g.– Typographic– Poor writing

• Machine dependent– Small typeface differences: O vs. D

• Limits of technology

• Pre-existing errors

Automatic error correction

• Error rate for keyboard input = 98% OCR accuracy + automatic correction

• Automatic correction also helpful in:– Computer-aided authoring– Communication enhancement for disabled– Natural language responses– Database interaction

• Example: MS Word AutoCorrect

Automatic spelling correction

• Three increasingly difficult tasks:– Non-word detection: string in text not in

dictionary– Isolated word correction: thier automatically

becomes their– Context-dependent correction: here

automatically becomes hear

MS Word AutoCorrect

General spelling correction

• Can allow human intervention, e.g. choose the correct spelling from a list of candidates

• No context dependent general purpose correction tool exists yet.

Issues for spelling correction

• Type of input device– Focus on adjacent keys: b vs. n– Focus on similar shapes: O vs. D

• Interactive vs. automatic correction– How many choices are reasonable? (One for

automatic correction.)– How accurate should guesses be?

• Proper choice of dictionary

Proper Dictionary

Word list choice

• Use lexicon--a word list appropriate to a particular topic

• As opposed to dictionary -- a comprehensive list of words

• Include provision for adding new words

Word list choice: Example 1

• Compare NY Times news wire text with Webster’s 7th Collegiate Dictionary

• 8 million words in news wire text:– only 36% in dictionary– only 39% of dictionary words used in text

Example 1 (continued)

• Of text words not in dictionary– 1/4 inflected forms (change in case, gender, tense)– 1/4 proper names– 1/6 hyphenated forms– 1/12 misspellings– 1/4 unresolved by investigators (new words, etc.)

• How to handle proper names?

Example 2

• Corpus of 22 million words from a variety of genres

• Effect of changing lexicon from 50,000 to 60,000 words?– Eliminated 1348 false rejections (words are now

included in lexicon)– Created 23 false acceptances (originally

misspelled, now occur in lexicon and therefore, treated as correctly spelled.)

Unintentionally correct spellings

• Misuse of word: there for their, to for too

• Typo: from for form

• Quote from Mozart: I’ll see you in five minuets

Issues in detection

• Given document as a sequence of words, lexicon as ordered list of words, report all document words not in lexicon, but:

• How to handle upper case letters?

• How to handle suffixes and prefixes?

• What definition of word to use?

Issues in detection (2)

• Upper case: Change all to lower case– Handles first word of sentence and proper

names that are words: Bob Brown– Confuses: DEC (ok), Dec (abbreviation), dec

(misspelling) – Must put back capitalization

Types of errors

• From keyboard input, 80% of misspellings– Insertion– Deletion– Substitution, especially nearby keys– Transposition

• Few errors occur in first letter

• Mostly, length is same or changes by 1

Suggestion Strategies

• Words with same first letter first

• Order rest by change in length

Types of errors (2)

• Improper spacing: run-ons or splits– Significant unsolved problem

• Cognitive– recieve for receive; procede for proceed– conspiricy for conspiracy; mispell for misspell

• Phonetic– abiss for abyss; nacherly for naturally

Spelling Rules

• I before E except after C

• Ex, Suc, Pro ceed. All others are cede, except supersede

Suggestion Strategies (2)

• Words with same first letter first

• Order rest by change in length

• Use standard spelling rules

Suggestion Principles

• Edit distance: The minimum number of insertions, deletions, or substitutions needed to change one string to another, defined by Levenshtein in 1966

• Provide suggestions in increasing order of edit distance

Detection Algorithms

• For each word in text, search for word in dictionary. If not found, report spelling error.

• Issues:– Efficiency when text or dictionary is large

Detection Algorithms (2)

• n-gram analysis

• Issues:– Requires preprocessing of dictionary– Extremely fast if misspelling creates unusual n-

gram

n-gram Fundamentals

• Definition: an n-gram is a substring of length n of a given word.

• Examples:– The word weasel contains 5 digrams (2-grams),

namely we,ea,as,se,el.– The word monkey contains 4 trigrams (3-grams),

namely mon, onk, nke, key.– The word turkey contains 6 monograms (1-grams),

namely t,u,r,k,e,y.

n-gram Strategy

• Preprocess the dictionary to create a list of all the n-grams contained in words in the dictionary.– Eliminate duplicates from the list– Perhaps record the position within the word of the

n-gram.

• Detect a spelling error by discovering an n-gram in the target word that is not in the n-gram list.

Arrays

• Definition: A data structure is a particular way of storing data in a computer.

• Definition: An array is an indexed set of values. Informally, an array can be viewed as a table.

• Example (of a data structure): An array is a data structure.

Arrays (2)

• Array index:– Usually positive integers to some maximum

size, e.g. 1 to 500.– Can also be another ordered set, e.g. the

alphabet, the characters in ASCII order

• Values: Whatever one wants to store: numbers, letters, strings, other arrays.

Array Examples

• Table of hex and binary numbers corresponding to base 10 numbers. The index set is the base 10 numbers, the array values (table entries) are the corresponding hex and binary numbers

• List of words for searching. The index is the position in the list, the array values are the words viewed as strings.

Array Examples (2)

• Shift table for Boyer-Moore searching. The index is the set of characters. The array value is the number representing the shift amount for that index character.

• List of ASCII codes. The index is the ASCII code, 00 to FF in hex numbers. The array value is the character represented by the index hex number.

Digram Arrays

• A digram array is an array indexed by the letters a through z. Each value is, in turn, an array indexed by the letters a through z.

• A digram array can be viewed as a table whose rows and columns are indexed by the 26 lower case letters.

• Typically, we use binary digits as the values in a digram array, creating a binary digram array, or BDA.

Digram Arrays (2)

• Assume that a dictionary is given.

• Preprocess the dictionary by setting the value in a digram array for each digram that appears in each word in the dictionary.

• Notes:– The digram array depends on the dictionary– Typically 42% of entries are 0– Trigram arrays may be constructed in the same way.

Nonpositional BDA

• Each value, or cell, in a BDA is associated with the digram represented by the row and column index of the cell.

• Example: The digram ck is associated to the value in the cell in row c, column k.

• The value in a nonpositional BDA associated to a digram is 1 if that digram appears in some word in the dictionary and is 0 otherwise.

Nonpositional BDA (2)

• Example: The value associated with the digram ck is 1 if some word containing ck appears in the dictionary (e.g. cuckoo). The value is 0 if no word in the dictionary contains ck.

• Example: If the word whose spelling is being checked contains the digram mv and the value associated with this digram is 0, then the word does not appear in the dictionary.

Nonpositional BDA (3)

• Example: If the word whose spelling is being checked contains the digram gh and the value associated with this digram in the array is 1, then one cannot say whether the word is spelled correctly, based just on this information.

Example: Moby Dick

• Class examined Chapters 31-93• Summary file contains

– 284,591 characters– 63,851 words– 63,853 sentences– 63,585 lines– 63,583 paragraphs– 1413 pages

Example: Moby Dick (2)

• After processing (removing numbers, upper case letters, and punctuation), file contains– 70039 characters– 9578 words– 9577 sentences– 9577 lines– 9577 paragraphs– 213 pages

Example: Moby Dick (3)

• Checking digrams, we find

Positional BDA

• Assume that the longest word in the dictionary has length M.

• Denote the position of a digram by k. Then k has value 1, 2, ... , M-1.

• For each digram, create an array of length M-1, where the value at index k is 1 if the digram appears in a word in the dictionary in position k. The value is 0 if no word in the dictionary has this digram at position k.

Positional BDA (2)

• Example: In the positional BDA for the digram at the value indexed by k=3 is equal to 1 if some word in the dictionary has the form ??at*

• Example: In the positional BDA for the digram sp the value indexed by k=7 is equal to 0 if no word in the dictionary is of the form ??????sp*

Effectiveness

• Typically about 42% of entries in a non-positional BDA are 0

• Randomly changing one letter in a word will produce a digram with value 0 in NP BDA about 70% of time

• Study of handprinted 6-letter words, 7662 with a single substitution error, 7561 detected by positional trigram analysis

Encryption

• Goal: provide privacy and security for text transmitted by computer network.– Confidentiality of contents

– Authenticity of sender and receiver

– Integrity of contents

• Interested parties– Military and diplomatic officers

– Mathematicians and computer scientists

– E-commerce providers

Encryption History

• Early work– Cryptography book by George Fisher published

by Benjamin Franklin

• Present day– Text transmitted by computer network– Techniques regulated by federal government

Encryption on Networks

• Situation: no transmission on any computer network can be considered absolutely private– Network tap is not physically difficult– Legitimate use for monitoring traffic to detect

problems and potential bottlenecks

Intruders

• Passive: listens, gathers information

• Active: captures and (perhaps) replaces– Changes amount in a financial transaction– Uses a stolen credit card number

Encryption Model

Encryption Techniques

• Character-based– Shift (Caeser cipher)– Monoalphabetic substitution (cryptograms)– Polyalphabetic cipher

• Numeric– Each character is represented by 8 bits– Four characters form a 32 bit number– Encode these numbers

Shift Encryption

• Encryption: Each letter is encoded with the letter k positions from it in the alphabet

• Key: The integer k, in the range –25..25

Shift Encryption (2)

• Example 1: Shift

Replace each letter by the one three positions forward in the alphabet, k=+3

WILDCATS ---> ZLOGFDWV

• Example 2: Shift, k = +5

CATS ---> HFYX

Decrypt using k = –5

Shift Encryption (3)

Notes on shift encryption

• Only 26 different strategies are possible, and one of those is the null strategy (no encrypting is done).

• If encryption uses the key k, then decryption uses the key –k (or the key 26 – k)

Monoalphabetic Substition

• Encrypt by using a random permutation of the alphabet.

• Key is the permutation, 26! choices are available.

• Decryption by checking all permutations is impossible.

• However, this is the Daily Cryptogram in the newspaper.

Monoalphabetic Substitution (2)

• Example:

XE XU BRUXBF EM EROJ EARI EM

IT IS EASIER TO TALK THAN TO

AMOP MIB’S EMIWCB

HOLD ONE’S TONGUE

Monoalphabetic Substitution (3)

• Notes on monoalphabetic substitution– Decryption strategy uses letter patterns, e.g.

common digrams and trigrams– Heuristics, as opposed to an algorithm

Polyalphabetic Substitution

• Caesar cipher has too few keys

• Monoalphabetic substitution has enough keys, but word patterns (digrams and trigrams) allow easy code breaking

• Develop strategy with – large number of keys– disrupted word patterns

Polyalphabetic Substitution (2)

• Start with a 26 x 26 array of letters, shifted by one letter in each row

• Choose a string as a key• Example: key = springforward spr ingforwardsp ringf or ward The confidential terms of your springforw ardsprin gfo rw ardspri employment contract are as follows


• The ith character in the text, denoted by c, is replaced by m(d,c), where d is the corresponding character in the key, and the replacement is the character m, appearing in the dth row and cth column of the array.

• Example: d = s, c = t, m(d,c) = l d = p, c = h, m(d,c) = w d = r, c = e, m(d,c) = v


• The encoded message starts

lwvkb tkwua nklsa kmesx cwuol uwbgt …

where the letters have been written in groups of five.


• To decode a message, knowing the key, match the key with the message.

• Example: key = declaration

decla ratio ndecl arati ionde

zlgyi etamq bxvup owhnu cahzg


• The ith character p of the plaintext message is the character such that m(d,p) = e, where d is the character of the key corresponding to the ith character e of the encrypted message.

• Operationally,– Go to the dth row of the array– Find e in this row by scanning across– Record p, the column index of e


• Example: d = d, e = z, appears in column w, p = w

d = e, e = l, appears in column h, p = h

d = c, e = g, appears in column e, p = e

Text Compression

• Text is represented as a long string of binary digits, 8 digits per character.

• A 2000 word essay, has about – 10,000 characters – 2000 spaces– 96,000 bits

• Question: Can we represent this essay in substantially fewer bits?

Text Compression (2)

• Answer: Most likely, since we really only need 7 bits per character for the 94 printing characters plus white space characters.

Techniques

• Represent fixed text with a short symbol string, e.g.– Stock exchange symbols for company names– ISBN numbers for book title and author

• Shorter symbol strings for more frequently occurring text strings– Use one bit for the most frequent character, etc.

Techniques (2)

• Context dependent strings– Represent common combinations with their

own codes– Represent constant bit strings

Huffman Coding

• Frequency dependent coding

• Uses frequency distribution of characters in text– Most common occurring letter is E, 13.05%– Next most is T, 9.02%

– Rarest is Z, 0.09%

Huffman Coding (2)

• Creating a Huffman code for a set of characters– List the characters and their relative frequencies– Sort the list in order of least frequent to most

frequent– Build a coding tree, which is a binary tree, as

described below

Binary Tree

• A binary tree is a tree in which– Each interior node has degree 2– The child nodes are ordered

Huffman Coding (3)

• To build a Huffman tree– List the characters in order of frequency from most

to least– Make the two least frequent characters leaf nodes

and join them to a new node.– Label the new node with the sum of the

frequencies of the two child nodes– Label the link to the least frequent with 0 and the

other link with 1

Huffman Coding (4)

– Join the newly created node with the next least frequent character.

– Again add the frequencies, label the new node, and label the link to the least frequent node with 0, the other link with 1. Caution: compare the character frequency with the new node frequency

– Continue until all characters have been joined.– The last node (the root of the tree) will be

labeled with frequency 1.00. (Why?)

Huffman Coding (5)

To compress text with a Huffman code:

• Follow the tree from the root to the leaf labeled by a character to find the code of the character, the code being the sequence of link labels on the (unique) path to the character

Huffman Coding (6)

Example: Assume only 4 characters (so that the tree doesn’t get too large) with relative frequencies:

A = .40

B = .20

C = .15

D = .25

Total = 1.00

Huffman Coding (7)

Sort the characters by frequency, smallest firstC = .15

B = .20

D = .25

A = .40

Join B and C to get a node labeled .35 = .15 + .20

with link C .35 labeled 0 and B .35 labeled 1.

Huffman Coding (8)

Join the next least frequent character node (D) to the new node (.35) and create a node labeled .60 = .35 + .25

Label link D .60 with 0 and link .35 .60 with 1

Huffman Coding (9)

Join the next least frequent character node (A) to the new node (.60) and create a node labeled 1.00 = .60 + .40

Label link A 1.00 with 0 and link .60 1.00 with 1

Huffman Coding (10)

Follow the tree from the root to the leaves to find the codes:

A = 0

B = 111

C = 110

D = 10

Without compression, BAD takes 24 bits

With compression BAD = 111010, 6 bits

hypertext (1)

Documents

set of links

hypertext engine

hypertext systembrowser

calculatorcomputed links

internal links

hypertext authorwww

duplicate links

link b