representing and storing structured data

89
Representing and Storing Structured Data LBSC 690: Jordan Boyd-Graber October 15, 2012 Adapted from Jimmy Lin’s Slides LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 1 / 75

Upload: others

Post on 30-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Representing and Storing Structured Data

Representing and Storing Structured Data

LBSC 690: Jordan Boyd-Graber

October 15, 2012

Adapted from Jimmy Lin’s Slides

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 1 / 75

Page 2: Representing and Storing Structured Data

Goals

Metadata: XML

Databases

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 2 / 75

Page 3: Representing and Storing Structured Data

Outline

1 The joys and sorrows of metadata

2 XML: a framework for data representation

3 New and interesting things

4 Relational Databases

5 Relational Algebra

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 3 / 75

Page 4: Representing and Storing Structured Data

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 4 / 75

Page 5: Representing and Storing Structured Data

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 4 / 75

Page 6: Representing and Storing Structured Data

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 4 / 75

Page 7: Representing and Storing Structured Data

Take-Away Messages

Metadata makes data useful

XML is a way to encode data and metadata

XML allows computers to exchange information in new andinteresting ways

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 5 / 75

Page 8: Representing and Storing Structured Data

What’s going on here? How do I use this?

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 6 / 75

Page 9: Representing and Storing Structured Data

Metadata

Literally “data about data”

“a set of data that describes and gives information about other data” –Oxford English Dictionary

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 7 / 75

Page 10: Representing and Storing Structured Data

Dublin Core

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 8 / 75

Page 11: Representing and Storing Structured Data

What is the Dublin Core?

A metadata standard for describing digital resources

An initiative to create a library card catalog for the Web

Dublin Core fields:

Title Creator SubjectDescription Publisher Contributor

Date Type FormatIdentifier Source LanguageRelation Coverage Rights

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 9 / 75

Page 12: Representing and Storing Structured Data

Encoding Metadata

Language for encoding metadata should be:I Universal - so all can understandI Flexible - to incorporate di↵erent typesI Extensible - flexible to custom typesI Simple - to encourage adoptionI Modular - so that schemes can be mixed, extended

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 10 / 75

Page 13: Representing and Storing Structured Data

How do we encode data for interoperability?

Challenges

January 31, 200131 janvier 20012001-01-3101-31-2000980942400

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 11 / 75

Page 14: Representing and Storing Structured Data

Outline

1 The joys and sorrows of metadata

2 XML: a framework for data representation

3 New and interesting things

4 Relational Databases

5 Relational Algebra

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 12 / 75

Page 15: Representing and Storing Structured Data

What is XML?

XML = eXtensible Markup Language

XML is a standard for exchanging structured dataI Provides standardization at the syntactic levelI Does not provide “meaning” for the tagsI XML is a standard recommended by the W3C

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 13 / 75

Page 16: Representing and Storing Structured Data

Goals of XML

Easy to use

Easy to extend and adapt

Easy to write programs that use XML

Support a wide variety of applications

Should be human legible

Formal and concise

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 14 / 75

Page 17: Representing and Storing Structured Data

Refresher: Elements and Attributes

Attribute

<pe r son age=”28” />

Element

<pe r son><age>28</age></ pe r son>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 15 / 75

Page 18: Representing and Storing Structured Data

The Basic Rules

XML is case sensitive

All start tags must have end tags

Elements must be properly nested

XML declaration is the first statement

<?xml ve r s i on=” 1 .0 ”?>

Every document must contain a root element

Attribute values must have quotation marks

<i t em i d=”33905”>

Certain characters are reserved for parsing

\& l t ; = ‘ ‘< ’ ’

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 16 / 75

Page 19: Representing and Storing Structured Data

<rdf :RDFxm l n s : r d f=” h t t p : //www.w3 . org /1999/02/22� rd f�syntax�ns#”xmln s : dc=” h t t p : // p u r l . o rg /dc/ e l ement s /1 .1/ ”>

< r d f : D e s c r i p t i o nr d f : a b o u t=” h t t p : //media . example . com/ aud io / gu ide . r a ”>

<d c : c r e a t o r>Rose Bush</ d c : c r e a t o r><d c : t i t l e>A Guide to Growing Roses</ d c : t i t l e><d c : d e s c r i p t i o n>De s c r i b e s p r o c e s s f o r p l a n t i n g and n u r t u r i n g

d i f f e r e n t k i n d s o f r o s e bushes .</ d c : d e s c r i p t i o n><d c : d a t e>2001�01�20</ d c : d a t e>

</ r d f : D e s c r i p t i o n>

</ rdf :RDF>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 17 / 75

Page 20: Representing and Storing Structured Data

What does XML do?

. . .nothing

Syntax vs. semantics

XML vs HTML

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 18 / 75

Page 21: Representing and Storing Structured Data

What does XML do? . . .nothing

Syntax vs. semantics

XML vs HTML

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 18 / 75

Page 22: Representing and Storing Structured Data

Historic Perspective: Three Core Technologies

HTTP - HyperText Transfer ProtocolI A protocol for transferring data between machines on the Internet

URL - Uniform Resource LocatorI A scheme for referencing the specific location of a resource

HTML - HyperText Markup LanguageI A markup language for encoding information to be read by humans

HTTP and URLs have stood the test of time.But by 1996, HTML was already showing signs of age . . .

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 19 / 75

Page 23: Representing and Storing Structured Data

HTML

Started with very few tags

Language evolved as more tags were added:I FormsI TablesI FontsI FramesI . . .

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 20 / 75

Page 24: Representing and Storing Structured Data

Problems with HTML

I want personalized tagsI HTML can’t be extended

I want to incorporate other types of dataI Mathematics, database entries, literary text, poems, purchase orders

. . .I HTML can’t accommodate other types of data

I want to process pages automatically with softwareI HTML is too messy and inconsistentI Browsers are too forgiving

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 21 / 75

Page 25: Representing and Storing Structured Data

Back to Basics

HTML was defined using SGMLI Standard Generalized Markup LanguageI A meta-language for defining languagesI Complex, sophisticated, powerful . . .

too di�cult to useI Idea: create a simpler version of SGML . . . the birth of XML!

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 22 / 75

Page 26: Representing and Storing Structured Data

Back to Basics

HTML was defined using SGMLI Standard Generalized Markup LanguageI A meta-language for defining languagesI Complex, sophisticated, powerful . . . too di�cult to useI Idea: create a simpler version of SGML . . .

the birth of XML!

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 22 / 75

Page 27: Representing and Storing Structured Data

Back to Basics

HTML was defined using SGMLI Standard Generalized Markup LanguageI A meta-language for defining languagesI Complex, sophisticated, powerful . . . too di�cult to useI Idea: create a simpler version of SGML . . . the birth of XML!

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 22 / 75

Page 28: Representing and Storing Structured Data

XML Languages

XML can be used to define other languages

Many XML languages, optimized for di↵erent rolesI XHTML: HTML by XML rulesI MathML: for mathematicsI EPUB: for creating eBooksI RSS: for news feedsI Civ IV: Create your own gameI SVG: Create graphics

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 23 / 75

Page 29: Representing and Storing Structured Data

XHTML: Cleaning up HTML

<?xml ve r s i on=” 1 .0 ” encod ing=” i s o �8859�1”?><html xmlns=” h t t p : //www.w3 . org /TR/ xhtml1 ” ><head>

< t i t l e> T i t l e o f t e x t XHTML Document </ t i t l e></head><body><d i v c l a s s=”myDiv”>

<h1> Heading o f Page </h1><p>A parag raph t h i s one wi th an

<img s r c=” image . g i f ” a l t=”waste o f t ime ” />image , and a <br /> l i n e b reak . </p>

</ d i v></body></html>

What’s new?New preamble to tell us what’s here, and tags must have explicit ends.

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 24 / 75

Page 30: Representing and Storing Structured Data

MathML

An XML language for defining mathematic formulas

x

2 + 4x + 4 = 0

<mrow><mrow>

<msup><mi>x</mi><mn>2</mn></msup><mo>+</mo><mrow>

<mn>4</mn><mo>&I n v i s i b l e T im e s ;</mo><mi>x</mi>

</mrow><mo>+</mo><mn>4</mn>

</mrow><mo>=</mo><mn>0</mn>

</mrow>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 25 / 75

Page 31: Representing and Storing Structured Data

EPUB

Format for putting books on mobile readers (except Kindles)

Divide up a book into XHTML files

Create two additional XML filesI opf (open packaging format)

F Metadata (using Dublin Core)F All the files neededF Linear reading order

I ncx (navigation control file for XML)F Hierarchical organization of content (for easy navigation)

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 26 / 75

Page 32: Representing and Storing Structured Data

RSS

RSS = Really Simple Syndication or Rich Site Summary

An XML format for distributing news headlines on the Web

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 27 / 75

Page 33: Representing and Storing Structured Data

And Others . . .

CML: chemical Markup Lang

CellML: biological models

BSML: bioinformatic sequences

MAGE-ML: Microarray Gene Expression

XSTAR: for archaeological research

MARCXML: MARC in XML

AML: astronomy markup language

SportsML: for sharing sports data

List goes on and on and on . . .

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 28 / 75

Page 34: Representing and Storing Structured Data

The XML Family Tree

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 29 / 75

Page 35: Representing and Storing Structured Data

Mixing XML Dialects

XML is designed to support the integration of multiple standards

Allows users to mix elements from di↵erent standardsI Snapping together XML dialects like Lego piecesI Based on the notion of “namespaces”

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 30 / 75

Page 36: Representing and Storing Structured Data

Example

<?xml ve r s i on=” 1 .0 ”?><rdf :RDF

xm l n s : r d f=” h t t p : //www.w3 . org /1999/02/22� rd f�syntax�ns#”xm l n s : r s s=” h t t p : // p u r l . o rg / r s s /1 .0/ ”xm ln s :dc=” h t t p : // p u r l . o rg /dc/ e l ement s /1 .1/ ”>< r s s : c h a n n e l r d f : a b o u t=” h t t p : //www. xml . com/xml/news . r s s ”>

< r s s : t i t l e>XML. com</ r s s : t i t l e>< r s s : l i n k>h t t p : // xml . com/pub</ r s s : l i n k><d c : d e s c r i p t i o n>

XML. com f e a t u r e s a r i c h mix o fi n f o rma t i o n and s e r v i c e s f o r the XML community .

</ d c : d e s c r i p t i o n><d c : s u b j e c t>XML, RDF, metadata , i n f o rma t i o n

s y n d i c a t i o n s e r v i c e s</ d c : s u b j e c t><d c : i d e n t i f i e r>h t t p : //www. xml . com</ d c : i d e n t i f i e r><d c : p u b l i s h e r>O’ R e i l l y & As s o c i a t e s , I n c .</ d c : p u b l i s h e r><d c : r i g h t s >Copy r i gh t 2000 , O ’ R e i l l y &

As s o c i a t e s , I n c .</ d c : r i g h t s></ r s s : c h a n n e l>

</ rdf :RDF>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 31 / 75

Page 37: Representing and Storing Structured Data

Another Example

<?xml ve r s i on=” 1 .0 ” encod ing=” i s o �8859�1”?><html xmlns=” h t t p : //www.w3 . org /TR/ xhtml1 ” ><head>

< t i t l e> T i t l e o f XHTML Document </ t i t l e></head><body><d i v c l a s s=”myDiv”>

<h1> Heading o f Page </h1><math xmlns=” h t t p : //www.w3 . org /1998/Math/MathML”>

. . . MathML markup . . .</math><p> more html s t u f f goes he r e </p>

<sm i l xmlns=” h t t p : //www.w3 . org /TR/ sm i l 1 ”>. . . SMIL markup . . .</ sm i l>

</ d i v></body></html>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 32 / 75

Page 38: Representing and Storing Structured Data

Outline

1 The joys and sorrows of metadata

2 XML: a framework for data representation

3 New and interesting things

4 Relational Databases

5 Relational Algebra

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 33 / 75

Page 39: Representing and Storing Structured Data

Interoperability

What does it mean and what’s the role of XML?

XML: universal format for data interchange

Software exchanges data as XML-format messages

Advantages?I Eliminates proprietary data formatsI Promotes interoperabilityI Encourages cooperationI Leverages lots of existing XML processing software

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 34 / 75

Page 40: Representing and Storing Structured Data

Interoperability

What does it mean and what’s the role of XML?

XML: universal format for data interchange

Software exchanges data as XML-format messages

Advantages?I Eliminates proprietary data formatsI Promotes interoperabilityI Encourages cooperationI Leverages lots of existing XML processing software

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 34 / 75

Page 41: Representing and Storing Structured Data

XML Messaging

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 35 / 75

Page 42: Representing and Storing Structured Data

XML Messaging

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 35 / 75

Page 43: Representing and Storing Structured Data

What’s in it for me?

WebappsI Lower overheadI Richer dataI More portability

Mashups

Syntax vs. Semantics

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 36 / 75

Page 44: Representing and Storing Structured Data

Mashups

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 37 / 75

Page 45: Representing and Storing Structured Data

Mashups

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 37 / 75

Page 46: Representing and Storing Structured Data

XML Schema

Defines what a valid XML document should look likeI FieldsI AttributesI Number of entries

Has filename extension “xsd”

There are plenty of XML validators out there

Won’t go into details . . . think of it like a rulebook

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 38 / 75

Page 47: Representing and Storing Structured Data

Extensible Stylesheet Language Transformations

XSLT transforms one XML document into another

Often used to display XML to a userI WebpageI Graphics

Syntax varies, semantics are fixed

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 39 / 75

Page 48: Representing and Storing Structured Data

Business Card: Source Data

<?xml ve r s i on=” 1 .0 ”?>

<ca rd type=” s imp l e ”><name>John Doe</name>< t i t l e>CEO, Widget I n c .</ t i t l e><ema i l>j ohn . doe@widget . com</ ema i l><phone>(202) 456�1414</phone>

</ ca rd>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 40 / 75

Page 49: Representing and Storing Structured Data

Business Card: XSLT Transformation

< x s l : s t y l e s h e e txm l n s : x s l=” h t t p : //www.w3 . org /1999/XSL/Transform” ve r s i on=” 1 .0 ”xmlns=” h t t p : //www.w3 . org /1999/ xhtml ”>

<x s l : t em p l a t e match=” card ”><html>

<head>< t i t l e>b u s i n e s s ca rd</ t i t l e></head><body>

<x s l : a p p l y �t emp l a t e s s e l e c t=”name”/><x s l : a p p l y �t emp l a t e s s e l e c t=” t i t l e ”/><x s l : a p p l y �t emp l a t e s s e l e c t=” ema i l ”/><x s l : a p p l y �t emp l a t e s s e l e c t=”phone”/>

</body></html>

</ x s l : t em p l a t e>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 41 / 75

Page 50: Representing and Storing Structured Data

Business Card: XSLT Transformation (cont.)

<x s l : t em p l a t e match=”name”><h1><x s l : v a l u e �o f s e l e c t=” t e x t ( ) ”/></h1>

</ x s l : t em p l a t e>

<x s l : t em p l a t e match=” t i t l e ”><b>T i t l e :</b> <x s l : v a l u e �o f s e l e c t=” t e x t ( ) ”/> <br />

</ x s l : t em p l a t e>

<x s l : t em p l a t e match=” ema i l ”><b>Ema i l :</b> <a h r e f=” ma i l t o : { t e x t ( )} ”><t t>

<x s l : v a l u e �o f s e l e c t=” t e x t ( ) ”/></ t t></a><br />

</ x s l : t em p l a t e>

<x s l : t em p l a t e match=”phone”><b>Phone:</b> <x s l : v a l u e �o f s e l e c t=” t e x t ( ) ”/> <br />

</ x s l : t em p l a t e>

</ x s l : s t y l e s h e e t>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 42 / 75

Page 51: Representing and Storing Structured Data

Card with Style

<?xml ve r s i on=” 1 .0 ”?><?xml�s t y l e s h e e t type=” t e x t / x s l ” h r e f=” c a r d s t y l e 1 . xml”?>

<ca rd type=” s imp l e ”><name>John Doe</name>< t i t l e>CEO, Widget I n c .</ t i t l e><ema i l>j ohn . doe@widget . com</ ema i l><phone>(202) 456�1414</phone>

</ ca rd>

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 43 / 75

Page 52: Representing and Storing Structured Data

In a browser . . .

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 44 / 75

Page 53: Representing and Storing Structured Data

XML isn’t all there is

S-ExpressionsI Based on logical statementsI Not used outside academia (not in it too much, either)

Protocol Bu↵ersI Blazingly fastI More constrained than XML (have to specify data types, ranges)

JSONI Designed specifically for web applicationsI Lighter weight than XML

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 45 / 75

Page 54: Representing and Storing Structured Data

Take-Away Messages

Metadata makes data useful

XML is a way to encode data and metadata

XML allows computers to exchange information in new andinteresting ways

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 46 / 75

Page 55: Representing and Storing Structured Data

Outline

1 The joys and sorrows of metadata

2 XML: a framework for data representation

3 New and interesting things

4 Relational Databases

5 Relational Algebra

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 47 / 75

Page 56: Representing and Storing Structured Data

Take-Away Messages

Databases are suitable for storing structured information

Databases are important tools to organize, manipulate, and accessstructured information

Databases are integral components of modern Web applications

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 48 / 75

Page 57: Representing and Storing Structured Data

Definitions

Structured Information

What you put in a database (e.g. from XML)

DatabaseWhat you put structured information in.

Database Management System (DBMS)

Software system designed to store, manage, and facilitate access todatabases

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 49 / 75

Page 58: Representing and Storing Structured Data

What’s a database?

An integrated collection of data organized according to some model . . .

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 50 / 75

Page 59: Representing and Storing Structured Data

What’s a relational database?

An integrated collection of data organized according to a relational model. . .

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 50 / 75

Page 60: Representing and Storing Structured Data

Databases (try to) model reality. . .

Entities: things in the world (Example: airlines, tickets, passengers)

Relationships: how di↵erent things are related (Example: the ticketseach passenger bought)

“Business Logic”: rules about the world (Example: fare rules)

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 51 / 75

Page 61: Representing and Storing Structured Data

Components of a Relational Database

Field: an “atomic” unit of data

Record: a collection of related fields

Table: a collection of related recordsI Each record is a row in the tableI Each field is a column in the table

Database: a collection of tables

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 52 / 75

Page 62: Representing and Storing Structured Data

A Simple Example

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 53 / 75

Page 63: Representing and Storing Structured Data

Components of a Relational Database

Why “Relational?”

View of the world in terms of entities and relations between them

Tables represent “relations”

Each row in the table is sometimes called a “tuple”

Each tuple is “about” an entity

Fields can be interpreted as “attributes” or “properties” of the entity

Data is manipulated by “relational algebra”:

Defines things you can do with tuples

Expressed in SQL (Structured Query Language, next week)

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 54 / 75

Page 64: Representing and Storing Structured Data

The Registrar Example

What do we need to know?I Something about the students ?(e.g., first name, last name, email,

department)I Something about the courses ?(e.g., course ID, description, enrolled

students, grades)I Which students are in which courses

How do we capture these things?

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 55 / 75

Page 65: Representing and Storing Structured Data

A first stab . . .

What’s wrong with this?

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 56 / 75

Page 66: Representing and Storing Structured Data

A first stab . . .

What’s wrong with this?

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 56 / 75

Page 67: Representing and Storing Structured Data

Goals of “Normalization”

Save spaceI Save each fact only once

More rapid updatesI Every fact only needs to be updated once

More rapid searchI Finding something once is good enough

Avoid inconsistencyI Changing data once changes it everywhere

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 57 / 75

Page 68: Representing and Storing Structured Data

Updated Organization

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 58 / 75

Page 69: Representing and Storing Structured Data

Updated Organization

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 58 / 75

Page 70: Representing and Storing Structured Data

Keys

“Primary Key” uniquely identifies a recordI e.g., student ID in the student table

“Foreign Key” is primary key in the other tableI It need not be unique in this table

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 59 / 75

Page 71: Representing and Storing Structured Data

Approaches to Normalization

For simple problems:I Start with the entities you’re trying to modelI Group together fields that “belong together”I Add keys where necessary to connect entities in di↵erent tables

For more complicated problems:I Entity-relationship modeling (LBSC 670)

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 60 / 75

Page 72: Representing and Storing Structured Data

Entity Relationship Modeling

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 61 / 75

Page 73: Representing and Storing Structured Data

Entity Relationship Modeling

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 61 / 75

Page 74: Representing and Storing Structured Data

Database Integrity

Registrar database must be internally consistentI All enrolled students must have an entry in the student tableI All courses must have a nameI Grades can’t be negative

What happens:I When a student withdraws from the university?I When a course is taken o↵ the books?

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 62 / 75

Page 75: Representing and Storing Structured Data

Integrity Constraints

Conditions that must be true of the database at any timeI Specified when the database is designedI Checked when the database is modified

RDBMS ensures that integrity constraints are always keptI So that database contents remain faithful to the real worldI Helps avoid data entry errors

Where do integrity constraints come from?

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 63 / 75

Page 76: Representing and Storing Structured Data

Outline

1 The joys and sorrows of metadata

2 XML: a framework for data representation

3 New and interesting things

4 Relational Databases

5 Relational Algebra

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 64 / 75

Page 77: Representing and Storing Structured Data

Relational Operations

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 65 / 75

Page 78: Representing and Storing Structured Data

Relational Operations

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 65 / 75

Page 79: Representing and Storing Structured Data

Relational Operations

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 65 / 75

Page 80: Representing and Storing Structured Data

Relational Operations

Joining tables: JOIN

Choosing columns: SELECT

Based on their labels (field names)

Choosing rows: WHERE

Based on their contents

These can be specified together

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 66 / 75

Page 81: Representing and Storing Structured Data

How is a database more than aspreadsheet?

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 67 / 75

Page 82: Representing and Storing Structured Data

Database in the “Real World”

Typical database applications:I Banking (e.g., saving/checking accounts)I Trading (e.g., stocks)I Traveling (e.g., airline reservations)I Networking (e.g., Facebook)

Characteristics:I Lots of dataI Lots of concurrent operationsI Must be fastI “Mission critical” (well . . . sometimes)

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 68 / 75

Page 83: Representing and Storing Structured Data

Operational Requirements

Must hold a lot of dataI Use lots of computers, each with a small sliceI So which machine has your data?

Must be reliableI Use lots of computers with duplicate copiesI How do you keep copies consistent

Must be fastI Use lots of computersI Share the load

Must support concurrent operationsI This is hardI But often not needed

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 69 / 75

Page 84: Representing and Storing Structured Data

Database Transactions

Transaction = sequence of database actions grouped togetherI e.g., transfer $500 from checking to savings

ACID properties:I

Atomicity: all-or-nothingI

Consistency: each transaction must take the DB between consistentstates

IIsolation: concurrent transactions must appear to run in isolation

IDurability: results of transactions must survive even if systems crash

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 70 / 75

Page 85: Representing and Storing Structured Data

Making Transactions

Idea: keep a log (history) of all actions carried out while executingtransactions

I Before a change is made to the database, the corresponding log entryis forced to a safe location

Recovering from a crash:I E↵ects of partially executed transactions are undoneI E↵ects of committed transactions are redoneI Trickier than it sounds!

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 71 / 75

Page 86: Representing and Storing Structured Data

Discussion Question

RideFinder

Design a database to match drivers with passengers (e.g., for road trips)

Drivers post available seats; they want to know about interestedpassengers

Passengers call up looking for rides: they want to know aboutavailable rides (they don’t get to post “rides wanted” ads)

These things happen in no particular order

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 72 / 75

Page 87: Representing and Storing Structured Data

Discussion Goals

Design the tables you will needI First decide what information you need to keep track ofI Then design tables to capture this information

Design queries (using join, project, and restrict)I What happens when a passenger comes looking for a ride?I What happens when a driver comes to find out who his passengers are?

Role play!

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 73 / 75

Page 88: Representing and Storing Structured Data

Exercise solution: tables

Ride: Ride ID, Driver ID, Origin, Destination, Departure Time,Arrival Time, Available Seats

Passenger: Passenger ID, Name, Address, Phone Number

Driver: Driver ID, Name, Address, Phone Number

Booking: Ride ID, Passenger ID

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 74 / 75

Page 89: Representing and Storing Structured Data

Exercise solution: queries

Passenger calls: Can I get a ride?I Join: Ride, DriverI Project: Departure Time, Name, Phone NumberI Restrict: Origin, Destination, Available Seats > 0

Driver calls: Who are my passengers?I Join: Ride, Passenger, BookingI Project: Name, Phone NumberI Restrict: (Driver) Name, Origin, Destination, Departure Time

LBSC 690: Jordan Boyd-Graber () Representing and Storing Structured Data October 15, 2012 75 / 75