the next 700 data description languages yitzhak mandelbaum princeton university computer science...

26
The Next 700 Data The Next 700 Data Description Languages Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Upload: myra-garrett

Post on 06-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

What Data Needs Describing? There's much data in databases and common formats like XML; there’s much data that’s ad hoc. Ad hoc data lacks readily available parsing, querying, analysis or transformation tools It’s all over the place: financial, telecomm, chemistry, physics, biology, etc.

TRANSCRIPT

Page 1: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

The Next 700 Data The Next 700 Data Description LanguagesDescription Languages

Yitzhak MandelbaumPrinceton UniversityComputer Science

Collaborators: Kathleen Fisher and David Walker

Page 2: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

““The Next 700 …”The Next 700 …”

Program(s) Data Format(s)

Programming Language(s)

PL Semantics

Data Description Language(s)

DDL Semantics

Page 3: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

What Data Needs Describing?What Data Needs Describing?• There's much data in databases and common

formats like XML; there’s much data that’s ad hoc.

• Ad hoc data lacks readily available parsing, querying, analysis or transformation tools

• It’s all over the place: financial, telecomm, chemistry, physics, biology, etc.

Page 4: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Ad Hoc Data in BiologyAd Hoc Data in Biology!autogenerated-by: DAG-Edit version 1.419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21:00:28 PST 2005 !version: $Revision: 3.223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673 <biological_process ; GO:0008150 %behavior ; GO:0007610 ; synonym:behaviour %adult behavior ; GO:0030534 ; synonym:adult behaviour %adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour % feeding behavior ; GO:0007631 %adult locomotory behavior ; GO:0008344 ;

...

from www.geneontology.orgfrom www.geneontology.org

Page 5: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Ad Hoc Data in ChemistryAd Hoc Data in ChemistryO=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H](OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)C5=CC=CC=C5)=O)C1

O O

O

OH

AcO

H

O

O

O

HO

NH

O

O

OHO

Page 6: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Ad Hoc Data from Web Server Logs Ad Hoc Data from Web Server Logs (CLF)(CLF)

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941

Page 7: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Ad Hoc Data: DNS packetsAd Hoc Data: DNS packets

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

Page 8: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

DData ata Description LanguagesDescription Languages• Data description languages describe many ad hoc

formats and provide the following features:– Descriptions serves as documentation, including semantic of

data– Compiler generates tools from description: parser, printer,

query engine, converter to XML, statistical profiler, etc.– Parser includes robust error detection and recovery.– Parsers can handle high data volume.

• > 1GB/second Netflow traffic from Cisco routers.

Page 9: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Many Data Description Many Data Description LanguagesLanguages

• Logical Descriptions– ASN.1– ASDL

• Physical Descriptions– PacketTypes (SIGCOMM ‘00)

– DataScript (GPCE ‘02)

– PADS (PLDI ‘05)

• Basis for current work

001010101101001001111001010001010100

Logical

Physical

Page 10: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

ContributionsContributions• A core data description calculus (DDC)

– Based on dependent type theory– Simple, orthogonal, composable types– Types are transducers from external data source to internal

data representation.

• Encodings of high-level DDLs in low-level DDC– Explain semantics of PADS language in particular.

PacketTypesPADS

Datascript

DDC

Page 11: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Base Types and SequencesBase Types and Sequences• C(e): base type can be parameterized by

expression e.• x:T.T’: dependent product describes sequence of

values.– Variable x gives name to first value in sequence.

• Examples:

“123hello|” int * string(‘|’) * char (123, “hello”, ‘|’)

“3513” width:int_fw(1). int_fw(width) (3,513)

“:hello:” term:char.string(term) * char (‘:’,“hello”,‘:’)

Page 12: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

ConstraintsConstraints• {x:T | e}: set types allow you to constrain the type T

and express relationships between elements of the data.

• Examples:

‘a’ {c:char | c = ‘a’} (abbrev: Sc(‘a’)) inl ‘a’

“101”,“82”

{x:int | x > 100}inl 101,inr error(82)

“43|105|67”min:int.Sc(‘|’) *max:{m:int | min ≤ m}.Sc(‘|’) * {avg:int | min ≤ avg & avg ≤ max}

(43, inl ‘|’, inl 105, inl ‘|’, inl 67)

Page 13: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Unions and the Empty StringUnions and the Empty String• true: matches the empty string.• T + T’ : deterministic, exclusive or: try T; on failure, try

T’.• Examples:

“54”, “n/a” int + Ss(“n/a”) inl 54, inr “n/a”

“2341”, “” int + true inl 2341, inr ()

Page 14: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Array FeaturesArray Features• What features do we need to handle data

sequences?– Elements– Separator between elements– Termination condition (“are we done yet?”)– Terminator after sequence

• Examples:“192.168.1.1”“Bill|Cathy|Jane|Bob;”

Page 15: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

False and ArraysFalse and Arrays• T seq(Ts; e, Tt) specifies:

– Element type T– Separator types Ts.– Termination condition e.– Terminator type Tt.

• false: reads nothing, flagging an error.• Example: IP address.

“192.168.1.1” int seq(Sc(‘.’); len 4, false) [192,168,1,1]

Page 16: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Abstraction and ApplicationAbstraction and Application• Can parameterize types over values: x.T• Correspondingly, can apply types to values: T e• Example: IP address with terminator

none term.int seq(Sc(‘.’); len 4, Sc(term)) none

“1.2.3.4|” IP_addr ‘|’ * Sc(‘|’) ([1,2,3,4],inl ‘|’)

Page 17: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Absorb, Compute and ScanAbsorb, Compute and Scan• Absorb, Compute and Scan are active types.

– absorb(T) : consume data from source; produce nothing.– compute(e:) : consume nothing; output result of

computation e.– scan(T) : scan data source for type T.

• Examples:

“|” absorb(Sc(‘|’)) ()

“10|12”width:int.Sc(‘|’) *length:int. area:compute(width length:int)

(10,12,120)

“^%$!&_|” scan(Sc(‘|’)) (6,inl ‘|’)

Page 18: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Type KindingType Kinding• Kinding ensures types are well formed.

|- T : s k |- e : s |- T e: k

|- T : type |- T’ : type |- T + T’: type

|- T : type ,x:s |- e : bool (s = …) |- {x:T | e}: type

Page 19: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Parsing Semantics of TypesParsing Semantics of Types• Semantics expressed as parsing functions

written in the polymorphic -calculus.– Sem(T) : DDC Type Function– Input data and offset, output new offset, value and

parse descriptor.– For specifics, see upcoming technical report.

Page 20: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Types of Parser Output Types of Parser Output • Parsers produce values with following type in the host language:

DDC Host Language

[C(e)]rep I(C) + noval

[true]rep unit

[x:T.T’]rep [T]rep * [T’]rep

[x.T]rep, [T e]rep [T]rep

[T + T’]rep [T]rep + [T’]rep + noval

[{x:T | e}]rep [T]rep + ([T]rep error)

unrecoverable error

semantic error

dependency erased

Base Types

Products

Union

Abs. and App.

Set types

Page 21: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Properties of the CalculusProperties of the Calculus

• Theorem: If |- T : k then – [T] = F well formed types yield parsers |- F : bits * offset offset * [T]rep * [T]pd

a T-Parser returns values with types that correspond to T.

• Theorem: Parsers report errors accurately.– Errors in parse descriptor correspond to actual errors in data.– Parsers check all semantic constraints.– More …

Page 22: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Making Use of the CalculusMaking Use of the Calculus

IPADS

DDC

|- t T

IPADSt ::= C(e) | Pfun(x:s) = t | t e | Pstruct{fields} | Punion{fields} | Pswitch e of {alts tdef;} | Popt t | t Pwhere x.e | Palt{fields} | t [t; e,t] | Pcompute e | Plit cfields ::= | fields x : t;alts ::= | alts e => t;

Page 23: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Example: Example: PoptPopt and and PlitPlit

|- Popt t T + true

|- t T

|- Plit c scan(absorb({x:char | x = c}))

|- c : char

trueT1 + T2

C(e){x:T | e}absorb(T)scan(T)

Page 24: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Example: Example: PswitchPswitch

|- Pswitch e of {e1 => t1; e2 => t2; … tdef}

(c.{x:T1 | c = e1} + {x:T2 | c = e2} + …+ Tdef) e

|- ti Ti (i = 1…n) T + T’x.T{x:T|e}

|- tdef Tdef

Page 25: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

Future workFuture work• What are the set of languages recognized by the

DDC?• How does the expressive power of the DDC relate to

CFGs and regular expressions?• Implement recursive types in PADS system based on

the recursive types of the DDC.• Add polymorphism to DDC and PADS.

Page 26: The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker

SummarySummary• Data description languages are well-suited to

describing ad hoc data.• No one DDL will ever be right - different domains and

applications will demand different languages with differing levels of expressiveness and abstraction.

• Our work defines the first semantics for data description languages.

• For more information, visit www.padsproj.org.