pads: processing arbitrary data streams kathleen fisher robert gruber

16
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Upload: georgia-perry

Post on 04-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

PADS:Processing Arbitrary Data Streams

Kathleen FisherRobert Gruber

Page 2: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 2

The big picture

• Plethora of high-volume data streams, from which valuable information can be extracted.

– Call-detail data, web logs, provisioning streams, tcpdump data, etc.

• Desired operations:

– Programmatic manipulation

– Format translation (into XML, relational database, etc.)

– Declarative interaction

• Filtering, querying, aggregation, statistical profiling

Page 3: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 3

Technical challenges

• Data arrives “as is.”

– Format determined by data source, not consumers.

– Often has little documentation.

– Some percentage of data is “buggy.”

• Often streams have high volume.

– Detect relevant errors (without necessarily halting program)

– Control how data is read (e.g. read header but skip body vs. read entire record).

• Parsing routines must be written to support any of the desired operations.

Page 4: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 4

Why not use C / Perl / Shell scripts… ?

Problems with hand-coded parsers:

• Writing them is time consuming and error prone.

• Reading them a few months later is difficult.

• Maintaining them in the face of even small format changes can be difficult.

• Programs break in subtle and machine-specific ways (endien-ness, word-sizes).

• Such programs are often incomplete, particularly with respect to errors.

Page 5: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 5

Solution: PADS System (In Progress)

One person writes declarative description of data source:

– Physical format information

– Semantic constraints.

Many people use PADS data description and generated library.

PADS system generates

– C library interface for processing data.

• Reading ( original / binary / XML / …)

• Writing ( original / binary / XML / … )

• Accumulators

• …

– Application for querying stream.

Page 6: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 6

PADS language

• Can describe ASCII, EBCDIC (Cobol) , binary, and mixed data formats.

• Allows arbitrary boolean constraint expressions to describe expected properties of data.

• Type-based model: each type indicates how to read associated data.

• Provides rich and extensible set of base types.

– Pa_uint8, Pa_int8, Pa_uint16, …, Pe_uint8, …, Pb_int8, …, Pint8

– Pstring(:term-char:), Pstring_FW(:size:), Pstring_RE(:reg_exp:)

• Supports user-defined compound types to describe file structure:

– Pstruct, Parray, Punion, Ptypedef, Penum

Page 7: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 7

PADS compiler

• Converts description to C header and implementation files.

• For each built-in/user-defined type:

– Functions (read, accumulate, write, test data generation)

– In-memory representation

– Error description

– Mask (check constraints, set representation, suppress printing)

• Reading invariant: If mask is check and set and error description reports no errors, then in-memory representation satisfies all constraints in data description.

Page 8: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 8

Example: CLF web log

• Common Log Format from Web Protocols and Practice.

• Fields:

– IP address of remote host, either resolved (as above) or symbolic

– Remote identity (usually ‘-’ to indicate name not collected)

– Authenticated user (usually ‘-’ to indicate name not collected)

– Time associated with request

– Request (request method, request-uri, and protocol version)

– Response code

– Content length

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

Page 9: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 9

Example: CLF web log in PADS

Precord Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

Page 10: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 10

PADSL example: user constraint

int checkVersion(http_v version, method_t meth) { if ((version.major == 1) && (version.minor == 1)) return

1; if ((meth == LINK) || (meth == UNLINK)) return 0; return 1;}

Pstruct http_request { '\"'; method_t meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_v version : checkVersion(version,

meth); /- HTTP version number of request '\"';};207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

Page 11: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 11

PADSL example: arrays and unions

Parray nIP { Puint8 [4] : Psep == '.'; };

Parray sIP { Pstring(:"[. ]":) [] : Psep == '.' && Pterm == ' '; }

Punion host { nIP resolved; /- 135.207.23.32 sIP symbolic; /- www.research.att.com};

Punion auth_id { Pchar unauthorized : unauthorized == '-'; /- non-authenticated http session Pstring(:' ':) id; /- login supplied during authentication};207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

Page 12: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 12

Generated type declarations

typedef struct { host client; /* Client requesting service */ auth_id remoteID; /* Remote identity */ …} http_weblog;

typedef struct { host_m client; auth_id_m remoteID; …} http_weblog_m;

typedef struct { int nerr; int errCode; PDC_loc loc; int panic; host_ed client; auth_id_ed remoteID; …;} http_weblog_ed;

Page 13: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 13

Sample use

PDC_t *pdc; http_weblog entry;http_weblog_m mask; http_weblog_ed ed;

PDC_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */);PDC_IO_fopen(pdc, fileName);... call init functions ...http_weblog_mask(&mask, PCheck & PSet);while (!PDC_IO_at_EOF(pdc)) { http_weblog_read(pdc, &mask, &ed, &entry); if (ed.nerr != 0) { ... Error handling ... } ... Process/query entry ...};... call cleanup functions ...PDC_IO_fclose(pdc);PDC_close(pdc);

Page 14: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 14

Related work

• ASN.1, ASDL

– Describe logical representation, generate physical.

• DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000]

– Binary only

– Stop on first error

Page 15: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 15

PADS to do

• Allow library generation to be customized with application-specific information:

– Repair errors, ignore certain fields, customize in-memory representation, etc.

• Explore declarative querying via integration with XQuery (joint work with Mary Fernandez and Ricardo Medel).

• Support data translation

– Requires mapping from one in-memory representation to another.

• Develop user-base and integrate feedback.

– What would you want in such a tool?

Page 16: PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Data Binding, June 2003 16

Getting PADS

PADS will be available shortly for download with a non-commercial-use license.

http://www.research.att.com/projects/pads