pads: processing arbitrary data streams kathleen fisher robert gruber
TRANSCRIPT
PADS:Processing Arbitrary Data Streams
Kathleen FisherRobert Gruber
Data Binding, June 2003 2
The big picture
• Plethora of high-volume data streams, from which valuable information can be extracted.
– Call-detail data, web logs, provisioning streams, tcpdump data, etc.
• Desired operations:
– Programmatic manipulation
– Format translation (into XML, relational database, etc.)
– Declarative interaction
• Filtering, querying, aggregation, statistical profiling
Data Binding, June 2003 3
Technical challenges
• Data arrives “as is.”
– Format determined by data source, not consumers.
– Often has little documentation.
– Some percentage of data is “buggy.”
• Often streams have high volume.
– Detect relevant errors (without necessarily halting program)
– Control how data is read (e.g. read header but skip body vs. read entire record).
• Parsing routines must be written to support any of the desired operations.
Data Binding, June 2003 4
Why not use C / Perl / Shell scripts… ?
Problems with hand-coded parsers:
• Writing them is time consuming and error prone.
• Reading them a few months later is difficult.
• Maintaining them in the face of even small format changes can be difficult.
• Programs break in subtle and machine-specific ways (endien-ness, word-sizes).
• Such programs are often incomplete, particularly with respect to errors.
Data Binding, June 2003 5
Solution: PADS System (In Progress)
One person writes declarative description of data source:
– Physical format information
– Semantic constraints.
Many people use PADS data description and generated library.
PADS system generates
– C library interface for processing data.
• Reading ( original / binary / XML / …)
• Writing ( original / binary / XML / … )
• Accumulators
• …
– Application for querying stream.
Data Binding, June 2003 6
PADS language
• Can describe ASCII, EBCDIC (Cobol) , binary, and mixed data formats.
• Allows arbitrary boolean constraint expressions to describe expected properties of data.
• Type-based model: each type indicates how to read associated data.
• Provides rich and extensible set of base types.
– Pa_uint8, Pa_int8, Pa_uint16, …, Pe_uint8, …, Pb_int8, …, Pint8
– Pstring(:term-char:), Pstring_FW(:size:), Pstring_RE(:reg_exp:)
• Supports user-defined compound types to describe file structure:
– Pstruct, Parray, Punion, Ptypedef, Penum
Data Binding, June 2003 7
PADS compiler
• Converts description to C header and implementation files.
• For each built-in/user-defined type:
– Functions (read, accumulate, write, test data generation)
– In-memory representation
– Error description
– Mask (check constraints, set representation, suppress printing)
• Reading invariant: If mask is check and set and error description reports no errors, then in-memory representation satisfies all constraints in data description.
Data Binding, June 2003 8
Example: CLF web log
• Common Log Format from Web Protocols and Practice.
• Fields:
– IP address of remote host, either resolved (as above) or symbolic
– Remote identity (usually ‘-’ to indicate name not collected)
– Authenticated user (usually ‘-’ to indicate name not collected)
– Time associated with request
– Request (request method, request-uri, and protocol version)
– Response code
– Content length
207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
Data Binding, June 2003 9
Example: CLF web log in PADS
Precord Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response};
207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
Data Binding, June 2003 10
PADSL example: user constraint
int checkVersion(http_v version, method_t meth) { if ((version.major == 1) && (version.minor == 1)) return
1; if ((meth == LINK) || (meth == UNLINK)) return 0; return 1;}
Pstruct http_request { '\"'; method_t meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_v version : checkVersion(version,
meth); /- HTTP version number of request '\"';};207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
Data Binding, June 2003 11
PADSL example: arrays and unions
Parray nIP { Puint8 [4] : Psep == '.'; };
Parray sIP { Pstring(:"[. ]":) [] : Psep == '.' && Pterm == ' '; }
Punion host { nIP resolved; /- 135.207.23.32 sIP symbolic; /- www.research.att.com};
Punion auth_id { Pchar unauthorized : unauthorized == '-'; /- non-authenticated http session Pstring(:' ':) id; /- login supplied during authentication};207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
Data Binding, June 2003 12
Generated type declarations
typedef struct { host client; /* Client requesting service */ auth_id remoteID; /* Remote identity */ …} http_weblog;
typedef struct { host_m client; auth_id_m remoteID; …} http_weblog_m;
typedef struct { int nerr; int errCode; PDC_loc loc; int panic; host_ed client; auth_id_ed remoteID; …;} http_weblog_ed;
Data Binding, June 2003 13
Sample use
PDC_t *pdc; http_weblog entry;http_weblog_m mask; http_weblog_ed ed;
PDC_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */);PDC_IO_fopen(pdc, fileName);... call init functions ...http_weblog_mask(&mask, PCheck & PSet);while (!PDC_IO_at_EOF(pdc)) { http_weblog_read(pdc, &mask, &ed, &entry); if (ed.nerr != 0) { ... Error handling ... } ... Process/query entry ...};... call cleanup functions ...PDC_IO_fclose(pdc);PDC_close(pdc);
Data Binding, June 2003 14
Related work
• ASN.1, ASDL
– Describe logical representation, generate physical.
• DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000]
– Binary only
– Stop on first error
Data Binding, June 2003 15
PADS to do
• Allow library generation to be customized with application-specific information:
– Repair errors, ignore certain fields, customize in-memory representation, etc.
• Explore declarative querying via integration with XQuery (joint work with Mary Fernandez and Ricardo Medel).
• Support data translation
– Requires mapping from one in-memory representation to another.
• Develop user-base and integrate feedback.
– What would you want in such a tool?
Data Binding, June 2003 16
Getting PADS
PADS will be available shortly for download with a non-commercial-use license.
http://www.research.att.com/projects/pads