bubbles – virtual data objects

57
Bubbles Virtual Data Objects June 2013 Stefan Urbanek data brewery

Upload: stefan-urbanek

Post on 18-Dec-2014

16.615 views

Category:

Technology


4 download

DESCRIPTION

Bubbles is a data framework for creating data processing and monitoring pipelines.

TRANSCRIPT

Page 1: Bubbles – Virtual Data Objects

BubblesVirtual Data Objects

June 2013Stefan Urbanek

data brewery

Page 2: Bubbles – Virtual Data Objects

Contents

■ Data Objects

■ Operations

■ Context

■ Stores

■ Pipeline

Page 3: Bubbles – Virtual Data Objects

Brewery 1 Issues■ based on streaming data by records

buffering in python lists as python objects

■ stream networks were using threadshard to debug, performance penalty (GIL)

■ no use of native data operations

■ difficult to extend

Page 4: Bubbles – Virtual Data Objects

About

Python framework for data processing and quality probing

v3.3

Page 5: Bubbles – Virtual Data Objects

Objective

focus on the process,not data technology

Page 6: Bubbles – Virtual Data Objects

Data

■ keep data in their original form

■ use native operations if possible

■ performance provided by technology

■ have other options

Page 7: Bubbles – Virtual Data Objects

for categorical data

* you can do numerical too, but there are

plenty of other, better tools for that

*

Page 8: Bubbles – Virtual Data Objects

Data Objects

Page 9: Bubbles – Virtual Data Objects

data object represents structured data

Data do not have to be in its final form,

neither they have to exist. Promise of

providing data in the future is just fine.

Data are virtual.

Page 10: Bubbles – Virtual Data Objects

virtual data object

fields

virtual data

SQL statement

iterator

idproductcategoryamountunit price

representations

Page 11: Bubbles – Virtual Data Objects

Data Object

■ is defined by fields

■ has one or more representations

■ might be consumableone-time use objects such as streamed data

SQL statement

iterator

Page 12: Bubbles – Virtual Data Objects

Fields

■ define structure of data object

■ storage metadatageneralized storage type, concrete storage type

■ usage metadatapurpose – analytical point of view, missing values, ...

Page 13: Bubbles – Virtual Data Objects

100Atari 1040STcomputer10400.01985no

integerstringstringintegerfloatintegerstring

typelessnominalnominaldiscretemeasureordinalflag

idproductcategoryamountunit priceyearshipped

Field List

storage type

name

analytical type

(purpose)

sample metadata

Page 14: Bubbles – Virtual Data Objects

SQL statement

iterator

SELECT *FROM productsWHERE price < 100

engine.execute(statement)

Representations

SQL statement that can be composed

actual rows fetched from database

Page 15: Bubbles – Virtual Data Objects

Representations

■ represent actual data in some waySQL statement, CSV file, API query, iterator, ...

■ decided on runtimelist might be dynamic, based on metadata, availability, …

■ used for data object operationsfiltering, composition, transformation, …

Page 16: Bubbles – Virtual Data Objects

Representations

SQL statement

iterator

natural, most efficient for operations

default, all-purpose, might be very expensive

Page 17: Bubbles – Virtual Data Objects

Representations

>>> object.representations()[“sql_table”, “postgres+sql”, “sql”, “rows”]

data might have been

cached in a table

we might use PostgreSQL

dialect specific features...

… or fall back to

generic SQL

for all other

operations

Page 18: Bubbles – Virtual Data Objects

Data Object Role

■ source: provides datavarious source representations such as rows()

■ target: consumes dataappend(row), append_from(object), ...

target.append_from(source)

for row in source.rows(): print(row)

implementation might

depend on source

Page 19: Bubbles – Virtual Data Objects

Append From ...

Iterator SQL

target.append_from(source)

for row in source.rows(): INSERT INTO target (...)

SQLSQL

INSERT INTO target SELECT … FROM source

same engine

Page 20: Bubbles – Virtual Data Objects

Operations

Page 21: Bubbles – Virtual Data Objects

Operation

✽… ? ...

… ? ...… ? ...

… ? ...

does something useful with data object and produces another data object

or something else, also useful

Page 22: Bubbles – Virtual Data Objects

Signature

@operation(“sql”)def sample(context, object, limit): ...

signature

accepted representation

SQL ✽ … ? ...iteratorSQL

Page 23: Bubbles – Virtual Data Objects

@operation

@operation(“sql”)def sample(context, object, limit): ...

@operation(“sql”, “sql”)def new_rows(context, target, source): ...

@operation(“sql”, “rows”, name=“new_rows”)def new_rows_iter(context, target, source): ...

unary

binary

binary with same name but different signature:

Page 24: Bubbles – Virtual Data Objects

List of Objects

@operation(“sql[]”)def append(context, objects): ...

@operation(“rows[]”)def append(context, objects): ...

matches one of common representations of all objects in the list

Page 25: Bubbles – Virtual Data Objects

Any / Default

@operation(“*”)def do_something(context, object): ...

default operation – if no signature matches

Page 26: Bubbles – Virtual Data Objects

Context

Page 27: Bubbles – Virtual Data Objects

Context

SQL iterator

iterator

SQL iterator

Mongo ✽

collection of operations

Page 28: Bubbles – Virtual Data Objects

Operation Call

context = Context()context.operation(“sample”)(source, 10)

sample sample

iterator ⇢SQL ⇢iteratorSQL

callable reference

runtime dispatch

sample

SQL ⇢

Page 29: Bubbles – Virtual Data Objects

Simplified Call

context.operation(“sample”)(source, 10)

context.o.sample(source, 10)

Page 30: Bubbles – Virtual Data Objects

Dispatch

SQL ✽iteratorSQL

iterator ✽iterator

MongoDB

operation is chosen based on signatureExample: we do not have this kind of operation

for MongoDB, so we use default iterator instead

Page 31: Bubbles – Virtual Data Objects

Dispatch

dynamic dispatch of operations based on representations of argument objects

Page 32: Bubbles – Virtual Data Objects

PrioritySQL ✽

iteratorSQL

iterator ✽SQL

iterator

order of representations mattersmight be decided during runtime

same representations,

different order

Page 33: Bubbles – Virtual Data Objects

Incapable?

SQL

SQL

join details

A

A

SQL

SQL

join details

A

B

SQL

join details

SQL

same connection different connection

use

this fails

Page 34: Bubbles – Virtual Data Objects

Retry!

SQL

SQL

A

B

iterator

iteratorSQL

join details join details

SQL

retry another

signature

raise RetryOperation(“rows”, “rows”)

if objects are not compose-able as

expected, operation might gently fail and

request a retry with another signature:

Page 35: Bubbles – Virtual Data Objects

Retry when...

■ not able to compose objectsbecause of different connections or other reasons

■ not able to use representationas expected

■ any other reason

Page 36: Bubbles – Virtual Data Objects

Modules

*just an example

collection of operations

SQL Iterator MongoDB

SQL iterator

iterator

SQL iterator

Mongo ✽

Page 37: Bubbles – Virtual Data Objects

Extend Context

context.add_operations_from(obj)

any object that has operations as

attributes, such as module

Page 38: Bubbles – Virtual Data Objects

Stores

Page 39: Bubbles – Virtual Data Objects

Object Store

■ contains objectstables, files, collections, ...

■ objects are namedget_object(name)

■ might create objectscreate(name, replace, ...)

Page 40: Bubbles – Virtual Data Objects

Object Store

store = open_store(“sql”, “postgres://localhost/data”)

store factory

Factories: sql, csv (directory), memory, ...

Page 41: Bubbles – Virtual Data Objects

Stores and Objects

source = open_store(“sql”, “postgres://localhost/data”)target = open_store(“csv”, “./data/”)

source_obj = source.get_object(“products”)target_obj = target.create(“products”, fields=source_obj.fields)

for row in source_obj.rows(): target_obj.append(row)

target_obj.flush()

copy data from SQL table to CSV

Page 42: Bubbles – Virtual Data Objects

Pipeline

Page 43: Bubbles – Virtual Data Objects

Pipeline

SQLSQL SQL SQL

Iterator

sequence of operations on “trunk”

Page 44: Bubbles – Virtual Data Objects

Pipeline Operations

stores = { “source”: open_store(“sql”, “postgres://localhost/data”) ”target” = open_store(“csv”, “./data/”)}

p = Pipeline(stores=stores)p.source(“source”, “products”)p.distinct(“color”)p.create(“target”, “product_colors”)

operations – first argument is

result from previous step

extract product colors to CSV

Page 45: Bubbles – Virtual Data Objects

Pipeline

p.source(store, object_name, ...) store.get_object(...)

p.create(store, object_name, ...) store.create(...) store.append_from(...)

Page 46: Bubbles – Virtual Data Objects

Operation Library

Page 47: Bubbles – Virtual Data Objects

Filtering

■ row filtersfilter_by_value, filter_by_set, filter_by_range

■ field_filter (ctx, obj, keep=[], drop=[], rename={})

keep, drop, rename fields

■ sample (ctx, obj, value, mode)

first N, every Nth, random, …

Page 48: Bubbles – Virtual Data Objects

Uniqueness

■ distinct (ctx, obj, key)

distinct values for key

■ distinct_rows (ctx, obj, key)

distinct whole rows (first occurence of a row) for key

■ count_duplicates (ctx, obj, key)

count number of duplicates for key

Page 49: Bubbles – Virtual Data Objects

Master-detail

■ join_detail(ctx, master, detail, master_key, detail_key)

Joins detail table, such as a dimension, on a specified key. Detail key field will be dropped from the result.

Note: other join-based operations will be implemented

later, as they need some usability decisions to be made

Page 50: Bubbles – Virtual Data Objects

Dimension Loading■ added_keys (ctx, dim, source, dim_key, source_key)

which keys in the source are new?

■ added_rows (ctx, dim, source, dim_key, source_key)

which rows in the source are new?

■ changed_rows (ctx, target, source, dim_key, source_key, fields, version_field)

which rows in the source have changed?

Page 51: Bubbles – Virtual Data Objects

more to come…

Page 52: Bubbles – Virtual Data Objects

Conclusion

Page 53: Bubbles – Virtual Data Objects

To Do

■ consolidate representations API

■ define basic set of operations

■ temporaries and garbage collection

■ sequence objects for surrogate keys

Page 54: Bubbles – Virtual Data Objects

Version 0.2

■ processing graphconnected nodes, like in Brewery

■ more basic backendsat least Mongo

■ bubbles command line tool

already in progress

Page 55: Bubbles – Virtual Data Objects

Future

■ separate operation dispatcherwill allow custom dispatch policies

Page 56: Bubbles – Virtual Data Objects

Contact:@Stiivi

[email protected]

Page 57: Bubbles – Virtual Data Objects

databrewery.org