ep2017 - bonobo etl · pdf file• cloveretl (ide/java) talend open studio . data...

82
bonobo Simple ETL in Python 3.5+

Upload: vuongkien

Post on 25-Mar-2018

226 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

bonobo Simple ETL in Python 3.5+

Page 2: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Romain Dorgueil @rdorgueil

CTO/Hacker in Residence Technical Co-founder

(Solo) Founder Eng. Manager

Developer

L’Atelier BNP Paribas WeAreTheShops RDC Dist. Agency Sensio/SensioLabs AffiliationWizard

Felt too young in a Linux Cauldron Dismantler of Atari computers

Basic literacy using a Minitel Guitars & accordions

Off by one baby Inception

Page 3: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

STARTUP ACCELERATION PROGRAMS

NO HYPE, JUST BUSINESS

launchpad.atelier.net

Page 4: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

bonobo Simple ETL in Python 3.5+

Page 5: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

• History of Extract Transform Load

• Concept ; Existing tools ; Related tools ; Ignition

• Practical Bonobo

• Tutorial ; Under the hood ; Demo ; Plugins & Extensions ; More demos

• Wrap up

• Present & future ; Resources ; Sprint ; Feedback

Plan

Page 6: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Once upon a time…

Page 7: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Extract Transform Load• Not new. Popular concept in the 1970s [1] [2]

• Everywhere. Commerce, websites, marketing, finance, …

[1] https://en.wikipedia.org/wiki/Extract,_transform,_load [2] https://www.sas.com/en_us/insights/data-management/what-is-etl.html

Page 8: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Extract Transform Load

foo

bar

baz

Extract Transform Load

Page 9: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Extract Transform Load

foo

bar

baz

Extract

Transform Load

Transformmore

JoinDB HTTP POST

log?

Page 10: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Data Integration Tools• Pentaho Data Integration (IDE/Java)

• Talend Open Studio (IDE/Java)

• CloverETL (IDE/Java)

Page 11: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Talend Open Studio

Page 12: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data
Page 13: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data
Page 14: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Data Integration Tools• Java + IDE based, for most of them

• Data transformations are blocks

• IO flow managed by connections

• Execution

GUI first, eventually code :-(

Page 15: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

In the Python world …• Bubbles (https://github.com/stiivi/bubbles)

• PETL (https://github.com/alimanfoo/petl)

• (insert a few more here)

• and now… Bonobo (https://www.bonobo-project.org/)

You can also use amazing libraries including Joblib, Dask, Pandas, Toolz, but ETL is not their main focus.

Page 16: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Other scales…

Page 17: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Small Automation Tools

• Mostly aimed at simple recurring tasks.

• Cloud / SaaS only.

Page 18: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Big Data Tools

• Can do anything. And probably more. Fast.

• Either needs an infrastructure, or cloud based.

Page 19: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Story time

Page 20: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Partner 1 Data Integration

Page 21: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

WE GOT DEALS !!!

Page 22: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Partner 1 Partner 2 Partner 3 Partner 4 Partner 5

Partner 6 Partner 7 Partner 8 Partner 9 …

Page 23: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Tiny bug there… Can you fix it ?

Page 24: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data
Page 25: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

My need• A data integration / ETL tool using code as configuration.

• Preferably Python code.

• Something that can be tested (I mean, by a machine).

• Something that can use inheritance.

• Fast & cheap install on laptop, thought for servers too.

Page 26: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

And that’s Bonobo

Page 27: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

It is …• A framework to write ETL jobs in Python 3 (3.5+)

• Using the same concepts as the old ETLs.

• You can use OOP!

Code first. Eventually a GUI will come.

Page 28: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

It is NOT …• Pandas / R Dataframes

• Dask (but will probably implement a dask.distributed strategy someday)

• Luigi / Airflow

• Hadoop / Big Data / Big Query / …

• A monkey (spoiler : it’s an ape, damnit french language…)

Page 29: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Let’s see…

Page 30: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Create a project

~ $ pip install bonobo

~ $ bonobo init europython/tutorial

~ $ bonobo run europython/tutorial

Page 31: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

TEMPLATE

~ $ bonobo run .

…demo

Page 32: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Write our ownimport bonobo

def extract(): yield 'euro' yield 'python' yield '2017'

def transform(s): return s.title()

def load(s): print(s)

graph = bonobo.Graph( extract, transform, load, )

Page 33: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

EXAMPLE_1

~ $ bonobo run .

…demo

Page 34: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

EXAMPLE_1

~ $ bonobo run first.py

…demo

Page 35: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Under the hood…

Page 36: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

graph = bonobo.Graph(…)

Page 37: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

BEGIN

CsvReader( 'clients.csv' )

InsertOrUpdate( 'db.site', 'clients', key='guid' )

update_crm

retrieve_orders

Page 38: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Graph…class Graph:

def __init__(self, *chain): self.edges = {} self.nodes = []

self.add_chain(*chain)

def add_chain(self, *nodes, _input=None, _output=None): # ...

Page 39: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

bonobo.run(graph)

or in a shell… $ bonobo run main.py

Page 40: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

BEGIN

CsvReader( 'clients.csv' )

InsertOrUpdate( 'db.site', 'clients', key='guid' )

update_crm

retrieve_orders

Page 41: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

BEGIN

CsvReader( 'clients.csv' )

InsertOrUpdate( 'db.site', 'clients', key='guid' )

update_crm

retrieve_orders

Context +

Thread

Context +

Thread

Context +

Thread

Context +

Thread

Page 42: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Context…class GraphExecutionContext: def __init__(self, graph, plugins, services): self.graph = graph self.nodes = [ NodeExecutionContext(node, parent=self) for node in self.graph ] self.plugins = [ PluginExecutionContext(plugin, parent=self) for plugin in plugins ] self.services = services

Page 43: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Strategy…class ThreadPoolExecutorStrategy(Strategy): def execute(self, graph, plugins, services): context = self.create_context(graph, plugins, services) executor = self.create_executor()

for node_context in context.nodes: executor.submit( self.create_runner(node_context) )

while context.alive: self.sleep()

executor.shutdown()

return context

Page 44: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

</ implementation details >

Page 45: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Transformations

a.k.a nodes in the graph

Page 46: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Functionsdef get_more_infos(api, **row): more = api.query(row.get('id'))

return { **row, **(more or {}), }

Page 47: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Generatorsdef join_orders(order_api, **row): for order in order_api.get(row.get('customer_id')): yield { **row, **order, }

Page 48: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Iteratorsextract = ( 'foo', 'bar', 'baz', )

extract = range(0, 1001, 7)

Page 49: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Classesclass RiminizeThis: def __call__(self, **row): return { **row, 'Rimini': 'Woo-hou-wo...', }

Anything, as long as it’s callable().

Page 50: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Configurable classesfrom bonobo.config import Configurable, Option, Service

class QueryDatabase(Configurable):

table_name = Option(str, default=‘customers')

database = Service('database.default')

def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId']) return { **row, 'is_customer': bool(customer), }

Page 51: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Configurable classesfrom bonobo.config import Configurable, Option, Service

class QueryDatabase(Configurable):

table_name = Option(str, default=‘customers')

database = Service('database.default')

def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId']) return { **row, 'is_customer': bool(customer), }

Page 52: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Configurable classesfrom bonobo.config import Configurable, Option, Service

class QueryDatabase(Configurable):

table_name = Option(str, default=‘customers')

database = Service('database.default')

def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId']) return { **row, 'is_customer': bool(customer), }

Page 53: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Configurable classesfrom bonobo.config import Configurable, Option, Service

class QueryDatabase(Configurable):

table_name = Option(str, default=‘customers')

database = Service('database.default')

def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId']) return { **row, 'is_customer': bool(customer), }

Page 54: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Configurable classesquery_database = QueryDatabase( table_name='test_customers', database='database.testing', )

Page 55: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Services

Page 56: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Define as namesclass QueryDatabase(Configurable):

database = Service('database.default')

def call(self, database, **row): return { … }

Page 57: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Runtime injectionimport bonobo

graph = bonobo.Graph(...)

def get_services(): return { ‘database.default’: MyDatabaseImpl() }

Page 58: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Bananas!

Page 59: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Librarybonobo.FileReader(…) bonobo.CsvReader(…) bonobo.JsonReader(…) bonobo.PickleReader(…)

bonobo.ExcelReader(…) bonobo.XMLReader(…)

… more to come

bonobo.FileWriter(…) bonobo.CsvWriter(…) bonobo.JsonWriter(…) bonobo.PickleWriter(…)

bonobo.ExcelWriter(…) bonobo.XMLWriter(…)

… more to come

Page 60: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Librarybonobo.Limit(limit) bonobo.PrettyPrinter() bonobo.Filter(…)

… more to come

Page 61: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Extensions & Plugins

Page 62: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Console Plugin

Page 63: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Jupyter Plugin

Page 64: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

SQLAlchemy Extensionbonobo_sqlalchemy.Select( query, *, pack_size=1000, limit=None )

bonobo_sqlalchemy.InsertOrUpdate( table_name, *, fetch_columns, insert_only_fields, discriminant, … )

PREVIEW

Page 65: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Docker Extension$ pip install bonobo[docker]

$ bonobo runc myjob.py

PREVIEW

Page 66: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Dev KitPREVIEW

https://github.com/python-bonobo/bonobo-devkit

Page 67: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

More examples

?

Page 68: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

EXAMPLE_1 -> EXAMPLE_2

…demo

• Use filesystem service.

• Write to a CSV

• Also write to JSON

Page 69: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

EXAMPLE_3

Rimini open data

Page 70: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

~/bdk/demos/europython2017

Europython attendees

featuring… jupyter notebookselenium & firefox

Page 71: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

~/bdk/demos/sirene

French companies registry

featuring… docker

postgresql sql alchemy

Page 72: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Wrap up

Page 73: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Young• First commit : December 2016

• 23 releases, ~420 commits, 4 contributors

• Current « stable » 0.4.3

• Target : 1.0 early 2018

Page 74: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Python 3.5+• {**}

• async/await

• (…, *, …)

• GIL :(

Page 75: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

1.0• 100% Open-Source.

• Light & Focused.

• Very few dependencies.

• Comprehensive standard library.

• The rest goes to plugins and extensions.

Page 76: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Small scale• 1 minute to install

• Easy to deploy

• NOT : Big Data, Statistics, Analytics …

• IS : Lean manufacturing for data

Page 77: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Interwebs are crazy

Page 78: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Data Processing for Humans

Page 79: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

www.bonobo-project.org

docs.bonobo-project.org

bonobo-slack.herokuapp.com

github.com/python-bonobo

Let me know what you think!

Page 80: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Sprint• Sprints at Europython are amazing

• Nice place to learn about Bonobo, basics, etc.

• Nice place to contribute while learning.

• You’re amazing.

Page 81: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

Thank you!

@monkcage @rdorgueil

https://goo.gl/e25eoa

Page 82: EP2017 - Bonobo ETL · PDF file• CloverETL (IDE/Java) Talend Open Studio . Data Integration Tools • Java + IDE based, for most of them • Data

bonobo@monkcage