python business intelligence (pydata 2012 talk)

58
Python for Business Intelligence Štefan Urbánek @Stiivi [email protected] PyData NYC, October 2012

Upload: stefan-urbanek

Post on 21-Nov-2014

11.289 views

Category:

Documents


0 download

DESCRIPTION

What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data. Video: https://vimeo.com/53063944

TRANSCRIPT

Page 1: Python business intelligence (PyData 2012 talk)

Python for Business Intelligence

Štefan Urbánek ■ @Stiivi ■ [email protected] ■ PyData NYC, October 2012

Page 2: Python business intelligence (PyData 2012 talk)

python business intelligence

)

Page 3: Python business intelligence (PyData 2012 talk)

Q/A and articles with Java solution references

(not listed here)

Results

Page 4: Python business intelligence (PyData 2012 talk)
Page 5: Python business intelligence (PyData 2012 talk)

Why?

Page 6: Python business intelligence (PyData 2012 talk)

Overview

■ Traditional Data Warehouse

■ Python and Data

■ Is Python Capable?

■ Conclusion

Page 7: Python business intelligence (PyData 2012 talk)

Business Intelligence

Page 8: Python business intelligence (PyData 2012 talk)

people

technology processes

Page 9: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 10: Python business intelligence (PyData 2012 talk)

Traditional Data Warehouse

Page 11: Python business intelligence (PyData 2012 talk)

■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures in the data to achieve consistency across the original sources

■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards.

Source: Ralph Kimball – The Data Warehouse ETL Toolkit

Page 12: Python business intelligence (PyData 2012 talk)

Source Systems

Staging Area Operational Data Store Datamarts

structured documents

databases

APIs

TemporaryStaging Area

staging relational dimensional

L0 L1 L2

Page 13: Python business intelligence (PyData 2012 talk)

real time = daily

Page 14: Python business intelligence (PyData 2012 talk)

Multi-dimensionalModeling

Page 16: Python business intelligence (PyData 2012 talk)

aggregation browsingslicing and dicing

Page 17: Python business intelligence (PyData 2012 talk)

business / analyst’spoint of view

regardless of physical schema implementation

Page 18: Python business intelligence (PyData 2012 talk)

Facts

fact

most detailed information

measurable

fact data cell

Page 19: Python business intelligence (PyData 2012 talk)

dimensions

location

type

time

Page 20: Python business intelligence (PyData 2012 talk)

■ provide context for facts

■ used to filter queries or reports

■ control scope of aggregation of facts

Dimension

Page 21: Python business intelligence (PyData 2012 talk)

Pentaho

Page 22: Python business intelligence (PyData 2012 talk)

Python and Datacommunity perception*

*as of Oct 2012

Page 23: Python business intelligence (PyData 2012 talk)

Scientific & Financial

Page 24: Python business intelligence (PyData 2012 talk)

Python

Page 25: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 26: Python business intelligence (PyData 2012 talk)

T1[s] T2[s] T3[s] T4[s]

P1 112,68 941,67 171,01 660,48

P2 96,15 306,51 725,88 877,82

P3 313,39 189,31 41,81 428,68

P4 760,62 983,48 371,21 281,19

P5 838,56 39,27 389,42 231,12

n-dimensional array of numbers

Scientific Data

Page 27: Python business intelligence (PyData 2012 talk)

Assumptions

■ data is mostly numbers

■ data is neatly organized...

■ … in one multi-dimensional array

Page 28: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 29: Python business intelligence (PyData 2012 talk)

Business Data

Page 30: Python business intelligence (PyData 2012 talk)

multiple representations

of same data

multiple snapshots of one source

categories are

changing

Page 31: Python business intelligence (PyData 2012 talk)

Page 32: Python business intelligence (PyData 2012 talk)

Is Python Capable?very basic examples

Page 33: Python business intelligence (PyData 2012 talk)

Data Pipes with SQLAlchemy

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 34: Python business intelligence (PyData 2012 talk)

■ connection: create_engine

■ schema reflection: MetaData, Table

■ expressions: select(), insert()

Page 35: Python business intelligence (PyData 2012 talk)

src_engine = create_engine("sqlite:///data.sqlite")src_metadata = MetaData(bind=src_engine)src_table = Table('data', src_metadata, autoload=True)

target_engine = create_engine("postgres://localhost/sandbox")target_metadata = MetaData(bind=target_engine)target_table = Table('data', target_metadata)

Page 36: Python business intelligence (PyData 2012 talk)

for column in src_table.columns: target_table.append_column(column.copy())

target_table.create()

insert = target_table.insert()

for row in src_table.select().execute(): insert.execute(row)

clone schema:

copy data:

Page 37: Python business intelligence (PyData 2012 talk)

magic used:

metadata reflection

Page 38: Python business intelligence (PyData 2012 talk)

reader = csv.reader(file_stream)

columns = reader.next()

for column in columns: table.append_column(Column(column, String))

table.create()

for row in reader: insert.execute(row)

text file (CSV) to table:

Page 39: Python business intelligence (PyData 2012 talk)

Simple T from ETL

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 40: Python business intelligence (PyData 2012 talk)

transformation = [

('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ]

target fields source transformations

Page 41: Python business intelligence (PyData 2012 talk)

for row in source: result = transform(row, [ transformation) table.insert(result).execute()

Transformation

Page 42: Python business intelligence (PyData 2012 talk)

OLAP with Cubes

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 43: Python business intelligence (PyData 2012 talk)

cubes dimensionsmeasures levels, attributes, hierarchy

Model{ “name” = “My Model” “description” = ....

“cubes” = [...] “dimensions” = [...]}

Page 44: Python business intelligence (PyData 2012 talk)

logical

physical

Page 45: Python business intelligence (PyData 2012 talk)

workspace.browser(cube)

load_model("model.json")

create_workspace("sql", model, url="sqlite:///data.sqlite")

model.cube("sales")

Aggregation Browser backend

cubes

Application

1

2

3

4

Page 46: Python business intelligence (PyData 2012 talk)

browser.aggregate(o cell, . drilldown=[9 "sector"])

drill-down

Page 47: Python business intelligence (PyData 2012 talk)

q row.label k row.key

for row in result.table_rows(“sector”):

row.record["amount_sum"]

Page 48: Python business intelligence (PyData 2012 talk)

✂ cut = PointCut(9 “date”, [2010])o cell = o cell.slice(✂ cut)

browser.aggregate(o cell, drilldown=[9 “date”])

2006 2007 2008 2009 2010

Total

Jan Feb Mar Apr March April May ...

whole cube

o cell = Cell(cube)browser.aggregate(o cell)

browser.aggregate(o cell, drilldown=[9 “date”])

Page 49: Python business intelligence (PyData 2012 talk)

How can Python be Useful

Page 50: Python business intelligence (PyData 2012 talk)

■ saves maintenance resources

■ shortens development time

■ saves your from going insane

Languagejust the

Page 51: Python business intelligence (PyData 2012 talk)

Source Systems

Staging Area Operational Data Store Datamarts

structured documents

databases

APIs

TemporaryStaging Area

staging relational dimensional

L0 L1 L2

faster

Page 52: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

faster advanced

understandable, maintainable

Page 53: Python business intelligence (PyData 2012 talk)

Conclusion

Page 54: Python business intelligence (PyData 2012 talk)

people

technology processes

BI is about…

Page 55: Python business intelligence (PyData 2012 talk)

don’t forget metadata

Page 56: Python business intelligence (PyData 2012 talk)

who is going to fix your COBOL Java toolif you have only Python guys around?

Future

Page 57: Python business intelligence (PyData 2012 talk)

is capable, let’s start