python business intelligence (pydata 2012 talk)

Python for Business Intelligence

Štefan Urbánek ■ @Stiivi ■ [email protected] ■ PyData NYC, October 2012

https://twitter.com/@stiivi


python business intelligence

)

Q/A and articles with Java solution references

(not listed here)

Results

Overview

■ Traditional Data Warehouse

■ Python and Data

■ Is Python Capable?

■ Conclusion

Business Intelligence

people

technology processes

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Traditional Data Warehouse

■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures in the data to achieve consistency across the original sources

■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards.

Source: Ralph Kimball – The Data Warehouse ETL Toolkit

Source Systems

Staging Area Operational Data Store Datamarts

structured documents

databases

APIs

TemporaryStaging Area

staging relational dimensional

L0 L1 L2

real time = daily

Multi-dimensionalModeling

http://tendre.sme.sk






aggregation browsingslicing and dicing

business / analyst’spoint of view

regardless of physical schema implementation

Facts

fact

most detailed information

measurable

fact data cell

dimensions

location

type

time

■ provide context for facts

■ used to filter queries or reports

■ control scope of aggregation of facts

Dimension

Pentaho

Python and Datacommunity perception*

*as of Oct 2012

Scientific & Financial

Python

Data Governance



Sources


T1[s] T2[s] T3[s] T4[s]

P1 112,68 941,67 171,01 660,48

P2 96,15 306,51 725,88 877,82

P3 313,39 189,31 41,81 428,68

P4 760,62 983,48 371,21 281,19

P5 838,56 39,27 389,42 231,12

n-dimensional array of numbers

Scientific Data

Assumptions

■ data is mostly numbers

■ data is neatly organized...

■ … in one multi-dimensional array

Data Governance



Sources


Business Data

multiple representations

of same data

multiple snapshots of one source

categories are

changing

Is Python Capable?very basic examples

Data Pipes with SQLAlchemy

Data Governance



Sources


■ connection: create_engine

■ schema reflection: MetaData, Table

■ expressions: select(), insert()

src_engine = create_engine("sqlite:///data.sqlite")src_metadata = MetaData(bind=src_engine)src_table = Table('data', src_metadata, autoload=True)

target_engine = create_engine("postgres://localhost/sandbox")target_metadata = MetaData(bind=target_engine)target_table = Table('data', target_metadata)

for column in src_table.columns: target_table.append_column(column.copy())

target_table.create()

insert = target_table.insert()

for row in src_table.select().execute(): insert.execute(row)

clone schema:

copy data:

magic used:

metadata reflection

reader = csv.reader(file_stream)

columns = reader.next()

for column in columns: table.append_column(Column(column, String))

table.create()

for row in reader: insert.execute(row)

text file (CSV) to table:

Simple T from ETL

Data Governance



Sources


transformation = [

('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ]

target fields source transformations

for row in source: result = transform(row, [ transformation) table.insert(result).execute()

Transformation

OLAP with Cubes

Data Governance



Sources


cubes dimensionsmeasures levels, attributes, hierarchy

Model{ “name” = “My Model” “description” = ....

“cubes” = [...] “dimensions” = [...]}

❄

logical

physical

workspace.browser(cube)

load_model("model.json")

create_workspace("sql", model, url="sqlite:///data.sqlite")

model.cube("sales")

Aggregation Browser backend

cubes

Application

∑

1

2

3

4

browser.aggregate(o cell, . drilldown=[9 "sector"])

drill-down

q row.label k row.key

for row in result.table_rows(“sector”):

row.record["amount_sum"]

✂ cut = PointCut(9 “date”, [2010])o cell = o cell.slice(✂ cut)

browser.aggregate(o cell, drilldown=[9 “date”])

2006 2007 2008 2009 2010

Total

Jan Feb Mar Apr March April May ...

whole cube

o cell = Cell(cube)browser.aggregate(o cell)

browser.aggregate(o cell, drilldown=[9 “date”])

How can Python be Useful

■ saves maintenance resources

■ shortens development time

■ saves your from going insane

Languagejust the

Source Systems

Staging Area Operational Data Store Datamarts

structured documents

databases

APIs

TemporaryStaging Area

staging relational dimensional

L0 L1 L2

faster

Data Governance



Sources


faster advanced

understandable, maintainable

Conclusion

people

technology processes

BI is about…

don’t forget metadata

who is going to fix your COBOL Java toolif you have only Python guys around?

Future

is capable, let’s start

Thank You

Twitter:

@StiiviDataBrewery blog:

blog.databrewery.orgGithub:

github.com/Stiivi

[t\



http://blog.databrewery.org

http://blog.databrewery.org

http://github.com/Stiivi/cubes

http://github.com/Stiivi/cubes

python business intelligence (pydata 2012 talk)

Documents