python business intelligence (pydata 2012 talk)
DESCRIPTION
What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data. Video: https://vimeo.com/53063944TRANSCRIPT
Python for Business Intelligence
Štefan Urbánek ■ @Stiivi ■ [email protected] ■ PyData NYC, October 2012
python business intelligence
)
Q/A and articles with Java solution references
(not listed here)
Results
Why?
Overview
■ Traditional Data Warehouse
■ Python and Data
■ Is Python Capable?
■ Conclusion
Business Intelligence
people
technology processes
Data Governance
Analysis and Presentation
Extraction, Transformation, LoadingData
Sources
Technologies and Utilities
Traditional Data Warehouse
■ Extracting data from the original sources
■ Quality assuring and cleaning data
■ Conforming the labels and measures in the data to achieve consistency across the original sources
■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards.
Source: Ralph Kimball – The Data Warehouse ETL Toolkit
Source Systems
Staging Area Operational Data Store Datamarts
structured documents
databases
APIs
TemporaryStaging Area
staging relational dimensional
L0 L1 L2
real time = daily
Multi-dimensionalModeling
aggregation browsingslicing and dicing
business / analyst’spoint of view
regardless of physical schema implementation
Facts
fact
most detailed information
measurable
fact data cell
dimensions
location
type
time
■ provide context for facts
■ used to filter queries or reports
■ control scope of aggregation of facts
Dimension
Pentaho
Python and Datacommunity perception*
*as of Oct 2012
Scientific & Financial
Python
Data Governance
Analysis and Presentation
Extraction, Transformation, LoadingData
Sources
Technologies and Utilities
T1[s] T2[s] T3[s] T4[s]
P1 112,68 941,67 171,01 660,48
P2 96,15 306,51 725,88 877,82
P3 313,39 189,31 41,81 428,68
P4 760,62 983,48 371,21 281,19
P5 838,56 39,27 389,42 231,12
n-dimensional array of numbers
Scientific Data
Assumptions
■ data is mostly numbers
■ data is neatly organized...
■ … in one multi-dimensional array
Data Governance
Analysis and Presentation
Extraction, Transformation, LoadingData
Sources
Technologies and Utilities
Business Data
multiple representations
of same data
multiple snapshots of one source
categories are
changing
❄
Is Python Capable?very basic examples
Data Pipes with SQLAlchemy
Data Governance
Analysis and Presentation
Extraction, Transformation, LoadingData
Sources
Technologies and Utilities
■ connection: create_engine
■ schema reflection: MetaData, Table
■ expressions: select(), insert()
src_engine = create_engine("sqlite:///data.sqlite")src_metadata = MetaData(bind=src_engine)src_table = Table('data', src_metadata, autoload=True)
target_engine = create_engine("postgres://localhost/sandbox")target_metadata = MetaData(bind=target_engine)target_table = Table('data', target_metadata)
for column in src_table.columns: target_table.append_column(column.copy())
target_table.create()
insert = target_table.insert()
for row in src_table.select().execute(): insert.execute(row)
clone schema:
copy data:
magic used:
metadata reflection
reader = csv.reader(file_stream)
columns = reader.next()
for column in columns: table.append_column(Column(column, String))
table.create()
for row in reader: insert.execute(row)
text file (CSV) to table:
Simple T from ETL
Data Governance
Analysis and Presentation
Extraction, Transformation, LoadingData
Sources
Technologies and Utilities
transformation = [
('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ]
target fields source transformations
for row in source: result = transform(row, [ transformation) table.insert(result).execute()
Transformation
OLAP with Cubes
Data Governance
Analysis and Presentation
Extraction, Transformation, LoadingData
Sources
Technologies and Utilities
cubes dimensionsmeasures levels, attributes, hierarchy
Model{ “name” = “My Model” “description” = ....
“cubes” = [...] “dimensions” = [...]}
❄
logical
physical
workspace.browser(cube)
load_model("model.json")
create_workspace("sql", model, url="sqlite:///data.sqlite")
model.cube("sales")
Aggregation Browser backend
cubes
Application
∑
1
2
3
4
browser.aggregate(o cell, . drilldown=[9 "sector"])
drill-down
q row.label k row.key
for row in result.table_rows(“sector”):
row.record["amount_sum"]
✂ cut = PointCut(9 “date”, [2010])o cell = o cell.slice(✂ cut)
browser.aggregate(o cell, drilldown=[9 “date”])
2006 2007 2008 2009 2010
Total
Jan Feb Mar Apr March April May ...
whole cube
o cell = Cell(cube)browser.aggregate(o cell)
browser.aggregate(o cell, drilldown=[9 “date”])
How can Python be Useful
■ saves maintenance resources
■ shortens development time
■ saves your from going insane
Languagejust the
Source Systems
Staging Area Operational Data Store Datamarts
structured documents
databases
APIs
TemporaryStaging Area
staging relational dimensional
L0 L1 L2
faster
Data Governance
Analysis and Presentation
Extraction, Transformation, LoadingData
Sources
Technologies and Utilities
faster advanced
understandable, maintainable
Conclusion
people
technology processes
BI is about…
don’t forget metadata
who is going to fix your COBOL Java toolif you have only Python guys around?
Future
is capable, let’s start
Thank You
Twitter:
@StiiviDataBrewery blog:
blog.databrewery.orgGithub:
github.com/Stiivi
[t\