addressing data chaos: using mysql and kettle to deliver world-class data warehouses

22
Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Wednesday April 25, 2007

Upload: plato-diaz

Post on 02-Jan-2016

27 views

Category:

Documents


3 download

DESCRIPTION

Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses. Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Wednesday April 25, 2007. Agenda. Big News Data Integration challenges and open source BI adoption - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Addressing Data Chaos:Using MySQL and Kettle to Deliver World-Class Data Warehouses

Matt Casters: Chief Architect, Data Integration and Kettle Project Founder

MySQL User Conference, Wednesday April 25, 2007

Page 2: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Agenda

Big News

Data Integration challenges and open source BI adoption

Pentaho company overview

Pentaho Data Integration FundamentalsSchema designKettle basicsDemonstration

Resources and links

Page 3: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Announcing Pentaho Data Integration 2.5.0

Again we offer big improvements over smash hit version 2.4.0Advanced error handling

Tight Apache VFS integrationAllows us to directly load and save files from any location: file

systems, web servers, FTP sites, ZIP-files, tar-files, etc.

Dimension key caching dramatically improving speedA slew of new job entries and steps (including MySQL bulk operations)Hundreds of bug fixes

Page 4: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Managing Data Chaos: Data Integration Challenges

Data is everywhereCustomer order information in one system, customer service information in another

Data is inconsistentThe record of the customer is different in each system

Performance is an issueRunning queries to summarize 3 years of data in the operational system takes foreverAND it brings the operational system to its knees

The data is never ALL in the data

warehouseAcquisitions, Excel spreadsheets, new applications

Customer

Service

History

Customer

Order

History

Marketing

Campaigns

Data

Warehouse

XML Acquired

System

Page 5: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

How Pentaho Extends MySQL with ETL

MySQL Provides

Data storageSQL query executionHeavy-duty sorting, correlation, aggregation Integration point for all BI tools

Kettle Provides

Data extraction, transformation, and loadingDimensional modelingSQL generationAggregate creationData enrichment / calculationsData migration

Page 6: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Sample Companies that Use MySQL and Kettle from Pentaho

“With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution.”

“We selected Pentaho for its ease-of-use. Pentaho addressed many of our requirements -- from reporting and analysis to dashboards, OLAP and ETL, and offered our business users the Excel-based access that they wanted.”

Page 7: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

“We chose Pentaho because it has a full range of functionality, exceptional flexibility, and a low total cost of ownership because of its open source business model. We can start delivering value to our business users quickly with embedded, web-based reporting, while integrating our disparate data sources for more strategic benefits down the road.”

Other Kettle Users

And Thousands More……

Page 8: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Pentaho Introduction

World’s most popular enterprise open source BI

Suite2 million lifetime downloads, averaging 100K / monthFounded in 2004: Pioneer in professional open source BI

Key ProjectsJFreeReport ReportingKettle Data IntegrationMondrian OLAPPentaho BI PlatformWeka Data Mining

Management and BoardProven BI veterans from Business Objects, Cognos, Hyperion, SAS, OracleOpen source leaders - Larry Augustin, New Enterprise Associates, Index Ventures

MySQL Gold Partner

Page 9: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Overview: Data Warehouse Data Flow

From source systems …

to the data warehouse …

to reports …

to analyses …

to dashboard reports …

to better information

Page 10: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Pentaho Introduction

Strategic

Operational

Sales Marketing Inventory FinancialProduction

Scorecards

Analysis

Aggregates

Reports

Departmental

Page 11: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

The star schema: a new data model is needed

Because data from various sources is “mixed” we need to design a new data model: a star schema.

A star schema is designed based on the requirements and populated by the ETL engine.

During modeling we split up the requirements in Facts and Dimensions:

Category 2001 2002 2003 TotalLaptop 2.800.726 5.272.243 2.295.147 10.368.116 Monitor 138.681 297.037 145.263 580.981 PC 2.260.053 3.893.171 1.784.220 7.937.444 Peripheral 3.028.527 5.966.100 2.857.026 11.851.653 Printer 2.795.736 5.566.608 2.285.188 10.647.532 Server 2.210.015 3.591.230 2.044.897 7.846.142 Total 13.233.738 24.586.389 11.411.741 49.231.868

DimensionsDimensions FactsFacts

Page 12: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

The star schema: a new data model is needed• After grouping the dimension attributes by subject we get our

data model. For example:

CustomerProductOrder Line Fact Table

Date

Order

Page 13: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Overview: A new data model is needed

The fact table contains ONLY facts and dimension technical keys

Column Type Data type

date_tk Technical

key

Bigint

customer_tk Technical

key

Bigint

order_tk Technical

key

Bigint

product_tk Technical

key

Bigint

number_of_product

s

Fact Smallint

Turnover Fact Float

Pct_discount Fact Tinyint

Discount Fact Float

Page 14: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Overview: A new data model is needed

T

K

Versio

n

date_fro

m

date_t

o

cust_i

d

name NAL* Birth_date

1

0

1 100 Matt

C.

Address

1

1900-01-

01

• The dimensions contain technical fields, typically like in this customer dimension entry for customer_id = 100

NAL = Name, Address & Location

T

K

Versio

n

date_fro

m

date_t

o

cust_i

d

name NAL* Birth_date

1

0

1 T1 100 Matt

C.

Address

1

1900-01-

01

5

4

2 T1 100 Matt

C.

Address

2

1900-01-

01

• If the address changes (at time T1) we get a new entry in the dimension. This is called a Ralph Kimball type II dimension update.

Page 15: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Overview: A new data model is needed

NAL = Name, Address & Location

T

K

Versio

n

date_fro

m

date_t

o

cust_i

d

name NAL* Birth_date

1

0

1 T1 100 Matt

C.

Address

1

1969-02-

14

5

4

2 T1 100 Matt

C.

Address

2

1969-02-

14

• If the birth_date changes we update all entries in the dimension. This is called a Ralph Kimball type I dimension update.

Page 16: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Implications

We are making it easier to create reports by using star schemas

We are shifting work from the reporting side to the ETL

We need a good toolset to do ETL because of the complexities

We need to turn everything upside down

… and this is where Pentaho Data Integration comes in.

Page 17: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Data Transformation and Integration Examples

Data filteringIs not null, greater than, less than, includes

Field manipulationTrimming, padding, upper and lowercase conversion

Data calculations+ - X / , average, absolute value, arctangent, natural logarithm

Date manipulationFirst day of month, Last day of month, add months, week of year, day of year

Data type conversionString to number, number to string, date to number

Merging fields & splitting fields

Looking up dateLook up in a database, in a text file, an excel sheet, …

Page 18: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Pentaho Data Integration (Kettle) Components

SpoonConnect to data sourcesDefine transformation rules and design target schema(s)Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs

PanCommand-line execution of single, pre-defined transformation jobs

KitchenScheduler for multi-stage jobs

Pentaho BI PlatformIntegrated scheduling of transformations or jobsAbility to call real-time transformations and use output in reports and dashboards

Page 19: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Demonstration- create a MySQL db + repository- create dimensions- create facts- auditing & incremental loading- jobs

Page 20: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Case Study: Pentaho Data Integration

Organization: Flemish Government Traffic Centre

Use case: Monitoring the state of the road network

Application requirement: Integrate minute-by-minute data from

570 highway locations for analysis

Technical challenges: Large volume of data, more than 2.5

billion rows

Business Usage: Users can now compare traffic speeds based on

weather conditions, time of day, date, season

Best practices:Clearly understand business user requirements firstThere are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way

Page 21: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Case Study: Replacement of Proprietary Data Integration

Organization: Large, public, North American based genetics

and pharmaceutical research firm

Application requirement: Data warehouse for analysis of

patient trials, and research spending

Incumbent BI vendor: Oracle (Oracle Warehouse Builder)

Decision criteria: Ease of use, openness, cost of ownership“It was so much quicker and easier to do the things we wanted to do, and so much easier to maintain when our users’ business requirements change.”

Best practices:Evaluate replacement costs holisticallyTreat migrations as an opportunity to improve a deployment, not just move itGood deployments are iterative and evolve regularly – if users like what you give them, they will probably ask for more

Page 22: Addressing Data Chaos: Using MySQL and Kettle to  Deliver World-Class Data Warehouses

Summary and Resources

Pentaho and MySQL can address help you manage your data infrastructureExtraction, Transformation and Loading for Data Warehousing and Data Migration

kettle.pentaho.orgKettle project homepage

kettle.javaforge.comKettle community website: forum, source, documentation, tech tips, samples, …

www.pentaho.org/download/All Pentaho modules, pre-configured with sample dataDeveloper forums, documentationVentana Research Open Source BI Survey

www.mysql.comWhite paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htmlKettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-19.php Roland Bouman blog on Pentaho Data Integration and MySQL

http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html