building bi solutions with sql server pdw au3 ruwen hess senior program manager microsoft...
TRANSCRIPT
Building BI Solutions with SQL Server PDW AU3Ruwen HessSenior Program ManagerMicrosoft Corporation
DBI321
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
Trends in the Data Warehousing SpaceUnderstanding the Opportunity
Source: TDWI Report – Next Generation DW
Don't Know
More than 10 TB
3 - 10 TB
1 - 3 TB
Less than 1TB
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
6%
34%
25%
18%
17%
2%
17%
19%
21%
41%
Approximate data volume managed by DW
Today In 3 years
Performance at scale: ability to analyze massive amounts of data
DW systems continue to grow at a fast pace, scalability is a key concern, growing a system from 10s of TBs, to 100s of TB, to PBs
Data Warehousing has shifted almost entirely towards the appliance model due to speed of the balanced appliance and scalability of scale out (MPP) solutions.
Jim Cobelius, Forrester Research
Appliances are the key trend in the next 4 years (4 Billion market by ‘15)Cloud DW longer-termBox is a slow decline
Source: MS internal analysis, DBSMIT Cloud Market Opportunity Forecast
CAGR
-0.3%
26.2%
7.1%
Share(‘15)
4.6%
5.0%
30.0%
60.4%
FY10 FY11 FY12 FY13 FY14 FY150
2
4
6
8
10
12
14
7.9 8 8.2 8.2 8.1 7.7
1.1 1.5 1.9 2.4 3 3.8
DW Software License RevenueUS$ Billions
Public Cloud
Private Cloud
Appliances/RA
Traditional
7.1%
Trends in the Data Warehousing SpaceUnderstanding the Opportunity
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
Scale out
What is Parallel Data Warehouse (PDW)?SQL Server Data Warehousing in Appliance Model
ScalableStandardsBased
FlexibleCost Effective
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
CONTROL RACK DATA RACK
Control Node (query submitted here)
Management Node
Landing Zone
Backup Node
• Query is executed on all nodes• Multiple queries are simultaneously executed across all nodes• PDW supports querying while data is loading
SQL Server PDW Hardware Architecture
PDW Data ExampleTime Dim
Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day
Store Dim
Store Dim IDStore NameStore MgrStore Size
Product Dim
Prod Dim IDProd CategoryProd Sub CatProd Desc
MktgCampaign Dim
Mktg Camp IDCamp NameCamp MgrCamp StartCamp End
SQL
SQL
SQL
SQL
PDW Compute Nodes
Sales Facts
Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold
Time DimDate Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day
Sales Facts
Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold
PDW Data Example
Store Dim
Store Dim IDStore NameStore MgrStore Size
Product Dim
Prod Dim IDProd CategoryProd Sub CatProd Desc
MktgCampaign Dim
Mktg Camp IDCamp NameCamp MgrCamp StartCamp End
SQL
SQL
SQL
SQL
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
Smaller Dimension Tables are Replicated on Every Compute
Node
PDW Data ExampleTime Dim
Date Dim IDCalendar YearCalendar QtrCalendar MoCalendar Day
Store Dim
Store Dim IDStore NameStore MgrStore Size
Product Dim
Prod Dim IDProd CategoryProd Sub CatProd Desc
Sales Facts
Date Dim IDStore Dim IDProd Dim IDMktg Camp IdQty SoldDollars Sold Mktg
Campaign Dim
Mktg Camp IDCamp NameCamp MgrCamp StartCamp End
SQL
SQL
SQL
SQL
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
PDTD
MDSD
SF-1
SF-2
SF-3
SF-4
Larger Fact Table is Hash Distributed Across All
Compute Nodes
SF-1SF-2SF-3SF-4
SQL Server Parallel Data WarehouseA quick look at MPP query execution
Compute Node 1
Compute Node 2
Compute Node N
ClientControl Node
..
.
The control node handles global query execution, and generates a distributed execution plan
The user connects to ‘the appliance’ like he would to a ‘normal’ SQL Server, and sends his request
The actual user data resides on compute nodes, and steps of the global execution plan are executed on each compute node
SQL Server PDW is a shared nothing MPP system, meaning user data is distributed across the nodes*. Data Movement Service is responsible for moving data around so that individual nodes can satisfy queries that need data from other nodes.
SQL Server PDW Appliance
Shuffle MovementDMS Redistributes the data by color values in parallel.
Co
mp
ute
No
de
1C
om
pu
te N
od
e 2
Dealing with Distributions - ShufflingExample:Select [color], SUM([qty]) from [Store Sales] group by [color];
Retu
rn
Ss_id
color qty
Store Sales
1 Red 5
3 Blue 11
5 Red 12
7 Green 7
Ss_id
color qty
Store Sales
2 Red 8
4 Blue 10
6 Yellow 12
Distributed Table
Temp_1
Red 5
Red 12
Red 8
Green 7
Temp_1
Blue 11
Yellow 12
Blue 10
color qty
color qty
Hash
Blue 21
Red 25
Green 7
Yellow 12
color qty
Hash
HashHashParallel Merge and Aggregate
SQL Server Parallel Data WarehouseOverall Architecture
Legend:
Control Node
Client Interface(JDBC, ODBC,
OLE-DB, ADO.NET) DMS Manager
PDW Engine
…
Compute Node 1
DMS Core
PDW Agent
Landing Zone Node
Bulk Data LoaderPDW Agent
Management NodeActive Directory
PDW Agent
PDW AgentCompute Node 2
DMS Core
PDW Agent
Compute Node 10
DMS Core
PDW AgentPDW service
Data Movement ServiceDMS =Parallel Data WarehousePDW =
ETL Interface
Data Rack (up to 4)Control Rack
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
SQL ServerCompatibility
SQL Server Parallel Data Warehouse AU3Release Themes
BI, Analytics, & ETL Integration
Performance At Scale
Broader functionality
Full Alignment
Less work for the same results
Do the same work more efficiently
Native Support for- Analysis Services- Reporting Services- PowerPivot
Lay the foundation for broad connectivity support
SQL Server PDW ArchitectureHow did it work before?
ProblemBasic RDBMS functionality, that already exists in SQL Server, was re-built in PDW
Challenge for PDW AU3 release Can we leverage SQL Server and focus on MPP related challenges?
Contro
l Node
SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer
Shell Appliance(SQL Server)
Engine Service
Plan
S
teps
Plan
S
teps
Plan
S
teps
Compute Node (SQL Server)
Compute Node (SQL Server)
Compute Node (SQL Server)
Con
trol N
od
e
SELECTSELECT
foo foofoo
foo
SQL Server runs a ‘Shell Appliance’
Every database exists as an empty ‘shell’
All objects, no user data
DDL executes against both the shell and the compute nodes
Large parts of basic RDBMS functionality now provided by the shell
Authentication and authorizationSchema binding Metadata catalog
SQL Server PDW AU3 Architecture PDW AU3 Architecture with Shell Appliance and Cost-Based Query Optimizer
1. User issues a query
2. Query is sent to the Shell through sp_showmemo_xml stored procedure
SQL Server performs parsing, binding, authorizationSQL optimizer generates execution alternatives
3. MEMO containing candidate plans, histograms, data types is generated
4. Parallel execution plan generated
5. Parallel plan executes on compute nodes
6. Result returned to the user
Shell Appliance(SQL Server)
Engine Service
Plan
S
teps
Plan
S
teps
Plan
S
teps
ME
MO
Compute Node (SQL Server)
Compute Node (SQL Server)
Compute Node (SQL Server)
Con
trol N
od
e
SELECTSELECT
Return
PDW Cost-Based OptimizerOptimizer lifecycle…
1. Simplification and space explorationQuery standardization and simplification (e.g. column reduction, predicates push-down)Logical space exploration (e.g. join re-ordering, local/global aggregation)Space expansion (e.g. bushy trees – dealing with intermediate resultsets)Physical space explorationSerializing MEMO into binary XML (logical plans)De-serializing binary XML into PDW Memo
2. Parallel optimization and pruningInjecting data move operations (expansion)Costing different alternativesPruning and selecting lowest cost distributed plan
3. SQL GenerationGenerating SQL Statements to be executed
PDW Cost-Based Optimizer… And Cost Model Details
PDW cost model assumptions:Costing only data movement operations (relational operations excluded)
Sequential step execution (no pipelined and independent parallelism)
Data movement operation costs modeled at detailEach movement consists of multiple tasksEach task has Fixed and Variable overhead
Uniform data distribution assumed (no data skew)
Distributed Query Cost Based OptimizerOperator trees for a sample query
(l_o = o_o)
PDW AU2 operator tree
O (o_o) LI (l_o)
(l_o = o_o)shuffle (l_pk)
PDW AU3 operator tree
O (o_o)
LI (l_o)
(l_pk = p_pk)
broadcast
P (p_pk)
SELECT * from orders JOIN lineitem on (o_orderkey =
l_orderkey) JOIN part on (l_partkey = p_partkey)WHERE p_name like '%smoke%';
P (p_pk)
PDW Sales Test WorkloadAU2 to AU3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 390
10
20
30
40
50
60
70
80
AU2AU3
Seco
nds
Queries
5x improvement in terms of total elapsed time out of the box
Theme: Performance at ScaleZero data conversions in data movement
Q1 Q3 Q5 Q7 Q9Q11 Q13 Q15 Q17 Q19 Q21
0102030405060
DMS CPU Utilization - TPCH
AU2 AU3
CP
U (
%)
Broadcast
Trim
Replicate
Shuffle
Repl Table Load
0% 100% 200% 300% 400% 500% 600%
Throughput improvement for data movements
GoalEliminate CPU utilization spent on data conversionsFurther parallelize operations during data moves
FunctionalityUsing ODBC instead of ADO.NET for reading and writing dataMinimizing appliance resource utilization for data moves
BenefitsBetter resource, CPU, utilization 6x or more faster move operationsIncreased concurrencyMixed workload (loads + queries)
Theme: SQL Server CompatibilitySQL Server Security and Metadata
SecuritySQL Server security syntax and semanticsSame underlying authorization model and codeSupporting user, roles and loginsFixed database rolesAllows script re-useAllows well-known security procedures/processes
MetadataPDW metadata stored in SQL ServerExisting SQL Server metadata tables/views (e.g. security views)PDW distribution info as extended properties in SQL Server metadataExisting means and technology for persisting metadataImproved 3rd party tool compatibility (BI, ETL)
Theme: SQL Server CompatibilitySupport for SQL Server (Native) Client
SQL PDW Clients
(ODBC, OLE-DB, ADO.NET)
SQL Server Clients
(ADO.NET, ODBC, OLE-DB, JDBC)
TDS
Server: 10.217.165.13, 17001
Server: 10.217.165.13, 17000
SequeLink
Goal‘Look’ just like a normal SQL ServerBetter integration with other BI tools
FunctionalityUse existing SQL Server drivers to connect to SQL Server PDWImplement SQL Server TDS protocolNamed Parameter supportSQLCMD connectivity to PDW
BenefitsUse known tools and proven technology stackExisting SQL Server ’eco-system’2x performance improvement for return operations5x reduction of connection time
Goal Support common scenarios of code encapsulation and reuse in Reporting and ETL
Functionality System and user-defined stored proceduresInvocation using RPC or EXECUTEControl flow logic, input parameters
BenefitsEnables common logic re-useBig impact for Reporting Services scenariosAllows porting existing scriptsIncreases compatibility with SQL Server
Theme: SQL Server CompatibilityStored Procedure Support (Subset)
SyntaxCREATE { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]
ALTER { PROC | PROCEDURE } [dbo.]procedure_name [ { @parameter data_type } [ = default ] ] [ ,...n ] AS { [ BEGIN ] sql_statement [;] [ ...n ] [ END ] } [;]
DROP { PROC | PROCEDURE } { [dbo.]procedure_name } [;]
[ { EXEC | EXECUTE } ] { { [database_name.][schema_name.]procedure_name } [{ value | @variable }] [ ,...n ] } [;]
{ EXEC | EXECUTE } ( { @string_variable | [ N ]'tsql_string' } [ + ...n ] ) [;]
Unsupported Functionality
Stored Proc Nesting Output Params
Return Try-Catch
Theme: SQL Server CompatibilityCollations
GoalSupport local and international data
FunctionalityFixed server level collationUser-defined column level collationSupporting all Windows collationsAllow COLLATE clauses in Queries and DML
BenefitsStore all the data in PDW w/ additional querying flexibilityExisting T-SQL DDL and Query scriptsSQL Server alignment and functionality
SyntaxCREATE TABLE T ( c1 varchar(3) COLLATE traditional_Spanish_ci_ai, c2 varchar(10) COLLATE …)
SELECT c1 COLLATE Latin1_General_Bin2FROM T
SELECT * FROM T ORDER BY c1 COLLATE Latin1_General_Bin2
Unsupported Functionality
Cannot specify DB collation during DB creation
Cannot alter column collations for existing tables
Theme: Improved IntegrationSQL Server PDW Connectors
Connector for HadoopBi-directional (import/export) interface between MSFT Hadoop and PDWDelimited file supportAdapter uses existing PDW tools (bulk loader, dwsql)Low cost solution that handles all the data: structured and unstructuredAdditional agility, flexibility and choice
Connector for InformaticaConnector providing PDW source and target (mappings, transformations)Informatica uses PDW bulk loader for fast loads
Leverage existing toolset and knowledge
Connector for Business Objects
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
PDW Retail POS WorkloadOriginal Customer SMP solution vs. PDW AU3 (with cost-based query optimizer)
Q1 Q2 Q3 Q4 Q5 Q6 Q70
200
400
600
800
1000
1200
1400
1600
Old SMPPOS ODS AU3
Seco
nd
s
Queries
Customer SuccessesHow are customers using PDW & BI ?
Data Volume 80 TB data warehouse analyzing data from exchangesExisting system based on SQL SMP farm
2 different clusters of 6 servers each
Requirement Linear scalability with additional hardwareSupport hourly loads with SSIS – 300GB/dayBI Integration: SSRS, SSAS and PowerPivot
AU3 FeedbackSP and increased T-SQL support was greatMigrating SMP SSRS to PDW was painless142x for scan heavy queries & no summary tablesEnabled queries that do not run on existing system
Reports
Dashboards
Scorecards
CUSTOMER EXAMPLE:Stock Exchange in the US
Portal
ETL
PDWOperationa
l DB’s
PDW
Nielsen OLTP system
SSAS
Click-Stream
Customer Successes – cont’dHow are customers using PDW & BI ?
CUSTOMER EXAMPLE:Major Retailer in the US
Data Volume 36 TB data warehouse analyzing data from transactional and clickstream sourcesBusiness need to expand to 7 year data window (currently 1 year data)
Requirements Scalability - growing data volume does not affect performancePerformance and ad-hoc analysis for interactive querying by usersBI Integration with Microsoft BI stack - SSAS and SSRS
AU3 FeedbackSSAS cubes worked ‘out-of-box’Performance an order of magnitude faster than existing system (~30x on an expanded data set)
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
Role of PDW within the BI stack
PDW
DM
DM DM
3rd party BI
SSAS / SSRS
SSAS / SSRS
SSAS / SSRS
PDW role as fast ‘data hub’Fast and parallel feeding of data marts (DMs) via Infiniband
CREATE REMOTE TABLE AS SELECT
Aggregation abilities avoids ETL overhead in existing systems
No need for indexes No need to maintain indexed/materialized views (summary tables)
Infiniband
GBit link
SSAS with SQL Server PDWUnderstanding the differences compared to ‘SMP world’
Specific to the nature of large dataParallel cube processing/deployment has its limits
Cautious about parallel loads of SSAS - query timeout settings
Query design crucial - only include required dataBI tools traditionally not designed for handling huge amount of data
Specific to PDWPDW does not support foreign key constraintsShared nothing model requires careful data design and retrieval planningDesign cubes for parallel processing – via MOLAP & ROLAP storage model
demo PowerPivot with SQL Server PDW
… just like any other SQL Server
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
Supported Third Party BI Solutions
AU3 T-SQL compatibility allows for common access for multiple tools
Current support on PDW drivers includesMicroStrategySAP BusinessObjectsInformatica
Other tools have ‘mixed experience’Cognos support required : CURRENT_TIMESTAMP , @@DATEFIRST, SET OPTION …Core connectivity enhancements planned for the next 2 releases
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
New Challenges for Business Analytics
Huge amount of data born ‘unstructured’Increasing demand for (near) real-time business analyticsPre-filtering of important from less relevant raw data required
ApplicationsSensor networks & RFIDSocial networks & Mobile AppsBiological & Genomics
Sensor/RFID Data
Blogs, Docs
Web Data
HADOOP
HADOOP
Fast ETL processi
ng
Active Archive
FastRefinery
Cost-Optimal storage
Hadoop as a Platform SolutionIn the context of ETL , BI , and DW
Platform to accelerate ETL processes (not competing with current ETL software tools!)
Flexible and fast development of ‘hand-written’ refining requests of raw data
Active & cost effective data archive to let (historical) data ‘live forever’
Co-existence with a relational DW
Importing HDFS data into PDW for advanced BI
HADOOP
Sensor/RFID Data
Blogs, Docs
Web Data
SQL Server PDW
Interactive BI/Data Visualization
SQOOP
Application Programmers
DBMS Admin
Power BI Users
Hadoop - PWD Integration via SQOOP (export)
…
Landing Zone
Compute Node 1
Compute Node 8
HDFS
…
PDW-configuration file
PDW Hadoop Connecto
r
SQOOP export with source (HDFS path) &
target (PDW DB & table)1. FTP
Server
Copies incoming data on Landing
Zone
3.
2.Read HDFS
data via mappers
Invokes‘DWLoader’
Telnet
Server
4.
Control Node
Compute Nodes
Windows/PDW
Linux/Hadoo
p
5.
demo Hadoop Sqoop Connector with SQL Server PDW
… integrating unstructured data into your end-to-end DW/BI solution
Agenda
Trends in the DW spaceHow does SQL Server PDW fit in?SQL Server PDW AU3 – What’s new?Building BI Solutions with SQL Server PDW
Customer SuccessesUsing SQL Server PDW with Microsoft BI solutionsUsing SQL Server PDW with third party BI solutionsBI solutions leveraging Hadoop integration
What’s coming next in SQL Server PDW?
SQL Server PDW Roadmap What is coming next?
Q1 Q2 Q3 Q4 Q1 Q2
• Improved node manageability
• Better performance and reduced overhead
• OEM requests
• Programmability• Batches• Control flow• Variables
• Temp tables• QDR infiniband switch• Onboard Dell
• Columnar store index• Stored procedures• Integrated Authentication• PowerView integration• Workload management• LZ/BU redundancy• Windows 8 • SQL Server 2012• Hardware refresh
CALENDAR YEAR 2011 CALENDAR YEAR 2012
• Cost based optimizer • Native SQL Server drivers,
including JDBC• Collations• More expressive query
language • Data Movement Services
performance• SCOM pack• Stored procedures (subset)• Half-rack
• 3rd party integration (Informatica, MicroStrategy, Business Objects, HADOOP)
Q4
V-NextAppliance Update 3Appliance Update 1Shipped
Appliance Update 2
Q3
Shipped
Shipped
In Review
Session Objectives Provide an overview of SQL Server PDW Introduce PDW AU3 and share details regarding the new features and their impact on BI scenarios
Key TakeawaysPDW is the SQL Server DW Appliance for 10-100s TBAU3 enables you to use your existing BI solutions on Microsoft & 3rd Party BI ToolsExpect at least 5x performance improvements over PDW AU2
Specific workloads can see much more
Related Content
DBI209 – Big Data, Big Deal
Lots of BI Tool Specific Related Sessions (PowerPivot, Analysis services, Etc.)
Breakthrough Insights: Big Data Analytics & Data Warehousing Demo Station
PDW Deep Dive Session Online from TechEd 2010
Track Resources
@sqlserver@TechEd_europe
#msTechEd
mvaMicrosoft Virtual Academy
SQL Server 2012 Eval Copy
Get Certified!
Hands-On Labs
Resources
Connect. Share. Discuss.
http://europe.msteched.com
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Resources for Developers
http://microsoft.com/msdn
Evaluations
http://europe.msteched.com/sessions
Submit your evals online
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
PRESENTATION.