business intelligence data warehousing operational consistency migration consolidation master data...
TRANSCRIPT
Data Integration at Microsoft:Technologies and Solution PatternsJeff BernhardtSSIS Product Unit ManagerMicrosoft Corporation
SESSION CODE: BIE202
An Ancient Story from India …
A Modern Story from Redmond…
Integration Services
BizTalk Server
SQL ServerReplication
SQL ServerService Broker
Others …
Data Integration: Two Globs of Ideas, Why and How
Problem/Solution types
Technology Choices
Business IntelligenceData Warehousing
Operational Consistency
Migration
Consolidation Master DataB2B
ETL – Bulk Move
Federated Views
Message Oriented
CDC
Replication / Synchronization
Streams - CEPData Management
Data Quality
SOA
DI Problem Types: Building a DW and Business Intelligence
My ODS or OLTP System
My DWETLReports
• Alter the shape• Create a Star Schema (de-nomalized for analysis
queries)• Surrogate Keys (in place of business keys)• Pre-Aggregations (to support some types of reporting)
• Track History• Slowly Changing Dimensions (history of entities)• Manage Partitions (once a month, roll up details and
archive)• Take changes from the store
• React to Inserts/Updates/Deletes.• Could be a “full refresh” or incremental
Text Files
DI Problem Types: Data Consistency Between Applications
Old Accounts Receivable on
SAP
New Custom AR system in
SQL
Create Consistency
• A long running ‘bridge’• Existing systems will be left in place and kept in synch.• Reacts to changes in either system.• Needs a way to react to changes or messages to minimize tax on App systems
• The systems are different• Often different back ends.• Match schemas, tables, columns• Consistent data domains (like keys)• Detect and resolve duplicates• Create a consistent level of granularity
• Aggregate• Allocate
Create Consistency
DI Problem Types: Migration or Consolidation
Old Accounts Receivable on
SAP
New Custom AR system in
SQLTransfer all data
and map the shape
• Systems or Companies merged or acquired. • Bring the data together into the “new” place.
• An integration system is design and built and tested to minimize the down time for the old system and make one smooth transition.• Match schemas• Consistent data domains (like keys)• Detect and resolve duplicates
• May create a long running ‘bridge’ while the systems settle.
Once the design is set and tested, execute this
DI Problem Types: Master Data Management
CustomersAccounting
CustomersSales
• Creating ‘One Version of the Truth’• Data residing in many sources where each source schema is fixed but different. Combined into one store
with a consistent schema• Pivot / Unpivot• Type and domain mapping• Key generation
• Ensure quality• Remove duplicates• Provide missing data• Hard matching to find duplicates
• Bulk update and trickle changes• Changes to central store delivered back to operational system
CustomersMarketing
CustomersSupport
Customers
WeShip
PartsAreUsEZ Buy
DI Problem Types: B2B (Inter-Enterprise Data Sharing)
Orders
Supplier’sSystem
• Contracts• SLAs• Standardized formats• Long running transactions or business process• Loosely coupled• Coordination, message passing• A very specific perspective on Application Integration.
Order Fulfillment
System Shipper’sSystem
Internet / WAN
Organizing the Problem/Solution IdeasData Warehouse
and Business Intelligence
Data Consistency Between
Applications
Data System Migration and Consolidation
Master Data Management
Inter Enterprise Data Acquisition
and Sharing
Point BPoint A
Technology Types: Bulk Movement
RDBMSRDBMSETL
• Move a sizeable set of rows from point A to point B• Often
• Part of a scheduled process• Transform the shape of the data being moved• Combine many sources or split into many destinations
• Two flavors• ETL (Extract Transform Load)
• SSIS• Ascential Datastage (IBM)
• ELT (Extract Load Transform)• Oracle Warehouse Builder• Bulk Insert
Text Files
XMLELT
Text Files
XML
CA
Technology Types: Message Oriented Movement
RDBMS
• Central ‘Coordinator’ • Guarantees receipt and delivery of messages.• Components are ‘at rest’ until activated by the
coordinator or an external event.• Data delivered in packets along with the message.• Terms that might fit in this category:
• CDC• Trickle Feed• SOA• Message Bus
Coordinator
B
DXML Text
Files
Line Of Business
Application
Event
From To Message
D C File Date
C A Insert
A B Purchase
Technology Types: Replication and Synchronization
• Maintaining equivalent copies of data in different locations• One master, many slaves• Multi-master• High Availability (live backups)
• Similarity between systems• Most often table copies on the same brand of RDBMS• Heterogeneous possible
• Attunity, Goldengate, etc.• Transformations: Little to none
• Terms that might fit in this category:• CDC, Log mining• Merge Replication• Checksum tables
Repl / Sync Agent
From To Message
D C File Date
C A Insert
A B Purchase
From To Message
D C File Date
C A Insert
A B Purchase
From To Message
D C File Date
C A Insert
A B Purchase
Technology Types: Federated Views
• Answers queries directly from many source systems• View Provider may:
• Optimize and execute the combined query (Joins, etc.)• Pushes query parts down to the source.• Provide unified security model• Provide unified metadata• Cache source data• Support Heterogeneous Sources
View Provider
Reports
Event Processing
Technology Types: Stream Processing
Source
• Monitor a stream of data, Create an event when• Temporal (time based) events occur• Running average or aggregate hits a limit• Interesting sequence of records is detected
• Also called CEP (Complex Event Processing)• Different from the other Technology Types??? I Can’t tell yet.
CEP Engine Destination
EventLogEvent
Technology Types: Data Management and Quality
• A collection of services common to most Data Integration solutions• Shared semantic model
• Metadata library• Manage hierarchies• Data artifact level security model
• Data Quality• Profile to understand• Merge to resolve duplicates• Find approximate matches• Test and monitor quality.
• Version management for data.
Organizing the Technology Choice Ideas
Bulk Movement
Message Oriented
Movement
Replication and Synchronization
Federated Views
Data Management and Quality
Stream Processing (CEP)
Problems and TechnologiesData Warehouse
and Business Intelligence
Data Consistency Between
Applications
Data System Migration and Consolidation
Master Data Management
Inter Enterprise Data Acquisition
and Sharing
Bulk Movement
Message Oriented
Movement
Replication and Synchronization
Federated Views
Data Management and Quality
Stream Processing (CEP)
60%
10%
15%
15%
Microsoft’s OfferingsData Warehouse
and Business Intelligence
Data Consistency Between
Applications
Data System Migration and Consolidation
Master Data Management
Inter Enterprise Data Acquisition
and Sharing
Bulk Movement
Message Oriented
Movement
Replication and Synchronization
Federated Views
Data Management and Quality
Stream Processing (CEP)
SSIS
BizTalkServiceBrokerSQLReplication
DistributedQuery
Master DataServices
StreamInsights
What Should You Use and When? It Depends On The Artifacts / Facets / Attributes / “the flavor”:
Artifacts that Select Data Integration Technologies
Artifact Description Example
Developer’s Mindset How does a developer approach building a solution or modeling their application?
• “I just know SQL”.• Message Oriented vs
Sequential.Application Pattern What is the canonical application that
Is most resembled?• DW Fundamentals (SSIS)• Business Orchestration (BizTalk)
Latency The integrated data has some amount of “staleness” when compared to the sources.
• Monthly / Weekly /Daily (SSIS)• Hourly / Near real time (SI, DQ)
Data Size Expected amount of data that will be processed in one transaction or integration event.
• One record at a time• 1 million records
Data push or pull Is data pulled from sources (sources must respond to queries) by way of the integration process and then pushed at destinations or is data “made available” by a source on its own schedule and pushed through the integration or perhaps data is pulled into a destination through the integration when the destination desires it.
• Push• Pull•
Artifacts that Select Data Integration Technologies
Artifact Description Example
Topology Hub-spoke, etc. Middle-tier or other locations for integration engine. Availability (determines hub-spoke)Authority: Who is in charge? (who is master)
• One machine drives a process• Many masters• Message orchestrator (BizTalk)
Data Heterogeneity Need for heterogeneity of Sources / Destination. • SQL Server to SQL Server• Oracle, SAP, Teradata, XML
Conflict detection and resolution
Integration problem has a need for detecting and resolving conflicting versions of the same records in different system
• None (SSIS)• Merge Replication
Data Integration or Movement
Before data is delivered to its final destination, must it be combined with other data that comes from a different source versus a need to simply move, transform and react to data from mostly once source.
Data access patterns Ad hoc vs. known-in-advance. Are the access patterns hard coded into the solution and fixed at “development” time or are the access patterns determined at runtime via some flexible specification.
• SQL is very flexible• SSIS hard codes metadata• BizTalk can change sources on
the flyData Shape “Point” (data about a single entity) vs. table-valued data
access patterns vs. Message content or event data.• Tables• XML hierarchies
Artifacts that Select Data Integration Technologies
Artifact Description Example
Known vs. variable data formats.
Need for flexibility to changes in data shape. Should the mainline non-error case behavior expect to handle variant data formats?
• SSIS Fixed structure• BizTalk ‘Promoted’ properties• SQL just adapts
Complexity of transformation
Need for complex transformation of data shape versus the simple data type conversions required by heterogeneity
• Minor transform (Replication)• XSL (BizTalk)
Structured or unstructured data
Working with unstructured documents, blob data, semi-structured XML /rigid XML, flexible/rigid file formats that must be parsed / rectangular table data ?
• Structured Tables (SSIS)• XML Messages (BizTalk)
Supports per-user security.
If returning results to a user as if were a data server, do the end user's credentials become part of a request and enable enforcement of heterogeneous security policies?
• Dist Query enforces user.• SSIS batch runs with job’s
context.Recovery SLA What happens when nodes are down or disconnected, and
what kind of recovery is required when connection is re-established? Are business processes “stopped” or “failed” when integration is delayed or incomplete
• SSIS has error handling• Dist Queries just ‘Fail’• BizTalk had long running
transactions and auto-retry.Stream Processing Need to react to temporal or localized changes in a stream of
records• The ‘Point’ of StreamInsights• User built script in SSIS
A Brief Tour of the Microsoft Offerings
Integration ServicesService BrokerReplicationDistributed QueryBizTalk ServerStreamInsights
Integration Services (SSIS) - Overview• Text files, Oracle, SQL, SAP BW, Excel, etc.• Merge, look-up, union• Pivot, calculate, filter
Move, Conform, Combine Data
• Create slowly changing dimensions• Pre aggregate• Partition data
Build a Data Warehouse
• Send mail• Loop over files• Connect to FTP
Coordinate Activities
• Departmental and IT pros• Special class of developer, might be able to write c# script.• In BIDS (Visual Studio). Graphical Editor, Debugging
Tool for ETL Developers
• Heads free automation of jobs• Object model for embedded applications.An Execution Environment
• 1 time utility• Load or export a file• Movement of tables from one place to another
The I/E Wizard
Integration Service – Problems / Technologies / Artifacts
• Constructing a Data Warehouse
• Migration / Consolidation
• Bulk Movement• ETL
Artifacts that Select Data Integration Technologies
Artifact Integration Services Artifact Integration Services
Developer’s mindset Sequential, some scripting Heterogeneity Files, XML, Access, Oracle, Teradata, etc.
Application pattern DW fundamentals Shape / Access Rigid schema and access
Latency Hourly Conflict resolution None
Data size Millions of rows Complex transform Complex business logic, reshaping
Topology 1 Machine drives Recovery Custom error handling logic.
My ODS or OLTP System
My DWSSIS
Text Files
Integration Services – Customer Solution
Data Warehouse(SQL Server)
Inventory Management
(Oracle)
Staging DB
CRM(SQL Server)
Manufacturing Data
(Flat files)
SSIS Package
Lookups, load facts and dimensions, surrogate
key generation, …
SSIS Package
Lookups, slowly changing dimensions, address cleansing, …
SSIS Package
Data conversions, parsing, data quality,
aggregations, …
Attun
ity C
DC
for O
racl
eSQ
L Se
rver
So
urce
Flat
File
Sou
rce
Data Mart(Reporting
and Analytics)
Operational Database
(Shop Floor Application)
SSIS Package
SSIS Package
Service Broker - Overview• Run asynchronously• Communicate reliably • Communicate securely
Distributed Applications
• Every system has its own data managed and administered independently• Only communicate via messages• Transactions do not span
Loosely Coupled
• Specify message types and contracts• A queue looks like a SQL Table. Routes connect queues• Conversation is a persistent 2 way session of communication between two
servicesMessaging
• Single Install• Unified programming, administration and security. Great if you love SQL.• SQL Server benefits: Transactions, Backup, Mirroring
Part of SQL Server
Database 2Database 1
Service Broker – Problems / Technologies / Artifacts
• Consistency Between Applications
• Master Data Management (?)
• Message Oriented Movement
Artifacts that Select Data Integration Technologies
Artifact Service Broker Artifact Service Broker
Developer’s mindset “I Love SQL!” Heterogeneity SQL Server to SQL Server
Application pattern Data tier, Loose coupled Shape / Access Flexible
Latency Near real time
Data size Many small messages Complex transform Minimal, Data carried in messages
Recovery SQL Transactions
Queue AService A
conversation
Queue BService B
SSIS
Service Broker – Customer SolutionBank, lost loan provisioningRequires very fast processing and analysis of up to date data.
SourceTable
subset 1 (x rows)
subset 2 (x rows)
subset n (x rows)
server 1
server 2
server n
…… SSIS
Service Broker
ResultTable
32 cores
sproc
SQL Replication - Overview• SQL Tables• Many copies in different databases• Changes may originate in any database.
Synchronized Tables
• Read Scale• Reporting and Staging• Geo Data Locality• Branch Office• Offline Sync• EIM
Key Scenarios
• Tables• Stored Procedures• Build a custom data tier applicationPart of SQL Server
• SQL Server Management StudioManagement and
Configuration
SQL Replication – Problems / Technologies / Artifacts
• Data Warehousing• Data consistency
between Applications• Migration /
Consolidation
• Replication and Synchronization
Artifacts that Select Data Integration Technologies
Artifact Replication Artifact Replication
Developer’s mindset SQL centric Heterogeneity Mostly SQL to SQL. Some support
Shape / Access Rigid schema and access
Latency Minutes Conflict resolution Merge
Data size Changed records Complex transform Slight in the heterogeneous case
Topology Bi Directional, Many masters
Reporting and Staging Enterprisee Information Management
SQL Replication – Customer SolutionEdcon
1000 branch officesOne way replication of catalog data from hub to spoke Catalog downloads are partitioned with complex Dynamic and Join filtering – Catalog/Pricing data per storeUses merge Replication for downloadsSubscribers located in each store - multi-user server databasesUses Service Broker for uploads of transacted data – requires guaranteed in order delivery
Branch Office
Corporate OfficesLOB Systems
‘Central’
SSIS(daily)
‘Branch’
Online Terminal
Merge Replication
TransactionalReplication
ServiceBroker
Distributed Query - Overview• One SQL Query that joins/combines data from n remote servers.• Consistent type system• Consistent query grammar
Unified View of Data sources
• Cannot move data• Healthcare, Finance
• Privacy restrictions. Stored Procedures only access to data• Augment restricted data
Gateway to Remote Data
• Ad-hoc BI. One time or infrequent use.• Combine data from Microsoft eco-system (Access, Excel, SSAS)Federated Databases
• Linked Servers• OPENROWSET, OPENQUERY, OPENDATASOURCESQL Features• OLE/DB as protocolData Sources• Rowset Remoting• Query Expression RemotingQuery Optimizer
Distributed Query – Problems / Technologies / Artifacts
• Business Intelligence• Master Data
Management• Federated Views
Artifacts that Select Data Integration Technologies
Artifact Distributed Query Artifact Distributed Query
Developer’s mindset SQL Heterogeneity Some via OLEDB. Mostly SQL to SQL
Application pattern Ad-Hoc or infrequent reports
Latency none
Data size Quickly remotable tables Complex transform Through SQL operators
Topology Hub and spoke
SQL
SQL Access
BizTalk Server - Overview• Connecting Disparate Systems Across Various
BoundariesMessaging
• Automating Business ProcessesOrchestration
• LOB, Legacy, Technologies, RDBMSHeterogeneous Data
• Providing Process Visibility and AnalyticsBusiness Activity Monitoring
• Connecting Business PartnersB2B
Manage Business Rules• Hosts and runs ‘Orchestrations’• Message Delivery• Long Running Transactions
Server
• XML, Xpath, XSLTMessages
BizTalk Server – Problems / Technologies / Artifacts
• B2B• Data Consistency
Between Applications
• Message Oriented Movement
Artifacts that Select Data Integration Technologies
Artifact BizTalk Server Artifact BizTalk Server
Developer’s mindset Message Oriented. SOA Heterogeneity Highly mixed sources of messages
Application pattern Orchestrated Business Process Shape XML
Latency Minutes / Seconds
Data size Message Contents. 100KB Complex transform XSLT on Message content.
LOB App OLTP
XMLDocs Orchestration
Logic
Bus
BizTalk Server – Customer Solution
LOB App
OLTP
XMLDocs
Orchestration Logic
Bus
Emerging Practice: Loose Coupled BatchBizTalk Coordinates processSome traditional data flow through messagesMany SSIS Packages with complicated relationships and dependencies.Messages control activation of SSIS pieces.Messages deliver intermediate results or pointers to batch data.Scale out SSIS execution
SSIS Package
DW
SSIS Package
StreamInsights - Overview
• Monitor stream of data from database query, hardware device, internet feed, etc.CEP Engine
• Point in time event• Fixed duration events with a sliding widow• Interesting sequence of events
Captures Events
• Grouping and aggregation with windows• Correlate event streams• Absence of activity or too much activity.• Calculations, filters, top-K
Rich Query Semantics
• Ideal for custom applications• LINQ Syntax for stream semantics.Net integration
StreamInsights – Problems / Technologies / Artifacts• Business Intelligence• Data Warehousing
• Message Oriented Movement
• Stream Processing
Artifacts that Select Data Integration Technologies
Artifact StreamInsights Artifact StreamInsights
Developer’s mindset SQL and .Net Application pattern Stream Processing
Data Sources, Operations, Assets, Feeds, Sensors, Devices
Operational Data Store & Archive
CEP Engine
f(x) g(y)Resultsf'(x) h(x,y)
Input Data Streams
SSIS
StreamInsights– Customer Solution
Fraud
SwitchLogs
Fact Processing
TelcoDetect Fraud callbacksDuring regular processing of data warehouse factsCustom SSIS Component encapsulates the CEP engine.Events detected and sent for follow-up
DW
SwitchLogsSwitch
Logs
StreamInsightComponent
A Bigger Picture?
Resources
www.microsoft.com/teched
Sessions On-Demand & Community Microsoft Certification & Training Resources
Resources for IT Professionals Resources for Developers
www.microsoft.com/learning
http://microsoft.com/technet http://microsoft.com/msdn
Learning
Complete an evaluation on CommNet and enter to win!
Sign up for Tech·Ed 2011 and save $500 starting June 8 – June 31st
http://northamerica.msteched.com/registration
You can also register at the
North America 2011 kiosk located at registrationJoin us in Atlanta next year
© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.