big data & data management - glasspaper data and data management...using analytic engines like...
TRANSCRIPT
Big Data & Data ManagementA Monday morning chat about
Azure Data Lake, Azure SQL Data Warehouse, Azure HDInsight, and Azure Data Factory
https://azure.microsoft.com/en-us/services/data-factory/
A managed cloud service for building & operating data pipelines (aka. data flows)
1. Orchestrate, monitor & schedule
• compose data processing, storage & movement services (on premises & cloud)
2. Automatic infrastructure mgmt
• combine pipeline intent w/ resource allocation & mgmt
• data movement as a service (global footprint & on premises)
3. Single pane of glass
• one place to manage your network
of data flows
Call Log Files
Customer Table
Call Log Files
Customer Table
Customer
Churn Table
Data Sources Ingest Transform & Analyze Publish
Customer
Call Details
Customers
Likely to
Churn
https://azure.microsoft.com/en-us/services/sql-data-warehouse/
Broad SQL Server PartnerEcosystem
+ Leverage Azure ML, HDInsight, PowerBI, ADF,
and more.
+ Industry’s broadest ecosystem of DW partners,
including Tableau, Informatica, Attunity, and SAP.
Streamlined deployment with Azure Portal.
Deep tool integration with top partners including:
• Single-click configuration
• Optimized data movement
• Logical pushdown
Azure SQL DW
Azure ML
Azure Event Hub
Azure HDInsight
https://azure.microsoft.com/en-us/services/data-lake-store/
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting & Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand Corporate Strategy
Data sources
ETL
BI and analytic
Data warehouse
Gather Requirements
Business Requirements
Technical Requirements
Ingestregardless of requirements
Storein native format without
schema definition
AnalyzeUsing analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Distributed, parallel file system in
the cloud
Performance-tuned and optimized
for analytics
No fixed size limits
Stores all data types
Highly available with local & geo
redundant storage
WebHDFS REST API
Supported by leading
Hadoop distros
Role-based security
Low latency and high
throughput workloads
Azure Data Lake: Store
YARN
HDFS
HDInsightAnalytics
Service
Store
U-SQL
Clickstream
Sensors
Video
Social
Web
Devices
Relational
Applications
24
Store indefinitely Analyze See resultsGather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
25
Data Lake Store: Technical Requirements
26
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
Low latency Must have low latency for high-frequency operations.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.
No one analytic framework can work for all data and all types of analysis.
Multiple analytic
frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
A highly scalable, distributed, parallel file system in the cloud
specifically designed to work with multiple analytic frameworks
What is Azure Data Lake Store?
LOB Applications
SocialDevices
Clickstream
Sensors
Video
Web
Relational
HDInsight
ADL Analytics
Machine Learning
Spark
R
27
ADL Store
https://azure.microsoft.com/en-us/services/data-lake-analytics/
https://channel9.msdn.com/Series/AzureDataLake
Analytics
Storage
HDInsight(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake
Azure Data Lake Analytics Service
A new distributed analytics service
Built on Apache YARN
Scales dynamically with the turn of a dial
Pay by the query
Supports Azure AD for access control, roles, and integration with on-premidentity systems
Built with U-SQL to unify the benefits of SQL with the power of C#
Processes data across Azure
30
Work across all cloud data
Azure Data Lake Analytics
Azure SQL DW Azure SQL DBAzure
Storage BlobsAzure
Data Lake Store
SQL DB in an Azure VM
Analytics: Two form factors
HDInsightManaged Hadoop clusters
ADLA
Analytics service
HDInsight Cluster
n1 n2 n3 n4
Hive/Pig/etc. job
Lots of containers
U-SQL/Hive/Pig jobADLA Account
YARN Layer
StorageBlobs or ADLS Input Output
ADLA complements HDInsightTarget the same scenarios, tools, and customers
HDInsight
For developers familiar with the Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control, and flexibility in a managed Hadoop cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency,
automatic scale, and management in
a “job service” form factor
What is
U-SQL?
A hyper-scalable, highly extensible
language for preparing, transforming
and analyzing all data
Allows users to focus on the what—
not the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
34
Developing big data apps
Author, debug, & optimize big data apps in Visual Studio
Multiple LanguagesU-SQL, Hive, & Pig
Seamlessly integrate .NET
https://channel9.msdn.com/Events/Cortana-Analytics-Suite/CA-Suite-Workshop-10-11SEP15
http://www.microsoft.com/en-us/server-cloud/cortana-analytics-suite/overview.aspx