02 data lake - wug Česká republika
TRANSCRIPT
Azure Data LakeMarek Chmel | MVP: Data Platform, Principal Architect
[email protected]@MarekChmel
Key Message
With the increase of computing power, electronic devices and accessibility to the Internet, more data than ever is being produced, collected and transmitted.
Organizations have recognized the power of data analysis, but are struggling to manage the massive amounts of information they have.
Facebook collects 250 TB a dayData production a day ±10 ZB35-40 ZB expected in 2020 Currently 0.5% used for analysis
Introducing Data Lakes
Two approaches to information management for analytics
Bottom-up(inductive)
Observation
Pattern
Theory
Hypothesis
What will happen?
How can we make it happen?
Predictive analytics
Prescriptive analytics
What happened?
Why did it happen?
Descriptive analytics
INFORMATION
Diagnostic analytics
OPTIMIZATION
Top-down(deductive)
Confirmation
Theory
Hypothesis
Observation
The data lake uses a bottom-up approach
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysisusing analytic engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Data warehousing uses a top-downapproach
Implement data warehouse
Physical design
ETL development
Reporting and analytics development
Install and tune
Reporting and analytics design
Dimension modeling
ETL design
Set up infrastructure
Understand corporate strategy
Data sources
Gather requirements
Business requirements
Technical requirements
Options for Big Data in Azure
HDInsightAzure-managed VM cluster
running Hadoop
Data LakeBig Data as a Service
Virtual MachinesUser-managed VM cluster
running Hadoop
Clusters Serverless
Data Lake Analytics
Data Lake Storage Gen1
Challenges involved in implementing a data lake
Performance and Scale
Storage bottlenecks IoT sources – small writesPrice-performanceData grows independently
Security
Compliance challengesEffectively control accessCorporate policies
Data Silos
Data spans sourcesInefficiency in colocation
Analytics
Open interfaces to dataVariety of analytics tools
Azure Data Lake (ADL)
Analytics
Storage
Azure Data Lake Analytics
Azure Data Lake Analytics
Azure Data Lake Store
Azure Data Lake demystified
Azure Data Lake
Data Lake Storage Gen1Storage optimized for analytics
Data Lake AnalyticsAnalytics job service
• Petabyte size files and trillions of objects
• Provide I/O capacity to bandwidth hungry apps like HDI,
Cloudera and Hortonworks
• Encryption on disk
• File & Folder ACLs
• Start in seconds, scale instantly, pay per job
• Develop massively parallel programs with simplicity
• Leverage open source (Hadoop, Spark, Python, R…)
• Debug & optimize big data programs with ease
• Virtualize your analytics
• Always encrypted
• Integrated Azure Active Directory
• Role-based access Control
• Enterprise-grade support
Built on open source
Analytics
Storage
YARN
ADL Analytics ADL HDInsight
WebHDFS
ADL Store
HiveU/SQL
Data lakes in Azure Cloud
Hadoop cluster
HDFS/WebHDFS API
Azure Data LakeWebHDFS API
Azure HDInsight
Bigger picture
• Cortana Analytics is a fully managed big data and advanced analytics suite that enables you to transform your data into intelligent action
Business scenariosRecommendations,customer churn, forecasting
Perceptual intelligenceFace, vision
Speech, text
Personal digital assistant
Cortana
Dashboards and visualizations
Power BI
Machine Learning and Analytics
Azure Machine Learning
Azure Stream Analytics
DATA
Business apps
Custom apps
Sensors and devices
INTELLIGENCE ACTION
People
Automatedsystems
Big data stores
AzureSQL Data Warehouse
Information management
Azure Data Factory
Azure Data Catalog
Azure Event Hub
Azure Data Lake Store
Azure HDInsight (Hadoop)
Azure Data Lake Analytics
Big data made easy
Understanding ADL Store
What is Azure Data Lake (ADL) Store?
ADL Store HDInsight
ADL Analytics
Machine Learning
Spark
R
Devices
Comparing ADLS with Azure Blob Storage
Optimized for analytics Bulk storage of filesCold data storage
Azure Blob storage
Azure Data Lake Store
How do you start using ADL?
ADLAnalytics
Data
Azure Storage blobs
ADLStore
… and so on
Log in to the Azure portal
Write a U-SQL script and submit it to the ADL Analytics account
Create an ADL Analytics
account (90 seconds, free)
The U-SQL job reads and writes
data
DEMO
Azure Data Lake Store
Understanding ADLA
Azure Data Lake Analytics service
§ Built on Apache YARN
§ Scales dynamically with the turn of a dial
§ Pay by the query
§ Supports Azure Active Directory for access control, roles, and integration with on-premises identity systems
§ Built with U-SQL to unify the benefits of SQL with the power of C#
§ Processes data across Azure
A new distributed analytics service
Key Benefits of ADLA
Supports Azure Active Directory
Includes U-SQL
Processes data across multiple Azure data sources
Simplified management and administration
§ Web-based management in Azure portal
§ Automate tasks using PowerShell
§ Role-based Access Control with Azure Active Directory
§ Monitor service operations and activity
DEMO
Azure Data Lake Analytics
What isU-SQL?
A hyper-scalable, highly extensible language for preparing, transforming, and analyzing all data
Allows users to focus on the what—not the how—of business problems
Built on familiar languages (SQL and C#, SCOPE) and supported by a fully integrated development environment
Built for data developers and scientists
U-SQL fundamentals
All the familiar SQL clauses
SELECT | FROM | WHERE
GROUP BY | JOIN | OVER
Operate on unstructured and structured data
Relational metadata objects
StoreProcessRead
U-SQL queries: General pattern
INSERT
OUTPUT
OUTPUT
SELECT… FROM… WHERE…
EXTRACT
EXTRACT
SELECT
SELECT
Azure SQL DB
Azure Storage blobs
AzureStorage
blobs
RowSet
Azure Data Lake
RowSet
Azure Data Lake
.NET integration and extensibility
U-SQL expressions are full C# expressions
Reuse .NET code in your own assemblies
Use C# to define your own:
Types | Functions | Joins | Aggregators | IO (Extractors, Outputters)
U/SQL .NET
Query across Azure data sources
READ WRITE
Azure Data Lake Store
Azure Storage blobs
Azure SQL database
Azure SQL data warehouse
Visual Studio and U-SQL integration
Visualize and replay progress of
job
Fine-tune query performance
Visualize physical plan of U-SQL
query
Browse metadata catalog
Author U-SQL scripts (with
C# code)
Create metadata objects
Submit and cancel U-SQL jobs
Debug U-SQL and C# code
@rows = EXTRACT
OrderId int, Customer string, Date DateTime, Amount float
FROM “/input/orders.txt"USING Extractor.Tsv();
OUTPUT @rowsTO “adl://mylake/orders_copy.txt"USING Outputters.Tsv();
Apply schema on read
From a file in a data lake
Easy delimited text handling
Write out
Rowset
Read the input, write it directly to output (just a simple copy)
Follow up
§ Course 20775A:Performing Data Engineering on Microsoft HD Insight
§ Course 20774A:Perform Cloud Data Science with Azure Machine Learning
§ Course 20776A:Performing Big Data Engineering on Microsoft Cloud Services