02 data lake - wug Česká republika

34
Azure Data Lake Marek Chmel | MVP: Data Platform, Principal Architect [email protected] @MarekChmel

Upload: others

Post on 28-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 02 data lake - WUG Česká republika

Azure Data LakeMarek Chmel | MVP: Data Platform, Principal Architect

[email protected]@MarekChmel

Page 2: 02 data lake - WUG Česká republika

Key Message

With the increase of computing power, electronic devices and accessibility to the Internet, more data than ever is being produced, collected and transmitted.

Organizations have recognized the power of data analysis, but are struggling to manage the massive amounts of information they have.

Facebook collects 250 TB a dayData production a day ±10 ZB35-40 ZB expected in 2020 Currently 0.5% used for analysis

Page 3: 02 data lake - WUG Česká republika
Page 4: 02 data lake - WUG Česká republika
Page 5: 02 data lake - WUG Česká republika

Introducing Data Lakes

Page 6: 02 data lake - WUG Česká republika

Two approaches to information management for analytics

Bottom-up(inductive)

Observation

Pattern

Theory

Hypothesis

What will happen?

How can we make it happen?

Predictive analytics

Prescriptive analytics

What happened?

Why did it happen?

Descriptive analytics

INFORMATION

Diagnostic analytics

OPTIMIZATION

Top-down(deductive)

Confirmation

Theory

Hypothesis

Observation

Page 7: 02 data lake - WUG Česká republika

The data lake uses a bottom-up approach

Ingest all data regardless of requirements

Store all data in native format without schema definition

Do analysisusing analytic engines like Hadoop

Interactive queries

Batch queries

Machine Learning

Data warehouse

Real-time analytics

Devices

Page 8: 02 data lake - WUG Česká republika

Data warehousing uses a top-downapproach

Implement data warehouse

Physical design

ETL development

Reporting and analytics development

Install and tune

Reporting and analytics design

Dimension modeling

ETL design

Set up infrastructure

Understand corporate strategy

Data sources

Gather requirements

Business requirements

Technical requirements

Page 9: 02 data lake - WUG Česká republika

Options for Big Data in Azure

HDInsightAzure-managed VM cluster

running Hadoop

Data LakeBig Data as a Service

Virtual MachinesUser-managed VM cluster

running Hadoop

Clusters Serverless

Data Lake Analytics

Data Lake Storage Gen1

Page 10: 02 data lake - WUG Česká republika

Challenges involved in implementing a data lake

Performance and Scale

Storage bottlenecks IoT sources – small writesPrice-performanceData grows independently

Security

Compliance challengesEffectively control accessCorporate policies

Data Silos

Data spans sourcesInefficiency in colocation

Analytics

Open interfaces to dataVariety of analytics tools

Page 11: 02 data lake - WUG Česká republika

Azure Data Lake (ADL)

Analytics

Storage

Azure Data Lake Analytics

Azure Data Lake Analytics

Azure Data Lake Store

Page 12: 02 data lake - WUG Česká republika

Azure Data Lake demystified

Azure Data Lake

Data Lake Storage Gen1Storage optimized for analytics

Data Lake AnalyticsAnalytics job service

• Petabyte size files and trillions of objects

• Provide I/O capacity to bandwidth hungry apps like HDI,

Cloudera and Hortonworks

• Encryption on disk

• File & Folder ACLs

• Start in seconds, scale instantly, pay per job

• Develop massively parallel programs with simplicity

• Leverage open source (Hadoop, Spark, Python, R…)

• Debug & optimize big data programs with ease

• Virtualize your analytics

• Always encrypted

• Integrated Azure Active Directory

• Role-based access Control

• Enterprise-grade support

Page 13: 02 data lake - WUG Česká republika

Built on open source

Analytics

Storage

YARN

ADL Analytics ADL HDInsight

WebHDFS

ADL Store

HiveU/SQL

Page 14: 02 data lake - WUG Česká republika

Data lakes in Azure Cloud

Hadoop cluster

HDFS/WebHDFS API

Azure Data LakeWebHDFS API

Azure HDInsight

Page 15: 02 data lake - WUG Česká republika

Bigger picture

• Cortana Analytics is a fully managed big data and advanced analytics suite that enables you to transform your data into intelligent action

Business scenariosRecommendations,customer churn, forecasting

Perceptual intelligenceFace, vision

Speech, text

Personal digital assistant

Cortana

Dashboards and visualizations

Power BI

Machine Learning and Analytics

Azure Machine Learning

Azure Stream Analytics

DATA

Business apps

Custom apps

Sensors and devices

INTELLIGENCE ACTION

People

Automatedsystems

Big data stores

AzureSQL Data Warehouse

Information management

Azure Data Factory

Azure Data Catalog

Azure Event Hub

Azure Data Lake Store

Azure HDInsight (Hadoop)

Azure Data Lake Analytics

Page 16: 02 data lake - WUG Česká republika

Big data made easy

Page 17: 02 data lake - WUG Česká republika

Understanding ADL Store

Page 18: 02 data lake - WUG Česká republika

What is Azure Data Lake (ADL) Store?

ADL Store HDInsight

ADL Analytics

Machine Learning

Spark

R

Devices

Page 19: 02 data lake - WUG Česká republika

Comparing ADLS with Azure Blob Storage

Optimized for analytics Bulk storage of filesCold data storage

Azure Blob storage

Azure Data Lake Store

Page 20: 02 data lake - WUG Česká republika

How do you start using ADL?

ADLAnalytics

Data

Azure Storage blobs

ADLStore

… and so on

Log in to the Azure portal

Write a U-SQL script and submit it to the ADL Analytics account

Create an ADL Analytics

account (90 seconds, free)

The U-SQL job reads and writes

data

Page 21: 02 data lake - WUG Česká republika

DEMO

Azure Data Lake Store

Page 22: 02 data lake - WUG Česká republika

Understanding ADLA

Page 23: 02 data lake - WUG Česká republika

Azure Data Lake Analytics service

§ Built on Apache YARN

§ Scales dynamically with the turn of a dial

§ Pay by the query

§ Supports Azure Active Directory for access control, roles, and integration with on-premises identity systems

§ Built with U-SQL to unify the benefits of SQL with the power of C#

§ Processes data across Azure

A new distributed analytics service

Page 24: 02 data lake - WUG Česká republika

Key Benefits of ADLA

Supports Azure Active Directory

Includes U-SQL

Processes data across multiple Azure data sources

Page 25: 02 data lake - WUG Česká republika

Simplified management and administration

§ Web-based management in Azure portal

§ Automate tasks using PowerShell

§ Role-based Access Control with Azure Active Directory

§ Monitor service operations and activity

Page 26: 02 data lake - WUG Česká republika

DEMO

Azure Data Lake Analytics

Page 27: 02 data lake - WUG Česká republika

What isU-SQL?

A hyper-scalable, highly extensible language for preparing, transforming, and analyzing all data

Allows users to focus on the what—not the how—of business problems

Built on familiar languages (SQL and C#, SCOPE) and supported by a fully integrated development environment

Built for data developers and scientists

Page 28: 02 data lake - WUG Česká republika

U-SQL fundamentals

All the familiar SQL clauses

SELECT | FROM | WHERE

GROUP BY | JOIN | OVER

Operate on unstructured and structured data

Relational metadata objects

Page 29: 02 data lake - WUG Česká republika

StoreProcessRead

U-SQL queries: General pattern

INSERT

OUTPUT

OUTPUT

SELECT… FROM… WHERE…

EXTRACT

EXTRACT

SELECT

SELECT

Azure SQL DB

Azure Storage blobs

AzureStorage

blobs

RowSet

Azure Data Lake

RowSet

Azure Data Lake

Page 30: 02 data lake - WUG Česká republika

.NET integration and extensibility

U-SQL expressions are full C# expressions

Reuse .NET code in your own assemblies

Use C# to define your own:

Types | Functions | Joins | Aggregators | IO (Extractors, Outputters)

U/SQL .NET

Page 31: 02 data lake - WUG Česká republika

Query across Azure data sources

READ WRITE

Azure Data Lake Store

Azure Storage blobs

Azure SQL database

Azure SQL data warehouse

Page 32: 02 data lake - WUG Česká republika

Visual Studio and U-SQL integration

Visualize and replay progress of

job

Fine-tune query performance

Visualize physical plan of U-SQL

query

Browse metadata catalog

Author U-SQL scripts (with

C# code)

Create metadata objects

Submit and cancel U-SQL jobs

Debug U-SQL and C# code

Page 33: 02 data lake - WUG Česká republika

@rows = EXTRACT

OrderId int, Customer string, Date DateTime, Amount float

FROM “/input/orders.txt"USING Extractor.Tsv();

OUTPUT @rowsTO “adl://mylake/orders_copy.txt"USING Outputters.Tsv();

Apply schema on read

From a file in a data lake

Easy delimited text handling

Write out

Rowset

Read the input, write it directly to output (just a simple copy)

Page 34: 02 data lake - WUG Česká republika

Follow up

§ Course 20775A:Performing Data Engineering on Microsoft HD Insight

§ Course 20774A:Perform Cloud Data Science with Azure Machine Learning

§ Course 20776A:Performing Big Data Engineering on Microsoft Cloud Services