u-sql - azure data lake analytics for developers

U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys, Microsoft@MikeDoesBigData, {mrys, usql}@microsoft.com

Why U-SQL?Why U-SQL?

Characteristics of Big Data analytics

•Sample Use Cases• Digital Crime Forensics – Analyze complex

attack patterns to understand BotNets and to predict and mitigate future attacks, by analyzing log records with complex custom algorithms

• Image Processing – Large-scale image feature extraction and classification using custom code

• Shopping Recommendations – Complex pattern analysis and prediction over shopping records using proprietary algorithms

Requires processing of any type of data

Allows use of custom algorithms

Scales efficiently to any size

Status quo:SQL for Big Data

Declarativity does scaling and parallelization for you

Extensibility is bolted on and not “native” Difficult to work with anything

other than structured data Difficult to extend with custom

code

Status quo:Programming languages for Big Data

Extensibility through custom code is “native”

Declarativity is bolted on and not “native” User often has to care about scale

and performance SQL is second class within string Often no code reuse/sharing

across queries

Why U-SQL?

Get benefits of both!•Makes it easy for you by unifying:• Unstructured and structured data processing• Declarative SQL and custom imperative code (C#)• Local and remote queries

• Increase productivity and agility from Day 1 and at Day 100 for YOU!

Declarativity and extensibility are equally native to the language

The origins of U-SQL

U-SQLSCOPE

Hive

T-SQL/ ANSI SQL

SCOPE – Microsoft’s internal Big Data language• SQL and C# integration model• Optimization and scaling model• Runs 100,000s of jobs daily

Hive• Complex data types (Maps, Arrays)• Data format alignment for text files

T-SQL/ANSI SQL• Many of the SQL capabilities (windowing functions,

meta data model etc.)

Show me U-SQL!

U-SQL language philosophyDeclarative query and transformation language• Uses SQL’s SELECT FROM WHERE with GROUP

BY/aggregation, joins, SQL analytics functions• Optimizable, scalableExpression-flow programming style• Easy-to-use functional lambda composition • Composable, globally optimizableOperates on unstructured and structured data• Schema on read over files• Relational metadata objects (e.g. database, table)Extensible from ground up• Type system is based on C#• Expression language IS C#• User-defined functions (U-SQL and C#)• User-defined aggregators (C#)• User-defined operators (UDO) (C#)

U-SQL provides the Parallelization and Scale-out Framework for Usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,

COMBINER, APPLIERFederated query across distributed data sources

REFERENCE MyDB.MyAssembly;

CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float );

@o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv();

@c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv();

@j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid;

OUTPUT @j TO "/output/result.txt"USING new MyData.Write();

INSERT INTO T SELECT * FROM @j;

Expression-flow programming style• Automatic "in-lining" of U-SQL

expressions – whole script leads to a single execution model.

• Execution plan that is optimized out-of-the-box and without user intervention.

• Per-job and user-driven level of parallelization.

• Detailed visibility into execution steps, for debugging.

• Heatmap-like functionality to identify performance bottlenecks.

Query data where it lives• Easily query data in multiple

Azure data stores without moving it to a single store

• Avoid moving large amounts of data across the network between stores

• Single view of data irrespective of physical location

• Minimize data proliferation issues caused by maintaining multiple copies

• Single query language for all data• Each data store maintains its own sovereignty• Design choices based on the need• Push SQL expressions to remote SQL sources

• Filters• Joins

U-SQL Query

Query

Query

Query

WriteAzure Storage Blobs

Azure SQL in VMs

Azure SQL DB

Azure Data Lake Analytics

Query

Azure SQL Data Warehouse

Quer

y

Writ

e

Azure Data Lake Storage

Unstructured files• Schema on read• Write to file• Built-in and custom extractors

and outputters• ADL Storage and Azure Blob

Storage

EXTRACT Expression@s = EXTRACT a string, b int FROM "filepath/file.csv"USING Extractors.Csv(encoding: Encoding.Unicode);

• Built-in extractors: Csv, Tsv, Text with lots of options• Custom extractors: e.g., JSON, XML, and so on

OUTPUT ExpressionOUTPUT @sTO "filepath/file.csv"USING Outputters.Csv();

• Built-in outputters: Csv, Tsv, Text• Custom outputters: JSON, XML, and so on

Filepath URIs• Relative URI to default ADL Storage account: "/filepath/file.csv"• Absolute URIs:

• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"

• WASB: "wasb://container@account/filepath/file.csv"

U-SQL extensibilityExtend U-SQL with C#/.NET

Built-in operators, function, aggregates

C# expressions (in SELECT expressions)

User-defined aggregates (UDAGGs)

User-defined functions (UDFs)

User-defined operators (UDOs)

Extending U-SQL

Managing AssembliesCreate assembliesReference assembliesEnumerate assembliesDrop assemblies

CREATE ASSEMBLY db.assembly FROM @path;CREATE ASSEMBLY db.assembly FROM byte[];

• Can also include additional resource files

REFERENCE ASSEMBLY db.assembly;

• Referencing .Net Framework Assemblies• Always accessible system namespaces:

• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll system.data.dll,

System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)

• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];

• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer

DROP ASSEMBLY db.assembly;

Assembly Dependencies

• Assembly must be registered to be referenced• All Assemblies needed for compilation

must be referenced in script• All Assemblies needed at runtime

either• Need to be referenced in script, or• Need to be registered with the

assembly as additional files• Metadata Service does NOT enforce

dependencies• Visual Studio Extension provides

support for dependency management

Show Me File Sets!

File sets

• Simple patterns• Virtual columns• Only on EXTRACT for now

Simple pattern language on filename and path@pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";

• Binds two columns date and suffix• Wildcards the filename• Limits on number of files (between 800 and 3000)

Virtual columnsEXTRACT

name string, suffix string // virtual column, date DateTime // virtual column

FROM @patternUSING Extractors.Csv();

• Refer to virtual columns in query to get partition elimination• Virtual columns need to be referenced for DateTime columns

and if no wildcard has been given

Let’s do some SQL with U-SQL

@t = EXTRACT date string, time string, author string, tweet string FROM "/Samples/Data/MyTwitterHistory.csv“ USING Extractors.Csv();

@m = SELECT new ARRAY<string>(tweet.Split(' ').Where(x => x.StartsWith("@"))) AS refs FROM @t;

@t = SELECT author, "authored" AS category FROM @t UNION ALL SELECT r.Substring(1) AS r, "referenced" AS category FROM @m CROSS APPLY EXPLODE(refs) AS Refs(r);

@res = SELECT author, category, COUNT( * ) AS tweetcount FROM @t GROUP BY author, category;

OUTPUT @res TO "/Samples/Data/Output/MyTwitterAnalysis.csv"ORDER BY tweetcount DESCUSING Outputters.Csv();

@m(refs)@me, @you

@him, @her

Refs(r)@me

@you

@him

@her

CROSS APPLY EXPLODE

@me, @you

@me

@you

U-SQL JoinsJoin operators• INNER JOIN• LEFT or RIGHT or FULL OUTER JOIN• CROSS JOIN• SEMIJOIN

• equivalent to IN subquery• ANTISEMIJOIN

• Equivalent to NOT IN subqueryNotes

• ON clause comparisons need to be of the simple form: rowset.column == rowset.columnor AND conjunctions of the simple equality comparison

• If a comparand is not a column, wrap it into a column in a previous SELECT

• If the comparison operation is not ==, put it into the WHERE clause

• turn the join into a CROSS JOIN if no equality comparisonReason: Syntax calls out which joins are efficient

U-SQL Analytics Windowing Expression

Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ]

[ Order_By_Clause ] [ Row _Clause ]')'.

Window_Function_Call :=Aggregate_Function_Call

| Analytic_Function_Call| Ranking_Function_Call.

Windowing Aggregate Functions

ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP

Analytics Functions

CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, LEAD, LAG

Ranking Functions

DENSE_RANK, NTILE, RANK, ROW_NUMBER

“Top 5”sSurprises for SQL Users

AS is not as• C# keywords and SQL keywords overlap• Costly to make case-insensitive -> Better

build capabilities than tinker with syntax

= != ==• Remember: C# expression language

null IS NOT NULL• C# nulls are two-valued

PROCEDURES but no WHILE

No UPDATE nor MERGE• Transform/Recook instead

Meta Data Object ModelADLA Catalog

Database

Schema

[1,n]

[1,n]

[0,n]

tables views TVFs

C# Fns C# UDAgg

ClusteredIndex

partitions

C# Assemblies

C# Extractors

Data Source

C# ReducersC# Processors

C# CombinersC# Outputters

Ext. tables Procedures

Creden-tials

C# Applier

Table Types

Statistics

C# UDTs

Abstract

objectsUser

objects

Refers toContains

Implemented and named by

MD Name

C# Name

Legend

U-SQL Catalog • Naming• Default database and schema context: master.dbo• Quote identifiers with []: [my table]• Stores data in ADL Storage /catalog folder

• Discovery• Visual Studio Server Explorer• Azure Data Lake Analytics Portal• SDKs and Azure PowerShell commands

• Sharing• Within an Azure Data Lake Analytics account

• Securing• Secured with AAD principals at catalog level (inherited from

ADL Storage)

• Naming• Discovery• Sharing• Securing

Create shareable data and code

Views and TVFs

• Views for simple cases• TVFs for parameterization and

most cases

ViewsCREATE VIEW V AS EXTRACT…CREATE VIEW V AS SELECT …

• Cannot contain user-defined objects (such as UDFs or UDOs)

• Will be inlined

Table-Valued Functions (TVFs)CREATE FUNCTION F (@arg string = "default") RETURNS @res [TABLE ( … )] AS BEGIN … @res = … END;

• Provides parameterization• One or more results• Can contain multiple statements• Can contain user-code (needs assembly reference)• Will always be inlined • Infers schema or checks against specified return

schema

ProceduresAllows encapsulation of non-DDL scripts

CREATE PROCEDURE P (@arg string = "default“) ASBEGIN …; OUTPUT @res TO …; INSERT INTO T …;END;

• Provides parameterization• No result but writes into file or table• Can contain multiple statements• Can contain user code (needs assembly reference)• Will always be inlined • Cannot contain DDL (no CREATE, DROP)

Tables• CREATE TABLE• CREATE TABLE AS SELECT

CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col1 ASC) PARTITIONED BY HASH (driver_id) );

• Structured Data• Built-in Data types only (no UDTs)• Clustered index (must be specified): row-oriented• Fine-grained partitioning (must be specified):

• HASH, DIRECT HASH, RANGE, ROUND ROBIN

CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);

• Infer the schema from the query• Still requires index and partitioning

Inserting New Data

INSERT

• INSERT constant values• INSERT from queries• Multiple INSERTs

INSERT constant values

INSERT INTO T VALUES (1, "text", new SQL.MAP<string,string>("key","value"));

INSERT from queries

INSERT INTO T SELECT col1, col2, col3 FROM @rowset;

Multiple INSERTs into same table

• Is supported• Generates separate file per insert in physical storage:

• Can lead to performance degradation• Recommendations:

• Try to avoid small inserts• Rebuild table after frequent insertions with:

ALTER TABLE T REBUILD;

Additional capabilities and resources

Additional U-SQL Capabilities• User-defined Operators, Aggregators, Types, Table types• Partitioned Tables• Federated Queries against SQL Server in Azure• Deploying file resources• Multi-rowset table-valued functions

Additional U-SQL Resources• Tools: http://aka.ms/adltoolsVS• Blogs and community page:

• http://usql.io• http://blogs.msdn.com/b/visualstudio/• http://azure.microsoft.com/en-us/blog/topics/big-data/• https://channel9.msdn.com/Search?term=U-SQL#ch9Search

• Documentation and articles and slides:• http://aka.ms/usql_reference• https://azure.microsoft.com/en-us/documentation/services/data-lake-analyti

cs/• https://msdn.microsoft.com/en-us/magazine/mt614251 • http://www.slideshare.net/MichaelRys

• ADL forums and feedback• http://aka.ms/adlfeedback• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=Azure

DataLake

• http://stackoverflow.com/questions/tagged/u-sql

http://aka.ms/adltoolsVS

http://aka.ms/adltoolsVS

http://usql.io/

http://blogs.msdn.com/b/visualstudio/

http://azure.microsoft.com/en-us/blog/topics/big-data/

https://channel9.msdn.com/Search?term=U-SQL#ch9Search

http://aka.ms/usql_reference

https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/

https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/

https://msdn.microsoft.com/en-us/magazine/mt614251

http://www.slideshare.net/MichaelRys

http://aka.ms/adlfeedback

https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake

https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake

http://stackoverflow.com/questions/tagged/u-sql

Unifies SQL declarativity and C# extensibilityUnifies querying structured and unstructured dataUnifies local and remote queriesIncrease productivity and agility from Day 1 forward for YOU!

Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake, download the VS tools, and give us feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!

This is why U-SQL!

http://www.azure.com/datalake

http://aka.ms/adlfeedback

http://aka.ms/u-sql-survey

Friendly CompetitionWin ADL and U-SQL SWAG

1. Contribute a cool U-SQL project/sample to the Azure/usql Github repo (via http://usql.io) by Apr 30, 2016

2. Tweet your submission to @MikeDoesBigData with #USQLComp

3. We will review the submissions and send some cool swag (U-SQL T-Shirts, ADL Poloshirts etc) to the top 5 submissions

http://usql.io/

u-sql - azure data lake analytics for developers

Data & Analytics