u-sql - azure data lake analytics for developers
TRANSCRIPT
U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys, Microsoft@MikeDoesBigData, {mrys, usql}@microsoft.com
Why U-SQL?Why U-SQL?
Characteristics of Big Data analytics
•Sample Use Cases• Digital Crime Forensics – Analyze complex
attack patterns to understand BotNets and to predict and mitigate future attacks, by analyzing log records with complex custom algorithms
• Image Processing – Large-scale image feature extraction and classification using custom code
• Shopping Recommendations – Complex pattern analysis and prediction over shopping records using proprietary algorithms
Requires processing of any type of data
Allows use of custom algorithms
Scales efficiently to any size
Status quo:SQL for Big Data
Declarativity does scaling and parallelization for you
Extensibility is bolted on and not “native” Difficult to work with anything
other than structured data Difficult to extend with custom
code
Status quo:Programming languages for Big Data
Extensibility through custom code is “native”
Declarativity is bolted on and not “native” User often has to care about scale
and performance SQL is second class within string Often no code reuse/sharing
across queries
Why U-SQL?
Get benefits of both!•Makes it easy for you by unifying:• Unstructured and structured data processing• Declarative SQL and custom imperative code (C#)• Local and remote queries
• Increase productivity and agility from Day 1 and at Day 100 for YOU!
Declarativity and extensibility are equally native to the language
The origins of U-SQL
U-SQLSCOPE
Hive
T-SQL/ ANSI SQL
SCOPE – Microsoft’s internal Big Data language• SQL and C# integration model• Optimization and scaling model• Runs 100,000s of jobs daily
Hive• Complex data types (Maps, Arrays)• Data format alignment for text files
T-SQL/ANSI SQL• Many of the SQL capabilities (windowing functions,
meta data model etc.)
Show me U-SQL!
U-SQL language philosophyDeclarative query and transformation language• Uses SQL’s SELECT FROM WHERE with GROUP
BY/aggregation, joins, SQL analytics functions• Optimizable, scalableExpression-flow programming style• Easy-to-use functional lambda composition • Composable, globally optimizableOperates on unstructured and structured data• Schema on read over files• Relational metadata objects (e.g. database, table)Extensible from ground up• Type system is based on C#• Expression language IS C#• User-defined functions (U-SQL and C#)• User-defined aggregators (C#)• User-defined operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out Framework for Usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIERFederated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Expression-flow programming style• Automatic "in-lining" of U-SQL
expressions – whole script leads to a single execution model.
• Execution plan that is optimized out-of-the-box and without user intervention.
• Per-job and user-driven level of parallelization.
• Detailed visibility into execution steps, for debugging.
• Heatmap-like functionality to identify performance bottlenecks.
Query data where it lives• Easily query data in multiple
Azure data stores without moving it to a single store
• Avoid moving large amounts of data across the network between stores
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by maintaining multiple copies
• Single query language for all data• Each data store maintains its own sovereignty• Design choices based on the need• Push SQL expressions to remote SQL sources
• Filters• Joins
U-SQL Query
Query
Query
Query
WriteAzure Storage Blobs
Azure SQL in VMs
Azure SQL DB
Azure Data Lake Analytics
Query
Azure SQL Data Warehouse
Quer
y
Writ
e
Azure Data Lake Storage
Unstructured files• Schema on read• Write to file• Built-in and custom extractors
and outputters• ADL Storage and Azure Blob
Storage
EXTRACT Expression@s = EXTRACT a string, b int FROM "filepath/file.csv"USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in extractors: Csv, Tsv, Text with lots of options• Custom extractors: e.g., JSON, XML, and so on
OUTPUT ExpressionOUTPUT @sTO "filepath/file.csv"USING Outputters.Csv();
• Built-in outputters: Csv, Tsv, Text• Custom outputters: JSON, XML, and so on
Filepath URIs• Relative URI to default ADL Storage account: "/filepath/file.csv"• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
U-SQL extensibilityExtend U-SQL with C#/.NET
Built-in operators, function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
Extending U-SQL
Managing AssembliesCreate assembliesReference assembliesEnumerate assembliesDrop assemblies
CREATE ASSEMBLY db.assembly FROM @path;CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer
DROP ASSEMBLY db.assembly;
Assembly Dependencies
• Assembly must be registered to be referenced• All Assemblies needed for compilation
must be referenced in script• All Assemblies needed at runtime
either• Need to be referenced in script, or• Need to be registered with the
assembly as additional files• Metadata Service does NOT enforce
dependencies• Visual Studio Extension provides
support for dependency management
Show Me File Sets!
File sets
• Simple patterns• Virtual columns• Only on EXTRACT for now
Simple pattern language on filename and path@pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix• Wildcards the filename• Limits on number of files (between 800 and 3000)
Virtual columnsEXTRACT
name string, suffix string // virtual column, date DateTime // virtual column
FROM @patternUSING Extractors.Csv();
• Refer to virtual columns in query to get partition elimination• Virtual columns need to be referenced for DateTime columns
and if no wildcard has been given
Let’s do some SQL with U-SQL
@t = EXTRACT date string, time string, author string, tweet string FROM "/Samples/Data/MyTwitterHistory.csv“ USING Extractors.Csv();
@m = SELECT new ARRAY<string>(tweet.Split(' ').Where(x => x.StartsWith("@"))) AS refs FROM @t;
@t = SELECT author, "authored" AS category FROM @t UNION ALL SELECT r.Substring(1) AS r, "referenced" AS category FROM @m CROSS APPLY EXPLODE(refs) AS Refs(r);
@res = SELECT author, category, COUNT( * ) AS tweetcount FROM @t GROUP BY author, category;
OUTPUT @res TO "/Samples/Data/Output/MyTwitterAnalysis.csv"ORDER BY tweetcount DESCUSING Outputters.Csv();
@m(refs)@me, @you
@him, @her
Refs(r)@me
@you
@him
@her
CROSS APPLY EXPLODE
@me, @you
@me
@you
U-SQL JoinsJoin operators• INNER JOIN• LEFT or RIGHT or FULL OUTER JOIN• CROSS JOIN• SEMIJOIN
• equivalent to IN subquery• ANTISEMIJOIN
• Equivalent to NOT IN subqueryNotes
• ON clause comparisons need to be of the simple form: rowset.column == rowset.columnor AND conjunctions of the simple equality comparison
• If a comparand is not a column, wrap it into a column in a previous SELECT
• If the comparison operation is not ==, put it into the WHERE clause
• turn the join into a CROSS JOIN if no equality comparisonReason: Syntax calls out which joins are efficient
U-SQL Analytics Windowing Expression
Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ]
[ Order_By_Clause ] [ Row _Clause ]')'.
Window_Function_Call :=Aggregate_Function_Call
| Analytic_Function_Call| Ranking_Function_Call.
Windowing Aggregate Functions
ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP
Analytics Functions
CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, LEAD, LAG
Ranking Functions
DENSE_RANK, NTILE, RANK, ROW_NUMBER
“Top 5”sSurprises for SQL Users
AS is not as• C# keywords and SQL keywords overlap• Costly to make case-insensitive -> Better
build capabilities than tinker with syntax
= != ==• Remember: C# expression language
null IS NOT NULL• C# nulls are two-valued
PROCEDURES but no WHILE
No UPDATE nor MERGE• Transform/Recook instead
Meta Data Object ModelADLA Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
ClusteredIndex
partitions
C# Assemblies
C# Extractors
Data Source
C# ReducersC# Processors
C# CombinersC# Outputters
Ext. tables Procedures
Creden-tials
C# Applier
Table Types
Statistics
C# UDTs
Abstract
objectsUser
objects
Refers toContains
Implemented and named by
MD Name
C# Name
Legend
U-SQL Catalog • Naming• Default database and schema context: master.dbo• Quote identifiers with []: [my table]• Stores data in ADL Storage /catalog folder
• Discovery• Visual Studio Server Explorer• Azure Data Lake Analytics Portal• SDKs and Azure PowerShell commands
• Sharing• Within an Azure Data Lake Analytics account
• Securing• Secured with AAD principals at catalog level (inherited from
ADL Storage)
• Naming• Discovery• Sharing• Securing
Create shareable data and code
Views and TVFs
• Views for simple cases• TVFs for parameterization and
most cases
ViewsCREATE VIEW V AS EXTRACT…CREATE VIEW V AS SELECT …
• Cannot contain user-defined objects (such as UDFs or UDOs)
• Will be inlined
Table-Valued Functions (TVFs)CREATE FUNCTION F (@arg string = "default") RETURNS @res [TABLE ( … )] AS BEGIN … @res = … END;
• Provides parameterization• One or more results• Can contain multiple statements• Can contain user-code (needs assembly reference)• Will always be inlined • Infers schema or checks against specified return
schema
ProceduresAllows encapsulation of non-DDL scripts
CREATE PROCEDURE P (@arg string = "default“) ASBEGIN …; OUTPUT @res TO …; INSERT INTO T …;END;
• Provides parameterization• No result but writes into file or table• Can contain multiple statements• Can contain user code (needs assembly reference)• Will always be inlined • Cannot contain DDL (no CREATE, DROP)
Tables• CREATE TABLE• CREATE TABLE AS SELECT
CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col1 ASC) PARTITIONED BY HASH (driver_id) );
• Structured Data• Built-in Data types only (no UDTs)• Clustered index (must be specified): row-oriented• Fine-grained partitioning (must be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);
• Infer the schema from the query• Still requires index and partitioning
Inserting New Data
INSERT
• INSERT constant values• INSERT from queries• Multiple INSERTs
INSERT constant values
INSERT INTO T VALUES (1, "text", new SQL.MAP<string,string>("key","value"));
INSERT from queries
INSERT INTO T SELECT col1, col2, col3 FROM @rowset;
Multiple INSERTs into same table
• Is supported• Generates separate file per insert in physical storage:
• Can lead to performance degradation• Recommendations:
• Try to avoid small inserts• Rebuild table after frequent insertions with:
ALTER TABLE T REBUILD;
Additional capabilities and resources
Additional U-SQL Capabilities• User-defined Operators, Aggregators, Types, Table types• Partitioned Tables• Federated Queries against SQL Server in Azure• Deploying file resources• Multi-rowset table-valued functions
Additional U-SQL Resources• Tools: http://aka.ms/adltoolsVS• Blogs and community page:
• http://usql.io• http://blogs.msdn.com/b/visualstudio/• http://azure.microsoft.com/en-us/blog/topics/big-data/• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation and articles and slides:• http://aka.ms/usql_reference• https://azure.microsoft.com/en-us/documentation/services/data-lake-analyti
cs/• https://msdn.microsoft.com/en-us/magazine/mt614251 • http://www.slideshare.net/MichaelRys
• ADL forums and feedback• http://aka.ms/adlfeedback• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=Azure
DataLake
• http://stackoverflow.com/questions/tagged/u-sql
Unifies SQL declarativity and C# extensibilityUnifies querying structured and unstructured dataUnifies local and remote queriesIncrease productivity and agility from Day 1 forward for YOU!
Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake, download the VS tools, and give us feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
This is why U-SQL!
Friendly CompetitionWin ADL and U-SQL SWAG
1. Contribute a cool U-SQL project/sample to the Azure/usql Github repo (via http://usql.io) by Apr 30, 2016
2. Tweet your submission to @MikeDoesBigData with #USQLComp
3. We will review the submissions and send some cool swag (U-SQL T-Shirts, ADL Poloshirts etc) to the top 5 submissions