microsoft's big play for big data

16

Click here to load reader

Upload: andrew-brust

Post on 10-May-2015

1.222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 1

Microsoft's Big Play for Big Data

Level: Intermediate

Andrew J. BrustCEO and Founder

Blue Badge Insights

• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 17 years as a speaker• Founder, Microsoft BI User Group of NYC

– http://www.msbinyc.com

• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com

• “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News

• brustblog.com, Twitter: @andrewbrust

Meet Andrew

Page 2: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 2

My New Blog (bit.ly/bigondata)

Read all about it!

Page 3: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 3

What is Big Data?• 100s of TB into PB and higher• Involving data from: financial data,

sensors, web logs, social media, etc.• Parallel processing often involved

– Hadoop is emblematic, but other technologies are Big Data too

• Processing of data sets too large for transactional databases– Analyzing interactions, rather than transactions– The three V’s: Volume, Velocity, Variety

• Big Data tech sometimes imposed on small data problems

What’s MapReduce?

• “Big” input data as key-value pair series

• Partition the data and send to mappers (nodes in cluster)

• Mappers pre-aggregate by key, then all output for (a) given key(s) goes to a reducer

• Reducer completes aggregations; one output per key, with value

• Map and Reduce code natively written as Java functions

Page 4: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 4

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1

K2

K3

Output

Output

Output

What’s a Distributed File System?

• One where data gets distributed over commodity drives on commodity servers

• Data is replicated

• If one box goes down, no data lost– “Shared Nothing”

• BUT: Immutable– Files can only be written to once

– So updates require drop + re-write (slow)

– You can append though

– Like a DVD/CD-ROM

Page 5: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 5

Hadoop = MapReduce + HDFS

• Modeled after Google MapReduce + GFS

• Have more data? Just add more nodes to cluster. – Mappers execute in parallel

– Hardware is commodity

– “Scaling out”

• Use of HDFS means data may well be local to mapper processing

• So, not just parallel, but minimal data movement, which avoids network bottlenecks

What’s NoSQL?

• Databases that are non-relational (don’t let name fool you, some actually use SQL)

• Four kinds:– Key-Value Store

Schema-freeFYI: Azure Table Storage is an example

– Document Store

All data stored in JSON objects– Wide-Column Store

Define column families, but not columns– Graph database

Manage relationships between objects

Page 6: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 6

What’s HBase?

• A Wide-Column Store

• Modeled after Google BigTable

• Uses HDFS– Therefore, Hadoop-compatible

• Hadoop often used with HBase– But you can use either without the other

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 7: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 7

What’s Hive?

• Began as Hadoop sub-project– Now top-level Apache project

• Provides a SQL-like (“HiveQL”) abstraction over MapReduce

• Has its own HDFS table file format (and it’s fully schema-bound)

• Can also work over HBase

• Acts as a bridge to many BI products which expect tabular data

Hadoop Distributions

• Cloudera

• Hortonworks– HCatalog: Hive/Pig/MR Interop

• MapR– Network File System replaces HDFS

• IBM InfoSphere BigInsights– HDFS<->DB2 integration

• And now Microsoft…

Page 8: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 8

Microsoft HDInsight

• Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows

• Windows Azure HDInsight and Microsoft HDInsight (for Windows Server)– Single node preview runs on Windows client

• Includes ODBC Driver for Hive– And Excel Add-In that uses it

• JavaScript MapReduce framework

• Contribute it all back to open source Apache Project

Azure HDInsight Provisioning

• Give cluster a name– Hostname will be name.cloudapp.net

• Create credentials– Used for ODBC connections and RDP sessions

• Elect whether to use SQL Azure for Hive metabase

• [Choose number of nodes and storage size in cluster]

• Wait for cluster to provision

• Click link to go to portal

Page 9: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 9

Submitting, Running and Monitoring Jobs

• Upload a JAR

• Use Streaming– Use other languages (i.e. other than Java) to write

MapReduce code

– Python is popular option

– Any executable works, even C# console apps

– On HDInsight, JavaScript works too

– Still uses a JAR file: streaming.jar

• Run at command line (passing JAR name and params) or use GUI

Page 10: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 10

HortonworksData Platform for

Windows

MRLib(NuGet

Package)

LINQ to Hive

OdbcClient + Hive ODBC

Driver

Deployment

Debugging

MR code in C#,

HadoopJob, MapperBase, ReducerBase

Amenities for Visual Studio/.NET

Running MapReduce Jobs

Page 11: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 11

HDInsight Data Sources

• Files in HDFS

• Azure Blob Storage (Azure HDInsight only)

• Hive Tables

• HBase?

Review: ODBC Connection Types

• Registry-based– User Data Source Name (DSN)

– System DSN

• File-based– File DSN

• String-based– DSN-less connection

• We need file-based

• Wizard obfuscates how to do this

• Don’t forget to open the ODBC port!

Page 12: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 12

Hive ODBC Setup, Excel Add-In

ODBC Driver’s Untold Story

• Works with any Hive install/Hadoopcluster, not just Windows-based ones.

• Simba driver available too

Page 13: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 13

How Does SQL Server Fit In?

• RDBMS + PDW: Sqoop connectors

• RDBMS: Columnstore Indexes– Enterprise Edition only

• Analysis Services: Tabular Mode– Compatible with ODBC Driver

Multidimensional mode is not

• RDBMS + SSAS Tabular: DirectQuery

• PowerPivot (as with SSAS Tabular)

• Power View– Works against PowerPivot and SSAS Tabular

Querying Hadoop from SQL Server BI

Page 14: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 14

The “Data-Refinery” Idea

• Use Hadoop to “on-board” unstructured data, then extract manageable subsets

• Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine

• This is the current rationalization of Hadoop + BI tools’ coexistence

• Will it stay this way?

Usability Impact

• PowerPivot makes analysis much easier, self-service

• Power View is great for discovery and visualization; also self-service

• Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users

• Caveats– Someone has to write the HiveQL

– Can query Big Data, but must have smaller result

Page 15: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 15

Other Relevant MS Technologies

• SQL Server Components:– SQL Server Parallel Data Warehouse

– StreamInsight

• Azure Components:– Data Explorer

– DataMarket

• Deprecated MSR Project– Dryad

Resources• Big On Data blog

– http://www.zdnet.com/blog/big-data

• Apache Hadoop home page– http://hadoop.apache.org/

• Hive & Pig home pages– http://hive.apache.org/– http://pig.apache.org/

• Hadoop on Azure home page– https://www.hadooponazure.com/

• SQL Server 2012 Big Data– http://bit.ly/sql2012bigdata

Page 16: Microsoft's Big Play for Big Data

SQL Server Live! Orlando 2012

SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 16

Thank you

[email protected]

• @andrewbrust on twitter

• Want to get the free “Redmond Roundup Plus?”– Text “bluebadge” to 22828