platform for big data, nosql and relational data. what makes sense for me? (+azure)

52
PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE) Michael Epprecht Technology Evangelist [email protected] @fastflame

Upload: octavio-wheetley

Post on 29-Mar-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME?(+AZURE)Michael EpprechtTechnology Evangelist

[email protected]@fastflame

Page 2: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Agenda Big Data

AllSQL, NoSQL, NewSQL, SomeSQL

Windows Azure

Page 3: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Big Data

Page 4: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

WHAT IS BIG DATA?

Data Complexity: Variety and Velocity

Terabytes

Gigabytes

Megabytes

Petabytes Big

DataLog files

Spatial & GPS coordinates

Data market feeds

eGov feeds

Weather

Text/image

Click stream

Wikis/blogs

Sensors/RFID/devices

Social sentiment

Audio/video

Web 2.0

Web Logs

Digital Marketing

Search Marketing

Recommendations

Advertising

Mobile

Collaboration

eCommerce

ERP/CRM

Payables

Payroll

Inventory

Contacts

Deal Tracking

Sales Pipeline

Page 5: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Original Gartner three V’s Feb 2001: http://

blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

Volume (think data tiering) Size of the data Manageability

Velocity (think CEP) Speed at which data is received Latency to deliver data analysis

Variety (think ETL, ODS, Email, Social Networks) Differing formats of data Disparate source systems

Page 6: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Big Data to Data Analytics

Variety: Dealing with Un/Semi-structured and Structured

How do you mix Oranges and Apples? Compare Textual data with Relational

Tooling – accessing the “Variety” of different data sources

Determining “Value” Big Data = Proxy for doing more with existing

data

Perspective What you are doing Hardware Innovations overtime

Spinning disk V Flash GPGPU v CPU

Page 7: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Replacing BI? Single Version of the Truth? Conformed dimensions (standardised data reporting) Four different operational systems ETL’d into single dimension

Does Big Data change that? NO! YES! Unstructured data is unstructured – can it be conformed? Report on Detail or Aggregations?

No – Analytics – we are data mining

Still needs standardisation and thought – formal design process

Page 8: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

All data has Structure - not All data has Context Data stored [in structure]

Image -> png, jpg, bmp etc. Free-text -> ascii, unicode, .docx, xls etc. Sound -> mp3, mpeg

Data queried Image -> (?) face regonition, kinect Free-text -> grammar Sound -> Pitch, Note etc.

Context? Image -> Polygon Free-text -> ?? Sound -> Bars in the Music??

Page 9: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

has Structure?

A1 difficulties

Page 10: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

has Context? Stored in Normal Form (Relational)

Stored in Unicode A1 – could mean anything Difficulties – the word itself has meaning

Notes: Using Norm Form (Relational) context is provided by schema New term time – Uncontexted data (115 Bing references) Context gives data structured only when applied

RoadDesignator DrivingStatus

A1 Difficulties

Page 11: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Big Data ProcessingBatch Processing

Interactive Analysis

Stream Processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Originating project Google MapReduce Google Dremel Twitter Storm

Open source project Hadoop / Spark Drill / Shark / Impala Hbase

Storm / Apache S4 /Kafka

Page 12: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

We’ve been Hyped Band wagon is rolling If you hear a new term – research it; probably nothing new

Page 13: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Finally: What is Big Data (really)? Data Analytics (stuff we already do) What is new?

New toolsets to help with variety of data Industry waking up to the power of commodity kit Data Science as a field (combination of a BI Analyst, Business

Analyst and BI Developer) It’s still all about Insights into our data Hadoop – the platform of the next generation?

Look out for the name change Big Data will become Data Analytics

Page 14: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

How do I optimize my fleet based on weather and traffic patterns?

SOCIAL & WEB ANALYTICS

LIVE DATA FEEDS

ADVANCED ANALYTICS

What’s the social sentiment for my brand or products

How do I better predict future outcomes?

A NEW SET OF QUESTIONS

Page 15: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

COMMON BIG DATA CUSTOMER SCENARIOSGAIN COMPETITIVE ADVANTAGE BY MOVING FIRST AND FAST IN YOUR INDUSTRY

Web app optimization

Smart meter monitoring

Equipment monitoring

Advertising analysis

Life sciences research

Fraud detection

Healthcare outcomes

Weather forecasting

Natural resource exploration

Social network analysis

Churn analysis

Traffic flow optimization

IT infrastructure optimization

Legal discovery

Page 16: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

What is Hadoop?

Page 17: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Massively Parallel Processing (MPP) Chop a task up across multiple physical machines High Performance Clustering (HPC) Distributed Data Processing (DDP) Processing done locally on Data MapReduce is based on Something we know already

Page 18: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Why MPP? Because Enterprise kit for this performance is way too

expensive. 100 machines with cheap DAS costs fraction of a scale

up machine with expensive SAN infrastructure Most NoSQL and NewSQL products are built with MPP

and commodity kit as a design feature. Cloud computing model also Network connectivity is key component (oh, hence take

the processing to the data!) Follows the design paradigm that processing should

move to the data and not the data to the processing

Page 19: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

What is Hadoop? Open source project coordinated by Apache Analogous to an OS; core components:

Utilities HDFS MapReduce

Lots of other projects that sit within the ecosphere:

Mahout, Sqoop, Flume, Scribe, Oozie, Jaql, Hue, Hiho, Hive, Pig, Hbase, … and more and more…

• V1.0.0 and V2.0.0 code branches

Page 20: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

HBasepersistent | distributed • In Memory

• Efficient at Random Reads/Writes

• Distributed, large scale data store

• Utilizes Hadoop for persistence

• Both HBase and Hadoop are distributed

Page 21: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

In Hadoop MapReduce speak Map

Parse input line to get data you want: output: key (presented to single reducer), value pair (what we will

likely aggregate)

Shuffle Sort and move same “keys” to same node for reduction (can be

expensive – plan your data partitions properly)

Reduce Aggregate values Output

http://developer.yahoo.com/hadoop/tutorial/module4.html

Page 22: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

MapReduce as SQL Map = SELECT FROM WHERE

Reduce = GROUP BY

Page 23: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

AllSQL, NoSQL, NewSQL and SomeSQL

Page 24: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

AllSQL Data stored in Normal Form ACID for consistency and durability Queries done using ANSI SQL Basically what the majority of folk do The majority of reporting products use SQL as an interface

Everybody knows SQL (despite its sins) Easy to understand and get going with

Page 25: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

NoSQL (Not Only SQL) Led by Developers wanting:

More flexible data structures (dynamic schema) Ability to store none-tabular data Higher Scalability – scale out Hardware cost – build on commodity kit Durability and consistency not a primary concern Open source – move away from proprietary products Data resilience built into the product through replicas rather than expensive hardware and software

solutions

Examples See http://nosql-database.org/ - there are 100’s! Azure Table Store Google’s BigTable HADOOP MapReduce Cassandra RavenDB CouchDB MongoDB

Page 26: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

NoSQL momentum RDBMS cannot scale because of ACID (Atomicity, Consistency, Isolation, Durability)

Swathe of new open source products Data captured has value but not readily accessible

NewSQL – will it “cure” the NoSQL problem?

Page 27: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

NewSQL Existing AllSQL

Products do not scale out well Single machine design Design is several decades old Expensive to create a DR/HA environment

Realisation Folk do not want to learn Java in order to report off their data Most toolsets use SQL as a method for reporting

Examples VoltDB NuoDB Azure DB

Page 28: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

AllSQL, NoSQL, NewSQL and SomeSQL Days where everything in SQL Server are going

BI/BA/DA {whatever you want to call it} done across different data sources – semi/un/fully structured

Understand the non-relational world The SQL language isn’t going anywhere This isn’t about enterprise only – this affects us all

Page 29: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Windows Azure

Page 30: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

RelationalNon-Relational Streaming

MANAGE ANY DATA, ANY SIZE, ANYWHERE

010101010101010101101010101010101001010101010101101010101010

Unified Monitoring, Management & Security

Data Movement

Page 31: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

HADOOP INTEGRATED INTO THE DATA PLATFORM

Non-Relational

Enterprise class security, HA & managementSeamlessly integrated with Microsoft BI toolsWindows Simplicity and ManageabilityProvisioned in minutes on Windows Azure

Microsoft HDInsight Server for on-premisesWindows Azure HDInsight Service for cloud

BUILT ON HORTONWORKS DATA PLATFORM (HDP)

Page 32: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Distributed Storage(HDFS)

Query(Hive)

Hadoop architecture.

Distributed Processing(Map Reduce)

Scripting

(Pig)

NoSQ

L Data

base

(HB

ase

)

Metadata(HCatalog)

Data

Inte

gra

tion

( OD

BC

/ SQ

OO

P/

REST)

Busin

ess In

tellig

ence

(E

xcel, Po

werV

iew

…)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoop

)

Pipelin

e /

workflo

w(O

ozie

)

Log file

aggre

gatio

n(Flu

me)

Active

D

irecto

ry (S

ecu

rity)Syste

m C

ente

r

Page 33: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

INSIGHTS FOR ALL USERS THROUGH FAMILIAR TOOLS

Advanced Analytics from Microsoft and 3rd parties

Self Service Analysis with PowerPivot & Power View

Interactivity & exploration with Hadoop data in Excel

PB TB GB

BI Professionals Business AnalystsData Scientists

Page 34: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Azure SQL Database

Page 35: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

SQL Database Architecture

Page 36: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

ArchitectureFederationAn object contained within a user databaseDefines the scheme for the federation Represent the database being sharded

Federation RootDatabase that houses the federation object

Federation MemberSystem managed SQL databasesContain part, or “slices” of data

SalesDB

Orders_federationOrders_federationOrders_Fed

Federation Members

Federations

Federation Root

CREATE FEDERATION fed_name(fed_key_label fed_key_type distribution_type)

Page 37: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

SalesDB

Orders_federationOrders_federationOrders_Fed

Federation Members

Federations

Federation Root

Architecture Cont.Federation KeyThe key used for data distributionint, bigint, guid, varbinary

Atomic UnitRepresent a single instance of a federation key. All rows in all federated tables with the same federation key value.

Member: range [1000, 2000)

AUPK=5

AUPK=25

AUPK=35AU

PK=5AU

PK=25AU

PK=35AUPK=10

05

AUPK=1025

AUPK=1035

Atomic Units

Page 38: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Architecture Cont.Federated TableContains only atomic units for member’s key range

Reference TableNon-Federated table

Page 39: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Repartitioning

SalesDB

Orders_federationOrders_federationOrders_Fed

[5000, 10000)

ALTER FEDERATION Orders_Fed SPLIT AT (tenant_id=7500)

[5000, 7500) & [7500, 10000)

Dynamic PartitioningSPLIT members to spread workloads over to more nodes

DROP members to shrink back to fewer nodes

Page 40: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Reliable Routing

SalesDB

Orders_federationOrders_federationOrders_Fed

[5000, 7500) & [7500, 10000)

USE FEDERATION Orders_Fed (tenant_id=7509)

Built-in Data-Dependent Routing (DDR)Ensure apps can discover where the data is just-in-time

No “Shard Map” caching

Guaranteed member routing

Page 41: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Azure NoSQL (Azure Table Storage)

Page 42: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Table Storage Concepts

EntityTableAccount

contoso

Name =…Email = …

Name =…EMailAdd=

customers

Photo ID =…Date =…

photos

Photo ID =…Date =…

Page 43: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Table Details

InsertUpdate Merge – Partial update

Replace – Update entire entity

UpsertDeleteQueryEntity Group TransactionsMultiple CUD Operations in a single atomic transaction

Create, Query, DeleteTables can have metadata

Not an RDBMS! Table

Entities

Page 44: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Entity Properties Entity can have up to 255 properties

Up to 1MB per entity Mandatory Properties for every entity

PartitionKey & RowKey (only indexed properties) Uniquely identifies an entity Defines the sort order

Timestamp Optimistic Concurrency Exposed as an HTTP Etag

No fixed schema for other properties Each property is stored as a <name, typed value> pair No schema stored for a table Properties can be the standard .NET types String, binary, bool, DateTime, GUID, int, int64, and double

Page 45: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

No Fixed Schema

FIRST LAST BIRTHDATE

Wade Wegner 2/2/1981

Nathan Totten 3/15/1965

Nick Harris May 1, 1976

FAV SPORT

Canoeing

Page 46: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Querying

FIRST LAST BIRTHDATE

Wade Wegner 2/2/1981

Nathan Totten 3/15/1965

Nick Harris May 1, 1976

?$filter=Last eq ‘Wegner’

Page 47: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Purpose of the PartitionKeyEntity Locality

Entities in the same partition will be stored together Efficient querying and cache locality Endeavour to include partition key in all queries

Entity Group Transactions Atomic multiple Insert/Update/Delete in same partition in a single

transaction

Table Scalability Target throughput – 500 tps/partition, several thousand tps/account Windows Azure monitors the usage patterns of partitions Automatically load balance partitions Each partition can be served by a different storage node Scale to meet the traffic needs of your table

Page 48: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

PARTITIONKEY(CATEGORY)

ROWKEY(TITLE)

TIMESTAMP MODELYEAR

Bikes Super Duper Cycle … 2009

BikesQuick Cycle 200 Deluxe

… 2007

… … … …

Canoes Whitewater … 2009

Canoes Flatwater … 2006

PARTITIONKEY(CATEGORY)

ROWKEY(TITLE)

TIMESTAMP MODELYEAR

Rafts 14ft Super Tourer … 1999

… … … …

SkisFabrikam Back Trackers

… 2009

… … … …

Tents Super Palace … 2008

PARTITIONKEY(CATEGORY)

ROWKEY(TITLE)

TIMESTAMP MODELYEAR

Bikes Super Duper Cycle … 2009

BikesQuick Cycle 200 Deluxe

… 2007

… … … …

Canoes Whitewater … 2009

Canoes Flatwater … 2006

Rafts 14ft Super Tourer … 1999

… … … …

SkisFabrikam Back Trackers

… 2009

… … … …

Tents Super Palace … 2008

Partitions and Partition Ranges

Server ATable = Products

Server BTable = Products

[Canoes - MaxKey)

Server ATable = Products

[MinKey - Canoes)

Page 49: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

MANAGE ANY DATA, ANY SIZE ANYWHERE

Non-RelationalRelational

SQL Server Database & Parallel Data Warehouse

Hadoop on WindowsHadoop on Azure

Streaming

101010101010101001010101010101101010101010

StreamInsight

Data MovementHadoop Connectors & ETL

Unified Monitoring, Management & Security

Page 50: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Global Physical Infrastructureservers / network / datacenters

caching identityservice

bus media cdn big data commerceintegratio

n analytics hpc mobile

compute storage networkingvirtual machines

web sites

cloud services

SQL database

noSQL database

blob storage connect

virtual network

traffic manager

...

Fra

mew

ork

sS

erv

ices

Fab

ric

Infr

astr

uctu

re

N Central US, S Central US, N Europe, W Europe, E Asia, SE Asia + 24 Edge CDN Locations

......

......

...

Automated

Managed

Resources

Elastic

Usage Based

Page 51: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

www.microsoft.ch/shape

Page 52: PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME? (+AZURE)

Questions?