platform for big data, nosql and relational data. what makes sense for me? (+azure)

PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME?(+AZURE)Michael EpprechtTechnology Evangelist

[email protected]@fastflame

mailto:[email protected]

Agenda Big Data

AllSQL, NoSQL, NewSQL, SomeSQL

Windows Azure

Big Data

WHAT IS BIG DATA?

Data Complexity: Variety and Velocity

Terabytes

Gigabytes

Megabytes

Petabytes Big

DataLog files

Spatial & GPS coordinates

Data market feeds

eGov feeds

Weather

Text/image

Click stream

Wikis/blogs

Sensors/RFID/devices

Social sentiment

Audio/video

Web 2.0

Web Logs

Digital Marketing

Search Marketing

Recommendations

Advertising

Mobile

Collaboration

eCommerce

ERP/CRM

Payables

Payroll

Inventory

Contacts

Deal Tracking

Sales Pipeline

Original Gartner three V’s Feb 2001: http://

blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

Volume (think data tiering) Size of the data Manageability

Velocity (think CEP) Speed at which data is received Latency to deliver data analysis

Variety (think ETL, ODS, Email, Social Networks) Differing formats of data Disparate source systems

http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf




Big Data to Data Analytics

Variety: Dealing with Un/Semi-structured and Structured

How do you mix Oranges and Apples? Compare Textual data with Relational

Tooling – accessing the “Variety” of different data sources

Determining “Value” Big Data = Proxy for doing more with existing

data

Perspective What you are doing Hardware Innovations overtime

Spinning disk V Flash GPGPU v CPU

Replacing BI? Single Version of the Truth? Conformed dimensions (standardised data reporting) Four different operational systems ETL’d into single dimension

Does Big Data change that? NO! YES! Unstructured data is unstructured – can it be conformed? Report on Detail or Aggregations?

No – Analytics – we are data mining

Still needs standardisation and thought – formal design process

All data has Structure - not All data has Context Data stored [in structure]

Image -> png, jpg, bmp etc. Free-text -> ascii, unicode, .docx, xls etc. Sound -> mp3, mpeg

Data queried Image -> (?) face regonition, kinect Free-text -> grammar Sound -> Pitch, Note etc.

Context? Image -> Polygon Free-text -> ?? Sound -> Bars in the Music??

has Structure?

A1 difficulties

has Context? Stored in Normal Form (Relational)

Stored in Unicode A1 – could mean anything Difficulties – the word itself has meaning

Notes: Using Norm Form (Relational) context is provided by schema New term time – Uncontexted data (115 Bing references) Context gives data structured only when applied

RoadDesignator DrivingStatus

A1 Difficulties

Big Data ProcessingBatch Processing

Interactive Analysis

Stream Processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Originating project Google MapReduce Google Dremel Twitter Storm

Open source project Hadoop / Spark Drill / Shark / Impala Hbase

Storm / Apache S4 /Kafka

We’ve been Hyped Band wagon is rolling If you hear a new term – research it; probably nothing new

Finally: What is Big Data (really)? Data Analytics (stuff we already do) What is new?

New toolsets to help with variety of data Industry waking up to the power of commodity kit Data Science as a field (combination of a BI Analyst, Business

Analyst and BI Developer) It’s still all about Insights into our data Hadoop – the platform of the next generation?

Look out for the name change Big Data will become Data Analytics

How do I optimize my fleet based on weather and traffic patterns?

SOCIAL & WEB ANALYTICS

LIVE DATA FEEDS

ADVANCED ANALYTICS

What’s the social sentiment for my brand or products

How do I better predict future outcomes?

A NEW SET OF QUESTIONS

COMMON BIG DATA CUSTOMER SCENARIOSGAIN COMPETITIVE ADVANTAGE BY MOVING FIRST AND FAST IN YOUR INDUSTRY

Web app optimization

Smart meter monitoring

Equipment monitoring

Advertising analysis

Life sciences research

Fraud detection

Healthcare outcomes

Weather forecasting

Natural resource exploration

Social network analysis

Churn analysis

Traffic flow optimization

IT infrastructure optimization

Legal discovery

What is Hadoop?

Massively Parallel Processing (MPP) Chop a task up across multiple physical machines High Performance Clustering (HPC) Distributed Data Processing (DDP) Processing done locally on Data MapReduce is based on Something we know already

Why MPP? Because Enterprise kit for this performance is way too

expensive. 100 machines with cheap DAS costs fraction of a scale

up machine with expensive SAN infrastructure Most NoSQL and NewSQL products are built with MPP

and commodity kit as a design feature. Cloud computing model also Network connectivity is key component (oh, hence take

the processing to the data!) Follows the design paradigm that processing should

move to the data and not the data to the processing

What is Hadoop? Open source project coordinated by Apache Analogous to an OS; core components:

Utilities HDFS MapReduce

Lots of other projects that sit within the ecosphere:

Mahout, Sqoop, Flume, Scribe, Oozie, Jaql, Hue, Hiho, Hive, Pig, Hbase, … and more and more…

• V1.0.0 and V2.0.0 code branches

HBasepersistent | distributed • In Memory

• Efficient at Random Reads/Writes

• Distributed, large scale data store

• Utilizes Hadoop for persistence

• Both HBase and Hadoop are distributed

In Hadoop MapReduce speak Map

Parse input line to get data you want: output: key (presented to single reducer), value pair (what we will

likely aggregate)

Shuffle Sort and move same “keys” to same node for reduction (can be

expensive – plan your data partitions properly)

Reduce Aggregate values Output

http://developer.yahoo.com/hadoop/tutorial/module4.html




MapReduce as SQL Map = SELECT FROM WHERE

Reduce = GROUP BY

AllSQL, NoSQL, NewSQL and SomeSQL

AllSQL Data stored in Normal Form ACID for consistency and durability Queries done using ANSI SQL Basically what the majority of folk do The majority of reporting products use SQL as an interface

Everybody knows SQL (despite its sins) Easy to understand and get going with

NoSQL (Not Only SQL) Led by Developers wanting:

More flexible data structures (dynamic schema) Ability to store none-tabular data Higher Scalability – scale out Hardware cost – build on commodity kit Durability and consistency not a primary concern Open source – move away from proprietary products Data resilience built into the product through replicas rather than expensive hardware and software

solutions

Examples See http://nosql-database.org/ - there are 100’s! Azure Table Store Google’s BigTable HADOOP MapReduce Cassandra RavenDB CouchDB MongoDB

NoSQL momentum RDBMS cannot scale because of ACID (Atomicity, Consistency, Isolation, Durability)

Swathe of new open source products Data captured has value but not readily accessible

NewSQL – will it “cure” the NoSQL problem?

NewSQL Existing AllSQL

Products do not scale out well Single machine design Design is several decades old Expensive to create a DR/HA environment

Realisation Folk do not want to learn Java in order to report off their data Most toolsets use SQL as a method for reporting

Examples VoltDB NuoDB Azure DB

AllSQL, NoSQL, NewSQL and SomeSQL Days where everything in SQL Server are going

BI/BA/DA {whatever you want to call it} done across different data sources – semi/un/fully structured

Understand the non-relational world The SQL language isn’t going anywhere This isn’t about enterprise only – this affects us all

Windows Azure

RelationalNon-Relational Streaming

MANAGE ANY DATA, ANY SIZE, ANYWHERE

010101010101010101101010101010101001010101010101101010101010

Unified Monitoring, Management & Security

Data Movement

HADOOP INTEGRATED INTO THE DATA PLATFORM

Non-Relational

Enterprise class security, HA & managementSeamlessly integrated with Microsoft BI toolsWindows Simplicity and ManageabilityProvisioned in minutes on Windows Azure

Microsoft HDInsight Server for on-premisesWindows Azure HDInsight Service for cloud

BUILT ON HORTONWORKS DATA PLATFORM (HDP)

Distributed Storage(HDFS)

Query(Hive)

Hadoop architecture.

Distributed Processing(Map Reduce)

Scripting

(Pig)

NoSQ

L Data

base

(HB

ase

)

Metadata(HCatalog)

Data

Inte

gra

tion

( OD

BC

/ SQ

OO

P/

REST)

Busin

ess In

tellig

ence

(E

xcel, Po

werV

iew

…)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoop

)

Pipelin

e /

workflo

w(O

ozie

)

Log file

aggre

gatio

n(Flu

me)

Active

D

irecto

ry (S

ecu

rity)Syste

m C

ente

r

INSIGHTS FOR ALL USERS THROUGH FAMILIAR TOOLS

Advanced Analytics from Microsoft and 3rd parties

Self Service Analysis with PowerPivot & Power View

Interactivity & exploration with Hadoop data in Excel

PB TB GB

BI Professionals Business AnalystsData Scientists

Azure SQL Database

SQL Database Architecture

ArchitectureFederationAn object contained within a user databaseDefines the scheme for the federation Represent the database being sharded

Federation RootDatabase that houses the federation object

Federation MemberSystem managed SQL databasesContain part, or “slices” of data

SalesDB

Orders_federationOrders_federationOrders_Fed

Federation Members

Federations

Federation Root

CREATE FEDERATION fed_name(fed_key_label fed_key_type distribution_type)

SalesDB


Federation Members

Federations

Federation Root

Architecture Cont.Federation KeyThe key used for data distributionint, bigint, guid, varbinary

Atomic UnitRepresent a single instance of a federation key. All rows in all federated tables with the same federation key value.

Member: range [1000, 2000)

AUPK=5

AUPK=25

AUPK=35AU

PK=5AU

PK=25AU

PK=35AUPK=10

05

AUPK=1025

AUPK=1035

Atomic Units

Architecture Cont.Federated TableContains only atomic units for member’s key range

Reference TableNon-Federated table

Repartitioning

SalesDB


[5000, 10000)

ALTER FEDERATION Orders_Fed SPLIT AT (tenant_id=7500)

[5000, 7500) & [7500, 10000)

Dynamic PartitioningSPLIT members to spread workloads over to more nodes

DROP members to shrink back to fewer nodes

Reliable Routing

SalesDB


[5000, 7500) & [7500, 10000)

USE FEDERATION Orders_Fed (tenant_id=7509)

Built-in Data-Dependent Routing (DDR)Ensure apps can discover where the data is just-in-time

No “Shard Map” caching

Guaranteed member routing

Azure NoSQL (Azure Table Storage)

Table Storage Concepts

EntityTableAccount

contoso

Name =…Email = …

Name =…EMailAdd=

customers

Photo ID =…Date =…

photos

Photo ID =…Date =…

Table Details

InsertUpdate Merge – Partial update

Replace – Update entire entity

UpsertDeleteQueryEntity Group TransactionsMultiple CUD Operations in a single atomic transaction

Create, Query, DeleteTables can have metadata

Not an RDBMS! Table

Entities

Entity Properties Entity can have up to 255 properties

Up to 1MB per entity Mandatory Properties for every entity

PartitionKey & RowKey (only indexed properties) Uniquely identifies an entity Defines the sort order

Timestamp Optimistic Concurrency Exposed as an HTTP Etag

No fixed schema for other properties Each property is stored as a <name, typed value> pair No schema stored for a table Properties can be the standard .NET types String, binary, bool, DateTime, GUID, int, int64, and double

No Fixed Schema

FIRST LAST BIRTHDATE

Wade Wegner 2/2/1981

Nathan Totten 3/15/1965

Nick Harris May 1, 1976

FAV SPORT

Canoeing

Querying

FIRST LAST BIRTHDATE

Wade Wegner 2/2/1981

Nathan Totten 3/15/1965

Nick Harris May 1, 1976

?$filter=Last eq ‘Wegner’

Purpose of the PartitionKeyEntity Locality

Entities in the same partition will be stored together Efficient querying and cache locality Endeavour to include partition key in all queries

Entity Group Transactions Atomic multiple Insert/Update/Delete in same partition in a single

transaction

Table Scalability Target throughput – 500 tps/partition, several thousand tps/account Windows Azure monitors the usage patterns of partitions Automatically load balance partitions Each partition can be served by a different storage node Scale to meet the traffic needs of your table

PARTITIONKEY(CATEGORY)

ROWKEY(TITLE)

TIMESTAMP MODELYEAR

Bikes Super Duper Cycle … 2009

BikesQuick Cycle 200 Deluxe

… 2007

… … … …

Canoes Whitewater … 2009

Canoes Flatwater … 2006


ROWKEY(TITLE)

TIMESTAMP MODELYEAR

Rafts 14ft Super Tourer … 1999

… … … …

SkisFabrikam Back Trackers

… 2009

… … … …

Tents Super Palace … 2008


ROWKEY(TITLE)

TIMESTAMP MODELYEAR

Bikes Super Duper Cycle … 2009

BikesQuick Cycle 200 Deluxe

… 2007

… … … …

Canoes Whitewater … 2009

Canoes Flatwater … 2006

Rafts 14ft Super Tourer … 1999

… … … …

SkisFabrikam Back Trackers

… 2009

… … … …

Tents Super Palace … 2008

Partitions and Partition Ranges

Server ATable = Products

Server BTable = Products

[Canoes - MaxKey)

Server ATable = Products

[MinKey - Canoes)

MANAGE ANY DATA, ANY SIZE ANYWHERE

Non-RelationalRelational

SQL Server Database & Parallel Data Warehouse

Hadoop on WindowsHadoop on Azure

Streaming

101010101010101001010101010101101010101010

StreamInsight

Data MovementHadoop Connectors & ETL

Unified Monitoring, Management & Security

Global Physical Infrastructureservers / network / datacenters

caching identityservice

bus media cdn big data commerceintegratio

n analytics hpc mobile

compute storage networkingvirtual machines

web sites

cloud services

SQL database

noSQL database

blob storage connect

virtual network

traffic manager

...

Fra

mew

ork

sS

erv

ices

Fab

ric

Infr

astr

uctu

re

N Central US, S Central US, N Europe, W Europe, E Asia, SE Asia + 24 Edge CDN Locations

......

......

...

Automated

Managed

Resources

Elastic

Usage Based

www.microsoft.ch/shape

Questions?

platform for big data, nosql and relational data. what makes sense for me? (+azure)

Documents

distributed slide

type slide

azure slide

data platform slide

tentssuper palace2008

eq wegner slide

fav sport canoeing slide

familiar tools pb tb