platform for big data, nosql and relational data. what makes sense for me? (+azure)
TRANSCRIPT
PLATFORM FOR BIG DATA, NOSQL AND RELATIONAL DATA. WHAT MAKES SENSE FOR ME?(+AZURE)Michael EpprechtTechnology Evangelist
[email protected]@fastflame
Agenda Big Data
AllSQL, NoSQL, NewSQL, SomeSQL
Windows Azure
Big Data
WHAT IS BIG DATA?
Data Complexity: Variety and Velocity
Terabytes
Gigabytes
Megabytes
Petabytes Big
DataLog files
Spatial & GPS coordinates
Data market feeds
eGov feeds
Weather
Text/image
Click stream
Wikis/blogs
Sensors/RFID/devices
Social sentiment
Audio/video
Web 2.0
Web Logs
Digital Marketing
Search Marketing
Recommendations
Advertising
Mobile
Collaboration
eCommerce
ERP/CRM
Payables
Payroll
Inventory
Contacts
Deal Tracking
Sales Pipeline
Original Gartner three V’s Feb 2001: http://
blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Volume (think data tiering) Size of the data Manageability
Velocity (think CEP) Speed at which data is received Latency to deliver data analysis
Variety (think ETL, ODS, Email, Social Networks) Differing formats of data Disparate source systems
Big Data to Data Analytics
Variety: Dealing with Un/Semi-structured and Structured
How do you mix Oranges and Apples? Compare Textual data with Relational
Tooling – accessing the “Variety” of different data sources
Determining “Value” Big Data = Proxy for doing more with existing
data
Perspective What you are doing Hardware Innovations overtime
Spinning disk V Flash GPGPU v CPU
Replacing BI? Single Version of the Truth? Conformed dimensions (standardised data reporting) Four different operational systems ETL’d into single dimension
Does Big Data change that? NO! YES! Unstructured data is unstructured – can it be conformed? Report on Detail or Aggregations?
No – Analytics – we are data mining
Still needs standardisation and thought – formal design process
All data has Structure - not All data has Context Data stored [in structure]
Image -> png, jpg, bmp etc. Free-text -> ascii, unicode, .docx, xls etc. Sound -> mp3, mpeg
Data queried Image -> (?) face regonition, kinect Free-text -> grammar Sound -> Pitch, Note etc.
Context? Image -> Polygon Free-text -> ?? Sound -> Bars in the Music??
has Structure?
A1 difficulties
has Context? Stored in Normal Form (Relational)
Stored in Unicode A1 – could mean anything Difficulties – the word itself has meaning
Notes: Using Norm Form (Relational) context is provided by schema New term time – Uncontexted data (115 Bing references) Context gives data structured only when applied
RoadDesignator DrivingStatus
A1 Difficulties
Big Data ProcessingBatch Processing
Interactive Analysis
Stream Processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model MapReduce Queries DAG
Users Developers Analysts and developers
Developers
Originating project Google MapReduce Google Dremel Twitter Storm
Open source project Hadoop / Spark Drill / Shark / Impala Hbase
Storm / Apache S4 /Kafka
We’ve been Hyped Band wagon is rolling If you hear a new term – research it; probably nothing new
Finally: What is Big Data (really)? Data Analytics (stuff we already do) What is new?
New toolsets to help with variety of data Industry waking up to the power of commodity kit Data Science as a field (combination of a BI Analyst, Business
Analyst and BI Developer) It’s still all about Insights into our data Hadoop – the platform of the next generation?
Look out for the name change Big Data will become Data Analytics
How do I optimize my fleet based on weather and traffic patterns?
SOCIAL & WEB ANALYTICS
LIVE DATA FEEDS
ADVANCED ANALYTICS
What’s the social sentiment for my brand or products
How do I better predict future outcomes?
A NEW SET OF QUESTIONS
COMMON BIG DATA CUSTOMER SCENARIOSGAIN COMPETITIVE ADVANTAGE BY MOVING FIRST AND FAST IN YOUR INDUSTRY
Web app optimization
Smart meter monitoring
Equipment monitoring
Advertising analysis
Life sciences research
Fraud detection
Healthcare outcomes
Weather forecasting
Natural resource exploration
Social network analysis
Churn analysis
Traffic flow optimization
IT infrastructure optimization
Legal discovery
What is Hadoop?
Massively Parallel Processing (MPP) Chop a task up across multiple physical machines High Performance Clustering (HPC) Distributed Data Processing (DDP) Processing done locally on Data MapReduce is based on Something we know already
Why MPP? Because Enterprise kit for this performance is way too
expensive. 100 machines with cheap DAS costs fraction of a scale
up machine with expensive SAN infrastructure Most NoSQL and NewSQL products are built with MPP
and commodity kit as a design feature. Cloud computing model also Network connectivity is key component (oh, hence take
the processing to the data!) Follows the design paradigm that processing should
move to the data and not the data to the processing
What is Hadoop? Open source project coordinated by Apache Analogous to an OS; core components:
Utilities HDFS MapReduce
Lots of other projects that sit within the ecosphere:
Mahout, Sqoop, Flume, Scribe, Oozie, Jaql, Hue, Hiho, Hive, Pig, Hbase, … and more and more…
• V1.0.0 and V2.0.0 code branches
HBasepersistent | distributed • In Memory
• Efficient at Random Reads/Writes
• Distributed, large scale data store
• Utilizes Hadoop for persistence
• Both HBase and Hadoop are distributed
In Hadoop MapReduce speak Map
Parse input line to get data you want: output: key (presented to single reducer), value pair (what we will
likely aggregate)
Shuffle Sort and move same “keys” to same node for reduction (can be
expensive – plan your data partitions properly)
Reduce Aggregate values Output
http://developer.yahoo.com/hadoop/tutorial/module4.html
MapReduce as SQL Map = SELECT FROM WHERE
Reduce = GROUP BY
AllSQL, NoSQL, NewSQL and SomeSQL
AllSQL Data stored in Normal Form ACID for consistency and durability Queries done using ANSI SQL Basically what the majority of folk do The majority of reporting products use SQL as an interface
Everybody knows SQL (despite its sins) Easy to understand and get going with
NoSQL (Not Only SQL) Led by Developers wanting:
More flexible data structures (dynamic schema) Ability to store none-tabular data Higher Scalability – scale out Hardware cost – build on commodity kit Durability and consistency not a primary concern Open source – move away from proprietary products Data resilience built into the product through replicas rather than expensive hardware and software
solutions
Examples See http://nosql-database.org/ - there are 100’s! Azure Table Store Google’s BigTable HADOOP MapReduce Cassandra RavenDB CouchDB MongoDB
NoSQL momentum RDBMS cannot scale because of ACID (Atomicity, Consistency, Isolation, Durability)
Swathe of new open source products Data captured has value but not readily accessible
NewSQL – will it “cure” the NoSQL problem?
NewSQL Existing AllSQL
Products do not scale out well Single machine design Design is several decades old Expensive to create a DR/HA environment
Realisation Folk do not want to learn Java in order to report off their data Most toolsets use SQL as a method for reporting
Examples VoltDB NuoDB Azure DB
AllSQL, NoSQL, NewSQL and SomeSQL Days where everything in SQL Server are going
BI/BA/DA {whatever you want to call it} done across different data sources – semi/un/fully structured
Understand the non-relational world The SQL language isn’t going anywhere This isn’t about enterprise only – this affects us all
Windows Azure
RelationalNon-Relational Streaming
MANAGE ANY DATA, ANY SIZE, ANYWHERE
010101010101010101101010101010101001010101010101101010101010
Unified Monitoring, Management & Security
Data Movement
HADOOP INTEGRATED INTO THE DATA PLATFORM
Non-Relational
Enterprise class security, HA & managementSeamlessly integrated with Microsoft BI toolsWindows Simplicity and ManageabilityProvisioned in minutes on Windows Azure
Microsoft HDInsight Server for on-premisesWindows Azure HDInsight Service for cloud
BUILT ON HORTONWORKS DATA PLATFORM (HDP)
Distributed Storage(HDFS)
Query(Hive)
Hadoop architecture.
Distributed Processing(Map Reduce)
Scripting
(Pig)
NoSQ
L Data
base
(HB
ase
)
Metadata(HCatalog)
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/
REST)
Busin
ess In
tellig
ence
(E
xcel, Po
werV
iew
…)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processin
g(RHadoop
)
Pipelin
e /
workflo
w(O
ozie
)
Log file
aggre
gatio
n(Flu
me)
Active
D
irecto
ry (S
ecu
rity)Syste
m C
ente
r
INSIGHTS FOR ALL USERS THROUGH FAMILIAR TOOLS
Advanced Analytics from Microsoft and 3rd parties
Self Service Analysis with PowerPivot & Power View
Interactivity & exploration with Hadoop data in Excel
PB TB GB
BI Professionals Business AnalystsData Scientists
Azure SQL Database
SQL Database Architecture
ArchitectureFederationAn object contained within a user databaseDefines the scheme for the federation Represent the database being sharded
Federation RootDatabase that houses the federation object
Federation MemberSystem managed SQL databasesContain part, or “slices” of data
SalesDB
Orders_federationOrders_federationOrders_Fed
Federation Members
Federations
Federation Root
CREATE FEDERATION fed_name(fed_key_label fed_key_type distribution_type)
SalesDB
Orders_federationOrders_federationOrders_Fed
Federation Members
Federations
Federation Root
Architecture Cont.Federation KeyThe key used for data distributionint, bigint, guid, varbinary
Atomic UnitRepresent a single instance of a federation key. All rows in all federated tables with the same federation key value.
Member: range [1000, 2000)
AUPK=5
AUPK=25
AUPK=35AU
PK=5AU
PK=25AU
PK=35AUPK=10
05
AUPK=1025
AUPK=1035
Atomic Units
Architecture Cont.Federated TableContains only atomic units for member’s key range
Reference TableNon-Federated table
Repartitioning
SalesDB
Orders_federationOrders_federationOrders_Fed
[5000, 10000)
ALTER FEDERATION Orders_Fed SPLIT AT (tenant_id=7500)
[5000, 7500) & [7500, 10000)
Dynamic PartitioningSPLIT members to spread workloads over to more nodes
DROP members to shrink back to fewer nodes
Reliable Routing
SalesDB
Orders_federationOrders_federationOrders_Fed
[5000, 7500) & [7500, 10000)
USE FEDERATION Orders_Fed (tenant_id=7509)
Built-in Data-Dependent Routing (DDR)Ensure apps can discover where the data is just-in-time
No “Shard Map” caching
Guaranteed member routing
Azure NoSQL (Azure Table Storage)
Table Storage Concepts
EntityTableAccount
contoso
Name =…Email = …
Name =…EMailAdd=
customers
Photo ID =…Date =…
photos
Photo ID =…Date =…
Table Details
InsertUpdate Merge – Partial update
Replace – Update entire entity
UpsertDeleteQueryEntity Group TransactionsMultiple CUD Operations in a single atomic transaction
Create, Query, DeleteTables can have metadata
Not an RDBMS! Table
Entities
Entity Properties Entity can have up to 255 properties
Up to 1MB per entity Mandatory Properties for every entity
PartitionKey & RowKey (only indexed properties) Uniquely identifies an entity Defines the sort order
Timestamp Optimistic Concurrency Exposed as an HTTP Etag
No fixed schema for other properties Each property is stored as a <name, typed value> pair No schema stored for a table Properties can be the standard .NET types String, binary, bool, DateTime, GUID, int, int64, and double
No Fixed Schema
FIRST LAST BIRTHDATE
Wade Wegner 2/2/1981
Nathan Totten 3/15/1965
Nick Harris May 1, 1976
FAV SPORT
Canoeing
Querying
FIRST LAST BIRTHDATE
Wade Wegner 2/2/1981
Nathan Totten 3/15/1965
Nick Harris May 1, 1976
?$filter=Last eq ‘Wegner’
Purpose of the PartitionKeyEntity Locality
Entities in the same partition will be stored together Efficient querying and cache locality Endeavour to include partition key in all queries
Entity Group Transactions Atomic multiple Insert/Update/Delete in same partition in a single
transaction
Table Scalability Target throughput – 500 tps/partition, several thousand tps/account Windows Azure monitors the usage patterns of partitions Automatically load balance partitions Each partition can be served by a different storage node Scale to meet the traffic needs of your table
PARTITIONKEY(CATEGORY)
ROWKEY(TITLE)
TIMESTAMP MODELYEAR
Bikes Super Duper Cycle … 2009
BikesQuick Cycle 200 Deluxe
… 2007
… … … …
Canoes Whitewater … 2009
Canoes Flatwater … 2006
PARTITIONKEY(CATEGORY)
ROWKEY(TITLE)
TIMESTAMP MODELYEAR
Rafts 14ft Super Tourer … 1999
… … … …
SkisFabrikam Back Trackers
… 2009
… … … …
Tents Super Palace … 2008
PARTITIONKEY(CATEGORY)
ROWKEY(TITLE)
TIMESTAMP MODELYEAR
Bikes Super Duper Cycle … 2009
BikesQuick Cycle 200 Deluxe
… 2007
… … … …
Canoes Whitewater … 2009
Canoes Flatwater … 2006
Rafts 14ft Super Tourer … 1999
… … … …
SkisFabrikam Back Trackers
… 2009
… … … …
Tents Super Palace … 2008
Partitions and Partition Ranges
Server ATable = Products
Server BTable = Products
[Canoes - MaxKey)
Server ATable = Products
[MinKey - Canoes)
MANAGE ANY DATA, ANY SIZE ANYWHERE
Non-RelationalRelational
SQL Server Database & Parallel Data Warehouse
Hadoop on WindowsHadoop on Azure
Streaming
101010101010101001010101010101101010101010
StreamInsight
Data MovementHadoop Connectors & ETL
Unified Monitoring, Management & Security
Global Physical Infrastructureservers / network / datacenters
caching identityservice
bus media cdn big data commerceintegratio
n analytics hpc mobile
compute storage networkingvirtual machines
web sites
cloud services
SQL database
noSQL database
blob storage connect
virtual network
traffic manager
...
Fra
mew
ork
sS
erv
ices
Fab
ric
Infr
astr
uctu
re
N Central US, S Central US, N Europe, W Europe, E Asia, SE Asia + 24 Edge CDN Locations
......
......
...
Automated
Managed
Resources
Elastic
Usage Based
www.microsoft.ch/shape
Questions?