big data (nj sql server user group)
DESCRIPTION
TRANSCRIPT
![Page 1: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/1.jpg)
Introduction to Big Data and NoSQL
NJ SQL Server User GroupMay 15, 2012
Don Demsak
Advisory Solutions Architect
EMC Consulting
www.donxml.com
Melissa Demsak
SQL Architect
Realogy
www.sqldiva.com
![Page 3: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/3.jpg)
Meet Don
• Advisory Solutions Architect– EMC Consulting
• Application Architecture, Development & Design
• DonXml.com, Twitter: donxml• Email – [email protected]• SlideShare - http://www.slideshare.net/dondemsak
![Page 4: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/4.jpg)
The era of Big Data
![Page 5: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/5.jpg)
How did we get here?• Expensive
o Processorso Disk spaceo Memoryo Operating Systemso Softwareo Programmers
• Culture of Limitationso Limit CPU cycleso Limit disk spaceo Limit memoryo Limited OS Developmento Limited Softwareo Programmers
• One language• One persistence store
![Page 6: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/6.jpg)
Typical RDBMS Implementations
• Fixed table schemas
• Small but frequent reads/writes
• Large batch transactions
• Focus on ACIDo Atomicityo Consistencyo Isolationo Durability
![Page 7: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/7.jpg)
How we scale RDBMS implementations
![Page 8: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/8.jpg)
1st Step – Build a relational database
RelationalDatabase
![Page 9: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/9.jpg)
2nd Step – Table Partitioning
RelationalDatabase
p1 p2 p3
![Page 10: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/10.jpg)
3rd Step – Database Partitioning
Web TierBrowser B/L Tier RelationalDatabase
Customer #2
Web TierBrowser B/L Tier RelationalDatabase
Customer #1
Web TierBrowser B/L Tier RelationalDatabase
Customer #3
![Page 11: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/11.jpg)
4th Step – Move to the cloud?
Web TierBrowser B/L TierSQL AzureFederation
Customer #2
Web TierBrowser B/L Tier SQL AzureFederation
Customer #1
Web TierBrowser B/L TierSQL AzureFederation
Customer #3
![Page 12: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/12.jpg)
Problems created by too much data
• Where to store• How to store• How to process• Organization, searching, and
metadata• How to manage access• How to copy, move, and backup• Lifecycle
![Page 13: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/13.jpg)
![Page 14: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/14.jpg)
Polyglot Programmer
![Page 15: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/15.jpg)
Polyglot Persistence
(how to store)
![Page 16: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/16.jpg)
• Atlanta 2009 - No:sql(east) conference
select fun, profit from real_world where relational=false
• Billed as “conference of no-rel datastores”
• (often) Open source• Non-relational• Distributed• (often) does not guarantee ACID
(loose) Definition
![Page 17: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/17.jpg)
Types Of NoSQL Data Stores
![Page 18: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/18.jpg)
5 Groups of Data Models
Relational
Document
Key Value
Graph
Column Family
![Page 19: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/19.jpg)
Document?• Think of a web page...
o Relational model requires column/tago Lots of empty columnso Wasted space and processing time
• Document model just stores the pages as iso Saves on spaceo Very flexible
• Document Databaseso Apache Jackrabbito CouchDBo MongoDBo SimpleDBo XML Databases
• MarkLogic Server• eXist.
![Page 20: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/20.jpg)
Key/Value Stores• Simple Index on Key
• Value can be any serialized form of data
• Lots of different implementationso Eventually Consistent
• “If no updates occur for a period, eventually all updates will propagate through the system and all replicas will be consistent”
o Cached in RAMo Cached on disko Distributed Hash Tables
• Exampleso Azure AppFabric Cacheo Memcache-do VMWare vFabric GemFire
![Page 21: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/21.jpg)
Graph?• Graph consists of
o Node (‘stations’ of the graph)o Edges (lines between them)
• Graph Storeso AllegroGrapho Core Datao Neo4jo DEXo FlockDB
• Created by the Twitter folks• Nodes = Users• Edges = Nature of relationship between nodes.
o Microsoft Trinity (research project)• http://research.microsoft.com/en-us/projects/trinity/
![Page 22: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/22.jpg)
Column Family?• Lots of variants
o Object Stores• Db4o• GemStone/S• InterSystems Caché• Objectivity/DB• ZODB
o Tabluar• BigTable• Mnesia• Hbase• Hypertable• Azure Table Storage
o Column-oriented• Greenplum• Microsoft SQL Server 2012
![Page 23: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/23.jpg)
Okay got it, Now Let’s Compare Some Real
World Scenarios
![Page 24: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/24.jpg)
04/10/2023Footer Text 24
You Need Constant Consistency
• You’re dealing with financial transactions• You’re dealing with medical records• You’re dealing with bonded goods• Best you use a RDMBS
![Page 25: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/25.jpg)
04/10/2023Footer Text 25
You Need Horizontal Scalability
• You’re working across defined timezones• You’re Aggregating large quantities of data• Maintaining a chat server (Facebook chat)• Use Column Family Storage.
![Page 26: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/26.jpg)
04/10/2023Footer Text 26
Frequently Written Rarely Read
• Think web counters and the like• Every time a user comes to a page = ctr+
+• But it’s only read when the report is run• Use Key-Value Storage.
![Page 27: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/27.jpg)
04/10/2023Footer Text 27
Here Today Gone Tomorrow
• Transient data like..o Web Sessionso Lockso Short Term Stats
• Shopping cart contents
• Use Key-Value Storage
![Page 28: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/28.jpg)
Where to store• RAM
o Fasto Expensiveo volatile
• Parallel File Systemo HDFS (Hadoop)o Auto-replicated for
parallel decentralized I/O
• Local Disko SSD – super fasto Fast spinning disks (7200+)o High Bandwidth possibleo Persistent
• SANo Storage Area Networko Fully managedo Expensive
• Cloudo Amazono Box.Neto DropBox
![Page 29: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/29.jpg)
Big Data
![Page 30: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/30.jpg)
Big Data Definition
Volume
• Beyond what traditional environments can handle
Velocity
• Need decisions fast
Variety
• Many formats
![Page 31: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/31.jpg)
Additional Big Data Concepts• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks
![Page 32: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/32.jpg)
Big Data Examples• Cassandra
• Hadoop
• Greenplum
• Azure Storage
• EMC Atmos
• Amazon S3
• SQL Azure (with Federations support)?
![Page 33: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/33.jpg)
Real World Example
• Twittero The challenges
• Needs to store many graphs
Who you are following Who’s following you Who you receive phone
notifications from etc• To deliver a tweet requires
rapid paging of followers• Heavy write load as
followers are added and removed
• Set arithmetic for @mentions (intersection of users).
![Page 34: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/34.jpg)
What did they try?• Started with Relational
Databases
• Tried Key-Value storage of denormalized lists
• Did it work?o Nope
• Either good at Handling the write load Or paging large
amounts of data But not both
![Page 35: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/35.jpg)
What did they need?
• Simplest possible thing that would work
• Allow for horizontal partitioning
• Allow write operations to
• Arrive out of ordero Or be processed more than onceo Failures should result in redundant work
• Not lost work!
![Page 36: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/36.jpg)
The Result was FlockDB• Stores graph data
• Not optimized for graph traversal operations
• Optimized for large adjacency listso List of all edges in a graph
• Key is the edge value a set of the node end points
• Optimized for fast read and write
• Optimized for page-able set arithmetic.
![Page 37: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/37.jpg)
How Does it Work?• Stores graphs as sets of edges between
nodes
• Data is partitioned by nodeo All queries can be answered by a single partition
• Write operations are idempotento Can be applied multiple times without changing the result
• And commutativeo Changing the order of operands doesn’t change the result.
![Page 38: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/38.jpg)
How to Process Big Data
![Page 39: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/39.jpg)
ACID• Atomicity
o All or Nothing
• Consistencyo Valid according to all defined rules
• Isolationo No transaction should be able to interfere with another
transaction
• Durabilityo Once a transaction has been committed, it will remain so, even in
the event of power loss, crashes, or errors
![Page 40: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/40.jpg)
BASE• Basically Available
o High availability but not always consistent
• Soft stateo Background cleanup mechanism
• Eventual consistencyo Given a sufficiently long period of time over which no changes are
sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.
![Page 41: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/41.jpg)
Traditional (relational) Approach
Extract
Transform
Load
Transactional Data Store
Data Warehouse
![Page 42: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/42.jpg)
Big Data Approach
• MapReduce Pattern/Frameworko an Input Readero Map Function – To transform to a common
shape (format)o a partition functiono a compare functiono Reduce Functiono an Output Writer
![Page 43: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/43.jpg)
MongoDB Example
> // map function> m = function(){... this.tags.forEach(... function(z){... emit( z , { count : 1 } );... }... );...};
> // reduce function> r = function( key , values ){... var total = 0;... for ( var i=0; i<values.length; i++ )... total += values[i].count;... return { count : total };...};
> // execute> res = db.things.mapReduce(m, r, { out : "myoutput" } );
![Page 44: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/44.jpg)
What is Hadoop?• A scalable fault-tolerant grid operating system for
data storage and processing
• Its scalability comes from the marriage of:o HDFS: Self-Healing High-Bandwidth Clustered Storageo MapReduce: Fault-Tolerant Distributed Processing
• Operates on unstructured and structured data
• A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …)
• Open source under the friendly Apache License
• http://wiki.apache.org/hadoop/
![Page 45: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/45.jpg)
Hadoop Design Axioms
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Should Move to Data
4. Simple Core, Modular and Extensible
![Page 46: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/46.jpg)
Hadoop Core Components
Store
HDFS
Self-healingHigh-bandwidth
Clustered storage
Process
Map/Reduce
Fault-tolerantdistributedprocessing
![Page 47: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/47.jpg)
HDFS: Hadoop Distributed File System
Block Size = 64MBReplication Factor = 3
Cost/GB is a few ¢/month vs $/month
![Page 48: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/48.jpg)
Hadoop Map/Reduce
![Page 49: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/49.jpg)
Hadoop Job Architecture
![Page 50: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/50.jpg)
Microsoft embraces Hadoop
Good for enterprises & developers
Great for end users!
![Page 51: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/51.jpg)
A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs
EIS / ERP
RDBMS
File System
OData [RSS]
Azure Storage
HADOOP[Azure and Enterprise]
OCEAN OF DATA[unstructured, semi-structured, structured]
Java OMStreaming
OMHiveQL PigLatin (T)SQL.NET/C#/F#
HDFS
NOSQL ETL
![Page 52: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/52.jpg)
04/10/2023Footer Text 52
Hive Plug-in for Excel
![Page 53: Big Data (NJ SQL Server User Group)](https://reader035.vdocuments.site/reader035/viewer/2022062617/54b7b1c24a7959f3728b458d/html5/thumbnails/53.jpg)
THANK YOU