snappydata overview slidedeck for big data bellevue
TRANSCRIPT
SnappyData Getting Spark ready for real-time,
operational analytics
www.snappydata.io
Suds Menon Co-Founder SnappyData
Jan 2016
Last Week Tonight in Big Data
www.snappydata.io
IoT is what makes the big data challenge very real
A 10 Trillion Device World1
www.snappydata.io
1:http://cacm.acm.org/news/191847-get-ready-to-live-in-a-trillion-device-world/fulltext
Because Insights are like people. Useful for a short period of time
The New Arms Race
www.snappydata.io
● Sift through data to get insights to improve your business
● What is your time to insights? ● What is your time to
operationalizing insights?
Can we use the past to accurately predict the future?
The Holy Grail of Analytics
www.snappydata.io
The faster you go, the bigger your business advantage
Speeding Up Insights
www.snappydata.io
Exploding data volumes fuel the search for distributed solutions
How We Got Here
www.snappydata.io
Teradata Cognos
GreenPlum Netezza, ParAccel
Hadoop (SQL on Hadoop)
Spark (Spark SQL)
Every enterprise today deals with these 4 kinds of data interactions
The Four Horsemen Of Data
www.snappydata.io
OLTP OLAP Streaming Machine Learning
Who Are We? ● An EMC-Pivotal spinout focused on real time operational
analytics ● New Spark-based open source project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database
www.snappydata.io
SnappyData At Cruising Altitude
Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows Txn
Columnar
API
Stream processing
ODBC, JDBC, REST
Spark - Scala, Java, Python, R
HDFS AQP
First commercial project on Approximate Query Processing(AQP)
MPP DB
Index
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics
Batch design, high throughput
Real-‐time design center -‐ Low latency, HA,
concurrent
Vision: Drastically reduce the cost and complexity in modern big data
Huge community adoption, slip streaming into Hadoop momentum, great data integration platform
Why Spark? • Most events in life can be analyzed as micro batches • Blends streaming, interactive, and batch analytics • Appeals to Java, R, Python, Scala programmers • Rich set of transformations and libraries • RDD and fault tolerance without replication • Offers Spark SQL as a key capability
www.snappydata.io
Spark is a compute framework that processes data, not an analytics database
Clearing Up Some Spark Myths
www.snappydata.io
● It is NOT a distributed in-memory database ○ It’s a computational framework with immutable caching
● It is NOT Highly Available ○ Fault tolerance is not the same as HA
● NOT well suited for real time, operational environments ○ Does not handle concurrency well ○ Does not share data very well either
SnappyData & Lambda
SnappyData Focus
Perspective on Lambda for real time
In-Memory DB
Interactive queries, updates
Deep Scale, High volume
MPP DB Transform Data-in-motion Analytics
Application
Streams
Alerts
RELEVANT USECASES
www.snappydata.io
Market Surveillance
www.snappydata.io
FLAG DETECT
ANALYZE INGEST
Identify patterns based on query results
Partitioned, HA stream ingestion
Prevent settlement, investigate further
SQL queries & Stream Analytics on microbatches
Contextual Marketing
www.snappydata.io
RESPOND DECIDE
ANALYZE INGEST
Pick Ad based on variety of reference data parameters
Transactional request for Ad placement
Deliver in real time
Join with history, join with user profile, join with location
Location Based Telco Services
www.snappydata.io
Geo Fencing Mobile Marketing Network Analytics
● INGEST, CORRELATE, JOIN WITH HISTORICAL DATA, RESPOND
Spark Architecture
Driver
Cluster Manager (YARN, Mesos,
Standalone)
Worker Worker
Worker
Executor
REST API for Job
Submission
Worker Worker
Worker Data Server
Executor
Cluster Manager (YARN, Mesos,
Standalone)
Data Server
Executor
Snappy Infused Spark Architecture
JDBC Clients
ODBC Clients
Job Server Lead Node Lead Node
Core Components Of SnappyData
Colocated row/column Tables in Spark
Row Table
Column Table
Spark Executor TASK
Spark Block Manager
Stream processing
Row Table
Column Table
Spark Executor TASK
Spark Block Manager
Stream processing
Row Table
Column Table
Spark Executor TASK
Spark Block Manager
Stream processing
● Spark Executors are long lived and shared across multiple apps ● Gem Memory Mgr and Spark Block Mgr integrated
Table can be partitioned or replicated
Replicated Table
Partitioned Table (Buckets A-H) Replicated
Table
Partitioned Table (Buckets I-P)
consistent replica on each node
Partition Replica (Buckets A-H)
Replicated Table
Partitioned Table (Buckets Q-W) Partition
Replica (Buckets I-P)
Data partitioned with one or more replicas
Linearly scale with shared partitions
Spark Executor
Spark Executor
Kafka queue
Subscriber N-Z
Subscriber A-M
Subscriber A-M Ref data
Linearly scale with partition pruning Input queue, Stream, IMDB, Output queue all share the same partitioning strategy
Point access, updates, fast writes
● Row tables with PKs are distributed HashMaps ○ with secondary indexes
● Support for transactional semantics ○ read_committed, repeatable_read
● Support for scalable high write rates ○ streaming data goes through stages ○ queue streams, intermediate storage (Delta row buffer),
immutable compressed columns
Full Spark Compatibility ● Any table is also visible as a DataFrame
● Any RDD[T]/DataFrame can be stored in SnappyData tables
● Tables appear like any JDBC sourced table ○ But, in executor memory by default
● Addtional API for updates, inserts, deletes //Save a dataFrame using the spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column deIinition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
Key feature: Synopses Data ● Maintain stratified samples
○ Intelligent sampling to keep error bounds low
● Probabilistic data ○ TopK for time series (using time aggregation CMS, item
aggregation) ○ Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS
www.snappydata.io
AQP Architecture
www.snappydata.io
Spot The Differences
www.snappydata.io
SnappyData is Open Source ● Beta will be on github in January. We are looking for
contributors!
● Learn more & register for beta: www.snappydata.io
● Connect: ○ twitter: www.twitter.com/snappydata ○ facebook: www.facebook.com/snappydata ○ linkedin: www.linkedin.com/snappydata ○ slack: http://snappydata-slackin.herokuapp.com ○ IRC: irc.freenode.net #snappydata
Q&A
www.snappydata.io
THANK YOU
www.snappydata.io