demystifying systems for interactive and real-time analytics
DESCRIPTION
Demystifying Systems for Interactive and Real-time Analytics. The BigFrame Team. Duke University, Hong Kong Polytechnic University, and HP Labs. Analytics System Landscape. Streaming. Dataflow. MapReduce. Graph. Multi-tenant. MPP DB. Array DB. Columnar. Mixed. Text Analytics. - PowerPoint PPT PresentationTRANSCRIPT
Demystifying Systems for Interactive and Real-time
Analytics
The BigFrame TeamDuke University, Hong Kong Polytechnic
University, and HP Labs
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
Gamma
AsterNetezza
DB2 PE
Teradata SQL Server Parallel DataWarehouse
Greenplum
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
HP Vertica
ParAccel
Redshift
Vectorwise
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System LandscapeHadoo
pTenzing
HiveMahout
HadoopDBPig
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System LandscapeDremel
Drill StingerImpala
SparkDryad SCOPE
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
CassandraHBaseBigtable
Druid
HANA
SpannerMegastore
Splunk
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
StormGraphLab
Streambase
CassovaryGraphX
Solr
ElasticSearch
SciDBCloudera Search
MadLINQ
Pregel
HAMA
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Analytics System Landscape
Mesos
YARNSerengeti
Cloud platforms
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
What does this mean for Big Data Practitioners?
Gives them a lot of power!
From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html
Even the mighty may need a little help
Challenges for Practitioners
Which system touse for the app that I
am developing?
• Features (e.g., graph data)
• Performance (e.g., claims like
System A is 50x faster than B)
• Resource efficiency
• Growth and scalability
• Multi-tenancy
App Developers, Data Scientists
Different parts of my app have different
requirements
Compose “best of breed” systems
ORUse “one size fits
all” system?
Managing manysystems is hard!
System Admins
Challenges for Practitioners
Which system touse for the app that I
am developing?
App Developers, Data Scientists
Managing manysystems is hard!
Different parts of my app have different
requirements
Total Cost of Ownership (TCO)?
CIOSystem Admins
Challenges for Practitioners
Which system touse for the app that I
am developing?
App Developers, Data Scientists
Numbers make decisions easier
Need benchmarks
One Approach
Develop a benchmark per system category
Categorize systems
Useful, But …
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenant
Star Schema BenchmarkTPC-H / TPC-DS
Counting triangles
Terasort
GridMixSWIMHiBench
DFSIO
MapReduce Vs. Parallel DB /Hive Benchmark (in HiBench) /Berkeley Big Data Benchmark
Yahoo Cloud Serving Benchmark (YCSB)YCSB Variants
CH-benchCHmark
MulTe
Graph 500PageRank
RDF Benchmarks
Information Extraction Benchmark
Linear Road
SS-DB
Problem #1 May Miss the Big Picture
Problem #1 May Miss the Big Picture
Cannot capture the complexities and end-to-end behavior of big data applications and deployments:
(i) Bottlenecks(ii) Data conversion, transfer, & loading overheads(iii) Storage costs & other parts of the data life-cycle(iv) Resource management challenges(v) Total Cost of Ownership (TCO)
Give a man a fish and you will feed him for a day.
Give him fishing gear and you will feed him for life.
-- Anonymous
Problem #2 Benchmark
BenchmarkGenerator
BigFrame: A Benchmark Generator for Big
Data Analytics
How a user uses BigFrameBigFram
eInterfac
e
bigif(benchmark
input format)BenchmarkGenerator
bspec(benchmark specification)
HBase
Hive
MapReduce
Benchmark Driver for System
Under Testrun the benchmark
results
System Under Test
bspec: Benchmark Specification
HBase
Hive
MapReduce
System Under Test
2. Data refreshpattern
Time
3. Query streams
4. E
valu
atio
n m
etric
s
1. Data forinitial load
What does the user(want to) specify?
BigFrame
Interface
bigif(benchmark
input format)
The 3Vs
MPP DB
Columnar
MapReduce
Mixed
Dataflow
Streaming
Text Analytics
Array DB
GraphMulti-tenantVolume
VarietyVelocity
bigif: BigFrame’s InputFormat
Data Variety
Relational, text, array,
graph
Small,medium,
large
Data Volume
QueryVolume
Queryconcurrency
& classes
DataVelocity
At rest,slow,fast
Micro,Macro
QueryVariety
Exploratory,Continuous
QueryVelocity
Benchmark Generationbigif
(benchmark input format)
BenchmarkGenerator
bspec(benchmark specification)
bigif describes pointsin a discrete space of
{Data,Query} X{Variety,Volume,Velocity}
1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics
Benchmark generation can beaddressed as a search problem
within a rich application domain
Application Domain Modeled Currently
E-commerce sales,
promotions, recommendati
ons
Social mediasentiment &
influence
Benchmark generation can beaddressed as a search problem
within a rich application domain
Application Domain Modeled Currently
Item
Customer
Web_sales
Promotion
Tweets
Relationships
Application Domain Modeled Currently
Item
Web_salesPromotion
Application Domain Modeled Currently
Benchmark Generationbigif
(benchmark input format)
BenchmarkGenerator
bspec(benchmark specification)
bigif describes pointsin a discrete space of
{Data,Query} X{Variety,Volume,Velocity}
1. Initial data to load 2. Data refresh pattern3. Query streams4. Evaluation metrics
BigFrame can generate Data, Queries, and Arrival Patterns with the user-specified {Variety,Volume,Velocity}
requirements from the application domain
Use Cases of BigFrame
Use Case I: Exploratory BI• Large volumes of relational data
• Mostly aggregation and few joins
• Can Spark’s performance match that of an MPP DB?
Data Variety = {Relational}
Query Variety = Micro
BigFrame will generate a benchmark specification containing
relational data and (SQL-ish) queries
Use Case II: Complex BI• Large volumes of relational data• Even larger volumes of text data
• Combined analytics
Data Variety = {Relational, Text}
Query Variety = Macro (application-focused instead of
micro-benchmarking)
BigFrame will generate a benchmark specification that includes
sentiment analysis tasks over tweets
• Large volume and velocity of
relational and text data
Use Case III: Dashboards
• Continuously-updated Dashboards
Query Velocity = Continuous
(as opposed to Exploratory)
Data Velocity =Fast
BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results
change upon data refresh
Use Case IV: Does One Size Fit All?• Growing set of applications have to
process relational, text, & graph data
• Compose “best of breed” systems or use a “one size fits all” system?
Data Variety = {Relational, Text,
Graph}
BigFrame will generate a benchmark specification that includes composite workflows
with relational, text, and graph analytics
Query Variety = Macro
Use Case V: Multi-tenancy and SLAs• Big data deployments are
increasingly multi-tenant and
need to meet SLAs
Specifiedthrough Query
Volume dimension
BigFrame can generate a benchmark specification containing a specified number of concurrent query streams with class labels for queries (e.g., Batch, Interactive, or Streaming)
Working with the Community• First release of BigFrame planned for August 2013• With feedback from benchmark developers (BigBench)
• Open-source with extensibility APIs
• Benchmark Drivers for more systems
• Utilities (accessed through the Benchmark Driver to
drill down into system behavior during benchmarking)
• Instantiate the BigFrame pipeline for more app domains
Take Away• “Benchmarks shape a field (for better or worse) …”
-- David Patterson, Univ. of California, Berkeley
• Benchmarks meet different needs for different people• End customers, application developers, system designers,
system administrators, researchers, CIOs
• BigFrame helps users generate benchmarks that best
meet their needs