determine the right analytic databases: a survey of new data...
TRANSCRIPT
Determine the Right Analytic Database: A Survey of New Data Technologies
O’Reilly Strata ConferenceFebruary 1, 2011
Mark R. Madsenhttp://ThirdNature.netTwitter: @markmadsen
Atomic Avenue #1 by Glen Orbik
Key Questions▪What technologies are available?
▪What are they good for?
▪ How do you decide which to use?
Page 2
But first: why are analytic databases available now?
Consequences of Commoditization: Data Volume
Time
Data Generated
Chipping
GPS
RFID
Sensors
Spimes
You are here
H
Lots of H
“More” can become a qualitative rather than quantitative difference
Really lots of H
“Databases are dead!” – famous last words
An Unexpected Consequence of Data Volumes
Sums, counts and sorted results only get you so far.
An Unexpected Consequence of Data Volumes
Our ability to collect data is still outpacing our ability to derive meaning from it.
Don’t worry about it. We’ll just buy more hardware.
CPUs, memory and storage track to very similar curves
RIP Moore’s Law: it nearly ground to a halt for silicon integrated circuits about four years ago.
Technology Has Changed (a lot) But We Haven’t
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
1010
10 9
10 8
107
106
105
104
103
102
101
10
10‐1
01‐2
10‐3
10‐4
10‐5
10‐6Cal
cula
tion
s pe
r se
cond
per
$10
00
Mechanical Relay Vacuum tube Transistor Integrated circuit
Data: Ray Kurzweil, 2001
10,000 X improvement
Current DW architecture and methods start here
in the mid-1980s
Multicore performance is not a linear extrapolation.
Technology Maturity (time + engineering effort)
New Technology Evolution Means New Problems
1970 1980 1990 2000 2010 2020
Uniprocessorand custom CPU era
Symmetric multi‐processing era
Massively parallel era
Early engineering phaseExploring, learning, inventing
Investment phaseImproving, perfecting, applying
Core problems solved
1010
10 9
10 8
107
106
105
104
103
102
101
10
10‐1
01‐2
10‐3
10‐4
10‐5
10‐6
What’s different?
Parallelism
We’re not getting more CPU power, but more CPUs.
There are too many CPUs relative to other resources, creating an imbalance in hardware platforms.
Most software is designed for a single worker, not high degrees of parallelism and won’t scale well.
Core problem: software is not designed for parallel work
Databases must be designed to permit local work with minimal global coordination and data redistribution.
SOME TECHNOLOGY INNOVATIONS
Storage Improvements
For data workloads, disk throughput still key.
Improvements:▪ Spinning disks at .05/GB▪ Solid state disks remove some latencies, read speed of ~250MB/sec
▪ SSD capacity still rising▪ Card storage (PCI), e.g. FusionIO at 1.5GB/sec
▪ SSD is still costly at $2/GB up to $30/GB
Compression Applied to Stored Data
10x compression means 1 disk I/O can read 10x as much data, stretching your current hardware investment
But it eats CPU andmemory.
YMMV
Scale‐up vs. Scale‐out Parallelism
Uniprocessor environments required chip upgrades.
SMP servers can grow to a point, then it’s a forklift upgrade to a bigger box.
MPP servers grow by adding mode nodes.
Database and Hardware Deployment Models
Three levels of software‐hardware integration:▪ Database appliance (specialized hardware and software)▪ Preconfigured (commodity) hardware with software
▪ Software on generic hardware
Then there are the hardware‐database parallel models:
Page 20
Shared Everything Shared Disk Shared Nothing
Database DB DatabaseDB
OS OS OS OS OS OS
In‐Memory Processing
1. Maybe not as fast you think. Depends entirely on the database (e.g. VectorWise)
2. So far, applied mainly to shared‐nothing models
3. Very large memories are more applicable to shared‐nothing than shared‐memory systems
Box‐limited Limited by node scalinge.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB
4. Still an expensive way to get performance
Columnar Databases
ID Name Salary
1 Marge Inovera $50,000
2 Anita Bath $120,000
3 Nadia Geddit $36,000
Marge Inovera
Anita Bath
Nadia Geddit
$50,000
$120,000
$36,000
1
2
3
In a row-store model these three rows would be stored in sequential order as shown here, packed into a block.
In a column store model database they would be divided by columns and stored in different blocks.
Not just changing the storage layout. Also involves changes to the execution engine and query optimizer.
Column Stores Rule the TPC‐H Benchmark
Columnar Advantages and Disadvantages
+ Reduced I/O for queries not reading all columns
+ Better compression characteristics, meaning database size < raw data size (unlike row store) and less I/O
+ Ability to operate on compressed data, improving overall system performance
+ Less manual tuning
‐ Slower inserts and updates (causing ELT and trickle‐feed problems*)
‐ Worse for small retrievals and random I/O
‐ Uses more system memory and CPU
Advanced Analytic Methods
Machine learning
Statistics
Numerical methods
Text mining & text
analytics
Rules engines & constraint programming
Information theory & IR
Visualization
Explosion of Analytic Techniques
GIS
Map‐Reduce is a parallel programming framework that allows one to code more easily across a distributed computing environment, not a database.
So how do I query the database?
It’s not a database, it’s a key-value store!
Ok, it’s not a databaseHow do I query it?
You write a distributed mapreducefunction in erlang.
Did you just tell me to go to hell? I believe I
did, Bob.
What’s Different
No database
No schema
No metadata
No query language*
Good for:▪ Processing lots of complex or non‐relational data
▪ Batch processing for very large amounts of data
* Hive, Hbase, Pig, others
Using MapReduce / Hadoop
28
Hadoop is one implementation of MapReduce. There are different variations with different performance and resource characteristics e.g. Dryad, CGL‐MR, MPI variants
Hadoop is only part of the solution. You need more for enterprise deployment. Cloudera’s distribution for Hadoopshows what a complete environment could look like.
Image: Cloudera
How Hadoop fits into a traditional BI environment
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
File loads ETL
Data Warehouse
Developers Analysts End Users
Development tools and IDEs
Analysis tools, BI BI, Applications
Data stores that augment or replace relational access and storage models with other methods.
Different storage models:• Key‐value stores• Column families• Object / document stores• Graphs
Different access models:• SQL (rarely)• programming API• get/putReality: mostly suck for BI & analytics
Analytic DB vendors are coming from the other direction:• Aster Data – SQL wrapped around MR• EMC (Greenplum) – MR on top of the database
NoSQL theoretically = “not only sql”, in reality…
30
Some realities to consider
Cheap performance?▪ Do you have 20 blades lying around unused?
▪ How much concurrency?
▪ How much effort to write queries? Debug them?
▪ Performance comparisons: 10x slower on the same hardware?
The key is the workload type and the scale of it.
Page 31
Do you really need a rack of blades for computing?
Graphics co‐processors have been used for certain problems for years.
Offer single‐system solution to offload very large compute‐intensive problems.
Order of magnitude cost reduction, order of magnitude performance increase with current technology today (for compute‐intensive problems).
We’ve barely started with this.
Other Options for analytic software deployment
The basic models.
1. Separate tools and systems (MapReduce and nosql are a simple variation on this theme)
2. Integrated with a database
3. Embedded in a database
The primary arguments about deployment models center on whether to take data to the code or code to the data.
33
Leveraging the Database
Levels of database integration:▪ Native DB connector▪ External integration▪ Internal integration▪ Embedded
+ Less data movement+ Possible dev process support+ Hardware / environment savings
+ Possible “sandboxing” support‐ Limitations on techniques
34
In‐database Execution
You can do a lot with standards‐compliant SQL
If the database has UDFs, you can code too (but it’s harder)
Parallel support for UDFs varies
Some vendors build functions directly into the database, (usually scalar)
Iterative algorithms (ones that converge on a solution) are problematic, more so in MPP
35
What are factors in the decision?
User concurrency: one job or many Repetition is a key element:▪ Execute once and apply (build a response or mortality model)
▪ Many executions daily (web cross‐sells)
In‐process or Batch?▪ Batch and use results – segment, score
▪ In‐process reacts on demand – detect fraud, recommend
In‐process requires thinking about how it integrates with the calling application. (SQL sometimes not your friend)
36
MATCHING THE PROBLEMS TO TECHNOLOGIES
The problem of size is three problems of volume.
Number of users!
Computations!
Amount of data!
Hardware Architectures and Deployment
Compute and data sizes are the key requirements
39
Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB
PCShared everythingor shared disk
Shared nothing
MR and related
Com
puta
tions
MF
GF
TFP
F
Hardware Architectures and Deployment
40
Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB
Com
puta
tions
MF
GF
TFP
F
Today’s reality, and true for a while in most businesses.
The bulk of the market resides here!
Hardware Architectures and Deployment
41
Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB
Com
puta
tions
MF
GF
TFP
F
Today’s reality, and true for a while in most businesses.
The bulk of the market resides here!
…but analytics pushes many things into the MPP zone.
The real question: why do you want a new platform?
Trouble doing what you already do today▪ Poor response times
▪ Not meeting availability deadlines
Doing more of what you do today▪ Adding users, mining more data
Doing something new with your data▪ Data mining, recommendations, embedded real‐time process support
What’s desired is possible but limited by the cost of supporting or growing the existing environment.
Page 42
The World According to Gartner: One Magical Quadrant
SQL Server 2008 R2 (PDW)Official production customers?
EMC / GreenplumSQL limitations
Memory / concurrency issues
IngresOLTP database
IlluminateSQL limitations
Very limited scalability
SunMySQL for a DW, is this a joke?
43
Magic Quadrant for Data Warehouse Database Management Systems
The assumption of the warehouse as a database is gone
44
Traditional tabular or
structured data
Data at rest
Non-traditional data (logs, audio,
documents)
Parallel programming
platforms
Databases Streaming DBs/engines
Message streams
Data in motion
Slide 44Copyright Third Nature, Inc.
Data Access Differences
Basic data access styles:▪ Standard BI and reporting▪ Dashboards / scorecards▪ Operational BI▪ Ad‐hoc query and analysis▪ Batch analytics▪ Embedded analytics
Data loading styles:▪ Refresh▪ Incremental
▪ Constant
Evaluating ADB Options
Storage style:▪ Files, tables, columns, cubes, KV
Storage type:▪ Memory, disk, hybrid, compressed
Scaling model:▪ SMP, clustered, MPP, distributed
Deployment model:▪ Appliance, cloud, SaaS, on‐premise
Data access model:▪ SQL, MapReduce, R, languages, etc.
License options:▪ CPU, data size, subscription
Page 46
What’s it going to cost? A small sample at list:Solution Pricing model Price/unit 1 TB solution Remarks
DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1 TB Satori server
Kickfire(out of
business)
Data Volume(raw)
$ 50,000,-/TB $ 50,000 Includes MySQL5.1 Enterprise
Vertica Data Volume(raw)
$ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each
ParAccel Data Volume(raw)
$ 100,000/TB $ 200,000 Based on 5 nodes, $ 20,000 each
EXASOL Data Volume(active)
$ 1,350/GB(€1,000/GB)
$ 350,000* Based on 4 nodes, $ 20,000 each
Teradata Node $ 99,000 / TB $ 99,000** Based on 2550 base configuration
* 1TB raw ± 200 GB active, **realistic configuration likely 2x this price47
Factors and TradeoffsThe core tradeoff is not always money for performance.
What else do you trade?
• Load time
• Trickle feeds• New ETL tools• New BI tools• Operational complexity:• Data integration and management
• Backups• Hardware maintenance
Page 48
The Path to Performance
1. Laborware – tuning
2. Upgrade – try to solve the problem without changing out the database
3. Extend – add an ADB or Hadoop cluster to the environment to offload a specific workload
4. Replace – out with the old, in with the new
Page 49
One Word: PoC!
The Future
Assuming database market embraces MPP, you have compute power that exceeds what the DB itself needs.
Why not execute the code at the data?
Even without MPP, moving to in‐database analytic processing is a future direction and is workable for a large number of people.
51
Thank you!
Image Attributions
Thanks to the people who supplied the images used in this presentation:
Atomic Avenue #1 by Glen Orbik http://www.orbikart.com/gallery/displayimage.php?album=4&pos=5spices.jpg ‐ http://flickr.com/photos/oberazzi/387992959/Black hole galaxy ‐ http://www.flickr.com/photos/badastronomy/3176565627/weaver peru.jpg ‐ http://flickr.com/photos/slack12/442373910/rc toy truck.jpg ‐ http://flickr.com/photos/texas_hillsurfer/2683650363/automat purple2.jpg ‐ http://flickr.com/photos/alaina/288199169/open_air_market_bologna ‐ http://flickr.com/photos/pattchi/181259150/bored_girl.jpg ‐ http://www.flickr.com/photos/alejandrosandoval/280691168/path_vecchia.jpg ‐ http://www.flickr.com/photos/funadium/2320388358/fast kids truck peru.jpg ‐ http://flickr.com/photos/zerega/1029076197/
What’s best for which types of problems?*
Page 54
Shared nothing will be best for solving large data problems, regardless of workload or concurrency.
Column‐stores will improve query response time problems for most traditional query and aggregation workloads.
Row‐stores will be better for operational BI or embedded BI.
Fast storage always makes things better, but is only cost‐effective for medium scale or smaller data.
Compression will help everyone, but column‐stores more than row stores because of how the engines work.
Map‐Reduce and distributed filesystems offer advantages of a schema‐less storage & analytic layer that can process into relational databases.
SMP and in‐memory will be better for high complexity problems under moderate data scale, shared‐nothing and MR for large data scale.
*The answer is always “it depends”
About the PresenterMark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and performance management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.
About Third Nature
Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, data integration and information management. If your question is related to BI, open source, web 2.0 or data integration then you‘re at the right place.
Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating the products rather than vendor market positions.