the new analytical db for the hadoop platform
TRANSCRIPT
The New Analytical DB for the Hadoop PlatformSept 2012
Agenda
2
● Where is the (Big) data?● How “big” is Big Data?● Approaches to working with data
● Transactional/operational systems● Analytical systems
● Hadapt● Hadapt compared to HBase● Who we are and where we come from● Hadapt in Poland● What's next in Hadapt
Big Data: Volume | Variety| Velocity
Source: wikibon.org
• 2,500 exabytes of new information in 2012
• “Digital universe” grew by 62% last year to 800K petabytes & will grow to 1.2 zettabytes this year
• 80% of data is typically not in data warehouses
Data Beats Algorithms
“I’m at Google because that’s where the data is.”
-- Peter Norvig, on why he left NASA for Google in 2001
Databases
5
Datastores
6
Where did Hadapt come from and Why?
7
“Digital universe” grew by 62% last year to 800K petabytes & will grow to 1.2 zettabytes this year
“How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did”
“Why Netflix produces BBC remake starring Kevin Spacey, directed by David Fincher”
Differences of Purpose : “Transaction Processing”Operational systems
● Optimized for small short random access – reads and writes● e.g. record that a person bought 20 shares of a company on
the stock market *or* record that a user posted something on another users “wall”
Traditional DB examples● Oracle● MySQL
NoSQL Examples● HBase● MongoDB● Cassandra
8
Differences of Purpose: AnalyticsAnalytics
● Optimized for read-only computations about large amounts of data
● e.g. compute the average amount invested in bond funds and stock funds for all employees at all employers over the last 5 years
DB Examples● Netezza● Vertica
NoSQL Examples● Hive● Pig
9
Oct Nov Dec Jan Feb Mar02468
10121416
Actual
Option 1Acme
GM
Newco
Oldco
Bigcorp
Foo
Acme Newco Bigcorp
0
2
4
6
8
10
Option 2
Option 2
The evolution of analytics – where are we today?
10
The early stages of analytics • Market Basket Analysis• Trend Analysis• Cyclical Analysis• Customer Segmentation
New Analytical Models• Pattern Detection, Discovery, Matching• A/B Testing and Behavioral Analysis• Sessionization• Social Correlation Analysis • Fractional Attribution• Sentiment Analysis • Personalization
Hadapt – The Adaptive Analytical Platform for Big Data● Company started in early 2011, currently commercializing the Yale University research
project by Kamil Bajda-Pawlikowski called HadoopDB led by Dr. Daniel Abadi● Combines the benefits of Apache Hadoop and relational DBMS technology into a
single system for applications that rely on multi-structured data analytics● Designed for the cloud, and is optimized for virtualized environments● Architected to leverage clusters of industry standard (commodity) machines● Provides the full power of MapReduce as well as SQL support and the ability to work
with data within a single platform● Based on findings from the
HadoopDB project it aims to achieve:
– Performance and efficiency of MPP databases
– Scalability, fault tolerance, and flexibility of MapReduce-based systems
11
Hadapt Analysis Process
Raw Dataload
enrichquery
BI ToolsApplications
predict
analyze
Predictive Analytics
Hadapt Bulk loader
Multi-Structured Big Data Analytics Across Industries – Use Cases
Need for deep data analysis…on TB’s to PB’s of data…with minutes to seconds response times
Internet Use Cases
Financial Services & Insurance
Use Cases
Retail Use Cases
Communications, Media &
Information Services
Use Cases
• Recommendation Engines
• Cross-channel Analysis• Clickstream/Golden
Path Analysis• Right Offer at the Right
Time• Social networking graph
analysis • Ad Revenue
Optimization
• Risk Warehousing• e-Discovery• Tick data back testing• Anti-Money
Laundering/Fraud Detection
• Customer Behavior Analysis
• Customer Behavior Analytics
• Market & Consumer Segmentation
• Event and Behavior-based Targeting
• Affinity/Market Basket Analysis
• Loyalty Analytics
• Price Optimization• CDR Analysis• Customer Churn
Prevention• Network Optimization• Ad optimization
Common Requirements across these applications:● Ad hoc analysis
● Structured & Unstructured data● Rapid iteration
● Elastic scale out, cloud deployments
Hadapt Architecture
14
Master Node
HDFS MapReduceFramework
Namenode JobTracker
Node 1
TaskTracker
Database DataNode
Hadapt SQL Engine
Node n
TaskTracker
Database DataNode
Load & QueryTasks
MapReduceJob
SQL QueryMapReduce
Job
Hadapt – Key components – Query EngineFlexible Query Interface
● Data can be queried using both SQL and MapReduce ● SQL can be embedded within MapReduce or vice versa● JDBC/ODBC drivers for connectivity with customer-facing BI tools
Query Planner● Queries are analyzed to consider data partitioning and distribution, indexes, and statistics to
determine a query plan● Split query execution ensures optimal use of the DBMS layer before pushing operations into
Hadoop
Adaptive Query Execution● In MPP databases the time to complete the query will be approximately equal to the time it
takes the slowest compute node to complete its assigned task● This dynamic is especially problematic in a cloud environment● Query plans are adjusted dynamically based on cloud worker node performance
15
Hadapt – Key components – Data EngineData Loader
● Data is loaded using all machines in parallel
● Data is partitioned into small chunks and replicated across the cluster
● Optimizes query performance and fault tolerance
Data manager● Stores metadata about the schema, data, and chunk distribution
● Handles data replication, backups, recovery, and rebalancing of chunks across the cluster
Hybrid Storage Engine● A DBMS engine is stored on each node in addition to a standard distributed file system
(HDFS)
● DBMS layer is optimized for structured data and HDFS handles unstructured data
More insight into the underlying technology: http://www.HadoopDB.net
16
HBase Data Model : Conceptual
From the BigTable paper:“a sparse, distributed, persistent multi-dimensional sorted map”
(row: bytestring, column family: bytestring, column: bytestring, time: int64) ---> byte string
17
HBase Map { ”key_1" : { ”columnfamily_a" : { ”column_i" : { 15 : "y", 4 : "m" }, ”column_ii" : { 15 : "d”, }}, “columnfamily_b" : { ”column_other" : { 6 : "w" 3 : "o" 1 : "w” }}}}
18
Hadapt Data Model : ConceptualTraditional Relational Tables
19
CUSTKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT
451234 NEWCORP
196 Broadway…
1 111-555-1212
$1,231,285 NULL
887765 ACME 1 Main st. …
2 222-555-1212
$46,945 “Top customer”
HBase Data Model : Physical
Every cell stored with row, family, column and timestampAllows fast lookup with low copy overhead
BUT
Space inefficient (optional compression available) and inefficient to scan
20
“key_1” “cf_a” “c_i” 15 “foo”
“key_1” “cf_a” “c_ii” 15 “bar”
“key_2” “cf_a” “c_ii” 4 “baz”
Hadapt Data Model : Physical
Leverages RDBMSSupports Normalized or Denormalized data models
21
Data Model / Workload Comparison
22
Hadapt HBase
Conceptual Relational tables Sparse sorted map
Schema Structured Fluid
Data Density Dense Sparse
Workload Large scans, joins, aggregations
Point lookup, Short range lookup, updates
Interface SQL Custom API
Informal Performance Comparison
23
Hadapt HBase
Load / Ingest batch Fast!
Lookup speed Few seconds Fast!
Data warehouse queries
50x faster than HBase
Uh oh
Hadapt is NOT● OLTP● NoSQL Key/Value store● CEP – streaming analysis● Web Server● File System
(but we do integrate with all of them)
24
Example HBase+Hadapt Application
Social Graph Input into Communication Monitoring System
HBase would provide real-time lookup and update of connected entities and their risk profiles for monitoring / alerting. Incremental data capture, real-time detection
Hadapt would periodically recalculate rich entity connectivity model to be deployed to the HBase real-time persistence layerCalculates the patterns that should be detected in real-time
25
Hadapt Board
26
Chris Lynch, Chairman of the Board – Previously CEO of Vertica
Sharmila Shahani-Mulligan – Previously CMO of Aster Data
Felda Hardymon – Staples, Endeca, PTC, Vertica, Gartner, BladeLogic, Skype, LinkedIn, and many others
Matthew Howard - Avere Systems, Blue Jeans Network, ConteXtream, MobileIron, Pertino Networks, Retrevo, and many others
Daniel Abadi, Chief Scientist – Yale, MIT, known for C-Store (Vertica), and HadoopDB (Hadapt)
Hadapt Management
27
Justin Borgman - Chief Executive Officer and Co-Founder
Dr. Daniel Abadi - Chief Scientist and Co-founder
Philip Wickline - Chief Technology Officer
Kamil Bajda-Pawlikowski - Chief Software Architect and Co-Founder
Kelly Stirman - Vice President of Customer Solutions
Scott Howser - Vice President of Marketing
Hadapt in Poland
28
● Kamil Bajda-Pawlikowski (HadoopDB is part of his PhD work) graduated from Wrocław University of Technology
● Hadapt from its inception had contributors in Poland, most of them are still with us
● Hadapt now has a permanent location, an office in Warsaw
● Hadapt in Poland is now a legitimate company: Hadapt Polska sp. z o.o.
● Hadapt Polska is now hiring!
– We're looking for a couple of bright, excellent senior/principal software engineers with great OOP/system skills and experience in developing enterprise products
Hadapt in the Future
29
● In late 2011 Hadapt raised $9.5 million funding and is rapidly growing as a company since then, headcount ca. 40 employees
● We already have several big customers in the USA, and are gaining more market attention every month
● Big Announcement is coming in October at the next Strata/HadoopWorld 2012 conference in New York
THANK YOU
Wojciech [email protected]