big data landscape for databases - dama...
TRANSCRIPT
Big Data Landscape for Databases
Bob Baran Senior Sales Enginee
May 12, 2015
Typical Database Workloads
2
OLTP Applications Real-Time Web, Mobile, and IoT
Applications
Real-Time, Operational Reporting
Ad-Hoc Analytics Enterprise Data Warehouses
Typical Databases
• MySQL • Oracle
• MongoDB • Cassandra • MySQL • Oracle
• MySQL • Oracle
• Greenplum • Paraccel • Netezza
• Teradata • Oracle • Sybase IQ
Use Cases • ERP, CRM, Supply Chain
• Web, mobile, social • IoT
• Operational Datastores
• Crystal Reports
• Exploratory Analytics
• Data Mining
• Enterprise Reporting
Workload Strengths
• Real-time updates • ACID transactions • High concurrency
of small reads/ writes
• Range queries
• Real-time updates • High ingest rates • High concurrency of
small reads/ writes • Range queries
• Real-time updates • Canned,
parameterized reports
• Range queries
• Complex queries requiring full table scans
• Append only
• Parameterized reports against historical data
Operational Analytical
Recent History of RDBMSs
▪ RDBMS Definition ▪ Relational with joins ▪ ACID transactions ▪ Secondary indexes ▪ Typically row-oriented ▪ Operational and/or analytical workloads
▪ By early 2000s ▪ Limited innovation ▪ Looked like Oracle and Teradata won…
3
Hadoop Shakes Up Batch Analytics
▪ Data processing framework ▪ Cheap distributed file system ▪ Brute force, batch processing through MapReduce
▪ Great for batch analytics ▪ Great place to dump data to look at later
4
NoSQL Shakes Ups Operational DBs
▪ NoSQL wave ▪ Companies like Google, Amazon and
LinkedIn needed greater scalability & schema flexibility
▪ New databases developed by developers, not database people
▪ Provided scale-out, but lost SQL ▪ Worked well at web startups because:
▪ In some cases, use cases did not need ACID ▪ Willing to handle exceptions at app level
5
Convoluted Evolution of DatabasesSc al
abi
li ty
Hierarchical/Network Databases
1970s
Functionality
Indexed Files (ISAM)
1960s
Traditional RDBMSs
1980s-2000s
Hadoop 2005
NoSQL Databases
2010 Scale-out SQL Databases
2013
6
Scale Out
Scale Up
Mainstream user changes
▪ Driven by web, social, mobile, and Internet of Things ▪ Major increases in scale – 30% annual data growth ▪ Significant requirements for semi-structured data ▪ Though relatively little unstructured
▪ Technology adoption continuum
What is it? Should I use it?
Why wouldn’t I use it?
CloudNoSQL for web apps
Scale-out SQL DBs for operational
apps Hadoop technologies for
analytics
7
‹#›
Schema on Ingest vs. Schema on Read
▪ Even “schemaless” MongoDB requires “schema” - 10 Things You Should Know About Running MongoDB At Scale
• By Asya Kamsky, Principal Solutions Architect at MongoDB • Item #1 – “have a good schema and indexing strategy”
Schema on Ingest
Schema on Read
• Schema on Read if you only use data a few times a year
• Structured data should always remain structured
• Add schema if data used regularly
Data Stream Application
Scale-out is the future of databases
9
Scale Up Scale Out
NoSQL NewSQL SQL-on-Hadoop
Hadoop RDBMS
Analytic Engines
How do I scale?
MPP
NoSQL
Pros
▪ Easy scale-out ▪ Flexible schema ▪ Easier web development with
hierarchical data structures (MongoDB)
▪ Cross-data center replication (Cassandra)
Cons
▪ No SQL – requires retraining and app rewrites
▪ No joins – i.e., no cross row/document dependencies
▪ No reliable updates through transactions across rows/tables
▪ Eventual consistency (Cassandra)
▪ Not designed to do aggregations required for analytics
10
NewSQL
Pros
▪ Easy scale-out ▪ ANSI SQL – eliminates
retraining and app rewrites ▪ Reliable updates through ACID
transactions ▪ RDBMS functionality ▪ Strong cross-data center
replication (NuoDB)
Cons
▪ Proprietary scale-out, unproven into petabytes
▪ Must manage another distributed infrastructure beyond Hadoop
▪ Can not leverage Hadoop ecosystem of tools
11
NewSQL – In-Memory
Pros
▪ Easy scale-out ▪ High performance because
everything in memory ▪ ACID transactions within nodes
Cons
▪ Memory 10-20x more expensive ▪ Limited SQL ▪ Limited cross-node transactions ▪ Proprietary scale-out, unproven
into petabytes ▪ Must manage another distributed
infrastructure beyond Hadoop ▪ Can not leverage Hadoop
ecosystem
12
Operational RDBMS on Hadoop
Pros
▪ Easy scale-out ▪ Scale-out infrastructure proven
into petabytes ▪ ANSI SQL – eliminates
retraining and app rewrites ▪ Reliable updates through ACID
transactions ▪ Leverages Hadoop distributed
infrastructure and tool ecosystem
Cons
▪ Full table scans slower than MPP DBs, but faster than traditional RDBMSs
▪ Existing HDFS data must be re-loaded through SQL interface
13
MPP Analytical Databases
Pros
▪ Easy scale-out ▪ Very fast performance for full
table scans ▪ Highly parallelized, shared
nothing architectures ▪ May have columnar storage
(Vertica) ▪ No maintenance of indexes
(Netezza)
Cons
▪ Poor concurrency models prevent support of real-time apps
▪ Poor performance for range queries
▪ Need to redistribute all data to add nodes (hash partitioning)
▪ May require specialized hardware (Netezza)
▪ Proprietary scale out - can not leverage Hadoop ecosystem of tools
14
SQL-on-Hadoop – Analytical Engines
Pros
▪ Easy scale-out ▪ Scale-out proven into
petabytes ▪ Leverages Hadoop distributed
infrastructure ▪ Can leverage Hadoop
ecosystem of tools
Cons
▪ Relatively immature, especially compared to MPP DBs
▪ Limited SQL ▪ Poor concurrency models prevent
support of real-time apps ▪ No reliable updates through
transactions ▪ Intermediate results must fit in
memory (Presto)
15
Future: Hybrid In-Memory Architectures
Memory Cache with Disk
PureIn-Memory
Hybrid In-Memory
- Very expensive- Unsophisticated memory management
- Flexible, cost-effective - Controlled by optimizer - In-memory materialized
views?
16
Summary – Future of Databases
▪ Predicted Trends ▪ Scale-out dominates databases ▪ Developers stop worrying about data size and
develop new data-driven apps ▪ Hybrid in-memory architecture becomes
mainstream ▪ Predicted Winners
▪ Hadoop becomes de facto distributed file system ▪ NoSQL used for simple web apps ▪ Scale-out SQL RDBMSs replace traditional RDBMSs
17
Questions?
Bob Baran Senior Sales Engineer
May 12, 2015
Powering Real-Time Apps on Hadoop
Bob Baran Senior Sales Engineer
May 12, 2015
‹#›
Who Are We?
THE ONLY
HADOOP RDBMS Power operational applications
on HadoopAffordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back Transactional – Real-time updates & ACID Transactions ANSI SQL – Leverage existing SQL code, tools, & skills Flexible – Support operational and analytical workloads
10x Better
Price/Perf
What People are Saying…
21
Recognized as a key innovator in databases
Scaling out on Splice Machine presented
some major benefits over Oracle
...automatic balancing between clusters...avoiding the costly
licensing issues.
Quotes
Awards
An alternative to today’s RDBMSes,
Splice Machine effectively combines traditional relational
database technology with the scale-out capabilities
of Hadoop.
The unique claim of … Splice Machine is that it can run
transactional applications
as well as support analytics on top of Hadoop.
Advisory Board
22
Advisory Board includes luminaries in databases and technology
Roger BamfordFormer Principal Architect at Oracle Father of Oracle RAC
Mike Franklin Computer Science Chair, UC Berkeley Director, UC Berkeley AmpLab Founder of Apache Spark
Marie-Anne Neimat Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle
Ken Rudin Head of Analytics at Facebook Former GM of Oracle Data Warehousing
‹#›
Combines the Best of Both Worlds
▪ Scale-out on commodity servers ▪ Proven to 100s of petabytes ▪ Efficiently handle sparse data ▪ Extensive ecosystem
RDBMS▪ ANSI SQL ▪ Real-time, concurrent updates ▪ ACID transactions ▪ ODBC/JDBC support
Hadoop
Focused on OLTP and Real-Time Workloads
24
OLTP Applications Real-Time Web, Mobile, and IoT
Applications
Real-Time, Operational Reporting
Ad-Hoc Analytics Enterprise Data Warehouses
Typical Databases
• MySQL • Oracle
• MySQL • Oracle • MongoDB • Cassandra
• MySQL • Oracle
• Greenplum • Paraccel • Netezza
• Teradata • Oracle • Sybase IQ
Use Cases • ERP, CRM, Supply Chain
• Web, mobile, social • IoT
• Operational Datastores
• Crystal Reports
• Exploratory Analytics
• Data Mining
• Enterprise Reporting
Workload Strengths
• Real-time updates • ACID transactions • High concurrency
of small reads/ writes
• Range queries
• Real-time updates • High ingest rates • High concurrency
of small reads/ writes
• Range queries
• Real-time updates • Canned,
parameterized reports
• Range queries
• Complex queries requiring full table scans
• Append only
• Parameterized reports against historical data
25
OLTP Campaign Management: Harte-Hanks
Overview Digital marketing services provider Unified Customer Profile Real-time campaign management OLTP environment with BI reports
Challenges Oracle RAC too expensive to scale
Queries too slow – even up to ½ hour
Getting worse – expect 30-50% data growth
Looked for 9 months for a cost-effective solution
Solution Diagram Initial Results
¼ cost with commodity scale out3-7x faster through parallelized queries
10-20x price/perf with no application, BI or ETL rewrites
Cross-Channel Campaigns
Real-Time Personalization
Real-Time Actions
‹#›
Reference Architecture: Operational Data LakeOffload real-time reporting and analytics from expensive OLTP and DW systems
OLTP Systems
Ad Hoc Analytics
Operational Data Lake
Executive Business Reports
Operational Reports & Analytics
ERP
CRM
Supply Chain
HR
…
Data Warehouse
Datamart
Stream or Batch
Updates
ETL
Real-Time, Event-Driven
Apps
Streamlining the Structured Data Pipeline in Hadoop
27
Source Systems
ERP
…
CRM
Sqoop
Apply Inferred Schema
Stored as flat files
SQL Query Engines BI Tools
Traditional Hadoop Pipeline
vs.
Source Systems
ERP
…
CRM
Exisiting ETL Tool
Stored in same
schema
BI Tools
Streamlined Hadoop PipelineAdvantages • Reduced operational costs
with less complexity • Reduced processing time and
errors with fewer translations • Real-time updates for data
cleansing • Better SQL support
‹#›
Complementing Existing Hadoop-Based Data LakesOptimizing storage and querying of structured data as part of ELT or Hadoop query engines
OLTP Systems
ERP
CRM
Supply Chain
HR
…
SCHEMA ON INGEST:
Streamlined, structured-to-
structured integration
Structured Data
Unstructured Data
1
2
3
SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data
HCATALOG
Pig
SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data
‹#›
Proven Building Blocks: Hadoop and Derby
APACHE DERBY ▪ ANSI SQL-99 RDBMS ▪ Java-based ▪ ODBC/JDBC Compliant !
APACHE HBASE/HDFS ▪ Auto-sharding ▪ Real-time updates ▪ Fault-tolerance ▪ Scalability to 100s of PBs ▪ Data replication
‹#›
HBase: Proven Scale-Out
▪ Auto-sharding ▪ Scales with commodity hardware ▪ Cost-effective from GBs to PBs
▪ High availability thru failover and replication
▪ LSM-trees
Splice Optimizations to HBase
▪ Splice Storage is optimized over raw HBase ▪ We use Bitmap Indexes to store data in packed byte arrays ▪ This approach allows us to store data in a much smaller footprint than traditional HBase ▪ With a TPCH schema, we found a 10X reduction in data size reduction
▪ Requires far less hardware and resources to perform the same workload ▪ Asynchronous Write Pipeline
▪ HBase writes (puts) are not pipelined and block while the call is being made ▪ Splice’s write pipeline allows us to reach speeds of over 100K writes / second per HBase
node ▪ This allows extremely high ingest speeds without requiring more hardware and custom code
▪ Transactions ▪ As scalability increases, the likelihood of failures increases ▪ We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing
data ▪ RDBMS Capabilities
▪ The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data
▪ Core Data Management functions (Indexes, Constraints, typed columns, etc.)
31
‹#›
Distributed, Parallelized Query Execution
Parallelized computation across cluster Moves computation to the data Utilizes HBase co-processors No MapReduce
HBase Co-Processor !HBase Server Memory Space
L EG EN D
ANSI SQL-99 Coverage
33
▪ Data types – e.g., INTEGER, REAL, CHARACTER, DATE, BOOLEAN, BIGINT
▪ DDL – e.g., CREATE TABLE, CREATE SCHEMA, ALTER TABLE, DELETE, UPDATE
▪ Predicates – e.g., IN, BETWEEN, LIKE, EXISTS
▪ DML – e.g., INSERT, DELETE, UPDATE, SELECT
▪ Query specification – e.g., SELECT DISTINCT, GROUP BY, HAVING
▪ SET functions – e.g., UNION, ABS, MOD, ALL, CHECK
▪ Aggregation functions – e.g., AVG, MAX, COUNT
▪ String functions – e.g., SUBSTRING, concatenation, UPPER, LOWER, POSITION, TRIM, LENGTH
▪ Conditional functions – e.g., CASE, searched CASE
▪ Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE
▪ Cursors – e.g., updatable, read-only, positioned DELETE/UPDATE
▪ Joins – e.g., INNER JOIN, LEFT OUTER JOIN
▪ Transactions – e.g., COMMIT, ROLLBACK, READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation
▪ Sub-queries ▪ Triggers ▪ User-defined functions (UDFs) ▪ Views – including grouped views ▪ Window Functions (rank, rownumber, …)
Window Functions (Advanced Analytics Functions)
▪ Analytics such as Running total, Moving averages, Top-N Queries ▪ Performs calculations across a set of table rows related to the current
row in the window ▪ Similar to aggregate functions with two significant differences:
▪ Outputs one row for each input value it operates upon. ▪ Groups rows with window partitioning and frame clauses vs. Group BY
▪ SPLICE MACHINE Currently Supports ▪ RANK ▪ DENSE_RANK ▪ ROW NUMBER ▪ AVG ▪ SUM ▪ COUNT ▪ MAX ▪ MIN
34
‹#›
Lockless, ACID transactions
• Adds multi-row, multi-table transactions to HBase with rollback
• Fast, lockless, high concurrency • Extends research from Google
Percolator, Yahoo Labs, U of Waterloo
• Patent pending technology
‹#›
Customer Performance BenchmarksTypically 10x price/performance improvement
30x
3-7x
10-20x 10x
20x 7xSPEED
PRICE/ PERFORMANCE
VS.
FASTER
LOWER
‹#›
Applications, BI / SQL tool support via ODBC/JDBC
1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months
Splice Machine Safe Journey Process
38
Initial Overview
• Splice Machine overview
• Set the stage for Rapid Assessment
Rapid Assessment
• Half day workshop • Assess Splice Machine
fit • Identity target use
cases • Risk assessment of use
cases • Agree upon success
criteria
Proof ofConcept
• Prove client use case on Splice Machine hosted environment
• Benchmark using customer queries and schema
• On Customer data or generated data that resembles customer data
Pilot Project
• Identify paid pilot use case with limited change management impact
• Install Splice Machine on client environment
• Deploy use case/application on client data
• Prove Splice Machine against key requirements
Enterprise Implementation• Kickstart • Requirements • Design/Dev • QA Test • Cutover • Hypercare
‹#›
Safe Journey Enterprise Implementation Stages
Kickstart
Packaged 2 week program to get new client off to strong start on solid foundation !Incorporates: • Splice
Architecture & Development courses
• Risk Assessment Workshop
• Implementation Blueprint
Requirements
Establish clear functional and performance requirements document !Can be a “refresh only” if project is a port of an existing app to Splice
Design/Dev
Based on Agile method. Phase is divided into 2 week sprints !Stories covering a set of required capabilities are assigned to each developer !A design doc is created, code is written, unit tests are written and executed until they pass
QA TestThe QA test period includes: • Performance Test • End-to-End
System Integration Test
• User Acceptance Test !
Depending on scale of project there may be multiple iterations of each test with break/fix cycles in between
Cutover
Formal period in which Splice-based solution goes-live and pre-existing system is deprecated
Parallel Ops
Used when an existing system is being ported to Splice Machine from another database !The new Splice Machine-based system runs side by side with the old system for a period of time
Optional
Hypercare
Period of on-site support during cutover and for a period immediately following go-live
Optional
Common Risks and Mitigation Strategies
Data migration • Risk: Clients are typically migrating very large data sets to Splice
Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data
• Solution: First migrate a small subset of tables that contain all required data types. Ensure these migrate successfully before migrating the entire database
Changes to source schema during implementation • Risk: Changes to the schema of the source database to be migrated
during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project
• Solution: All stakeholders agree up front to freeze the schema as of an agreed upon date prior to the Design/Development stage.Stored procedure conversion
• Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include significant amounts or procedural code as well as multiple SQL statements
• Solution: Carefully review the function and design of SPs to be converted. Leverage an automated conversion tool where appropriate
40
Common Risks and Mitigation Strategies
SQL compatibility • Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some
queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable. • Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not
modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed.
Indexing • Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ
from the indexes required for a traditional RDBMS • Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early
in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes
Hadoop knowledge • Risk: Project stakeholders often have limited knowledge of Hadoop and the distributed computing paradigm. This can lead to
confusion about the Splice Machine value proposition and the and the advantages of moving to a scale-out architecture • Solution: Include the Splice Machine Kickoff Program at the beginning of the implementation project. This includes essential training
on Hadoop and related fundamentals concepts critical to realizing value from a Splice Machine deployment
41
‹#›
Summary
THE ONLY
HADOOP RDBMS Power operational applications
on HadoopAffordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back Transactional – Real-time updates & ACID Transactions ANSI SQL – Leverage existing SQL code, tools, & skills Flexible – Support operational and analytical workloads
10x Better
Price/Perf
Questions?
Bob Baran Senior Sales Engineer
May 12, 2015