big data landscape for databases - dama...

Big Data Landscape for Databases

Bob Baran Senior Sales Enginee

[email protected] !

May 12, 2015

mailto:[email protected]

Typical Database Workloads

2

OLTP Applications Real-Time Web, Mobile, and IoT

Applications

Real-Time, Operational Reporting

Ad-Hoc Analytics Enterprise Data Warehouses

Typical Databases

• MySQL • Oracle

• MongoDB • Cassandra • MySQL • Oracle


• Greenplum • Paraccel • Netezza

• Teradata • Oracle • Sybase IQ

Use Cases • ERP, CRM, Supply Chain

• Web, mobile, social • IoT

• Operational Datastores

• Crystal Reports

• Exploratory Analytics

• Data Mining

• Enterprise Reporting

Workload Strengths

• Real-time updates • ACID transactions • High concurrency

of small reads/ writes

• Range queries

• Real-time updates • High ingest rates • High concurrency of

small reads/ writes • Range queries

• Real-time updates • Canned,

parameterized reports

• Range queries

• Complex queries requiring full table scans

• Append only

• Parameterized reports against historical data

Operational Analytical

Recent History of RDBMSs

▪ RDBMS Definition ▪ Relational with joins ▪ ACID transactions ▪ Secondary indexes ▪ Typically row-oriented ▪ Operational and/or analytical workloads

▪ By early 2000s ▪ Limited innovation ▪ Looked like Oracle and Teradata won…

3

Hadoop Shakes Up Batch Analytics

▪ Data processing framework ▪ Cheap distributed file system ▪ Brute force, batch processing through MapReduce

▪ Great for batch analytics ▪ Great place to dump data to look at later

4

NoSQL Shakes Ups Operational DBs

▪ NoSQL wave ▪ Companies like Google, Amazon and

LinkedIn needed greater scalability & schema flexibility

▪ New databases developed by developers, not database people

▪ Provided scale-out, but lost SQL ▪ Worked well at web startups because:

▪ In some cases, use cases did not need ACID ▪ Willing to handle exceptions at app level

5

Convoluted Evolution of DatabasesSc al

abi

li ty

Hierarchical/Network Databases

1970s

Functionality

Indexed Files (ISAM)

1960s

Traditional RDBMSs

1980s-2000s

Hadoop 2005

NoSQL Databases

2010 Scale-out SQL Databases

2013

6

Scale Out

Scale Up

Mainstream user changes

▪ Driven by web, social, mobile, and Internet of Things ▪ Major increases in scale – 30% annual data growth ▪ Significant requirements for semi-structured data ▪ Though relatively little unstructured

▪ Technology adoption continuum

What is it? Should I use it?

Why wouldn’t I use it?

CloudNoSQL for web apps

Scale-out SQL DBs for operational

apps Hadoop technologies for

analytics

7

‹#›

Schema on Ingest vs. Schema on Read

▪ Even “schemaless” MongoDB requires “schema” - 10 Things You Should Know About Running MongoDB At Scale

• By Asya Kamsky, Principal Solutions Architect at MongoDB • Item #1 – “have a good schema and indexing strategy”

Schema on Ingest

Schema on Read

• Schema on Read if you only use data a few times a year

• Structured data should always remain structured

• Add schema if data used regularly

Data Stream Application

http://highscalability.com/blog/2014/3/5/10-things-you-should-know-about-running-mongodb-at-scale.html?utm_content=bufferde5d2&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Scale-out is the future of databases

9

Scale Up Scale Out

NoSQL NewSQL SQL-on-Hadoop

Hadoop RDBMS

Analytic Engines

How do I scale?

MPP

NoSQL

Pros

▪ Easy scale-out ▪ Flexible schema ▪ Easier web development with

hierarchical data structures (MongoDB)

▪ Cross-data center replication (Cassandra)

Cons

▪ No SQL – requires retraining and app rewrites

▪ No joins – i.e., no cross row/document dependencies

▪ No reliable updates through transactions across rows/tables

▪ Eventual consistency (Cassandra)

▪ Not designed to do aggregations required for analytics

10

NewSQL

Pros

▪ Easy scale-out ▪ ANSI SQL – eliminates

retraining and app rewrites ▪ Reliable updates through ACID

transactions ▪ RDBMS functionality ▪ Strong cross-data center

replication (NuoDB)

Cons

▪ Proprietary scale-out, unproven into petabytes

▪ Must manage another distributed infrastructure beyond Hadoop

▪ Can not leverage Hadoop ecosystem of tools

11

NewSQL – In-Memory

Pros

▪ Easy scale-out ▪ High performance because

everything in memory ▪ ACID transactions within nodes

Cons

▪ Memory 10-20x more expensive ▪ Limited SQL ▪ Limited cross-node transactions ▪ Proprietary scale-out, unproven

into petabytes ▪ Must manage another distributed

infrastructure beyond Hadoop ▪ Can not leverage Hadoop

ecosystem

12

Operational RDBMS on Hadoop

Pros

▪ Easy scale-out ▪ Scale-out infrastructure proven

into petabytes ▪ ANSI SQL – eliminates

retraining and app rewrites ▪ Reliable updates through ACID

transactions ▪ Leverages Hadoop distributed

infrastructure and tool ecosystem

Cons

▪ Full table scans slower than MPP DBs, but faster than traditional RDBMSs

▪ Existing HDFS data must be re-loaded through SQL interface

13

MPP Analytical Databases

Pros

▪ Easy scale-out ▪ Very fast performance for full

table scans ▪ Highly parallelized, shared

nothing architectures ▪ May have columnar storage

(Vertica) ▪ No maintenance of indexes

(Netezza)

Cons

▪ Poor concurrency models prevent support of real-time apps

▪ Poor performance for range queries

▪ Need to redistribute all data to add nodes (hash partitioning)

▪ May require specialized hardware (Netezza)

▪ Proprietary scale out - can not leverage Hadoop ecosystem of tools

14

SQL-on-Hadoop – Analytical Engines

Pros

▪ Easy scale-out ▪ Scale-out proven into

petabytes ▪ Leverages Hadoop distributed

infrastructure ▪ Can leverage Hadoop

ecosystem of tools

Cons

▪ Relatively immature, especially compared to MPP DBs

▪ Limited SQL ▪ Poor concurrency models prevent

support of real-time apps ▪ No reliable updates through

transactions ▪ Intermediate results must fit in

memory (Presto)

15

Future: Hybrid In-Memory Architectures

Memory Cache with Disk

PureIn-Memory

Hybrid In-Memory

- Very expensive- Unsophisticated memory management

- Flexible, cost-effective - Controlled by optimizer - In-memory materialized

views?

16

Summary – Future of Databases

▪ Predicted Trends ▪ Scale-out dominates databases ▪ Developers stop worrying about data size and

develop new data-driven apps ▪ Hybrid in-memory architecture becomes

mainstream ▪ Predicted Winners

▪ Hadoop becomes de facto distributed file system ▪ NoSQL used for simple web apps ▪ Scale-out SQL RDBMSs replace traditional RDBMSs

17

Questions?

Bob Baran Senior Sales Engineer

[email protected] !

May 12, 2015


Powering Real-Time Apps on Hadoop


[email protected] !

May 12, 2015


‹#›

Who Are We?

THE ONLY

HADOOP RDBMS Power operational applications

on HadoopAffordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back Transactional – Real-time updates & ACID Transactions ANSI SQL – Leverage existing SQL code, tools, & skills Flexible – Support operational and analytical workloads

10x Better

Price/Perf

What People are Saying…

21

Recognized as a key innovator in databases

Scaling out on Splice Machine presented

some major benefits over Oracle

...automatic balancing between clusters...avoiding the costly

licensing issues.

Quotes

Awards

An alternative to today’s RDBMSes,

Splice Machine effectively combines traditional relational

database technology with the scale-out capabilities

of Hadoop.

The unique claim of … Splice Machine is that it can run

transactional applications

as well as support analytics on top of Hadoop.

Advisory Board

22

Advisory Board includes luminaries in databases and technology

Roger BamfordFormer Principal Architect at Oracle Father of Oracle RAC

Mike Franklin Computer Science Chair, UC Berkeley Director, UC Berkeley AmpLab Founder of Apache Spark

Marie-Anne Neimat Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle

Ken Rudin Head of Analytics at Facebook Former GM of Oracle Data Warehousing

‹#›

Combines the Best of Both Worlds

▪ Scale-out on commodity servers ▪ Proven to 100s of petabytes ▪ Efficiently handle sparse data ▪ Extensive ecosystem

RDBMS▪ ANSI SQL ▪ Real-time, concurrent updates ▪ ACID transactions ▪ ODBC/JDBC support

Hadoop

Focused on OLTP and Real-Time Workloads

24

OLTP Applications Real-Time Web, Mobile, and IoT

Applications

Real-Time, Operational Reporting

Ad-Hoc Analytics Enterprise Data Warehouses

Typical Databases


• MySQL • Oracle • MongoDB • Cassandra


• Greenplum • Paraccel • Netezza

• Teradata • Oracle • Sybase IQ

Use Cases • ERP, CRM, Supply Chain

• Web, mobile, social • IoT

• Operational Datastores

• Crystal Reports

• Exploratory Analytics

• Data Mining

• Enterprise Reporting

Workload Strengths

• Real-time updates • ACID transactions • High concurrency


• Range queries

• Real-time updates • High ingest rates • High concurrency


• Range queries

• Real-time updates • Canned,

parameterized reports

• Range queries

• Complex queries requiring full table scans

• Append only

• Parameterized reports against historical data

25

OLTP Campaign Management: Harte-Hanks

Overview Digital marketing services provider Unified Customer Profile Real-time campaign management OLTP environment with BI reports

Challenges Oracle RAC too expensive to scale

Queries too slow – even up to ½ hour

Getting worse – expect 30-50% data growth

Looked for 9 months for a cost-effective solution

Solution Diagram Initial Results

¼ cost with commodity scale out3-7x faster through parallelized queries

10-20x price/perf with no application, BI or ETL rewrites

Cross-Channel Campaigns

Real-Time Personalization

Real-Time Actions

‹#›

Reference Architecture: Operational Data LakeOffload real-time reporting and analytics from expensive OLTP and DW systems

OLTP Systems

Ad Hoc Analytics

Operational Data Lake

Executive Business Reports

Operational Reports & Analytics

ERP

CRM

Supply Chain

HR

…

Data Warehouse

Datamart

Stream or Batch

Updates

ETL

Real-Time, Event-Driven

Apps

Streamlining the Structured Data Pipeline in Hadoop

27

Source Systems

ERP

…

CRM

Sqoop

Apply Inferred Schema

Stored as flat files

SQL Query Engines BI Tools

Traditional Hadoop Pipeline

vs.

Source Systems

ERP

…

CRM

Exisiting ETL Tool

Stored in same

schema

BI Tools

Streamlined Hadoop PipelineAdvantages • Reduced operational costs

with less complexity • Reduced processing time and

errors with fewer translations • Real-time updates for data

cleansing • Better SQL support

‹#›

Complementing Existing Hadoop-Based Data LakesOptimizing storage and querying of structured data as part of ELT or Hadoop query engines

OLTP Systems

ERP

CRM

Supply Chain

HR

…

SCHEMA ON INGEST:

Streamlined, structured-to-

structured integration

Structured Data

Unstructured Data

1

2

3

SCHEMA BEFORE READ: Repository for structured data or metadata from ELT process on unstructured data

HCATALOG

Pig

SCHEMA ON READ: Ad-hoc Hadoop queries across structured and unstructured data

‹#›

Proven Building Blocks: Hadoop and Derby

APACHE DERBY ▪ ANSI SQL-99 RDBMS ▪ Java-based ▪ ODBC/JDBC Compliant !

APACHE HBASE/HDFS ▪ Auto-sharding ▪ Real-time updates ▪ Fault-tolerance ▪ Scalability to 100s of PBs ▪ Data replication

‹#›

HBase: Proven Scale-Out

▪ Auto-sharding ▪ Scales with commodity hardware ▪ Cost-effective from GBs to PBs

▪ High availability thru failover and replication

▪ LSM-trees

Splice Optimizations to HBase

▪ Splice Storage is optimized over raw HBase ▪ We use Bitmap Indexes to store data in packed byte arrays ▪ This approach allows us to store data in a much smaller footprint than traditional HBase ▪ With a TPCH schema, we found a 10X reduction in data size reduction

▪ Requires far less hardware and resources to perform the same workload ▪ Asynchronous Write Pipeline

▪ HBase writes (puts) are not pipelined and block while the call is being made ▪ Splice’s write pipeline allows us to reach speeds of over 100K writes / second per HBase

node ▪ This allows extremely high ingest speeds without requiring more hardware and custom code

▪ Transactions ▪ As scalability increases, the likelihood of failures increases ▪ We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing

data ▪ RDBMS Capabilities

▪ The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data

▪ Core Data Management functions (Indexes, Constraints, typed columns, etc.)

31

‹#›

Distributed, Parallelized Query Execution

Parallelized computation across cluster Moves computation to the data Utilizes HBase co-processors No MapReduce

HBase Co-Processor !HBase Server Memory Space

L EG EN D

ANSI SQL-99 Coverage

33

▪ Data types – e.g., INTEGER, REAL, CHARACTER, DATE, BOOLEAN, BIGINT

▪ DDL – e.g., CREATE TABLE, CREATE SCHEMA, ALTER TABLE, DELETE, UPDATE

▪ Predicates – e.g., IN, BETWEEN, LIKE, EXISTS

▪ DML – e.g., INSERT, DELETE, UPDATE, SELECT

▪ Query specification – e.g., SELECT DISTINCT, GROUP BY, HAVING

▪ SET functions – e.g., UNION, ABS, MOD, ALL, CHECK

▪ Aggregation functions – e.g., AVG, MAX, COUNT

▪ String functions – e.g., SUBSTRING, concatenation, UPPER, LOWER, POSITION, TRIM, LENGTH

▪ Conditional functions – e.g., CASE, searched CASE

▪ Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE

▪ Cursors – e.g., updatable, read-only, positioned DELETE/UPDATE

▪ Joins – e.g., INNER JOIN, LEFT OUTER JOIN

▪ Transactions – e.g., COMMIT, ROLLBACK, READ COMMITTED, REPEATABLE READ, READ UNCOMMITTED, Snapshot Isolation

▪ Sub-queries ▪ Triggers ▪ User-defined functions (UDFs) ▪ Views – including grouped views ▪ Window Functions (rank, rownumber, …)

Window Functions (Advanced Analytics Functions)

▪ Analytics such as Running total, Moving averages, Top-N Queries ▪ Performs calculations across a set of table rows related to the current

row in the window ▪ Similar to aggregate functions with two significant differences:

▪ Outputs one row for each input value it operates upon. ▪ Groups rows with window partitioning and frame clauses vs. Group BY

▪ SPLICE MACHINE Currently Supports ▪ RANK ▪ DENSE_RANK ▪ ROW NUMBER ▪ AVG ▪ SUM ▪ COUNT ▪ MAX ▪ MIN

34

‹#›

Lockless, ACID transactions

• Adds multi-row, multi-table transactions to HBase with rollback

• Fast, lockless, high concurrency • Extends research from Google

Percolator, Yahoo Labs, U of Waterloo

• Patent pending technology

‹#›

Customer Performance BenchmarksTypically 10x price/performance improvement

30x

3-7x

10-20x 10x

20x 7xSPEED

PRICE/ PERFORMANCE

VS.

FASTER

LOWER

‹#›

Applications, BI / SQL tool support via ODBC/JDBC

1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months

Splice Machine Safe Journey Process

38

Initial Overview

• Splice Machine overview

• Set the stage for Rapid Assessment

Rapid Assessment

• Half day workshop • Assess Splice Machine

fit • Identity target use

cases • Risk assessment of use

cases • Agree upon success

criteria

Proof ofConcept

• Prove client use case on Splice Machine hosted environment

• Benchmark using customer queries and schema

• On Customer data or generated data that resembles customer data

Pilot Project

• Identify paid pilot use case with limited change management impact

• Install Splice Machine on client environment

• Deploy use case/application on client data

• Prove Splice Machine against key requirements

Enterprise Implementation• Kickstart • Requirements • Design/Dev • QA Test • Cutover • Hypercare

‹#›

Safe Journey Enterprise Implementation Stages

Kickstart

Packaged 2 week program to get new client off to strong start on solid foundation !Incorporates: • Splice

Architecture & Development courses

• Risk Assessment Workshop

• Implementation Blueprint

Requirements

Establish clear functional and performance requirements document !Can be a “refresh only” if project is a port of an existing app to Splice

Design/Dev

Based on Agile method. Phase is divided into 2 week sprints !Stories covering a set of required capabilities are assigned to each developer !A design doc is created, code is written, unit tests are written and executed until they pass

QA TestThe QA test period includes: • Performance Test • End-to-End

System Integration Test

• User Acceptance Test !

Depending on scale of project there may be multiple iterations of each test with break/fix cycles in between

Cutover

Formal period in which Splice-based solution goes-live and pre-existing system is deprecated

Parallel Ops

Used when an existing system is being ported to Splice Machine from another database !The new Splice Machine-based system runs side by side with the old system for a period of time

Optional

Hypercare

Period of on-site support during cutover and for a period immediately following go-live

Optional

Common Risks and Mitigation Strategies

Data migration • Risk: Clients are typically migrating very large data sets to Splice

Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data

• Solution: First migrate a small subset of tables that contain all required data types. Ensure these migrate successfully before migrating the entire database

Changes to source schema during implementation • Risk: Changes to the schema of the source database to be migrated

during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project

• Solution: All stakeholders agree up front to freeze the schema as of an agreed upon date prior to the Design/Development stage.Stored procedure conversion

• Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include significant amounts or procedural code as well as multiple SQL statements

• Solution: Carefully review the function and design of SPs to be converted. Leverage an automated conversion tool where appropriate

40

Common Risks and Mitigation Strategies

SQL compatibility • Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some

queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable. • Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not

modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed.

Indexing • Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ

from the indexes required for a traditional RDBMS • Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early

in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes

Hadoop knowledge • Risk: Project stakeholders often have limited knowledge of Hadoop and the distributed computing paradigm. This can lead to

confusion about the Splice Machine value proposition and the and the advantages of moving to a scale-out architecture • Solution: Include the Splice Machine Kickoff Program at the beginning of the implementation project. This includes essential training

on Hadoop and related fundamentals concepts critical to realizing value from a Splice Machine deployment

41

‹#›

Summary

THE ONLY

HADOOP RDBMS Power operational applications

on HadoopAffordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back Transactional – Real-time updates & ACID Transactions ANSI SQL – Leverage existing SQL code, tools, & skills Flexible – Support operational and analytical workloads

10x Better

Price/Perf

Questions?


[email protected] !

May 12, 2015


big data landscape for databases - dama...

Documents