splice machine overview

28
Splice Machine SQL ON HADOOP RELATIONAL DATABASE MANAGEMENT SYSTEM

Upload: kunal-gupta

Post on 18-Jul-2015

227 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Splice Machine Overview

Splice MachineSQL ON HADOOP RELATIONAL DATABASE MANAGEMENT SYSTEM

Page 2: Splice Machine Overview

How RDBMS on Hadoop is Different From Traditional RDBMS and NOSQL?NOSQL Database don’t have SQL interface, no join, no transaction on multiple rows and tables.

Existing application of database need to rewrite to use NOSQL Database.

Traditional RDBMS can’t automatically scale out on commodity hardware and must manually shard across server.

Hadoop RDBMS eliminate the cost and scaling issue of traditional RDBMS and start supporting SQL interface over NOSQL Database.

Existing application can well be migrated .

Page 3: Splice Machine Overview

Splice Machine Overview

Splice Machine is a SQL on Hadoop RDBMS.

Splice Machine provides the database technology for real-time, including these features:A. Standard ANSI SQL

B. Horizontal Scale Out

C. Real-Time Updates With Transaction

D. Massively Parallel Architecture

Page 4: Splice Machine Overview

Splice Machine Becoming Real TimeMany companies are experiencing an explosion of data generated by applications, websites, users, and devices such as smartphones.

Companies recognize that insights contained with this data can be a source of real competitive advantage, compelling them to act quickly before those insights become obsolete.

However, traditional relational databases, NoSQL alternatives, and other SQL-on-Hadoop solutions don't allow companies to collect, analyze, and react to massive amounts of data in real-time.

Page 5: Splice Machine Overview

Standard ANSI SQL-99Splice Machine is an ANSI SQL-compliant database on Hadoop that enables companies to leverage existing SQL-trained resources over it.

Page 6: Splice Machine Overview

Horizontal Scale OutHBase support auto-sharding due to which it can have massive scalability .

Traditional RDBMS would like to do scale-up that comes out to be costly as compare to commodity hardware.

Splice Machine with the help of HBase to scale-out instead of scaling up to provide massive scalability across commodity hardware, even up to dozens of petabytes.

Page 7: Splice Machine Overview

Real Time Updates with TransactionSplice Machine supports SQL interface due to which it can perform transactions on multiple rows and tables.

How it can happen in real time? Because of HBase Distributed Database over Hadoop that allow real time read/write access using HBase co-processor rather than using MapReduce (Batch Processing ).

Transactional consistency is maintained by Multiple Version Control Concurrency.

Page 8: Splice Machine Overview

Massively Parallel ArchitectureSplice Machine delivers massive parallelization by placing its Parser, Planner, and Optimizer on each HBase RegionServer, which support multiple Regions and Executor on HBase Region, pushing computation down to each distributed data shard (HBase Region) .

Splice Machine provide high performance by using Massive Parallel Processing for performing Predicates, Joins, Aggregation and complex query by pushing down to Data shard.

For Parallelized query execution , splice machine utilizes HBase co-processor for distributed computation on data stored in Hadoop Distributed File System (HDFS).

Page 9: Splice Machine Overview

How Splice Machine is Different From Other SQL on Hadoop?

Splice Machine is fully operational database on Hadoop that support:A. Real Time Updates

B. Transaction

C. Analytics

D. Rich SQL Support by using ANSI SQL 99

Other SQL on Hadoop such as Hortonwork Stinger, Apache Drill, Cloudera Impala are query analytics engine that have limited SQL support, no transaction, no real time updates.

Page 10: Splice Machine Overview

Splice Machine Architecture

Page 11: Splice Machine Overview

Proven Building Blocks: HBase/HADOOP and DERBYSplice Machine marries two proven technology stacks: Apache Derby and HBase/Hadoop.

A. Apache Derby: Java-Based, ANSI SQL Database• JAVA Based

• ANSI SQL-99

• Lightweight 2.6 MB footprint

B. Apache HBase /HDFS• Auto- sharding

• Data Replication

• Scalability to 100s of PB

• Real Time Updates

Page 12: Splice Machine Overview

Apache Derby 100% Java ANSI SQL RDBMS- Client, Embedded.

Java Stored Procedure

Full Transaction Isolation Snapshot

2.6 MB Footprint

Custom Function

Authentication and Authorization

Concurrency (Lock Based)

Page 13: Splice Machine Overview

Splice Modification To DerbyDerby Component Derby Splice Machine

Store Block File Based HBase

Index B-Tree Dense Index in HBase

Concurrency Lock Based MVCC

Join Plan Centralized Hash and Nested Loop Join

Sort Merge, Merge, Nested loop, Distributed Broadcast

Page 14: Splice Machine Overview

How Derby and HBase work together for Splice Machine?Splice Machine replace the Block file based storage engine of Apache Derby with HBase.

Splice Machine uses the same Parser of Apache Derby and redesign the Planner, Optimizer and Executor so that they can work well and take advantages over Distributed HBase computation.

Redesign enable splice machine database to achieve massive parallel processing by pushing computation down to each HBase region on regionserver and utilize HBase co-processor for data computation in HDFS.

Client able to send SQL query to Apache derby parser then it flows to redesigned planner, optimizer and executor which resides in HBase region .

Apache Derby is JAVA based so each region server try to reference local jar files of parser, planner, optimizer and each region of region server reference local jar file of executor.

Page 15: Splice Machine Overview

Splice SQL ProcessingIt is similar to Apache Derby parser , splice does not redesign it.

PreparedStatement ps=conn.prepareStatement(“SELECT * FROM T WHERE ID=?”);

1. Look up in cache using text match• If it is found then skip all further 5 steps

• Else perform 5 steps

2. Parse using JAVACC generator(JAVA Compiler Compiler) convert it into abstract syntax tree.

3. Bind all tables associated with query.

4. Optimize plan based on cost of I/O ,Communication cost ,Disk Usage and feasible join strategies .

5. Generate of code to represent statement plan.

6. Loading of the class and creation of an instance to represent that connection's state of the query

Page 16: Splice Machine Overview

Distributed, Parallelized Query ExecutionParallel Computation across the cluster

Move computation to data shard

Utilize HBase co-processor

No MapReduce

Query uses special operator “Exchange Operator “ for parallelism

Page 17: Splice Machine Overview

HBase Co-Processor Verses MapReduce For Distributed Computation On Data Stored In HDFSHBase access HDFS directly while maintain its own metadata to quickly find out single record in HDFS files .

MapReduce is designed for batch data access and therefore would not appropriate for real time data access .

MapReduce start Java Virtual Machine for each query , which can take up to 30 sec even to retrieve single record from HDFS files.

MapReduce without metadata will scan all the data ,even if your query need to access a few records.

Co-processor of HBase run on each RegionServer and region take reference to co-processor.

Co-processor provide region life cycle management by open, close, split, flush and perform compact operation.

Page 18: Splice Machine Overview

HBase : Proven To Be Scale OutAuto Sharding.

Scale with commodity hardware.

Cost effective from GBs to PBs.

High availability through replication.

Page 19: Splice Machine Overview

Support of Secondary Index

Often data is organized along one dimension for fast updating (such as a customer number) but later must be looked up by other dimensions (such as zip code). Secondary indexes enable databases to lookup data across many dimensions efficiently.

Splice machine use HBase table to store index as well as any required data.

Page 20: Splice Machine Overview

Splice TransactionSplice Machine is a fully transactional database. This allows you to perform actions such as commit and rollback, which means that in a transactional context, the database does not make changes visible to others until a commit has been issued.

Here is a simple example. Enter the following commands to see commit and rollback in action:

splice> create table a (i int);

splice> autocommit off; -- puts current shell into a transactional context

splice> insert into a values 1,2,3; -- inserted but not visible to others

splice> commit; -- now committed to the database

splice> select * from a;

splice> insert into a values 4,5;

splice> rollback; -- 4 and 5 rolled back splice> select * from a; ...

Page 21: Splice Machine Overview

Snapshot Isolation In TransactionSnapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database (in practice it reads the last committed values that existed at the time it started), and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot. Such a write-write conflict will cause the transaction to abort.

Snapshot isolation is implemented within Multiversion concurrency control (MVCC)• MVCC is a common way to increase concurrency and performance by generating a new version of a database

object each time the object is written, and allowing transactions' read operations of several last relevant versions (of each object).

In a write skew anomaly, two transactions (T1 and T2) concurrently read an overlapping data set (e.g. values V1 and V2), concurrently make disjoint updates (e.g. T1 updates V1, T2 updates V2), and finally concurrently commit, neither having seen the update performed by the other. Were the system serializable, such an anomaly would be impossible, as either T1 or T2 would have to occur "first", and be visible to the other. In contrast, snapshot isolation permits write skew anomalies.

Page 22: Splice Machine Overview

Example of Snapshot Isolation

Page 23: Splice Machine Overview

Splice Machine Support Distributed Transaction

Splice Machine has added Asynchronous Write Pipeline to HBase.

Splice Machine also have nested sub transaction to ensure region level failure does not force to restart whole transaction.• Example –Suppose 10TB update transaction is there that would act as single parent transaction and

when it get divided among the shards then it will become nested transaction for each shard, a failure at a region level would typically restart only few GB instead of 10 TB.

Page 24: Splice Machine Overview

Splice Machine Efficiency

Can it efficiently handle sparse data?• In many large data sets, each attribute or column may be sparsely populated. In traditional databases,

an empty value must still be stored as a null, which consumes storage. Modern databases should not require nulls for empty values.

Can you add a column without table scans?• Data requirements change frequently and often require schema changes. Adding a column should not

require full table scans.

Page 25: Splice Machine Overview

Splice Machine PerformanceDoes it support secondary indexes?• Often data is organized along one dimension for fast updating (such as a customer number) but later

must be looked up by other dimensions (such as zip code). Secondary indexes enable databases to lookup data across many dimensions efficiently.

Does it provide multiple join strategies?• Joins combine data from multiple tables. With a distributed infrastructure like Hadoop that handles very

large data sets, multiple join strategies such as nested loop, sort-merge, and broadcast joins are needed to ensure fast join performance.

Is there a cost-based optimizer?• Performance on large data sets greatly depends on choosing the right execution strategy. Simple rules-

based optimizers are not enough. Cost-based optimizers looks for the actual cost to execute a query are critical to optimal query performance.

Page 26: Splice Machine Overview

Splice Machine Feature In Upcoming Release

In many applications, certain attributes on a record may be visible to one user, but not to another. For instance in an HR application, a CEO may get to see the salary field, while most employees would not. Many applications control data access directly, but column level security is an advanced database feature that enables the database to control which fields a user can view. Splice Machine will be adding this feature in an upcoming release.

Page 27: Splice Machine Overview
Page 28: Splice Machine Overview