sql vs nosql: why you’ll never dump your relations - dave shuttleworth, exasol

SQL vs NoSQL: Why you’ll never dump your relations17th March 2015

© 2015 EXASOL AG

BCS Data Management Specialist GroupDave Shuttleworth – Principal Consultant, Exasol UKemail: [email protected]: @EXA_DaveS

mailto:[email protected]

© 2015 EXASOL AG

Introduction & background

SQL vs NoSQL - observations

Case study King – online gaming

What’s hot?

Q & A

Agenda

© 2015 EXASOL AG

2014-2015 – EXASOL UK – Principal Consultant

Introducing EXASOL DBMS technology into UK

2003 - 2014 – Intelligent Edge Group – Principal Consultant

Data Warehouse design and migration from older technologies to new MPP DBMS

Business Intelligence infrastructure architect

New DBMS technology assessment

1992 - 2003 – WhiteCross Systems (now Kognitio) – Principal Consultant

Pre-sales and post-sales technical support

1989 -1992 – Teradata – Consultant


1980 -1989 – Data General (now part of EMC) – Systems engineer


1975 -1980 – UK retailer – Analyst programmer

Applications design and implementation, system management and tuning

My background

© 2015 EXASOL AG

a column store, in-memory, massively parallel processing (MPP) database

modern software designed for analytics

runs on standard x86 hardware

Uses standard SQL language (with optional extensions)

suitable for any scale of data & any number of users

mature, proven & very cost effective

quick to implement & easy to operate

The World’s Fastest Analytic Database

What is Exasol?

© 2015 EXASOL AG

QphH@1000 GB 1,000,000 2,000,000 3,000,000 4.000,000

Sept ´14

April ´14

June ´12

Feb ´14

Dec ´13

Aug ´11

Sept ´11

Oct ´11

Dec ´11

Source: www.tpc.org / Sept 22,2 0 1 5

We are the benchmark leader

5,246,338

Microsoft 134,117

Oracle 201,487

Oracle 209,533

Microsoft 219,887

Sybase IQ 258,474

Oracle 326,454

Vectorwise 445,529

Microsoft 519,976

On 1 Terabyte of data - an order of magnitude faster than its closest rival

Queries per hour

© 2015 EXASOL AG




What’s hot?

Q & A

Agenda

© 2015 EXASOL AG

• Databases and Data Warehouses have evolved to meet the needs of business (over many years…!)

• Generally using some form of Relational Database (SQL based)

• Originally tightly structured data, now expanding to include unstructured data

• Ever increasing data volumes and complexity

• New technologies have emerged to address (and extend) the storage and management requirements

• Fast cheap network connectivity

• Cloud services for cheaper and more flexible implementation

• Wider acceptance of open source software for production systems

• Hadoop parallel processing platform – often in a ‘hybrid’ environment

• Alternative database technologies (e.g. document stores, graph databases)

• Publicly accessible data sources (e.g. weather history, flight data, Google searches. Twitter feeds, census data, mapping data)

• More complex analytics needed to stay competitive

SQL vs NoSQL - background

© 2015 EXASOL AG

• Proliferation of NoSQL (‘not only SQL’) databases – over 150 listed on nosql.database.org – classified by type:

• Wide Column Stores• E.g. Hadoop, MapR, Cassandra, MonetDB

• Document stores• Elasticseach, MongoDB, Couchbase, Marklogic

• Key value/tuple store• DynamoDB, Azure Table Storage, Oracle NoSQL, MemcacheDB

• Graph databases• NEO4J, Yarcdata, Graphbase

• Multimodal databases

• Object databases

• etc, etc..


© 2015 EXASOL AG

• The inherent restrictions of relational databases are addressed by NoSQL implementations :

• More flexible data model – ‘schemaless’ or ‘schema on read’

• ‘Schemaless’ can mean very fast write performance – useful for streaming data

• Simplifies handling of unstructured and semi-structured data such as logfiles, other machine generated data and text

• Designed for easy scale-up (and scale down) to handle seasonal workloads

• High levels of concurrency can be achieved via distributed processing

• High availability via replication is built in to some NoSQL databases

• Maps well to cloud based infrastructure and capabilities (if done well!)


© 2015 EXASOL AG

Hadoop today is …

Still Open Source !

Began with HDFS and Map/Reduce

Now comprises a number of additional technologies

File systems

(e.g. Tachyon)

Cluster Managers

(e.g. YARN + Mesos)

Execution Engines

(e.g. Tez, Spark etc.)

Analytical Layer and Applications

(e.g. Hive, Pig, various SQL on Hadoop)

© 2015 EXASOL AG

Hadoop With Everything?

Hadoop was invented to more easily distribute the Nutch web search engine across a cluster of machines.

Map/Reduce – distributed processing

HDFS – distributed file system

Began to be used for …. just about everything.

But not all processing tasks are like indexing the Internet

Hadoop started to attract criticism

But usually when it was being used for something it wasn’t designed for

© 2015 EXASOL AG

Definitely NOT jobs for Hadoop

Word processing

Payroll system

Anything on a single computer

Anything with “small” data

© 2015 EXASOL AG

Analytical Queries

“GROUP BY“ logic

i.e. not concerned with individual data items

Analytical Functions

MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …

Table joins, nested subqueries

Usually short-running, ad-hoc and submitted many at a time.

© 2015 EXASOL AG

Map/Reduce and HDFS : the wrong tools for Analytics ?

Queries tend to be short : fault tolerance is less important

If chance of failure in a 5 hour batch is 1 in 300

Chance of failure in a 5 second query is 1 in 1,000,000

Queries tend to be short : start-up time is significant

a 20 second start-up time is NOT OK on a 5 second query

A number of projects started to address these issues

e.g. “Hot containers” in Hive on Tez to reduce start-up time

Also Pushdown via Hive partitions or ORC predicate pushdown

© 2015 EXASOL AG

Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation

Map/Reduce: the wrong language for Analytics ?

Stage 0: Map-Shuffle-Reduce

Mapper(row) {

fields = row.split("\t")

emit(fields[0], fields[1]);

}

Reducer(key, values) {

sum = 0;

for (value in values) {

sum += value;

}

emit(key, sum);

}

Stage 1: Map-Shuffle

Mapper(row) {

...

emit(page_views, page_name);

}

... shuffle

Stage 2: Local

data = open("stage1.out")

for (i in 0 to 10) {

print(data.getNext())

}

© 2015 EXASOL AG

Equivalent in SQL

SELECT

page_name,

SUM(page_views) views

FROM wikistats

GROUP BY page_name

ORDER BY views DESC

LIMIT 10;

© 2015 EXASOL AG

The SQL language

Portable

Well-defined standards exist

No detailed knowledge of the platform required

e.g. you don’t need to manage memory

SQL is assumed by a lot of reporting tools

Widely used and understood even by non-technical people

© 2015 EXASOL AG

I‘m not saying that SQL is perfect

• Try writing the simple Hadoop “Word Count” example in

pure SQL

• Or try to “sessionise” weblog data

• Or anything with data that is not structured• “Which part of STRUCTURED Query Language don’t you

understand …?!”

• All I’m saying is that is an excellent language for

analytical queries.

© 2015 EXASOL AG

Hadoop could handle SQL (via Hive), but historically …

High Latency

Restricted SQL options

All but simple table joins were difficult

Little support for compression & indexing

Merv Adrian (Gartner Research - 2014)

“What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well”

Result : EVERYTHING looked good compared to Hive

© 2015 EXASOL AG

Everyone still likes to compare themselves to Hive

© 2015 EXASOL AG

EXASOL being no exception !

© 2015 EXASOL AG

Hive continues to be improved …

Completed Views (HIVE-1143)

Partitioned Views (HIVE-1941)

Storage Handlers (HIVE-705)

HBase Integration

HBase Bulk Load

Locking (HIVE-1293)

Indexes (HIVE-417)

Bitmap Indexes (HIVE-1803)

Filter Pushdown (HIVE-279)

Table-level Statistics (HIVE-1361)

Dynamic Partitions

Binary Data Type (HIVE-2380)

Decimal Precision and Scale Support

HCatalog

HiveServer2 (HIVE-2935)

Column Statistics in Hive (HIVE-1362)

List Bucketing (HIVE-3026)

Group By With Rollup (HIVE-2397)

Enhanced Aggregation, Cube, Grouping and Rollup (HIVE-3433)

Optimizing Skewed Joins (HIVE-3086)

Correlation Optimizer (HIVE-2206)

Hive on Tez (HIVE-4660)

Vectorized Query Execution (HIVE-4160)

In Progress Atomic Insert/Update/Delete (HIVE-

5317)

Transaction Manager (HIVE-5843)

Cost Based Optimizer in Hive (HIVE-5775)

Proposed Spatial Queries

Theta Join (HIVE-556)

JDBC Storage Handler

MapJoin Optimization

Proposal to standardize and expand Authorization in Hive

Dependent Tables (HIVE-3466)

AccessServer

Type Qualifiers in Hive

MapJoin & Partition Pruning (HIVE-5119)

SQL Standard based secure authorization (HIVE-5837)

Updatable Views (HIVE-1143)

Hive on Spark (HIVE-7292)

https://cwiki.apache.org/confluence/display/Hive/ViewDev

https://issues.apache.org/jira/browse/HIVE-1143

https://cwiki.apache.org/confluence/display/Hive/PartitionedViews


https://cwiki.apache.org/confluence/display/Hive/StorageHandlers


https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad

https://cwiki.apache.org/confluence/display/Hive/Locking


https://cwiki.apache.org/confluence/display/Hive/IndexDev


https://cwiki.apache.org/confluence/display/Hive/IndexDev+Bitmap


https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev


https://cwiki.apache.org/confluence/display/Hive/StatsDev


https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions

https://cwiki.apache.org/confluence/display/Hive/Binary+DataType+Proposal


https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf

https://cwiki.apache.org/confluence/display/Hive/Howl

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Thrift+API


https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive


https://cwiki.apache.org/confluence/display/Hive/ListBucketing


https://cwiki.apache.org/confluence/display/Hive/GroupByWithRollup


https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup


https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization


https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer


https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez


https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution






https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive


https://cwiki.apache.org/confluence/display/Hive/Spatial+queries

https://cwiki.apache.org/confluence/display/Hive/Theta+Join


https://issues.apache.org/jira/secure/attachment/12474978/JDBCStorageHandler+Design+Doc.pdf

https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization

https://issues.apache.org/jira/secure/attachment/12554109/Hive_Authorization_Functionality.pdf

https://cwiki.apache.org/confluence/display/Hive/Dependent+Tables


https://cwiki.apache.org/confluence/display/Hive/AccessServer+Design+Proposal

https://cwiki.apache.org/confluence/display/Hive/Type+Qualifiers+in+Hive

https://cwiki.apache.org/confluence/display/Hive/MapJoin+and+Partition+Pruning


https://cwiki.apache.org/confluence/download/attachments/27362075/SQL+standard+authorization+hive.pdf


https://cwiki.apache.org/confluence/display/Hive/UpdatableViews


https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark


© 2015 EXASOL AG

The dream data architecture for analytics …

Based on the SQL language

but leverages Hadoop’s extreme scalability

and Hadoop’s fault tolerance

while not compromising on speed.

Could it please also have some maturity ?

And be easy to use ?

© 2015 EXASOL AG

The current reality

SQL on SQL, which is arguably

Less scalable

Less fault tolerant

Less good with unstructured data

SQL on Hadoop, which is arguably

Less mature

Less easy to use

Slower

© 2015 EXASOL AG

Choices for SQL and Hadoop

SQL AND HADOOP

A Connector

HADOOP ON SQL

User Defined Functions

SQL ON HADOOP

Something like Hive, but better

© 2015 EXASOL AG

Option 1 – SQL AND HADOOP

Run SQL on SQL and Hadoop on Hadoop and use a connector to join the two systems

Pros

Minimal impact (SQL and Hadoop worlds can function as before)

Easier to implement

Cons

Network !

Challenge of optimising across two technologies

© 2015 EXASOL AG

Option 2 – HADOOP ON SQL

Bring Map/Reduce into the Parallel database

For example using Java User Defined Functions

select my_java_map_function(words) a_word,

count(*) word_count

from DOCUMENTS

group by 1

Doesn’t benefit from Hadoop’s storage advantages

© 2015 EXASOL AG

Option 3 - SQL ON HADOOP

Build a relational database on Hadoop storage Impala (Cloudera)

Stinger (Hortonworks)

Presto (Facebook)

SparkSQL (UC Berkeley)

HAWQ (Pivotal)

BigSQL (IBM)

Apache Phoenix (for HBase)

Apache Tajo

Apache Drill

etc etc etc ….

AND DON‘T FORGET HIVE !

© 2015 EXASOL AG

Four possible market outcomes…

Hadoop and SQL databases are on a collision course – only one will survive

No sign of that so far

They are complementary – both will survive

Probably - the challenge is how to make them work together

They will merge and become one

Some indications this is already starting to happen

Something even more amazing will come along and replace them both

Sometimes this happens – Spark ?

© 2015 EXASOL AG

What do the pundits say?

Martin Fowler – Thoughtworks

The rise of NoSQL databases marks the end of the era of relational database dominance

But NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice.

The era of Polyglot Persistence has begun - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data

Emil Eifrem – Neo Technology

When evaluating a NoSQL database, it is critical to demand enterprise-readiness. An enterprise delivering modern applications needs a NoSQLdatabase that can manage today's complex and connected data while still delivering the enterprise strength, transactions and durability that IT departments have relied on for years.

http://martinfowler.com/bliki/PolyglotPersistence.html

© 2015 EXASOL AG




What’s hot?

Q & A

Agenda

© 2015 EXASOL AG

37

King in numbers

• 100 million daily active users

• 1 billion game plays per day

• 8 offices

And lots and lots of data...

• 14 billion rows per day

• 500 Gb per day new

• 700 Tb stored

Case Study - King

© 2015 EXASOL AG

King - Getting to know 500 million playersObjectives in game analytics

38

• Metrics and KPIs

• Measure and understand player behaviour

• Player segmentation

• Improve player experience

• Forecasting

• Predictive modelling

© 2015 EXASOL AG

39

Challenges at King

• Extreme scale

• Rate of growth

• Speed of innovation

• Cross platform

• Virtual economies

King - Getting to know 500 million players

© 2015 EXASOL AG

40

The King formula

• Data driven culture

• Engaged business

• Talented embedded data scientists

• AB testing

• Right technology platform

• Right data model

King - Getting to know 500 million players

© 2015 EXASOL AG

System architecture

41

How King does data

Game servers

Log server

ReportsData

scientists

Data WarehouseTSV log files

Dimensional model

Raw data

ETL

© 2015 EXASOL AG

Our data keeps growing...

42

How King does data

King launches on mobile...

© 2015 EXASOL AG

…our technology has to keep up

43

How King does data

Qlikview says no

Infobright CE says no

10 node Hadoop

80 nodes

40 nodes

20 nodes

InfiniDB

Exasol

© 2015 EXASOL AG

46

Why ExaSolution?

• Speed

• Efficiency

• Tuning free

• Scaling (150Tb and counting...)

• ExaDudes

How King does data

© 2015 EXASOL AG

51

Future challenges

• Keep on scaling

• Closer Hadoop integration

• Evolving data model

• Microbatch ETL

• Real(er) time…

Where next?

© 2015 EXASOL AG

• A definition:• The Internet of Things (IoT) is a scenario in which objects, animals or people are

provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction

• Basic concept has been around for decades – now accepted into the mainstream

• Wide range of potential uses:• Environmental monitoring• Infrastructure management• Manufacturing• Energy management• Medical and healthcare systems• Building and home automation• Transport systems

Internet of Things

© 2015 EXASOL AG

• Wearable technologies – e.g. smart watches, Google Glass• Bio sensors for humans (and other animals)

• Health monitoring

• Already in use on some dairy farms – optimise milk yields and give early warning for possible disease

• Location based data• All modern phones provide location data (either GPS or cell based)

• ‘crowd sourcing’ – e.g. traffic flow based on cellphone signals

• Beacons – e.g. Regent Street in London

• Location-based special offers and advertisement

• Facial recognition• To drive targetted advertisements

Other emerging technologies which produce data

© 2015 EXASOL AG

• Cloud being used for evaluation of new technologies and also as a platform for dev/test (and even DR) environments

• In-database analytics using UDFs in languages such a R, Lua and Python

• Move the processing closer to the data

• Run analytics on full data volumes (no sampling/extract required)

• Get improved performance due to parallelism (where possible)

• Lots of freely available R code on the web

• Automated conversion of analytical results to text (NLG) is emerging

• AI rule-based generation of natural language output

• Readable summaries and recommendations

• Yseop, NarrativeScience, Automated Insights, Arria NLG

Other emerging trends

© 2015 EXASOL AG

• Data and database technology isn’t going away!

• New database approaches are being developed to address the requirements of flexibility, scalability etc

• These technologies drive an increasing need for more analysts, database designers, data scientists

• Hybrid systems are becoming the norm, with companies mixing ‘best of breed’ technologies (possibly open source) to get the best and most cost-effective results – use ‘the right tool for the job’

• SQL databases will continue to be widely utilised – but alongside other technologies and integration will become tighter

Summary

© 2015 EXASOL AG

Dave Shuttleworth

Twitter: @EXA_Daves

Email: [email protected]

Any questions?

mailto:[email protected]

Presentation to insert name here 60

Presentation to insert name here 61

sql vs nosql: why you’ll never dump your relations - dave shuttleworth, exasol

Technology

data general

scale of data

data surgery

data doctor

data warehouses

terabyte of data

exasol agmay

exasol ag2014