batch processing and stream processing by sql

Post on 26-Jan-2015

123 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Batch processing andStream processing by

SQL@tagomoris (TAGOMORI Satoshi)2014/07/08Hadoop Conference Japan 2014 #hcj2014

14年7月8日火曜日

TAGOMORI Satoshi (@tagomoris)LINE Corporation

Analytics Platform Team

14年7月8日火曜日

14年7月8日火曜日

14年7月8日火曜日

14年7月8日火曜日

SQL

14年7月8日火曜日

BATCHand/or

STREAM14年7月8日火曜日

Analytics data flow overviewservers Fluentd

Cluster

archive

visualization

notifications

Hadoop / HivePresto

Fluentd

Norikra

applicationmetrics

“Log analysis systems and its designs in LINE corp. 2014 early”http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early14年7月8日火曜日

servers FluentdCluster

archive

visualization

notifications

Hadoop / HivePresto

Fluentd

Norikra

applicationmetrics

STREAM

BATCH

14年7月8日火曜日

servers FluentdCluster

archive

visualization

notifications

Hadoop / HivePresto

Fluentd

Norikra

applicationmetrics

STREAM

BATCHSQL

14年7月8日火曜日

SQL is NOT the best.

But,SQL is better than NONE.

14年7月8日火曜日

What supports SQL:

RDBMSApache Hive (on MR/Spark/Tez)

Facebook Presto, Cloudera Impala, Apache DrillGoogle BigQuery, ......

14年7月8日火曜日

14年7月8日火曜日

SQL

SQLSQL

SQL (2/6)SQL

SQL

SQL SQL

14年7月8日火曜日

DB Batch ShortBatch

non-SQL NoSQL HadoopMRPig ----

SQL RDBMS HivePrestoImpala

Drill

14年7月8日火曜日

Batch processing.

ORStream processing?

14年7月8日火曜日

Batch processing

Hadoop/Hive

Target window: hours - weeks (or more)

Total throuput: HIGHEST

Query Latency: LARGEST (20sec - mins - hours)

14年7月8日火曜日

Short Batch processing

Presto, Impala, Drill

Target window: seconds - hours (- days)

Total throughput: Normal

Query latency: Small (seconds - mins)

14年7月8日火曜日

Stream processing

Storm, Kafka, Esper, Norikra, Fluentd, ....

Spark streaming(?)

Target window: seconds - hours

Total throughput: Normal

Query latency: SMALLEST (milliseconds)

Queries must be written BEFORE DATA

Once registered, runs forever

14年7月8日火曜日

Data flow and latencydata windowquery execution

BatchShortBatch Stream

incrementalquery exection

14年7月8日火曜日

Data windowTarget time (or size) range of queries

Batch (or short-batch)

FROM-TO: WHERE dt >= ‘2014-07-07 00:00:00‘

AND dt <= ‘2014-07-08 23:59:59’

Stream

“Calculate this query for every 3 minutes”

Extended SQL required

14年7月8日火曜日

Stream processing with SQLEsper: Java library to process StreamWith schema

14年7月8日火曜日

Stream processing with SQLEsper: Java library to process StreamEsper EPL

SELECT param1, param2FROM tblWHERE age > 30

14年7月8日火曜日

Stream processing with SQL

SELECT param, COUNT(*) AS cFROM tblWHERE age > 30GROUP BY param

Esper: Java library to process StreamEsper EPL

14年7月8日火曜日

Stream processing with SQL

SELECT param, COUNT(*) AS cFROM tbl.win:time_batch(1 hour)WHERE age > 30GROUP BY param

Esper: Java library to process StreamEsper EPL

14年7月8日火曜日

14年7月8日火曜日

Norikra:Schema-less Stream Processing with SQL

OSS, based on Esper EPL, GPLv2

Without pre-defined schema

Complex event processing (w/ nested hash/array) w/ SQL

HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)

Dynamic query registration/removing

Ultra fast bootstrap (in 3 minutes!)

UDF plugins by Java/Rubyhttp://norikra.github.io/

14年7月8日火曜日

Distributed processing OR NOT?

Norikra is NOT a distributed processing platform.

Of course, SCALE OUT IS FANTASTIC.

Is non-distributed software useless?

MySQL

MySQL Cluster

Norikra can handle 10k events/sec

on 2CPU (8core) server

14年7月8日火曜日

DB Batch ShortBatch Stream

non-SQL NoSQL HadoopMRPig ----

StormKafka

Dataflow(G)

SQL RDBMS HivePrestoImpala

DrillNorikra

14年7月8日火曜日

Lambda architecture

Just same 2 process on:Stream processingBatch processing

http://lambda-architecture.net/

14年7月8日火曜日

Replayable processing

Stream processingMUST NOT be replayable

Queries on stream processingSHOULD be replayable

14年7月8日火曜日

Hybrid processing: for fault-torelance

Stream processing: executes queries in normalBatch processing: executes recovery queries

14年7月8日火曜日

Hybrid processing: for latency-reduction + accuracy

Stream processing: for prompt reports (速報値)

Batch processing: for fixed reports (確定値)

14年7月8日火曜日

Hybrid stream processing: against complexity

Non-SQL stream processing: for simple, fixed, high-traffic eventsSQL stream processing: for complex, fragile events

14年7月8日火曜日

Case study in LINE

Prompt-report & fixed-report

Norikra + Hive Hybrid

Error detection from application and access logs

Norikra + Fluentd Hybrid

Realtime aggregation for complex and simple(fixed) objects

Norikra + Fluentd Hybrid

14年7月8日火曜日

Case study in LINE

Prompt-report & fixed-report

Norikra + Hive Hybrid

Error detection from application and access logs

Norikra + Fluentd Hybrid

Realtime aggregation for complex and simple(fixed) objects

Norikra + Fluentd Hybrid

14年7月8日火曜日

Hive: fixed-reportsSELECT yyyymmdd, hh, campaign_id, region, lang, COUNT(*) AS click, COUNT(DISTINCT member_id) AS uuFROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20140708' AND hh='00' AND get_json_object(log, '$.type')='click') xGROUP BY yyyymmdd, hh, campaign_id, region, lang

14年7月8日火曜日

Norikra: prompt-reports

SELECT campaign.id AS campaign_id, member.region AS region, member.lang AS lang, COUNT(*) AS click, COUNT(DISTINCT member.id) AS uuFROM myservice.win:time_batch(1 hours)WHERE type="click"GROUP BY campaign.id, member.region, member.lang

14年7月8日火曜日

More queries, more simplicityand less latency.

Thanks!

14年7月8日火曜日

top related