lambda architecture using sql
DESCRIPTION
Keynote of HadoopCon 2014 Taiwan: * Data analytics platform architecture & designs * Lambda architecture overview * Using SQL as DSL for stream processing * Lambda architecture using SQLTRANSCRIPT
Lambda Architecture PlatformUsing SQL
Sep 13 2014HadoopCon 2014 Taiwan
TAGOMORI Satoshi (@tagomoris)
About Me & LINEData analytics workloads Batch processing Stream processingLambda architectureLambda architecture using SQL
Topics
Norikra: Stream processing with SQL13:30-14:20 4F
ReportsMonthly/Daily reportsHourly (or shorter) news
Real-time metricsAutomatically updated reports/graphsAlerts for abuse of services, overload, ...
Various Data Analytics Workload
HadoopMapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...)For reports
MPP EnginesCloudera Impala, Apache Drill, Facebook Presto, ...For interactive analysisFor reports of shorter window
Batch Processing
Apache StormIncubator project“Distributed and fault-tolerant realtime computation”
Norikraby tagomorisNon-distributed “Stream processing with SQL”
Stream Processing
Less latencyRealtime metricsShort-term prompt reports
Less computing power10Mbps for batch processing: 100GB/day10Mbps for stream processing: 1 Server
No query schedule managementOnce query registered, it runs forever
Why Stream Processing?
Queries must be written before dataThere should be another way to query past data
Queries cannot be run twiceAll results will be lost when any error occursAll data have gone when bugs found
Disorders of events break resultsRecorded time based queries? Or arrival time based queries?
Disadvantage of Stream Processing
“The Lambda-architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.”
Lambda Architecture
http://lambda-architecture.net/
Lambda Architecture: Overview
new data
batch layer
master dataset
serving layer
view
speed layer
real-time view
query
Twitter Summingbird
Lambda architecture libraryBatch mode: Scalding on Hadoop MapReduceRealtime mode: Storm
Word counting by Summingbird (scala):def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store)
https://github.com/twitter/summingbirdhttps://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
What Lambda Architecture Provides
Replayable queriesRedo queries anytime if results of speed layer are broken
Accurate results on demandPrompt reports in speed layer with arrival timeFixed reports in batch layer with recorded time
... And many more benefits of stream processing
Why All of Us Don’t Use It?
Storm doesn’t fit well with many usesStorm requires computer resources too big to deploySummingbird requires many steps to deploy
Many directors/analysts don’t write Scala/JavaSummingbird DSL is not enough easy for non-professional people
Norikra
Schema-less stream processing with SQL“Norikra is a open source server software provides "Stream Processing" with SQL, written in JRuby, runs on JVM, licensed under GPLv2.”
SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS countFROM AccessLog.win:time_batch(10 min, 0L)WHERE service='myservice' AND path LIKE '/api/%'GROUP BY path
http://norikra.github.io/
Lambda architecture platform with almost same queries
SELECT path, COUNT(IF(status=200,1,NULL)) AS success_count, COUNT(IF(status=500,1,NULL)) AS server_error_count, COUNT(*) AS countFROM AccessLogWHERE service='myservice' AND path LIKE '/api/%' AND timestamp >= ‘2014-09-13 10:40:00’ AND timestamp < ‘2014-09-13 10:50:00’GROUP BY path
SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS countFROM AccessLog.win:time_batch(10 min, 0L)WHERE service='myservice' AND path LIKE '/api/%'GROUP BY path
“Pseudo Lambda” Architecture Using SQL
“Pseudo Lambda” Architecture Using SQL
SQL dialects are easy to learn!Standard SQL, Hive, Presto, Impala, Drill, ...+ Norikra
For non-professional people too!
SQL queries are very easy to write twice!
Use Cases in LINE
Prompt reports for Ads serviceShort-term prompt reports by NorikraDaily fixed reports by Hive
Summary of application server error logAggregate error log for alerting by NorikraCheck details with Hive, Presto (or grep!)
See you later for details!