jruby with java code in data processing world
TRANSCRIPT
![Page 1: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/1.jpg)
JRuby with Java Code in Data Processing WorldJRubyConf.EU at 31 Jul 2015 Satoshi Tagomori (@tagomoris)
![Page 2: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/2.jpg)
Satoshi "Moris" Tagomori (@tagomoris)
Fluentd, Norikra, MessagePack-Ruby,... Docker logging driver for Fluentd (docker v1.8)
Treasure Data, Inc.
![Page 3: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/3.jpg)
https://jobs.lever.co/treasure-data
We're hiring!OSS team (developer / community manager)
Distributed system engineer (Hadoop, queue/workers) Front-end engineer (RoR)
![Page 4: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/4.jpg)
Data Processing World
![Page 5: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/5.jpg)
Data Processing World
![Page 6: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/6.jpg)
JavaData Processing World
![Page 7: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/7.jpg)
Data Processing World
Hadoop, Spark, Tez, Flink, Storm, Kafka, ...
Hive, Pig, Drill, Impala, Presto, ....
![Page 8: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/8.jpg)
Java + Scala, Clojure + C++, ....
Data Processing World
on JVM
![Page 9: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/9.jpg)
Data Processing World
Many CPU cores, Large memory, High rate Disk I/O, ...
High throughput data processing
Hadoop YARN/MapReduce/HDFS API compatibility
![Page 10: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/10.jpg)
Two OSS using Java&JRuby
![Page 11: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/11.jpg)
Norikra: Stream Processing with SQL for everybody
Server software, written in JRuby, runs on JVM
Open source software (GPLv2)
http://norikra.github.io/
https://github.com/norikra/norikra
Distributed on rubygems.org
"gem i norikra"
![Page 12: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/12.jpg)
What Norikra does:
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
![Page 13: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/13.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/", "status":200, "bytes":300, "duration":0.03, "referer":"...", "user-agent":"...."
path:"/", s:301
1
![Page 14: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/14.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/download/a", "status":200, "bytes":10240, "duration":0.53, "referer":"...", "user-agent":"...."
path:"/", s:301 path:"/download/a", s:10240
2
![Page 15: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/15.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/", "status":404, "bytes":0, "duration":0.08, "referer":"...", "user-agent":"...."
path:"/", s:301 path:"/download/a", s:10240
3
![Page 16: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/16.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/", "status":200, "bytes":301, "duration":0.01, "referer":"...", "user-agent":"...."
path:"/", s:602 path:"/download/a", s:10240
4
![Page 17: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/17.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/download/b", "status":200, "bytes":678, "duration":0.11, "referer":"...", "user-agent":"...."
path:"/", s:602 path:"/download/a", s:10240 path:"/download/b", s:678
5
![Page 18: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/18.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/download/b", "status":200, "bytes":678, "duration":0.13, "referer":"...", "user-agent":"...."
path:"/", s:602 path:"/download/a", s:10240 path:"/download/b", s:1356
6
![Page 19: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/19.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/", "status":200, "bytes":301, "duration":0.02, "referer":"...", "user-agent":"...."
path:"/", s:903 path:"/download/a", s:10240 path:"/download/b", s:1356
7
![Page 20: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/20.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/", "status":200, "bytes":301, "duration":0.09, "referer":"...", "user-agent":"...."
path:"/", s:1204 path:"/download/a", s:10240 path:"/download/b", s:1356
8
![Page 21: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/21.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/download/a", "status":200, "bytes":10240, "duration":1.1, "referer":"...", "user-agent":"...."
path:"/", s:1204 path:"/download/a", s:20480 path:"/download/b", s:1356
9
![Page 22: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/22.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
{"path":"/", "status":200, "bytes":301, "duration":0.05, "referer":"...", "user-agent":"...."
path:"/", s:1505 path:"/download/a", s:20480 path:"/download/b", s:1356
10
![Page 23: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/23.jpg)
SELECT path, SUM(bytes) AS s FROM www_access_logs.win:length_batch(10)
WHERE status=200 GROUP BY path ORDER BY s DESC
10
{"path":"/download/a", "s":20480}
{"path":"/", "s":1505}
{"path":"/download/b", "s":1356}
![Page 24: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/24.jpg)
Norikra and JavaNorikra is written in JRuby, and using Esper
Key factor: productivity (33days until first release)
Esper:Java library, provides Complex Event Processing
SQL parser, executor
Many features and good performance
Licensed under GPLv2
![Page 25: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/25.jpg)
Plugins as rubygems
Norikra Server (on JVM)
Esper (Query Engine)
Type DefinitionManager
Output Event Pool
Norikra Engine
RPC Servermizuno (Jetty + Rack)
Rack RPC Handler
Listener
UDFUDF
User-Defined Functions "gem i norikra-udf-xxx"
written in Java, or JRuby (compiled to Java) works in Esper instance: must be a Java class
Listener handler for output data of queries, written in JRuby "gem i norikra-listener-xxx"
![Page 26: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/26.jpg)
Embulk
"Embulk is a open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services."
http://www.embulk.org/docs/
![Page 27: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/27.jpg)
Embulk: makes painful data integration work relaxed
Plugin-based parallel bulk data loader
Open source software (Apache License v2.0)
http://www.embulk.org/
https://github.com/embulk/embulk
Distributed as .jar or on rubygems.org
Plugins are on rubygems.orghttp://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.slideshare.net/HiroshiNakamura/embulk-20150411
![Page 28: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/28.jpg)
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotet retrying
Plugins Plugins
bulk load
![Page 29: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/29.jpg)
#ccc_cd4 / #embulk
InputPlugin OutputPlugin
Executor pluginFilter plugin
Filter pluginFilter plugins
records
Threads, MapReduce
records
convert, …
input, … output.
29
records
config
![Page 30: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/30.jpg)
#ccc_cd4 / #embulk
InputPlugin
FileInput plugin
OutputPlugin
FileOutput plugin
Encoder plugin
Formatter plugin
Decoder plugin
Parser plugin
HDFS, S3,Riak CS, …
gzip, bzip2,aes, …
CSV, JSON,pcap, …
buffer
bufferbuffer
buffer
Filter pluginFilter plugin
Filter plugins
recordsrecords
Executor plugin
30
records
config
![Page 31: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/31.jpg)
Embulk and JavaEmbulk core is written in Java
mainly for performance
Embulk plugins:
are loaded over API based on JRuby
are written in JRuby or Java
JRuby for early release
Java for performance
![Page 32: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/32.jpg)
InputPlugin
module Embulk class InputExample < InputPlugin Plugin.register_input('example', self)
def self.transaction(config, &control) # read config task = { 'message' => config.param('message', :string, default: nil) } threads = config.param('threads', :int, default: 2)
columns = [ Column.new(0, 'col0', :long), Column.new(1, 'col1', :double), Column.new(2, 'col2', :string), ]
# BEGIN here
commit_reports = yield(task, columns, threads)
# COMMIT here puts "Example input finished"
return {} end
def run(task, schema, index, page_builder) puts "Example input thread #{@index}…"
10.times do |i| @page_builder.add([i, 10.0, "example"]) end @page_builder.finish
commit_report = { } return commit_report end end end
![Page 33: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/33.jpg)
OutputPlugin
module Embulk class OutputExample < OutputPlugin Plugin.register_output('example', self)
def self.transaction( config, schema, processor_count, &control) # read config task = { 'message' => config.param('message', :string, default: "record") }
puts "Example output started." commit_reports = yield(task) puts "Example output finished. Commit reports = #{commit_reports.to_json}"
return {} end
def initialize(task, schema, index) puts "Example output thread #{index}..." super @message = task.prop('message', :string) @records = 0 end
def add(page) page.each do |record| hash = Hash[schema.names.zip(record)] puts "#{@message}: #{hash.to_json}" @records += 1 end end
def finish end
def abort end
def commit commit_report = { "records" => @records } return commit_report end end end
![Page 34: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/34.jpg)
Plugin management: Norikra
Esper instance
Engine
Plugin management
UDF Listener
plugins as gems
plugin loader written in JRuby
Java JRuby
![Page 35: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/35.jpg)
Plugin management: Embulk
Embulk core
Plugin management
input/output/filter parser/formatter
Java JRuby
decoder/encoder file-input/output executor
plugins as gems
plugin loader written in JRuby
![Page 36: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/36.jpg)
Pluggable softwareon JVM & Java API
Java? Scala? Clojure? JRuby?: JRuby
Plugin packaging: jar? gem?: gem
rubygem.org >>> maven central (or others)
especially for plugin authors
Plugin loader: Class Loader? "require"?: require
![Page 37: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/37.jpg)
JRuby in Japan
Not so many users :(
CRuby is super major software in Japan
Java -> Ruby -> Scala? Golang?
![Page 38: JRuby with Java Code in Data Processing World](https://reader030.vdocuments.site/reader030/viewer/2022032507/55cf0a34bb61eb1b628b4621/html5/thumbnails/38.jpg)
Make your software pluggable.Make eco-system&community.
with JRuby!
Thanks!