how to collect big data into hadoop
DESCRIPTION
Big Data processing to collect Big DataTRANSCRIPT
![Page 1: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/1.jpg)
Sadayuki Furuhashi!uentd.org
How to collect Big Data
into HadoopBig Data processing to collect Big Data
![Page 2: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/2.jpg)
Self-introduction
> Sadayuki Furuhashi
> Treasure Data, Inc.Founder & Software Architect
> Open source projectsMessagePack - efficient serializer (original author)
Fluentd - event collector (original author)
![Page 4: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/4.jpg)
Today’s topic
![Page 5: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/5.jpg)
Report &
Big Data
Monitor
![Page 6: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/6.jpg)
Collect Store Process Visualize
Report &
Big Data
Monitor
![Page 7: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/7.jpg)
Store Process
ClouderaHorton WorksMapR
Collect Visualize
TableauExcel
R
easier & shorter time
![Page 8: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/8.jpg)
Store ProcessCollect Visualize
ClouderaHorton WorksMapR
TableauExcel
R
easier & shorter timeHow to shorten here?
![Page 9: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/9.jpg)
Problems to collect data
![Page 10: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/10.jpg)
Poor man’s data collection
1. Copy files from servers using rsync
2. Create a RegExp to parse the files
3. Parse the files and generate a 10GB CSV file
4. Put it into HDFS
![Page 11: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/11.jpg)
Problems to collect “big data”
> Includes broken values> needs error handling & retrying
> Time-series data are changing and uncler> parse logs before storing
> Takes time to read/write> tools have to be optimized and parallelized
> Takes time for trial & error> Causes network traffic spikes
![Page 12: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/12.jpg)
Problem of poor man’s data collection
> Wastes time to implement error handling> Wastes time to maintain a parser> Wastes time to debug the tool> Not reliable> Not efficient
![Page 13: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/13.jpg)
Basic theoriesto collect big data
![Page 14: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/14.jpg)
Divide & Conquer
error
error
![Page 15: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/15.jpg)
Divide & Conquer & Retry
error retry
error retry retry
retry
![Page 16: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/16.jpg)
Streaming
Don’t handle big files here Do it here
![Page 17: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/17.jpg)
Apache Flume and Fluentd
![Page 18: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/18.jpg)
Apache Flume
![Page 19: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/19.jpg)
AgentAgentAgentAgent
Collector
Collector
Apache Flume
access logs
app logs
system logs
...
![Page 20: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/20.jpg)
Apache Flume - network topology
AgentAgentAgentAgent
Collector
CollectorCollector
Master
AgentAgentAgentAgent
Collector
CollectorCollector
ack
send
send/ack
Flume OG
Flume NG
![Page 21: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/21.jpg)
plugin
Apache Flume - pipeline
Flume OG
Flume NG
Source Sink
Source SinkChannel
![Page 22: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/22.jpg)
Apache Flume - con!guration
Master
AgentAgentAgentAgent
Collector
CollectorCollectorFlume NG
Master managesall configuration
(optional)
![Page 23: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/23.jpg)
Apache Flume - con!guration
# sourcehost1.sources = avro-source1host1.sources.avro-source1.type = avrohost1.sources.avro-source1.bind = 0.0.0.0host1.sources.avro-source1.port = 41414host1.sources.avro-source1.channels = ch1
# channelhost1.channels = ch_avro_loghost1.channels.ch_avro_log.type = memory
# sinkhost1.sinks = log-sink1host1.sinks.log-sink1.type = loggerhost1.sinks.log-sink1.channel = ch1
![Page 24: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/24.jpg)
Fluentd
![Page 25: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/25.jpg)
Fluentd - network topology
fluentdfluentdfluentdfluentd
fluentd
fluentdfluentd
send/ackFluentd
AgentAgentAgentAgent
Collector
CollectorCollector
send/ackFlume NG
![Page 26: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/26.jpg)
plugin
Fluentd - pipeline
FluentdInput OutputBuffer
Source SinkChannelFlume NG
![Page 27: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/27.jpg)
Fluentd - con!guration
Fluentd
fluentdfluentdfluentdfluentd
fluentd
fluentdfluentd
Use chef, puppet, etc. for configuration(they do things better)
No central node - keep things simple
![Page 28: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/28.jpg)
Fluentd - con!guration
<source> type forward port 24224</source>
<match **> type file path /var/log/logs</match>
![Page 29: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/29.jpg)
Fluentd - con!guration
<source> type forward port 24224</source>
<match **> type file path /var/log/logs</match>
# source
host1.sources = avro-source1
host1.sources.avro-source1.type = avro
host1.sources.avro-source1.bind = 0.0.0.0
host1.sources.avro-source1.port = 41414
host1.sources.avro-source1.channels = ch1
# channel
host1.channels = ch_avro_log
host1.channels.ch_avro_log.type = memory
# sink
host1.sinks = log-sink1
host1.sinks.log-sink1.type = logger
host1.sinks.log-sink1.channel = ch1
![Page 30: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/30.jpg)
Fluentd - Users
![Page 31: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/31.jpg)
Fluentd - plugin distribution platform
$ fluent-gem search -rd fluent-plugin
$ fluent-gem install fluent-plugin-mongo
![Page 32: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/32.jpg)
Fluentd - plugin distribution platform
$ fluent-gem search -rd fluent-plugin
$ fluent-gem install fluent-plugin-mongo
94 plugins!
![Page 33: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/33.jpg)
Concept of Fluentd
Customization is essential> small core + many plugins
Fluentd core helps to implement plugins> common features are already implemented
![Page 34: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/34.jpg)
Divide & Conquer
Retrying
Parallelize
Error handling
Message routing
Fluentd core Plugins
read / receive data
write / send data
![Page 35: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/35.jpg)
Fluentd plugins
![Page 36: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/36.jpg)
in_tail
fluentdapache
access.log
✓ read a log file✓ custom regexp✓ custom parser in Ruby
![Page 37: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/37.jpg)
out_mongo
fluentdapache
access.log buffer
in_tail
![Page 38: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/38.jpg)
out_mongo
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
in_tail
![Page 39: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/39.jpg)
out_s3
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
Amazon S3
✓ slice files based on time
in_tail
2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...
![Page 40: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/40.jpg)
out_hdfs
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
✓ slice files based on time
in_tail
2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...
HDFS
✓ custom text formater
![Page 41: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/41.jpg)
out_hdfs
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
✓ slice files based on time
in_tail
2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...
fluentd
fluentd
fluentd
✓ automatic fail-over✓ load balancing
![Page 42: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/42.jpg)
Fluentd examples
![Page 43: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/43.jpg)
Fluentd at Treasure Data - REST API logs
API servers
fluentdRails app
fluentd
fluentdRails app
fluent-logger-ruby+ in_forward
out_forward
watch server
![Page 44: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/44.jpg)
Fluentd at Treasure Data - backend logs
API servers
fluentdRails app Ruby app
fluentd
fluentdRails app
worker servers
Ruby appfluentd
fluent-logger-ruby+ in_forward
out_forward
fluentdwatch server
![Page 45: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/45.jpg)
Fluentd at Treasure Data - monitoring
API servers
fluentdRails app
fluentd
Queue
PerfectQueue
Ruby appfluentd
fluentdRails app
worker servers
Ruby appfluentd
fluent-logger-ruby+ in_forward
watch server
scriptout_forwardin_exec
![Page 46: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/46.jpg)
Fluentd at Treasure Data - Hadoop logs
fluentd watch server
scriptin_exec
✓ resource consumption statistics for each user✓ capacity monitoring
thrift API call
HadoopJobTracker
![Page 47: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/47.jpg)
Fluentd at Treasure Data - store & analyze
fluentd watch server
Librato Metricsfor realtime analysis
Treasure Datafor historical analysis
out_tdlog out_metricsense✓ streaming aggregation
![Page 48: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/48.jpg)
![Page 49: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/49.jpg)
![Page 50: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/50.jpg)
Plugin development
![Page 51: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/51.jpg)
class SomeInput < Fluent::Input Fluent::Plugin.register_input('myin', self)
config_param :tag, :string
def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end
def shutdown ... endend
<source> type myin tag myapp.api.heartbeat</source>
![Page 52: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/52.jpg)
class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output('myout', self)
config_param :myparam, :string
def format(tag, time, record) [tag, time, record].to_json + "\n" end
def write(chunk) puts chunk.read endend
<match **> type myout myparam foobar</match>
![Page 53: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/53.jpg)
class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input('mytail', self)
def configure_parser(conf) ... end
def parse_line(line) array = line.split(“\t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record endend
<source> type mytail</source>
![Page 54: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/54.jpg)
![Page 55: How to collect Big Data into Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052218/554bea55b4c90556328b4eab/html5/thumbnails/55.jpg)
Fluentd v11
Error stream
Streaming processing
Better DSL
Multiprocess