how to collect big data into hadoop

Sadayuki Furuhashi!uentd.org

How to collect Big Data

into HadoopBig Data processing to collect Big Data

Self-introduction

> Sadayuki Furuhashi

> Treasure Data, Inc.Founder & Software Architect

> Open source projectsMessagePack - efficient serializer (original author)

Fluentd - event collector (original author)

[email protected]

We’re Hiring!We’re Hiring!

mailto:[email protected]

mailto:[email protected]

Today’s topic

Report &

Big Data

Monitor

Collect Store Process Visualize

Report &

Big Data

Monitor

Store Process

ClouderaHorton WorksMapR

Collect Visualize

TableauExcel

R

easier & shorter time

Store ProcessCollect Visualize

ClouderaHorton WorksMapR

TableauExcel

R

easier & shorter timeHow to shorten here?

Problems to collect data

Poor man’s data collection

1. Copy files from servers using rsync

2. Create a RegExp to parse the files

3. Parse the files and generate a 10GB CSV file

4. Put it into HDFS

Problems to collect “big data”

> Includes broken values> needs error handling & retrying

> Time-series data are changing and uncler> parse logs before storing

> Takes time to read/write> tools have to be optimized and parallelized

> Takes time for trial & error> Causes network traffic spikes

Problem of poor man’s data collection

> Wastes time to implement error handling> Wastes time to maintain a parser> Wastes time to debug the tool> Not reliable> Not efficient

Basic theoriesto collect big data

Divide & Conquer

error

error

Divide & Conquer & Retry

error retry

error retry retry

retry

Streaming

Don’t handle big files here Do it here

Apache Flume and Fluentd

Apache Flume

AgentAgentAgentAgent

Collector

Collector

Apache Flume

access logs

app logs

system logs

...

Apache Flume - network topology


Collector

CollectorCollector

Master


Collector

CollectorCollector

ack

send

send/ack

Flume OG

Flume NG

plugin

Apache Flume - pipeline

Flume OG

Flume NG

Source Sink

Source SinkChannel

Apache Flume - con!guration

Master


Collector

CollectorCollectorFlume NG

Master managesall configuration

(optional)

Apache Flume - con!guration

# sourcehost1.sources = avro-source1host1.sources.avro-source1.type = avrohost1.sources.avro-source1.bind = 0.0.0.0host1.sources.avro-source1.port = 41414host1.sources.avro-source1.channels = ch1

# channelhost1.channels = ch_avro_loghost1.channels.ch_avro_log.type = memory

# sinkhost1.sinks = log-sink1host1.sinks.log-sink1.type = loggerhost1.sinks.log-sink1.channel = ch1

Fluentd

Fluentd - network topology

fluentdfluentdfluentdfluentd

fluentd

fluentdfluentd

send/ackFluentd


Collector

CollectorCollector

send/ackFlume NG

plugin

Fluentd - pipeline

FluentdInput OutputBuffer

Source SinkChannelFlume NG

Fluentd - con!guration

Fluentd

fluentdfluentdfluentdfluentd

fluentd

fluentdfluentd

Use chef, puppet, etc. for configuration(they do things better)

No central node - keep things simple


<source> type forward port 24224</source>

<match **> type file path /var/log/logs</match>


<source> type forward port 24224</source>

<match **> type file path /var/log/logs</match>

# source

host1.sources = avro-source1

host1.sources.avro-source1.type = avro

host1.sources.avro-source1.bind = 0.0.0.0

host1.sources.avro-source1.port = 41414

host1.sources.avro-source1.channels = ch1

# channel

host1.channels = ch_avro_log

host1.channels.ch_avro_log.type = memory

# sink

host1.sinks = log-sink1

host1.sinks.log-sink1.type = logger

host1.sinks.log-sink1.channel = ch1

Fluentd - Users

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

94 plugins!

Concept of Fluentd

Customization is essential> small core + many plugins

Fluentd core helps to implement plugins> common features are already implemented

Divide & Conquer

Retrying

Parallelize

Error handling

Message routing

Fluentd core Plugins

read / receive data

write / send data

Fluentd plugins

in_tail

fluentdapache

access.log

✓ read a log file✓ custom regexp✓ custom parser in Ruby

out_mongo

fluentdapache

access.log buffer

in_tail

out_mongo

fluentdapache

access.log buffer

✓ retry automatically✓ exponential retry wait✓ persistent on a file

in_tail

out_s3

fluentdapache

access.log buffer


Amazon S3

✓ slice files based on time

in_tail

2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...

out_hdfs

fluentdapache

access.log buffer



in_tail


HDFS

✓ custom text formater

out_hdfs

fluentdapache

access.log buffer



in_tail


fluentd

fluentd

fluentd

✓ automatic fail-over✓ load balancing

Fluentd examples

Fluentd at Treasure Data - REST API logs

API servers

fluentdRails app

fluentd

fluentdRails app

fluent-logger-ruby+ in_forward

out_forward

watch server

Fluentd at Treasure Data - backend logs

API servers

fluentdRails app Ruby app

fluentd

fluentdRails app

worker servers

Ruby appfluentd


out_forward

fluentdwatch server

Fluentd at Treasure Data - monitoring

API servers

fluentdRails app

fluentd

Queue

PerfectQueue

Ruby appfluentd

fluentdRails app

worker servers

Ruby appfluentd


watch server

scriptout_forwardin_exec

Fluentd at Treasure Data - Hadoop logs

fluentd watch server

scriptin_exec

✓ resource consumption statistics for each user✓ capacity monitoring

thrift API call

HadoopJobTracker

Fluentd at Treasure Data - store & analyze

fluentd watch server

Librato Metricsfor realtime analysis

Treasure Datafor historical analysis

out_tdlog out_metricsense✓ streaming aggregation

Plugin development

class SomeInput < Fluent::Input Fluent::Plugin.register_input('myin', self)

config_param :tag, :string

def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end

def shutdown ... endend

<source> type myin tag myapp.api.heartbeat</source>

class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output('myout', self)

config_param :myparam, :string

def format(tag, time, record) [tag, time, record].to_json + "\n" end

def write(chunk) puts chunk.read endend

<match **> type myout myparam foobar</match>

class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input('mytail', self)

def configure_parser(conf) ... end

def parse_line(line) array = line.split(“\t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record endend

<source> type mytail</source>

Fluentd v11

Error stream

Streaming processing

Better DSL

Multiprocess

how to collect big data into hadoop

Documents