hydra - getting started

Hydra - A Practical Introduction

Big Data DC - @bigdatadc Matt Abrams - @abramsm

March 4th 2013

Agenda

• What is Hydra?

• Sample Data and Analysis Questions

• Getting started with a local Hydra dev environment

• Hydra’s Key Concepts

• Creating your first Hydra job

• Putting it all together

Hydra’s Goals• Support Streaming and Batch

Processing

• Massive Scalability

• Fault tolerant by design (bend but do not break)

• Incremental Data Processing

• Full stack operational support

• Command and Control

• Alerting

• Resource Management

• Data/Task Rebalancing

• Data replication and Backup

What Exactly is Hydra?

• File System

• Data Processing

• Query System

• Job/Cluster Management

• Operational Alerting

• Open Source

Hydra - Terms• Job: a process for processing data

• Task: a processing component of a job. A job can have one to n tasks

• Node: A logic unit of processing capacity available to a cluster

• Minion: Management process that runs on cluster nodes. Acts as gate keeper for controlling task processes

• Spawn: Cluster management controller and UI

Hydra Cluster

Our Sample Data (Log-Synth)

3.535, 5214d63bab95687d, 166.144.203.186, "the then good"

3.568, 5dbd9451948ad895, 88.120.153.226, "know boys"

4.206, 5dbd9451948ad895, 88.120.153.226, "to"

4.673, b967d99cad0b3e60, 88.120.153.226, "seven"

4.900, bd0d760fbb338955, 166.144.203.186, "did local it"

What do we want to know?• What are the top IP addresses by request count?

• What are the top IP address by unique user count?

• What are the most common search terms?

• What are the most common search terms in the slowest 5% of queries?

• What are the daily number of unique searches, unique users, unique IP addresses, and distribution of response times (all approximates)?

Setting up Hydra’s Local Stack

Vagrant

• $ vagrant init precise32 http://files.vagrantup.com/precise32.box

• // add: config.vm.network :forwarded_port, guest: 5052, host: 5052 to your Vagrantfile

• $ vagrant up

• $ vagrant ssh

http://files.vagrantup.com/precise32.box

Java7

• $ sudo apt-‐get update

• $ sudo apt-‐get install python-‐software-‐properties

• $ sudo add-‐apt-‐repository ppa:webupd8team/java

• $ sudo apt-‐get update

• $ sudo apt-‐get install oracle-‐java7-‐installer

RabbitMQ, Maven, Git, Make

• $ sudo apt-‐get install rabbitmq-‐server

• $ sudo apt-‐get install maven

• $ sudo apt-‐get install git

• $ sudo apt-‐get install make

Copy on Write• $ wget http://xmailserver.org/fl-‐cow-‐0.10.tar.gz

• $ tar zxvf fl-‐cow-‐0.10.tar.gz

• $ cd fl-‐cow-‐0.10

• $ ./configure —prefix=/usr

• $ make; make check

• $ sudo make install

• $ export LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD

Hydra

• $ git clone https://github.com/addthis/hydra.git

• $ cd hydra; mvn clean -‐Pbdbje package

• $ ./hydra-‐uber/bin/local-‐stack.sh start

• $ ./hydra-‐uber/bin/local-‐stack.sh start

• $ ./hydra-‐uber/bin/local-‐stack.sh seed

https://github.com/addthis/hydra.git

Stage Sample Data in Stream Directory

• $ mkdir ~/hydra/hydra-‐local/streams/log-‐synth

• $ cp $YOUR_SAMPLE_DATA_DIR ~/hydra/hydra-‐local/streams/log-‐synth

Pipes and Filters

• Return true or false • Operate on entire

rows • Add/Remove columns • Edit Column Values • May include a call to

ValueFilter

• Operate on single volume values

• Return a value or null

• No visibility to full row

• Often take input from BundleFilter

BundleFilters ValueFilters

// chain of bundle filters {"op":"chain", “filter”:[ //LIST OF BUNDLE //FILTERS …. ]}

BundleFilter - Chain

// false if UID column is null {"op":"field", "from":"UID"},

BundleFilter - Existence

// joins FOO and BAR // Stores output in new column “OUTPUT” !{"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},

Bundle Filter - Concatenation

// FIELD_ONE == FIELD_TWO !{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},

BundleFilter - Equality Testing

// DUR = Math.round((end-start)/1000) !

{"op":"num", "columns":["END", "START", "DUR"], "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}

BundleFilter - Math!

Stack Math - Sample Data

C0,START_TIME C1,END_TIME

100,234 200,468

Stack Math

c0,c1,sub,v1000,ddiv,toint,v2,set

100,234

200,468 200,468-100,234 =100,234

Sub

Stack Math


100,234

1000 100,234/1000 =100.234

DDIV

Stack Math


100.234 100toint

Stack Math - Sample Result

C0,START_TIME C1,END_TIME C2,DURATION

100,234 200,468 100

{from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}}

ValueFilter - Glob

ValueFilter

BundleFilter

{op:"field", from:”LIST”,filter: {op:"chain", filter:[ {op:”split", split:"="}, {op:"index", index:0} ]}},

ValueFilter - Chain, Split, Index

ValueFilter

ValueFilter(s)

Data Attachments

Data Attachments are Hydra’s Secret Weapon

• Top-K Estimator

• Cardinality Estimation (HyperLogLog Plus)

• Quantile Estimation (Q,T-Digest)

• Bloom Filters

• Multiset streaming summarization (CountMin Sketch)

Data Attachment ExampleA single node that tracks the top 1000 unique search terms, the distinct count of

UIDs, and provides quantile estimation for the query time

Putting it All Together

Job Structure

• Jobs have three sections

• Source

• Map

• Output

Source

• Defines the properties of the input data set

• Several built in source types:

• Mesh

• Local File System

• Kafka

Map

• Select fields from input record to process

• Apply filters to rows and columns

• Drop or expand rows

Output - Tree• Output(s) can be trees

or data files

• Trees represent data aggregations that can be queried

• Files Output Targets

• File System

• Cassandra

• HDFS

Lets put it all Together

Create Hydra Job

Run Job

What are the top IP Addresses By Record Count?• Exact

• path: root/byip/+:+hits

• ops: gather=ks;sort=1:n:d;limit=100

• Approximate

• path: root/byip/+$+uidcount


What are the top IPs by unique user count?

• Exact

• path: root/byip/+/+

• ops: gather=kk;sort=0;gather=ku;sort=1:n:d

• Approximate

• path: root/byip/+$+uidcount


What are the search terms for the slowest 5%?

• First get the 95th percentile query time

• path: /root$+timeDigest=quantile(.95)

• ops: num=c0,toint,v0,set;gather=a

• Now find all queries then 95th percentile

• path: /root/bytime/+/+:+hits

• ops: num=c0,v950,gteq;gather=iks;sort=1:n:d

Daily Unqiue Searches, Users, IPs and distribution of response times?• Query Path:

• root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits

• Ops:

• gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999

• Remote Ops:

• num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num=c7,toint,v7,set;num=c8,toint,v8,set;

But yeah, I could do that with CLI!

Related Open Source Projects

• Meshy - https://github.com/addthis/meshy

• Codec - https://github.com/addthis/codec

• Muxy - https://github.com/addthis/muxy

• Bundle - https://github.com/addthis/bundle

• Basis - https://github.com/addthis/basis

• Column Compressor - https://github.com/addthis/columncompressor

• Cluster Boot Service - https://github.com/stewartoallen/cbs

https://github.com/addthis/meshy

https://github.com/addthis/codec

https://github.com/addthis/muxy

https://github.com/addthis/bundle

https://github.com/addthis/basis

https://github.com/addthis/columncompressor

Helpful Resources• Hydra - https://github.com/addthis/hydra

• Hydra User Reference - http://oss-docs.addthiscode.net/hydra/latest/user-reference/

• Hydra User Guide - http://oss-docs.addthiscode.net/hydra/latest/user-guide/

• IRC - #hydra

• Mailing List - https://groups.google.com/forum/#!forum/hydra-oss

https://github.com/addthis/hydra

http://oss-docs.addthiscode.net/hydra/latest/user-reference/

http://oss-docs.addthiscode.net/hydra/latest/user-guide/

https://groups.google.com/forum/#!forum/hydra-oss

hydra - getting started

Technology