hydra - getting started

49
Hydra - A Practical Introduction Big Data DC - @bigdatadc Matt Abrams - @abramsm March 4th 2013

Upload: abramsm

Post on 25-Dec-2014

422 views

Category:

Technology


3 download

DESCRIPTION

Getting start with Hydra

TRANSCRIPT

Page 1: Hydra - Getting Started

Hydra - A Practical Introduction

Big Data DC - @bigdatadc Matt Abrams - @abramsm

March 4th 2013

Page 2: Hydra - Getting Started
Page 3: Hydra - Getting Started

Agenda

• What is Hydra?

• Sample Data and Analysis Questions

• Getting started with a local Hydra dev environment

• Hydra’s Key Concepts

• Creating your first Hydra job

• Putting it all together

Page 4: Hydra - Getting Started

Hydra’s Goals• Support Streaming and Batch

Processing

• Massive Scalability

• Fault tolerant by design (bend but do not break)

• Incremental Data Processing

• Full stack operational support

• Command and Control

• Alerting

• Resource Management

• Data/Task Rebalancing

• Data replication and Backup

Page 5: Hydra - Getting Started

What Exactly is Hydra?

• File System

• Data Processing

• Query System

• Job/Cluster Management

• Operational Alerting

• Open Source

Page 6: Hydra - Getting Started

Hydra - Terms• Job: a process for processing data

• Task: a processing component of a job. A job can have one to n tasks

• Node: A logic unit of processing capacity available to a cluster

• Minion: Management process that runs on cluster nodes. Acts as gate keeper for controlling task processes

• Spawn: Cluster management controller and UI

Page 7: Hydra - Getting Started

Hydra Cluster

Page 8: Hydra - Getting Started

Our Sample Data (Log-Synth)

3.535,  5214d63bab95687d,  166.144.203.186,  "the  then  good"  

3.568,  5dbd9451948ad895,  88.120.153.226,  "know  boys"  

4.206,  5dbd9451948ad895,  88.120.153.226,  "to"  

4.673,  b967d99cad0b3e60,  88.120.153.226,  "seven"  

4.900,  bd0d760fbb338955,  166.144.203.186,  "did  local  it"

Page 9: Hydra - Getting Started

What do we want to know?• What are the top IP addresses by request count?

• What are the top IP address by unique user count?

• What are the most common search terms?

• What are the most common search terms in the slowest 5% of queries?

• What are the daily number of unique searches, unique users, unique IP addresses, and distribution of response times (all approximates)?

Page 10: Hydra - Getting Started

Setting up Hydra’s Local Stack

Page 11: Hydra - Getting Started

Vagrant

• $  vagrant  init  precise32  http://files.vagrantup.com/precise32.box  

• //  add:  config.vm.network  :forwarded_port,  guest:  5052,  host:  5052  to  your  Vagrantfile  

• $  vagrant  up  

• $  vagrant  ssh

Page 12: Hydra - Getting Started

Java7

• $  sudo  apt-­‐get  update    

• $  sudo  apt-­‐get  install  python-­‐software-­‐properties  

• $  sudo  add-­‐apt-­‐repository  ppa:webupd8team/java  

• $  sudo  apt-­‐get  update  

• $  sudo  apt-­‐get  install  oracle-­‐java7-­‐installer

Page 13: Hydra - Getting Started

RabbitMQ, Maven, Git, Make

• $  sudo  apt-­‐get  install  rabbitmq-­‐server  

• $  sudo  apt-­‐get  install  maven  

• $  sudo  apt-­‐get  install  git  

• $  sudo  apt-­‐get  install  make

Page 14: Hydra - Getting Started

Copy on Write• $  wget  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz  

• $  tar  zxvf  fl-­‐cow-­‐0.10.tar.gz  

• $  cd  fl-­‐cow-­‐0.10  

• $  ./configure  —prefix=/usr  

• $  make;  make  check  

• $  sudo  make  install  

• $  export  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD

Page 15: Hydra - Getting Started

Hydra

• $  git  clone  https://github.com/addthis/hydra.git  

• $  cd  hydra;  mvn  clean  -­‐Pbdbje  package  

• $  ./hydra-­‐uber/bin/local-­‐stack.sh  start  

• $  ./hydra-­‐uber/bin/local-­‐stack.sh  start  

• $  ./hydra-­‐uber/bin/local-­‐stack.sh  seed

Page 16: Hydra - Getting Started

Stage Sample Data in Stream Directory

• $  mkdir  ~/hydra/hydra-­‐local/streams/log-­‐synth  

• $  cp  $YOUR_SAMPLE_DATA_DIR  ~/hydra/hydra-­‐local/streams/log-­‐synth

Page 17: Hydra - Getting Started

Pipes and Filters

Page 18: Hydra - Getting Started

• Return true or false • Operate on entire

rows • Add/Remove columns • Edit Column Values • May include a call to

ValueFilter

• Operate on single volume values

• Return a value or null

• No visibility to full row

• Often take input from BundleFilter

BundleFilters ValueFilters

Page 19: Hydra - Getting Started

// chain of bundle filters {"op":"chain", “filter”:[ //LIST OF BUNDLE //FILTERS …. ]}

BundleFilter - Chain

Page 20: Hydra - Getting Started

// false if UID column is null {"op":"field", "from":"UID"},

BundleFilter - Existence

Page 21: Hydra - Getting Started

// joins FOO and BAR // Stores output in new column “OUTPUT” !{"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},

Bundle Filter - Concatenation

Page 22: Hydra - Getting Started

// FIELD_ONE == FIELD_TWO !{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},

BundleFilter - Equality Testing

Page 23: Hydra - Getting Started

// DUR = Math.round((end-start)/1000) !

{"op":"num", "columns":["END", "START", "DUR"], "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}

BundleFilter - Math!

Page 24: Hydra - Getting Started

Stack Math - Sample Data

C0,START_TIME C1,END_TIME

100,234 200,468

Page 25: Hydra - Getting Started

Stack Math

c0,c1,sub,v1000,ddiv,toint,v2,set

100,234

200,468 200,468-100,234 =100,234

Sub

Page 26: Hydra - Getting Started

Stack Math

c0,c1,sub,v1000,ddiv,toint,v2,set

100,234

1000 100,234/1000 =100.234

DDIV

Page 27: Hydra - Getting Started

Stack Math

c0,c1,sub,v1000,ddiv,toint,v2,set

100.234 100toint

Page 28: Hydra - Getting Started

Stack Math - Sample Result

C0,START_TIME C1,END_TIME C2,DURATION

100,234 200,468 100

Page 29: Hydra - Getting Started

{from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}}

ValueFilter - Glob

ValueFilter

BundleFilter

Page 30: Hydra - Getting Started

{op:"field", from:”LIST”,filter: {op:"chain", filter:[ {op:”split", split:"="}, {op:"index", index:0} ]}},

ValueFilter - Chain, Split, Index

ValueFilter

ValueFilter(s)

Page 31: Hydra - Getting Started

Data Attachments

Page 32: Hydra - Getting Started

Data Attachments are Hydra’s Secret Weapon

• Top-K Estimator

• Cardinality Estimation (HyperLogLog Plus)

• Quantile Estimation (Q,T-Digest)

• Bloom Filters

• Multiset streaming summarization (CountMin Sketch)

Page 33: Hydra - Getting Started

Data Attachment ExampleA single node that tracks the top 1000 unique search terms, the distinct count of

UIDs, and provides quantile estimation for the query time

Page 34: Hydra - Getting Started

Putting it All Together

Page 35: Hydra - Getting Started

Job Structure

• Jobs have three sections

• Source

• Map

• Output

Page 36: Hydra - Getting Started

Source

• Defines the properties of the input data set

• Several built in source types:

• Mesh

• Local File System

• Kafka

Page 37: Hydra - Getting Started

Map

• Select fields from input record to process

• Apply filters to rows and columns

• Drop or expand rows

Page 38: Hydra - Getting Started

Output - Tree• Output(s) can be trees

or data files

• Trees represent data aggregations that can be queried

• Files Output Targets

• File System

• Cassandra

• HDFS

Page 39: Hydra - Getting Started

Lets put it all Together

Page 40: Hydra - Getting Started

Create Hydra Job

Page 41: Hydra - Getting Started

Run Job

Page 42: Hydra - Getting Started

Query

Page 43: Hydra - Getting Started

What are the top IP Addresses By Record Count?• Exact

• path: root/byip/+:+hits

• ops: gather=ks;sort=1:n:d;limit=100

• Approximate

• path: root/byip/+$+uidcount

• ops: gather=ks;sort=1:n:d;limit=100

Page 44: Hydra - Getting Started

What are the top IPs by unique user count?

• Exact

• path: root/byip/+/+

• ops: gather=kk;sort=0;gather=ku;sort=1:n:d

• Approximate

• path: root/byip/+$+uidcount

• ops: gather=ks;sort=1:n:d;limit=100

Page 45: Hydra - Getting Started

What are the search terms for the slowest 5%?

• First get the 95th percentile query time

• path: /root$+timeDigest=quantile(.95)

• ops: num=c0,toint,v0,set;gather=a

• Now find all queries then 95th percentile

• path: /root/bytime/+/+:+hits

• ops: num=c0,v950,gteq;gather=iks;sort=1:n:d

Page 46: Hydra - Getting Started

Daily Unqiue Searches, Users, IPs and distribution of response times?• Query Path:

• root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits

• Ops:

• gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999

• Remote Ops:

• num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num=c7,toint,v7,set;num=c8,toint,v8,set;

Page 47: Hydra - Getting Started

But yeah, I could do that with CLI!

Page 48: Hydra - Getting Started

Related Open Source Projects

• Meshy - https://github.com/addthis/meshy

• Codec - https://github.com/addthis/codec

• Muxy - https://github.com/addthis/muxy

• Bundle - https://github.com/addthis/bundle

• Basis - https://github.com/addthis/basis

• Column Compressor - https://github.com/addthis/columncompressor

• Cluster Boot Service - https://github.com/stewartoallen/cbs

Page 49: Hydra - Getting Started

Helpful Resources• Hydra - https://github.com/addthis/hydra

• Hydra User Reference - http://oss-docs.addthiscode.net/hydra/latest/user-reference/

• Hydra User Guide - http://oss-docs.addthiscode.net/hydra/latest/user-guide/

• IRC - #hydra

• Mailing List - https://groups.google.com/forum/#!forum/hydra-oss