accumulo nutch/gora, storm, and pig

Large Scale Web Analytics with Accumulo (and Nutch/Gora, Pig, and Storm)

Jason Trost jtrost@endgames.us @jason_trost

Introductions

• Jason Trost (jtrost@endgames.us)

• Senior Software Engineer at Endgame Systems

• Former Accumulo Trainer

• Apache Accumulo Committer

– Apache Pig integration with Accumulo

– some minor bug fixes

Agenda

• Technologies Introduction –Apache Accumulo –Apache Gora –Apache Nutch/Gora –Storm

• Accumulo at Endgame –Web Crawl Analytics –Real-time DNS Processing –Operations

Apache Accumulo

• Accumulo is a BigTable implementation with cell level security

• It is conceptually very similar to HBase, but it has some nice features that HBase is currently lacking.

• Some of these features are: – Cell level security

– No fat row problem

– No limitation on col fams or when col fams can be created

– Server side, data local, programming abstraction called Iterators

– Iterators enable fast aggregation, searching, filtering, streaming Reduce

Apache Gora

• Gora is a object relational/non-relational mapping for arbitrary data stores including both relational (MySQL) and non-relational data stores (HBase, Cassandra, Accumulo, Redis, Voldermort, etc.).

• It was designed for Big Data applications and has support (interfaces) for Apache Pig, Apache Hive, Cascading, and generic MapReduce.

Apache Nutch/Gora

• Nutch is a highly scalable web crawler built over Hadoop MapReduce.

• It was designed from the ground up to be an Internet scale web crawler and to enable large scale search applications

• GORA enables the storing of the web crawl data and metadata in Accumulo

• Highly scalable streaming event processing system

• Conceptually similar to MapReduce, but operates on streaming data in real-time

• Released by Twitter after they acquired Backtype

• Development led by Nathan Marz

Twitter Storm

• At-least-once-processing of events

• Spouts and Bolts are wired together to form computation Topologies

• Topologies run until killed

Web Crawl Analytics

• Formerly used Heritrix with a Cassandra backend for collection and storage

• We now use Nutch/GORA to perform Large-scale web crawling

• All pages and HTTP headers are stored in Accumulo

• Run Pig scripts for pulling data out of Accumulo, performing rollups, performing pattern matching (using regular expressions), and processing the pages using python scripts

Real-time DNS Processing

• We used to use MapReduce/PIG to generate daily reports on all DNS event data from files in HDFS; this took several hours

• Now, we use an internally developed framework called Velocity that was built over Storm

• In real-time, enrich DNS and security events with IP geo data (country, city, company, vertical), correlate with internally developed/maintained DNS blacklists

• Store the events in Accumulo & use custom Accumulo iterators to perform rollups

• At report generation time, Accumulo aggregates records server side

• This process now takes minutes, not hours, and we can query for partial results instead of having to wait until the end of the day Twitter Storm

Custom Iterators & Aggregation

Ingest Format

Row GROUP BY FIELDS

Col Fam Constant String

Col Qual Event UUID

Format After Custom Iterator

Row GROUP BY FIELDS

Col Fam Constant String

Col Qual “”

Val “1”

At Ingest • RowID contains a CSV record that

represents the fields used to basically perform a GROUP BY

• Col Qual contains the event UUID

At Scan time • Basically strip off the event UUID • Set the value to be “1” • Prepares Key/Value for input into

SummingCombiner • Output from SummingCombiner is an

accurate count of aggregated records • This is, in essence, a streaming

Reduce

Operations with Accumulo

• Hadoop Streaming jobs tend to kill tablet servers

– Streaming jobs use more memory than Hadoop allows

– This can make service memory allocations challenging

– Reducing number of Map tasks helped

• Running tablet servers under supervision is critical

– Tablet servers fail fast

– Supervisord or daemontools restart failed processes

– Has improved our cluster’s stability dramatically

• Pre-splitting tables is very important for throughput

– Our rows lead with a day day, e.g. “20120101”

• Locality Groups are your friend for Nutch/Gora

We’re Hiring

• Like to work on hard problems with Big Data?

• Are you familiar/interested in these technologies? – Hadoop, Storm, Django, Nutch/GORA

– Accumulo, Solr/ElasticSearch, Redis

– Python, Java, Pig, Node.JS, Github

• Want to contribute to Open Source?

• We have offices in Atlanta, Washington DC, Baltimore, and San Antonio

• www.linkedin.com/jobs/at-Endgame-Systems

Questions?

Contact Info

• Jason Trost

• Email: jtrost@endgames.us

• Twitter: @jason_trost

• Blog: www.covert.io

accumulo nutch/gora, storm, and pig

apache accumulo accumulo

nutchgora accumulo

custom accumulo iterators

accumulo hadoop streaming

apache gora gora

apache nutchgora nutch

iterators iterators

alldns event data

Documents

accumulo summit 2015: building aggregation systems on...

web crawling with apache nutch

accumulo summit 2015: tracing in accumulo and hdfs...

accumulo summit 2014: accismus -- percolating with accumulo

accumulo summit 2015: event-driven big data with accumulo -...

friends of solr - nutch & hdfs

nutch in nutshell

getting started with apache nutch

nutch and lucene_framework

accumulo summit 2015: verifiable responses to accumulo...

building nutch: open source search

accumulo summit 2015: apache accumulo at bloomberg [keynote]

accumulo summit 2014: accumulo on yarn

accumulo summit 2014: accumulo with distributed sql queries

nutch - web-scale search engine toolkit

crawling ida mele. nutch apache nutch is an open source java...

accumulo summit 2014: a tour of internal accumulo testing

large scale crawling with apache nutch

accumulo summit 2014: benchmarking accumulo: how fast is...

accumulo summit 2015: extending accumulo to support abac...