accumulo nutch/gora, storm, and pig
Post on 27-Jan-2015
131 Views
Preview:
DESCRIPTION
TRANSCRIPT
Large Scale Web Analytics with Accumulo (and Nutch/Gora, Pig, and Storm)
Jason Trost jtrost@endgames.us @jason_trost
Introductions
• Jason Trost (jtrost@endgames.us)
• Senior Software Engineer at Endgame Systems
• Former Accumulo Trainer
• Apache Accumulo Committer
– Apache Pig integration with Accumulo
– some minor bug fixes
Agenda
• Technologies Introduction –Apache Accumulo –Apache Gora –Apache Nutch/Gora –Storm
• Accumulo at Endgame –Web Crawl Analytics –Real-time DNS Processing –Operations
Apache Accumulo
• Accumulo is a BigTable implementation with cell level security
• It is conceptually very similar to HBase, but it has some nice features that HBase is currently lacking.
• Some of these features are: – Cell level security
– No fat row problem
– No limitation on col fams or when col fams can be created
– Server side, data local, programming abstraction called Iterators
– Iterators enable fast aggregation, searching, filtering, streaming Reduce
Apache Gora
• Gora is a object relational/non-relational mapping for arbitrary data stores including both relational (MySQL) and non-relational data stores (HBase, Cassandra, Accumulo, Redis, Voldermort, etc.).
• It was designed for Big Data applications and has support (interfaces) for Apache Pig, Apache Hive, Cascading, and generic MapReduce.
Apache Nutch/Gora
• Nutch is a highly scalable web crawler built over Hadoop MapReduce.
• It was designed from the ground up to be an Internet scale web crawler and to enable large scale search applications
• GORA enables the storing of the web crawl data and metadata in Accumulo
Storm
• Highly scalable streaming event processing system
• Conceptually similar to MapReduce, but operates on streaming data in real-time
• Released by Twitter after they acquired Backtype
• Development led by Nathan Marz
Twitter Storm
• At-least-once-processing of events
• Spouts and Bolts are wired together to form computation Topologies
• Topologies run until killed
at
Web Crawl Analytics
• Formerly used Heritrix with a Cassandra backend for collection and storage
• We now use Nutch/GORA to perform Large-scale web crawling
• All pages and HTTP headers are stored in Accumulo
• Run Pig scripts for pulling data out of Accumulo, performing rollups, performing pattern matching (using regular expressions), and processing the pages using python scripts
Real-time DNS Processing
• We used to use MapReduce/PIG to generate daily reports on all DNS event data from files in HDFS; this took several hours
• Now, we use an internally developed framework called Velocity that was built over Storm
• In real-time, enrich DNS and security events with IP geo data (country, city, company, vertical), correlate with internally developed/maintained DNS blacklists
• Store the events in Accumulo & use custom Accumulo iterators to perform rollups
• At report generation time, Accumulo aggregates records server side
• This process now takes minutes, not hours, and we can query for partial results instead of having to wait until the end of the day Twitter Storm
Custom Iterators & Aggregation
Ingest Format
Row GROUP BY FIELDS
Col Fam Constant String
Col Qual Event UUID
Val -
Format After Custom Iterator
Row GROUP BY FIELDS
Col Fam Constant String
Col Qual “”
Val “1”
At Ingest • RowID contains a CSV record that
represents the fields used to basically perform a GROUP BY
• Col Qual contains the event UUID
At Scan time • Basically strip off the event UUID • Set the value to be “1” • Prepares Key/Value for input into
SummingCombiner • Output from SummingCombiner is an
accurate count of aggregated records • This is, in essence, a streaming
Reduce
Operations with Accumulo
• Hadoop Streaming jobs tend to kill tablet servers
– Streaming jobs use more memory than Hadoop allows
– This can make service memory allocations challenging
– Reducing number of Map tasks helped
• Running tablet servers under supervision is critical
– Tablet servers fail fast
– Supervisord or daemontools restart failed processes
– Has improved our cluster’s stability dramatically
• Pre-splitting tables is very important for throughput
– Our rows lead with a day day, e.g. “20120101”
• Locality Groups are your friend for Nutch/Gora
We’re Hiring
• Like to work on hard problems with Big Data?
• Are you familiar/interested in these technologies? – Hadoop, Storm, Django, Nutch/GORA
– Accumulo, Solr/ElasticSearch, Redis
– Python, Java, Pig, Node.JS, Github
• Want to contribute to Open Source?
• We have offices in Atlanta, Washington DC, Baltimore, and San Antonio
• www.linkedin.com/jobs/at-Endgame-Systems
Questions?
Contact Info
• Jason Trost
• Email: jtrost@endgames.us
• Twitter: @jason_trost
• Blog: www.covert.io
top related