dc spark bake off - realtime tcp packet analysis using spark and azure event hubs
TRANSCRIPT
Washington DC Area Apache Spark Interactive
Spark Bake-off
Team Name: Silvio Fiorito Solution Title: Real-time Packet Analysis using Spark
Spark Bake-offPage: 2
Team Introductions
Silvio Fiorito – Background in development and app security– Started working with Hadoop in 2012– Started using Spark at v0.6 in early 2013– Built a few prototypes for low-latency query
services with Spark/Shark and then SparkSQL
– Twitter: @granturing
Spark Bake-offPage: 3
Solution Overview
Real-time TCP packet analysis of geographically distributed hosts– Must support high throughput from many hosts– 3 demo VMs ( 2 x Azure & 1 x AWS)
Local Flume agent pushes events to Azure Event Hub Events are partitioned and persisted up to 7 days Spark Streaming app ingests streams
– Reconstruct packets– Lookups for geo-ip and port description– Clusters using pre-trained k-means model– Saves data to Azure Table Storage and publishes on Service
Bus Topic