big data @ hootsuite analtyics
DESCRIPTION
An overview over what the Hootsuite Bucharest office does to aggregate data into analytics and insights.TRANSCRIPT
Big Data@Hootsuite AnalyticsClaudiu Coman
About us● You might have heard of uberVU● Acquired by Hootsuite to develop their new analytics
● New scalability challenges => millions of customers● Everything is in the cloud => Amazon● 10 M social media posts per day● 100+ Amazon EC2 instances of various sizes● 650 GB media posts/month● 400 GB worth of analytics data● 10+ MongoDB clusters
What this presentation will be about
● What is the big data we’re working with● Infrastructure● Technologies● What we do on top of the data● How we currently display the data● What’s in store for the future
Our data
● Our data is made up of media posts● Revolves around search queries● You can also connect accounts and pages● Sources: Twitter, Facebook, Google+, Wordpress, Flickr,
Picasa and others● To acquire data
○ a lot of REST○ some streaming○ very little scraping
Mentions
● Every piece of data that we process○ gets normalized○ annotated○ convert everything to a standard JSON format
● We call the end result a mention
Mentions (2)
Annotations
● Language detection○ written in C++, wrapped in Python○ can process ~300 mentions/second
● Sentiment detection○ external provider○ piggybacking for efficiency
● Location detection○ in-house algorithm○ text tokenization and matching against a locations database
The Pipeline
● Our processing infrastructure is a pipeline● Producer-Consumer pattern● Enables us to scale parts of the infrastructure separately
The pipeline (2)
The pipeline (3)
● 100+ consumer types● 450 consumer instances● Automatic scaling algorithm developed by us
○ whenever a consumer falls behind, the system deploys new consumer instances
○ automatically adjusts cluster size
MongoDB● 10+ clusters● Our biggest cluster
○ 1500 operations/second○ m2.xlarge instances (17GB RAM, 6.5 ECU)○ 8x80 GB RAID10
● Hard to manage databases in multiple clusters○ we wrote mongo-pool https://github.com/uberVU/mongo-pool
● Cluster pyramid structure for cost efficiency● Communication between clusters through our own mongo-
oplogreplayhttps://github.com/uberVU/mongo-oplogreplay
MongoDB mention clusters pyramid
Kestrel
● Distributed message queue developed by Twitter● Uses Memcache protocol● Disk persistence● 400 consumer operations/second● Part of our pipeline core● Extremely reliable● Sending gzipped content to save I/O costs
Redis, Memcache
● Gradually replacing Memcache with Redis● Used for high-access temporary information● 60 GB worth of data
Other technologies
● RabbitMQ asynchronous tasks● DynamoDB
○ analytics○ auxiliary permanent storage use cases that don’t take a lot of
space● S3 for data with low access rates● Glacier for archived data
System metrics & monitoring
● Graphite● 150K system metrics● Alerts are being generated based on Graphite● In-house alert-detection● Nagios/Nagstamon● PagerDuty for on-call
Analytics overview
● Currently in DynamoDB● MongoDB still runs in parallel, considering a move back to
MongoDB● Breakdowns on language, location, sentiment, gender● Support for several resolutions (day, hour, 15min)● Optimized for language and location filtering (95% of our
queries)
Aggregation pipeline
● We aggregate analytics to reduce writes● Efficient but simple concurrency through Redis primitives● Got a 5x improvement !
Aggregation pipeline (2)
Tagcloud
● Tagcloud algorithm that can detect n-grams of all lengths● Some of the data we analyze is blog content, can be very big● Needed something fast● In-house algorithm
○ linear complexity (doesn’t go up with max n-gram size)○ based on statistical correlations
Signals
● Need to synthesize all the data we’re collecting● Top Stories
○ O(1) algorithm● Influencers
○ dependency graph○ edges are interactions between users
● Spikes & Bursts○ code written in C++ to reduce time○ statistical algorithms on top of our analytics timeseries○ adapt reads from analytics based on data size
Boards
● Needed a good way to display all this data● Designed Boards
○ released this year○ allows you to create a dashboard with metric visualizations○ drag-drop widgets to arrange them your way
Boards Demo
Future plans
● Hootsuite has millions of users● Our analytics infrastructure will have to scale
○ Transitioning to streaming services○ Larger MongoDB clusters
■ more shards for write throughput■ more secondaries for reads
● Add more metrics
Questions ?