big data @ hootsuite analtyics

Big Data@Hootsuite AnalyticsClaudiu Coman

About us● You might have heard of uberVU● Acquired by Hootsuite to develop their new analytics

● New scalability challenges => millions of customers● Everything is in the cloud => Amazon● 10 M social media posts per day● 100+ Amazon EC2 instances of various sizes● 650 GB media posts/month● 400 GB worth of analytics data● 10+ MongoDB clusters

What this presentation will be about

● What is the big data we’re working with● Infrastructure● Technologies● What we do on top of the data● How we currently display the data● What’s in store for the future

Our data

● Our data is made up of media posts● Revolves around search queries● You can also connect accounts and pages● Sources: Twitter, Facebook, Google+, Wordpress, Flickr,

Picasa and others● To acquire data

○ a lot of REST○ some streaming○ very little scraping

Mentions

● Every piece of data that we process○ gets normalized○ annotated○ convert everything to a standard JSON format

● We call the end result a mention

Mentions (2)

Annotations

● Language detection○ written in C++, wrapped in Python○ can process ~300 mentions/second

● Sentiment detection○ external provider○ piggybacking for efficiency

● Location detection○ in-house algorithm○ text tokenization and matching against a locations database

The Pipeline

● Our processing infrastructure is a pipeline● Producer-Consumer pattern● Enables us to scale parts of the infrastructure separately

The pipeline (2)

The pipeline (3)

● 100+ consumer types● 450 consumer instances● Automatic scaling algorithm developed by us

○ whenever a consumer falls behind, the system deploys new consumer instances

○ automatically adjusts cluster size

MongoDB● 10+ clusters● Our biggest cluster

○ 1500 operations/second○ m2.xlarge instances (17GB RAM, 6.5 ECU)○ 8x80 GB RAID10

● Hard to manage databases in multiple clusters○ we wrote mongo-pool https://github.com/uberVU/mongo-pool

● Cluster pyramid structure for cost efficiency● Communication between clusters through our own mongo-

oplogreplayhttps://github.com/uberVU/mongo-oplogreplay

MongoDB mention clusters pyramid

Kestrel

● Distributed message queue developed by Twitter● Uses Memcache protocol● Disk persistence● 400 consumer operations/second● Part of our pipeline core● Extremely reliable● Sending gzipped content to save I/O costs

Redis, Memcache

● Gradually replacing Memcache with Redis● Used for high-access temporary information● 60 GB worth of data

Other technologies

● RabbitMQ asynchronous tasks● DynamoDB

○ analytics○ auxiliary permanent storage use cases that don’t take a lot of

space● S3 for data with low access rates● Glacier for archived data

System metrics & monitoring

● Graphite● 150K system metrics● Alerts are being generated based on Graphite● In-house alert-detection● Nagios/Nagstamon● PagerDuty for on-call

Analytics overview

● Currently in DynamoDB● MongoDB still runs in parallel, considering a move back to

MongoDB● Breakdowns on language, location, sentiment, gender● Support for several resolutions (day, hour, 15min)● Optimized for language and location filtering (95% of our

queries)

Aggregation pipeline

● We aggregate analytics to reduce writes● Efficient but simple concurrency through Redis primitives● Got a 5x improvement !

Aggregation pipeline (2)

Tagcloud

● Tagcloud algorithm that can detect n-grams of all lengths● Some of the data we analyze is blog content, can be very big● Needed something fast● In-house algorithm

○ linear complexity (doesn’t go up with max n-gram size)○ based on statistical correlations

Signals

● Need to synthesize all the data we’re collecting● Top Stories

○ O(1) algorithm● Influencers

○ dependency graph○ edges are interactions between users

● Spikes & Bursts○ code written in C++ to reduce time○ statistical algorithms on top of our analytics timeseries○ adapt reads from analytics based on data size

Boards

● Needed a good way to display all this data● Designed Boards

○ released this year○ allows you to create a dashboard with metric visualizations○ drag-drop widgets to arrange them your way

Boards Demo

Future plans

● Hootsuite has millions of users● Our analytics infrastructure will have to scale

○ Transitioning to streaming services○ Larger MongoDB clusters

■ more shards for write throughput■ more secondaries for reads

● Add more metrics

Questions ?

[email protected]

big data @ hootsuite analtyics

Engineering

big data

data size

archived data

data whats

piece of data

gb worth of analytics

gb worth of data

analytics infrastructure