big data @ hootsuite analtyics

25
Big Data @Hootsuite Analytics Claudiu Coman

Upload: claudiu-coman

Post on 12-Jun-2015

307 views

Category:

Engineering


2 download

DESCRIPTION

An overview over what the Hootsuite Bucharest office does to aggregate data into analytics and insights.

TRANSCRIPT

Page 1: Big data @ Hootsuite analtyics

Big Data@Hootsuite AnalyticsClaudiu Coman

Page 2: Big data @ Hootsuite analtyics

About us● You might have heard of uberVU● Acquired by Hootsuite to develop their new analytics

● New scalability challenges => millions of customers● Everything is in the cloud => Amazon● 10 M social media posts per day● 100+ Amazon EC2 instances of various sizes● 650 GB media posts/month● 400 GB worth of analytics data● 10+ MongoDB clusters

Page 3: Big data @ Hootsuite analtyics

What this presentation will be about

● What is the big data we’re working with● Infrastructure● Technologies● What we do on top of the data● How we currently display the data● What’s in store for the future

Page 4: Big data @ Hootsuite analtyics

Our data

● Our data is made up of media posts● Revolves around search queries● You can also connect accounts and pages● Sources: Twitter, Facebook, Google+, Wordpress, Flickr,

Picasa and others● To acquire data

○ a lot of REST○ some streaming○ very little scraping

Page 5: Big data @ Hootsuite analtyics

Mentions

● Every piece of data that we process○ gets normalized○ annotated○ convert everything to a standard JSON format

● We call the end result a mention

Page 6: Big data @ Hootsuite analtyics

Mentions (2)

Page 7: Big data @ Hootsuite analtyics

Annotations

● Language detection○ written in C++, wrapped in Python○ can process ~300 mentions/second

● Sentiment detection○ external provider○ piggybacking for efficiency

● Location detection○ in-house algorithm○ text tokenization and matching against a locations database

Page 8: Big data @ Hootsuite analtyics

The Pipeline

● Our processing infrastructure is a pipeline● Producer-Consumer pattern● Enables us to scale parts of the infrastructure separately

Page 9: Big data @ Hootsuite analtyics

The pipeline (2)

Page 10: Big data @ Hootsuite analtyics

The pipeline (3)

● 100+ consumer types● 450 consumer instances● Automatic scaling algorithm developed by us

○ whenever a consumer falls behind, the system deploys new consumer instances

○ automatically adjusts cluster size

Page 11: Big data @ Hootsuite analtyics

MongoDB● 10+ clusters● Our biggest cluster

○ 1500 operations/second○ m2.xlarge instances (17GB RAM, 6.5 ECU)○ 8x80 GB RAID10

● Hard to manage databases in multiple clusters○ we wrote mongo-pool https://github.com/uberVU/mongo-pool

● Cluster pyramid structure for cost efficiency● Communication between clusters through our own mongo-

oplogreplayhttps://github.com/uberVU/mongo-oplogreplay

Page 12: Big data @ Hootsuite analtyics

MongoDB mention clusters pyramid

Page 13: Big data @ Hootsuite analtyics

Kestrel

● Distributed message queue developed by Twitter● Uses Memcache protocol● Disk persistence● 400 consumer operations/second● Part of our pipeline core● Extremely reliable● Sending gzipped content to save I/O costs

Page 14: Big data @ Hootsuite analtyics

Redis, Memcache

● Gradually replacing Memcache with Redis● Used for high-access temporary information● 60 GB worth of data

Page 15: Big data @ Hootsuite analtyics

Other technologies

● RabbitMQ asynchronous tasks● DynamoDB

○ analytics○ auxiliary permanent storage use cases that don’t take a lot of

space● S3 for data with low access rates● Glacier for archived data

Page 16: Big data @ Hootsuite analtyics

System metrics & monitoring

● Graphite● 150K system metrics● Alerts are being generated based on Graphite● In-house alert-detection● Nagios/Nagstamon● PagerDuty for on-call

Page 17: Big data @ Hootsuite analtyics

Analytics overview

● Currently in DynamoDB● MongoDB still runs in parallel, considering a move back to

MongoDB● Breakdowns on language, location, sentiment, gender● Support for several resolutions (day, hour, 15min)● Optimized for language and location filtering (95% of our

queries)

Page 18: Big data @ Hootsuite analtyics

Aggregation pipeline

● We aggregate analytics to reduce writes● Efficient but simple concurrency through Redis primitives● Got a 5x improvement !

Page 19: Big data @ Hootsuite analtyics

Aggregation pipeline (2)

Page 20: Big data @ Hootsuite analtyics

Tagcloud

● Tagcloud algorithm that can detect n-grams of all lengths● Some of the data we analyze is blog content, can be very big● Needed something fast● In-house algorithm

○ linear complexity (doesn’t go up with max n-gram size)○ based on statistical correlations

Page 21: Big data @ Hootsuite analtyics

Signals

● Need to synthesize all the data we’re collecting● Top Stories

○ O(1) algorithm● Influencers

○ dependency graph○ edges are interactions between users

● Spikes & Bursts○ code written in C++ to reduce time○ statistical algorithms on top of our analytics timeseries○ adapt reads from analytics based on data size

Page 22: Big data @ Hootsuite analtyics

Boards

● Needed a good way to display all this data● Designed Boards

○ released this year○ allows you to create a dashboard with metric visualizations○ drag-drop widgets to arrange them your way

Page 23: Big data @ Hootsuite analtyics

Boards Demo

Page 24: Big data @ Hootsuite analtyics

Future plans

● Hootsuite has millions of users● Our analytics infrastructure will have to scale

○ Transitioning to streaming services○ Larger MongoDB clusters

■ more shards for write throughput■ more secondaries for reads

● Add more metrics

Page 25: Big data @ Hootsuite analtyics

Questions ?

[email protected]