aws re:invent 2014 | (arc202) real-world real-time analytics

38
SaaS Company – since 2008 Social Media Analytics track and measure activity of brands and personality, providing information to market research & brand comparison Multi Language Technology (English, Portuguese and Spanish) Leader in Latin America, with operations in 5 countries, customers in LatAm and US 1 out of 34 Twitter Certified Program Worldwide

Upload: socialmetrix

Post on 19-Jul-2015

116 views

Category:

Technology


1 download

TRANSCRIPT

•  SaaS Company – since 2008

•  Social Media Analytics track and measure activity of brands and personality, providing information to market research & brand comparison

•  Multi Language Technology (English, Portuguese and Spanish)

•  Leader in Latin America, with operations in 5 countries, customers in LatAm and US

•  1 out of 34 Twitter Certified Program Worldwide

Our customers

Ranking Brand 1 Brand 2 Brand 3

Q2 Q3 Q2 Q3 Q2 Q3

1° Flavor Breakfast Flavor Flavor Advertising Flavor

2° Healthy Flavor Packaging Brand I love Flavor Breakfast

3° Components Components Healthy Packaging Healthy Healthy

4° Advertising Healthy Components Addiction Components Advertising

5° Enquires Desire Prices Consumption Prices Components

TOTAL 1.401 8.189 463 5.519 1.081 2.445

Share of Topics

Which conversation my brand and my competitors are driving?

smx.io/reinvent #reinvent

Challenges

Challenges: Variety • Different data sources

• Different API

• SLA

• Method (Pull or Push)

• Rate-Limit, Backoff strategy

Challenges: Velocity • Updates every second

•  Top users, top hashtags each minute

• After event analysis are made with batch over complete dataset

• Spikes of 20,000+ tweets per minute

Last TV Debate

Results Announced

Challenges: Meaning • Disambiguation • Data Enrichment

–  Demographics –  Sentiment –  Influencers

• Human Analysis

PAN

Orange Telecom

Oi Telecom Hi!

Challenges: Alert & Report • Clear & Understandable UI

• Slice-dice for business (not BI experts)

• Real-time Alerts for Anomalies

Architecture Evolution

Drivers for Architecture Evolution

•  More customers, bigger customers

•  Add new features

•  Keep costs under control

Architecture Evolution

0

20

40

60

80

100

120

#1 #2 #3 #4

Act

ive

Cus

tom

ers

Architecture – 1st iteration What we needed: •  Complete data isolation •  Trying different solutions/offerings

Architecture – 1st iteration

What we did:

•  All-in-one approach

•  Multi instance architecture

•  Simple vertical scalability

•  MySQL performance tunning

Architecture – 1st iteration

What we've learned: •  Multi-instance is harder to administrate, but

minimize instability impact on customers

•  Vertical scalability: poor resource management

•  MySQL schema changes translates into downtime

Architecture – 2nd iteration

What we needed: •  Separation of Responsabilities (crawling,

processing)

•  Horizontal Scalability

•  Fast Provisioning

•  Costs reduction

Architecture – 2nd iteration

What we changed: •  Migrated to AWS

•  RabbitMQ (Single Node)

•  Replace MySQL for RDS

•  Cloud Formation

•  Auto Scaling Groups

Architecture – 2nd iteration

What we've learned: •  PIOPs à

•  Tuning the auto scaling policies can be hard

•  Cloud Formation: great for migration, not enough

for daily ops

Architecture – 3rd iteration

What we needed:

•  Deliver new features (NRT, more complex analytics)

•  Scale Fast

•  Be resilient against failure

•  Adding and improving data-sources

•  Keep costs under control (always)

Architecture – 3rd iteration

What we changed:

•  Apache Storm

•  RabbitMQ HA

•  EMR (Hadoop/Hive)

•  CloudFormation + Chef

•  Glacier + S3 lifecycles policies

Architecture – 3rd iteration

What we've learned: •  Spot instances + Reserved instances

•  Hive = SQL à SQL scripts are hard to test

•  Bulk upserts on RDS can be expensive (PIOPS)

•  DynamoDB is great, but expensive (for our use-case)

Dashboard

Architecture – 4th iteration

What we needed: •  Monitor millions of social media profiles

•  Make data accessible (exploration, PoC)

•  Improve UI response times

•  Testing our data pipelines

•  Reprocessing (faster)

Architecture – 4th iteration

What we changed:

•  Cassandra (DSE)

•  MongoDB MMS

•  Apache Spark

What we've learned: •  Leverage on AWS ecosystem

•  Datastax AMI + Opscenter integration •  MongoDB MMS: automation magic! •  Apache Spark unit testing + ec2 launch scripts

•  EMR doesn’t have the latest stable versions

Architecture – 4th iteration

Architecture Evolution

-

20

40

60

80

100

120

140

160

0

20

40

60

80

100

120

#1 #2 #3 #4

Act

ive

Cus

tom

ers

Costs Customers

Lessons Learned

Lessons Learned •  Automate since day 1 (cloudformation + chef)

•  Monitor systems activity, understand your data patterns. eg: LogStash (ELK)

•  Always have a Source of Truth (S3 + Glacier)

•  Make your Source of Truth Searchable

Lessons Learned (II) • Approximation is a good thing: HLL, CMS, Bloom

• Write your pipelines considering reprocessing needs

•  Avoid at all costs framework explosion • AWS ecosystem allows rapid prototype

Socialmetrix NextGen 2015

Architecture Evolution

0

20

40

60

80

100

120

#1 #2 #3 #4

Act

ive

Cus

tom

ers

Architecture NextGen •  Reduce moving parts •  Apache Spark as central processing framework

–  Realtime (Micro-batch) –  Batch-processing

•  Kafka (Message Broker) •  Cassandra (Time-series storage) •  ElasticSearch (Content Indexer)

To infinity … and beyond! Architecture

Evolution

0

20

40

60

80

100

120

#1 #2 #3 #4 NextGen

Act

ive

Cus

tom

ers

Gustavo Arjones, CTO @arjones | [email protected] Sebastian Montini, Solutions Architect @sebamontini | [email protected] Let’s talk at Venetian-Titian Hallway

Feedback and Q&A

Please give us your feedback on this presentation

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Join the conversation on Twitter with #reinvent

ARC202 Thank you!