© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Michael Muckel, Head of Data Platform
Markus Schmidberger, Data Platform Architect
Glomex GmbH – A ProSiebenSat.1 Media SE company
Berlin, April 12th 2016
Big Data is Dead,Long Live Business Intelligence?
berlin
Page 2Glomex GmbH – A ProSiebenSat.1 Media SE company
Glomex: A ProSiebenSat.1 company
Page 3Glomex GmbH – A ProSiebenSat.1 Media SE company
Glomex – The Global Media Exchange
Publishers
Content providers
Video Value Platform
Media Delivery Platform
Media Exchange Platform
Glomex
External broadcasters
Web-only content owners
Non-P7S1 publishers
Page 4Glomex GmbH – A ProSiebenSat.1 Media SE company
Glomex – Data Platform
Video Value Platform Media Delivery Platform Media Exchange Platform
Data Platform
Real-time-Monitoring Batch Analytics Machine Learning
Page 5Glomex GmbH – A ProSiebenSat.1 Media SE company
Key Components of our New Data Platform
Content Discovery Find the most relevant content for our customers and their users.
Real-Time MonitoringEnable our development teams to serve our content to our users in the best quality possible.
AnalyticsProvide our teams access to the data to enable data-driven development of new features and products.
Page 6Glomex GmbH – A ProSiebenSat.1 Media SE company
Lambda Architecture
Graphic provided by http://lambda-architecture.net
≠ AWS Lambda
Page 7Glomex GmbH – A ProSiebenSat.1 Media SE company
ingest /collect
store process /analyze
visualize / serve
Simplify Data Processing
data answers
Time to Answer (Latency)Throughput
Cost
more concrete numbers at the end
Page 8Glomex GmbH – A ProSiebenSat.1 Media SE company
Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon RDS
Amazon DynamoDB
AmazonES
AmazonS3
ApacheKafka
AmazonGlacier
AmazonKinesis
AmazonDynamoDB
Amazon Redshift
Impala
Pig
Amazon ML
Streaming
AmazonKinesis
AWSLambda
Amaz
on E
last
ic M
apRe
duce
AmazonElastiCache
Sea
rch
SQ
L N
oSQ
L C
ache
Stre
am P
roce
ssin
gB
atch
Inte
ract
ive
Logg
ing
Stre
am S
tora
ge
IoT
Appl
icat
ions
File
Sto
rage An
alys
is &
Vis
ualiz
atio
n
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon QuickSight
TransactionalData
File Data
Stream Data
Not
eboo
ks
Predictions
Apps & APIs
Mobile Apps
IDE
Search Data
ETL
Data Processing in Big Data World
Page 9Glomex GmbH – A ProSiebenSat.1 Media SE company
Our Data Platform Architecture
INGEST STOREPROCESS &
ANALYSEVISUALIZE &
SERVE
AdProxy Log Import Service
Player Feedback Import Service
Data PlatformAccess
Data ScienceAnalytics Service
TechnicalMonitoring
Service
Dev / Ops Analytics Service
Content Discovery Service
KPI & Analytics Service
MetadataService
ContentImport Service
Data Platform Monitoring Service
Data QualityService
Data Management
Service
Data Layer
Data API
Data Lake
External Data Import Service
Portal
CDN files
data stream
data stream
Team
VAS Log Import Service
data stream
other modules
Real-Time Dashboards
ContentAPI
Data Platform - MicroService Layout
CDN Log Import Service
Data Science UI
Page 10Glomex GmbH – A ProSiebenSat.1 Media SE company
Real-Time Player Monitoring
INGEST STOREPROCESS &
ANALYSEVISUALIZE &
SERVE
AdProxy Log Import Service
Player Feedback Import Service
Data PlatformAccess
Data ScienceAnalytics Service
TechnicalMonitoring
Service
Dev / Ops Analytics Service
Content Discovery Service
KPI & Analytics Service
MetadataService
ContentImport Service
Data Platform Monitoring Service
Data QualityService
Data Management
Service
Data Layer
Data API
Data Lake
External Data Import Service
Portal
CDN files
data stream
data stream
Team
VAS Log Import Service
data stream
other modules
Real-Time Dashboards
ContentAPI
Data Platform - MicroService Layout
CDN Log Import Service
Data Science UI
Page 11Glomex GmbH – A ProSiebenSat.1 Media SE company
Monitoring Video-Streaming Experience
Focus on Metrics from the User‘s Perspective
From Server-Uptime To (anonymized) Real-User Monitoring
Page 12Glomex GmbH – A ProSiebenSat.1 Media SE company
Analyze
Take ActionsAutomate
1
23
Page 13Glomex GmbH – A ProSiebenSat.1 Media SE company
Our Ingest Process
Page 14Glomex GmbH – A ProSiebenSat.1 Media SE company
Kinesis Firehose is doing his job
Next session: “Streaming Data: The Opportunity and
How to Work With It”
Page 15Glomex GmbH – A ProSiebenSat.1 Media SE company
Data Facts
20 GB5 Billion
Per day click-stream data in Kinesis Firehose
Record processed per day
~100 ms Data freshness to S3
Page 16Glomex GmbH – A ProSiebenSat.1 Media SE company
ElasticSearch + Grafana for real-time analyses
Not AWS managed!
Page 17Glomex GmbH – A ProSiebenSat.1 Media SE company
ElasticSearch on Spot Instances
Page 18Glomex GmbH – A ProSiebenSat.1 Media SE company
CDN ’Batch Processing’
INGEST STOREPROCESS &
ANALYSEVISUALIZE &
SERVE
AdProxy Log Import Service
Player Feedback Import Service
Data PlatformAccess
Data ScienceAnalytics Service
TechnicalMonitoring
Service
Dev / Ops Analytics Service
Content Discovery Service
KPI & Analytics Service
MetadataService
ContentImport Service
Data Platform Monitoring Service
Data QualityService
Data Management
Service
Data Layer
Data API
Data Lake
External Data Import Service
Portal
CDN files
data stream
data stream
Team
VAS Log Import Service
data stream
other modules
Real-Time Dashboards
ContentAPI
Data Platform - MicroService Layout
CDN Log Import Service
Data Science UI
Page 19Glomex GmbH – A ProSiebenSat.1 Media SE company
Processing CDN-Logs
25 GB300 Million
Per day as zipped log-files
Record processed per day
+
Normal challenges with external data sourcesOut-of-order deliver / Data quality issues / Varying file sizes / etc.
Page 20Glomex GmbH – A ProSiebenSat.1 Media SE company
Requirements for our Data Processing Pipeline
Monitor Complete Pipeline
Enable Reprocessing of Historical Datasets
Be Ready to Scale
Page 21Glomex GmbH – A ProSiebenSat.1 Media SE company
Our CDN Pipeline
Page 22Glomex GmbH – A ProSiebenSat.1 Media SE company
• How to process 800MB gziped logfile?
• How to split compressed gzip files?
• Splitter using Amazon SQS and Amazon EC2 Spot Instances
AWS Lambda Limits5 min
512 MBAWS Lambda Timeout
AWS Lambda temp disk
Our Meta Data Store
https://blogs.aws.amazon.com/bigdata/post/Tx2YRX3Y16CVQFZ/Building-and-Maintaining-an-Amazon-S3-Metadata-Index-without-Servers
AWS Big Data Blog:
Page 24Glomex GmbH – A ProSiebenSat.1 Media SE company
Our Meta Data Store
Page 25Glomex GmbH – A ProSiebenSat.1 Media SE company
Be serverless and serve data
AWS Lambda AWS Lambda Amazon API GatewayAmazon Kinesis
Page 26Glomex GmbH – A ProSiebenSat.1 Media SE company
CDN Batch Facts
600 rec/sec
1 $ / hour
Processing time
Cost for 25 GB/dayCDN processing
6 Parallel AWS Lambda functions
2.3 min Average run-time of AWS Lambda AWS Lambda duration
Redshift CPU
Page 27Glomex GmbH – A ProSiebenSat.1 Media SE company
Data Science Environment
INGEST STOREPROCESS &
ANALYSEVISUALIZE &
SERVE
AdProxy Log Import Service
Player Feedback Import Service
Data PlatformAccess
Data ScienceAnalytics Service
TechnicalMonitoring
Service
Dev / Ops Analytics Service
Content Discovery Service
KPI & Analytics Service
MetadataService
ContentImport Service
Data Platform Monitoring Service
Data QualityService
Data Management
Service
Data Layer
Data API
Data Lake
External Data Import Service
Portal
CDN files
data stream
data stream
Team
VAS Log Import Service
data stream
other modules
Real-Time Dashboards
ContentAPI
Data Platform - MicroService Layout
CDN Log Import Service
Data Science UI
Page 28Glomex GmbH – A ProSiebenSat.1 Media SE company
Data Science Environment
Project Jupyter: http://jupyter.org/
Page 29Glomex GmbH – A ProSiebenSat.1 Media SE company
Data Science Environment - Architecture
Amazon Redshift Amazon S3 Elasticsearch
Amazon EMR
Amazon Kinesis
Github
Dat
a So
urce
sC
lust
er
Tech
nolo
gyD
evel
opm
ent
In development
In development
Page 30Glomex GmbH – A ProSiebenSat.1 Media SE company
Our Lambda Architecture on AWS
Batch Layer
Speed Layer
Serving Layer
Applications
Amazon KinesisFirehose
S3
EC2 with ElasticSearch
AmazonRedshift
Amazon ElasticMapReduce + Spark
Amazon API Gateway
EC2 withJupyther
EC2 withGrafana
EC2 withCaravel
data stream
CDN files Portal
Team
Instancewith Kinesis
Agent
AWS Lambda
other player
modules
Data Platform - Lambda Architecture
AWS Lambda
AWS Lambda
Page 31Glomex GmbH – A ProSiebenSat.1 Media SE company
Key Takeaways
Lambda Architecture
Enrich your traditional, batch-driven BI-workflow with real-time analytics
Use Lambda-Architecture as a guiding principle and adapt it to your needs
Page 32Glomex GmbH – A ProSiebenSat.1 Media SE company
Key Takeaways
AWS managed services provide an robust way to run complex big data infrastructures
Follow best-practices provided by AWS and the community
Focus on features development and robust pipelines not on infrastructure management
Page 33Glomex GmbH – A ProSiebenSat.1 Media SE company
Key Takeaways
Provide an open data environments
Structure your data that it can be access in processed and raw form
Trust the creativity of your engineering teams to find insights in your datasets
Notebooks provide easy access to even large distributed datasets
Michael Muckel, Head of Data Platform
Markus Schmidberger, Data Platform Architect
Glomex GmbH – A ProSiebenSat.1 Media SE company
We are hiring …
• Data Scientists• Data Engineers
• Project Managers