20141021 aws cloud taekwon - big data on aws
DESCRIPTION
AWS APAC Principal Technology Evangelist인 Markku Lepisto의 발표내용입니다.TRANSCRIPT
Big Data on AWS Markku Lepistö Principal Technology Evangelist @markkulepisto
Does this Data make me look big?
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Getting your Data into AWS
Amazon S3
Corporate Data Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
1
Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data store
Amazon S3
Amazon EC2
2
Zero Admin NoSQL Service
Unlimited Storage
Provisioned Throughput
Consistent <10ms response
Durable on SSD
Services: Database: Amazon DynamoDB
Compute Storage
AWS Global Infrastructure
Database
Networking
Queue, pre-process and then write to data source
Amazon Simple Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data store
3
Aggregate and write to data source
Flume running on EC2
Amazon S3
Any other data store
HDFS
4
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Choose depending upon design
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3 as a “single source of truth”
S3
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Hadoop based Analysis Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon EMR
EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
EMR Cluster
S3
Put the data into S3
Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc.
Get the output from S3
Launch the cluster using the EMR console, CLI, SDK, or APIs
You can also store everything in HDFS
How does EMR work ?
S3
What can you run on EMR…
EMR Cluster
SQL based processing Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon EMR
Amazon Redshift
Pre-processing framework
Petabyte scale Columnar Data -
warehouse
Amazon Redshift • Easily and rapidly analyze petabytes of data • Fully managed data warehouse service • Automated deployment and administration • 1/10th the cost of traditional data warehouses • < $1000 / Terabyte / year • Compatible with popular BI tools
Services: Database: Amazon Redshift
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Your choice of BI Tools on the cloud Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon EMR
Amazon Redshift
Pre-processing framework
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon EMR
Amazon Redshift
Sharing results and visualizations at scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon EMR
Amazon Redshift
Web App Server Visualization tools
Rinse and Repeat every day or hour
Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
No it isn’t !
What about Real-Time?
nopeampi data on parempi data
HAPPENING NOW! real-time == stream analytics
Ingest data streams Store durably
Distribute Scale out
Process as packets flow in
Realtime Analytics in the Cloud
Amazon Kinesis Streaming Data Service
Kinesis architecture
Clash of Clans
In-game activity
Amazon Kinesis
Kinesis: Real-time data stream of in-game activity
Clash of Clans
Kinesis-enabled apps on EC2
In-game activity
Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage
Clash of Clans
Real-time clickstream processing app
Amazon Kinesis
S3 Aggregate statistics
In-game activity
EC2: In-game engagement
trends dashboard
Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage S3 and Glacier: Data storage and long term archival
Clash of Clans
Kinesis-enabled apps on EC2
Real-time clickstream processing app
Amazon Kinesis
Business-intelligence user
EC2: In-game engagement
trends dashboard
In-game activity
S3 Aggregate statistics
Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage
Data Warehouse: BI reporting and interactive queries S3 and Glacier: Data storage and long term archival
Clash of Clans
Kinesis-enabled apps on EC2
EC2 Data
Warehouse
Real-time clickstream processing app
Amazon Kinesis
Glacier
EC2 Data
Warehouse
Clickstream archive
EC2: In-game engagement
trends dashboard
Real-time clickstream processing app
Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage
Data Warehouse: BI reporting and interactive queries S3 and Glacier: Data storage and long term archival
In-game activity
S3
Clash of Clans
Aggregate statistics
Business-intelligence user
Kinesis-enabled apps on EC2
Amazon Kinesis
Demo
Sliding Window Analytics Live Dashboard
S3 Storage Redshift Data Warehouse
Kinesis
Website Clickstream
logs
AWS Cloud Taekwon
Bonus
Internet of Things
Smart Devices
Powered by the Cloud
Smart Devices
Powered by the Cloud
Smart Devices
Powered by the Cloud
Smart Devices
Powered by the Cloud
Smart?evices
Powered by the Cloud
Smart?evices
Powered by the Cloud Arduino Uno Raspberry Pi
CPU 20MHz 8bit 700MHz 32bit Memory 2 KB 512 MB Storage 32 KB SD card
Smart Devices
Powered by the Cloud
Camera Microphone
Thermometer
Distance
GPS
Gyroscope
Actuator
Relay
Motor
Manipulator
Switch Pressure
Accelerometer
Wheel Propeller
Rotor
Challenges
Challenges
Thousands – Millions of Devices / Producers
Challenges
Thousands – Millions of Devices / Producers
Thousands – Millions of Users / Consumers
Distributed
Thousands – Millions of Devices / Producers
Thousands – Millions of Users / Consumers
At scale
Thousands – Millions of Devices / Producers
Thousands – Millions of Users / Consumers
Smart Devices
Powered by the Cloud
Smart Devices
Powered by the Cloud Unlimited Storage – Memory Unlimited Compute – Logic
Camera Microphone
Thermometer
Distance
GPS
Actuator
Relay
Motor
Manipulator
Switch Pressure
Wheel Propeller
Rotor
Gyroscope Accelerometer
Smart Devices
Powered by the Cloud
70
Demo
Arduino Yún
云
Raspberry Pi
Spark Core
Accele-rometer
MQTT
Mosquitto MQTT Broker MQTT-Kinesis Bridge
AWS SDK
Amazon Kinesis Real-time Streaming
Data Service
AWS APIs
AWS Elastic Beanstalk
Dashboard
Amazon SNS Earthquake
Alerts
Mobile Push
Demo
COLLECT | STORE | ANALYZE | SHARE
Import Export
Glacier
S3 EC2
Redshift DynamoDB
EMR
Data Pipeline
S3 Direct Connect
Kinesis
The AWS Big Data Portfolio
CloudFront
AWS Cloud Taekwon