large scale data analytics on aws
TRANSCRIPT
![Page 1: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/1.jpg)
Large Scale Data Analytics on AWS
Ian Meyers, David Elliott, Denis Batalov
Solution Architects, EMEA
![Page 2: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/2.jpg)
Agenda
2:00pm – 3:00pm - AWS & Analytics Services Overview
3:00pm – 4:30pm - Machine Learning with AWS Demonstration
4:30pm – 5:00pm - Break
5:00pm – 6:00pm - Data Analytics Platform Demonstration
![Page 3: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/3.jpg)
WHY BUILD LARGE SCALE ANALYTICS
APPLICATIONS ON AWS?
![Page 4: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/4.jpg)
It’s never been easier and less expensive to
collect, store, analyse & share data
![Page 5: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/5.jpg)
We are constantly producing more data
![Page 6: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/6.jpg)
From all types of industries
![Page 7: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/7.jpg)
From a diverse range of sources
![Page 8: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/8.jpg)
Discovery Development Delivery
Risk Marketing Reporting Trade
Sales
Broad Analytics Use In The AWS Cloud
![Page 9: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/9.jpg)
CLOUD COMPUTING?
![Page 10: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/10.jpg)
A broad and deep platform that helps customers
build sophisticated, scalable applications
What is Cloud Computing?
Cloud Computing
![Page 11: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/11.jpg)
On demand Pay as you go
UniformAvailable
Utility
Cloud Computing
![Page 12: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/12.jpg)
Infrastructure
Cloud Computing
![Page 13: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/13.jpg)
Compute
Database
Load Balancing
Networking
Storage
Analytics
Messaging
Monitoring
Content Distribution
Security
DNS
Cloud Computing
![Page 14: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/14.jpg)
Availability Zones
Global Infrastructure
![Page 15: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/15.jpg)
US-WEST (Oregon)
EU-WEST (Ireland)
ASIA PAC (Tokyo)
US-WEST (N. California)
SOUTH AMERICA
(Sao Paulo)
US-EAST (Virginia)
AWS GovCloud(US)
ASIA PAC (Sydney)
ASIA PAC (Singapore)
ASIA PAC (Beijing)
EU-CENTRAL (Frankfurt)
Availability Zones
Global Infrastructure
![Page 16: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/16.jpg)
Accessible via API endpoints
Global Infrastructure
![Page 17: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/17.jpg)
aws ec2 run-instances
--image-id ami-a813fadf
--count 3
--placement AvailabilityZone=eu-west-1a
--instance-type m3.medium
aws ec2 run-instances
--image-id ami-a813fadf
--count 5
--placement AvailabilityZone=eu-west-1c
—instance-type m3.large
Global Infrastructure
![Page 18: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/18.jpg)
Traditional IT capacityCapacity
TimeYour actual capacity needs
Elastic Capacity (or lack of in this case)
Elasticity
![Page 19: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/19.jpg)
On and Off Fast Growth
Variable peaks Predictable peaks
Elastic Capacity (or lack of in this case)
Elasticity
![Page 20: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/20.jpg)
On and Off Fast Growth
Predictable peaksVariable peaks
Waste
Customer Dissatisfaction
Elastic Capacity (or lack of in this case)
Elasticity
![Page 21: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/21.jpg)
On and Off Fast Growth
Predictable peaksVariable peaks
Elastic Capacity
Elasticity
![Page 22: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/22.jpg)
From One Instance
Elasticity
![Page 23: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/23.jpg)
To Thousands
Elasticity
![Page 24: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/24.jpg)
And Back Again
Elasticity
![Page 25: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/25.jpg)
NetworkingVPC
Direct Connect
Route 53
AnalyticsLambda
EC2 Container Service
Elastic Beanstalk
EMR Data Pipeline KinesisMachine Learning
ComputeEC2
Storage & Content DeliveryS3
Developer ToolsCodeCommit CodeDeploy CodePipeline
Management ToolsCloudWatch
CloudFormation
CloudTrail Config OpsWorksService Catalog
Security & IdentityIdentity & Access
ManagementDirectory Service
Trusted Advisor
CloudFront EFS GlacierStorage Gateway
Application ServicesAPI Gateway AppStream CloudSearch
Elastic Transcoder
SES SQS SWF
Device FarmMobile
Analytics
Mobile ServicesCognito SNS
DatabaseRDS DynamoDB ElastiCache RedShift WorkSpaces WorkDocs WorkMail
Enterprise Applications
Broad Range Of Services
![Page 26: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/26.jpg)
https://aws.amazon.com/compliance/
Broadest Certification & Accreditations
![Page 27: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/27.jpg)
DATA INGESTION & STORAGE
![Page 28: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/28.jpg)
Makes it easy to establish a dedicated network connection from your premises to AWS
Establish private connectivity between AWS & your datacenter, office, or colocation environment
Reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience
The dedicated connection can be partitioned into multiple virtual interfaces using 802.1q VLANs
aws.amazon.com/directconnect
AWS Direct Connect
Data Ingestion & Storage
![Page 29: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/29.jpg)
Amazon S3
Secure, durable, highly-scalable object storage
Accessible via a simple web services interface
Store & retrieve any amount of data
Use alone or together with other AWS services
Different Tiers: Standard, Infrequent Access,
Reduced Redundancy, Glacier
Data Ingestion & Storage
![Page 30: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/30.jpg)
Elastic Block StoreHigh performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Availability99.99%
Durability 99.999999999%
Is a Web StoreNot a file system
No Single Points of FailureEventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.03/GB/month
Typical use
case
Write once, read many
Limits 100 Buckets, Unlimited Storage, 5TB Objects
Simple Storage ServiceHighly scalable object storage for the internet
1 byte to 5TB in size
99.999999999% durability
![Page 31: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/31.jpg)
Amazon S3 Multipart Upload
Large file(Size < 5TB)
Large object(Size < 5TB)
Split file into parts Send parts to S3 S3 rejoins the parts
Data IngestionData Ingestion & Storage
![Page 32: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/32.jpg)
Simple Storage ServiceHighly scalable object storage
GlacierLong term object archive
Data Ingestion & Storage
Lifecycle Management
![Page 33: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/33.jpg)
Persistent block level storage volumes
For use with Amazon EC2 instances
Automatically replicated within Availability Zones
Offer consistent and low-latency performance
EBS Snapshot(stored on S3) EBS
Volume
EC2Instance
aws.amazon.com/ebs
Data Ingestion & Storage
Amazon Elastic Block Store
![Page 34: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/34.jpg)
AWS Import/Export
Move large amounts of data into and out of the AWS cloud using portable storage devices
Transfer your data directly onto and off of storage devices using Amazon’s high-speed internal network
For significant data sets, AWS Import/Export is often faster than Internet transfer and more cost effective than upgrading your connectivity
Supports upload & download from S3 & upload to Amazon EBS snapshots & Amazon Glacier Vaults
aws.amazon.com/importexport/
Data Ingestion & Storage
![Page 35: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/35.jpg)
An on-premises software appliance connecting with cloud-based storage
Supports industry-standard storage protocols that work with your existing applications and workflows
Provides low-latency performance by maintaining frequently accessed data on-premises while securely storing all of your data encrypted in Amazon S3 or Amazon Glacier
aws.amazon.com/storagegateway/
AWS Storage Gateway
Data Ingestion & Storage
![Page 36: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/36.jpg)
A fully managed, cloud-based service for real-time data processing over large, distributed data streams
Continuously capture and store terabytes of data per hour from hundreds of thousands of sources
Emit data to other streams and other AWS services such as Amazon S3, Amazon Redshift, Amazon Elastic Map Reduce (Amazon EMR), Dynamo DB
Elastically Add and Remove Shards for Performance
Use Kinesis Worker Library to Process Data
aws.amazon.com/kinesis
AWS Kinesis
Data Ingestion & Storage
![Page 37: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/37.jpg)
Millions of sources
producing 100s of TB per hour
FrontEnd
AuthenticationAuthorization
AZAZAZDurable, consistent replicas
across three AWS Availability Zones
Amazon Web Services RegionInexpensive: $0.0165 per million PUT Payload Units
(in EU Ireland)
Aggregate and archive to S3
Real-time dashboards and alarms
Machine learning algorithms
Aggregate analysis in Hadoop or a data warehouse
Ordered stream of events supporting multiple readers
Data Ingestion & Storage
AWS Kinesis Architecture
![Page 38: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/38.jpg)
As a startup, using AWS has
allowed us to scale nicely and use resources without spending a lot
of capital.
Brian Langel
CTO
Dash
• Needed scale IT resources to create an app that would offer real-time information to drivers
• Developed and deployed the Dash application on the AWS Cloud
• Streams more than 1 TB of real-time data per day using Amazon Kinesis and processes billions of entries using Amazon DynamoDB
• Scaled up to support large traffic spikes–several thousand updates per second–in app usage
• Reduced operating costs by $200,000 per year
Using AWS, Dash Streams More Than 1 TB of Real-Time Data Per Day
Find out more here: aws.amazon.com/solutions/case-studies/dash/
![Page 39: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/39.jpg)
Data Ingestion Ecosystem
![Page 40: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/40.jpg)
Log Analysis
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
CloudWatch LoggingAutomated Log Ingestion from Amazon Linux
Agents
Create Log Streams, Groups of Logs, and Log
Event Types
Analyze Log Data using Search Patterns
Alarms on Application Log Events
Integration with RSysLog
![Page 41: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/41.jpg)
STRUCTURED DATA MANAGEMENT
![Page 42: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/42.jpg)
Database
Relational Database ServiceManaged Oracle, MySQL, SQL Server & Aurora
Dynamo DBManaged NOSQL Database
ElastiCacheManaged In Memory Caching
RDS Dynamo DB
Redshift Elasticache
Amazon RedshiftMassively Parallel Petabyte Scale Data Warehouse
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 43: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/43.jpg)
Database
Relational Database ServiceDatabase-as-a-Service
No need to install or manage database instances
Scalable and fault tolerant configurations
Integration with Data Pipeline
RDS Dynamo DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 44: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/44.jpg)
Database
DynamoDBProvisioned throughput NoSQL database; single-
digit millisecond latency at any scale
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Supports both document, key-value and graph
Integration with EMR & Hive
RDS Dynamo DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 45: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/45.jpg)
• Writes• Writes are acknowledged
(committed) once they exist in at least two physical data centers
• Writes are persisted to SSD
• Reads• Tunable for Application
Requirements
• No reduction in durability or consistency in order to achieve throughput
Dynamo Consistency
Eventually Consistent Read Strongly Consistent Read
Stale Values reads possible No Stale Values read
Highest Throughput Lower Potential Throughput
√ √
√
![Page 46: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/46.jpg)
Database
RDS Dynamo DB
Redshift Elasticache
ElastiCacheIn Memory Caching
Memcached or Redis
Automatic Node Failover / Replacement
Multi-AZ Standby
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 47: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/47.jpg)
Database
RedshiftManaged Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 2 PB Online
RDS Dynamo DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 48: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/48.jpg)
Amazon Redshift parallelizes and distributes
everything
Query
Load
Backup
Restore
Resize
ComputeNode
ComputeNode
ComputeNode
LeaderNode
Common BI Tools
JDBC/ ODBC
10GigE Mesh
![Page 49: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/49.jpg)
Redshift lets you start small and grow big
Small Nodes: (dc1.l & ds2.xl)
3 spindles, 15-30GiB RAM 2 or 4 virtual cores, 10GigE
Single Node (160GB SSD or 2TB Magnetic)
Cluster 2-32 Nodes (320GB SSD – 64TB Magnetic)
Large Nodes: (dc1.8xl & ds2.8xl)
24 spindles, 120-244GiB RAM, 2.56TB SSD or 16TB Magnetic, 16 or 32 virtual cores, 10GigE
Cluster 2-100 Nodes (5TB SSD – 1.6PB Magnetic)
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
![Page 50: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/50.jpg)
COMPLEX ANALYTICS
![Page 51: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/51.jpg)
Elastic MapReduceManaged, elastic Hadoop (1.x & 2.x) cluster
Integrates with S3, DynamoDB and Redshift
Install Storm, Spark & Shark, Hive, Pig, Impala &
End User Tools Automatically
Support for Spot Instances
Integrated HBase NOSQL Database
Analytics
Elastic MapReduce
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 52: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/52.jpg)
Analytics
![Page 53: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/53.jpg)
Analytics languages/enginesData management
AmazonRedshift
AmazonKinesis
AmazonS3
AmazonDynamoDB
AmazonRDSEMR
Data Sources
AWSData Pipeline
Ecosystem
![Page 54: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/54.jpg)
S&P Capital IQ Uses AWS for Big Data Processing
Provides data to 4200+ top global investment firms
Launched Hadoop faster, Learned Hadoop faster
S3 Hadoop Cluster
http://aws.amazon.com/solutions/case-studies/sp-capital-iq
![Page 55: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/55.jpg)
Event Processing
AWS LambdaFully Managed Event Processor
Node.js, Integrated AWS SDK & ImageMagick
Natively Compile & Install any Node.js modules
Specify Runtime RAM & Timeout
Automatically Scaled to support Event Volume
Events from S3, Dynamo DB, Kinesis & Lambda
Integrated CloudWatch Logging
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 56: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/56.jpg)
Analytics of the Internet of Things
![Page 57: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/57.jpg)
Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.
Activity: This is a data aggregation, manipulation, or copy that runs on a user-configured schedule.
Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.
Analytics Orchestration
Data PipelineAutomatically Provision EC2 & EMR Resources
Manage Dependencies & Scheduling
Automatically Retry and Notify of Success &
Failure
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 58: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/58.jpg)
Output: S3 filePath: s3://trend-data/#{year-month-day}.csv
Activity: EMR TransformHive Query: user-metrics.hqlFrequency: Daily
Input: RDS TableTable: User-DemographicsSQL Precondition: “Select last_update from table“ > #{YY-MM-DD}
Input: DynamoDB TableTable: User-Event-Data-#{year-month}
Success Notification: [email protected] Notification: [email protected] Notification: : [email protected]
Sample Use Case
![Page 59: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/59.jpg)
Train and optimize models on GBs of data
Batch process predictions
Real-time prediction API in one-click
No servers to provision or manage
Amazon Machine Learning
![Page 60: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/60.jpg)
END USER REPORTING
![Page 61: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/61.jpg)
End User Reporting
Redshift
S3
EMR
Dynamo DB
![Page 62: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/62.jpg)
End User Reporting – Customer Issues
Realizing the “Virtual Desktop Dream”BYOD is increasingly popular
Workforces are increasingly diverse
Tablet adoption significant
Keeping all these desktops secure
![Page 63: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/63.jpg)
End User Reporting - Workspaces
WorkSpaces
Fully Managed
Support Multiple Devices
Keep Data Secure and Available
Choose Software & Hardware
Pay as You Go
Corporate Directory Integration
No data stored on end-user device
Only Pixels delivered to users (PCoIP)
User volume backed by Amazon S3
![Page 64: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/64.jpg)
INTEGRATED ANALYTICS
![Page 65: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/65.jpg)
Integrated Analytics
![Page 66: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/66.jpg)
Integrated Analytics
TBs of logs sent daily
Logs stored inAmazon S3
Amazon EMR clusters
Hive Metastoreon Amazon EMR
Interactive query
![Page 67: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/67.jpg)
Integrated Analytics
Batch Processing
GBs of logs pushed to Amazon
S3 hourly
Daily Amazon EMR cluster using Hive to
process data
Input and output stored in Amazon S3
Load subset into Amazon Redshift
![Page 68: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/68.jpg)
Integrated Analytics
Streaming Data Processing
Clickstream logs streamed to Kinesis
Logs stored in Amazon Kinesis
Amazon Kinesis Client Library
AWS Lambda
Amazon EMR
Amazon EC2
![Page 69: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/69.jpg)
Integrated Analytics
Real Time Predictions
Your applicationAmazon
DynamoDB
+
Trigger event with Lambda+
Query for predictions with the Amazon Machine Learning
real-time API
![Page 70: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/70.jpg)
Integrated Analytics
Batch Predictions
Structured datain Amazon Redshift
Load predictions intoAmazon Redshift Predictions
in Amazon S3
Query for predictions with
Amazon ML batch API
Your application -or-
Read prediction resultsdirectly from S3
![Page 71: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/71.jpg)
aws.amazon.com/architecture/
![Page 72: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/72.jpg)
Certification
aws.amazon.com/certification
Self-Paced Labs
aws.amazon.com/training/
self-paced-labs
Try products, gain new skills, and get hands-on practice working
with AWS technologies
aws.amazon.com/training
Training
Validate your proven skills and expertise
with the AWS platform
Build technical expertise to design
and operate scalable, efficient applications
on AWS
AWS Training & Certification
![Page 73: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/73.jpg)
Large Scale Data Analytics with Amazon Web Services
Ian Meyers, Principal Solution Architect
October 28th, 2015
![Page 74: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/74.jpg)
A customer has built a new Oil Pipeline, the North Sea Anglian
System (the Flying Scotsman) which ships Crude Oil from the North Sea to
London.
Built on Next Generation Sensor Technology, this Pipeline emits
operational metrics from every Sensor using Internet of Things technology.
With every measurement, each sensor can track the ambient
temperature, corrosivity, Pressure and Flow Rate, as well as physical
orientation of the segment of Pipeline being monitored.
Provide an Operational Analytics Pipeline which allows for real time
monitoring of the Pipeline, as well as historical analysis of all data.
![Page 75: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/75.jpg)
Getting the Data In
![Page 76: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/76.jpg)
Amazon EC2
Amazon Kinesis
MQTT
HTTPS
![Page 77: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/77.jpg)
Application Services
Amazon Kinesis Managed Service for Real Time Big Data Processing
Create Streams to Produce & Consume Data
Elastically Add and Remove Shards for Performance
Use Kinesis Worker Library to Process Data
Integration with S3, Redshift and Dynamo DB
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 78: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/78.jpg)
Data
Sources
App.4
[Machine
Learning]
AW
S E
nd
po
int
App.1
[Aggregate &
De-Duplicate]
Data
Sources
Data Sources
Data
Sources
App.2
[Metric
Extraction]
S3
DynamoDB
Redshift
App.3
[Sliding
Window
Analysis]
Data
Sources
Availability Zone
Amazon Kinesis
Availability Zone
Availability
Zone
Shard 1
Shard 2
Shard N
![Page 79: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/79.jpg)
Native Code Module to perform efficient writes to Multiple
Kinesis Streams
C++/Boost
Asynchronous Execution
Configurable Aggregation of Events
Introducing the Kinesis Producer Library
My Application KPL Daemon
PutRecord(s)
Kinesis Stream
Kinesis Stream
Kinesis Stream
Kinesis Stream
Async
![Page 80: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/80.jpg)
KPL Aggregation
My Application KPL Daemon
PutRecord(s)
Kinesis Stream
Kinesis Stream
Kinesis Stream
Kinesis Stream
Async
1MB Max Event Size
Aggregate
100k 20k 500k 200k
40k 20k 40k
500k 100k 200k 20k
40k
40k
20k
Protobuf Header Protobuf Footer
![Page 81: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/81.jpg)
KCL Libraries available for Java, Ruby,
Node, Go, and a Multi-Lang
Implementation with Native Python
support
All State Management in Dynamo DB
Kinesis Client Library
DynamoDB
![Page 82: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/82.jpg)
AWS Analytics Demo
![Page 83: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/83.jpg)
Long Term Durability
![Page 84: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/84.jpg)
Amazon EC2
Amazon Kinesis
MQTT
HTTPS
![Page 85: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/85.jpg)
Amazon EC2
Amazon S3
Amazon Kinesis
Amazon EC2
MQTT
HTTPS
![Page 86: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/86.jpg)
Kinesis Connectors
• S3
Batch Write Files for Archive into S3
Extensible file naming
• Redshift
Once Written to S3, load to Redshift
Manifest support
User defined transformers
• DynamoDB
BatchPut append to table
User defined transformers
• Spark • Spark Streaming RDD’s
• Storm
Use Kinesis as a Spout
• ElasticSearch
Automatically index stream contents
Storm
S3
DynamoDB
Redshift
Kinesis
ElasticSearch
![Page 87: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/87.jpg)
Connectors Architecture
![Page 88: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/88.jpg)
Elastic Block Store High performance block storage
device
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
IMAGE
Availability 99.99%
Durability 99.999999999%
Is a Web Store Not a file system
No Single Points of Failure Eventually consistent
Paradigm Object store
Performance Very Fast
Redundancy Across Availability Zones
Security Public Key / Private Key
Pricing $0.095/GB/month
Typical use case Write once, read many
Limits 100 Buckets, Unlimited Storage, 5TB Objects
Simple Storage Service Highly scalable object storage for the internet
1 byte to 5TB in size
99.999999999% durability
![Page 89: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/89.jpg)
Amazon S3 provides near linear scalability
S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr
34 secs per terabyte
GB/Second
Rea
de
r C
on
ne
ctions
S3 Performance & Scalability
![Page 90: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/90.jpg)
AWS Analytics Demo
![Page 91: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/91.jpg)
Real Time Analytics
![Page 92: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/92.jpg)
Amazon EC2
Amazon S3
Amazon Kinesis
Amazon EC2
MQTT
HTTPS
![Page 93: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/93.jpg)
Amazon EC2
Elastic Beanstalk
DynamoDB
Amazon S3
Amazon Kinesis CloudWatch
Amazon EC2
MQTT
HTTPS json
![Page 94: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/94.jpg)
Deployment & Admin
Elastic Beanstalk 1 click deployment from Eclipse, Visual Studio and Git
Rapid deployment of applications
All AWS resources automatically created
Feature Details
Platform support Containers for Java, .net , Ruby and PHP
Resource creation Creates load balancer, instances, autoscaling and monitoring
automatically
Monitoring & Logs Integrated with Cloud Watch and consolidates server logs
Versioning Manage versions of applications and easily rollback deployments
Notifications Receive alerts on key events
Full resource access Access all underlying AWS resources as necessary
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 95: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/95.jpg)
KCL Libraries available for Java, Ruby,
Node, Go, and a Multi-Lang
Implementation with Native Python
support
All State Management in Dynamo DB
Kinesis Client Library
DynamoDB
![Page 96: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/96.jpg)
Kinesis Aggregators
Kinesis Aggregators provide a powerful and simple mechanism for creating Real Time Aggregates of data as it traverses Kinesis Simple Configuration
Create a configuration file defining the Aggregations required Run the application using Elastic Beanstalk
Data is persisted automatically to Dynamo DB, Dynamo Provisioning is fully managed Data can be graphed using CloudWatch Utilities to integrate Real Time Aggregates with Elastic MapReduce Hive or Amazon Redshift
Σ
![Page 97: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/97.jpg)
Database
DynamoDB Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive
RDS Dynamo DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 98: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/98.jpg)
CloudWatch Integration
Σ
![Page 99: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/99.jpg)
AWS Analytics Demo
![Page 100: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/100.jpg)
Massively Parallel Transformations
![Page 101: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/101.jpg)
Amazon EC2
Elastic Beanstalk
DynamoDB
Amazon S3
Amazon Kinesis CloudWatch
Amazon EC2
MQTT
HTTPS json
![Page 102: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/102.jpg)
Amazon EC2
Elastic Beanstalk
DynamoDB
Amazon S3
Amazon Kinesis
Amazon EMR
CloudWatch
Amazon EC2
MQTT
HTTPS json
![Page 103: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/103.jpg)
Elastic MapReduce Managed, elastic Hadoop (1.x & 2.x) cluster
Integrates with S3, DynamoDB and Redshift
Install Storm, Spark & Shark, Hive, Pig, Impala &
End User Tools Automatically
Support for Spot Instances
Integrated HBase NOSQL Database
Analytics
Elastic MapReduce
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 104: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/104.jpg)
AWS Analytics Demo
![Page 105: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/105.jpg)
Accessible for Analysts & Dashboards
![Page 106: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/106.jpg)
Amazon EC2
Elastic Beanstalk
DynamoDB
Amazon S3
Amazon Kinesis
Amazon EMR
CloudWatch
Amazon EC2
MQTT
HTTPS json
![Page 107: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/107.jpg)
Amazon EC2
Elastic Beanstalk
DynamoDB
Amazon S3
Amazon Kinesis
Amazon Redshift
Amazon EMR
CloudWatch
Amazon EC2
MQTT
HTTPS json
AWS Lambda
![Page 108: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/108.jpg)
S3 Events
AWS Lambda
SQS Queues
SNS Topics
Amazon S3 Bucket
RRS Object Lost
Object Deleted
Object Delete Marker Created
Object Created (Put)
Object Created (Post)
Object Created (Copy)
Object Created (Multi-Part)
![Page 109: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/109.jpg)
Event Processing
AWS Lambda Fully Managed Event Processor
Node.js, Integrated AWS SDK & ImageMagick
Natively Compile & Install any Node.js modules
Specify Runtime RAM & Timeout
Automatically Scaled to support Event Volume
Events from S3, Dynamo DB, Kinesis & Lambda
Integrated CloudWatch Logging
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 110: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/110.jpg)
Database
Redshift Managed Massively Parallel Petabyte Scale Data
Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160GB -> 2 PB Online
RDS Dynamo DB
Redshift Elasticache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
![Page 111: Large Scale Data Analytics on AWS](https://reader036.vdocuments.site/reader036/viewer/2022081801/584a2cdc1a28ab0b678b6bc8/html5/thumbnails/111.jpg)