aws june webinar series - deep dive: big data analytics and business intelligence

Matt Yanchyshyn, Sr. Manager Solutions Architecture

June 17th, 2015

AWS Deep DiveBig Data Analytics and Business Intelligence

Analytics and BI on AWS

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon RDS (Aurora)

AWS Lambda

KCL Apps

Amazon EMR

Amazon Redshift

Amazon MachineLearning

Collect Process AnalyzeStore

Data Collectionand Storage

DataProcessing

EventProcessing

Data Analysis

Batch processing

GBs of logs pushed to Amazon

S3 hourly

Daily Amazon EMR cluster using Hive to

process data

Input and output stored in Amazon S3

Load subset into Amazon Redshift

Reporting

Amazon S3 Log Bucket

Amazon EMR Structured log data

AmazonRedshift

Operational Reports

Streaming data processing

TBs of logs sent daily

Logs stored in Amazon Kinesis

Amazon Kinesis Client Library

AWS Lambda

Amazon EMR

Amazon EC2

TBs of logs sent daily

Logs stored inAmazon S3

Amazon EMR clusters

Hive Metastoreon Amazon EMR

Interactive query

Structured dataIn Amazon Redshift

Load predictions intoAmazon Redshift

-or-Read prediction results

directly from S3

Predictions in S3

Query for predictions with Amazon ML batch API

Your application

Batch predictions

Your applicationAmazon

DynamoDB

Lambda

Trigger event with Lambda+

Query for predictions with Amazon ML real-time API

Real-time predictions

Amazon Machine Learning

Easy to use, managed machine learning service built for developers

Create models using data stored in AWS

Deploy models to production in seconds

Powerful machine learning technology

Based on Amazon’s battle-hardened internal systems

Not just the algorithms:Smart data transformationsInput data and model quality alertsBuilt-in industry best practices

Grows with your needsTrain on up to 100 GB of dataGenerate billions of predictionsObtain predictions in batches or real-time

Pay-as-you-go and inexpensive

Data analysis, model training, and evaluation: $0.42/instance hour

Batch predictions: $0.10/1000

Real-time predictions: $0.10/1000+ hourly capacity reservation charge

Build & Trainmodel

Evaluate andoptimize

Retrieve predictions

Building smart applications with Amazon ML

Create a Datasource object

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> ds = ml.create_data_source_from_s3( data_source_id = ’my_datasource', data_spec= { 'DataLocationS3':'s3://bucket/input/', 'DataSchemaLocationS3':'s3://bucket/input/.schema'}, compute_statistics = True)

Explore and understand your data

Train your model

>>> import boto

>>> model = ml.create_ml_model( ml_model_id=’my_model', ml_model_type='REGRESSION', training_data_source_id='my_datasource')

Build & Trainmodel

Explore model quality

Fine-tune model interpretation

Build & Trainmodel

Batch predictions

Asynchronous, large-volume prediction generation

Request through service console or API

Best for applications that deal with batches of data records

>>> import boto

>>> model = ml.create_batch_prediction( batch_prediction_id = 'my_batch_prediction’ batch_prediction_data_source_id = ’my_datasource’ ml_model_id = ’my_model', output_uri = 's3://examplebucket/output/’)

Real-time predictions

Synchronous, low-latency, high-throughput prediction generation

Request through service API or server or mobile SDKs

Best for interaction applications that deal with individual data records

>>> import boto

>>> ml.predict( ml_model_id=’my_model', predict_endpoint=’example_endpoint’, record={’key1':’value1’, ’key2':’value2’})

{ 'Prediction': { 'predictedValue': 13.284348, 'details': { 'Algorithm': 'SGD', 'PredictiveModelType': 'REGRESSION’ } }}

Amazon Elastic MapReduce (EMR)

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleControl the cluster

The Hadoop ecosystem can run in Amazon EMR

Try different configurations to find your optimal architecture

CPUc3 family

cc1.4xlargecc2.8xlarge

Memorym2 familyr3 family

Disk/IOd2 familyi2 family

Generalm1 familym3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Easy to add/remove compute capacity to your cluster

Match compute demands with cluster sizing

Resizable clusters

Spot Instances for task nodes

Up to 90% off Amazon EC2

on-demand pricing

On-demand for core nodes

Standard Amazon EC2

pricing for on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Amazon S3 as your persistent data store

Separate compute and storage

Resize and shut down Amazon EMR clusters with no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3

Amazon S3

EMRFS makes it easier to use Amazon S3

Read-after-write consistency

Very fast list operations

Error handling options

Support for Amazon S3 encryption

Transparent to applications: s3://

EMRFS client-side encryption

Amazon S3

S enabled for

azon S3 client-side encryption

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

HDFS is still there if you need it

Iterative workloads

• If you’re processing the same dataset more than once

Disk I/O intensive workloads

Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing

Amazon Redshift

Amazon Redshift Architecture

Leader Node• SQL endpoint• Stores metadata• Coordinates query execution

Compute Nodes• Execute queries in parallel• Node types to match your

workload: Dense Storage (DS2) or Dense Compute (DC1)

• Divided into multiple slices• Local, columnar storage

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

S3 / EMR / DynamoDB / SSH

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

Compute Node

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storageWith column storage, you only

read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

analyze compression listing;

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storage

• COPY compresses automatically

• You can analyze and override

• More performance, less cost

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storage• Track the minimum and

maximum value for each block

• Skip over blocks that don’t contain relevant data

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storage

• Local storage for performance

• High scan rates

• Automatic replication

• Continuous backup and streaming restores to/from Amazon S3

• User snapshots on demand

• Cross region backups for disaster recovery

Amazon Redshift online resize

Continue querying during resize

New cluster deployed in the background at no extra cost

Data copied in parallel from node to node

Automatic SQL endpoint switchover via DNS

SnowflakeStar

Amazon Redshift works with existing data models

Distribution Key All

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

All data on every node

Same key to same location

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

EvenRound robin distribution

Amazon Redshift data distribution

Sorting data in Amazon Redshift

In the slices (on disk), the data is sorted by a sort key

Choose a sort key that is frequently used in your queries

Data in columns is marked with a min/max value so Redshift can skip blocks not relevant to the query

A good sort key also prevents reading entire blocks

User Defined Functions

Python 2.7

PostgreSQL UDF Syntax System

Network calls within UDFs are prohibited

Pandas, NumPy, and SciPy pre-installed

Import your own

Interleaved Multi Column Sort

Currently support Compound Sort Keys• Optimized for applications that filter data by one leading column

Adding support for Interleaved Sort Keys• Optimized for filtering data by up to eight columns• No storage overhead unlike an index• Lower maintenance penalty compared to indexes

Amazon Redshift works with yourexisting analysis tools

JDBC/ODBC

Amazon Redshift

Questions?

AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services.

Details• July 1, 2015 • Chicago, Illinois• @ McCormick Place

Featuring• New product launches• 36+ sessions, labs, and bootcamps• Executive and partner networking

Registration is now open • Come and see what AWS and the cloud can do for you.• Click here to register: http://amzn.to/1RooPPL

aws june webinar series - deep dive: big data analytics and business intelligence

Technology

big data analytics options on aws - aws whitepaper

aws iot analytics - aws iot analytics...

(sec315) aws directory service deep dive

azure stream analytics deep dive - tetranoodle.com€¦ ·...

serverless analytics and etl on aws presentation- aws...

aws iot analytics - aws iot analytics documentation · aws...

aws meetup auto scaling deep dive

deep dive: aws cloudhsm

deep dive on aws cloudhsm

aws june webinar series - deep dive: protecting your data...

deep dive on aws chalicedeep dive on aws chalice a...

aws august webinar series - s3 deep dive

migrating your databases to aws - london-summit-slides ......

deep dive: infrastructure as code - aws - amazon...

aws iot analytics - aws iot analytics 사용 설명서 ·...

deep dive: aws command line interface

(sec318) aws cloudtrail deep dive

aws セキュアデザイン(iam) deep dive

deep dive on aws lambda

aws iot deep dive - aws iot web day