building a big data & analytics platform using aws

90
v Chris Hampartsoumian Technology Evangelist - ASEAN End to End Data Flows on the Cloud Structured, Unstructured & Streaming July 2015

Upload: amazon-web-services

Post on 09-Aug-2015

1.600 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Building a Big Data & Analytics Platform using AWS

v

Chris Hampartsoumian

Technology Evangelist - ASEAN

End to End Data Flows on the Cloud

Structured, Unstructured & Streaming

July 2015

Page 2: Building a Big Data & Analytics Platform using AWS

How is Cloud Computing important for Big Data

Applications?

Page 3: Building a Big Data & Analytics Platform using AWS

v

?

…get into cloud computing?

How did Amazon…

Page 4: Building a Big Data & Analytics Platform using AWS

11 Regions

30 Availability Zones

53 Edge locations

AWS Global Infrastructure

Page 5: Building a Big Data & Analytics Platform using AWS

Why are customers adopting cloud computing?

Variable expense

Replace capital

expenditure with variable

expense

Elastic capacity

No need to guess

capacity requirements

and over-provision

Speed and agility

Infrastructure in minutes

not weeks

Global Reach

Go global in minutes and

reach a global audience

Page 6: Building a Big Data & Analytics Platform using AWS

Mobile

PushNotifications

MobileAnalytics

CognitoCognito

Sync

AWS Global Infrastructure

Your Applications

AWS Global Infrastructure11 Regions 30 Availability Zones 53 Edge Locations

Network

VPCDirect

ConnectRoute 53

AP

I

Human Interaction

Support

Web Console

Interaction

Command Line

Libraries, SDK’s

Database

DynamoDBRDS ElastiCache

Deployment & Management

ElasticBeanstalk

OpsWorksCloud

FormationCode

DeployCode

PipelineCode

Commit

Security & Administration

CloudWatch ConfigCloudTrail

IAM Directory KMS

Application

SQS SWFApp

StreamElastic

TranscoderSES

CloudSearch

SNS

Enterprise Applications

WorkSpaces WorkMail WorkDocs

Compute

EC2 ELBAuto

ScalingLambdaECS

Analytics

KinesisData

PipelineRedShift EMR

Machine Learning

Storage

EBS Glacier CloudFrontEFSS3

Page 7: Building a Big Data & Analytics Platform using AWS

v

StructureLowHigh

Large

Small

Size

Traditional

Database

Hadoop

NoSQL

MPP Database

Page 8: Building a Big Data & Analytics Platform using AWS

UnstructuredStructured Streaming

MPP Databases

Amazon Redshift

Hadoop

Amazon EMR

Real-time Analysis

Amazon Kinesis

Page 9: Building a Big Data & Analytics Platform using AWS

v

• Standard SQL

• Optimized for fast analysis

• Very scalable

Page 10: Building a Big Data & Analytics Platform using AWS

vAmazon Redshift

Page 11: Building a Big Data & Analytics Platform using AWS

v

Q1. What is it?

Page 12: Building a Big Data & Analytics Platform using AWS

vMPP SQL Database

Optimised for Analytics

Gigabytes to Petabytes

Fully relational

Fully managed

Amazon Redshift

Page 13: Building a Big Data & Analytics Platform using AWS

v

Q2. How does it work?

Page 14: Building a Big Data & Analytics Platform using AWS

JDBC/ODBC

Page 15: Building a Big Data & Analytics Platform using AWS

JDBC/ODBC

ID Name

1 John Smith

2 Jane Jones

3 Peter Black

4 Pat Partridge

5 Sarah Cyan

6 Brian Snail

1 John Smith

4 Pat Partridge

2 Jane Jones

5 Sarah Cyan

3 Peter Black

6 Brian Snail

Page 16: Building a Big Data & Analytics Platform using AWS

v

• Column storage

• Data compression

• Zone maps• With row storage you do unnecessary I/O

• To get average Amount by State, you have

to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Dramatically reduces I/O

Page 17: Building a Big Data & Analytics Platform using AWS

v

• With column storage, you only

read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• Column storage

• Data compression

• Zone maps

Dramatically reduces I/O

Page 18: Building a Big Data & Analytics Platform using AWS

v analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

• Column storage

• Data compression

• Zone maps• COPY compresses automatically

• You can analyze and override

• More performance, less cost

Dramatically reduces I/O

Page 19: Building a Big Data & Analytics Platform using AWS

v

• Column storage

• Data compression

• Zone maps

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

• Track the minimum and maximum

value for each block

• Skip over blocks that don’t contain

relevant data

Dramatically reduces I/O

Page 20: Building a Big Data & Analytics Platform using AWS

v

Q3. What’s good about it?

Performance, Scalability, Ease of Use, Cost

Page 21: Building a Big Data & Analytics Platform using AWS

v

Performance Evaluation on 2B Rows

Aggregate by month 02:08:35 00:35:46 00:00:12

Traditional SQL Database

AmazonRedshift

Page 22: Building a Big Data & Analytics Platform using AWS

160 GBDW2.L

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

2 PB

Page 23: Building a Big Data & Analytics Platform using AWS

v

Q4. How do I integrate with Redshift?

Page 24: Building a Big Data & Analytics Platform using AWS

v

Works with your existing analysis tools

JDBC/ODBC

Amazon Redshift

Page 25: Building a Big Data & Analytics Platform using AWS

S3

Redshift

DynamoDB

EMR

Linux

Loading data

Page 26: Building a Big Data & Analytics Platform using AWS

AmazonRedshift

SourceSystems

ETL

Page 27: Building a Big Data & Analytics Platform using AWS

UnstructuredStructured Streaming

MPP Databases

Amazon Redshift

Hadoop

Amazon EMR

Real-time Analysis

Amazon Kinesis

Page 28: Building a Big Data & Analytics Platform using AWS

Input File

Hadoop cluster

Functions Output

1. Very Flexible2. Very Scalable3. Often Transient

Page 29: Building a Big Data & Analytics Platform using AWS

vAmazon Elastic MapReduce (EMR)

Page 30: Building a Big Data & Analytics Platform using AWS

v

Q1. What is it?

Managed Hadoop

Page 31: Building a Big Data & Analytics Platform using AWS

Input File

EMR cluster

Functions OutputEC2

EC2

EC2

EC2

EC2

EC2

Page 32: Building a Big Data & Analytics Platform using AWS

v

Q2. How does it work?

Page 33: Building a Big Data & Analytics Platform using AWS

v

EMR

EMR ClusterS3

1. Put the data into S3

2. Choose: Hadoop distribution, # of nodes, types

of nodes, Hadoop apps like Hive/Pig/HBase

4. Get the output from S3

3. Launch the cluster using the EMR console, CLI, SDK,

or APIs

Page 34: Building a Big Data & Analytics Platform using AWS

v

EMR

EMR Cluster

S3

You can easily resize the cluster

And launch parallel clusters using the same

data

Page 35: Building a Big Data & Analytics Platform using AWS

v

EMR

EMR Cluster

S3

Use Spotnodes to save time and money

Page 36: Building a Big Data & Analytics Platform using AWS

v

EMR ClusterS3

When processing is complete, you can terminate the cluster (and stop

paying)

Page 37: Building a Big Data & Analytics Platform using AWS

v

Q3. What’s good about it?

Scalability, Cost & Ease of Use

Page 38: Building a Big Data & Analytics Platform using AWS

v

14 Hours

Duration:

Scenario #1

Duration:

7 Hours

Scenario #2

EMR with spot instances

#1: Cost without Spot4 instances *14 hrs * $0.50 = $28

#2: Cost with Spot4 instances *7 hrs * $0.50 = $14 +5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

Time Savings: 50% Cost Savings: ~22%

Page 39: Building a Big Data & Analytics Platform using AWS

Master instance groupEMR cluster

Task instance groupCore instance group

HDFS HDFS

Amazon S3

Great for Spot Instances

Page 40: Building a Big Data & Analytics Platform using AWS

v

The Hadoop Ecosystem

Page 41: Building a Big Data & Analytics Platform using AWS

UnstructuredStructured Streaming

MPP Databases

Amazon Redshift

Hadoop

Amazon EMR

Real-time Analysis

Amazon Kinesis

Page 42: Building a Big Data & Analytics Platform using AWS

v

Page 43: Building a Big Data & Analytics Platform using AWS

v

Q1. What is it?

Page 44: Building a Big Data & Analytics Platform using AWS

vKinesis

A fully managed service for real-time processing

of high-volume, streaming data.

Page 45: Building a Big Data & Analytics Platform using AWS

v

Q2. How does it work?

Page 46: Building a Big Data & Analytics Platform using AWS

Availability

Zone

Availability

ZoneAvailability

Zone

Data Sources

Data Sources

Data Sources

Data Sources

Data Sources

Logging

Metrics

Analysis

MachineLearning

S3

DynamoDB

Redshift

EMR

Kinesis

Stream

Page 47: Building a Big Data & Analytics Platform using AWS

Putting data into Kinesis

• Each shard

• 1000 Tx Per Second

• 1MB Per Second

• 50KB Payload Per Tx

• Messages kept for 24 hours

• Simple PUT interface to store data in Kinesis

• A Partition Key is used to distribute the PUTs across Shards

• A unique Sequence # is created

Page 48: Building a Big Data & Analytics Platform using AWS

v

Getting data out of Kinesis

Kinesis Client Library (KCL):

• Abstracts code from individual shards

• Starts a Kinesis Worker for each shard

• Increases and decreases workers

• Tracks a Worker’s location in the stream

Page 49: Building a Big Data & Analytics Platform using AWS

v

Q3. What’s good about it?

Page 50: Building a Big Data & Analytics Platform using AWS

v

Easy Administration Real-time Performance High Throughput.

Elastic

Integration

S3

Redshift

DynamoDB

Storm

ElasticSearch

Build Real-time

Applications

.

Low Cost

Page 51: Building a Big Data & Analytics Platform using AWS

v

Amazon Machine Learning

Page 52: Building a Big Data & Analytics Platform using AWS

v A Legacy of Machine Learning at Amazon

“Customers who bought this

also bought…”

Page 53: Building a Big Data & Analytics Platform using AWS

Why Did We Build Amazon Machine Learning?

Page 54: Building a Big Data & Analytics Platform using AWS

Three types of data-driven development

Retrospective

analysis and

reporting

Amazon Redshift

Amazon RDS

Amazon S3

Amazon EMR

Page 55: Building a Big Data & Analytics Platform using AWS

Three types of data-driven development

Retrospective

analysis and

reporting

Here-and-now

real-time processing and

dashboards

Amazon Kinesis

Amazon EC2

AWS Lambda

Amazon Redshift,

Amazon RDS

Amazon S3

Amazon EMR

Page 56: Building a Big Data & Analytics Platform using AWS

Three types of data-driven development

Retrospective

analysis and

reporting

Here-and-now

real-time processing and

dashboards

Predictions

to enable smart

applications

Amazon Kinesis

Amazon EC2

AWS Lambda

Amazon Redshift,

Amazon RDS

Amazon S3

Amazon EMR

Page 57: Building a Big Data & Analytics Platform using AWS

v

Machine learning and smart applications

• Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available

Page 58: Building a Big Data & Analytics Platform using AWS

v

Machine learning and smart applications

• Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available

Your data + machine learning = smart applications

Page 59: Building a Big Data & Analytics Platform using AWS

v

Smart applications by example

Based on what you know

about the user:

Will they use your product?

Page 60: Building a Big Data & Analytics Platform using AWS

v

Smart applications by example

Based on what you know

about the user:

Will they use your product?

Based on what you know

about an order:

Is this order fraudulent?

Page 61: Building a Big Data & Analytics Platform using AWS

v

Smart applications by example

Based on what you know

about the user:

Will they use your product?

Based on what you know

about an order:

Is this order fraudulent?

Based on what you know about a

news article:

What other articles are

interesting?

Page 62: Building a Big Data & Analytics Platform using AWS

v

Challenges to Building Smart Applications Today

Expertise Technology Operationalization

Limited supply of data scientists

Many choices, few mainstays

Complex and error-prone data workflows

Expensive to hire or outsource

Difficult to use and scale Custom platforms and APIs

Page 63: Building a Big Data & Analytics Platform using AWS

What is Amazon Machine Learning?

Page 64: Building a Big Data & Analytics Platform using AWS

v

Amazon Machine Learning

• Easy to use, managed machine learning service built for developers

• Robust, powerful machine learning technology based on Amazon’s internal systems

• Create models using your data already stored in the AWS cloud

• Deploy models to production in seconds

Page 65: Building a Big Data & Analytics Platform using AWS

v

Easy to use and developer-friendly

• Use the intuitive, powerful service console to build and explore your initial models

• Data retrieval • Model training, quality evaluation, fine-tuning• Deployment and management

• Automate model lifecycle with fully featured APIs and SDKs

• Java, Python, .NET, JavaScript, Ruby, PHP

• Easily create smart iOS and Android applications with AWS Mobile SDK

Page 66: Building a Big Data & Analytics Platform using AWS

v

Powerful machine learning technology

• Based on Amazon’s battle-hardened internal systems

• Not just the algorithms:• Smart data transformations• Input data and model quality alerts• Built-in industry best practices

• Grows with your needs• Train on up to 100 GB of data• Generate billions of predictions• Obtain predictions in batches or real-time

Page 67: Building a Big Data & Analytics Platform using AWS

v

Integrated with AWS Data Ecosystem

• Access data that is stored in Amazon S3, Amazon Redshift, or MySQL databases in RDS

• Output predictions to Amazon S3 for easy integration with your data flows

• Use AWS Identity and Access Management (IAM) for fine-grained data-access permission policies

Page 68: Building a Big Data & Analytics Platform using AWS

v

Fully-managed model and prediction services

• End-to-end service, with no servers to provision and manage

• One-click production model deployment

• Programmatically query model metadata to enable automatic retraining workflows

• Monitor prediction usage patterns with Amazon CloudWatch metrics

Page 69: Building a Big Data & Analytics Platform using AWS

v

Pay-as-you-go and inexpensive

• Data analysis, model training, and evaluation: $0.42/instance hour

• Batch predictions: $0.10/1000

• Real-time predictions: $0.10/1000

• + hourly capacity reservation charge

Page 70: Building a Big Data & Analytics Platform using AWS

v

Three Supported Types of Predictions

• Binary Classification

• Predict the answer to a Yes/No question

• Multi-class classification

• Predict the correct category from a list

• Regression

• Predict the value of a numeric variable

Page 71: Building a Big Data & Analytics Platform using AWS

How Do I Get started Using Amazon Machine Learning?

Page 72: Building a Big Data & Analytics Platform using AWS

Get Started Quickly• Create, access, and manage all Amazon

ML entities through the AWS Management Console

• Easily learn to build a model with the tutorial dataset provided

• Add prediction capabilities to your iOS and Android applications with AWS Mobile SDK

• Use Amazon ML APIs, CLIs, or SDKs

Page 73: Building a Big Data & Analytics Platform using AWS

v

Buildmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

Page 74: Building a Big Data & Analytics Platform using AWS

v

Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

- Create a Datasource object pointing to your data

- Explore and understand your data

- Transform data and train your model

Page 75: Building a Big Data & Analytics Platform using AWS

v

Explore and understand your data

Page 76: Building a Big Data & Analytics Platform using AWS

v

Train your model

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> model = ml.create_ml_model(

ml_model_id=’my_model',

ml_model_type='REGRESSION',

training_data_source_id='my_datasource')

Page 77: Building a Big Data & Analytics Platform using AWS

v

Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

- Understand model quality

- Adjust model interpretation

Page 78: Building a Big Data & Analytics Platform using AWS

v

Explore model quality

Page 79: Building a Big Data & Analytics Platform using AWS

v

Fine-tune model interpretation

Page 80: Building a Big Data & Analytics Platform using AWS

v

Fine-tune model interpretation

Page 81: Building a Big Data & Analytics Platform using AWS

v

Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

- Batch predictions

- Real-time predictions

Page 82: Building a Big Data & Analytics Platform using AWS

v

Batch predictions

• Asynchronous, large-volume prediction generation

• Request through service console or API

• Best for applications that deal with batches of data records

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> model = ml.create_batch_prediction(

batch_prediction_id = 'my_batch_prediction’

batch_prediction_data_source_id = ’my_datasource’

ml_model_id = ’my_model',

output_uri = 's3://examplebucket/output/’)

Page 83: Building a Big Data & Analytics Platform using AWS

v

Real-time predictions

• Synchronous, low-latency, high-throughput prediction generation

• Request through service API or server or mobile SDKs

• Best for interaction applications that deal with individual data records

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> ml.predict(

ml_model_id=’my_model',

predict_endpoint=’example_endpoint’,

record={’key1':’value1’, ’key2':’value2’})

{

'Prediction': {

'predictedValue': 13.284348,

'details': {

'Algorithm': 'SGD',

'PredictiveModelType': 'REGRESSION’

}

}

}

Page 84: Building a Big Data & Analytics Platform using AWS

Architecture Patterns for Smart Applications

Page 85: Building a Big Data & Analytics Platform using AWS

Batch predictions with Amazon EMR

Query for predictions with Amazon ML batch API

Process data with Amazon EMR

Raw data in Amazon S3

Aggregated data in Amazon S3

Predictions in Amazon S3 Your application

Page 86: Building a Big Data & Analytics Platform using AWS

Batch predictions with Amazon Redshift

Structured dataIn Amazon Redshift

Load predictions into Amazon Redshift

-or-Read prediction results directly

from Amazon S3

Predictions in Amazon S3

Query for predictions with Amazon ML batch API

Your application

Page 87: Building a Big Data & Analytics Platform using AWS

Real-time predictions for interactive applications

Your application

Query for predictions with Amazon ML real-time API

Page 88: Building a Big Data & Analytics Platform using AWS

Thank You!

Page 89: Building a Big Data & Analytics Platform using AWS

aws.amazon.com/big-data

Page 90: Building a Big Data & Analytics Platform using AWS

Thank you!

@AWSCloudSEAsia

Chris Hampartsoumian

Technology Evangelist ASEAN