deploying a data lake on aws - aws online tech talks march 2017

PowerPoint Presentation

Siva Raghupathy, Senior ManagerBig Data Solutions Architecture, AWSMarch 21, 2017Deploying a Data Lake in AWS

2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AgendaData Lake ConceptsSimplify Data LakeWhat technologies should you use? Why?How?Reference architectureDesign patterns

What is a Data Lake?It is an architecture that allows you to collect, store, process, analyze and consume all data that flows into your organization.

Is it about storing all data in a single location ?

It is not enough.

You need all the critical componentsI will show you how to assemble this together?

Persist an immutable copy of dataMaterialize views

3

Why Data Lake?Leverage all data that flows into your organizationCustomer centricityBusiness agilityBetter predictions via Machine LearningCompetitive advantage

Data Lake EnablersBig Data technology evolutionCloud services evolution/economicsBig Data + Cloud architecture convergence

Big Data EvolutionBatch processingStreamprocessingArtificialIntelligence

Hourly server logs: were your systems misbehaving 1hr agoWeekly / Monthly Bill: what you spent this billing cycleDaily customer-preferences report from your web sites click stream: what deal or ad to try next timeDaily fraud reports: was there fraud yesterday

Real-time alerts: what went wrong nowReal-time spending caps: prevent overspending nowReal-time analysis: what to offer the current customer nowReal-time detection: block fraudulent use now

I need to harness big data, fastI want more happy customersI want to save/make more money

6

Cloud Services EvolutionVirtual machinesManaged servicesServerless

http://www.allthingsdistributed.com/2016/06/aws-lambda-serverless-reference-architectures.html

7

Plethora of Tools

Amazon Glacier

S3

DynamoDB

RDS

EMRAmazon Redshift

Data PipelineAmazon Kinesis

LambdaAmazon ML

SQS

ElastiCache

DynamoDBStreams

Amazon Elasticsearch Service

Amazon Kinesis Analytics

Amazon QuickSight

HiveSparkStormKafkaHBaseFlumeImpalaCascading

EMRDynamoDBS3RedshiftKinesisRDSGlacier

8

Data Lake Challenges

Is there a reference architecture?What tools should I use?How? Why?9

Architectural Principles

Build decoupled systemsData Store Process Store Analyze AnswersUse the right tool for the jobData structure, latency, throughput, access patternsLeverage AWS managed servicesScalable/elastic, available, reliable, secure, no/low adminUse log-centric design patternsImmutable logs, materialized viewsBe cost-consciousBig data big cost

Before we go into solving the Big architecture, I want to introduce some tried and test architecture principles.

Here at AWS we believe you should be using the right tool for the job instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. Well talk about this more.Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each otherthis has been tried and battle test. Managed services this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense.you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.Lambda Architecture - 10

Simplify Data Lake

data

answers

Time to answer (Latency)ThroughputCost

Types of DataCOLLECTMobile appsWeb appsData centers

AWS Direct Connect

RECORDSApplicationsIn-memory data structuresDatabase recordsAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILESLoggingTransportSearch documentsLog filesMessaging

Message

MESSAGESMessagingMessagesDevices

Sensors & IoT platforms

AWS IoT

STREAMSIoTData streams

Transactions

Files

Events

Types of Data

Heavily read, app defined data structuresDatabase recordsSearch documentsLog filesMessaging events

Devices / sensors / IoT stream

Database RecordsSearch DocumentsLog Files

Messaging Events

What is streaming data?An unbounded sequence of events that is continuously captured and processed with low latency.

Devices / Sensors / IoT Stream

12

What Is the Temperature of Your Data ?

13

HotWarmColdVolumeMBGBGBTBPBEBItem sizeBKBKBMBKBTBLatencymsms, secmin, hrsDurabilityLowhighHighVery highRequest rateVery highHighLowCost/GB$$-$$-

Hot dataWarm dataCold dataData Characteristics: Hot, Warm, Cold

14

Store

STOREDevices


AWS IoT

STREAMSIoTCOLLECTAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILESLoggingTransportMessaging

Message

MESSAGESMessagingApplicationsMobile appsWeb appsData centers

AWS Direct Connect

RECORDSTypes of Data StoresDatabaseSQL & NoSQL databasesSearchSearch enginesFile storeFile systems QueueMessage queuesStreamstoragePub/sub message queuesIn-memoryCaches, data structure servers

Types of Data

Heavily read, app defined data structuresDatabase recordsSearch documentsLog filesMessaging events

Devices / sensors / IoT stream

Database RecordsSearch DocumentsLog Files

Messaging Events

Devices / Sensors / IoT Stream

16

In-memoryAmazon Kinesis Firehose

Amazon KinesisStreams

Apache Kafka

Amazon DynamoDB Streams

Amazon SQS

Amazon SQSManaged message queue serviceApache KafkaHigh throughput distributed streaming platformAmazon Kinesis StreamsManaged stream storage + processingAmazon Kinesis FirehoseManaged data deliveryAmazon DynamoDBManaged NoSQL databaseTables can be stream-enabled

Message & Stream StorageDevices


AWS IoT

STREAMSIoTCOLLECTSTOREMobile appsWeb appsData centers

AWS Direct Connect

RECORDSDatabaseApplicationsAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILESSearchFile storeLoggingTransportMessaging

Message

MESSAGESMessagingMessageStream

Why Stream Storage?Decouple producers & consumersPersistent bufferCollect multiple streamsPreserve client orderingParallel consumptionStreaming MapReduce44332211

4321

4321

4321

4321

44332211

Producer 1shard 1 / partition 1shard 2 / partition 2Consumer 1Count of red = 4Count of violet = 4Consumer 2Count of blue = 4Count of green = 4Producer 2Producer 3Producer nKey = redKey = green Key = blueKey = violet

DynamoDB streamAmazon Kinesis stream

Kafka topic

Huge buffer18

What About Amazon SQS? Decouple producers & consumersPersistent bufferCollect multiple streamsNo client ordering (Standard) FIFO queue preserves client orderingNo streaming MapReduceNo parallel consumptionAmazon SNS can publish to multiple SNS subscribers (queues or functions)

Publisher

Amazon SNStopic

functionAWS Lambdafunction

Amazon SQSqueuequeueSubscriberConsumers

4321

1

2

3

4

4321

1

2

3

4

2

1

3

4

1

3

3

4

2

Standard FIFOProducersAmazon SQS Queue

19

Which Stream/Message Storage Should I Use?Amazon DynamoDB StreamsAmazonKinesisStreamsAmazonKinesis FirehoseApacheKafkaAmazonSQS (Standard)Amazon SQS (FIFO)AWS managedYesYesYesNoYesYesGuaranteed orderingYes Yes NoYesNoYesDelivery (deduping)Exactly-onceAt-least-onceAt-least-onceAt-least-onceAt-least-onceExactly-onceData retention period24 hours7 daysN/AConfigurable14 days14 daysAvailability3 AZ3 AZ3 AZConfigurable3 AZ3 AZScale / throughputNo limit /~ table IOPSNo limit /~ shardsNo limit /automaticNo limit /~ nodesNo limits /automatic300 TPS / queueParallel consumptionYesYesNoYesNoNoStream MapReduceYesYesN/AYesN/AN/ARow/object size400 KB1 MBDestination row/object sizeConfigurable256 KB256 KBCostHigher (table cost) LowLowLow (+admin)Low-mediumLow-medium

HotWarmNew

http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.htmlhttps://aws.amazon.com/blogs/aws/new-for-amazon-simple-queue-service-fifo-queues-with-exactly-once-delivery-deduplication/

http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-BE3BA3E4-1AC5-4E7A-B542-015056D8EDAF

Kinesis -> $52.14 per monthSQS -> $133.42 per month for puts or $400/month (put, get, delete)DynamoDB -> $3809.88 per month (10TB of storage cost itself is $2500/month)

Cost (100rpsx 35KB)$52/month$133/month * 2 = $266/month?

Amazon DynamoDB Service (US-East)$ProvisionedThroughputCapacity:$120IndexedDataStorage:$2560.90DynamoDBStreams:$1.3Amazon SQS Service (US-East)

Pricing ExampleLets assume that our data producers put 100 records per second in aggregate, and each record is 35KB. In this case, the total data input rate is 3.4MB/sec (100 records/sec*35KB/record). For simplicity, we assume that the throughput and data size of each record are stable and constant throughout the day. Please note that we can dynamically adjust the throughput of our Amazon Kinesis stream at any time.We first calculate the number of shards needed for our stream to achieve the required throughput. As one shard provides a capacity of 1MB/sec data input and supports 1000 records/sec, four shards provide a capacity of 4MB/sec data input and support 4000 records/sec. So a stream with four shards satisfies our required throughput of 3.4MB/sec at 100 records/sec.We then calculate our monthly Amazon Kinesis costs using Amazon Kinesis pricing in the US-East Region:

Shard Hour: One shard costs $0.015 per hour, or $0.36 per day ($0.015*24). Our stream has four shards so that it costs $1.44 per day ($0.36*4). For a month with 31 days, our monthly Shard Hour cost is $44.64 ($1.44*31).PUT Payload Unit (25KB):As our record is 35KB, each record contains two PUT Payload Units. Our data producers put 100 records or 200 PUT Payload Units per second in aggregate. That is 267,840,000 records or 535,680,000 PUT Payload Units per month. As one million PUT Payload Units cost $0.014, our monthly PUT Payload Units cost is $7.499 ($0.014*535.68).Adding the Shard Hour and PUT Payload Unit costs together, our total Amazon Kinesis costs are $1.68 per day, or $52.14 per month. For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.

20

In-memoryCOLLECTSTOREMobile appsWeb appsData centers

AWS Direct Connect

RECORDSDatabaseAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILESSearchMessaging

Message

MESSAGESDevices


AWS IoT

STREAMSApache KafkaAmazon KinesisStreamsAmazon Kinesis Firehose


HotStream

Amazon S3Amazon SQS

MessageAmazon S3

FileLoggingIoTApplicationsTransportMessagingFile Storage

Why Is Amazon S3 the Fabric of Data Lake?Natively supported by big data frameworks (Spark, Hive, Presto, etc.) Decouple storage and computeNo need to run compute clusters for storage (unlike HDFS)Can run transient Hadoop clusters & Amazon EC2 Spot InstancesMultiple & heterogeneous analysis clusters can use the same dataUnlimited number of objects and volume of dataVery high bandwidth no aggregate throughput limitDesigned for 99.99% availability can tolerate zone failureDesigned for 99.999999999% durabilityNo need to pay for data replicationNative support for versioningTiered-storage (Standard, IA, Amazon Glacier) via life-cycle policiesSecure SSL, client/server-side encryption at restLow cost

https://aws.amazon.com/blogs/aws/new-for-amazon-simple-queue-service-fifo-queues-with-exactly-once-delivery-deduplication/

No need to run compute clusters for storage (unlike HDFS)Can run transient Hadoop clusters & Amazon EC2 spot InstancesMultiple distinct (Spark, Hive, Presto) clusters can use the same data

Highly durable, highly available, highly scalable

Secure

Designed for 99.999999999% durabilityhttps://aws.amazon.com/s3/storage-classes/

Lifecycle Policies - migrated to S3-Standard - IA, archive to Amazon Glacier, or deleted after a specific period of time!

No need to run a Hadoop cluster for storage (unlike HDFS)Unlimited number of objectsObject size up to 5TBVery high bandwidthVersioningLifecycle Policies to Achieve to Glacier Designed for 99.999999999% durabilityServer size encryptionHighly available SLA 99.99 (Standard)Highly scalableLow costEvent notificationCross-region replicationNatively supported by almost all big data frameworks & tools

22

What About HDFS & Data Tiering?Use HDFS for very frequently accessed (hot) dataUse Amazon S3 Standard for frequently accessed data Use Amazon S3 Standard IA for less frequently accessed dataUse Amazon Glacier for archiving cold dataUse Amazon S3 Analytics for storage class analysis

New

23

In-memoryCOLLECTSTOREMobile appsWeb appsData centers

AWS Direct Connect

RECORDSDatabaseAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILESSearchMessaging

Message

MESSAGESDevices


AWS IoT



HotStream

Amazon SQS

MessageAmazon S3

FileLoggingIoTApplicationsTransportMessagingIn-memory, Database, Search

Anti-PatternData TierApplicationsRDBMS

25

Best Practice: Use the Right Tool for the JobData TierSearch

Amazon Elasticsearch ServiceIn-memory

Amazon ElastiCacheRedisMemcached

SQL

Amazon AuroraAmazon RDSMySQLPostgreSQLOracleSQL ServerNoSQL

Amazon DynamoDBCassandraHBaseMongoDB

Applications

26

Materialized Views & Immutable LogViewsImmutable log

COLLECTSTOREMobile appsWeb appsData centers

AWS Direct Connect

RECORDSAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

DOCUMENTS

FILESMessaging

Message

MESSAGESDevices


AWS IoT



HotStream

Amazon SQS

Message

Amazon Elasticsearch ServiceAmazon DynamoDBAmazon S3Amazon ElastiCacheAmazon RDS

Search SQL NoSQL CacheFile

LoggingIoTApplicationsTransportMessagingAmazon ElastiCacheManaged Memcached or Redis serviceAmazon DynamoDBManaged NoSQL database serviceAmazon RDSManaged relational database serviceAmazon Elasticsearch ServiceManaged Elasticsearch service

Which Data Store Should I Use?Data structure Fixed schema, JSON, key-value

Access patterns Store data in the format you will access it

Data characteristics Hot, warm, cold

Cost Right cost

Data Structure and Access PatternsAccess PatternsWhat to use?Put/Get (key, value) In-memory, NoSQLSimple relationships 1:N, M:NNoSQLMulti-table joins, transaction, SQLSQLFaceting, search Search

Data StructureWhat to use?Fixed schemaSQL, NoSQLSchema-free (JSON)NoSQL, Search(Key, value)In-memory, NoSQL

2 x 2 MatrixStructuredLevel of query (from none to complex)

Draw down the slide30

In-memory

SQL

Request rateHighLowCost/GBHighLowLatencyLowHighData volumeLowHighAmazon GlacierStructureNoSQL

Hot dataWarm dataCold dataLowHighS3

Search

31

Amazon ElastiCacheAmazonDynamoDBAmazonRDS/AuroraAmazonESAmazon S3Amazon GlacierAverage latencymsmsms, secms,secms,sec,min(~ size)hrsTypicaldata storedGBGBTBs(no limit)GBTB(64 TB max)GBTBMBPB(no limit)GBPB(no limit)Typicalitem sizeB-KBKB(400 KB max)KB(64 KB max)B-KB(2 GB max)KB-TB(5 TB max)GB(40 TB max)Request RateHigh very highVery high(no limit)HighHighLow high(no limit)Very low

Storage cost GB/month$$4/10DurabilityLow - moderateVery highVery highHighVery highVery highAvailabilityHigh2 AZVery high 3 AZVery high3 AZHigh2 AZVery high3 AZVery high3 AZ

Hot dataWarm dataCold dataWhich Data Store Should I Use?

EC 40 to 50K writes/second (10 MB batch firehose streams 5 seconds it will flush)Segment size key parameter to tune.Segment merge - larger merge betterShard, portion of Index

Domains -> size 2 limits -> EBS 10TB, Instance storage 32TB storageData is half that

20 data nodes

Logging retention period 2 or 3 days of data

Latency SecondsTuning how often

Low volume -> MSHuge volume -> batch puts -> 5 seconds32

Cost-Conscious Design

Example: Should I use Amazon S3 or Amazon DynamoDB?Im currently scoping out a project. The design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month

Request rate (Writes/sec)Object size(Bytes)Total size(GB/month)Objects per month30020481483 777,600,000

33

https://calculator.s3.amazonaws.com/index.htmlSimple Monthly Calculator

Cost-Conscious Design

Example: Should I use Amazon S3 or Amazon DynamoDB?

34

Request rate (Writes/sec)Object size(Bytes)Total size(GB/month)Objects per month3002,0481,483 777,600,000

Amazon S3 orDynamoDB?

35

Request rate (Writes/sec)Object size(Bytes)Total size(GB/month)Objects per monthScenario 13002,0481,483777,600,000 Scenario 230032,76823,730777,600,000

Amazon S3

Amazon DynamoDBuse use

Scenario1: http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-F6B3AD98-1404-4770-BAB0-1F5397F445A7Scenario 2: http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-2440EC2A-1C16-4BCE-B5CE-5075887F4A47

36

PROCESS / ANALYZE

BatchTakes minutes to hours Example: Daily/weekly/monthly reportsAmazon EMR (MapReduce, Hive, Pig, Spark)InteractiveTakes secondsExample: Self-service dashboardsAmazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)MessageTakes milliseconds to secondsExample: Message processingAmazon SQS applications on Amazon EC2StreamTakes milliseconds to secondsExample: Fraud alerts, 1 minute metricsAmazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm, AWS LambdaArtificial IntelligenceTakes milliseconds to minutesExample: Fraud detection, forecast demand, text to speechAmazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe)

Analytics Types & FrameworksPROCESS / ANALYZEMessageAmazon SQS apps

Amazon EC2

StreamingAmazon Kinesis AnalyticsKCLappsAWS Lambda

StreamAmazon EC2

Amazon EMR

Fast

Amazon Redshift

Presto

AmazonEMR

FastSlow

Amazon Athena

BatchInteractiveAmazon AIAI

Which Stream & Message Processing Technology Should I Use?Amazon EMR (Spark Streaming)Apache StormKCL ApplicationAmazon Kinesis AnalyticsAWS LambdaAmazon SQS ApplicationAWS managedYes (Amazon EMR)No (Do it yourself)No ( EC2 + Auto Scaling)YesYesNo (EC2 + Auto Scaling)ServerlessNoNoNoYesYesNoScale / throughputNo limits / ~ nodesNo limits / ~ nodesNo limits / ~ nodesUp to 8 KPU / automaticNo limits /automaticNo limits /~ nodesAvailabilitySingle AZConfigurableMulti-AZMulti-AZMulti-AZMulti-AZProgramming languagesJava, Python, ScalaAlmost any language via ThriftJava, others via MultiLangDaemonANSI SQL with extensionsNode.js, Java, PythonAWS SDK languages (Java, .NET, Python, )UsesMultistage processingMultistage processingSingle stage processingMultistage processingSimple event-based triggersSimple event based triggers

ReliabilityKCL and Spark checkpointsFramework managedManaged by KCLManaged by Amazon Kinesis AnalyticsManaged by AWS LambdaManaged by SQS Visibility Timeout

Fast

Add connector

Direct Acyclic Graphs?

Exactly once processing & DAG? how do you do this??

https://storm.apache.org/documentation/Rationale.html

http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming39

Which Analysis Tool Should I Use?Amazon RedshiftAmazon AthenaAmazon EMRPrestoSparkHiveUse caseOptimized for data warehousingAd-hoc Interactive QueriesInteractive QueryGeneral purpose (iterative ML, RT, ..)BatchScale/throughput~NodesAutomatic / No limits~ NodesAWS Managed ServiceYesYes, ServerlessYesStorageLocal storageAmazon S3Amazon S3, HDFSOptimizationColumnar storage, data compression, and zone maps CSV, TSV, JSON, Parquet, ORC, Apache Web logFramework dependentMetadataAmazon Redshift managedAthena Catalog ManagerHive Meta-storeBI tools supports Yes (JDBC/ODBC)Yes (JDBC)Yes (JDBC/ODBC & Custom)Access controlsUsers, groups, and access controlsAWS IAMIntegration with LDAPUDF supportYes (Scalar)NoYes

FastSlow

Athena SimpleRedshift FastEMR - ConfigurableAdd connector

30% queries & 70% of data Athena

Direct Acyclic Graphs?

Exactly once processing & DAG? how do you do this??

https://storm.apache.org/documentation/Rationale.html

http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming40

What About ETL?

https://aws.amazon.com/big-data/partner-solutions/ETLSTOREPROCESS / ANALYZE

Data Integration PartnersReduce the effort to move, cleanse, synchronize, manage, and automatize data related processes.

AWS GlueAWS Glue is a fully managed ETL service that makes it easy to understand your data sources, prepare the data, and move it reliably between data storesNew

Data Integration Reduce the effort to move, cleanse, synchronize, manage, and automatize data related processes.

41

CONSUME

COLLECTSTORECONSUMEPROCESS / ANALYZEAmazon Elasticsearch ServiceApache KafkaAmazon SQSAmazon KinesisStreamsAmazon Kinesis FirehoseAmazon DynamoDBAmazon ElastiCacheAmazon RDS


Hot

HotWarm

FileMessageStream

Mobile appsWeb appsDevices

Messaging

MessageSensors & IoT platforms

AWS IoTData centers

AWS Direct ConnectAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMSLoggingIoTApplicationsTransportMessaging

ETLSearch SQL NoSQL Cache

StreamingAmazon Kinesis AnalyticsKCLappsAWS Lambda

FastStreamAmazon EC2

Amazon EMR

Amazon SQS appsAmazon Redshift

Presto

AmazonEMR

FastSlow

Amazon EC2

Amazon Athena

BatchMessageInteractiveAI

Amazon AI

Amazon S3

STORECONSUMEPROCESS / ANALYZE

Amazon QuickSight

Apps & Services

Analysis & visualizationNotebooks

IDE

APIApplications & API

Analysis and visualization

Notebooks IDE

Business users

Data scientist, developers

COLLECTETL

Applications & API

Analysis and Visualization

Notebooks IDE

44

Putting It All Together

StreamingAmazon Kinesis AnalyticsKCLappsAWS LambdaCOLLECTSTORECONSUMEPROCESS / ANALYZE

Amazon Elasticsearch ServiceApache KafkaAmazon SQSAmazon KinesisStreamsAmazon Kinesis FirehoseAmazon DynamoDBAmazon ElastiCacheAmazon RDS


Hot

HotWarm

Fast

StreamSearch SQL NoSQL CacheFileMessageStreamAmazon EC2

Mobile appsWeb appsDevices

Messaging

MessageSensors & IoT platforms

AWS IoTData centers

AWS Direct ConnectAWS Import/Export

SnowballLogging

Amazon CloudWatch

AWS CloudTrail

RECORDS

DOCUMENTS

FILES

MESSAGES

STREAMS

Amazon QuickSight

Apps & Services

Analysis & visualizationNotebooks

IDE

APILoggingIoTApplicationsTransportMessagingETL

Amazon EMR

Amazon SQS appsAmazon Redshift

Presto

AmazonEMR

FastSlow

Amazon EC2

Amazon Athena

BatchMessageInteractiveAI

Amazon AI

Amazon S3

46

What about Metadata?Amazon Athena CatalogAn internal data catalog for tables/schemas on S3Glue CatalogHive Metastore compliantCrawlers - Detect new data, schema, partitionsSearch - Metadata discoveryEMR Hive Metastore (Presto, Spark, Hive, Pig) Can be hosted on Amazon RDS

Data CatalogAmazon AthenaCatalog

RDSAmazon EMRHive Metastore

Amazon RDSGlue Catalog

PrestoAmazonEMR

Amazon Athena

Amazon Athena uses an internal data catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3.47

Security & GovernanceAWS Identity and Access Management (IAM)Amazon CognitoAmazon CloudWatch & AWS CloudTrailAmazon KMSAWS Directory ServiceApache RangerSecurity & Governance

IAM

Amazon CloudWatch

AWSCloudTrailAWS KMS

AWSCloudHSM

AWS Directory Service

AmazonCognito

AWS Identity and Access Management (IAM) - enables you to securely control access to AWS services and resources for your usersAmazon Cognito lets you easily add user sign-up and sign-in to your mobile and web appsAmazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWSWith CloudTrail, you can log, continuously monitor, and retain events related to API calls across your AWS infrastructureAWS Key Management Service (KMS) is a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data, and uses Hardware Security Modules (HSMs) to protect the security of your keys. AWS Key Management Service is integrated with several other AWS services to help you protect the data you store with these services. AWS Key Management Service is also integrated with AWS CloudTrail to provide you with logs of all key usage to help meet your regulatory and compliance needs.AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as AWS Microsoft AD, enables your directory-aware workloads and AWS resources to use managed Active Directory in the AWS Cloud. The Microsoft AD service is built on actual Microsoft Active Directory and does not require you to synchronize or replicate data from your existing Active Directory to the cloudhttps://aws.amazon.com/blogs/big-data/implementing-authorization-and-auditing-using-apache-ranger-on-amazon-emr/

48

Security & Governance

IAM

AWS STS

Amazon CloudWatch


AWSCloudHSM

AWS Directory ServiceData CatalogAmazon AthenaCatalog

RDSHive Metastore

EMRRDSGlueCatalog

Data LakeReference Architecture

Design Patterns

Spark Streaming Apache StormAWS LambdaKCL appsAmazon RedshiftAmazonRedshiftHive

Spark Presto

Amazon Kinesis

Amazon DynamoDB

Amazon S3

dataHotColdData temperature

Processing speedFastSlow

Answers

HiveNative appsKCL appsAWS Lambda

Real-timeInteractiveBatch

Amazon Athena

This is a summary of all six design patterns together. This summarizes all of the solutions available in the context of the temperature of the data and the data processing latency requirements.

Hive 1 year worth of click stream dataSpark 1 year of click stream data what people are buying frequently togetherRedshift reporting, enterprise reporting tool SQL HeavyImpala same as redshiftPreseto same league as Impala presto Interactive SQL analytics have a Hadoop installed base.NoSQL Analytics on NoSQL

51

Amazon EMRReal-time AnalyticsAmazonKinesisKCL appAWS LambdaSparkStreamingAmazon SNSAmazonAINotifications

AmazonElastiCache (Redis)AmazonDynamoDBAmazonRDSAmazonESAlertsApp state orMaterializedViewReal-time predictionKPI

process

storeStreamAmazon KinesisAnalyticsAmazonS3LogAmazon KinesisFan out

Best practices

BDT318 - Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second

Use Kinesis or Kafka for stream storageUse appropriate stream processing toolKCL Simple, only for KinesisApache Storm general purposeSpark Streaming Windowing, MLlib, StatefullAWS Lambda fully managed, no servers, stateless Use Amazon SNS for AlertsUse Amazon ML for predictions Use a managed database/cache for state managementUse a visualization tool for KPI visualization

52

Interactive & Batch AnalyticsAmazon S3Amazon EMRHivePigSparkAmazonAI

process

storeConsumeAmazon RedshiftAmazon EMRPrestoSparkBatchInteractiveBatch predictionReal-time predictionStreamAmazon KinesisFirehoseAmazon AthenaFilesAmazon KinesisAnalytics

Interactive&BatchAmazon S3

Amazon RedshiftAmazon EMRPrestoHivePigSpark

AmazonElastiCacheAmazonDynamoDBAmazonRDSAmazonESAWS LambdaStormSpark Streaming on Amazon EMRApplicationsAmazon KinesisApp state orMaterializedView KCLAmazonAIReal-time

AmazonDynamoDBAmazonRDSChange Data CaptureTransactionsStreamFilesData LakeAmazon KinesisAnalyticsAmazon Athena

Amazon KinesisFirehose

Lambda Architecture best practicesUse Kinesis or Kafka for stream storageMaintain an immutable (append only) raw master data in Amazon S3 - use Amazon Kinesis S3 connector or FirehoseUse appropriate batch processing technologies for creating batch viewsUse appropriate stream processing technology for real-time viewsUse a serving layer with an appropriate database to serve downstream applications

54

SummaryBuild decoupled systemsData Store Process Store Analyze AnswersUse the right tool for the jobData structure, latency, throughput, access patternsLeverage AWS managed servicesScalable/elastic, available, reliable, secure, no/low adminUse log-centric design patternsImmutable log, batch, interactive & real-time viewsBe cost-consciousBig data big cost

Before we go into solving the Big architecture, I want to introduce some tried and test architecture principles.

Here at AWS we believe you should be using the right tool for the job instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. Well talk about this more.Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each otherthis has been tried and battle test. Managed services this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense.you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.Lambda Architecture - 55

Security & Governance

IAM

Amazon CloudWatch


AWSCloudHSM

AWS Directory ServiceData CatalogAmazon AthenaCatalog

RDSHive Metastore

EMRRDSGlue Catalog

Data LakeReference Architecture

AmazonCognito

Glue catalog does Schema detection, partition inference, 56

Resourceshttps://aws.amazon.com/blogs/big-data/introducing-the-data-lake-solution-on-aws/AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306) AWS re:Invent 2016: Deep Dive on Amazon S3 (STG303)https://aws.amazon.com/blogs/big-data/reinvent-2016-aws-big-data-machine-learning-sessions/https://aws.amazon.com/blogs/big-data/implementing-authorization-and-auditing-using-apache-ranger-on-amazon-emr/

Thank you!