deploying a data lake on aws - aws online tech talks march 2017
TRANSCRIPT
PowerPoint Presentation
Siva Raghupathy, Senior ManagerBig Data Solutions Architecture, AWSMarch 21, 2017Deploying a Data Lake in AWS
2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AgendaData Lake ConceptsSimplify Data LakeWhat technologies should you use? Why?How?Reference architectureDesign patterns
What is a Data Lake?It is an architecture that allows you to collect, store, process, analyze and consume all data that flows into your organization.
Is it about storing all data in a single location ?
It is not enough.
You need all the critical componentsI will show you how to assemble this together?
Persist an immutable copy of dataMaterialize views
3
Why Data Lake?Leverage all data that flows into your organizationCustomer centricityBusiness agilityBetter predictions via Machine LearningCompetitive advantage
Data Lake EnablersBig Data technology evolutionCloud services evolution/economicsBig Data + Cloud architecture convergence
Big Data EvolutionBatch processingStreamprocessingArtificialIntelligence
Hourly server logs: were your systems misbehaving 1hr agoWeekly / Monthly Bill: what you spent this billing cycleDaily customer-preferences report from your web sites click stream: what deal or ad to try next timeDaily fraud reports: was there fraud yesterday
Real-time alerts: what went wrong nowReal-time spending caps: prevent overspending nowReal-time analysis: what to offer the current customer nowReal-time detection: block fraudulent use now
I need to harness big data, fastI want more happy customersI want to save/make more money
6
Cloud Services EvolutionVirtual machinesManaged servicesServerless
http://www.allthingsdistributed.com/2016/06/aws-lambda-serverless-reference-architectures.html
7
Plethora of Tools
Amazon Glacier
S3
DynamoDB
RDS
EMRAmazon Redshift
Data PipelineAmazon Kinesis
LambdaAmazon ML
SQS
ElastiCache
DynamoDBStreams
Amazon Elasticsearch Service
Amazon Kinesis Analytics
Amazon QuickSight
HiveSparkStormKafkaHBaseFlumeImpalaCascading
EMRDynamoDBS3RedshiftKinesisRDSGlacier
8
Data Lake Challenges
Is there a reference architecture?What tools should I use?How? Why?9
Architectural Principles
Build decoupled systemsData Store Process Store Analyze AnswersUse the right tool for the jobData structure, latency, throughput, access patternsLeverage AWS managed servicesScalable/elastic, available, reliable, secure, no/low adminUse log-centric design patternsImmutable logs, materialized viewsBe cost-consciousBig data big cost
Before we go into solving the Big architecture, I want to introduce some tried and test architecture principles.
Here at AWS we believe you should be using the right tool for the job instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. Well talk about this more.Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each otherthis has been tried and battle test. Managed services this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense.you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.Lambda Architecture - 10
Simplify Data Lake
data
answers
Time to answer (Latency)ThroughputCost
Types of DataCOLLECTMobile appsWeb appsData centers
AWS Direct Connect
RECORDSApplicationsIn-memory data structuresDatabase recordsAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
DOCUMENTS
FILESLoggingTransportSearch documentsLog filesMessaging
Message
MESSAGESMessagingMessagesDevices
Sensors & IoT platforms
AWS IoT
STREAMSIoTData streams
Transactions
Files
Events
Types of Data
Heavily read, app defined data structuresDatabase recordsSearch documentsLog filesMessaging events
Devices / sensors / IoT stream
Database RecordsSearch DocumentsLog Files
Messaging Events
What is streaming data?An unbounded sequence of events that is continuously captured and processed with low latency.
Devices / Sensors / IoT Stream
12
What Is the Temperature of Your Data ?
13
HotWarmColdVolumeMBGBGBTBPBEBItem sizeBKBKBMBKBTBLatencymsms, secmin, hrsDurabilityLowhighHighVery highRequest rateVery highHighLowCost/GB$$-$$-
Hot dataWarm dataCold dataData Characteristics: Hot, Warm, Cold
14
Store
STOREDevices
Sensors & IoT platforms
AWS IoT
STREAMSIoTCOLLECTAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
DOCUMENTS
FILESLoggingTransportMessaging
Message
MESSAGESMessagingApplicationsMobile appsWeb appsData centers
AWS Direct Connect
RECORDSTypes of Data StoresDatabaseSQL & NoSQL databasesSearchSearch enginesFile storeFile systems QueueMessage queuesStreamstoragePub/sub message queuesIn-memoryCaches, data structure servers
Types of Data
Heavily read, app defined data structuresDatabase recordsSearch documentsLog filesMessaging events
Devices / sensors / IoT stream
Database RecordsSearch DocumentsLog Files
Messaging Events
Devices / Sensors / IoT Stream
16
In-memoryAmazon Kinesis Firehose
Amazon KinesisStreams
Apache Kafka
Amazon DynamoDB Streams
Amazon SQS
Amazon SQSManaged message queue serviceApache KafkaHigh throughput distributed streaming platformAmazon Kinesis StreamsManaged stream storage + processingAmazon Kinesis FirehoseManaged data deliveryAmazon DynamoDBManaged NoSQL databaseTables can be stream-enabled
Message & Stream StorageDevices
Sensors & IoT platforms
AWS IoT
STREAMSIoTCOLLECTSTOREMobile appsWeb appsData centers
AWS Direct Connect
RECORDSDatabaseApplicationsAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
DOCUMENTS
FILESSearchFile storeLoggingTransportMessaging
Message
MESSAGESMessagingMessageStream
Why Stream Storage?Decouple producers & consumersPersistent bufferCollect multiple streamsPreserve client orderingParallel consumptionStreaming MapReduce44332211
4321
4321
4321
4321
44332211
Producer 1shard 1 / partition 1shard 2 / partition 2Consumer 1Count of red = 4Count of violet = 4Consumer 2Count of blue = 4Count of green = 4Producer 2Producer 3Producer nKey = redKey = green Key = blueKey = violet
DynamoDB streamAmazon Kinesis stream
Kafka topic
Huge buffer18
What About Amazon SQS? Decouple producers & consumersPersistent bufferCollect multiple streamsNo client ordering (Standard) FIFO queue preserves client orderingNo streaming MapReduceNo parallel consumptionAmazon SNS can publish to multiple SNS subscribers (queues or functions)
Publisher
Amazon SNStopic
functionAWS Lambdafunction
Amazon SQSqueuequeueSubscriberConsumers
4321
1
2
3
4
4321
1
2
3
4
2
1
3
4
1
3
3
4
2
Standard FIFOProducersAmazon SQS Queue
19
Which Stream/Message Storage Should I Use?Amazon DynamoDB StreamsAmazonKinesisStreamsAmazonKinesis FirehoseApacheKafkaAmazonSQS (Standard)Amazon SQS (FIFO)AWS managedYesYesYesNoYesYesGuaranteed orderingYes Yes NoYesNoYesDelivery (deduping)Exactly-onceAt-least-onceAt-least-onceAt-least-onceAt-least-onceExactly-onceData retention period24 hours7 daysN/AConfigurable14 days14 daysAvailability3 AZ3 AZ3 AZConfigurable3 AZ3 AZScale / throughputNo limit /~ table IOPSNo limit /~ shardsNo limit /automaticNo limit /~ nodesNo limits /automatic300 TPS / queueParallel consumptionYesYesNoYesNoNoStream MapReduceYesYesN/AYesN/AN/ARow/object size400 KB1 MBDestination row/object sizeConfigurable256 KB256 KBCostHigher (table cost) LowLowLow (+admin)Low-mediumLow-medium
HotWarmNew
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.htmlhttps://aws.amazon.com/blogs/aws/new-for-amazon-simple-queue-service-fifo-queues-with-exactly-once-delivery-deduplication/
http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-BE3BA3E4-1AC5-4E7A-B542-015056D8EDAF
Kinesis -> $52.14 per monthSQS -> $133.42 per month for puts or $400/month (put, get, delete)DynamoDB -> $3809.88 per month (10TB of storage cost itself is $2500/month)
Cost (100rpsx 35KB)$52/month$133/month * 2 = $266/month?
Amazon DynamoDB Service (US-East)$ProvisionedThroughputCapacity:$120IndexedDataStorage:$2560.90DynamoDBStreams:$1.3Amazon SQS Service (US-East)
Pricing ExampleLets assume that our data producers put 100 records per second in aggregate, and each record is 35KB. In this case, the total data input rate is 3.4MB/sec (100 records/sec*35KB/record). For simplicity, we assume that the throughput and data size of each record are stable and constant throughout the day. Please note that we can dynamically adjust the throughput of our Amazon Kinesis stream at any time.We first calculate the number of shards needed for our stream to achieve the required throughput. As one shard provides a capacity of 1MB/sec data input and supports 1000 records/sec, four shards provide a capacity of 4MB/sec data input and support 4000 records/sec. So a stream with four shards satisfies our required throughput of 3.4MB/sec at 100 records/sec.We then calculate our monthly Amazon Kinesis costs using Amazon Kinesis pricing in the US-East Region:
Shard Hour: One shard costs $0.015 per hour, or $0.36 per day ($0.015*24). Our stream has four shards so that it costs $1.44 per day ($0.36*4). For a month with 31 days, our monthly Shard Hour cost is $44.64 ($1.44*31).PUT Payload Unit (25KB):As our record is 35KB, each record contains two PUT Payload Units. Our data producers put 100 records or 200 PUT Payload Units per second in aggregate. That is 267,840,000 records or 535,680,000 PUT Payload Units per month. As one million PUT Payload Units cost $0.014, our monthly PUT Payload Units cost is $7.499 ($0.014*535.68).Adding the Shard Hour and PUT Payload Unit costs together, our total Amazon Kinesis costs are $1.68 per day, or $52.14 per month. For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.
20
In-memoryCOLLECTSTOREMobile appsWeb appsData centers
AWS Direct Connect
RECORDSDatabaseAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
DOCUMENTS
FILESSearchMessaging
Message
MESSAGESDevices
Sensors & IoT platforms
AWS IoT
STREAMSApache KafkaAmazon KinesisStreamsAmazon Kinesis Firehose
Amazon DynamoDB Streams
HotStream
Amazon S3Amazon SQS
MessageAmazon S3
FileLoggingIoTApplicationsTransportMessagingFile Storage
Why Is Amazon S3 the Fabric of Data Lake?Natively supported by big data frameworks (Spark, Hive, Presto, etc.) Decouple storage and computeNo need to run compute clusters for storage (unlike HDFS)Can run transient Hadoop clusters & Amazon EC2 Spot InstancesMultiple & heterogeneous analysis clusters can use the same dataUnlimited number of objects and volume of dataVery high bandwidth no aggregate throughput limitDesigned for 99.99% availability can tolerate zone failureDesigned for 99.999999999% durabilityNo need to pay for data replicationNative support for versioningTiered-storage (Standard, IA, Amazon Glacier) via life-cycle policiesSecure SSL, client/server-side encryption at restLow cost
https://aws.amazon.com/blogs/aws/new-for-amazon-simple-queue-service-fifo-queues-with-exactly-once-delivery-deduplication/
No need to run compute clusters for storage (unlike HDFS)Can run transient Hadoop clusters & Amazon EC2 spot InstancesMultiple distinct (Spark, Hive, Presto) clusters can use the same data
Highly durable, highly available, highly scalable
Secure
Designed for 99.999999999% durabilityhttps://aws.amazon.com/s3/storage-classes/
Lifecycle Policies - migrated to S3-Standard - IA, archive to Amazon Glacier, or deleted after a specific period of time!
No need to run a Hadoop cluster for storage (unlike HDFS)Unlimited number of objectsObject size up to 5TBVery high bandwidthVersioningLifecycle Policies to Achieve to Glacier Designed for 99.999999999% durabilityServer size encryptionHighly available SLA 99.99 (Standard)Highly scalableLow costEvent notificationCross-region replicationNatively supported by almost all big data frameworks & tools
22
What About HDFS & Data Tiering?Use HDFS for very frequently accessed (hot) dataUse Amazon S3 Standard for frequently accessed data Use Amazon S3 Standard IA for less frequently accessed dataUse Amazon Glacier for archiving cold dataUse Amazon S3 Analytics for storage class analysis
New
23
In-memoryCOLLECTSTOREMobile appsWeb appsData centers
AWS Direct Connect
RECORDSDatabaseAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
DOCUMENTS
FILESSearchMessaging
Message
MESSAGESDevices
Sensors & IoT platforms
AWS IoT
STREAMSApache KafkaAmazon KinesisStreamsAmazon Kinesis Firehose
Amazon DynamoDB Streams
HotStream
Amazon SQS
MessageAmazon S3
FileLoggingIoTApplicationsTransportMessagingIn-memory, Database, Search
Anti-PatternData TierApplicationsRDBMS
25
Best Practice: Use the Right Tool for the JobData TierSearch
Amazon Elasticsearch ServiceIn-memory
Amazon ElastiCacheRedisMemcached
SQL
Amazon AuroraAmazon RDSMySQLPostgreSQLOracleSQL ServerNoSQL
Amazon DynamoDBCassandraHBaseMongoDB
Applications
26
Materialized Views & Immutable LogViewsImmutable log
COLLECTSTOREMobile appsWeb appsData centers
AWS Direct Connect
RECORDSAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
DOCUMENTS
FILESMessaging
Message
MESSAGESDevices
Sensors & IoT platforms
AWS IoT
STREAMSApache KafkaAmazon KinesisStreamsAmazon Kinesis Firehose
Amazon DynamoDB Streams
HotStream
Amazon SQS
Message
Amazon Elasticsearch ServiceAmazon DynamoDBAmazon S3Amazon ElastiCacheAmazon RDS
Search SQL NoSQL CacheFile
LoggingIoTApplicationsTransportMessagingAmazon ElastiCacheManaged Memcached or Redis serviceAmazon DynamoDBManaged NoSQL database serviceAmazon RDSManaged relational database serviceAmazon Elasticsearch ServiceManaged Elasticsearch service
Which Data Store Should I Use?Data structure Fixed schema, JSON, key-value
Access patterns Store data in the format you will access it
Data characteristics Hot, warm, cold
Cost Right cost
Data Structure and Access PatternsAccess PatternsWhat to use?Put/Get (key, value) In-memory, NoSQLSimple relationships 1:N, M:NNoSQLMulti-table joins, transaction, SQLSQLFaceting, search Search
Data StructureWhat to use?Fixed schemaSQL, NoSQLSchema-free (JSON)NoSQL, Search(Key, value)In-memory, NoSQL
2 x 2 MatrixStructuredLevel of query (from none to complex)
Draw down the slide30
In-memory
SQL
Request rateHighLowCost/GBHighLowLatencyLowHighData volumeLowHighAmazon GlacierStructureNoSQL
Hot dataWarm dataCold dataLowHighS3
Search
31
Amazon ElastiCacheAmazonDynamoDBAmazonRDS/AuroraAmazonESAmazon S3Amazon GlacierAverage latencymsmsms, secms,secms,sec,min(~ size)hrsTypicaldata storedGBGBTBs(no limit)GBTB(64 TB max)GBTBMBPB(no limit)GBPB(no limit)Typicalitem sizeB-KBKB(400 KB max)KB(64 KB max)B-KB(2 GB max)KB-TB(5 TB max)GB(40 TB max)Request RateHigh very highVery high(no limit)HighHighLow high(no limit)Very low
Storage cost GB/month$$4/10DurabilityLow - moderateVery highVery highHighVery highVery highAvailabilityHigh2 AZVery high 3 AZVery high3 AZHigh2 AZVery high3 AZVery high3 AZ
Hot dataWarm dataCold dataWhich Data Store Should I Use?
EC 40 to 50K writes/second (10 MB batch firehose streams 5 seconds it will flush)Segment size key parameter to tune.Segment merge - larger merge betterShard, portion of Index
Domains -> size 2 limits -> EBS 10TB, Instance storage 32TB storageData is half that
20 data nodes
Logging retention period 2 or 3 days of data
Latency SecondsTuning how often
Low volume -> MSHuge volume -> batch puts -> 5 seconds32
Cost-Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?Im currently scoping out a project. The design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month
Request rate (Writes/sec)Object size(Bytes)Total size(GB/month)Objects per month30020481483 777,600,000
33
https://calculator.s3.amazonaws.com/index.htmlSimple Monthly Calculator
Cost-Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
34
Request rate (Writes/sec)Object size(Bytes)Total size(GB/month)Objects per month3002,0481,483 777,600,000
Amazon S3 orDynamoDB?
35
Request rate (Writes/sec)Object size(Bytes)Total size(GB/month)Objects per monthScenario 13002,0481,483777,600,000 Scenario 230032,76823,730777,600,000
Amazon S3
Amazon DynamoDBuse use
Scenario1: http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-F6B3AD98-1404-4770-BAB0-1F5397F445A7Scenario 2: http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-2440EC2A-1C16-4BCE-B5CE-5075887F4A47
36
PROCESS / ANALYZE
BatchTakes minutes to hours Example: Daily/weekly/monthly reportsAmazon EMR (MapReduce, Hive, Pig, Spark)InteractiveTakes secondsExample: Self-service dashboardsAmazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)MessageTakes milliseconds to secondsExample: Message processingAmazon SQS applications on Amazon EC2StreamTakes milliseconds to secondsExample: Fraud alerts, 1 minute metricsAmazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Storm, AWS LambdaArtificial IntelligenceTakes milliseconds to minutesExample: Fraud detection, forecast demand, text to speechAmazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe)
Analytics Types & FrameworksPROCESS / ANALYZEMessageAmazon SQS apps
Amazon EC2
StreamingAmazon Kinesis AnalyticsKCLappsAWS Lambda
StreamAmazon EC2
Amazon EMR
Fast
Amazon Redshift
Presto
AmazonEMR
FastSlow
Amazon Athena
BatchInteractiveAmazon AIAI
Which Stream & Message Processing Technology Should I Use?Amazon EMR (Spark Streaming)Apache StormKCL ApplicationAmazon Kinesis AnalyticsAWS LambdaAmazon SQS ApplicationAWS managedYes (Amazon EMR)No (Do it yourself)No ( EC2 + Auto Scaling)YesYesNo (EC2 + Auto Scaling)ServerlessNoNoNoYesYesNoScale / throughputNo limits / ~ nodesNo limits / ~ nodesNo limits / ~ nodesUp to 8 KPU / automaticNo limits /automaticNo limits /~ nodesAvailabilitySingle AZConfigurableMulti-AZMulti-AZMulti-AZMulti-AZProgramming languagesJava, Python, ScalaAlmost any language via ThriftJava, others via MultiLangDaemonANSI SQL with extensionsNode.js, Java, PythonAWS SDK languages (Java, .NET, Python, )UsesMultistage processingMultistage processingSingle stage processingMultistage processingSimple event-based triggersSimple event based triggers
ReliabilityKCL and Spark checkpointsFramework managedManaged by KCLManaged by Amazon Kinesis AnalyticsManaged by AWS LambdaManaged by SQS Visibility Timeout
Fast
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming39
Which Analysis Tool Should I Use?Amazon RedshiftAmazon AthenaAmazon EMRPrestoSparkHiveUse caseOptimized for data warehousingAd-hoc Interactive QueriesInteractive QueryGeneral purpose (iterative ML, RT, ..)BatchScale/throughput~NodesAutomatic / No limits~ NodesAWS Managed ServiceYesYes, ServerlessYesStorageLocal storageAmazon S3Amazon S3, HDFSOptimizationColumnar storage, data compression, and zone maps CSV, TSV, JSON, Parquet, ORC, Apache Web logFramework dependentMetadataAmazon Redshift managedAthena Catalog ManagerHive Meta-storeBI tools supports Yes (JDBC/ODBC)Yes (JDBC)Yes (JDBC/ODBC & Custom)Access controlsUsers, groups, and access controlsAWS IAMIntegration with LDAPUDF supportYes (Scalar)NoYes
FastSlow
Athena SimpleRedshift FastEMR - ConfigurableAdd connector
30% queries & 70% of data Athena
Direct Acyclic Graphs?
Exactly once processing & DAG? how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming40
What About ETL?
https://aws.amazon.com/big-data/partner-solutions/ETLSTOREPROCESS / ANALYZE
Data Integration PartnersReduce the effort to move, cleanse, synchronize, manage, and automatize data related processes.
AWS GlueAWS Glue is a fully managed ETL service that makes it easy to understand your data sources, prepare the data, and move it reliably between data storesNew
Data Integration Reduce the effort to move, cleanse, synchronize, manage, and automatize data related processes.
41
CONSUME
COLLECTSTORECONSUMEPROCESS / ANALYZEAmazon Elasticsearch ServiceApache KafkaAmazon SQSAmazon KinesisStreamsAmazon Kinesis FirehoseAmazon DynamoDBAmazon ElastiCacheAmazon RDS
Amazon DynamoDB Streams
Hot
HotWarm
FileMessageStream
Mobile appsWeb appsDevices
Messaging
MessageSensors & IoT platforms
AWS IoTData centers
AWS Direct ConnectAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMSLoggingIoTApplicationsTransportMessaging
ETLSearch SQL NoSQL Cache
StreamingAmazon Kinesis AnalyticsKCLappsAWS Lambda
FastStreamAmazon EC2
Amazon EMR
Amazon SQS appsAmazon Redshift
Presto
AmazonEMR
FastSlow
Amazon EC2
Amazon Athena
BatchMessageInteractiveAI
Amazon AI
Amazon S3
STORECONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis & visualizationNotebooks
IDE
APIApplications & API
Analysis and visualization
Notebooks IDE
Business users
Data scientist, developers
COLLECTETL
Applications & API
Analysis and Visualization
Notebooks IDE
44
Putting It All Together
StreamingAmazon Kinesis AnalyticsKCLappsAWS LambdaCOLLECTSTORECONSUMEPROCESS / ANALYZE
Amazon Elasticsearch ServiceApache KafkaAmazon SQSAmazon KinesisStreamsAmazon Kinesis FirehoseAmazon DynamoDBAmazon ElastiCacheAmazon RDS
Amazon DynamoDB Streams
Hot
HotWarm
Fast
StreamSearch SQL NoSQL CacheFileMessageStreamAmazon EC2
Mobile appsWeb appsDevices
Messaging
MessageSensors & IoT platforms
AWS IoTData centers
AWS Direct ConnectAWS Import/Export
SnowballLogging
Amazon CloudWatch
AWS CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis & visualizationNotebooks
IDE
APILoggingIoTApplicationsTransportMessagingETL
Amazon EMR
Amazon SQS appsAmazon Redshift
Presto
AmazonEMR
FastSlow
Amazon EC2
Amazon Athena
BatchMessageInteractiveAI
Amazon AI
Amazon S3
46
What about Metadata?Amazon Athena CatalogAn internal data catalog for tables/schemas on S3Glue CatalogHive Metastore compliantCrawlers - Detect new data, schema, partitionsSearch - Metadata discoveryEMR Hive Metastore (Presto, Spark, Hive, Pig) Can be hosted on Amazon RDS
Data CatalogAmazon AthenaCatalog
RDSAmazon EMRHive Metastore
Amazon RDSGlue Catalog
PrestoAmazonEMR
Amazon Athena
Amazon Athena uses an internal data catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3.47
Security & GovernanceAWS Identity and Access Management (IAM)Amazon CognitoAmazon CloudWatch & AWS CloudTrailAmazon KMSAWS Directory ServiceApache RangerSecurity & Governance
IAM
Amazon CloudWatch
AWSCloudTrailAWS KMS
AWSCloudHSM
AWS Directory Service
AmazonCognito
AWS Identity and Access Management (IAM) - enables you to securely control access to AWS services and resources for your usersAmazon Cognito lets you easily add user sign-up and sign-in to your mobile and web appsAmazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWSWith CloudTrail, you can log, continuously monitor, and retain events related to API calls across your AWS infrastructureAWS Key Management Service (KMS) is a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data, and uses Hardware Security Modules (HSMs) to protect the security of your keys. AWS Key Management Service is integrated with several other AWS services to help you protect the data you store with these services. AWS Key Management Service is also integrated with AWS CloudTrail to provide you with logs of all key usage to help meet your regulatory and compliance needs.AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as AWS Microsoft AD, enables your directory-aware workloads and AWS resources to use managed Active Directory in the AWS Cloud. The Microsoft AD service is built on actual Microsoft Active Directory and does not require you to synchronize or replicate data from your existing Active Directory to the cloudhttps://aws.amazon.com/blogs/big-data/implementing-authorization-and-auditing-using-apache-ranger-on-amazon-emr/
48
Security & Governance
IAM
AWS STS
Amazon CloudWatch
AWSCloudTrailAWS KMS
AWSCloudHSM
AWS Directory ServiceData CatalogAmazon AthenaCatalog
RDSHive Metastore
EMRRDSGlueCatalog
Data LakeReference Architecture
Design Patterns
Spark Streaming Apache StormAWS LambdaKCL appsAmazon RedshiftAmazonRedshiftHive
Spark Presto
Amazon Kinesis
Amazon DynamoDB
Amazon S3
dataHotColdData temperature
Processing speedFastSlow
Answers
HiveNative appsKCL appsAWS Lambda
Real-timeInteractiveBatch
Amazon Athena
This is a summary of all six design patterns together. This summarizes all of the solutions available in the context of the temperature of the data and the data processing latency requirements.
Hive 1 year worth of click stream dataSpark 1 year of click stream data what people are buying frequently togetherRedshift reporting, enterprise reporting tool SQL HeavyImpala same as redshiftPreseto same league as Impala presto Interactive SQL analytics have a Hadoop installed base.NoSQL Analytics on NoSQL
51
Amazon EMRReal-time AnalyticsAmazonKinesisKCL appAWS LambdaSparkStreamingAmazon SNSAmazonAINotifications
AmazonElastiCache (Redis)AmazonDynamoDBAmazonRDSAmazonESAlertsApp state orMaterializedViewReal-time predictionKPI
process
storeStreamAmazon KinesisAnalyticsAmazonS3LogAmazon KinesisFan out
Best practices
BDT318 - Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second
Use Kinesis or Kafka for stream storageUse appropriate stream processing toolKCL Simple, only for KinesisApache Storm general purposeSpark Streaming Windowing, MLlib, StatefullAWS Lambda fully managed, no servers, stateless Use Amazon SNS for AlertsUse Amazon ML for predictions Use a managed database/cache for state managementUse a visualization tool for KPI visualization
52
Interactive & Batch AnalyticsAmazon S3Amazon EMRHivePigSparkAmazonAI
process
storeConsumeAmazon RedshiftAmazon EMRPrestoSparkBatchInteractiveBatch predictionReal-time predictionStreamAmazon KinesisFirehoseAmazon AthenaFilesAmazon KinesisAnalytics
Interactive&BatchAmazon S3
Amazon RedshiftAmazon EMRPrestoHivePigSpark
AmazonElastiCacheAmazonDynamoDBAmazonRDSAmazonESAWS LambdaStormSpark Streaming on Amazon EMRApplicationsAmazon KinesisApp state orMaterializedView KCLAmazonAIReal-time
AmazonDynamoDBAmazonRDSChange Data CaptureTransactionsStreamFilesData LakeAmazon KinesisAnalyticsAmazon Athena
Amazon KinesisFirehose
Lambda Architecture best practicesUse Kinesis or Kafka for stream storageMaintain an immutable (append only) raw master data in Amazon S3 - use Amazon Kinesis S3 connector or FirehoseUse appropriate batch processing technologies for creating batch viewsUse appropriate stream processing technology for real-time viewsUse a serving layer with an appropriate database to serve downstream applications
54
SummaryBuild decoupled systemsData Store Process Store Analyze AnswersUse the right tool for the jobData structure, latency, throughput, access patternsLeverage AWS managed servicesScalable/elastic, available, reliable, secure, no/low adminUse log-centric design patternsImmutable log, batch, interactive & real-time viewsBe cost-consciousBig data big cost
Before we go into solving the Big architecture, I want to introduce some tried and test architecture principles.
Here at AWS we believe you should be using the right tool for the job instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. Well talk about this more.Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each otherthis has been tried and battle test. Managed services this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense.you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.Lambda Architecture - 55
Security & Governance
IAM
Amazon CloudWatch
AWSCloudTrailAWS KMS
AWSCloudHSM
AWS Directory ServiceData CatalogAmazon AthenaCatalog
RDSHive Metastore
EMRRDSGlue Catalog
Data LakeReference Architecture
AmazonCognito
Glue catalog does Schema detection, partition inference, 56
Resourceshttps://aws.amazon.com/blogs/big-data/introducing-the-data-lake-solution-on-aws/AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306) AWS re:Invent 2016: Deep Dive on Amazon S3 (STG303)https://aws.amazon.com/blogs/big-data/reinvent-2016-aws-big-data-machine-learning-sessions/https://aws.amazon.com/blogs/big-data/implementing-authorization-and-auditing-using-apache-ranger-on-amazon-emr/
Thank you!