amazon web services: emr (elastic map reduce) with itoc australia - what is emr/hadoop?

12
Elastic MapReduce http://linkedin.com/in/davidnedved On AWS Cloud

Upload: david-nedved

Post on 12-Nov-2014

1.110 views

Category:

Technology


0 download

DESCRIPTION

David Nedved (ITOC Australia) returns to give you the run down on using Amazon's Elastic MapReduce to complete complex queries on large scale data sets.

TRANSCRIPT

Page 1: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

Elastic MapReduce

http://linkedin.com/in/davidnedved

On AWS Cloud

Page 2: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

MapReduce is a programming model coming from functional programming(like LISP)

What is MapReduce?

MapReduce is "framework" for:● Processing parallel problems● Across HUGE datasets● Using a LARGE number of computers!

Made "Popular" by Google● Used to compute the index that

maps "terms" to "pages".

(AKA Google's Pagerank Algorithm)

Page 3: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

Ok - so...???

For Instance:● Recommendations (books, restaurants, etc..)● Predict Trends (job skills in demand, amazon's recent ).● Show customised Ads on my site etc.● Record every query a user makes on my site http://w3dt.net/

● Primary data has grown exponentially in the last 10 years on the internet...● Secondary data has gone "off the scale" ... ● UH?

○ We seem to log everything and "ask questions later"

● Big Data is no longer a problem for the big boys (Google, Microsoft etc)● Startups are "epically failing" to get on top of their big data....

Page 4: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

The "Big Boys" You: SME/Startup

Page 5: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

Hadoop● Hadoop can help with BigData

○ It's proven in the field○ Under active development○ Will only get cheaper as hardware/AWS prices drop!

● Cheaper storage and retrieval (through a limited SQL interface)● Easier to use with parallel programming.● Scalability for storage/retrieval

"Ok, so is hadoop a database?"NO, NO, NO!

Hadoop is a processing platform.

It combines data storage, retrieval andprogramming into a single highly scalablepackage.

Page 6: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

Hadoop on AWS = EMRIT IS API DRIVEN :)

Page 7: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

EMR simply kicks ass● Import/Export your BigData to AWS Platform quickly● Multipart Upload (s3)● Resize running job flows

● Balance cost and Performance● Resize based on usage patterns● Access Control --> IAM, VPC, Everything else in standard EC2..

Page 8: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

EMR in AWS console

Page 9: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

For example...EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

● By exporting rarely used data to S3 you will save $$$.● Exported data in S3 is directly queryable (via EMR)● Join exported tables with current DynamoDB Tables!

CREATE EXTERNAL TABLE sms_prices_s3 ( code string, country string, network int, networkname string, price string )PARTITIONED BY (code string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','LOCATION 's3://itoc-usergroup/sample' ;

Create hive table (notice the S3 endpoint)

Page 10: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

For example...

SELECT code, country, networkname, priceFROM sms_prices_s3WHERE code = 'AU';

Querying the external table (data in S3)

● Remember; you can run EMR (Hadoop) on just about ANY form of data!● Use EMR to query your NoSQL DB with SQL like queries (: ● Store your BigData in S3, Dynamo, etc you get the 99.999999999%

Durability

DynamoDB catch-out... If you want to query DynamoDB using Hadoop you MUST use EMR...The library for hive isn't available for your own ec2 instances.

Page 11: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

A few real life examples● Data Analytics

● Crawling

● Full-text Indexing

● Data Mining

Google Analytics/Quantcast

Google Search

Just about every HUGE system

LinkedIn Maps (:

Page 12: Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?

Thank You!

http://linkedin.com/in/davidnedved

http://aws.amazon.com/elasticache

Email: [email protected]

Amazon Elastic MapReduce