amazon web services: emr (elastic map reduce) with itoc australia - what is emr/hadoop?

Post on 12-Nov-2014

1.110 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

David Nedved (ITOC Australia) returns to give you the run down on using Amazon's Elastic MapReduce to complete complex queries on large scale data sets.

TRANSCRIPT

Elastic MapReduce

http://linkedin.com/in/davidnedved

On AWS Cloud

MapReduce is a programming model coming from functional programming(like LISP)

What is MapReduce?

MapReduce is "framework" for:● Processing parallel problems● Across HUGE datasets● Using a LARGE number of computers!

Made "Popular" by Google● Used to compute the index that

maps "terms" to "pages".

(AKA Google's Pagerank Algorithm)

Ok - so...???

For Instance:● Recommendations (books, restaurants, etc..)● Predict Trends (job skills in demand, amazon's recent ).● Show customised Ads on my site etc.● Record every query a user makes on my site http://w3dt.net/

● Primary data has grown exponentially in the last 10 years on the internet...● Secondary data has gone "off the scale" ... ● UH?

○ We seem to log everything and "ask questions later"

● Big Data is no longer a problem for the big boys (Google, Microsoft etc)● Startups are "epically failing" to get on top of their big data....

The "Big Boys" You: SME/Startup

Hadoop● Hadoop can help with BigData

○ It's proven in the field○ Under active development○ Will only get cheaper as hardware/AWS prices drop!

● Cheaper storage and retrieval (through a limited SQL interface)● Easier to use with parallel programming.● Scalability for storage/retrieval

"Ok, so is hadoop a database?"NO, NO, NO!

Hadoop is a processing platform.

It combines data storage, retrieval andprogramming into a single highly scalablepackage.

Hadoop on AWS = EMRIT IS API DRIVEN :)

EMR simply kicks ass● Import/Export your BigData to AWS Platform quickly● Multipart Upload (s3)● Resize running job flows

● Balance cost and Performance● Resize based on usage patterns● Access Control --> IAM, VPC, Everything else in standard EC2..

EMR in AWS console

For example...EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

● By exporting rarely used data to S3 you will save $$$.● Exported data in S3 is directly queryable (via EMR)● Join exported tables with current DynamoDB Tables!

CREATE EXTERNAL TABLE sms_prices_s3 ( code string, country string, network int, networkname string, price string )PARTITIONED BY (code string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','LOCATION 's3://itoc-usergroup/sample' ;

Create hive table (notice the S3 endpoint)

For example...

SELECT code, country, networkname, priceFROM sms_prices_s3WHERE code = 'AU';

Querying the external table (data in S3)

● Remember; you can run EMR (Hadoop) on just about ANY form of data!● Use EMR to query your NoSQL DB with SQL like queries (: ● Store your BigData in S3, Dynamo, etc you get the 99.999999999%

Durability

DynamoDB catch-out... If you want to query DynamoDB using Hadoop you MUST use EMR...The library for hive isn't available for your own ec2 instances.

A few real life examples● Data Analytics

● Crawling

● Full-text Indexing

● Data Mining

Google Analytics/Quantcast

Google Search

Just about every HUGE system

LinkedIn Maps (:

Thank You!

http://linkedin.com/in/davidnedved

http://aws.amazon.com/elasticache

Email: david.nedved@itoc.com.au

Amazon Elastic MapReduce

top related