spark: beyond mapreduce

17
Alin Blidisel - Spark: Big Data Beyond MapReduc ALIN BLIDISEL - SPARK: BIG DATA BEYOND MAPREDUCE

Upload: blidiselalin

Post on 07-Jan-2017

162 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Spark: Beyond mapreduce

Alin Blidisel - Spark: Big Data Beyond MapReduce

ALIN BLIDISEL - SPARK: BIG DATA BEYOND MAPREDUCE

Page 2: Spark: Beyond mapreduce

Apache Spark: Introduction,

Examples, Data Analysis and

Statistics.

Blidisel Alin

Page 3: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

WHY SPARK?

Hadoop Spark

Page 4: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

SPARK - INTRODUCTION- was created by Matei Zaharia at Berkley

- was introduced by Apache Software Foundation for speeding up the Hadoop computational process

- is not a modified version of Hadoop

- in-memory cluster computing

- own cluster computation management

- designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming

Page 5: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

SPARK COMPONENTS

Page 6: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

FEATURES OF APACHE SPARK

- Lighting Fast Processing (10 to 100 faster then Hadoop)

- Ease of Use as it supports multiple languages

- Support for Sophisticated Analytics

- Real Time Stream Processing

- Ability to Integrate with Hadoop and Existing HadoopData

- Active and Expanding Community (more than 250 developers have contributed to Spark already)

Page 7: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

RESILIENT DISTRIBUTED DATASETS (RDDS)

- fault-tolerant collection of elements that can be operated on in parallel (distributed and immutable)

- two ways to create RDDs:- parallelizing an existing collection in your driver program

- referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat

- persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP)

Page 8: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

SPARK CLUSTER MODE OVERVIEW

Page 9: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

SPARK USER INTERFACE

Page 10: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

EXAMPLE: DATA ANALYSIS Sample Data from Sales transactions CSV file

Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude

1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.1166667

1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194

1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83

1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.1333333,144.75

1/4/09 12:56,Product2,3600,Visa,Gerd W ,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025

1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806

1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889

1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028

1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.0666667,34.7666667

Page 11: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

LOAD ORIGINAL CSV FROM HDFSCreate Spark Context and define input parameters

Create RDD from CSV file

Page 12: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

GET RANDOM DATA AND CREATE A DATAFRAME

Page 13: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

DETERMINE FIELD TYPES

Page 14: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

CREATE NEW DATAFRAME BASED ON THE NEW DETERMINED FIELD TYPES

Page 15: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

SAVE DATA IN PARQUET FORMAT

This is the new updated schema

Page 16: Spark: Beyond mapreduce

Alin Blidisel - Big Data: Beyond MapReduce

GENERATE STATISTICS

Page 17: Spark: Beyond mapreduce

© 2016 Atigeo, Corporation. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Thank you!