hadoop with spark
TRANSCRIPT
PowerPoint Presentation
presentsSparkPresented by: Sandy
Introduction to ACADGILD
You can also click on this link to view the video https://www.youtube.com/watch?v=7nipSdxv2Uo Webinar on Spark
2
copyright ACADGILD
Introduction of MentorThe Mentor for this Webinar is Mr. Sandy and below are his qualifications:15 years of experience in IT focusing onBig Data, Data Science and IoTsolutions and implementations.
Expert in the Apache SPARK EcosystemincludingSpark 1.6, Scala, Spark SQL, Spark Streaming, MLLIB , SparkR and GraphX.
Extensive experience inHadoop Framework solutionsincludingYARN,/Mesos HDFS, MapReduce, PigLatin , Hive, HBase/MongoDB/Cassandra, Mahout, Flume, Zookeeper, Oozie and Sqoop.
Knowledge ofMachine Learningfor bothSupervised and Unsupervised Learning Algorithms.
Webinar on Spark3
copyright ACADGILD
Agenda
4 Webinar on SparkSl No.Agenda Title1What is Big data?2MapReduce Limitations3Introduction to Spark4Spark in Hadoop Ecosystem5Why In-memory Processing?6In-memory Caching7Resilient Distributed Dataset 8Creating RDDs9Spark Unified Platform10Popular Use Cases11Apache Spark Case Studies 12Get Your Feet Wet with Spark API's
4
copyright ACADGILD
5
What is Big data? Webinar on Spark5
copyright ACADGILD
MapReduce Limitations
6MapReduce is based on disk based computing. It is more suitable for single pass computations. It is not at all suitable for iterative computations. Disk intensive.
Programming Model limitations:Developing efficient MapReduce applications requires advanced programming skills and deep understanding of the system architecture.
Every problem has to be broken down in to Map and Reduce phases.
Webinar on Spark6
copyright ACADGILD
Introduction to SparkApache Spark is a fast and general-purpose cluster computing system.
Spark is a framework for Scheduling, Monitoring and Distributing the applications.
Spark is a General Unified Engine which can replace many specialized systems like Mahout, Tez, Graphlab, Storm, etc.
Webinar on Spark7
copyright ACADGILD
Some key points about Spark: handles batch, interactive, and real-time within a single framework native integration with Java, Python, Scala programming at a higher level of abstraction more general: map/reduce is just one set of supported constructs7
SQLGraphXMLlibStreaming
RDBMSDistributions:DatabasesFile systemsStreaming sourcesResource ManagersLibrariesAPIsSpark in Hadoop Ecosystem 8Webinar on SparkCDHHDPMap RDSE
copyright ACADGILD
EarlierNow
RAM was very costly. Comparatively, disk was cheap, so disk was primary source of data.Cost of RAM has been sharply reduced with increase in performance. So, RAM is primary source of data and we use disk for fallback.Network was costly so data localityNetwork is faster.Single core machines were dominantMulti core machines are commonplace
Why In-memory processing? Drastic change in hardware 9Webinar on Spark
copyright ACADGILD
9
In-memory Caching 10Webinar on Spark
copyright ACADGILD
Resilient Distributed DatasetResilient distributed dataset (RDD), represents an immutable collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.Its a distributed memory abstraction.
Features:Cache an RDD in memory across machines.Reuse in multiple MapReduce like parallel operations.Fault tolerant through lineage.
11Webinar on Spark
copyright ACADGILD
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
RDD is a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner
Resilient Distributed Datasets (RDDs), a distributed memory abstraction that letsprogrammers perform in-memory computations on large clusters in a fault-tolerantmanner.RDD shard the data over a cluster, like a virtualized, distributed collection
RDD are partitioned, locality aware, distributed collectionsI RDD are immutable
RDD are data structures that:I Either point to a direct data source (e.g. HDFS)I Apply some transformations to its parent RDD(s) to generate new data elementsComputations on RDDsI Represented by lazily evaluated lineage DAGs composed by chained RDDs
RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.
In both cases, keeping data in memory can improve performance by an order of magnitude.
RDD Lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition
An RDD can be created 2 ways:-Parallelize a collection-Read data from an external source (S3, C*, HDFS, etc)11
Creating RDDsTurn a collection into an RDD.val a = sc.parallelize(Array(1, 2, 3))
Load text file from local FS, HDFS, or S3.val a = sc.textFile("file.txt")val b = sc.textFile("directory/*.txt")val c = sc.textFile("hdfs://namenode:9000/path/file")
12Webinar on Spark
copyright ACADGILD
There are currently two types: parallelized collections take an existing Scala collection and run functions on it in parallel Hadoop datasets run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop
Spark can create RDDs from any file stored in HDFS or other storage systems supported by Hadoop, e.g., local file system, Amazon S3, Hypertable, HBase, etc.
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat, and can also take a directory or a glob
There are two types of operations on RDDs transformations and actions12
Graph
Spark Core Engine
MLlib Machine Learning
Spark Streaming Streaming
GraphxComputation
Spark RR on Spark
Spark SQL
DataFrameSpark Unified Platform 13Webinar on Spark
copyright ACADGILD
29%36%40%44%52%68%Popular Use Cases14Business Intelligence
Data Warehousing
Recommendation
Log Processing
User-Facing Services
Fraud Detection/ SecurityWebinar on Spark
copyright ACADGILD
Data integration and ETL Interactive analytics or business intelligence High performance batch computation Machine learning and advanced analytics Real-time stream processing
14
Apache Spark Case Studies Credit Card Fraud DetectionNetwork SecurityGenomic SequencingReal-Time Ad Processing15Webinar on Spark
copyright ACADGILD
Get Your Feet Wet with Spark API's
Quick tour of Scala, Python, Java API's
16Webinar on Spark
copyright ACADGILD
Any Questions?Webinar on Spark17
copyright ACADGILD
Get in Touch with Us18Webinar on Spark18
Contact Info:Website : http://www.acadgild.comLinkedIn : https://www.linkedin.com/company/acadgildFacebook : https://www.facebook.com/acadgildSupport: [email protected] copyright ACADGILD
Thank You
Webinar on Spark19
copyright ACADGILD