big data 2.0 - how spark technologies are reshaping the world of big data analytics
TRANSCRIPT
![Page 1: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/1.jpg)
Big Data 2.0HOW SPARK TECHNOLOGIES ARE RESHAPING THE WORLD OF BIG DATA ANALYTICS
Presented By: Lillian Pierson, P.E.
![Page 2: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/2.jpg)
Today’s webinarApache Spark: Journey from “Hadoop Eco System component” to “Big Data platform”
The story of how Spark began
Is Spark a data engineering or data science platform?
Who is using Spark and for what?
Got Spark skills? Here’s why you should
![Page 3: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/3.jpg)
Apache SparkJOURNEY FROM “HADOOP ECO SYSTEM COMPONENT ” TO “BIG DATA PLATFORM”
![Page 4: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/4.jpg)
What is Spark?
![Page 5: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/5.jpg)
“In-memory computing appliances are … faster than the traditional Hadoop system because in-memory appliances don’t use MapReduce… By storing data in memory, in-memory appliances are able to bypass the time-consuming disk accesses that are required as part of the map and reduce operations that comprise the MapReduce process. In-memory data storage processing, and analysis is fast enough to generate data analytics in real-time, derived from streaming data sources.“ –Excerpt from my book:
Big Data/Hadoop for Dummies
Why in-memory applications?
![Page 6: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/6.jpg)
From Hadoop ecosystem component…
HDFS
MapReduce 2.0
YARN
![Page 7: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/7.jpg)
From Hadoop ecosystem component…
HDFS
SparkMapReduce
2.0YARN
![Page 8: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/8.jpg)
To big data platform
HDFS
MapReduce 2.0
Spark YARN
![Page 9: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/9.jpg)
To big data platform
Spark-as-a-Service
![Page 10: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/10.jpg)
Spark’s 4 submodules
Spark SQL MLlib
GraphX Streaming
![Page 11: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/11.jpg)
Spark SQL moduleDataFrames
Spark SQL◦ SQL
Hive◦ HiveQL
◦ Spark Processing Engine
![Page 12: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/12.jpg)
Mllib moduleData analysis
Statistics
Machine learning
![Page 13: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/13.jpg)
GraphX moduleGraph data storage and processing
Graphx◦ In-memory graph data processing
HDFS◦ Graph data storage
![Page 14: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/14.jpg)
Streaming module
Continuously Streaming
Data
Discreet Data Streams
(Dstream)
Micro-batch processing
![Page 15: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/15.jpg)
Dstreams and micro-batch architecture
Source: http://www.slideshare.net/skpabba/hadoop-and-spark
RDD @ time 1 RDD @ time 2 RDD @ time 3
![Page 16: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/16.jpg)
Basic Spark Architecture
Spark SQL MLlib GraphX Streaming
Physical Hardware
Data Storage Layer (HDFS)
Resource Manager (YARN)
Spark Core Libraries
Single Abstraction Layer
Processing Processing Processing Processing
![Page 17: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/17.jpg)
Changes with Spark 2.0
RDD API
•DataFrame API
Spark 1.0
•RDD API
•DataFrame API
Spark 1.3
*RDD API
*DataFrame API
*Dataset API
Spark 1.6
Dataset API
•DataFrame API
•RDD API
Spark 2.0
![Page 18: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/18.jpg)
Changes with Spark 2.0
RDD API
Dataset API
DataFrame API
RDD API
Spark 1.0 Spark 2.0
![Page 19: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/19.jpg)
Changes with Spark 2.0
Structured Stream
Processing
DataFrame API
Dataset API
![Page 20: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/20.jpg)
The story of how Spark began
![Page 21: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/21.jpg)
Taking things from the beginning…2009
Mesos
UC Berkeley
Interactive, iterative parallel processing (in-memory)
◦ Machine learning requirements
Integrates with Hadoop ecosystem
Dr. Ion StoicaComputer Science Professor
UC Berkeley
![Page 22: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/22.jpg)
Databricks… the cutting edge of SparkDelivers Apache Spark-as-a-Service
Most popular solution for deploying Spark on the cloud
Dr. Ion StoicaExecutive Chairman, Apache Databricks
![Page 23: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/23.jpg)
Databricks… the cutting edge of SparkSpark on an as-needed basis
Automates◦ Cluster building and configuration
◦ Security
◦ Process monitoring
◦ Resource monitoring
Notebooks◦ For data analysis and machine learning using Python, R, and Scala
Data visualization capabilities◦ Data visualization and dashboard design options
![Page 24: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/24.jpg)
Is Spark a data engineering or data science platform?DATA ENGINEERING COMPONENTS AND TECHNOLOGIES
DATA SCIENCE COMPONENTS AND TECHNOLOGIES
![Page 25: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/25.jpg)
Spark’s data engineering elementsAutomate cluster sizing and configuration requirements
Data Storage: HDFS
Resource Management:◦ Spark Standalone
◦ Apache Mesos
◦ Hadoop YARN
![Page 26: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/26.jpg)
Spark’s data engineering elementsSpark Streaming Submodule – Reuse same code you use for batch processing, but get real-time results!
◦ Integrates with big data source, like:
◦ HDFS
◦ Flume
◦ Kafka
◦ Twitter and
◦ ZeroMQ
![Page 27: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/27.jpg)
Doing data science with SparkUseful for machine learning and analysis of big data
Build big data analytics products
Programmable in Python, R, Scala, and SQL
Submodules:◦ SQL and DataFrames
◦ MLlib for machine learning
◦ GraphX for in-memory big (graph) data computations
![Page 28: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/28.jpg)
Doing data science with SparkSpark integrates with the following data sources and formats:
◦ Hive, Avro, Parquet, CSV, JSON, and JDBC, HBase
◦ BI Tools: Tableau, QLIK, ZoomData, etc. (through JDBC)
![Page 29: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/29.jpg)
Who is using Spark and for what?A U T O M A T I C L A B S
L E N D U P
S E L L P O I N T S
F I N D I F Y
![Page 30: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/30.jpg)
Automatic Labs on DatabricksMaking cars smarter with real-time analytics
Connect to, and make smart use, of your car’s data
![Page 31: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/31.jpg)
Automatic Labs on DatabricksAutomatic apps do things like:
◦ Decoding engine problems
◦ Locating parked cars
◦ Crash detection and response
◦ Low fuel warnings, etc.
Automatic is using Spark to make cars smarter with real-time analytics
During product development, Automatic needs to query, explore, and visualize large amounts of data, QUICKLY. By moving this work over to Spark, Automatic was able to:
◦ Validate products in days, not weeks
◦ Complete complex queries in minutes
◦ Free up 1 full-time data scientist
◦ Save $10K/month on infrastructure costs
![Page 32: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/32.jpg)
LendUp on DatabricksImproving the lending process and experience
“Moving up the LendUpLadder means earningaccess to more money, atbetter rates, for longerperiods of time” - LendUp
![Page 33: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/33.jpg)
LendUp on DatabricksLendUp uses Spark for:
◦ Feature engineering at scale
◦ Fast model building and testing
By using Spark to do this work, LendUp is able to:◦ Build more accurate models, faster
◦ Offer more lines of credit
◦ Develop new products more quickly
◦ Increase in-house productivity of data science team
![Page 34: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/34.jpg)
sellpoints on DatabricksIncreasing ROI on ad spend
![Page 35: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/35.jpg)
sellpoints on DatabricksIncreasing ROI on ad spend
Sellpoint offers services in:◦ Identifying qualified shoppers
◦ Driving traffic
◦ Increasing sales conversion
By moving to Databricks, sellpoints was able to:◦ Productize a new predictive analytics offering, improving the ad spend ROI
by threefold compared to competitive offerings.
◦ Reduce the time and effort required to deliver actionable insights to the business team while lowering costs.
◦ Improve productivity of the engineering and data science team by eliminating the time spent on DevOps and maintaining open source software.
![Page 36: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/36.jpg)
Findify on DatabricksImproving shopping experience for ecommerce customers
Uses machine learning to continually improve search accuracy
![Page 37: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/37.jpg)
Findify on DatabricksImproving shopping experience for ecommerce customers
By moving to Databricks, Findify was able to:◦ Focus on development instead of infrastructure – Allowing them to complete
their feature development projects faster and reduce customer frustration in delayed analytics
◦ Focus on building innovative features - because the managed Spark platform eliminated time spent on DevOps and infrastructure issues.
Uses machine learning to continually improve search accuracy
![Page 38: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/38.jpg)
Got Spark skills? Here’s why you shouldIMPACT ON SALARY
TRAINING ISSUES AND OPPORTUNITIES
![Page 39: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/39.jpg)
How much do Spark skills pay?2015 Data Science Salary Survey, by O’Reilly
$11,000
$4,000$4,600
$8,000
$0
$2,000
$4,000
$6,000
$8,000
$10,000
$12,000
Spark Skills Scala Programming Basic ExploratoryAnalysis (>4 hr/wk)
D3.js Skills
Annual Salary Increase
Annual Salary Increase
![Page 40: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/40.jpg)
Getting training and experience in Spark
$149.50
SaleUntil
March 30Only
DiscountCode:
‘SPRING50’
![Page 41: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/41.jpg)
Getting training and experience in SparkGet hands-on training in the following areas:
◦ Using RDD
◦ Writing applications using Scala
◦ Spark SQL
◦ Spark Streaming
◦ Machine Learning in Spark (Mllib)
◦ Spark GraphX
◦ Spark Project Implementation
![Page 42: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/42.jpg)
Getting training and experience in Spark
$149.50
SaleUntil
March 30Only
DiscountCode:
‘SPRING50’
![Page 43: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/43.jpg)
Download these slide
![Page 44: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/44.jpg)
Why Data Science From Simplilearn
Key Features
40 hours of real life industry project
experience
25 hours of High Quality e-learning
Visualize and optimize data
effectively using the built-in tools in
R , SAS and Excel
48 hours of Live Instructor Led
Online sessions
Get proficient in using R,SAS and Excel
to model data and predict solutions to business problems
Master the concepts of statistical analysis like linear & logistic regression, cluster
analysis & forecasting
![Page 45: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics](https://reader031.vdocuments.site/reader031/viewer/2022030306/5870147d1a28ab7f428b517b/html5/thumbnails/45.jpg)
OUR JOURNEY SO FARProject
ManagementDigital Marketing
Big Data & Analytics
Business Productivity
Tools
Quality Management
Virtualization and Cloud Computing
IT Security
Financial Management
CompTIACertification
IT Hardware and N/W ERP
IT Services and Architecture
Agile and Scrum Certification
OS and DatabaseWeb and App Programming
Simplilearn : World’s Largest Certification Training Destination
One of the largest collections of accredited certification training in the world.
YEAR 2010
YEAR 2015
YEAR 2010
YEAR 2016