enterprise spark at...
Post on 15-Mar-2020
7 Views
Preview:
TRANSCRIPT
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Spark At Scale
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agile Analytics with Enterprise Apache Spark at Scale
S PA R K ON YA RN
OPERATIONS SECURITY
GOVERNANCE
STORAGE
STORAGE
Powering Agile Analyticsvia data science notebooks and automation for most common analytics (including Geospatial analysis and entity resolution)
Seamless Data Accessthat brings together as many data types as possible
Unmatched Economicscombining the speed of in‐memory processing with HDP’s cost efficiencies at scale
Ready for the Enterprisewith robust security, governance and operations coordinated centrally by Apache Hadoop and YARN
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Is Apache Spark?
Apache open source project originally developed at AMPLab(University of California Berkeley)
Unified data processing engine that operates across varied data workloads and platforms
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Apache Spark?
Elegant Developer APIs
– Single environment for data munging and Machine Learning (ML)
In‐memory computation model – Fast!
– Effective for iterative computations and ML
Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark ML)
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Apache Spark on YARN?
Resource management
– Share Spark workloads with other workloads (PIG, HIVE, etc.)
Utilizes existing HDP cluster infrastructure
Scheduling and queues
Spark Driver
ClientSpark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Emerging Apache Spark Patterns on HDP
Spark as query federation processing engine and caching tool
– Bring data from multiple sources to join/query in Spark
Use multiple Spark libraries together
– Common to see Core, ML & Sql used together
Use Spark with various Hadoop ecosystem projects
– Hive, Hbase, SOLR, etc.
– HDFS for long running secure clusters/TDE
– Secure Kafka connection in Kerberos clusters
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Structured Streaming with Apache Spark
Single, high‐level streaming API on DataFrames
Scalable, high‐throughput, fault‐tolerant stream processing of live data streams
Machine Learning Support batch and interactive queries
– Aggregate Data in a stream, then serve using JDBC
– Build and process ML models on the stream
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Processing in Apache Spark
Single, high‐level, seamless mix of SQL queries with Spark Programs
Connect to any data source the same way
– Hive, Avro, Parquet, Hbase, ORC, JSON and JDBC, etc.
Machine Learning Connect through JDBC or ODBC
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Use Cases
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Massive Volumes of Weblogs Fueled Webtrends Growth—but also its Skyrocketing Storage Costs
Webtrends provides digital marketing solutions for more than 2,000 companies in 60 countries – processing 13 billion daily online events
Data used to be processed in relational databases, stored on large NAS appliances, which were not economical at scale
Processing occurred on‐premises, without cloud‐based capabilities
Diseconomies of scale hampered the company objective to help its customers predict optimal online ad placement
Webtrends’ Journey
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Webtrends’ Journey
Petabytes of Weblogs Analyzed with Sparkat Scale
Data streams from a vast array of desktop and mobile devices
13 billion daily events collected in fewer than milliseconds per event
No data cleansing necessary prior to analysis with Apache Spark
Two clusters consolidated into one YARN‐based HDP cluster
Launched new product Webtrends Explore™ – powered by HDP
Innovate
RenovatePersonalized Online Ads
“We’re able to…look at this data set and process it and do predictions, behavioral analysis.
We can do things that allow us to determine ROI for different actions and behavioral patterns.”
Peter Crossley, Chief Architect
A C T I V EA R C H I V E
D A T AD I S C O V E R Y
SINGLEVIEW
P R E D I C T I V EA N A L Y T I C S
P R E D I C T I V EA N A L Y T I C S
D A T AD I S C O V E R Y
Per‐Customer Click Path
Web LogAnalysis
SQL Server Offload
Behavioral Segmentation
Ad Click Predictions
LCV Analysis
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Customer Use Cases with Apache Spark
Web Analytics for Marketing
– Ingesting 13 Billion events/Day
– Use Spark Streaming for Data Ingest
– Extremely low latency: 40 milliseconds
– Need more metrics for Spark Streaming
– Wants 2 way SSL for Kafka Spark receiver
Optimize Advertising
– Monitor channel changes with Spark Streaming
– Correlate changes with Ads/Programming
– Allocate Ads real time: Show ads to user who are watching a show and will stay for > over 20 seconds
– How to optimize Spark App development
Web Analytics Cable Company
Real time Fraud Detection
– Monitor ATM with NiFi
– Log Aggregation & fraud detection
Smart Meters
– Now getting data every 15 minutes
– Improve theft/fraud detection
– Text customer on power outage
Bank/Credit Card Utility Company
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Apache Spark
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Apache Spark
Spark Thrift Server
Driver
REST Server
Driver
Spark Shell
Driver
Zeppelin
Driver
Spark on YARN
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin GA: The Data Science Notebook
Web‐based data science notebook
Interactive data ingestion and data exploration
Easy sharing and collaboration
Secure with single sign‐on and encryption
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Apache Zeppelin Works
Notebook Author
Collaborators/Report Viewers
HDP ClusterSpark | Hive | HBase | SOLR
Any of 30+ back ends
Zeppelin
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Bringing Multitenancy to Apache Zeppelin
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Introducing Livy
Livy is the open source REST interface for interacting with Apache Spark from anywhere
Installed as Spark Ambari Service, not yet exposed outside of Zeppelin
Livy Client
HTTP HTTP (RPC)
Spark Interactive SessionSparkContext
Spark Batch SessionSparkContext
Livy Server
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security Across Zeppelin‐Livy‐Spark
Shiro
Ispark Group Interpreter
SPNego: Kerberos Kerberos
Livy APIs
Spark on YARN
Zeppelin
Driver
LDAP
Livy Server
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reasons to Integrate with Livy
Bring Sessions to Apache Zeppelin
– Isolation
– Session sharing
Enable efficient cluster resource utilization
– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity (controlled by livy.server.session.timeout )
To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Livy Server
SparkContext Sharing
Session‐2
Session‐1
SparkSession‐1SparkContext
SparkSession‐2SparkContext
Client 1
Client 2
Client 3
Session‐1
Session‐1
Session‐2
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Features of Spark in HDP 2.5
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark 2.0 Technical Preview
API Improvements
– SparkSession – new entry point
– Unified DataFrame & DataSet API
– Structured Streaming/Continuous Application
Performance Improvements
– Tungsten Phase 2 ‐ Multi stage code gen
Machine Learning
– ML pipeline the new API, MLlib deprecated
– Distributed R algorithms (GLM, Naïve Bayes, K‐Means, Survival Regression)
SparkSQL
– More SQL support (new ANSI SQL parser, subquery support)
First Hadoop distribution with Spark 2.0
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Side‐by‐Side Apache Spark Installs within HDP 2.5
Can install Spark 1.6.2 & 2.0 on the same cluster/on same nodes
Spark 1.6 & Spark 2.0 are separate Ambari services
– Each service gets its own Spark History Server, Thrift Server, Spark Clients
– Each Service configuration is independent
Spark 1.6 Jobs history only goes to Spark 1.6 History Server
Spark 2.0 Jobs history only goes to Spark 2.0 History Server
How to experiment with Spark 2.0 TP
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark + HBase Connector – GA within HDP 2.5
Brings DataFrame based Spark analytics for Hbase
See blog for usage patterns: http://bit.ly/sparkhbaseconnector
YARNContainer
Spark Executor
Task Task
YARN Container
Spark Executor
Task Task
YARN Container
Spark Executor
Task Task
YARN Container
Spark Executor
Task Task
Driver
Region Server Region Server Region Server Region Server
Scans BulkGets
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Features: Apache Spark Column Security with LLAP
Fine‐Grained Column Level Access Control for SparkSQL
Fully dynamic policies per user ‐ doesn’t require views
Use Standard Ranger policies and tools to control access and masking policies
Flow: 1. SparkSQL gets data locations known
as “splits” from HiveServer and plans query
2. HiveServer2 authorizes access using Ranger; per‐user policies like row filtering are applied
3. Spark gets a modified query plan based on dynamic security policy
4. Spark reads data from LLAP;filtering/masking guaranteed by LLAP server
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
12
4
3
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Per‐User Row Filtering by Region in SparkSQL
Spark User 2
(East Region)
Spark User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin Security: Authentication + SSL
Tommy Callahan
Zeppelin Spark on YARN
LDAP
SSL
Firewall
1
2
3
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin + Livy End‐to‐End Security
Ispark Group Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
Spark on YARN
Zeppelin
LDAP
Livy ServerJob runs as
Tommy Callahan
Tommy Callahan
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark & Zeppelin Timeline from Hortonworks
HDP 2.2.4
Spark 1.2.1
GA
HDP 2.3.2
Spark 1.4.1
GA
HDP 2.3.0
Spark 1.3.1
GA
HDP 2.3.4
Spark 1.5.2*
GA
Spark
Spark 1.3.1
TP
May 2015
Spark 1.4.1TP
Aug 2015
Spark 1.5.1TP
Nov 2015
Zeppelin
TP #1
Oct 2015
Zeppelin
Zeppelin
TP #2
Mar 2016
Dec 2015
HDP 2.4.0
Spark 1.6
GA
Zeppelin
Final TP
Apr 2016
Spark 1.6TP
Jan 2015
Mar 2016
HDP 2.4.2
Spark 1.6.1
GA
Spark 1.6.2 (GA)
+ Spark 2.0 (TP)
Hortonworks
First Zeppelin Contribution
Mar 2015
Zeppelin TLP
May 2016
HDP 2.5
Zeppelin
GA
Aug 2016
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
top related