sparkflows - build e2e data analytics use cases in less than 30 mins
TRANSCRIPT
Use Cases to Build & Deploy in < 30 min
Self-Serve Big Data Analytics & Applications
2
Agenda
Introduction
Sparkflows Solution
Use Cases
Problem Definition
• Takes a long time to build Big Data Analytics & ApplicationsTime Consuming
• Hard to enable many of them currentlyMany Potential Use Cases
• Big Shift Happening to Spark
• Hard to build & deploy Spark applications
• Hard to bring many people up to speed on SparkSpark
• Very few users are enabled to perform analytics, machine learning or build applications on Big Data SystemsUsers Enabled
• Streaming analytics becoming very popular, but hard to buildStreaming
• Many repeatable tasks take away a lot of time
• Parse logs/PDF, load into HBase/HIVE/Solr/ES, OCR/NLPMundane tasks
4
100 + Building Blocks
ETL, ML, OCR, NLP, Connect to various
Sources/Sinks
Workflow Editor
Powerful Schema Inference, Schema Propagation,
Interactive Execution
Visualization & DashboardsPrebuilt Workflows
Introduction
5
Workflow Editor
Sparkflows Solution
Rich Visualizations &
Dashboards
100’s of Pre-
built Nodes
Batch & Streaming
Engine
Interactive Execution
Easy Deployment &
Configuration
Pre-built Workflows
Telco Churn Pred
Housing Price Pred
Bike Sharing Analysis
NY Taxi Data Analysis
Movie Lens
Recommendations
6
Sparkflows Product Stack
Streaming
Data
Kafka
Flume
Data
SourcesHIVE/HBase
HDFS/S3
Solr
RDBMS
Apache Spark Cluster
Databricks AWSIBM
Bluemix
On
Prem
Azur
e
Data Sinks
HIVE/HBase
HDFS/S3
Solr
RDBMS
Visualizations
/ Dashboards
7
Machine Learning
Classification
Regression
Clustering
Collaborative Filtering
Save/Load Model
Predict
Cross-Validator
NLP
CoreNLP
StanfordNLP
OCR
Tesseract
Visualization
Line Chart
Bar Chart
Pie Chart
Updating Dashboards
File Formats
CSV/TSV
Parquet
JSON
Avro
Images
Whole Files
Feature
Generation
Tokenization
TF, IDF
OneHotEncoder
StringIndexer
Imputer
Scaler
Data Sources/Sinks
HDFS
S3
Kafka, Flume, Twitter
HBase
Solr
Elastic Search
ETL
Joins, Unions
Filter
SQL, Scala, Python
GeoIP
ConcatColumns
Column Filter
Dedup
Languages
SQL
Scala
Jython
Java
Building Block / Nodes
88
Why Sparkflows?Delivers End-to-End Data Analytics, Applications & Streaming with Big Data
Data Prep & Analytics
Easily prepare data and perform analytics
Machine Learning
Easily perform Machine Learning, NLP, OCR on Big Data
Streaming Analytics
Build & execute Streaming Analytics pipelines visually
Mundane Big Data Tasks
Parse PDF, IP to Geo, load into HBase, Cassandra,
Solr, Elastic Search etc. in a breeze
Batch Applications
Build Batch Applications with 100+ building blocks.
Incorporate SQL, Scala, Jython into the flow
Dashboards & Visualizations
View data in charts and drag and drop to build our self-
updating dashboards
Multi-tenant & Secure
Enable users across the org to use Big Data with full security
integrations
9
Use Cases in < 30 minutes
Self-Serve Big Data Analytics
ETL Pipelines
NLP
OCR
Streaming Analytics
Do Big Data Analytics with Drag & Drop with 100+ building blocks
Build ETL pipelines with ease. Also incorporate SQL, Scala, Jython in it.
Perform NLP on Big Data with OpenNLP and Stanford CoreNLP
Perform OCR on millions of images with Tesseract
Perform Streaming Analytics reading from Kafka, performing complex
transforms, generate graphs and write out to Solr, Hbase etc.
10
Use Cases in < 30 minutes
Machine Learning
Entity Resolution
Log Analytics
Format Conversion
Load data into Solr, ES,
HBase
Perform Machine Learning on huge datasets with drag and drop
Perform large scale Entity Resolution on data from multiple channels
Build Log Analytics Platform with Kafka, Spark, Solr/Elastic Search, Hue
Convert Big Data from one format to another
Easily load data into Solr, Elastic Search, HBase etc.
11
Use Cases in < 30 minutes
Custom Nodes Create Custom Nodes and drop them in the Library/Workflow Editor
Dashboards Combine various outputs of workflows into a Dashboard
Self-Serve Data Analytics
Spark
CSV
Read
AVRO
Save
JSON
Parquet
Solr
HBase
Elastic
Search
HIVE
Row Filter /
Rename Col
Random
Forest
SQL / Scala / Jython
JOIN
Read
Graph
Graph
Model
Dashboard
ETL – Build ETL pipelines with ease
HIVE
Solr
Spark
CSV Filter
Filter
JOIN SQLES
HBase
HIVE
LoadSolr
LoadES
LoadHBase
LoadHIVE
ReadCSV
ReadHIVE
ETL – Connect various SQL for powerful pipelines
HIVE
Solr
Spark
CSV SQL
SQL
SQL SQLES
HBase
HIVE
LoadSolr
LoadES
LoadHBase
LoadHIVE
ReadCSV
ReadHIVE
NLP – Perform distributed NLP on Big Data
CSV
Solr
Spark
PDF NLP
NLP
JOINES
HBase
HIVE
LoadSolr
LoadES
LoadHBase
LoadHIVE
ReadPDF
ReadCSV
OCR – Perform distributed OCR on Big Data
Solr
Spark
PDF OCRES
HBase
HIVE
LoadSolr
LoadES
LoadHBase
LoadHIVE
ReadPDF
Plus extract
images
Streaming Analytics – With Kafka & Spark Streaming
Solr
Spark
ES
HBase
HIVE
LoadSolr
LoadES
LoadHBase
LoadHIVE
ReadKafka
Apply
various
transforms
K
a
f
k
a
Transform
Graph
Machine Learning – With Spark ML
Spark
Logistic Regression
Score
Evaluate
Apply
various
transforms
TransformHIVE Split
Entity Resolution – Applying various distance algorithms & scoring
Spark
DedupJoin &
Transform
DataSet 1
DataSet 2
HIVEFilter low
Scores
Log Analytics
Spark
IP2Geo
ReadKafka
K
a
f
k
a
Graph
Apache
Logs
Parse Apache Logs
Save
Solr
HBase
Elastic
Search
HIVE
SQL
HUE
Small Files Problem
CSV
Spark
CSV
Coalesce
HIVE
Read
HIVE
Save
Format Conversion
Spark
CSV
Read
AVRO
Save
JSON
Parquet
CSV
AVRO
JSON
Parquet
Loading Data into Solr, Elastic Search, HBase, HIVE
Spark
CSV
Read
AVRO
Save
JSON
Parquet
Solr
HBase
Elastic
Search
HIVE
Custom Nodes – Create & Use Custom Nodes which add custom features
Spark
Custom NodeJoin &
Transform
DataSet 1
DataSet 2
HIVECustom Node
Dashboards – Combine output of various Workflows/Nodes into a Dashboard
26
THANK YOU