sparkflows.io
TRANSCRIPT
Reducing cost and time-to-market for Big Data Analytics & Applications by 10X
Self-Service Big Data Analytics & ApplicationsCut down from months to hours
AgendaProblem
Sparkflows Solution
Differentiators
2
3
Data Analysts
Data Engineers
Data Scientists
Its challenging for users and get value out of the Data Lake
Data Lake
● Data Analytics, Data Preparation
& Blending
● Machine Learning
● Streaming Applications
● Batch Applications
● Dashboards & Visualization
Needs a lot of coding on Big Data
4
Machine Learning
Classification Regression Clustering Collaborative Filtering Save/Load Model Predict Cross-Validator
NLP
CoreNLP StanfordNLP
OCR
Tesseract
File Formats
CSV/TSV Parquet JSON Avro PDF Images Whole Files
Feature Generation
Tokenization TF, IDF OneHotEncoder StringIndexer Imputer Scaler
Data Sources/Sinks
HDFS S3 Kafka, Flume, Twitter HBase Solr
ETL
Joins, Unions Filter SQL, Scala, Python GeoIP ConcatColumns Column Filter Dedup
5
Long time to Production & Value
Hard to maintain and extend the
pipelines/applications
Very Hard to Collaborate
Business Data Scientist
Data Engineer IT
Very Complex Deployment
Hard to handover code
Results In
Data Analysts
Data Engineers
Data Scientists
Spark
Relational
Batch + Streaming
Hadoop
Workflow / Application Repository
Nodes Repository
Future
● 100+ Nodes● Entity Resolution● Machine Learning● Data Wrangling / ETL / Drools
● Sentiment Analysis● Recommendations● Churn Prediction● Log Analytics
● Workflow Designer● Preview Mode● Execution Engine● Visualization
+ SQL / Scala / Python
7
Sparkflows Solution
Workflow Editor
How Sparkflows Works
Rich Visualizations &
Dashboards
100’s of Nodes
Batch & Streaming Engine
Interactive Execution
Easy Deployment & Configuration
Pre-built Workflows
Telco Churn Pred
Housing Price Pred
Bike Sharing Analysis
NY Taxi Data Analysis
Movie Lens Recommendations
Confidential Property of Sparkflows.io
Sparkflows Product Stack
Streaming DataKafka
Flume
Data SourcesHIVE/HBase
HDFS/S3
Solr
RDBMS
Apache Spark Cluster
Databricks AWS
IBM Bluemi
x
On Prem
Azure
Visualizations
ETL/NLP/OCR
Model Building
Workflow Execution
Scala/SQL/Python
Data Wrangling
Data Analysis
Data Pipelines
Big Data Analytics /Applications
Visualization
Data Sinks
HIVE/HBase
HDFS/S3
Solr
RDBMS
10
Business Analyst
Data Scientist
Data Engineer IT
Data Analytics for Business Use Cases by dragging and dropping nodes and using various datasets.
Visualization and deep
understanding of the data Build predictive models and apply
predictions
Do predictive and analytical modeling with the drag-and-drop capabilities
Write custom SQL, Scala, Python
to close the gaps Blend static and real-time streams
to build complex data pipelines
Build and deploy complex pipelines in minutes.
Connect to various sources and sinks including Kafaka, HDFS, S3, HBase, Solr.
Build and expose custom nodes in
Sparkflows for others to use Embed SQL, Scala, Python within
the workflow.
Easily configure multi-tenancy and security for Sparkflows users
Connect workflow results to
platform of choice for visualization
Provision Hadoop
infrastructure, monitor workflow jobs, and tune performance
Why Now?Big Trend towards building with Templates
11
Streamsets iPhone Apps
Building Website
nifi
StreamAnalytix Impetus
Alteryx
Dashboards
12
Combine output of various Workflows into Dashboards
Core Differentiators
13
Easy & Natural to use and Deploy
Deep Integration with Hadoop - Security/Impersonation/HIVE/HBase/Solr
Custom Nodes - Users can write their own Nodes and plug into the UI
Schema Propagation
Interactive Execution at Design Time
Rich Application Dashboards
Growing Repository of Workflows for various Solutions
Building out of Complex Nodes by Sparkflows - Dedup, Drools, OpenNLP, StanfordNLP, Tesseract etc.
Batch & Streaming - Nodes support both Batch & Streaming workloads
Support for SQL, Scala, Jython as Nodes of the workflow
Line of Products
14
Data Analytics(Analytics /
Wrangling / Machine Learning)
Streaming Analytics Applications
15
THANK YOU
Building Big Data Analytics & Applications is very costly & time consuming
16
Customer 360
Fraud Detection
Operations Analytics
Cyber Security
IoT Analytics
Analytics Applications
Not enough users are able to extract great value from the Data Lake
Needs a lot of coding on Big Data
17
Data Analytics, Data Preparation & Blending
Machine Learning
Streaming Applications
Batch Applications
Visualizations