sparkflows - build e2e data analytics use cases in less than 30 mins

Use Cases to Build & Deploy in < 30 min

Self-Serve Big Data Analytics & Applications

2

Agenda

Introduction

Sparkflows Solution

Use Cases

Problem Definition

• Takes a long time to build Big Data Analytics & ApplicationsTime Consuming

• Hard to enable many of them currentlyMany Potential Use Cases

• Big Shift Happening to Spark

• Hard to build & deploy Spark applications

• Hard to bring many people up to speed on SparkSpark

• Very few users are enabled to perform analytics, machine learning or build applications on Big Data SystemsUsers Enabled

• Streaming analytics becoming very popular, but hard to buildStreaming

• Many repeatable tasks take away a lot of time

• Parse logs/PDF, load into HBase/HIVE/Solr/ES, OCR/NLPMundane tasks

4

100 + Building Blocks

ETL, ML, OCR, NLP, Connect to various

Sources/Sinks

Workflow Editor

Powerful Schema Inference, Schema Propagation,

Interactive Execution

Visualization & DashboardsPrebuilt Workflows

Introduction

5

Workflow Editor

Sparkflows Solution

Rich Visualizations &

Dashboards

100’s of Pre-

built Nodes

Batch & Streaming

Engine

Interactive Execution

Easy Deployment &

Configuration

Pre-built Workflows

Telco Churn Pred

Housing Price Pred

Bike Sharing Analysis

NY Taxi Data Analysis

Movie Lens

Recommendations

6

Sparkflows Product Stack

Streaming

Data

Kafka

Flume

Data

SourcesHIVE/HBase

HDFS/S3

Solr

RDBMS

Apache Spark Cluster

Databricks AWSIBM

Bluemix

On

Prem

Azur

e

Data Sinks

HIVE/HBase

HDFS/S3

Solr

RDBMS

Visualizations

/ Dashboards

7

Machine Learning

Classification

Regression

Clustering

Collaborative Filtering

Save/Load Model

Predict

Cross-Validator

NLP

CoreNLP

StanfordNLP

OCR

Tesseract

Visualization

Line Chart

Bar Chart

Pie Chart

Updating Dashboards

File Formats

CSV/TSV

Parquet

JSON

Avro

PDF

Images

Whole Files

Feature

Generation

Tokenization

TF, IDF

OneHotEncoder

StringIndexer

Imputer

Scaler

Data Sources/Sinks

HDFS

S3

Kafka, Flume, Twitter

HBase

Solr

Elastic Search

ETL

Joins, Unions

Filter

SQL, Scala, Python

GeoIP

ConcatColumns

Column Filter

Dedup

Languages

SQL

Scala

Jython

Java

Building Block / Nodes

88

Why Sparkflows?Delivers End-to-End Data Analytics, Applications & Streaming with Big Data

Data Prep & Analytics

Easily prepare data and perform analytics

Machine Learning

Easily perform Machine Learning, NLP, OCR on Big Data

Streaming Analytics

Build & execute Streaming Analytics pipelines visually

Mundane Big Data Tasks

Parse PDF, IP to Geo, load into HBase, Cassandra,

Solr, Elastic Search etc. in a breeze

Batch Applications

Build Batch Applications with 100+ building blocks.

Incorporate SQL, Scala, Jython into the flow

Dashboards & Visualizations

View data in charts and drag and drop to build our self-

updating dashboards

Multi-tenant & Secure

Enable users across the org to use Big Data with full security

integrations

9

Use Cases in < 30 minutes

Self-Serve Big Data Analytics

ETL Pipelines

NLP

OCR

Streaming Analytics

Do Big Data Analytics with Drag & Drop with 100+ building blocks

Build ETL pipelines with ease. Also incorporate SQL, Scala, Jython in it.

Perform NLP on Big Data with OpenNLP and Stanford CoreNLP

Perform OCR on millions of images with Tesseract

Perform Streaming Analytics reading from Kafka, performing complex

transforms, generate graphs and write out to Solr, Hbase etc.

10


Machine Learning

Entity Resolution

Log Analytics

Format Conversion

Load data into Solr, ES,

HBase

Perform Machine Learning on huge datasets with drag and drop

Perform large scale Entity Resolution on data from multiple channels

Build Log Analytics Platform with Kafka, Spark, Solr/Elastic Search, Hue

Convert Big Data from one format to another

Easily load data into Solr, Elastic Search, HBase etc.

11


Custom Nodes Create Custom Nodes and drop them in the Library/Workflow Editor

Dashboards Combine various outputs of workflows into a Dashboard

Self-Serve Data Analytics

Spark

CSV

Read

AVRO

Save

JSON

Parquet

Solr

HBase

Elastic

Search

HIVE

Row Filter /

Rename Col

Random

Forest

SQL / Scala / Jython

JOIN

Read

Graph

Graph

Model

Dashboard

ETL – Build ETL pipelines with ease

HIVE

Solr

Spark

CSV Filter

Filter

JOIN SQLES

HBase

HIVE

LoadSolr

LoadES

LoadHBase

LoadHIVE

ReadCSV

ReadHIVE

ETL – Connect various SQL for powerful pipelines

HIVE

Solr

Spark

CSV SQL

SQL

SQL SQLES

HBase

HIVE

LoadSolr

LoadES

LoadHBase

LoadHIVE

ReadCSV

ReadHIVE

NLP – Perform distributed NLP on Big Data

CSV

Solr

Spark

PDF NLP

NLP

JOINES

HBase

HIVE

LoadSolr

LoadES

LoadHBase

LoadHIVE

ReadPDF

ReadCSV

OCR – Perform distributed OCR on Big Data

Solr

Spark

PDF OCRES

HBase

HIVE

LoadSolr

LoadES

LoadHBase

LoadHIVE

ReadPDF

Plus extract

images

Streaming Analytics – With Kafka & Spark Streaming

Solr

Spark

ES

HBase

HIVE

LoadSolr

LoadES

LoadHBase

LoadHIVE

ReadKafka

Apply

various

transforms

K

a

f

k

a

Transform

Graph

Machine Learning – With Spark ML

Spark

Logistic Regression

Score

Evaluate

Apply

various

transforms

TransformHIVE Split

Entity Resolution – Applying various distance algorithms & scoring

Spark

DedupJoin &

Transform

DataSet 1

DataSet 2

HIVEFilter low

Scores

Log Analytics

Spark

IP2Geo

ReadKafka

K

a

f

k

a

Graph

Apache

Logs

Parse Apache Logs

Save

Solr

HBase

Elastic

Search

HIVE

SQL

HUE

Small Files Problem

CSV

Spark

CSV

Coalesce

HIVE

Read

HIVE

Save

Format Conversion

Spark

CSV

Read

AVRO

Save

JSON

Parquet

CSV

AVRO

JSON

Parquet

Loading Data into Solr, Elastic Search, HBase, HIVE

Spark

CSV

Read

AVRO

Save

JSON

Parquet

Solr

HBase

Elastic

Search

HIVE

Custom Nodes – Create & Use Custom Nodes which add custom features

Spark

Custom NodeJoin &

Transform

DataSet 1

DataSet 2

HIVECustom Node

Dashboards – Combine output of various Workflows/Nodes into a Dashboard

26

THANK YOU

sparkflows - build e2e data analytics use cases in less than 30 mins

Technology