pervasive partner presentation knime + datarush · 2017-05-23 · big etl scada manufacturi ng...
TRANSCRIPT
www.pervasivebigdata.com
Pervasive Partner Presentation
KNIME + DataRush Mike Hoskins, GM - Pervasive Big Data
KNIME Conf, Zurich Technopark, 1 Feb 2012
Big Data Pipeline
Data Scientists
Data Analysts
Business Analysts
Decision Makers
Operational Intelligence
Data Integrators
App Developers
Prepare
profile match
cleanse aggregate
audit
Analyze sample model
discover visualize predict
Consume report chart
dashboard alert
closed loop
Collect
monitor log
ingest event capture
decrypt
Big Data Challenges
Volume
Prepare
profile match
cleanse aggregate
audit
Analyze sample model
discover visualize predict
Consume report chart
dashboard alert
closed loop
Collect
monitor log
ingest event capture
decrypt
www.pervasivebigdata.com
Pervasive DataRush
Full Core and Memory Utilization
5
Legacy Applications DataRush
• Single Threaded
• In-Memory
• Dynamic Scaling Multi-Threaded
• Full Resource Utilization
• Data Flow
• Overcome Memory Heap Sizes
© Copyright 2011 Pervasive Software. All rights reserved
Auto-Scaling
370,0
192,4
90,3
51,6
31,5
0,0
50,0
100,0
150,0
200,0
250,0
300,0
350,0
400,0
2 cores 4 cores 8 cores 16 cores 32 cores
Tim
e in
Min
ute
s
Core Count
Run-time
3.2 hours
using 4
cores
1.5 hours
using 8
cores Under 1
hour
using 16
cores
6
© Copyright 2011 Pervasive Software. All rights reserved
Full-Featured Data Preparation Functions
© Copyright 2011 Pervasive Software. All rights reserved
Analytics Functions For Deep Insights
www.pervasivebigdata.com
DataRush & Hadoop
Malstone Benchmark – Logfile Processing
• Web site logs
• 10 billion rows
(nearly 1
terabyte)
• Aggregates
site intrusion
information
Run Time
Tota
l Cost
of
Ow
ners
hip
(TCO
)
• 20-node cluster
• 4 cores per node
• 14 hours
• 32 cores
• single machine
• 31.5 minutes
*www.opencloudconsortium.org/benchmarks
26 X
Difference
!
10
© Copyright 2011 Pervasive Software. All rights reserved
Malstone Benchmark – Price/Performance
11
www.pervasivebigdata.com
DataRush & Hadoop & KNIME
Pervasive DataRush Plug-in for KNIME
13
DataRush
Plug-Ins
Drag and
Drop to
call
DataRush
for
KNIME
Retrospective
Analytics
What’s new since 2011 KNIME Conference
• Major Additions:
– New “DeriveFields” Operator
– Two new Join types from our Hive (SQL in Hadoop) work
• Semi-Join and Anti-Join
– Range Partitioning
• New Functions:
– Many Data Preparation functions
• Hadoop & Big Data Operators:
– Extreme high-performance HBase read/write
– Other Hadoop reader/writers
• Avro, Syslog, Netflow, Flume HBase sink
– KNIME nodes for HBase and HDFS read/write
14
What’s new since 2011 KNIME Conference (2)
• DataRush v6 (releasing later in 2012)
– Unified API/Composition model for scale-up SMP or scale-out
Clusters
– Full Integration with NextGen MapReduce (DataRush as
embedded dataflow computational alternative to coarse-grained
MapReduce programming)
• DataRush for KNIME integration
– Continue the Krunner work (high-speed execution of
contiguous DataRush nodes in a KNIME flow); make it work for
DDR6 (Distributed DataRush v6, summer 2012)
– Standalone server or cluster execution of KNIME flows that
contain only DataRush nodes
15
www.pervasivebigdata.com
Pervasive Big Data Stack
© Copyright 2011 Pervasive Software. All rights reserved
Azure
BigTable…
Pervasive
Big Data
Profiler
Pervasive
Big Data
Matcher
Moving from SDK to Consumable Products
17
Pervasive
Big
Miner
Telecom
Analyzer
Pervasive
Big ETL
SCADA
manufacturi
ng
Cyber
security Marketing/
advertising
Pervasive
BigOLAP
Time series, event, analytics
Platform
Tools
Products
Solutions
Pervasive DataRush
Big Data Integration and Analytics Platform
Hardware
• Single server or cluster
• On-premises or in cloud
Data
Sources
• Flat files
• Relational databases
• NoSQL databases
• Hadoop
Pervasive
Big BI
Pervasive
Big Viz
Hadoop add-
ons
(TurboRush)
Eco system add-ons
Big Data (NoSQL)Tools
• TurboRush for HBase
• Big Tooling w/GUI
– BigIntegrator (aka PDI)
– BigETL (aka KNIME)
– BigBI
• Report, Chart, OLAP, Query
– BigMiner (aka KNIME)
Pervasive Data Integrator™ v10
• All Service Oriented / ESB
• Browser-based UI
• Deploy On-premises or Cloud
• Extensible and Embeddable
• New management capabilities
WEB INTERFACE
Drag and drop palette Flexible workflow Auto or drag and map
© Copyright 2011 Pervasive Software. All rights reserved
Predictive Analytics in DataRush for KNIME
20
Big Data Capture and Analysis for Telecom
Customer Churn
Network Performance
Fraud detection
Revenue Assurance
Customer Experience
Least-Cost Routing
Vendor Performance
SaaS apps
Server/Web/App
logs
In-house apps
Sensors/Switches/
Routers
Partner data
Flume,
Snort,
Esper
Collect Prepare Analyze
Monitor
Decrypt
Add timestamps
Log receipt
Store CSV, XLS
Store HDFS, Hbase
Event ingest
What does it mean? Where is the fit good?
• KNIME is ready for Big Data! Just add DataRush
– Extreme scaling on modern commodity hardware: scale-up on
Servers/Appliances, and scale-out on Clusters
– Native support for Hadoop and NoSQL
• Use cases already worked with DataRush for KNIME
– Telecomms CDR (Call Detail Records)
– Cybersecurity (Network and Weblog analytics)
– Life Sciences (Gene alignment and assembly)
– Financial Services and Healthcare
– General Data Mining (Clustering, Linear Regression, Decision Tree)
– Almost no limit to the use cases
• Well suited for:
– Machine generated “event” data (aka: log events)
– Long-running Analytic workloads (including Matching)
– Heavy “Data Prep” pre-processing
• Lacking Operators (today) for text, multimedia
22
www.pervasivebigdata.com
Thanks! Q&A
© Copyright 2011 Pervasive Software. All rights reserved
Big Data Benchmarks on Hadoop
24
• Developed by the Open Cloud Consortium
• Benchmark related to web site visits and cyber infection status
• 10 billion row dataset with 100 bytes/row for a total of 1 Terabyte
1. The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing – Robert Grossman
http://rgrossman.com/2009/05/25/malstone-benchmark. Java code probably not optimized.
2. Subject to further review and potential optimization
3. Early test results – all subject to further optimization
Log file processing – Malstone benchmark
NOT FOR PUBLICATION
Rows/sec Rows/watt Rows/$
20-nodes x 4 cores - Open Cloud Consortium cluster
Grossman (Hadoop + Java MapReduce) 1 187,266 62,422 46,816,479
Single server: 48-core, 64-disk "Hadoop Appliance"
Pervasive 1 - Hadoop + Java MapReduce 2 75,597 88,938 110,630,075
Pervasive 2 - Flat file + DataRush 3 3,267,974 3,844,675 4,782,400,765
Pervasive 3 - HDFS/Hbase + DataRush 3 6,024,096 7,087,172 8,815,750,808
Performance ratio - Pervasive 3 vs Hadoop/MR cluster 32x 114x 188x
Read-only performance - HDFS/Hbase + DataRush 3 12,800,000 15,058,824 18,731,707,317
Hadoop
Structured
Data
Events
ERP
CRM
APPs
Devices
Syslog
Event
Collection
Framework
Collector
Collector
HBase
End User Tools
Aggregates
(RDBMS)
OLAP
Engine
Data
Prep
Real-time Visualization
Reporting
OLAP
Data Mining
ETL
HBase Sink
HBase Sink SQL/MED
JDBC
XMLA
KNIME Wrapper
Query
Big Data Platform
HDFS
ETL
Integration
www.pervasivebigdata.com
Big Data Solutions
Telecom Provider Challenges
Switches /
Network Elements
Off-net Usage OSS/BSS Data
Corporate
Sales/Marketing
Network OPS
Customer Care
Information Technology Vendor Performance
Pricing optimization
Product/Service
Offers
Operational
Performance
Profitability Analysis
Customer Experience
Capacity Optimization
Network Performance
Churn
Segment Insights
Usage Trends
Continuously
Integrate
Problem Solving
Pervasive DataRush™
28
DataRush is a parallel dataflow platform that eliminates
performance bottlenecks in your data-intensive applications
• Scalable
• High Throughput
• Cost Efficient
• Easy to Implement
• Extensible
Business Issues
• Time to decision is critical
– Missed opportunities; wasted resources
– Customer issue reaction is too slow
• Deeper granularity of data is critical
– Understanding of trends is needed
– Pricing optimization
– Vendor performance
• Decision time - from days to minutes
– Deeper understanding of operational issues
– Which situations are problematic (or not)
Pervasive DataRush and Hadoop
• DataRush embedded within Hadoop
– Reduce complexities of MapReduce experience
– Increased efficiencies = significantly faster run times
– Cloudera Certification
Mapper Mapper Mapper Mapper
Reducer Reducer
Hadoop
Distributed
File System
DataRush DataRush DataRush DataRush
DataRush DataRush
33
mins
135
mins
Malstone B
0.5 TB
DataRush in Hadoop
Hadoop
30
Pervasive DataRush™
31
DataRush is a parallel dataflow platform that eliminates performance bottlenecks in your data-intensive applications
• Scalable: Performance dynamically scales with increased core/server
counts. No change to the code.
• High Throughput: Patented parallel dataflow technology enables fast,
deep analysis of large data sets with no limit on input data size.
• Cost Efficient: Fully exploit commodity multicore servers – save
significant capital and energy costs via efficient node utilization.
• Easy to Implement: DataRush takes care of complex parallel
processing issues at design time: hides threading complexity; no
deadlocks; runs on any platform – including Hadoop; etc..
• Extensible: DataRush is a component-based platform with an open API
so you can easily extend it for your own needs.
© Copyright 2011 Pervasive Software. All rights reserved
DataRush Release Timeline
CQ1-2011 CQ2-2011 CQ3-2011 CQ4 2011 CQ1 2012 CQ2 2012
DataRush 5.0 • Distributed DR
• KNIME
• Performance
DataRush 5.0.1 • Bug fixes
• Targeted features
DataRush 5.1 • Hadoop and Hive integration
• I-Labs connectivity
• KNIME 2.4.1
• Bug fixes
(January 2011)
(March 2011, ongoing …)
(December 2011)
DataRush 6 • Fully distributed composition
and library
• Distributed execution in KNIME
• Next Gen MapReduce (?)
(TBD)
TurboRush for Hive 0.9 • Hive accelerator
• Limited release
www.pervasivebigdata.com
DataRush & KNIME
KNIME Introduction
• Open source workflow for data mining
• Desktop designer
– Eclipse based (RCP app and plug-in)
– Node based architecture
• Nodes provide connectivity, transformations, algorithms, …
• Extensible model: user developed nodes supported
– Drag and drop, graphical editing of projects
– Project execution from GUI
– Workflow model – each node executes completely
before next node is invoked
© Copyright 2011 Pervasive Software. All rights reserved
Predictive Analytics in DR-KNIME
35
© Copyright 2011 Pervasive Software. All rights reserved
Profiling in DR-KNIME
36
www.pervasivebigdata.com
NextGen Sequencing and
Genomic Pipelines
NGS data explosion
38
Convert/filter FastA/FastQ files
39
Align/order/assemble
40
Report/visualize matching/coverage
41
www.pervasivebigdata.com
Q & A
www.pervasivebigdata.com
Big Data Products
Pervasive Big Data (NoSQL)Tools
• TurboRush for HBase
• Big Tooling w/GUI
– BigIntegrator
– BigBI
• Rpt, Cht, OLAP, Qry
– BigMiner
– BigSearch
BigIntegrator: HBase as Source or Target
45
BigIntegrator: Visual Mapping to/from HBase
46
BigBI (aka BigQuery)
47
www.pervasivebigdata.com
DataRush & KNIME
DataRush + KNIME – what is it?
• Plug-in of DataRush v5.1 to KNIME v3.2?
• Adds extreme high-performance data preparation
and analytic functions
• Adds support for Hadoop data sources (both
HDFS and Hbase)
• Adds special dataflow “k-runner” mode that
recognizes adjacent DataRush nodes and
executes entirely in memory by “flowing” data
from node to node
• KNIME functionality can be further extended with
the DataRush SDK and Scripting
Pervasive RushMiner
Visual Environment for Big Data Analytics and Preparation
• Quickly cleanse, profile and aggregate big data
• Use Data mining, predictive analytics, machine learning to uncover actionable
intelligence
• Works with flat files, relation databases, NoSQL databases, and Hadoop filesystem
(HDFS)
• High performance, scales up to terabytes of data
• Design on your desktop using simple drag-and-drop interfaceExecute on desktop,
remote server, or clusters --including Hadoop clusters
50
Event Processing with DataRush
• Capture ALL data
• Discover previously unavailable patterns, correlations, etc.
• Scalable to meet growing needs
Processed 100 Million Syslog events in 58 seconds on a 48 core system. A sustained run rate of 14 Tb per day
51