imc hadoop gridgain · 2019-06-18 · hadoop data lakes with gridgain denis magda, gridgain, vp of...

19
Real-Time Analytics for Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, PMC Chair

Upload: others

Post on 07-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

Real-Time Analytics for Hadoop Data Lakes with GridGain

Denis Magda,GridGain, VP of Product ManagementApache Ignite, PMC Chair

Page 2: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential1

Digital Transformation and The Need for Real-time Analytics

Page 3: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential2

The Common Digital Transformation Goal

“We want to be a tech company with a banking license”

Ralph Hamers, CEO

Page 4: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

But Traditional IT Architectures Cannot Support the Modern, Digital Enterprise…

§ Digital Transformations Require New Levels of Performance and Scalability

– 10-100x more queries and transactions per second driven by mobile and digital business

– 50x as much data today as a decade ago and growing rapidly

– Overnight analytics must become real-time

§ Real-Time Decision Making Requires That Transactions and Analytics are Run On the Same Data Set

§ HTAP, HOAP, Translytical solutions

§ Traditional Applications Built on Disk-Based Databases Cannot Provide the Needed Speed and Scalability

10-100x Queries and Transactions

(per Sec)

50xData Storage

(Big Data)

10-1000x Faster

Analytics (Hours to Sec)

Application Layer

Web-Scale Apps Mobile AppsIoT Social Media

Data Layer

NoSQLRDBMS Hadoop

Page 5: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Data Layer

NoSQLRDBMS Hadoop

Application Layer

Web-Scale Apps Mobile AppsIoT Social Media

In-Memory Computing Solves the Speed and Scalability Challenges of Digital Business

§ The Only Affordable Path to Dramatic Speed and Scalability Improvements for Existing Applications

§ Application Performance Increases 10x to 1,000x

§ Applications Become Scalable to Petabytes of Data

§ Operational Data Can be Analyzed in Place in Real-Time Using HTAP/HOAP/Translytical Data Stores

In-Memory Computing Layer

Page 6: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential5

Data Lake Acceleration Architecture

Page 7: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

GridGain In-Memory Computing Platform

6

Data Lake Acceleration Solution With GridGain

Data Lake

Application Layer

Web-Scale Apps Mobile AppsIoT Social Media

GridGain APIs Hadoop APIs

Federated Queries

Hadoop Connector

• Data loading• Synchronization

Page 8: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Roles of Hadoop and GridGain

7

Continue Using Hadoop…• As a storage for old historical data (weeks, months, years)• For batch processing (dozens of seconds, hours)• Standard analytics workloads

Switch to GridGain…• As a storage for operational and “warm” historical data• For real-time processing (seconds, millisecond)• Operational workloads + real-time analytics

Use Spark…• To span across operational and historical data sets

Page 9: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential8

Essential GridGain APIs

Page 10: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential9

Ignite SQL Queries

1. Initial Query2. Query execution over local data3. Reduce multiple results in one

Ignite Node

CanadaToronto

OttawaMontreal

Calgary

Ignite Node

IndiaMumbai

New Delhi

1

2

23

Page 11: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Ignite Compute Grid

GridGain Cluster

C1

R1

C2

R2

C = C1 + C2

R = R1 + R2

C = Compute

R = Resultin T/2 time

Automatic Failover

Load BalancingZero Deployment

In-Memory Data Store

Persistent Store

Server Node

In-Memory Data Store

Persistent Store

Server Node

Page 12: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Continuous Learning FrameworkIn-Memory Machine and Deep Learning

Persistent Store

Distributed Machine Learning Datasets

TensorFLowRegressionsK-Means Decision Trees

In-Memory Data Store

GridGain Continuous Learning Framework

Compute and Service Grid

C++.NETJava RESTBinary Protocal (Thin client)

DistributedAlgorithms

Large Scale Parallelization

Multi-language Support

No ETL

DistributedDataset basedon partitionedcaches

Page 13: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Spark Application

Spark Worker

Spark Job

Spark Job

Yarn Mesos Docker HDFS

Spark Worker

Spark Job

Spark Job

Spark Worker

Spark Job

Spark Job

In-Memory Shared RDD or DataFrame

Share state and data

among Spark jobs

Boost DataFrame and SQL Performance

SQL on top of RDDs

No data movement

GridGain Spark Integration

In-place query

execution

In-Memory Data Store

Persistent Store

Server Node

In-Memory Data Store

Persistent Store

Server Node

In-Memory Data Store

Persistent Store

Server Node

Page 14: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential13

GridGain Hadoop Connector

Page 15: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Data Loading

14

• SparkLoader Tool– For deployments with Spark

• Hive Store– Schema Import and Loading– Writes GridGain changes to Hadoop

Page 16: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Clusters Synchronization with Apache Sqoop

15

• Keep GridGain and Hadoop in Sync– Asynchronous import/export

• Uses JDBC Driver for GridGain

• Captures Incremental Changes– Sqoop monitors Timestamp fields for updates

Page 17: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Federated-Queries

16

Dataset<Row> hiveDS = session.table("default.cities") .select("city_id", "city_name");//Create Hive DataFrame

//Create GridGain DataFrameDataset<Row> gridgainDS = session.read() .format(IgniteDataFrameSettings.FORMAT_IGNITE()).option(IgniteDataFrameSettings.OPTION_TABLE(), "Person").load().select("id", "city_id", "name", "age", "company");

//INNER JOINhiveDS.join(gridgainDS, hiveDS.col("city_id").equalTo(gridgainDS.col("city_id"))).show();

Page 18: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential

Download and Implement Today

17

• GridGain Data Lake Accelerator Solution– https://docs.gridgain.com/docs/bdb-getting-started

• GridGain Hadoop Connector Downloads– https://www.gridgain.com/resources/download#bigdatapack

Page 19: imc hadoop gridgain · 2019-06-18 · Hadoop Data Lakes with GridGain Denis Magda, GridGain, VP of Product Management Apache Ignite, ... Hadoop Connector • Data loading ... Distributed

2019 © GridGain Systems GridGain Company Confidential18

Q&A