information server v11 - datastagedsxchange.net/uploads/dsxchange_-_is_on_hadoop_beta_update.pdf ·...

0 © 2015 IBM Corporation

0

Luncheon Webinar SeriesJune 11th, 2015

InfoSphere Information Server on Hadoop!

Presented by Tony Curcio, IBM

Sponsored By:

1 © 2015 IBM Corporation

InfoSphere Information Server

Questions and suggestions regarding presentation topics? - send to [email protected]

Downloading the presentation

– http://www.dsxchange.net/IISonHadoop.html– Replay will be available within one day watch for email

with details Pricing and configuration - send to [email protected] Subject

line : Pricing Bonus Offer – Free premium membership for your DataStage

Management! Submit your management’s email address and we will offer him access on your behalf.

– Email [email protected] subject line “Managers special”.

– Join us all at Linkedin http://tinyurl.com/DSXmembers

1

mailto:[email protected]

http://www.dsxchange.net/IISonHadoop.html



http://tinyurl.com/DSXmembers

© 2014 IBM Corporation2

Information Server on Hadoopbeta update

06/11/2015

Tony Curcio

IBM Product Director

© 2014 IBM Corporation

First Poll Question

11.3.x

9.1.x

8.7

8.5

umm… one of the more classic versions

What version of

Information Server/

DataStage are you

using (choose your

most current

version in

production)


Second Poll Question

In production with hadoop already.

Plan to go live with hadoop in production this year

Still learning about it

That's the yellow elephant thing, right?

What is the status

of your hadoop

usage?


Big Data Integration Is Critical For Success With Hadoop

Extract, Transform, and Load Big Data With Apache Hadoop - White Paper

https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-

hadoop.pdf

“By most accounts 80%

of the development effort in

a big data project goes into

data integration

goes towards data

analysis.”

…and only

20%

Most Hadoop

initiatives involve

collecting, moving,

transforming,

cleansing,

integrating,

exploring, and

analysing volumes

of disparate data

sources and types.


Customers are telling us:“Data Integration is their #1 requirement for Hadoop”

“By most accounts, 80% of the development effort in a big data project goes into data integration and only 20% goes toward data analysis.” -- Intel Corporation (Extract, Transform, and Load Big Data With Apache

Hadoop Whitepaper )

© 2013 IBM Corporation6

Customer setup a massive data integration engine, consisting of 640 BigMatch cores and 224 Information Server/DataStage cores to address Big Data requirements

Large Bank is soliciting proposals for a 1 petabyte Data Refinery with data volume growth of 5x over 3-4 years. Data Integration is the only use case for the Data Refinery.

Hadoop is not a data integration solution -- Gartner Research

“Right now data integration is our only use case for Hadoop.”

Our #1 use case for Hadoop is offloading Teradata ELT.

Smart customers know their Big Data Projects will fail without Big Data

Integration


• Optimized

load;

benchmark

>5TB/hr

• 1st Hadoop

Pushdown

Optimization

• 1st Big data

Discovery, Profiling,

Transformation &

governance of data

• Optimized load

benchmark

achieving >15TB/hr

• Inclusion of a fully

supported hadoop

instance

• Self-service data

integration for

Hadoop (i.e. Data

Click)

• Governance for

Hadoop

• Kerberos-enabled

access to HDFS

data

IBM Information Server –

Leadership in Big Data Integration

• Native Hadoop

runtime for Data

Integration,

Quality and

Governance

• 1st to provide

native high

speed Hadoop

connectivity for

BI, Cloudera,

HortonWorks


Productivity - Rich user interface features simplifies the design process and metadata management requirements.

Transformation - Extensive set of pre-built objects that act on data to satisfy both simple & complex data integration tasks.

Connectivity - Native access to common industry databases and applications exploiting key features of each.

Performance - Runtime engine providing unlimited scalability through all objects tasks in batch/real-time, ETL/ELT/DV/SOA

Operations - Simple management of the operational environment lending analytics for understanding and investigation.

Administration - Intuitive and robust features for install, maintenance, configuration, security and resiliency

Supports all 7 dimensions of a great data integration platform

Governance - Maximizes business & IT collaboration providing business terms, policies, advanced impact analysis, search, comparison & more

1

2

3

4

5

6

7


GBC_Supp_step1 GBC_Supp_step1 GBC_Supp_step1

DataStage Parallelism

Records1-100

Records 101-200

Records201-300

Records301-400

Records401-500

Record501-600

…

GBC_Supp_step1

1 Pipeline – Data moves between stages (boxes on flow) like an assembly line process.

Each box has its own responsibilities for data.

2 Partition – Ability to fan out data to multiple streams based on some key (hash on ARR-ID

in this case) allows like data to be processed in order on that stream.

3 Instance – Multiple DataStage instances can be run at the same time, either on the same

server (if there are sufficient resources) or in separate physical servers.

……

…


Third Poll Question

DI/DQ is the primary requirement for us using hadoop.

DI/DQ are supporting our modern data warehouse/analytics in hadoop

We need some DI to get data into hadoop, but that's basically it

No plans to embrace the elephant yet

How would you

characterize your

requirements for

data integration

and cleansing in

hadoop?


INFORMATION SERVER FOR HADOOP

1


… integrate, transform, cleanse and govern your data

natively within your Hadoop cluster?

… design your integration and cleansing workflows

through an easy drag and drop UI?

… run it anywhere – inside or outside your cluster

… have a fully governed environment along side

And … what if it would run faster and with less latency

than a comparable Map/Reduce routine?

You think that’s impossible?

What if you could … ?


Benefits of Information Server for Hadoop

Superfast data ingest and processingIntegrate, prepare and enrich data with speed and confidence

running natively on Hadoop with speeds upwards of

15x faster than MapReduce

Complete confidence in your dataunderstand what data is available and where it came from

monitor and cleanse quality of data; catalog metadata

assets and trace lineage

Higher Level of Productivity develop integration processes much faster than hand coding –

based on existing enterprise skills

graphical data flow development environment with

100s of prebuilt stages and 1000s of prebuilt functions

no other vendor has this

scale or speed

fully integrated quality

and governance

proven development

paradigm


InfoSphere Information Server for Hadoop

• The most scalable Transformation and

Data Integration and Quality Engine now

runs natively on Hadoop

• Get enterprise-class transformation and

cleansing for your Hadoop data

• Use the power of your Hadoop Cluster to

integrate, transform & cleanse data

without writing a single line of code

• Easily govern all your transformation and

data quality processes


Why Information Server on Hadoop

• More customers choosing HDFS/GPFS for data landing zone, archive, and

analytics processing

• HDFS / GPFS are cheap, scalable, and fault tolerant

• DataStage offers large advantages over Hadoop data processing tools

• Graphically build data flows with little or no programming knowledge

• No hand coding, improve developer productivity

• Existing jobs can be moved directly over to run on Hadoop

• Reduce infrastructure, one cluster many applications

• DataStage coexist with other Hadoop applications managed by YARN

• Performance and Scalability

• DataStage scales linearly as nodes are added to cluster

• Run DataStage jobs more efficiently local to HDFS / GPFS data


Information Server is many times faster than M/R

16

Producer

Operator

Consumer

Operator

Record Pipeline

Map-Reduce is designed for task-level fault tolerance: The downside of this is significantly lower performance.

• Data is streamed from producer to consumer with

data repartitioning without landing to disk

• Intermediate data is not written to disk

• Flow control prevents the producer from

overrunning the consumer

• Pipelined and partitioned parallelism provides

efficiency and high performance

InfoSphere Information Server is designed to exploit data pipelining to minimize I/O and maximize overall

performance.

• Producer (Map) writes intermediate results to disk

• Consumer (Reduce) “pulls” the data

• This design provides task-level fault tolerance

• This design lowers overall performance and

efficiency by orders of magnitude*

*See: Themis: An I/O-Efficient MapReduce http://themis.sysnet.ucsd.edu/papers/themis_socc12.pdf

Job-level Fault Tolerance

Task-level Fault Tolerance

http://themis.sysnet.ucsd.edu/papers/themis_socc12.pdf


GBC_Supp_step1 GBC_Supp_step1 GBC_Supp_step1

DataStage Parallelism – Fully supported inside Hadoop

Records1-100

Records 101-200

Records201-300

Records301-400

Records401-500

Record501-600

…

GBC_Supp_step1

1 Pipeline – Data moves between stages (boxes on flow) like an assembly line process.

Each box has its own responsibilities for data.

2 Partition – Ability to fan out data to multiple streams based on some key (hash on ARR-ID

in this case) allows like data to be processed in order on that stream.

3 Instance – Multiple DataStage instances can be run at the same time, either on the same

server (if there are sufficient resources) or in separate physical servers.

……

…


Hadoop Cluster

DataNode

Section Leader

Player 1 Player 2 Player N

Information Server for Hadoop –Runtime Architecture Overview

IS Engine Tier Node

Hadoop Edge Node (or Full Node)

NameNode

Conductor

IS YARN

Client

DataNode

IS AM

DataNode

IS AM

DataNode

DataNode

Section Leader

Player 1 Player 2 Player N

DataNode DataNode

YARN Container

/opt/IBM/InformationServer


/opt/IBM/InformationServer/opt/IBM/InformationServer /opt/IBM/InformationServer

/opt/IBM/InformationServer /opt/IBM/InformationServer



Fourth Poll Question

Scaling DI/DQ workloads to any level

No need to handcode mapreduce to transform/cleanse data

Can use industry proven solutions for DI/DQ in this strategic evolution

Can't I just get in my TARDIS and go back to 2009?

Considering using

tools for your

hadoop

implementation,

which factor

appeals most to

you?


EARLY RELEASE PROGRAM UPDATE

2


Early Release Program Details

• Program Start: October 25, 2014

• Program End: September 30, 2015

• Supported Hadoop Distributions:• Phase 1: Cloudera 5.1+ & HortonWorks 2.1+

• Phase 2: BigInsights 4

• Supported Information Server environments• V11.3.1 on Linux OS

• Hadoop “Beta” patch


ERP Objectives

• Obtain feedback on installation & setup of Information Server on a Hadoop grid

environment

• Obtain feedback on running DataStage , QualityStage or Information Analyzer

within a YARN managed grid environment.

• Users should be able to run any existing job unmodified (except for path changes)

• Obtain feedback on performance & scalability running workloads inside a

Hadoop 2.0 cluster


New Features in Beta 2 Build Available 3/9/15

• Includes several customer requested enhancements, including:

o Kerberos enabled clusters are now supported

o Full Edge/Client node support for Engine Tier install

o NFS mount requirement removed, IS binaries may be copied to data nodes

o Option to automatically distribute IS binaries if they aren’t detected

o Data locality support for BDFS file reads

o Container size estimation


Current Beta Participation

• Nearly 20 Participants• Includes both customers and business partners

• Distribution in use:• All supported Hadoop distributions

• Product Coverage

• DataStage, QualityStage and Information Analyzer

2


Feedback

• Positive feedback for installation & setup experience

Validation of alignment to customer expectations for honoring hadoop requirements for

container resource usage, distribution of binaries across the cluster, and security

configuration

• Positive feedback for integration platform functionality

No issues with stage runtime. Able to successfully exploit all (tested) stages within

Hadoop.

“this solution already seems GA”


Next steps

• If you are interested in learning more about the beta, or

becoming part of it, please contact:

• Tony Curcio ([email protected])

• Beate Porst ([email protected])

information server v11 - datastagedsxchange.net/uploads/dsxchange_-_is_on_hadoop_beta_update.pdf ·...

Documents