information server v11 - datastagedsxchange.net/uploads/dsxchange_-_is_on_hadoop_beta_update.pdf ·...
TRANSCRIPT
0 © 2015 IBM Corporation
0
Luncheon Webinar SeriesJune 11th, 2015
InfoSphere Information Server on Hadoop!
Presented by Tony Curcio, IBM
Sponsored By:
1 © 2015 IBM Corporation
InfoSphere Information Server
Questions and suggestions regarding presentation topics? - send to [email protected]
Downloading the presentation
– http://www.dsxchange.net/IISonHadoop.html– Replay will be available within one day watch for email
with details Pricing and configuration - send to [email protected] Subject
line : Pricing Bonus Offer – Free premium membership for your DataStage
Management! Submit your management’s email address and we will offer him access on your behalf.
– Email [email protected] subject line “Managers special”.
– Join us all at Linkedin http://tinyurl.com/DSXmembers
1
© 2014 IBM Corporation2
Information Server on Hadoopbeta update
06/11/2015
Tony Curcio
IBM Product Director
© 2014 IBM Corporation
First Poll Question
11.3.x
9.1.x
8.7
8.5
umm… one of the more classic versions
What version of
Information Server/
DataStage are you
using (choose your
most current
version in
production)
© 2014 IBM Corporation
Second Poll Question
In production with hadoop already.
Plan to go live with hadoop in production this year
Still learning about it
That's the yellow elephant thing, right?
What is the status
of your hadoop
usage?
© 2014 IBM Corporation
Big Data Integration Is Critical For Success With Hadoop
Extract, Transform, and Load Big Data With Apache Hadoop - White Paper
https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-
hadoop.pdf
“By most accounts 80%
of the development effort in
a big data project goes into
data integration
goes towards data
analysis.”
…and only
20%
Most Hadoop
initiatives involve
collecting, moving,
transforming,
cleansing,
integrating,
exploring, and
analysing volumes
of disparate data
sources and types.
© 2014 IBM Corporation
Customers are telling us:“Data Integration is their #1 requirement for Hadoop”
“By most accounts, 80% of the development effort in a big data project goes into data integration and only 20% goes toward data analysis.” -- Intel Corporation (Extract, Transform, and Load Big Data With Apache
Hadoop Whitepaper )
© 2013 IBM Corporation6
Customer setup a massive data integration engine, consisting of 640 BigMatch cores and 224 Information Server/DataStage cores to address Big Data requirements
Large Bank is soliciting proposals for a 1 petabyte Data Refinery with data volume growth of 5x over 3-4 years. Data Integration is the only use case for the Data Refinery.
Hadoop is not a data integration solution -- Gartner Research
“Right now data integration is our only use case for Hadoop.”
Our #1 use case for Hadoop is offloading Teradata ELT.
Smart customers know their Big Data Projects will fail without Big Data
Integration
© 2014 IBM Corporation
• Optimized
load;
benchmark
>5TB/hr
• 1st Hadoop
Pushdown
Optimization
• 1st Big data
Discovery, Profiling,
Transformation &
governance of data
• Optimized load
benchmark
achieving >15TB/hr
• Inclusion of a fully
supported hadoop
instance
• Self-service data
integration for
Hadoop (i.e. Data
Click)
• Governance for
Hadoop
• Kerberos-enabled
access to HDFS
data
IBM Information Server –
Leadership in Big Data Integration
• Native Hadoop
runtime for Data
Integration,
Quality and
Governance
• 1st to provide
native high
speed Hadoop
connectivity for
BI, Cloudera,
HortonWorks
© 2014 IBM Corporation
Productivity - Rich user interface features simplifies the design process and metadata management requirements.
Transformation - Extensive set of pre-built objects that act on data to satisfy both simple & complex data integration tasks.
Connectivity - Native access to common industry databases and applications exploiting key features of each.
Performance - Runtime engine providing unlimited scalability through all objects tasks in batch/real-time, ETL/ELT/DV/SOA
Operations - Simple management of the operational environment lending analytics for understanding and investigation.
Administration - Intuitive and robust features for install, maintenance, configuration, security and resiliency
Supports all 7 dimensions of a great data integration platform
Governance - Maximizes business & IT collaboration providing business terms, policies, advanced impact analysis, search, comparison & more
1
2
3
4
5
6
7
© 2014 IBM Corporation
GBC_Supp_step1 GBC_Supp_step1 GBC_Supp_step1
DataStage Parallelism
Records1-100
Records 101-200
Records201-300
Records301-400
Records401-500
Record501-600
…
GBC_Supp_step1
1 Pipeline – Data moves between stages (boxes on flow) like an assembly line process.
Each box has its own responsibilities for data.
2 Partition – Ability to fan out data to multiple streams based on some key (hash on ARR-ID
in this case) allows like data to be processed in order on that stream.
3 Instance – Multiple DataStage instances can be run at the same time, either on the same
server (if there are sufficient resources) or in separate physical servers.
……
…
© 2014 IBM Corporation
Third Poll Question
DI/DQ is the primary requirement for us using hadoop.
DI/DQ are supporting our modern data warehouse/analytics in hadoop
We need some DI to get data into hadoop, but that's basically it
No plans to embrace the elephant yet
How would you
characterize your
requirements for
data integration
and cleansing in
hadoop?
© 2014 IBM Corporation
INFORMATION SERVER FOR HADOOP
1
© 2014 IBM Corporation
… integrate, transform, cleanse and govern your data
natively within your Hadoop cluster?
… design your integration and cleansing workflows
through an easy drag and drop UI?
… run it anywhere – inside or outside your cluster
… have a fully governed environment along side
And … what if it would run faster and with less latency
than a comparable Map/Reduce routine?
You think that’s impossible?
What if you could … ?
© 2014 IBM Corporation
Benefits of Information Server for Hadoop
Superfast data ingest and processingIntegrate, prepare and enrich data with speed and confidence
running natively on Hadoop with speeds upwards of
15x faster than MapReduce
Complete confidence in your dataunderstand what data is available and where it came from
monitor and cleanse quality of data; catalog metadata
assets and trace lineage
Higher Level of Productivity develop integration processes much faster than hand coding –
based on existing enterprise skills
graphical data flow development environment with
100s of prebuilt stages and 1000s of prebuilt functions
no other vendor has this
scale or speed
fully integrated quality
and governance
proven development
paradigm
© 2014 IBM Corporation
InfoSphere Information Server for Hadoop
• The most scalable Transformation and
Data Integration and Quality Engine now
runs natively on Hadoop
• Get enterprise-class transformation and
cleansing for your Hadoop data
• Use the power of your Hadoop Cluster to
integrate, transform & cleanse data
without writing a single line of code
• Easily govern all your transformation and
data quality processes
© 2014 IBM Corporation
Why Information Server on Hadoop
• More customers choosing HDFS/GPFS for data landing zone, archive, and
analytics processing
• HDFS / GPFS are cheap, scalable, and fault tolerant
• DataStage offers large advantages over Hadoop data processing tools
• Graphically build data flows with little or no programming knowledge
• No hand coding, improve developer productivity
• Existing jobs can be moved directly over to run on Hadoop
• Reduce infrastructure, one cluster many applications
• DataStage coexist with other Hadoop applications managed by YARN
• Performance and Scalability
• DataStage scales linearly as nodes are added to cluster
• Run DataStage jobs more efficiently local to HDFS / GPFS data
© 2014 IBM Corporation
Information Server is many times faster than M/R
16
Producer
Operator
Consumer
Operator
Record Pipeline
Map-Reduce is designed for task-level fault tolerance: The downside of this is significantly lower performance.
• Data is streamed from producer to consumer with
data repartitioning without landing to disk
• Intermediate data is not written to disk
• Flow control prevents the producer from
overrunning the consumer
• Pipelined and partitioned parallelism provides
efficiency and high performance
InfoSphere Information Server is designed to exploit data pipelining to minimize I/O and maximize overall
performance.
• Producer (Map) writes intermediate results to disk
• Consumer (Reduce) “pulls” the data
• This design provides task-level fault tolerance
• This design lowers overall performance and
efficiency by orders of magnitude*
*See: Themis: An I/O-Efficient MapReduce http://themis.sysnet.ucsd.edu/papers/themis_socc12.pdf
Job-level Fault Tolerance
Task-level Fault Tolerance
© 2014 IBM Corporation
GBC_Supp_step1 GBC_Supp_step1 GBC_Supp_step1
DataStage Parallelism – Fully supported inside Hadoop
Records1-100
Records 101-200
Records201-300
Records301-400
Records401-500
Record501-600
…
GBC_Supp_step1
1 Pipeline – Data moves between stages (boxes on flow) like an assembly line process.
Each box has its own responsibilities for data.
2 Partition – Ability to fan out data to multiple streams based on some key (hash on ARR-ID
in this case) allows like data to be processed in order on that stream.
3 Instance – Multiple DataStage instances can be run at the same time, either on the same
server (if there are sufficient resources) or in separate physical servers.
……
…
© 2014 IBM Corporation
Hadoop Cluster
DataNode
Section Leader
Player 1 Player 2 Player N
Information Server for Hadoop –Runtime Architecture Overview
IS Engine Tier Node
Hadoop Edge Node (or Full Node)
NameNode
Conductor
IS YARN
Client
DataNode
IS AM
DataNode
IS AM
DataNode
DataNode
Section Leader
Player 1 Player 2 Player N
DataNode DataNode
YARN Container
/opt/IBM/InformationServer
/opt/IBM/InformationServer
/opt/IBM/InformationServer/opt/IBM/InformationServer /opt/IBM/InformationServer
/opt/IBM/InformationServer /opt/IBM/InformationServer
/opt/IBM/InformationServer
© 2014 IBM Corporation
Fourth Poll Question
Scaling DI/DQ workloads to any level
No need to handcode mapreduce to transform/cleanse data
Can use industry proven solutions for DI/DQ in this strategic evolution
Can't I just get in my TARDIS and go back to 2009?
Considering using
tools for your
hadoop
implementation,
which factor
appeals most to
you?
© 2014 IBM Corporation
EARLY RELEASE PROGRAM UPDATE
2
© 2014 IBM Corporation
Early Release Program Details
• Program Start: October 25, 2014
• Program End: September 30, 2015
• Supported Hadoop Distributions:• Phase 1: Cloudera 5.1+ & HortonWorks 2.1+
• Phase 2: BigInsights 4
• Supported Information Server environments• V11.3.1 on Linux OS
• Hadoop “Beta” patch
© 2014 IBM Corporation
ERP Objectives
• Obtain feedback on installation & setup of Information Server on a Hadoop grid
environment
• Obtain feedback on running DataStage , QualityStage or Information Analyzer
within a YARN managed grid environment.
• Users should be able to run any existing job unmodified (except for path changes)
• Obtain feedback on performance & scalability running workloads inside a
Hadoop 2.0 cluster
© 2014 IBM Corporation
New Features in Beta 2 Build Available 3/9/15
• Includes several customer requested enhancements, including:
o Kerberos enabled clusters are now supported
o Full Edge/Client node support for Engine Tier install
o NFS mount requirement removed, IS binaries may be copied to data nodes
o Option to automatically distribute IS binaries if they aren’t detected
o Data locality support for BDFS file reads
o Container size estimation
© 2014 IBM Corporation
Current Beta Participation
• Nearly 20 Participants• Includes both customers and business partners
• Distribution in use:• All supported Hadoop distributions
• Product Coverage
• DataStage, QualityStage and Information Analyzer
2
© 2014 IBM Corporation
Feedback
• Positive feedback for installation & setup experience
Validation of alignment to customer expectations for honoring hadoop requirements for
container resource usage, distribution of binaries across the cluster, and security
configuration
• Positive feedback for integration platform functionality
No issues with stage runtime. Able to successfully exploit all (tested) stages within
Hadoop.
“this solution already seems GA”
© 2014 IBM Corporation
Next steps
• If you are interested in learning more about the beta, or
becoming part of it, please contact:
• Tony Curcio ([email protected])
• Beate Porst ([email protected])