has traditional mdm finally met its match?
TRANSCRIPT
Grab some coffee and
enjoy the
pre-show
banter
before the top of the
hour!
The Briefing Room
Has Traditional MDM Finally Met Its Match?
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Topics
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: INTEGRATION & DATA FLOW
October: ANALYTIC PLATFORMS
November: DISCOVERY & VISUALIZATION
Twitter Tag: #briefr
The Briefing Room
There’s a New Sheriff in Town!
Executive Summary
• Speed and power trump the old way • Traditional MDM is officially archaic • YARN is the new fabric of MDM
Twitter Tag: #briefr
The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
[email protected] @robinbloor
Twitter Tag: #briefr
The Briefing Room
RedPoint Global
! RedPoint Global is a data management and integrated marketing technology company
! RedPoint Data Management offers solutions designed for master data management (MDM), collaboration and architecture integration
! RedPoint Data Management for Hadoop is YARN-compliant and enables analysts to access and manipulate data directly within the Hadoop cluster
Twitter Tag: #briefr
The Briefing Room
Guest: George Corugedo
George Corugedo is Chief Technology Officer & Co-Founder at RedPoint Global Inc. A mathematician and seasoned technology executive, George has over 20 years of business and technical expertise. As co-founder and CTO of RedPoint Global, George is responsible for leading the development of the RedPoint Convergent Marketing Platform™. A former math professor, George left academia to co-found Accenture’s Customer Insight Practice, which specialized in strategic data utilization, analytics and customer strategy. Previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions to enterprise commercial entities, and COO/CIO of Riscuity, a receivables management company specializing in the utilization of analytics to drive collections.
MDM for the Modern Data Architecture September 2014
11 © RedPoint Global Inc. 2014 Confidential
Purpose of MDM
Create correct and consistent data across the enterprise that earns trust in information and acceleration of growth.
12 © RedPoint Global Inc. 2014 Confidential
Vicious Cycle of Unmanaged Data
1. Master Data Issues
remain unaddressed or unresolved
2. Garbage in/garbage out creates
process confusion
3. Lack of process trust
slows business momentum
4. Data conflicts reinforce
siloed operations
13 © RedPoint Global Inc. 2014 Confidential
© Hortonworks Inc. 2014
A Data Architecture Under Pressure
14 © RedPoint Global Inc. 2014 Confidential
Broad Spectrum of Benefits Across Industries
15 © RedPoint Global Inc. 2014 Confidential
Gartner’s Nexus of Forces Making Things Worse
16 © RedPoint Global Inc. 2014 Confidential
Business Benefits of MDM
17 © RedPoint Global Inc. 2014 Confidential
Types of Data in a Typical Organization
• Severe shortage of Map Reduce skilled resources
• Inconsistent skills lead to inconsistent results of code based solu>ons
• Nascent technologies require mul>ple point solu>ons
• Technologies are not enterprise grade • Some func>onality may not be possible within these frameworks
Challenges to Data Lake Approach
• Data is ingested in its raw state regardless of format, structure or lack of structure
• Raw data can be used and reused for differing purposes across the enterprise
• Beyond inexpensive storage, Hadoop is an extremely power and scalable and segmentable computa>onal plaMorm
• Master Data can be fed across the enterprise and deep analy>cs on clean data is immediately enabled
Benefits of a Hadoop Data Lake
18 © RedPoint Global Inc. 2014 Confidential
Big Data Can Become Big Information
! Inges>on of all data available from any source, format, cadence, structure or non-‐structure
! ELT and data transforma>on, refinement, cleansing, comple>on, valida>on and standardiza>on
! Geospa>al processing and geocoding
! Data profiling, lineage and metadata management
! Iden>ty resolu>on and persistent keying and en>ty profile management
! ASribute source and consumer mapping
19 © RedPoint Global Inc. 2014 Confidential
Data Lake Architecture for MDM
Data Sources CRM
ERP
Billing
Subscriber
Product
Network
Weather
Compete
Manuf.
Clickstream
Online Chat
Sensor Data
Social Media
Call Detail Records
Fabrica>on Logs
Sales Feedback
Field Feedback
Field Feedback
+
20 © RedPoint Global Inc. 2014 Confidential
Key Functions for Master Data Management
Master Key Management
ETL & ELT Data Quality
Web Services Integra>on
Integra>on & Matching
Process Automa>on & Opera>ons
• Profiling, reads/writes, transforma>ons
• Single project for all jobs
• Cleanse data • Parsing, correc>on • Geo-‐spa>al analysis
• Grouping • Fuzzy match
• Create keys • Track changes • Maintain matches over >me
• Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats
• Job scheduling, monitoring, no>fica>ons
• Central point of control • Meta Data Management
21 © RedPoint Global Inc. 2014 Confidential
So How to Proceed?
22 © RedPoint Global Inc. 2014 Confidential
Overview - What is Hadoop/Hadoop 2.0
Hadoop 1.0 • All opera>ons based on Map Reduce
• Intrinsic inconsistency of code based solu>ons
• Highly skilled and expensive resources needed
• 3rd party applica>ons constrained by the need to generate code
Hadoop 2.0 • Introduc>on of the YARN:
“a general-‐purpose, distributed, applica>on management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.”
• Mature applica>ons can now operate directly on Hadoop
• Reduce skill requirements and increased consistency
23 © RedPoint Global Inc. 2014 Confidential
RedPoint Data Management on Hadoop
Par>>oning AM / Tasks
Execu>on AM / Tasks Data I/O Key / Split
Analysis
Parallel Sec>on (UI)
YARN
MapReduce
24 © RedPoint Global Inc. 2014 Confidential
Reference Hadoop Architecture
Monitoring and Management Tools
AMBARI
MAPREDUCE
REST
DATA REFINEMENT
HIVE PIG
HTTP
STREAM
STRUCTURE
HCATALOG (metadata services)
Query/Visualization/ Reporting/Analytical
Tools and Apps
SOURCE DATA
- Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory
DBs
JMS Queue’s
Files Fil
es Files
Data Sources
RDBMS
EDW
INTERACTIVE
HIVE Server2
LOAD
SQOOP
WebHDFS
Flume
NFS
LOAD SQOOP/Hive
Web HDFS
YARN
� � � � � � � � � �
� � � � � � � � � � �
� � � � � � � � � � �
� �
� �
� n
HDFS
1 � � � � � � � � � � � �
�
� � � � � � � � � � � � �
� � � � � � � � � � � � �
� � � � � � � � � � � � �
25 © RedPoint Global Inc. 2014 Confidential
RedPoint Functional Footprint
Monitoring and Management Tools
AMBARI
MAPREDUCE
REST
DATA REFINEMENT
HIVE PIG
HTTP
STREAM
STRUCTURE
HCATALOG (metadata services)
Query/Visualization/ Reporting/Analytical
Tools and Apps
SOURCE DATA
- Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory
DBs
JMS Queue’s
Files Fil
es Files
Data Sources
RDBMS
EDW
INTERACTIVE
HIVE Server2
LOAD
SQOOP
WebHDFS
Flume
NFS
LOAD SQOOP/Hive
Web HDFS
YARN
� � � � � � � � � �
� � � � � � � � � � �
� � � � � � � � � � �
� �
� �
� n
HDFS
1 � � � � � � � � � � � �
�
� � � � � � � � � � � � �
� � � � � � � � � � � � �
� � � � � � � � � � � � �
26 © RedPoint Global Inc. 2014 Confidential
RedPoint
Benchmarks – Project Gutenberg
Map Reduce Pig
Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { private final static String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count';
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
6 minutes runtime 15 minutes runtime 3 minutes runtime
Extensive optimization needed
User Defined Functions required prior to running script
No tuning or optimization required
27 © RedPoint Global Inc. 2014 Confidential
Data Lake Architecture for MDM
Data Sources CRM
ERP
Billing
Subscriber
Product
Network
Weather
Compete
Manuf.
Clickstream
Online Chat
Sensor Data
Social Media
Call Detail Records
Fabrica>on Logs
Sales Feedback
Field Feedback
Field Feedback
+
Twitter Tag: #briefr
The Briefing Room
Perceptions & Questions
Analyst: Robin Bloor
What Can You Do With a Data Lake?
Robin Bloor, Ph.D.
The Story So Far…
The old Data Warehouse World (environment) is fast dying – giving way to a dystopian future dominated by alien and mutant data, carried by vast unruly data streams that flow rapidly into dank and murky data lakes. This is Hadoop World.
HOW DO WE MAKE SENSE OF THIS?
The Big Data Architecture
Filtering Replicating& Routing
DataReservoir(Hadoop)
GeneralData
Server(s)
SpecialistData
Server(s)
DataPreparation
Data Flow(Optimize)
LocalWorkloads
ETL &Data Virt'n
Data Refinery and Processing Hub
Data
StreamingApps
LocalData
DataMart
TransApps
LocalData
DataMart
BIApps
LocalData
DataMart
OfficeApps
LocalData
DataMart
EventsData Flow
DataExport
The ApplicationLayer
The DataLayer
Applications may use the Data Hub Directly
Streams IOT Log files DaaS Mobile Devices Desktops Servers The Cloud Social media Etc.
The Main Point to Note
This is WAY more complicated than the old Data Warehouse
world
The Governance of Data
It’s all GOVERNANCE!!
DataReservoir(Hadoop)
GeneralData
Server(s)
SpecialistData
Server(s)
ETL &Data Virt'n
DataSecurity
Data Life Cycle Mgt
MDM & Business Glossary
DataCleansing
System Management
LocalWorkloads
MetaDataManagement
PerformanceMonitoring
& Mgt
DataLineage
DataMapping
DataExtractsData
Extracts
MetaDataDiscovery
Service Level Mgt
Corporate Data Hub
The Evolution of Hadoop
u There were many components before YARN and Tez
u But YARN and Tez have changed the picture
u Hadoop is becoming the default scale-out file system and the OS for data flow
The Prognosis
The foundation is in place for a comprehensive Big Data Information Architecture…
But BUILDING such integrated systems
will not be easy
u How does RedPoint see the role of Hadoop (ingest-point, ETL engines, MDM work area, analytical sandbox, database, etc.); some of these? All of these?
u Often in the past, MDM implementations have proved to be disappointing. What makes RedPoint different given that the data environment is more challenging than ever?
u Which companies/technologies do you see as competitive with RedPoint
u Which verticals have shown the greatest interest in RedPoint?
u How does a RedPoint engagement normally pan out?
u If you are intent upon doing MDM, where is it best to start?
Twitter Tag: #briefr
The Briefing Room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: INTEGRATION & DATA FLOW
October: ANALYTIC PLATFORMS
November: DISCOVERY & VISUALIZATION
Twitter Tag: #briefr
The Briefing Room
THANK YOU for your
ATTENTION!
Opening slide image courtesy of Wikimedia Commons