enterprise hadoop is here to stay: plan your evolution strategy
DESCRIPTION
The Briefing Room with Neil Raden and Teradata Live Webcast on August 19, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=1acd0b7ace309f765dc3196001d26a5e Modern enterprises have been able to solve information management woes with the data warehouse, now a staple across the IT landscape that has evolved to a high level of sophistication and maturity with thousands of global implementations. Today’s modern enterprise has a similar challenge; big data and the fast evolution of the Hadoop ecosystem create plenty of new opportunities but also a significant number of operational pains as new solutions emerge. Register for this episode of The Briefing Room to hear veteran Analyst Neil Raden as he explores the details and nature of Hadoop’s evolution. He’ll be briefed by Cesar Rojas of Teradata, who will share how Teradata solves some of the Hadoop operational challenges. He will also explain how the integration between Hadoop and the data warehouse can help organizations develop a more responsive and robust data management environment. Visit InsideAnlaysis.com for more information.TRANSCRIPT
Grab some coffee and
enjoy the
pre-show
banter
before the top of the
hour!
The Briefing Room
Enterprise Hadoop is Here to Stay: Plan Your Evolution Strategy
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Topics
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: BIG DATA ECOSYSTEM
September: INTEGRATION & DATA FLOW
October: ANALYTIC PLATFORMS
Twitter Tag: #briefr
The Briefing Room
Executive Summary
! Hadoop changes data management
! Not just storage, but analytics as well
! The EDW will deliver ‘Certified Data’
! Someone must take the lead!
Twitter Tag: #briefr
The Briefing Room
Analyst: Neil Raden
Neil Raden is the founder and Principal Analyst at Hired Brains Research. He is the co-author, with James Taylor, of “Smart (Enough) Systems: How To Deliver Competitive Advantage by Automating Hidden Decisions.” With 30 years experience, he is a widely published writer, well-known speaker, analyst and consultant, having personally designed and implemented dozens of large analytical applications in finance, marketing, distribution, logistics, actuarial, intelligence, scientific, statistical and consumer products. As an industry analyst, he has published over 40 white papers, hundreds of articles, blogs and research reports. He welcomes your comments and can be reached at [email protected].
Twitter Tag: #briefr
The Briefing Room
Teradata
! Teradata is known for its analytics data solutions with a focus on integrated data warehousing, big data analytics and business applications
! It offers a broad suite of technology platforms and solutions and a wide range of data management applications
! The Teradata Unified Data Architecture includes the Integrated Big Data Platform and Appliance for Hadoop
Twitter Tag: #briefr
The Briefing Room
Guest: Cesar Rojas
Cesar Rojas is a data management veteran with fifteen years of experience in Product Management and Product Marketing working with Global 2000 users. At Teradata Labs Cesar leads product evangelization strategies for Hadoop enthusiasts, data scientists and business analysts. More specifically Cesar is responsible for key components of the Teradata Portfolio for Hadoop including SQL and Hadoop integration, Hadoop manageability and Hadoop Appliances. Prior to joining Teradata, Cesar worked in large industry vendors as well as Silicon
Valley software startups in the areas of Database Management, Business Intelligence, Complex Event Processing, IT Infrastructure as a Service, and Enterprise Applications. Cesar holds a MBA with emphasis in eBusiness from Notre Dame de Namur University and bachelor’s degree in Computer Engineering.
Cesar Rojas, Teradata [email protected]
ENTERPRISE HADOOP IS HERE TO STAY: PLAN YOUR EVOLUTION STRATEGY
11 Copyright Teradata
Enterprise Hadoop is not an Island
12 Copyright Teradata
Data Warehouse and Hadoop Data Warehouse Hadoop
Characteristics
Use Cases
Characteristics • High performance analytics and complex joins
• High concurrency • SQL (ANSI and ACID compliant)
• Advanced workload mgmt.
• High Availability • Data Governance • Fine Grain Security • Emerging Late Binding
• One-stop support
• Fast Data Landing and Refinement
• Processing Flexibility • Emerging SQL/SQL-like interfaces
• Batch-oriented processing
• Low workload concurrency
• Multi-structured and file based data
• Late Binding • Open Source Community
• Long-Term Raw Data Storage
• Low $/TB • ETL • Reporting • Deep Analytics
13 Copyright Teradata
Data Lake
ETL
Starting Small: Two Proven Hadoop Use Cases
• Single source of raw data
• Drag-the-Lake for new insights
• Co-location versus line of business data marts
• Transforms • Data set creation • Data manipulation
• ETL new data
14 Copyright Teradata
The Data Lake
A “Data Lake” is a massive repository enabled by low cost technologies that improves the capture, refinement, and exploration of raw data within an enterprise.
• Single source of raw, historical, operational data
• Cost effectively explore data sets > Unknown, under-
appreciated, or unrecognized value
• Consolidate data environments > Reduces costs and analytical
discrepancies
• Co-location of files enables light, on-the-fly integration
IDW
Web Logs
Sensors
Mobile
Files
15 Copyright Teradata
Let’s just build a Data Lake!
16 Copyright Teradata
“Without descriptive metadata and a mechanism to maintain it, the data
lake risks turning into a data swamp.”
17 Copyright Teradata
• The Data Lake promise > Data Lake provides an enterprise accessible data management
platform for analyzing data from multiple heterogeneous sources in its native format
> No need for data modeling or transformations > Data is immediately available for analysis
• The Data Lake challenge > The data remains in “knowledge” silos in the Data Lake – end users
need to understand how to reconcile the data across sources > Data quality is unknown - all data in the Data Lake is treated with
equal data quality which can result in inconsistencies or errors > There is no systematic method of understanding what is in the Data
Lake or what knowledge has already been determined
Without metadata, lineage and governance across the Data Lake, it quickly becomes just a file repository for
individuals working in their own data domains
The Data Lake Promise and Challenge
18 Copyright Teradata
ETL on Hadoop or ELT in Teradata
Where Hadoop will Shine
Where Hadoop will be Challenged
CPU intensive calculations I/O intensive calculations
Scans of data Seeks of data
Complex logic Complex joins
Fast ingest Service level agreements
19 Copyright Teradata
TERADATA PORTFOLIO FOR HADOOP
• Teradata Open Distribution for Hadoop (TDH) 2.1 > Core Hadoop: Hortonworks Data Platform 2.1 optimized for Teradata solutions > Value Added Teradata Components
• Flexible Hadoop Platforms > Teradata Appliance for Hadoop > Teradata Aster Big Analytics Appliance > Teradata Commodity Offering with Dell > Hortonworks Data Platform software-only support resell
• Complete consulting and training capability > Big Analytics Services—across the UDA > Data Integration Optimization—ETL, ELT across the UDA > Hadoop deployment and mentoring > Teradata delivering Hortonworks training > Hadoop Managed Services—operations and administration
• Customer Support for Hadoop > World-class Teradata customer support, backed by Hortonworks
20 Copyright Teradata
Teradata Portfolio for Hadoop “Fastest Path to Hadoop Production”
Fastest Path to Hadoop Production • Easiest Hadoop to implement (100% out of the box) • Pre-configured hardware, software and services to accelerate time to value • Teradata is the first vendor to support Hortonworks 2.1 with an appliance Deepest Hadoop Integration with Teradata Solutions • In Data Access, Data Movement, Manageability, Supportability, and Serviceability
• 2.1 support of current Teradata tools: QueryGrid (SQL-H), TDCH, Viewpoint, TVI, and Teradata Studio – Smart Loader for Hadoop
• 2.1 support of TD tools now supporting Hadoop: Teradata Unity Ecosystem Manager
• TDH 2.1 on appliance mode delivers hardware/software that is fully integrated with other Teradata platforms
Enterprise-Ready Hadoop experience from a single vendor • Updated portfolio of services for assessment, architecting and management • Teradata enterprise support
21 Copyright Teradata
TERADATA OPEN DISTRIBUTION FOR HADOOP 2.1 ”Ready for Production”
• Teradata Open Distribution for Hadoop (TDH) 2.1 > Enhanced Hadoop distribution from Hortonworks exclusively on
the Teradata Appliance for Hadoop > Includes Teradata built components for enhanced availability and
manageability, supporting all benefits of YARN
• Value > Enterprise-ready Hadoop with single vendor support > Best integration capabilities with Teradata > Accelerated time to value with integrated hardware/software that
is engineered, staged, & delivered complete > Industry optimized and hardened Hadoop configurations > World-class enterprise service and support for an automated,
low-touch model promoting lower TCO
22 Copyright Teradata
Hortonworks Data Platform 2.1 Components
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS YARN : Data OperaFng System
DATA MANAGEMENT
SECURITY DATA ACCESS GOVERNANCE & INTEGRATION
AuthenFcaFon AuthorizaFon AccounFng
Data ProtecFon
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
OPERATIONS
Script Pig
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory AnalyNcs, ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Teradata Open Distribution for Hadoop 2.1
Overview
QueryGrid TDCH Studio
Enterprise Access
HCLI HadoopBuilder
AdministraFon & Management
Viewpoint
Monitoring
23 Copyright Teradata
• Appliance Solution > Purpose-built integrated hardware / software solution > Optimized hardware for Hadoop, software, storage, and
networking in a single rack > Delivered ready to run at a competitive price point
• Enterprise Ready > Integrated with Teradata Analytical Ecosystem to
expand analytical capabilities > Support for major business intelligence, visualization,
and ETL tools > Management tools for monitoring system health
• Data Staging > Loading, storing, and refining data in preparation for
analytics • Active Archiving > Powerful solution for Unified Data Architecture for data
archiving
What is the Teradata Appliance for Hadoop?
24 Copyright Teradata
Teradata Appliance for Hadoop Highlights
Optimized hardware for Hadoop
BYNET™ V5 40GB/s InfiniBand interconnect
Tera
data
Vita
l Inf
rast
ruct
ure
Teradata Open Distribution for Hadoop
NameNode Failover
Intelligent Start and Stop
Teradata Connector for Hadoop (TDCH)
Aster and Teradata QueryGrid
Teradata Studio with Smart Loader
Teradata Viewpoint
Value Added Software from Partners
25 Copyright Teradata
• Installation > HadoopBuilder – Systems arrive out of the box ready to run
• Cluster Management (with Teradata Hadoop Tools) > Intelligent Start/Stop – All Hadoop services are coordinated to
begin/end automatically > Single Drive Replace – Simplified the hardware procedure > Add/Replace Data node – Automated the process for bare node
hardware setup • Monitoring > Viewpoint – Single GUI-based view of all systems in UDA > TVI – alerts and service dispatches for proactive issue monitoring
• Availability > Easy NameNode Failover: JobTracker and NameNode high
availability works out of the box > Full Master node HA
Teradata Hadoop Enhancements Simplifying Hadoop for Enterprise Readiness
26 Copyright Teradata
Hadoop + Viewpoint
• System management > Hadoop services > System health > Alert viewer > Node monitor > Space usage > Metrics analysis > Metrics graph > Capacity heatmap
27 Copyright Teradata
• Hadoop view > Browse Hadoop tables > Bi-directional table copying
– Drag and drop interface – Maps data types between
Hadoop and Teradata tables
• Benefits > Simplifies Hadoop browsing > Ad hoc data movement > No scripting required > Point and click
Studio and Smart Loader for Hadoop
Hadoop Table Properties
28 Copyright Teradata
• Enterprise class Hadoop support > Hadoop hardware and software > Proactive problem detection and fixes – Reliability, availability, manageability
• Virtualized server management > System monitoring > Cabinet Management Interface Controller (CMIC) > Service Work Station (SWS) > Automatically installed on base/first cabinet
Teradata Vital Infrastructure for Hadoop
62–70% of incidents fixed proactively
29 Copyright Teradata
Teradata 15.0: Teradata QueryGrid™
SQL, SQL-MR, SQL-GR
TERADATA ASTER
DATABASE
Teradata Systems
TERADATA DATABASE
OTHER DATABASES
Remote Data
LANGUAGES
SAS, Perl, Python, R,
Ruby, etc.,
HADOOP
Remote, push-down processing in Hadoop
IDW Discovery
TERADATA DATABASE
TERADATA ASTER
DATABASE
Business users Data Scientists
When fully implemented, the Teradata Database or the Teradata Aster Database will be able to intelligently use the functionality and data of multiple heterogeneous processing engines
30 Copyright Teradata
• Built with Hortonworks > Donated to Apache
• Business user query with favorite BI tools
• Join Hadoop data to > Teradata Data Warehouse > Aster Discovery Platform
• Teradata 15.0 > Bi-directional SQL > Push down filters to Hive
• Fast, secure, reliable
Teradata QueryGrid Teradata Systems
Hadoop Layer: HDFS
Pig
Hive
Hadoop MR
HCatalog
Dat
a
Dat
a Fi
lter
ing
SQ
L-H
31 Copyright Teradata
• Revelytix provided data management & data preparation tools for data in Hadoop. Specifically, Loom is an open platform for discovering, profiling, preparing and tracking data lineage for data in Hadoop.
• Data governance and specifically metadata management in Hadoop, is a key missing component in the Hadoop ecosystem.
• Understanding metadata in Hadoop is one of the biggest challenges and impediments to success with Hadoop today.
• Loom represent a unique value proposition that delivers a integrated metadata, lineage and data wrangling all in a single software solution.
Revelytix Acquisition: Data Governance and Metadata Management
New
32 Copyright Teradata
• Keep it simple and start small • Focus on proven use cases
• Vendor considerations > Should help you accelerate time to market > Easy integration of data with other IT platforms > Provide easy monitoring and manageability > Has sophisticated data management capabilities > Provide robust services and support
Teradata Customers: Successful Evolution Strategies
No!
Q&A
BACKUP
35 Copyright Teradata
Why Teradata Appliance for Hadoop?
Building a Hadoop Cluster
• Multiple vendors • DIY set up, install • DIY SW/HW updates • Integration–test–deploy • Multiple consoles
Teradata
• Easy 1 vendor acquisition • Quick set up, Plug ‘n’ play • Eliminate integration
complexity • Single pane of glass
management
36 Copyright Teradata
• Lack of Semantic consistency and Governed Metadata > Assumes that audiences are highly skilled at data manipulation and
analysis > Without governance, the lake will end up being a collection of
disconnected data silos all in one place
• Risks > Inability to determine data quality or the lineage of findings by other
analysts or users that have found value, previously, in using the same data in the lake.
• Security and access control.
• Performance > Tools and data interfaces simply cannot perform at the same level
against a general-purpose store
Source: “The Data Lake Fallacy: All Water and Little Substance”, Gartner
Data Lake Challenges
37 Copyright Teradata
Telematics in Insurance Geospatial analytics for better risk management
Situation • Insurer needs accurate risk scores to adjust premiums corporate auto fleets • Data collected vehicle data, driver behavior, GPS, weather, traffic • Current custom application limits scoring effectiveness
Problem • Limited storage capacity/infrastructure for huge volumes of real time data • No ad-hoc reporting or analytic systems
Solution • Teradata Appliance for Hadoop to ingest telematics data • Combine with other data sources to perform risk analysis
Impact • Quickly analyze data plus ad hoc reporting • Streamlined process to calculate vehicle and fleet scores • Cost effectively quantify, adjust and manage risk premiums
38 Copyright Teradata
• Telematics Service Provider (TSP) streaming and transforming • Apache Hive for ad-hoc querying and reporting
Telematics Use Case Data Architecture
Apache Storm
Sessionize
Streaming TSP data (sources,
formats)
Standard Format
VIN data Trip
files Enhanced
GPS
Vehicle Acceler-ometer data
Vehicle scores
Twitter Tag: #briefr
The Briefing Room
Perceptions & Questions
Analyst: Neil Raden
Hired Brains is an independent firm
providing research and advisory services and direct-to-client consulting for 25 years
Neil Raden CEO and Founder, Hired Brains Research [email protected] Twitter: @neilraden Blog: http://hiredbrains.wordpress.com http://www.linkedin.com/in/neilraden
Copyright © 2010-2012 Hired Brains Inc. 41
1950 1960 1970 1980 1990 2000
Batch Reporting
CICS/OLTP
C/S OLTP
Y2K/ERP
4GL/PC/SS DW/BI
Big Data Hybrid
2010
Convergence: End of managing from scarcity
2020
• 41
Copyright © 2010-2012 Hired Brains Inc. 42
Big Is Relative This Pace Isn’t New, Just Magnitude
Though Volume is interesting, it isn’t what distinguishes Big Data
Copyright © 2010-2012 Hired Brains Inc. 43 43
Data Doesn’t Speak for Itself
• Data is only a proxy for reality, footprints left behind
• Its meaning has to be understood • This is the problem DW set out to
solve • Data integration provides meaning
and context to data • Data integration technologies need
to be be faster, but they aren’t yet • Hadoop is far behind, analytics
can only be are directionally correct
• DW schema-on-write still needed for those analyses that require precision
Copyright © 2010-2012 Hired Brains Inc. 44
EDW vs EDH/Data Lake Very shaky analogy
EDW EDH/Data Lake
Copyright © 2010-2012 Hired Brains Inc. 45
• SQL-on-Hadoop does not imply Hadoop replaces the data warehouse
• Rather, it means people with SQL skills can access “black data” in Hadoop
• The two are somewhat mutually exclusive • If you think of both the DW and Hadoop and just data you will
arrive the wrong conclusion • DW is a controlled process that deals with data semantics • The fixed schema, described as the DW greatest drawback is
its greatest strength • Its cost and inflexibility viz-a-viz Hadoop is its weakness. • That is changing
Is Hadoop the New Data Warehouse?
Copyright © 2010-2012 Hired Brains Inc. 46
A Hybrid Architecture
ALL DATA
Structured Data
Multi-Structured
Data
Non- Relational
Data
DISCOVERY ANALYTICS USERS
Discovery Platform Data
Scientist
SQL
MapReduce
Statistical Functions
OLTP DBMS’s
• Doesn’t require extensive modeling
• Doesn’t balance the books
• Data completeness can be good enough
• No stringent SLAs
Behavioral Analytics • Customer • Product • Machine • Supply chain
Data Analyst
ITERATIVE ANALYSIS
Copyright © 2010-2012 Hired Brains Inc. 47
Five Things to Remember
• We see in “spending intention” surveys that enterprises are already thinking Hadoop alone is a replacement for data warehouses
• Existing database and ETL vendors need to rapidly innovate their offerings to avoid decline
• Hadoop, even with its hacker roots, is also rapidly innovating • But enterprise-ready Hadoop needs crucial fundamentals it
currently lacks for the enterprise: - Security - Dynamic Workload Management ( a term some Hadoop
vendors use to disparage the DW) - Comparable failure/recovery features - Concurrency - Latency
• The open source community is actively developing these features
Copyright © 2010-2012 Hired Brains Inc. 48
Neil Raden Founder, Hired Brains Research Twitter: NeilRaden Blog: http://hiredbrains.wordpress.com Website: http://www.hiredbrains.com Mail: [email protected] LinkedIn: http://www.linkedin.com/in/neilraden
Twitter Tag: #briefr
The Briefing Room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: BIG DATA ECOSYSTEM
September: INTEGRATION & DATA FLOW
October: ANALYTIC PLATFORMS
Twitter Tag: #briefr
The Briefing Room
THANK YOU for your
ATTENTION!
Opening slide image courtesy of Wikimedia Commons