how apache hadoop is revolutionizing business intelligence ...assets.en.oreilly.com/1/event/63/how...
Post on 22-May-2020
24 Views
Preview:
TRANSCRIPT
How Apache Hadoop is RevolutionizingBusiness Intelligence and Data Analytics
Strata Conference, Sept 22nd 2011, New York, NY
Dr. Amr Awadallah, Founder, CTO, VP of Engineeringaaa@cloudera.com, twitter: @awadallah
Business Intelligence Before Adopting Apache Hadoop
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2
Storage Only Grid (original raw data)
Instrumentation
Collection
RDBMS (processed data)
BI Reports + Interactive Apps
Mostly Append
ETL Compute Grid
Moving Data ToCompute Doesn’t Scale
Can’t Explore OriginalHigh Fidelity Raw Data
Archiving =PrematureData Death
Business Intelligence After Adopting Apache Hadoop
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3
Hadoop: Storage + Compute Grid
Instrumentation
Collection
RDBMS
BI Reports + Interactive Apps
Complex Data Processing
Mostly Append
Data Exploration &Advanced Analytics
ETL and Aggregations
Keep Data Alive For Ever
So What is Apache Hadoop?
• A scalable fault-‐tolerant distributed system for data storage andprocessing (open source under the Apache license)
• Core Hadoop has two main components:• Hadoop Distributed File System: self-‐healing high-‐bandwidth clustered storage• MapReduce: fault-‐tolerant distributed processing
• Key business values:• Flexible – Store any data, Run any analysis (Mine First, Govern Later)• Scalable – Start at 1TB/3-‐nodes then grow to petabytes/thousands of nodes• Affordable – Cost per TB at a fraction of traditional options• Open Source – No Lock-‐In, Rich Ecosystem, Large developer community• Broadly adopted – A large and active ecosystem, Proven to run at scale
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
The Main Benefit: Agility/Flexibility
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5
Schema-‐on-‐Read (Hadoop):Schema-‐on-‐Write (RDBMS):• Schema must be created beforedata is loaded
• Explicit load operation has totake place which transforms datato database internal structure
• New columns must be addedexplicitly before data for suchcolumns can be loaded into thedatabase
• Read is Fast
• Standards/Governance
• Data is simply copied to the filestore, no special transformation isneeded
• A SerDe (Serializer/Deserlizer) isapplied during read time to extractthe required columns
• New data can start flowinganytime and will appearretroactively once the SerDe isupdated to parse them
• Load is Fast
• Flexibility/AgilityBenefitsBenefits
What is Complex Data Processing?
1. Java MapReduce: Gives the most flexibility and performance,but potentially long development cycle (the “assemblylanguage” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop inany programming language of your choice, but slightly lowerperformance and less flexibility.
3. Pig: A high-‐level language out of Yahoo, suitable for batch dataflow workloads.
4. Hive: A SQL interpreter out of Facebook, also includes a meta-‐store mapping files to their schemas and associated SerDe.
5. Oozie: A PDL XML workflow server engine that enables creatinga workflow of jobs composed of any of the above.
6Copyright © 2011, Cloudera, Inc. All Rights Reserved.
What This Means For You: Agility
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
Up Front Design Just in Time
What This Means For You: Innovation
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8
Data Committee Data Scientist
What This Means For You: Consolidation
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
Silos Sharing
What This Means For You: Extract Value from Latent Data
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
Archive to Tape Keep Data Alive
Benefit #2: Scalability
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11
What This Means For You: Ability to Grow Fluidly
What This Means For You: Data Beats Algorithm
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12
Smarter Algos More Data
Where Does Hadoop Fit in the Enterprise Data Stack?
13Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Logs Files Web Data
EnterpriseData
Warehouse
WebApplication
EnterpriseReporting
BI, Analytics
Analysts Business Users
Customers
IDEs
Data Scientists
RelationalDatabases
Low-‐LatencyServingSystems
ClouderaMgmt Suite
SystemOperators
DataArchitects
Development Tools
ETLTools
Business Intelligence Tools
Use The Right Tool For The Right Job
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14
Relational Databases: Hadoop:
Use when:
• Structured or Not (Agility)
• Scalability of Storage/Compute
• Complex Data Processing
Use when:
• Interactive OLAP Analytics (<1sec)
• Multistep ACID Transactions
• 100% SQL Compliance
Two Core Use Cases Common Across Many Industries
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15
ADVA
NCE
DAN
ALYTICS
DATA
PROCESSING
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Clickstream Sessionization
Mediation
Data Factory
Trade Reconciliation
SIGINT
Application ApplicationIndustry
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome MappingSequencing Analysis
Use CaseUse Case
ManufacturingProduct Quality Mfg Process Tracking
CDH: Cloudera’s Distribution Including Apache Hadoop
16Copyright © 2011, Cloudera, Inc. All Rights Reserved.
• Open Source – 100% Apache licensed, 100% Open Source, 100% Free.• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA• Integrated – All required component versions & dependencies are managed for you• Industry Standard – Existing RDBMS, ETL and BI systems work best with it• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc
Coordination
Data Integration Fast Read/WriteAccess
Languages / Compilers
Workflow Scheduling Metadata
UI Framework SDK
ZOOKEEPER
FLUME, SQOOP, ODBC HBASE
PIG, HIVE
OOZIE OOZIE HIVE
HUE SDKHUE
SCM Express: Simplifies Installation and Configuration
©2011 Cloudera, Inc. All Rights Reserved. 17
Service & Configuration Manager(SCM) Express takes the complexity out ofdeploying and configuring CDH.
Provision a complete Hadoop stack in minutes
Centrally manage system services through a user-‐friendly interface
Manages services for up to 50 nodes
FREE to download
KEY FEATURESAutomated,wizard-‐based
installation of thecomplete Hadoop stack
Central, real-‐timedashboard forconfigurationmanagement
Ability to configure thecluster while it’s running
Incorporatescomprehensive validation
and error checking
Automates the expansionof services to new nodeswhen they come online
1 2 3 4 5
What is Cloudera Enterprise?
©2011 Cloudera, Inc. All Rights Reserved. 18
Simplify and Accelerate Hadoop Deployment
Reduce Adoption Costs and Risks
Lower the Cost of Administration
Increase the Transparency & Control of Hadoop
Leverage the Experience of Our Experts
Cloudera Enterprise makes open sourceApache Hadoop enterprise-‐easy
EFFECTIVENESSEnsuring Repeatable Value fromApache Hadoop Deployments
EFFICIENCYEnabling Apache Hadoop to beAffordably Run in Production
ClouderaManagement Suite
ComprehensiveToolset for HadoopAdministration
Production-‐LevelSupport
Our Team of ExpertsOn-‐Call to Help YouMeet Your SLAs
CLOUDERA ENTERPRISE COMPONENTS
3 of the top 5 telecommunications, mobile services, defense & intelligence,banking, media and retail organizations depend on Cloudera Enterprise
Hadoop World 2011
The largest gathering of Hadoop practitioners, developers,business executives, industry luminaries and innovativecompanies in the Hadoop ecosystem.
©2011 Cloudera, Inc. All Rights Reserved.
• 1400 attendees, 25+ sponsors
• 60 sessions across 5 tracks for:
– Business Decision Makers
– Enterprise Architects
– IT Operators
– Data Scientists
– Developers
• Cloudera Training and Certification(November 7, 10, 11)
November 8-9
Sheraton New York Hotel& Towers, NYC
Learn more and register atwww.hadoopworld.com
$50 discount for
Strata attendees
19
What I Would Like You To Remember:
• The Key Benefits of the Apache Hadoop Data Platform:• Agility/Flexibility (Enables Innovation/Exploration).• Complex Data Processing (Any Language, Any Problem).• Scalability of Storage/Compute (Freedom to Grow).• Economical Active Archive (Keep All Your Data Alive).
• Cloudera Enterprise enables:• Lower the Cost of Management and Administration.• Simplify and Accelerate Hadoop Deployment.• Increase the Transparency & Control of Hadoop.• Firm SLAs on Issue Resolution.
20Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Contact Information:
21Copyright © 2011, Cloudera, Inc. All Rights Reserved.
Amr Awadallahaaa@cloudera.com
650-‐644-‐3921http://twitter.com/awadallah
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22
Appendix
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23
Hadoop Timeline
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24
2002 2003 2004 2005 2006 2007 2008 2009
Doug Cutting & Mike Cafarellastarted working on Nutch
Google publishes GFS &MapReduce papers
Doug Cutting adds DFS &MapReduce support to Nutch
Yahoo! hires Cutting,Hadoop spins out of Nutch
Facebooks launches Hive:SQL Support for Hadoop
Fastest sort of a TB, 3.5minsover 910 nodes
NY Times converts 4TB ofimage archives over 100 EC2s
• Fastest sort of a TB, 62secsover 1,460 nodes• Sorted a PB in 16.25hoursover 3,658 nodes
Hadoop Summit 2009,750 attendees
ClouderaFounded
Doug Cuttingjoins Cloudera
Cloudera’s Track Record
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25
• Customers: Multiple customers with >1,000 Hadoop nodes under management• Supporting dozens of diverse production use cases including ones that are revenue critical
with tight SLA’s
• Community: years of demonstrated leadership in the Apache Hadoop ecosystem.Cloudera employees are:• The largest contributor to the Hadoop ecosystem in patches• Founders of 70% of the projects in the Apache Hadoop ecosystem including Apache
Hadoop itself• The first to build & integrate what is now the reference Hadoop stack
• Industry: Multiple years of experience providing Hadoop solutions across industries:• 2 of the top 5 payments companies run Cloudera• 3 of the top 5 commerical banks run Cloudera• 2 of the top 4 online travel companies run Cloudera
Cloudera Enterprise Management Suite
©2011 Cloudera, Inc. All Rights Reserved. 26
Utility It Helps You… So You Can… It’s Like…
Activity Monitor • Consolidate all user activitiesinto a real-‐time view
• Diagnose user performance• Track activity metrics
• Improve performance• Improve conformance toSLAs
• Improve QOS
• MySQL Enterprise Monitor• Quest Foglight for Oracle /SQL Server
Service &ConfigurationManager
• Manage system services• Automate changes• Validate settings• 1-‐click security
• Lower cost of administration• Improve uptime
• Red Hat Satellite Server• Microsoft System Center• Oracle Enterprise Manager
ResourceManager
• Report on the usage ofscarce resources
• Plan for capacity expansion
• Improve quality of service• Extend the life of the cluster
• VMware vCenter
AuthorizationManager
• Centralize management of allusers, groups and privileges
• Manage permissions viadelegated administration
• Lower the costs ofadministration
• Improve compliance
• Teradata securityadministration
CDH Integrates with Existing IT Infrastructure
27
Databases Cloud/OS HardwareBI/Analytics
Copyright © 2011, Cloudera, Inc. All Rights Reserved.
ETL
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28
top related