© copyright 2012 emc corporation. all rights reserved. 1 · pdf file© copyright 2012...
Post on 07-Mar-2018
228 Views
Preview:
TRANSCRIPT
1 © Copyright 2012 EMC Corporation. All rights reserved.
2 © Copyright 2012 EMC Corporation. All rights reserved.
THE ROAD TO BIG DATA ANALYTICS
Introduction to Greenplum Database and HD (Hadoop)
3 © Copyright 2012 EMC Corporation. All rights reserved.
First There Was The Data Warehouse
• A new architecture to host data from multiple sources to support decision-making
• Why the Data warehouse exists:
– Centralization of high value data
– Tools to process data into information
– Highly regulated environment
Legacy EDW
4 © Copyright 2012 EMC Corporation. All rights reserved.
Then The MPP Database Was Introduced
A new approach to database was required to handle new analytics environment
Why the MPP Database exists:
– Data got larger
– Queries got uglier
– Performance became critical
– R/SAS/Statistical languages need to run in-database
5 © Copyright 2012 EMC Corporation. All rights reserved.
Now There Is Hadoop
Traditional systems weren‟t built to handle the storage/processing needs of Web 2.0
Why Hadoop exists: – Data volumes moved to the PB
range
– Raw (unstructured) forms of data needed to be processed
– Cost needed to be low
– Processing must scale with storage
6 © Copyright 2012 EMC Corporation. All rights reserved.
Value Of Data Co-Processing With Hadoop
7 © Copyright 2012 EMC Corporation. All rights reserved.
• Requires a different approach to how you leverage data
• Removes limitations around what data is worth storing or analyzing
• Augments analysis capabilities to create competitive advantages
Hadoop And MPP Represent A Paradigm Shift
8 © Copyright 2012 EMC Corporation. All rights reserved.
• Healthcare – EMR/Claims data
• Financials – Ticker/Social media data
• Retail – Transaction/Customer sentiment data
• Insurance/Automobile – Telemetry data
Initially Used For Web Logs But Now…
9 © Copyright 2012 EMC Corporation. All rights reserved.
Different Tools Have Different Strengths
STRUCTURED UNSTRUCTURED
SQL
RDBMS
Tables and Schemas GP MapReduce
Indexing
Partitioning
BI Tools
10 © Copyright 2012 EMC Corporation. All rights reserved.
STRUCTURED UNSTRUCTURED
Hive MapReduce
Pig XML, JSON, … Flat files
Schema on load
Directories
No ETL
Java SequenceFile
Different Tools Have Different Strengths
11 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data Analytics Requires Both
STRUCTURED UNSTRUCTURED
SQL
RDBMS
Tables and Schemas GP MapReduce
Indexing
Partitioning
BI Tools
Hive MapReduce
Pig XML, JSON, …
Flat files Schema on load
Directories
No ETL
Java SequenceFile
12 © Copyright 2012 EMC Corporation. All rights reserved.
Delivered in a Unified Platform
• One system for Multi-structured analysis
• MPP Performance for data load and query
• Massive Scale
• Unified Collaboration, Management & Monitoring
13 © Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM DATABASE
Industry-Leading Massively Parallel Processing (MPP) Performance
14 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database
Extreme Performance for Analytics
Optimized for BI and analytics
– Deep integration with statistical packages
– High performance parallel implementations
• Simple and automatic
– Just load and query like any database
– Tables are automatically distributed across nodes
• Extremely scalable
– MPP shared-nothing architecture
– All nodes can scan and process in parallel
– Linear scalability by adding nodes
15 © Copyright 2012 EMC Corporation. All rights reserved.
A Mature Enterprise Platform
PRODUCT FEATURES
CLIENT ACCESS & TOOLS
Multi-Level Fault Tolerance (RAID, Mirroring, DR with
Data Domain Boost)
Shared-Nothing MPP
Parallel Query Optimizer
Polymorphic Data Storage™
CLIENT ACCESS
ODBC, JDBC, OLEDB,
MapReduce, etc.
CORE MPP ARCHITECTURE
Parallel Dataflow Engine
gNet™ Software Interconnect
Scatter/Gather Streaming™ Data Loading
Online System Expansion Workload Management GREENPLUM DATABASE ADAPTIVE SERVICES
LOADING & EXT. ACCESS
Petabyte-Scale Loading
Trickle Micro-Batching
Anywhere Data Access
STORAGE & DATA ACCESS
Hybrid Storage & Execution (Row- & Column-Oriented)
In-Database Compression
Multi-Level Partitioning
Indexes – Btree, Bitmap, etc.
External Table Support
LANGUAGE SUPPORT
Comprehensive SQL
Native MapReduce
SQL 2003 OLAP Extensions
Programmable Analytics
Analytics Extensions
3rd PARTY TOOLS
BI Tools, ETL Tools
Data Mining, etc
ADMIN TOOLS
Greenplum Command Center
Greenplum Package Manager
16 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database
Performance Through Parallelism
• Scale-out architecture on standard commodity hardware
• Automatic parallelization
– Load and query like any database
– Automatically distributed tables across all nodes
– No need for manual partitioning or tuning
• Extremely scalable MPP shared-nothing architecture
– All nodes can scan and process in parallel
– Linear scalability by adding nodes
– On-line expansion when adding nodes
Loading
Interconnect
17 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database
Most Powerful Data Loading Capabilities
Industry leading performance at 10+TB per-hour per-rack
Scatter-Gather Streaming™ provides true linear scaling
Support for both large-batch and continuous real-time loading strategies
Enable complex data transformations “in-flight”
Transparent interfaces to loading via support files, application, and services
Greenplum load rates scale linearly with the number of racks, others do not. For example, two racks = >20TB/H
18 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database
Polymorphic Table StorageTM
• Enable Information Lifecycle Management (ILM)
• Storage types can be mixed within a table or database
– Four table types: heap, row-oriented AO, column-oriented, external
– Block compression: Gzip (levels 1-9), QuickLZ
• Provide the choice of processing model for any table or partition
TABLE „CUSTOMER‟
Mar „11
Apr „11
May „11
Jun „11
Jul „11
Aug „11
Sept „11
Oct „11
Nov „11
Row-oriented for HOT DATA Column-oriented for COLD DATA
19 © Copyright 2012 EMC Corporation. All rights reserved.
20 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database
Parallel Query Optimizer PHYSICAL EXECUTION PLAN
FROM SQL OR MAPREDUCE
Gather Motion 4:1(Slice 3)
Sort
HashAggregate
HashJoin
Redistribute Motion 4:4(Slice 1)
HashJoin
Hash Hash
HashJoin
Hash
Broadcast Motion 4:4(Slice 2)
Seq Scan on motion
Seq Scan on customer
Seq Scan on lineitem
Seq Scan on orders
• Cost-based optimization looks for
the most efficient plan
• Physical plan contains scans, joins,
sorts, aggregations, etc.
• Global planning avoids sub-optimal
„SQL pushing‟ to segments
• Directly inserts „motion‟ nodes for
inter-segment communication
21 © Copyright 2012 EMC Corporation. All rights reserved.
A supercomputing-based “soft-switch” responsible for – Efficiently pumping streams of data between motion nodes during query-plan
execution
– Delivers messages, moves data, collects results, and coordinates work among the segments in the system
High Performance gNet for Hadoop – Parallel query access
– Parallel data exchange
Gnet Software Interconnect
gNet Software Interconnect
22 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database
High Availability
Master Server Data Protection Replicated transaction logs for server failure
Optional RAID protection for drive failures
Upon server failure
Standby server activated
Administrator alerted
Orchestrated failover
Segment Server Data Protection Mirrored segments for server failures
Optional RAID protection for drive failures
Upon server failure Mirrored segments take over with no loss of service
Fast online differential recovery
Master
Segment Segment Segment Segment
Master
23 © Copyright 2012 EMC Corporation. All rights reserved.
Simple To Manage
Greenplum Command Center
– Complete platform management and control
Greenplum Package Manager
– Automates install, uninstall, update, and query for analytics extensions
– Support package migration during upgrade, segment recovery, expansion, and standby initialization
24 © Copyright 2012 EMC Corporation. All rights reserved.
In-Database Analytics
Bringing the power of parallelism to commonly-used modeling and analytics functions
In-database analytics
– SAS – HPA, Access, and Scoring Accelerator
– MADLib – An open-source library of advanced analytics functions
– Analytics extensions supported, including
▪ PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.
MAD
lib
MAD
lib
25 © Copyright 2012 EMC Corporation. All rights reserved.
SAS and Greenplum Partnership Deliver High-Performance Computing and MAD Analytics Access relational data-sets for agile analysis
– SAS/ACCESS provides fast, transparent and secure
access to Greenplum data.
Leverage database scalability for rapid model
deployment
– SAS Scoring Accelerator publishes models for
execution in parallel across the Greenplum cluster.
Build complex models at massive scales
– SAS HPA Appliance combines SAS In-Memory
Analytics with Greenplum parallelism to produce
record-breaking scalability and performance.
26 © Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM HD Hadoop For The Enterprise
27 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum HD
People And Skills Challenges
Establish a strategic vision – Roadmap for Hadoop and unified analytics
Hadoop Architecture Services – POC planning and deployment
– Installation and best practices
GPHD Training & Education – Business, Developer, Data Scientist,
Administration
Access to Analytics Workbench
28 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum HD Platform Delivery Simple, efficient and scalable
Proven at scale in 1,000 node test environment (AWB) with worldwide EMC support
Purpose-built Hadoop infrastructure
Pluggable storage layer
Management & monitoring at scale
29 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum HD Platform Delivery G
REEN
PLU
M C
OM
MA
ND
CEN
TER
Pluggable Storage Layer (HDFS API)
MapReduce Layer
Hadoop Tools (Pig, Hive, HBase, Zookeeper, Mahout, etc…)
Apache HDFS
Greenplum Chorus
Isilon OneFS
30 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum HD Platform Delivery
•Integrates Spring and Hadoop Frameworks Spring Hadoop
•Scalable machine learning libraries Mahout
•Database for random, real time read/write access HBase
•System for SQL-like query data on top of HDFS Hive
•Procedural language that abstracts MapReduce Pig
•Highly reliable distributed coordination Zookeeper
•Framework for writing scalable data applications MapReduce
•Hadoop Distributed File System HDFS
31 © Copyright 2012 EMC Corporation. All rights reserved.
Productivity with Hadoop
Establish Chorus Connection to GPHD Cluster
Browse HDFS files
Leverage gNet integration to parse HDFS using SQL interface
– Determine inherent data structure
Collaboration with business, analytics and infrastructure
32 © Copyright 2012 EMC Corporation. All rights reserved.
Integration with Existing Technologies
Greenplum gNet
GREENPLUM HD GREENPLUM DATABASE
Java/Perl/Python Command Line PigLatin HQL ODBC JDBC
PARALLEL QUERY INTEGRATION
PARALLEL IMPORT/EXPORT
SQL HDFS
Data Access &Query Layer
Create end-to-end workflows
Leverage existing skills
33 © Copyright 2012 EMC Corporation. All rights reserved.
Big Data Analytics Requires Both
STRUCTURED UNSTRUCTURED
34 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Delivers
Big Data in a Unified Analytics Platform
top related