big data, data science & fast data
DESCRIPTION
TRANSCRIPT
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Analytics, Data Science & Fast Data
1
Kunal [email protected]
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
BIG DATA
DATA SCIENCE
FAST DATA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Big Data Pioneers
1,000,000,000 Queries A Day
250,000,000 New Photo’s / Day
290,000,000 Updates / Day
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Other Companies using Big Data
4,000,000 Claims / Day
2,800,000,000 Trades / Day
31,000,000,000 Interactions / Day
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Moore’s LawGordon Moore (Founder of Intel)
Number of transistors that can be placed in a processor DOUBLES in approximately every TWO years.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Introduction to Big Data Analytics
What is Big Data?
What makes data, “Big” Data?
7
Your Thoughts?
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
8Copyright © 2011 EMC Corporation. All Rights Reserved.
• “Big Data” is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. Requires new data architectures, analytic sandboxes New tools New analytical methods Integrating multiple skills into new role of data scientist
• Organizations are deriving business benefit from analyzing ever larger and more complex data sets that increasingly require real-time or near-real time capabilities
Big Data Defined
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
9Copyright © 2011 EMC Corporation. All Rights Reserved.
1. Data Volume 44x increase from 2010 to 2020
(1.2zettabytes to 35.2zb)
2. Processing Complexity Changing data structures Use cases warranting additional transformations and
analytical techniques
3. Data Structure Greater variety of data structures to mine and analyze
Key Characteristics of Big Data
Module 1: Introduction to BDA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Characteristics: Data StructuresData Growth is Increasingly Unstructured
Module 1: Introduction to BDA 10
Structured
Semi-Structure
d
“Quasi” Structured
Unstructured
• Data containing a defined data type, format, structure
• Example: Transaction data and OLAP
• Data that has no inherent structure and is usually stored as different types of files.
• Example: Text documents, PDFs, images and video
• Textual data with erratic data formats, can be formatted with effort, tools, and time
• Example: Web clickstream data that may contain some inconsistencies in data values and formats
• Textual data files with a discernable pattern, enabling parsing
• Example: XML data files that are self describing and defined by an xml schema
Mo
re S
tru
ctu
red
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Four Main Types of Data Structures
Module 1: Introduction to BDA 11
http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651
The Red Wheelbarrow, by William Carlos Williams
View Source
Structured Data
Semi-Structured Data
Quasi-Structured Data
Unstructured Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Driver Examples
Desire to optimize business operations Sales, pricing, profitability, efficiency
Desire to identify business risk Customer churn, fraud, default
Predict new business opportunities
Upsell, cross-sell, best new customer prospects
Comply with laws or regulatory requirements
Anti-Money Laundering, Fair Lending, Basel II
Business Drivers for Big Data Analytics
1
2
3
4
Current Business Problems Provide Opportunities for Organizations to Become More Analytical & Data Driven
Module 1: Introduction to BDA 12
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Challenges with a Traditional Data Warehouse
DepartmentalWarehouse
EnterpriseApplications
Reporting
Non-Prioritized Data Provisioning
Non-Agile Models
“SpreadMarts”
DataSources
SiloedAnalytics
Static schemasaccrete over time
PrioritizedOperational Processes
Errant data & marts
DepartmentalWarehouse
1
2
3
13
4
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Implications of a Traditional Data Warehouse
14
• High-value data is hard to reach and leverage• Predictive analytics & data mining activities are last
in line for data Queued after prioritized operational processes
• Data is moving in batches from EDW to local analytical tools In-memory analytics (such as R, SAS, SPSS, Excel) Sampling can skew model accuracy
• Isolated, ad hoc analytic projects, rather than centrally-managed harnessing of analytics Non-standardized initiatives Frequently, not aligned with corporate business goals
Slow “time-to-insight”
& reduced
business impact
Module 1: Introduction to BDA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Opportunities for a New Approach to Analytics
New Applications Driving Data Volume
Module 1: Introduction to BDA 15
2000’s(CONTENT & DIGITAL ASSET
MANAGEMENT)
1990’s(RDBMS & DATA
WAREHOUSE)
2010’s(NO-SQL & KEY/VALUE)
VO
LUM
E O
F IN
FOR
MATIO
N
LARGE
SMALL
MEASURED IN
TERABYTES1TB = 1,000GB
MEASURED IN
PETABYTES1PB = 1,000TB
WILL BE MEASURED IN
EXABYTES1EB = 1,000PB
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Considerations for Big Data Analytics
1. Speed of decision making
2. Throughput
3. Analysis flexibility
Analytic SandboxData assets gathered from multiple sources
and technologies for analysis
• Enables high performance analytics using in-db processing
• Reduces costs associated with data replication into "shadow" file systems
• “Analyst-owned” rather than “DBA owned”
Criteria for Big Data Projects New Analytic Architecture
1. Speed of decision making
2. Throughput
3. Analysis flexibility
16
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
State of the Practice in Analytics: Mini-Case StudyBig Data Enabled Loan Processing at XYZ bank
Income
Verifica
tion
Un
der
writ
ing
Ris
k
Employment
History
Credit S
corin
g
And Hist
oryAppra
isal
TraditionalUnderwriting
Risk Level
TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED
Big Data Enabled UnderwritingRisk Level
17Module 1: Introduction to BDA
Your Thoughts?
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Analytics: Industry Examples
Module 1: Introduction to BDA 19
Health Care •Reducing Cost of Care
Public Services•Preventing Pandemics
Life Sciences•Genomic Mapping
IT Infrastructure•Unstructured Data Analysis
Online Services•Social Media for Professionals
RetailPhone/TV
Government Internet
Medical
Financial
DataCol lectors
1
2
3
4
5
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Analytics: Healthcare
Use of Big Data
Key Outcomes
Situation
•Poor police response and problems with medical care, triggered by shooting of a Rutgers student
•The event drove local doctor to map crime data and examine local health care
•Dr. Jeffrey Brenner generated his own crime maps from medical billing records of 3 hospitals
•City hospitals & ER’s provided expensive care, low quality care•Reduced hospital costs by 56% by realizing that 80% of city’s medical costs came from 13% of its residents, mainly low-income or elderly
•Now offers preventative care over the phone or through home visits
1
20Module 1: Introduction to BDA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Analytics: Public Services
Use of Big Data
Key Outcomes
Situation
•Threat of global pandemics has increased exponentially
•Pandemics spreads at faster rates, more resistant to antibiotics
•Created a network of viral listening posts •Combines data from viral discovery in the field, research in disease hotspots, and social media trends
•Using Big Data to make accurate predications on spread of new pandemics
• Identified a fifth form of human malaria, including its origin
• Identified why efforts failed to control swine flu
•Proposing more proactive approaches to preventing outbreaks
2
Module 1: Introduction to BDA 21
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Analytics: Life Sciences
Use of Big Data
Key Outcomes
Situation •Broad Institute (MIT & Harvard) mapping the Human Genome
• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes
•Developed 30+ software packages, now shared publicly, along with the genomic data
•Using genetic mappings to identify cellular mutations causing cancer and other serious diseases
• Innovating how genomic research informs new pharmaceutical drugs
3
Module 1: Introduction to BDA 22
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Analytics: IT Infrastructure
Use of Big Data
Key Outcomes
Situation•Explosion of unstructured data required new technology to analyze quickly, and efficiently
•Doug Cutting created Hadoop to divide large processing tasks into smaller tasks across many computers
•Analyzes social media data generated by hundreds of thousands of users
•New York Times used Hadoop to transform its entire public archive, from 1851 to 1922, into 11 million PDF files in 24 hrs
•Applications range from social media, sentiment analysis, wartime chatter, natural language processing
4
Module 1: Introduction to BDA 23
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Big Data Analytics: Online Services
Use of Big Data
Key Outcomes
Situation •Opportunity to create social media space for professionals
•Collects and analyzes data from over 100 million users
•Adding 1 million new users per week
•LinkedIn Skills, InMaps, Job Recommendations, Recruiting
•Established a diverse data scientist group, as founder believes this is the start of Big Data revolution
5
Module 1: Introduction to BDA 24
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Greenplum Unified Analytic Platform
Partner Tools & Services
GREENPLUM CHORUS – Analytic Productivity Layer
Greenplum gNet
GREENPLUM DATABASE
Data Scientist
Data Engineer
Data Analyst Bl Analyst
LOB User
Data Platform Admin
DA
TA S
CIE
NC
E T
EA
M
Cloud, x86 Infrastructure, or Appliance
GREENPLUMHD
Unify your team
Drive Collaboration
Keep Your Options Open
The Power of Data Co-Processing
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Greenplum Hadoop
STRUCTURED UNSTRUCTURED
HiveMapReduce
PigXML, JSON, … Flat files
Schema on load
Directories
No ETLJava
SequenceFile
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Greenplum Database
STRUCTURED UNSTRUCTURED
SQL
RDBMS
Tables and Schemas
GreenplumMapReduce
Indexing
PartitioningBI Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
• A framework for handling big data An implementation of the MapReduce paradigm Hadoop glues the storage and analytics together and provides reliability,
scalability, and management
What do we Mean by Hadoop
Storage (Big Data) HDFS – Hadoop Distributed
File System Reliable, redundant,
distributed file system optimized for large files
MapReduce (Analytics) Programming model for
processing sets of data Mapping inputs to outputs and
reducing the output of multiple Mappers to one (or a few) answer(s)
Two Main Components
30Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Hadoop Distributed File System
31Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
MapReduce and HDFS
Task TrackerTask Tracker Task Tracker
Job Tracker
Hadoop Distributed File System (HDFS)
Client/Dev
Large Data Set(Log files, Sensor Data)
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
2
1
3
4
32Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
• As you move from Pig to Hive to HBase, you are increasingly moving away from the mechanics of Hadoop and get an RDBMS view of the Big Data world
Components of Hadoop
HBase Queries against defined tables
Hive SQL-based language
Pig Data flow language & Execution environment
More HadoopVisible
Less HadoopVisible
DBMS View
Mechanics of Hadoop
33Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Greenplum DatabaseExtreme Performance for Analytics
• Optimized for BI and analytics Deep integration with statistical packages
High performance parallel implementations
• Simple and automatic parallelization Just load and query like any database
Tables are automatically distributed across nodes
No need for manual partitioning or tuning
• Extremely scalable
MPP* shared-nothing architecture All nodes can scan and process in parallel
Linear scalability by adding nodes where each node adds storage, query & load performance
*MPP – Massive Parallel Processing
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Greenplum DB & HD
Massively Parallel Access and Movement
Maximize Solution Flexibility
Minimize Data Duplication
Access Hadoop Data in Real Time From Greenplum DB
Import and export in Text, Binary and Compressed Formats
Custom formats via user-written MapReduce Java program And GPDB Format classes
gNet
10Gb Ethernet
Greenplum DB Hadoop
Node 1
Node 2
Node 3
Segment 1
Segment 2
Segment 3
GP DB Master Host
MapReduce
User-Defined
Binary
TextExternal
Tables
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Analytical Software
Exploiting ParallelismIn-Database Analytics
Analytic Results
Interconnect
Storage
Independent Segment Processors
Independent Memory
Independent Direct Storage
Connection
Master Segment Processor
Interconnect Switch
Math & Statistical Functions
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Big Data Requires Data Science
Data Science
• Predictive analysis
• What if…..?
Business Intelligence
• Standard reporting
• What happened?
High
FuturePast
TIME
BUSINESS VALUE
Business Intelligence
Data Science
Low
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Data science and business intelligence
“BIG DATA ANALYTICS”
“TRADITIONAL BI”
GBs to 10s of TBs
Operational
Structured
Repetitive
10s of TB to Pb’s
External + Operational
Mostly Semi-Structured
Experimental, Ad Hoc
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Profile of a Data Scientist
Module 1: Introduction to BDA 46
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data PrepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
• People• Scientists / Analysts• Business Analysts• Consumers of analysis• Stakeholders• EMC sales and services
• Ecosystem• Sector (Telecom, banking, security agency etc.)• Modeling software and other tools used by
analysts (MADlib, SAS, R etc.)• Database (Greenplum) & Data Sources
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data PrepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
Discovery & prioritized identification of opportunities• Customer Retention• Fraud detection• Pricing• Marketing effectiveness and
optimization• Product Recommendation• Others……
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data PrepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
• What are the data sources?• Do we have access to them?• How big are they?• How often are they updated?• How far back do they go?• Which of these data sources are being used
for analysis? Can we use a data source which is currently unused? What problems would that help us solve?
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data PrepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
• Selection of raw variables which are potentially relevant to problem being solved
• Transformations to create a set of candidate variables
• Clustering and other types of categorization which could provide insights
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data StepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
Pick suitable statistics, or suitable model form and algorithm and build model
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data PrepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
The model needs to be executable in database on big data with reasonable execution time
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data PrepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
The model results need to be communicated & operationalized to have a measurable impact on the business
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data PrepVariable Selection
ModelBuilding
Model Execution
Communication &
Operationalization
Evaluate
• Accuracy of results and forecasts• Analysis of real-world experiments• A/B testing on target samples • End-user and LOB feedback
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Use Case 1 Trip modeling
Problem: Analyze behaviour of visitors to MakeMyTrip.com
Particularly interested in unregistered visitors– About 99% of total visitor traffic
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Applications of model• Tailor promotions for popular types of trips
Most popular types probably already well-known; potential in next tier down
• ... and for different types of customers
• Present customised promotions to visitors based on clicks
• Ad optimization: present ads based on modelled behavior
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Hypertargeting• Serving content to customers based on individual
characteristics and preferences, rather than broad generalizations
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Available data• Data available from server:
Date/time IP address Parts of site visited
• Geographic location can be obtained via geo lookup on IP
• Personal information available for registered visitors only
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Approach• Use clustering to identify trip/visitor types
Sport (IPL,F1, Football, etc) Festivals Other seasonal movements
• Decision trees to predict which type of trip a visitor is likely to make Based on successively more information as they move
through the site
• Use registered visitor info to augment models
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Use Case 2 Municipal traffic analysis
• Client domain: Municipal city government
• Available data:Cross-city loop detectors measuring traffic volumeDetailed city bus movement information from Bluetooth devicesVideo detection of traffic volume, velocity
• Goal: Exploit available data for unrealized business insights and values
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Data loading and manipulation
• Parallel data loading– Data loaded from local file system and distributed across Greenplum
servers in parallel.– Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4
seconds.
• SQL data manipulation– Standard SQL permits city personnel to use existing skillsets.– Greenplum SQL extensions offer the control over data distribution.– Open source packages (e.g. in Python, R) can be conveniently deployed
within Greenplum for visualization and analytics purposes.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Basic reporting on traffic volume• Easy generation of reports via straightforward user-defined functions
• Standard graphing utilities called from within Greenplum to create figures
• Detector downtimes can be clearly spotted in the figure, or via an SQL query, thus mitigating maintenance challenges
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Basic reporting on city buses• Data from Bluetooth devices has a wealth of information on city
buses that we can report on: Travel route of each bus Deviations of arrival times compared to provided timetable Occurrences of driver errors (e.g. taking a wrong turn) and possible
causes Occurrences where the same bus service arrives at the same stop
within seconds of each other Whether new bus services translates into lower traffic volume on
introduced roads
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Result visualizations (Google Earth)
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Applications for traffic network modelling
• Compute the fastest path between any two locations at a
future time point
• Identify potential bottlenecks in the traffic
• Identify phase transition points for massive traffic congestion
using simulation techniques
• Study the likely impact of new roads and traffic policies,
without having to observe real disruptive events to
determine the impact
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
• Greenplum’s parallel architecture permits traffic network analysis on a city scale
• Travel time can be predicted via model learning, involving hundreds of thousands of optimizations in parallel, across the entire traffic network
• Variables that can be considered include Distance between two locations Concurrent traffic volume Time of day Weather Construction work
• Computationally prohibitive for traditional non-parallel database environments
Parallel traffic network modelling
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Use Case 3 - Product Recommendation Analysis
• Eight banks became one Branches across the US
• Consolidation of products and customers Employees faced with new products and
customers Visibility into churn and retention was
challenged
• Analytics focus was historically reporting-centric Descriptive “hindsight”`
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Customer Segmentation
Customer segments– First, define a measurement of
customer value– Then create clusters of
customers based on customer value, and then product profiles.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Association Rules
Product associations– Now find products that are
common in the segment, but not owned by the given household.
Product AProduct B
Product XProduct YProduct Z
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Product Recommendations
Next best offer– Now, filter down to products
associated with high-value customers in the same segment.
Product AProduct B
Product XProduct YProduct Z
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Increased customer value
Customer Comments– “The Greenplum Solution has
scaled from 6 to 11 TB of data.”– Moved from 7 hours /month of
data to 7.5 hours / 2.5 years of data
Product Recommender
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Module #: Module Name 74
Ferrari Freight Train
0-100 KMPH 2.3 seconds 100 seconds
Top Speed 360 KMPH 140 KMPH
Stops / hr 1000 5
Horse Power 660 bhp 16,000 bhp
Throughput 220 KG in 27 mins 55000000 KG in 60 mins
VS
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Module #: Module Name 75
Fast Data Big Data
Transactions / Second
100000+ per second n.a
Concurrent hits 10000 + per sec 10 per second
Update Patterns Read / Write Appends
Data Complexity Simple Joins on a few tables
Can be highly complex
Data Volumes GB’s / TB PB to ZB
Access Tools GemFire / SQLFire GP DB, GP Hadoop
VSFast Data Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Not a fast OLTP DB!
APPLICATION(S)
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Fast Data is • More than just an OLTP DB
• Super Fast access to Data
• Server side flexibility
• Data is HA
• Supports transactions
• Setup is fault tolerant
• Can handle thousands of concurrent hits
• Distributed hence horizontally scalable
• Runs on cheap x86 hardware
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
CAP TheoremA distributed system can only achieve TWO out of the three qualities of Consistency, Availability and Partition Tolerance
C A Ponsistency vailability artition Tolerence
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Fast Data =
• Service Loose Coupling
• Data Transformation
• System Integration
+ Service Bus
• Guaranteed Delivery
• Event Propagation
• Data Distribution
+ Messaging System
• Event Driven Architectures
• Real-time Analysis
• Business Event Detection
+ Complex Event Processor
Fast Data combines select features from all of these products and combines them into a low-latency, linearly scalable, memory-based data fabric
• Storage
• Persistence
• Transactions
• Queries
Database• High Availability
• Load Balancing
• Data Replication
• L1 Caching
• Map-Reduce, Scatter-Gather
• Distributed Task Assignment
• Task Decomposition
+ Grid Controller
• Result Summarization
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
A Typical Fast Data Setup
Web Tier
Application Tier
Load BalancerAdd/remove web/application/data servers
Add/remove storage
Database Tier
Storage Tier
Disks may be direct or network attached
Optional reliable, asynchronous feedto a Big Data Store
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Memory-based Performance
PerformFast Data uses memory on a peer machine to make data updates durable, allowing the updating thread to return 10x to 100x faster than updates that must be written through to disk, without risking any data loss. Typical latencies are in the few hundreds of microseconds instead of in the tens to hundreds of milliseconds.
One can optionally write updates to disk / data warehouse / big data store asynchronously and reliably.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
WAN Distribution
Distribute
Fast Data can keep clusters that are distributed around the world synchronized in real-time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network environments.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Distributed Events
Targeted, guaranteed delivery, event notification and Continuous Queries
Notify
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Parallel Queries
Batch Controller or Client
Scatter-Gather (Map-Reduce) Queries
Compute
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Data-Aware Routing
Execute
Fast Data provides ‘data aware function routing’ – moving the behavior to the correct data instead of moving the data to the behavior.
Batch Controller or Client
Data Aware Function
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Accessing Fast Data
Stores Objects (Java, C++, C#, .NET) or unstructured data
Spring-GemFire
Stores Relational Data with SQL interface
Supports JDBC, ODBC, Java and .NET interfaces
Key-Value store with OQL Queries
Uses existing relational tools
Order
Order Line Item
Quantity
Discount
Product
SKU
Unit Price
L2 Cache plugin for Hibernate
HTTP Session replication module
GemFire
SQLFire
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Use Cases
Applying the technology
A few examples of Fast Data technology
applied to real business cases
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
A mainframe-based, nightly customer account reconciliation batch run
Mainframe Migration
min
0 12060
I/O Wait9%
CPU Busy15%
Mainframe
CPU Unavailable76%
COTS ClusterBatch now runs in 60 seconds
93% Network Wait! Time could have been reduced further with higher network bandwidth
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Mainframe Migration
So What? So the batch runs faster – who cares?
1. It ran on cheaper, modern, scalable hardware
2. If something goes wrong with the batch, you only wait 60 seconds to find out
3. Now, the hardware and the data are available to do other things in the remaining 119 minutes:
• Fraud detection
• Regulatory compliance
• Re-run risk calculations with 119 different scenarios
• Up sell customers
4. You can move from batch to real-time processing!
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Online Betting
A popular online gambling site attracts new players through ads on affiliate sites
Customized Banner Ad on affiliate site
Affiliate's Web Server
1 Banner Ad Server
23
4
In a fraction of a second, the banner ad sever must:Generate a tracking id specific to the request
Apply temporal, sequential, regional, contractual and other policies in order to decide which banner to deliver
Customize the banner
Record that the banner ad was delivered
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Online Betting (Contd.)
Their initial RDBMS-based system
Limited their ability to sign up new affiliates
Limited their ability to add new products on their site
Limited the delivery performance experienced by their affiliates and their customers
Limited their ability to add additional internal applications and policies to the process
Their new Fast Data based systemResponded with sub-millisecond latency
Met their target of 2500 banner ad deliveries per second
Provides for future scalability
Improved performance to the browser by 4x
Cost less
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring
Centralized data storage was not possible
Multi-agency, multi-force integration
Numerous Applications needed access to multiple data sources simultaneously
Networks constantly changing, unreliable, mobile deployments
Upwards of 60,000 object updates each minute
Over 70 data feeds
Needed a real-time situational awareness system to track assets that could be used by the war fighters in theatre
Northrop Grumman (integrator) investigated the following technologies before deciding on GemFire• RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL
• ODBMS - Objectivity
• jCache – GemFire, Oracle Coherence
• JMS – SonicMQ, BEA Weblogic, IBM, jBoss
• TIBCO Rendezvous
• Web Services
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring
655 sites, 11 thousand users
Real-time, 3 dimensional, NASA World Wind User Interface
60,000 Position updates per minute
Real time info available on the desk of
President of the United States
US Secretary of Defense
Each of the Joint Chiefs of Staff
Every commander in the US Military
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Low-latency trade insertionPermanent Archival of every tradeKept pace with fast ticking market dataRapid, Event Based Position CalculationDistribution of Position Updates GloballyConsistent Global Views of PositionsPass the BookRegional Close-of-dayHigh AvailabilityDisaster RecoveryRegional Autonomy
The project achieved:
Global Foreign Exchange Trading System
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Global Foreign Exchange Trading System
In that same application, Fast Data replaced:
Sybase Database In Every RegionStill need 1 instance for archival purposes
TIBCO Rendezvous for Local Area MessagingIBM MQ Series for WAN DistributionVeritas N+1 Clustering for H/AIn fact, we save the physical +1 node itself
3DNS or Wide IPAdmin personnel reduced from 1.5 to 0.5
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC2 PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Application High Level Overview
APPLICATION(S)
Single DB cant handle bothOLTP and OLAP
workloads