02 a holistic approach to big data
TRANSCRIPT
Raul F. ChongSenior Big Data and Cloud Program ManagerBig Data University Community Leader@raulchong
A holistic approach to Big Data
© 2013 BigDataUniversity.com
Agenda
Introduction to Big Data
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Agenda
Introduction to Big Data
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
What is Big Data?
Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools.
Difficulties include capture, storage, search, sharing, analytics, and visualizing.
Source: Wikipedia
Big Data Characteristics
Information is growing at a phenomenal rate
as much data and content over coming decade
2009800,000 petabytes
202035 zettabytes
=4 Trillion 8GB iPods
44x
Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010
Big Data Characteristics
• About 80% of the world’s data is unstructured
• It may be data we’ve been collecting before, but could not process
Types of Big Data
• Data in movement - streams• Twitter / Facebook comments• Stock market data• Sensors: Vital signs of a newly-born
• Data at rest - oceans• Collection of what has streamed• Web logs, emails, social media• Unstructured documents: forms, claims• Structured data from disparate systems
IT
Structures the data to answer that question
IT
Delivers a platform to enable creative discovery
Business
Explores what questions could be asked
Business Users
Determine what question to ask
Monthly sales reportsProfitability analysisCustomer surveys
Brand sentimentProduct strategyMaximum asset utilization
Big Data ApproachIterative & Exploratory Analysis
Traditional ApproachStructured & Repeatable Analysis
Traditional vs. big data business approaches
Applications for Big Data Analytics
Homeland Security
Finance Smarter Healthcare Multi-channel sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Big Data: In Demand Well Paying Skill
Skills are in Demand Pays well
“If you can claim to be a data scientist and have the chops to back
that up, you can pretty much write your own ticket even in this tough
job market.”
Source: Gigaom http://gigaom.com/cloud/big-data-skills-bring-big-dough/
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
15
KTH Swedish Royal Institute of Technology Reducing Traffic Congestion
• Deployed real-time Smarter Traffic system to predict and improve traffic flow.
• Analyzes streaming real-time data gathered from cameras at entry/exit to city, GPS data from taxis and trucks, and weather information.
• Predicts best time and method to travel such as when to leave to catch a flight at the airport
Results• Enables ability to analyze and predict traffic
faster and more accurately than ever before
• Provides new insight into mechanisms that affect a complex traffic system
• Smarter, more efficient, and more environmentally friendly traffic
15
Benefits Real-time display of public sentiment as
candidates respond to questions
Debate winner prediction based on public opinion instead of solely political analysts
University of Southern California Innovation Lab Monitors Political Debates
Big Data – A holistic approach
Big Data is Not Only Hadoop! Examples where Hadoop is not entirely applicable:
– Cyber security, Stock market, Traffic control, Sensor information, monitoring trends in Social Media
– What if your company has many silos of information, difficult to move to HDFS?
– What about governance? Can we trust the source of this data?
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
The IBM Big Data Platform
Delivers deep insight with advanced in-database analytics & operational analytics
Data Warehouse
Data Warehouse
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Stream Computing
Data Warehouse
Analyze streaming data and large data bursts for real-time insightsStream
Computing
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
The IBM Big Data Platform
HadoopSystem
Stream Computing
Data Warehouse
Cost-effectively analyze Petabytes of unstructured and structured data
HadoopSystem
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
22
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Govern data quality and manage the information lifecycle
Information Integration & Governance
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Speed time to value with analytic and application accelerators
Accelerators
Big data holistic approach: A platform
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Systems Management
Application Development
Visualization & Discovery
The IBM Big Data Platform
Discover, understand, search, and navigate federated sources of big data
Visualization & Discovery
Big data holistic approach: A platform
Process any type of data
– Structured, unstructured, in-motion, at-rest, in-place
Built-for-purpose engines
– Designed to handle different requirements
Manage and govern data in the ecosystem
Enterprise data integration
Grow and evolve on current infrastructure
The whole is greater than the sum of parts Integrated components
Out of the box, standards-based services
Start small (value is additive)
25
Solutions
Big Data Platform
Analytics and Decision Management
Big Data Infrastructure
Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Systems Management
Application Development
Visualization & Discovery
Big data holistic approach: A platform
ETL, MDM, Data Governance
Metadata and Governance Zone
Warehousing Zone
Enterprise Warehouse
Data Marts
Ingestion and Real-time Analytic ZoneStreams
Connectors
BI & Reporting
PredictiveAnalytics
Analytics and Reporting Zone
Visualization & Discovery
Landing and Analytics Sandbox Zone
Hive/HBaseCol Stores
Documentsin variety of formats
MapReduce
Hadoop
An example of the big data platform in practice
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Big Data ExplorationFind, visualize, understand all big data to improve business knowledge
Enhanced 360o Viewof the CustomerAchieve a true unified view, incorporating internal and external sources
Security/Intelligence ExtensionLower risk, detect fraud and monitor cyber security in real-time
Data Warehouse AugmentationIntegrate big data and data warehouse capabilities to increase operational efficiency
Operations AnalysisAnalyze a variety of machinedata for improved business results
The 5 High Value Big Data Use Cases
Find, visualize and understand all big data to improve business knowledge• Greater efficiencies in
business processes
• New insights from combining and analyzing data types in new ways
• Develop new business models with resulting increased market presence and revenue
CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems
ConnectorFramework
App Builder
Hadoop
Integration & Governance
UI / User
Streams
Big Data Exploration: Illustrated
WarehouseData Explorer
Big Data Exploration: Example in Practice
• Exploring 4 TB to drive point business solutions (supplier portal, call center, etc.)
• Single-point of data fusion for all employees to use• Reduced costs & improved operational performance for the business
How do you enable employees to navigate and explore enterprise and external content? Can you present this in a single user interface?
How do you identify areas of data risk before they become a problem?
What is the starting point for your big data initiatives?
Is Big Data Exploration Right for You? How do you separate the “noise” from useful
content?
How do you perform data exploration on large and complex data?
How do you find insights in new or unstructured data types (e.g. social media and email)?
Airplane ManufacturerBlinded for confidentiality
Big Data Platform Component Starting Point: Data Explorer
Enhanced 360º View of the Customer: Illustrated
CRMJ Robertson
Pittsburgh, PA 15213
35 West 15th
Name:
Address:
Address:
ERPJanet Robertson
Pittsburgh, PA 15213
35 West 15th St.
Name:
Address:
Address:
LegacyJan Robertson
Pittsburgh, PA 15213
36 West 15th St.
Name:
Address:
Address:
SOURCE SYSTEMS
Janet
35 West 15th St
Pittsburgh
Robertson
PA / 15213
F
48
1/4/64
First:
Last:
Address:
City:
State/Zip:
Gender:
Age:
DOB:
360 View of Party Identity
MasterDataManagement
Unified View of Party’s InformationHadoop Streams Warehouse
LogsEvents Alerts
Configuration information
System audit trails
External threat intelligence feeds
Network flows and anomalies
Identity context
Web pagetext
Video/audio surveillance
E-mail andsocial activity
Business process data
Customertransactions
Traditional Security Operations and Technology
Big Data Analytics
New ConsiderationsCollection, Storage and Processing
Collection and integrationSize and speedEnrichment and correlation
Analytics and Workflow
VisualizationUnstructured analysisLearning and predictionCustomizationSharing and export
Security/Intelligence Extension: Illustrated
“Reconstructing Events” – Integrating Multimedia from Diverse Sources
• Correlate multimedia content across a wide diversity of sources and dynamic topology of cameras
• Exploit partial overlaps in field of view, re-identification of objects/people and contextual information
• Obtain real-time operational picture across diverse content• 100K security cameras (static cameras, slowly changing topology)
• 10M mobile photos/day (limited knowledge about locations)• 50M social media photos/video (uncertain geo-temporal context)• Moving vehicles (patrol cars), overhead drones, broadcast, retail, 311, etc.
Overhead
Social MediaMobile Cameras
Security Cameras
33
Security/Intelligence Extension: Customer Example
What are your plans to enrich your security or intel system with unused or underleveraged data sources (video, audio, smart devices, network, Telco, social media)?
How will you address the need sub second detection, identification, resolution of physical or cyber threats?
How do you intend to follow activities of criminals, terrorists, or persons in a blacklist?
How do you plan to enhance your surveillance system with real-time data from video, acoustic, thermal or other security sensors?
Do you want to correlate lots of technical or human intel data and sources looking for associations or patterns (big data forensics)?
How are you going to deal with unstructured data (email, social, etc.) in your Security Information & Event Management (SIEM) solution to improve cyber threat detection & remediation?
Would the Security / Intelligence Extension benefit you?
Captured and analyzed 42TB of daily traffic in real-time for tracking persons of interest to take suitable action and reduce risk.
Big Data Platform Component Starting Point: Streams, Hadoop
Raw
Log
s an
d M
achi
ne D
ata
Indexing, Search
Statistical Modeling
Root Cause Analysis
Federated Navigation & Discovery
Real-time Analysis
Only storewhat is needed
Operations Analysis: Illustrated
Machine DataAccelerator
1 http://www.information-management.com/infodirect/2009_133/downtime_cost-10015855-1.html2 http://www.itchannelplanet.com/business_news/article.php/3916786/IT-System-Downtime-Costs-265-Billion-A-Year-Study-Finds.htm
Operations analysis is a Business Imperative
Cost of System Down Time– 49% of Fortune 500 companies > 80 hrs down time/year1
• Cost of down time: $90,000/hr to $6.48 million/hr• 80 hours * $6.48M = approx $500M per year
– System downtown costs North American businesses $26.5 billion a year in lost revenue2
Operations Analysis: Customer Example
• Intelligent Infrastructure Management: log analytics, energy bill forecasting, energy consumption optimization, anomalous energy usage detection, presence-aware energy management
• Optimized building energy consumption with centralized monitoring; Automated preventive and corrective maintenance
• Utilized InfoSphere Streams, InfoSphere BigInsights, IBM Cognos
Do you deal with large volumes of machine data? How do you access and search that data? How do you perform root cause analysis?
How do you perform complex real-time analysis to correlate across different data sets?
How do you monitor and visualize streaming data in real time and generate alerts?
Would Operations Analysis benefit you?
Big Data Platform Component Starting Point: Hadoop, Streams
Integrate big data and data warehouse capabilities to increase operational efficiency
Data Warehouse Augmentation: Needs
Need to leverage variety of data Extend warehouse infrastructure• Optimized storage, maintenance and licensing
costs by migrating rarely used data to Hadoop• Reduced storage costs through smart
processing of streaming data• Improved warehouse performance by
determining what data to feed into it
• Structured, unstructured, and streaming data sources required for deep analysis
• Low latency requirements (hours—not weeks or months)
• Required query access to data
Hadoop as a query-ready archive for a data warehouse
Hadoop
Data Warehouse Augmentation: Illustrated
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Open Source Hadoop
Visualization & Discovery Connectors
Workload Optimization
Flume
Runtime
Advanced Engines
File System
MapReduce
HDFS
Data StoreHBase
Development ToolsEclipse Plug-ins
Systems Management
Jaql
Pig
ZooKeeper
Lucene
Oozie
Hive
Open Source
Mahout
Whirr
Sqoop
Hue
H Catalog
R
Visualization & Discovery Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights v2.1 Enterprise Edition
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine & Extractor Library)
BigSheets
JDBC
Applications & Development
Text Analytics
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible SchedulerJaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard & Visualization Apps Workflow Monitoring
Management
Security
Audit & History
Lineage
R
Guardium
PlatformComputing
Cognos
GPFS
IBMOpen Source
High Availability
Big SQL
H Catalog
Whirr
Mahout
Hue
Added Value on Top of Open Source Hadoop
InfoSphere BigInsights Added Value
InfoSphere BigInsights
Administration & Security
Workload Optimization (MapReduce/SQL)
Connectors
Development Tools
IBM tested & supported open source components
Accelerators
Open source based
components
Workload Management
Security
Development Environment
Analytics/ExtractorsAnalytics
Extraction engine (System T)
Visualization & Exploration
Extractors and APIs
SQL API
InfoSphere BigInsights Added Value: Accelerators
Data Ingestand Prep
Extract Buzz, Intent , Sentiment
Entity Analytics:
Profile Resolution
Real time analytics. Pre-defined views
and charts
Dashboard
Stream Computing and Analytics
BigInsights System and Analytics
Online flow: Data-in-motion analysis
Offline flow: Data-at-rest analysis
Pre-defined Workbooks and
Dashboards
Social Media Data
Extract Buzz, Intent , Sentiment
And Consumer Profiles
Entity Analytics and
Integration
Comprehensive Social Media
Customer Profiles
Social Media
Optional: Indexed Search
Index using Push API
Data Explorer
Ad hoc access
Social Data Analytics Accelerator Architecture
InfoSphere BigInsights Added Value: BigSheets
InfoSphere BigInsights
Administration & Security
Workload Optimization (MapReduce/SQL)
Connectors
Development Tools
IBM tested & supported open source components
Accelerators
Open source based
components
Workload Management
Security
Development Environment
Analytics/ExtractorsAnalytics
Extraction engine (System T)
Visualization & Exploration
Extractors and APIs
SQL API
BigSheets Visualization and Exploration
• Web-based analysis and visualization for Users
• Familiar spreadsheet-like interface • Define and manage long running data
collection jobs
InfoSphere BigInsights Added Value: BigSheets
No programming knowledge needed!
How it works Model “big data” collected
from various sources as collections
Filter and enrich content with built-in functions
Combine data in different collections
Visualize results through spreadsheets, charts
Export data into common formats (if desired)
InfoSphere BigInsights Added Value: Dev Tools
InfoSphere BigInsights
Administration & Security
Workload Optimization (MapReduce/SQL)
Connectors
Development Tools
IBM tested & supported open source components
Accelerators
Open source based
components
Workload Management
Security
Development Environment
Analytics/ExtractorsAnalytics
Extraction engine (System T)
Visualization & Exploration
Extractors and APIs
SQL API
Development Environment• Eclipse based dev environment • Developer tools and a set of analytic
extractors for fast adoption and reduction in coding and debugging time
• Plugin for Text Analytics, MapReduce programming, Jaql development, Hive query development, …. and more
InfoSphere BigInsights Added Value: Dev Tools
How it works• Built-in Apps make it easy to run Big
Data applications & tasks: Import and Export Data from a
Database or files Import and Export Web and
Social Data Perform Tex Analytics on
specified content Query HBase Content Query content stored in
BigInsights using Big SQL. Execute Pig or JAQL applications
• EXT E N S I B L E !! Build your own applications and make them easy to execute from an appealing Application launcher
© 2013 IBM Corporation
InfoSphere BigInsights Added Value: Text Analytics
51
Advanced Text Analytics EngineAutomatically identify and understand key information in text
Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillasmade the save. Winger Andres Iniestascored for Spain for the win.
InfoSphere BigInsights
Administration & Security
Workload Optimization
Connectors
Advanced Engines
Visualization & Exploration
Development Tools
Open source Hadoop components
© 2013 IBM Corporation
© 2013 BigDataUniversity.com
Architecture Diagram
AQL Text AnalyticsText AnalyticsOptimizer
Text AnalyticsRuntimeGraph (.aog)
CompiledOperator
Graph (.aog)
Rule language with familiar SQL-like syntax
Specify annotator semantics declaratively
Choose an efficient
execution plan that implements the semantics
Highly scalable, embeddable Java runtime
InputDocumentStream
AnnotatedDocumentStream
© 2013 BigDataUniversity.com
InfoSphere BigInsights – Added Value: Connectors
Connectors• Databases
• DB2, Netezza, Oracle, TeradataIntegrations• InfoSphere Data Stage(data collection and integration)
• InfoSphere Streams(real-time streams processing)
• InfoSphere Guardium (security and monitoring)
• Cognos Business Intelligence(Business Intelligence capabilities)
• IBM Platform Computing (cluster/grid infrastructure and management) and more…
InfoSphere BigInsights
Administration & Security
Workload Optimization
Connectors
Advanced Engines
Visualization & Exploration
Development Tools
Open source Hadoop components
© 2013 BigDataUniversity.com
BigInsights – Added Value: Workload optimization
55
Task Map AdaptiveMap
Reduce
Hadoop System Scheduler• Identifies small and large jobs from prior
experience• Sequences work to reduce overhead
Adaptive MapReduce• Drop-in replacement for Hadoop batch
scheduler• Dramatic performance gains for latency-
sensitive application workloads• Agile scheduling, dynamically adjust
priorities at run-time
© 2013 IBM Corporation
InfoSphere BigInsights
Administration & Security
Workload Optimization (MapReduce/SQL)
Connectors
Development Tools
IBM tested & supported open source components
Accelerators
Open source based
components
Workload Management
Security
Development Environment
Analytics/ExtractorsAnalytics
Analytics Extraction Engine
Visualization & Exploration
Extractors and APIs
SQL API
© 2013 BigDataUniversity.com
BigInsights – Added Value: Web Console
56
Web Console• Start / stop services • Run / monitor jobs (applications)• Explore / modify file system• Built in Apps simplify common tasks
InfoSphere BigInsights
Administration & Security
Workload Optimization
Connectors
Advanced Engines
Visualization & Exploration
Development Tools
Open source Hadoop components
BigInsights – Added Value: Security
Security• LDAP authentication• Support for PAM & Flat File configuration• Administrators restrict access to authorized
users• HTTPS support for the InfoSphere
BigInsights console, and reverse proxy. • Role based access
InfoSphere BigInsights
Administration & Security
Workload Optimization
Connectors
Advanced Engines
Visualization & Exploration
Development Tools
Open source Hadoop components
Achieve scale:By partitioning applications into software componentsBy distributing across stream-connected hardware hosts
Infrastructure provides services forScheduling analytics across hardware hosts, Establishing streaming connectivity
TransformFilter / Sample
ClassifyCorrelate
Annotate
Where appropriate: Elements can be fused togetherfor lower communication latency
Continuous ingestion Continuous analysis
How Streams Works
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
The Future of Big Data and Cloud
SQL for Hadoop support improvements – towards full ANSI support
Hive
Impala (Cloudera)
Big SQL (IBM)
Stinger (Hortonworks)
Drill (MapR)
HAWQ (Pivotal)
SQL-H (Teradata)
Improvements in Multimedia Analytics
Growth in usage and adoption of R programming language
Cloud Bare metal support helping with Hadoop workloads
Private network
Full support with APIs
Big SQL overview
Big SQL fully integrates with SQL applications and BI tooling with benefits including:
• Existing queries run with no or few modifications
• Existing JDBC and ODBC compliant tools can be leveraged
• Applications do not have to compensate for constraints of Hive QL which may result in:
• more statements• potentially moving more
data over the network to the application
Data Sources
Hive Tables HBase Tables CSV Files
BigSQL Engine
BigInsights
Application
SQL Language
JDBC / ODBC Driver
JDBC / ODBC Server
Try it out!Big SQL 3.0 Technology Preview: bigsql.imdemocloud.com
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
BigInsights on the Cloud - Making Learning Hadoop Easy and FunM2M Demos (using Streams)
•The Connected Car Demo– http://ausgsa.ibm.com/projects/c/connected_car/index.html– http://m2m.demos.ibm.com/
YouTube IBM Big Data Channel– http://www.youtube.com/user/ibmbigdata
Big Data University (bigdatauniversity.com)
Agenda
The state of Big Data adoption
Big Data – A holistic approach
The 5 high value Big Data use cases
Technical details of key Big Data components
The future of Big Data and Cloud
Demos
Resources
Flexible on-line delivery allows learning @your place and @your pace
Free courses, free study materials.
Cloud-based sandbox for exercises – zero setup with Robust Course Management System and Content Distribution infrastructure
169,000 registered students.
Free IBM Hadoop, BigInsights Publications
Big Data University (bigdatauniversity.com)
BigInsights on the Cloud - Making Learning Hadoop Easy and FunQuick Start Editions available (Free, non-
production, no time bomb):
– IBM InfoSphere BigInsights (IBM’s Hadoop Distribution)ibm.co/QuickStart
– IBM InfoSphere Streamsibm.co/streamsqs
Big Data University (bigdatauniversity.com)
67
My contact information
Contact Info:Twitter: @raulchong
Facebook: facebook.com/raul.f.chong
LinkedIN: linkedin.com/pub/raul-f-chong/8/aa2/b63
My contact information