02 a holistic approach to big data

68
Raul F. Chong Senior Big Data and Cloud Program Manager Big Data University Community Leader @raulchong A holistic approach to Big Data © 2013 BigDataUniversity.com

Upload: raul-chong

Post on 20-Aug-2015

2.057 views

Category:

Technology


3 download

TRANSCRIPT

Raul F. ChongSenior Big Data and Cloud Program ManagerBig Data University Community Leader@raulchong

A holistic approach to Big Data

© 2013 BigDataUniversity.com

Agenda

Introduction to Big Data

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

Agenda

Introduction to Big Data

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

What is Big Data?

Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools.

Difficulties include capture, storage, search, sharing, analytics, and visualizing.

Source: Wikipedia

Big Data Characteristics

Information is growing at a phenomenal rate

as much data and content over coming decade

2009800,000 petabytes

202035 zettabytes

=4 Trillion 8GB iPods

44x

Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010

Big Data Characteristics

• About 80% of the world’s data is unstructured

• It may be data we’ve been collecting before, but could not process

Types of Big Data

• Data in movement - streams• Twitter / Facebook comments• Stock market data• Sensors: Vital signs of a newly-born

• Data at rest - oceans• Collection of what has streamed• Web logs, emails, social media• Unstructured documents: forms, claims• Structured data from disparate systems

IT

Structures the data to answer that question

IT

Delivers a platform to enable creative discovery

Business

Explores what questions could be asked

Business Users

Determine what question to ask

Monthly sales reportsProfitability analysisCustomer surveys

Brand sentimentProduct strategyMaximum asset utilization

Big Data ApproachIterative & Exploratory Analysis

Traditional ApproachStructured & Repeatable Analysis

Traditional vs. big data business approaches

Applications for Big Data Analytics

Homeland Security

Finance Smarter Healthcare Multi-channel sales

Telecom

Manufacturing

Traffic Control

Trading Analytics Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Agenda

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

Big Data Adoption Phases

Use of Big Data globally and in the financial sector

Multiple responses accepted

Big Data: In Demand Well Paying Skill

Skills are in Demand Pays well

“If you can claim to be a data scientist and have the chops to back

that up, you can pretty much write your own ticket even in this tough

job market.”

Source: Gigaom http://gigaom.com/cloud/big-data-skills-bring-big-dough/

Agenda

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

15

KTH Swedish Royal Institute of Technology Reducing Traffic Congestion

• Deployed real-time Smarter Traffic system to predict and improve traffic flow.

• Analyzes streaming real-time data gathered from cameras at entry/exit to city, GPS data from taxis and trucks, and weather information.

• Predicts best time and method to travel such as when to leave to catch a flight at the airport

Results• Enables ability to analyze and predict traffic

faster and more accurately than ever before

• Provides new insight into mechanisms that affect a complex traffic system

• Smarter, more efficient, and more environmentally friendly traffic

15

Benefits Real-time display of public sentiment as

candidates respond to questions

Debate winner prediction based on public opinion instead of solely political analysts

University of Southern California Innovation Lab Monitors Political Debates

Big Data – A holistic approach

Big Data is Not Only Hadoop! Examples where Hadoop is not entirely applicable:

– Cyber security, Stock market, Traffic control, Sensor information, monitoring trends in Social Media

– What if your company has many silos of information, difficult to move to HDFS?

– What about governance? Can we trust the source of this data?

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

Big data holistic approach: A platform

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

The IBM Big Data Platform

Delivers deep insight with advanced in-database analytics & operational analytics

Data Warehouse

Data Warehouse

Big data holistic approach: A platform

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

Stream Computing

Data Warehouse

Analyze streaming data and large data bursts for real-time insightsStream

Computing

Big data holistic approach: A platform

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

The IBM Big Data Platform

HadoopSystem

Stream Computing

Data Warehouse

Cost-effectively analyze Petabytes of unstructured and structured data

HadoopSystem

Big data holistic approach: A platform

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

22

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

Govern data quality and manage the information lifecycle

Information Integration & Governance

Big data holistic approach: A platform

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

Speed time to value with analytic and application accelerators

Accelerators

Big data holistic approach: A platform

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

Systems Management

Application Development

Visualization & Discovery

The IBM Big Data Platform

Discover, understand, search, and navigate federated sources of big data

Visualization & Discovery

Big data holistic approach: A platform

Process any type of data

– Structured, unstructured, in-motion, at-rest, in-place

Built-for-purpose engines

– Designed to handle different requirements

Manage and govern data in the ecosystem

Enterprise data integration

Grow and evolve on current infrastructure

The whole is greater than the sum of parts Integrated components

Out of the box, standards-based services

Start small (value is additive)

25

Solutions

Big Data Platform

Analytics and Decision Management

Big Data Infrastructure

Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

Systems Management

Application Development

Visualization & Discovery

Big data holistic approach: A platform

ETL, MDM, Data Governance

Metadata and Governance Zone

Warehousing Zone

Enterprise Warehouse

Data Marts

Ingestion and Real-time Analytic ZoneStreams

Connectors

BI & Reporting

PredictiveAnalytics

Analytics and Reporting Zone

Visualization & Discovery

Landing and Analytics Sandbox Zone

Hive/HBaseCol Stores

Documentsin variety of formats

MapReduce

Hadoop

An example of the big data platform in practice

Agenda

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

Big Data ExplorationFind, visualize, understand all big data to improve business knowledge

Enhanced 360o Viewof the CustomerAchieve a true unified view, incorporating internal and external sources

Security/Intelligence ExtensionLower risk, detect fraud and monitor cyber security in real-time

Data Warehouse AugmentationIntegrate big data and data warehouse capabilities to increase operational efficiency

Operations AnalysisAnalyze a variety of machinedata for improved business results

The 5 High Value Big Data Use Cases

Find, visualize and understand all big data to improve business knowledge• Greater efficiencies in

business processes

• New insights from combining and analyzing data types in new ways

• Develop new business models with resulting increased market presence and revenue

CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems

ConnectorFramework

App Builder

Hadoop

Integration & Governance

UI / User

Streams

Big Data Exploration: Illustrated

WarehouseData Explorer

Big Data Exploration: Example in Practice

• Exploring 4 TB to drive point business solutions (supplier portal, call center, etc.)

• Single-point of data fusion for all employees to use• Reduced costs & improved operational performance for the business

How do you enable employees to navigate and explore enterprise and external content? Can you present this in a single user interface?

How do you identify areas of data risk before they become a problem?

What is the starting point for your big data initiatives?

Is Big Data Exploration Right for You? How do you separate the “noise” from useful

content?

How do you perform data exploration on large and complex data?

How do you find insights in new or unstructured data types (e.g. social media and email)?

Airplane ManufacturerBlinded for confidentiality

Big Data Platform Component Starting Point: Data Explorer

Enhanced 360º View of the Customer: Illustrated

CRMJ Robertson

Pittsburgh, PA 15213

35 West 15th

Name:

Address:

Address:

ERPJanet Robertson

Pittsburgh, PA 15213

35 West 15th St.

Name:

Address:

Address:

LegacyJan Robertson

Pittsburgh, PA 15213

36 West 15th St.

Name:

Address:

Address:

SOURCE SYSTEMS

Janet

35 West 15th St

Pittsburgh

Robertson

PA / 15213

F

48

1/4/64

First:

Last:

Address:

City:

State/Zip:

Gender:

Age:

DOB:

360 View of Party Identity

MasterDataManagement

Unified View of Party’s InformationHadoop Streams Warehouse

LogsEvents Alerts

Configuration information

System audit trails

External threat intelligence feeds

Network flows and anomalies

Identity context

Web pagetext

Video/audio surveillance

E-mail andsocial activity

Business process data

Customertransactions

Traditional Security Operations and Technology

Big Data Analytics

New ConsiderationsCollection, Storage and Processing

Collection and integrationSize and speedEnrichment and correlation

Analytics and Workflow

VisualizationUnstructured analysisLearning and predictionCustomizationSharing and export

Security/Intelligence Extension: Illustrated

“Reconstructing Events” – Integrating Multimedia from Diverse Sources

• Correlate multimedia content across a wide diversity of sources and dynamic topology of cameras

• Exploit partial overlaps in field of view, re-identification of objects/people and contextual information

• Obtain real-time operational picture across diverse content• 100K security cameras (static cameras, slowly changing topology)

• 10M mobile photos/day (limited knowledge about locations)• 50M social media photos/video (uncertain geo-temporal context)• Moving vehicles (patrol cars), overhead drones, broadcast, retail, 311, etc.

Overhead

Social MediaMobile Cameras

Security Cameras

33

Security/Intelligence Extension: Customer Example

What are your plans to enrich your security or intel system with unused or underleveraged data sources (video, audio, smart devices, network, Telco, social media)?

How will you address the need sub second detection, identification, resolution of physical or cyber threats?

How do you intend to follow activities of criminals, terrorists, or persons in a blacklist?

How do you plan to enhance your surveillance system with real-time data from video, acoustic, thermal or other security sensors?

Do you want to correlate lots of technical or human intel data and sources looking for associations or patterns (big data forensics)?

How are you going to deal with unstructured data (email, social, etc.) in your Security Information & Event Management (SIEM) solution to improve cyber threat detection & remediation?

Would the Security / Intelligence Extension benefit you?

Captured and analyzed 42TB of daily traffic in real-time for tracking persons of interest to take suitable action and reduce risk.

Big Data Platform Component Starting Point: Streams, Hadoop

Raw

Log

s an

d M

achi

ne D

ata

Indexing, Search

Statistical Modeling

Root Cause Analysis

Federated Navigation & Discovery

Real-time Analysis

Only storewhat is needed

Operations Analysis: Illustrated

Machine DataAccelerator

1 http://www.information-management.com/infodirect/2009_133/downtime_cost-10015855-1.html2 http://www.itchannelplanet.com/business_news/article.php/3916786/IT-System-Downtime-Costs-265-Billion-A-Year-Study-Finds.htm

Operations analysis is a Business Imperative

Cost of System Down Time– 49% of Fortune 500 companies > 80 hrs down time/year1

• Cost of down time: $90,000/hr to $6.48 million/hr• 80 hours * $6.48M = approx $500M per year

– System downtown costs North American businesses $26.5 billion a year in lost revenue2

Operations Analysis: Customer Example

• Intelligent Infrastructure Management: log analytics, energy bill forecasting, energy consumption optimization, anomalous energy usage detection, presence-aware energy management

• Optimized building energy consumption with centralized monitoring; Automated preventive and corrective maintenance

• Utilized InfoSphere Streams, InfoSphere BigInsights, IBM Cognos

Do you deal with large volumes of machine data? How do you access and search that data? How do you perform root cause analysis?

How do you perform complex real-time analysis to correlate across different data sets?

How do you monitor and visualize streaming data in real time and generate alerts?

Would Operations Analysis benefit you?

Big Data Platform Component Starting Point: Hadoop, Streams

Integrate big data and data warehouse capabilities to increase operational efficiency

Data Warehouse Augmentation: Needs

Need to leverage variety of data Extend warehouse infrastructure• Optimized storage, maintenance and licensing

costs by migrating rarely used data to Hadoop• Reduced storage costs through smart

processing of streaming data• Improved warehouse performance by

determining what data to feed into it

• Structured, unstructured, and streaming data sources required for deep analysis

• Low latency requirements (hours—not weeks or months)

• Required query access to data

Filter and summarize big data for the warehouse

Hadoop

Data Warehouse Augmentation: Illustrated

Hadoop as a query-ready archive for a data warehouse

Hadoop

Data Warehouse Augmentation: Illustrated

Agenda

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

Open Source Hadoop

Visualization & Discovery Connectors

Workload Optimization

Flume

Runtime

Advanced Engines

File System

MapReduce

HDFS

Data StoreHBase

Development ToolsEclipse Plug-ins

Systems Management

Jaql

Pig

ZooKeeper

Lucene

Oozie

Hive

Open Source

Mahout

Whirr

Sqoop

Hue

H Catalog

R

Visualization & Discovery Integration

Workload Optimization

Streams

Netezza

Flume

DB2

DataStage

IBM InfoSphere BigInsights v2.1 Enterprise Edition

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data StoreHBase

Text Processing Engine & Extractor Library)

BigSheets

JDBC

Applications & Development

Text Analytics

Administration

Index

Splittable Text Compression

Enhanced Security

Flexible SchedulerJaql

Pig

ZooKeeper

Lucene

Oozie

Adaptive MapReduce

Hive

Integrated Installer

Admin Console

Sqoop

Adaptive Algorithms

Dashboard & Visualization Apps Workflow Monitoring

Management

Security

Audit & History

Lineage

R

Guardium

PlatformComputing

Cognos

GPFS

IBMOpen Source

High Availability

Big SQL

H Catalog

Whirr

Mahout

Hue

Added Value on Top of Open Source Hadoop

InfoSphere BigInsights Added Value

InfoSphere BigInsights

Administration & Security

Workload Optimization (MapReduce/SQL)

Connectors

Development Tools

IBM tested & supported open source components

Accelerators

Open source based

components

Workload Management

Security

Development Environment

Analytics/ExtractorsAnalytics

Extraction engine (System T)

Visualization & Exploration

Extractors and APIs

SQL API

InfoSphere BigInsights Added Value: Accelerators

Data Ingestand Prep

Extract Buzz, Intent , Sentiment

Entity Analytics:

Profile Resolution

Real time analytics. Pre-defined views

and charts

Dashboard

Stream Computing and Analytics

BigInsights System and Analytics

Online flow: Data-in-motion analysis

Offline flow: Data-at-rest analysis

Pre-defined Workbooks and

Dashboards

Social Media Data

Extract Buzz, Intent , Sentiment

And Consumer Profiles

Entity Analytics and

Integration

Comprehensive Social Media

Customer Profiles

Social Media

Optional: Indexed Search

Index using Push API

Data Explorer

Ad hoc access

Social Data Analytics Accelerator Architecture

InfoSphere BigInsights Added Value: BigSheets

InfoSphere BigInsights

Administration & Security

Workload Optimization (MapReduce/SQL)

Connectors

Development Tools

IBM tested & supported open source components

Accelerators

Open source based

components

Workload Management

Security

Development Environment

Analytics/ExtractorsAnalytics

Extraction engine (System T)

Visualization & Exploration

Extractors and APIs

SQL API

BigSheets Visualization and Exploration

• Web-based analysis and visualization for Users

• Familiar spreadsheet-like interface • Define and manage long running data

collection jobs

InfoSphere BigInsights Added Value: BigSheets

No programming knowledge needed!

How it works Model “big data” collected

from various sources as collections

Filter and enrich content with built-in functions

Combine data in different collections

Visualize results through spreadsheets, charts

Export data into common formats (if desired)

InfoSphere BigInsights Added Value: Dev Tools

InfoSphere BigInsights

Administration & Security

Workload Optimization (MapReduce/SQL)

Connectors

Development Tools

IBM tested & supported open source components

Accelerators

Open source based

components

Workload Management

Security

Development Environment

Analytics/ExtractorsAnalytics

Extraction engine (System T)

Visualization & Exploration

Extractors and APIs

SQL API

Development Environment• Eclipse based dev environment • Developer tools and a set of analytic

extractors for fast adoption and reduction in coding and debugging time

• Plugin for Text Analytics, MapReduce programming, Jaql development, Hive query development, …. and more

InfoSphere BigInsights Added Value: Dev Tools

How it works• Built-in Apps make it easy to run Big

Data applications & tasks: Import and Export Data from a

Database or files Import and Export Web and

Social Data Perform Tex Analytics on

specified content Query HBase Content Query content stored in

BigInsights using Big SQL. Execute Pig or JAQL applications

• EXT E N S I B L E !! Build your own applications and make them easy to execute from an appealing Application launcher

© 2013 IBM Corporation

InfoSphere BigInsights Added Value: Dev Tools

InfoSphere BigInsights Added Value: Text Analytics

51

Advanced Text Analytics EngineAutomatically identify and understand key information in text

Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillasmade the save. Winger Andres Iniestascored for Spain for the win.

InfoSphere BigInsights

Administration & Security

Workload Optimization

Connectors

Advanced Engines

Visualization & Exploration

Development Tools

Open source Hadoop components

© 2013 IBM Corporation

© 2013 BigDataUniversity.com

Sentiments for movie Ra.One :-(

© 2013 BigDataUniversity.com

Architecture Diagram

AQL Text AnalyticsText AnalyticsOptimizer

Text AnalyticsRuntimeGraph (.aog)

CompiledOperator

Graph (.aog)

Rule language with familiar SQL-like syntax

Specify annotator semantics declaratively

Choose an efficient

execution plan that implements the semantics

Highly scalable, embeddable Java runtime

InputDocumentStream

AnnotatedDocumentStream

© 2013 BigDataUniversity.com

InfoSphere BigInsights – Added Value: Connectors

Connectors• Databases

• DB2, Netezza, Oracle, TeradataIntegrations• InfoSphere Data Stage(data collection and integration)

• InfoSphere Streams(real-time streams processing)

• InfoSphere Guardium (security and monitoring)

• Cognos Business Intelligence(Business Intelligence capabilities)

• IBM Platform Computing (cluster/grid infrastructure and management) and more…

InfoSphere BigInsights

Administration & Security

Workload Optimization

Connectors

Advanced Engines

Visualization & Exploration

Development Tools

Open source Hadoop components

© 2013 BigDataUniversity.com

BigInsights – Added Value: Workload optimization

55

Task Map AdaptiveMap

Reduce

Hadoop System Scheduler• Identifies small and large jobs from prior

experience• Sequences work to reduce overhead

Adaptive MapReduce• Drop-in replacement for Hadoop batch

scheduler• Dramatic performance gains for latency-

sensitive application workloads• Agile scheduling, dynamically adjust

priorities at run-time

© 2013 IBM Corporation

InfoSphere BigInsights

Administration & Security

Workload Optimization (MapReduce/SQL)

Connectors

Development Tools

IBM tested & supported open source components

Accelerators

Open source based

components

Workload Management

Security

Development Environment

Analytics/ExtractorsAnalytics

Analytics Extraction Engine

Visualization & Exploration

Extractors and APIs

SQL API

© 2013 BigDataUniversity.com

BigInsights – Added Value: Web Console

56

Web Console• Start / stop services • Run / monitor jobs (applications)• Explore / modify file system• Built in Apps simplify common tasks

InfoSphere BigInsights

Administration & Security

Workload Optimization

Connectors

Advanced Engines

Visualization & Exploration

Development Tools

Open source Hadoop components

BigInsights – Added Value: Security

Security• LDAP authentication• Support for PAM & Flat File configuration• Administrators restrict access to authorized

users• HTTPS support for the InfoSphere

BigInsights console, and reverse proxy. • Role based access

InfoSphere BigInsights

Administration & Security

Workload Optimization

Connectors

Advanced Engines

Visualization & Exploration

Development Tools

Open source Hadoop components

Achieve scale:By partitioning applications into software componentsBy distributing across stream-connected hardware hosts

Infrastructure provides services forScheduling analytics across hardware hosts, Establishing streaming connectivity

TransformFilter / Sample

ClassifyCorrelate

Annotate

Where appropriate: Elements can be fused togetherfor lower communication latency

Continuous ingestion Continuous analysis

How Streams Works

Agenda

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

The Future of Big Data and Cloud

SQL for Hadoop support improvements – towards full ANSI support

Hive

Impala (Cloudera)

Big SQL (IBM)

Stinger (Hortonworks)

Drill (MapR)

HAWQ (Pivotal)

SQL-H (Teradata)

Improvements in Multimedia Analytics

Growth in usage and adoption of R programming language

Cloud Bare metal support helping with Hadoop workloads

Private network

Full support with APIs

Big SQL overview

Big SQL fully integrates with SQL applications and BI tooling with benefits including:

• Existing queries run with no or few modifications

• Existing JDBC and ODBC compliant tools can be leveraged

• Applications do not have to compensate for constraints of Hive QL which may result in:

• more statements• potentially moving more

data over the network to the application

Data Sources

Hive Tables HBase Tables CSV Files

BigSQL Engine

BigInsights

Application

SQL Language

JDBC / ODBC Driver

JDBC / ODBC Server

Try it out!Big SQL 3.0 Technology Preview: bigsql.imdemocloud.com

Agenda

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

BigInsights on the Cloud - Making Learning Hadoop Easy and FunM2M Demos (using Streams)

•The Connected Car Demo– http://ausgsa.ibm.com/projects/c/connected_car/index.html– http://m2m.demos.ibm.com/

YouTube IBM Big Data Channel– http://www.youtube.com/user/ibmbigdata

Big Data University (bigdatauniversity.com)

Agenda

The state of Big Data adoption

Big Data – A holistic approach

The 5 high value Big Data use cases

Technical details of key Big Data components

The future of Big Data and Cloud

Demos

Resources

Flexible on-line delivery allows learning @your place and @your pace

Free courses, free study materials.

Cloud-based sandbox for exercises – zero setup with Robust Course Management System and Content Distribution infrastructure

169,000 registered students.

Free IBM Hadoop, BigInsights Publications

Big Data University (bigdatauniversity.com)

BigInsights on the Cloud - Making Learning Hadoop Easy and FunQuick Start Editions available (Free, non-

production, no time bomb):

– IBM InfoSphere BigInsights (IBM’s Hadoop Distribution)ibm.co/QuickStart

– IBM InfoSphere Streamsibm.co/streamsqs

Big Data University (bigdatauniversity.com)

67

My contact information

Contact Info:Twitter: @raulchong

Facebook: facebook.com/raul.f.chong

LinkedIN: linkedin.com/pub/raul-f-chong/8/aa2/b63

My contact information

Thank You!

© 2013 BigDataUniversity.com