big data landscape

22
Big data The technology landscape and its applications. Natalino Busa - 12 Feb. 2013

Upload: natalino-busa

Post on 05-Dec-2014

1.597 views

Category:

Technology


0 download

DESCRIPTION

An overview about several technologies which contribute to the landscape of Big Data. An intro about the technology challenges of Big Data, follow by key open-source components which help out in dealing with various big data aspects such as OLAP, Real-Time Online Analytics, Machine Learning on Map-Reduce. I conclude with an enumeration of the key areas where those technologies are most likely unleashing new opportunity for various businesses.

TRANSCRIPT

Page 1: Big data landscape

Big dataThe technology landscape and its applications.

Natalino Busa - 12 Feb. 2013

Page 2: Big data landscape

Outline

● Big Data: Who are thou?● Big Data: The technology landscape

● Hadoop: Overview● Analytics & Machine Learning● Opportunities

Natalino Busa - 12 Feb. 2013

Page 3: Big data landscape

Hype cycle on new IT technologies

Gartner 2012

Natalino Busa - 12 Feb. 2013

Page 4: Big data landscape

What is big data?

Velocity Diversity Volume

Hardware Software Services

BIG DATA

DATA (structured and un-structured, Logs, ETL, social)

Marketing (e.g. Unica)Analytics (Tableau)Modeling (SAS)

RDBMSOLAPMessaging

Infrastructure(Private) CloudNetworking

Natalino Busa - 12 Feb. 2013

Page 5: Big data landscape

Big Data Heat map

Natalino Busa - 12 Feb. 2013

Page 6: Big data landscape

How big is big?

ARI = # Rows × # Columns Time (secs)

Where # Rows = Number of records being analyzed

# Columns = Number of variables captured in each record

Time (secs) = The timeframe within which to complete the analysis

SkyTree (tm) defines: Analytics Requirements Index (ARI)

Example: For each view (1000 views/sec) produce a personalized banner I need to analyze 100 variables on 1000 records (historic data) every 1 ms

ARI = (1000*100)/0.001 = 100 M values/sec

Natalino Busa - 12 Feb. 2013

Page 7: Big data landscape

What data?

Big Data can imply:

● Complex Data refactoring in Batch (lots of rows)● Real-Time Event Processing (high-speed responses)● Multidimensional analisys (lots of parameters)

● ... or any of those three

Natalino Busa - 12 Feb. 2013

Parameters Entities

Res

pons

e tim

e

Page 8: Big data landscape

More data

Database Databases Federated Data Aggregated Data Linked Data Just Data

Structured Unstructured

customerscustomers +products

customers +products +surveys

customers +products +surveys +transactions

customers +products +surveys +transactions +social messages

● in today's IT environments there is a gradual shift from structured data to unstructured data

RDBMS are well suited to deal with structured data -> but: more and complex ETL, how to deal with new data (structures) ?

Map-Reduce and noSQL systems are good with unstructured data -> but: how to we query and analyze this data?

Natalino Busa - 12 Feb. 2013

Page 9: Big data landscape

Big Data: how to deal with it

● Big Data at rest (storage, access) ● Big Data in motion (streaming, dataflows)

● Big Data analytics (OLAP, OTAP, BI)● Big Data modeling (predictive, machine learning)

Natalino Busa - 12 Feb. 2013

Page 10: Big data landscape

Big Data at rest

Analytical RDBMSs (EDW) Oracle, IBM, and various MPP's

Hadoop Distributed Systems HDFS (distributed file system) Hbase (Big Table)

HDFSLogs

Batch Real-time

EDW EDW

Analytics

EDW

Cassandra HBase

Natalino Busa - 12 Feb. 2013

● Traditional EDW and Distributed BigData / NoSQL solutions are complementary to each other.

● These systems do not exclude each others and can coexist to form a fullenterprise level solution.

Page 11: Big data landscape

Big Data at rest

No need to get everything out of the hadoop ecosystem:

NoSQL DBMSs: Couchbase ( ++ reads, caching) Cassandra ( ++ writes, OLAP)

... hybrid solutions are also possible:

HDFS + Cassandra : in-memory analytics + large DFSHDFS + Solr/Lucene: fast text search on a distributed file system

Natalino Busa - 12 Feb. 2013

Page 12: Big data landscape

Big Data in motion

Stream processing // Dataflow architectures

Used to support the automatic analysis of data-in-motion in real-time or near real-time.

- Identify meaningful patterns - Trigger action to respond to them as quickly as possible.

- Storm (from twitter) dataflow processing framework ++ multi-language

- Akka (from typesafe) dataflow actor framework ++ speed

Both are:Distributed, fault-tolerant, streaming

Natalino Busa - 12 Feb. 2013

Page 13: Big data landscape

Big Data Landscape

HDFS

Logs Hbase

EDWsqoop

hiho

flume

REST

scribe

Cassandra

Hive

Pig

MapR

OTAP Impala

SAS, R over HDFS Mahout

OLAP

BI

STORM

Natalino Busa - 12 Feb. 2013

● Real-Time Analytics● Streaming

● Batch Analytics● Visualization● Monitoring● Marketing

Machine Learning on Big Data

FS

Unstructured

Unstructured

Dat

a In

terfa

ces

Page 14: Big data landscape

Lambda Architecture

Logic layerSoftware as a Servicee.g realt-time predictor

Natalino Busa - 12 Feb. 2013from http://www.manning.com/marz/

Page 15: Big data landscape

Why do machine learning on big data

Natalino Busa - 12 Feb. 2013

http://www.skytree.net/why-do-machine-learning-on-big-data/

Page 16: Big data landscape

Machine Learning: What?

SIMILARITY SEARCH

Similarity search provides a way to find the objects that are the most similar, in an overall sense, to the object(s) of interest.

Natalino Busa - 12 Feb. 2013

PREDICTIVE ANALYTICS

Predictive analytics is the science of analyzing current and historical facts/data to make predictions about future events.

CLUSTERING AND SEGMENTATION

Cluster analysis and segmentation represents a purely data driven approach to grouping similar objects, behaviors, or whatever is represented by the data.

From http://www.skytree.net/why-do-machine-learning-on-big-data/use-cases/

Page 17: Big data landscape

Word Counting on Map Reduce

Natalino Busa - 12 Feb. 2013

Page 18: Big data landscape

Machine learning on Map Reduce

Natalino Busa - 12 Feb. 2013

From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011

Page 19: Big data landscape

Machine learning on Map Reduce

Natalino Busa - 12 Feb. 2013From http://www.slideshare.net/hadoop/modeling-with-hadoop-kdd2011

Page 20: Big data landscape

Machine Learning: Use Cases

E-Commerce / E-Tailing● Product Recommendation Engines● Cross Channel Analytics● Events/Activity Behavior Segmentation

Product Marketing● Campaign management and optimization● Market and consumer segmentations● Pricing Optimization

Customer Marketing● Customer Churn Management● (Mobile) User Behavior Prediction● Offer Personalization

Natalino Busa - 12 Feb. 2013

Page 21: Big data landscape

Big Data: Opportunities

Unstructured Data● Clustering● Distributed processing● Distributed Storage

Modeling & Analytics● Distributed Machine Learning● Fast Online Analytics Cubes

Streaming and Real-Time processing● Build RT profiles● Decision trees and Predictions● Offer Personalization

Natalino Busa - 12 Feb. 2013

Page 22: Big data landscape

Thanks

linkedin:

www.linkedin.com/in/natalinobusa

blog:

www.natalinobusa.com