big data - need of converged data platform

34
BIG DATA Need of Converged Data Platform UV Saradhi.

Upload: geeknighthyderabad

Post on 12-Apr-2017

52 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Big Data - Need of Converged Data Platform

BIG DATANeed of Converged Data Platform

UV Saradhi.

Page 2: Big Data - Need of Converged Data Platform

Presentation Overview

1.Big data. Why it is really big?2.Technologies that are available today.3.Need of Converged Data Platform.

4. Innovation!! What it takes?

Page 3: Big Data - Need of Converged Data Platform

Video Surveillance Data generated by 704 X 576 resolution CCTV’s generated

1GB per hour roughly.

Video Surveillance estimates 6000 PB in 2017.

Surge in Biometric applications.

Who stole my jacket? Forgot on the desk. Office has CCTV!!

Page 4: Big Data - Need of Converged Data Platform

Autonomous Cars

Driverless car generates 1 GB/sec roughly. 2 PB per car is the expectation. Car goes for a trip. Comes back safely. “Is

the car drive good?”What if someone files a lawsuit after six

months?

Page 5: Big Data - Need of Converged Data Platform

AadharBiometric identity to Indian Citizens.~5 Mega Bytes per citizen.Maps around 15 PB of raw data.100 million authentications per day. Each

authentication is roughly 4KB plus of data.Sub second response needed.

Page 6: Big Data - Need of Converged Data Platform

Aadhar Continued ...

Enrolled data moves from hot to cold. Data temperature varies.

Data analytics need. https://uidai.gov.in/images/FrontPageUpdates/uid_doc_30012012.pdf

Data is stored on Mapr Technologies.https://uidai.gov.in/images/

AadhaarTechnologyArchitecture_March2014.pdf

Page 7: Big Data - Need of Converged Data Platform

Retailers View

Walmart needs to process 2.4 PB per hour.Gain insights on data in 30 - 40 minutes time

period.Error in insights because of bugs and

miscalculation will burn money.Need to model 40 PB of recent transactional

data.

Page 8: Big Data - Need of Converged Data Platform

Retailers View Continued ...Data insights figured out that two particular stores

are not selling popular cookies. It’s not easy to find!!

Alert when a particular metric threshold is violated. Helps to reduce the turnaround time.

200 billion rows of transaction data has to be processed.

Page 9: Big Data - Need of Converged Data Platform

Retailer Needs ….Building 360 degrees view of the customer. Measuring Brand

Sentiment.

Creating customized promotions.

Improving store layout. Layout matters to make you purchase more!!

Click streams.

Inventory management.

Selling baby lotions to pregnancy women, tracking that weather is not good and selling Pizzas.

Page 10: Big Data - Need of Converged Data Platform

BIG DATA: Technologies PrimerSearch “GOD” in Laptop running with 1 Terabyte Drive. Assume 100

MB/sec as throughput.

How to speed search of “GOD”?Add more CPU. Okay, How many? 128 or 256 or 512?

Add more memory. How much? How many DIMMS? 16 or 64 or 128?

Tired!! Ah, I realize now single machine cannot solve the problem.

Do with multiple machines. May be, commodity machines, But scale in a huge way.

How to distribute storage?

How to access data?

Need of data locality?

Page 11: Big Data - Need of Converged Data Platform

Technologies : Compute, Storage and Network

Scale by moving compute close to data.

Store data efficiently on multiple nodes.No compromise on reliability.

No compromise on availability.

Automatically take care of addition and deletion of nodes.

Help to extract underlying device performance characteristics.

Network:Do not let compute happen on data over network.

Ensure rack awareness of data during data placement.

Utilize multiple network cards efficiently. Always minimize network footprint of data.

Page 12: Big Data - Need of Converged Data Platform

Technologies Available TodayHadoop! What exactly hadoop is?Map-Reduce! When is this a right choice? YARN? Is it refined Map-Reduce? More tight control on resource

management and job scheduling /monitoring?

Looks Hadoop core is distributed storage. Map-Reduce is compute engine. Is the processing real time? Are we good to go??

Page 13: Big Data - Need of Converged Data Platform

Technologies continued ...How to push data to Hadoop storage? Use Flume?

How to push data from an existing application writing to legacy file system? Is it to be rebuild?

Can the entire big data storage (aka hadoop) be accessed over NFS?

Okay, We somehow manage data into Hadoop. Does it solve all needs? Is there a way to address data as Key-Value pairs?

Page 14: Big Data - Need of Converged Data Platform

Unstructured Data as Key-Value PairsWhy do we need unstructured data as Key-Value pairs?

Aadhar needs to store biometric signature, address, fingerprints etc.

Retailers need to show various attributes on the products. It consists of images, technical specifications, tables, columns, reviews, etc.

IoT (Internet of things) generate lot of unstructured data.

How to store them and process them? Need of more technologies ...

Page 15: Big Data - Need of Converged Data Platform

Big TableHBase. Tries to address the key-value pair.

Cassandra. Tries to address the key-value pair.

Mapr DB. Addresses key-value pair problem.

Is there a JOIN operation on these tables? Can there be atomic operations across different rows? How about calling the above as NOSQL DB’s.

How can one decide right technology?

Page 16: Big Data - Need of Converged Data Platform

NOSQL DB MongoDB.

CouchDB.

Mapr DB - JSON

Why are there still more databases? What do these tables provide more?

Is querying data still a challenge?

Page 17: Big Data - Need of Converged Data Platform

Data Query EnginesHive

Impala

Drill

Presto

Pig

SPARK SQL

SQL is back into big table?? How can one define schema on data?? Can schema can be configured directly?

Page 18: Big Data - Need of Converged Data Platform

Real Time Analysis of DataHadoop, Connectors to Hadoop, Unstructured key-value pair, Big Table

SQL engines, Ready to go?

Is there a need to process data as soon as it arrives?

May be, Streams are needed. Streams are like pipes!!APACHE KAFKA

APACHE STORM

APACHE FLINK

MAPR STREAMS

Page 19: Big Data - Need of Converged Data Platform

AI, GRAPH, ...Need to represent data in graph

Apache Giraffe

Machine learning.Apache Mahout

Page 20: Big Data - Need of Converged Data Platform

PlatformPurchased 1000 nodes.

Have to connect several software to make meaning of the data.

IT needs standard platform to run day after day.

Development and Business needs continuous engagement of new tools and new software.

Security and Fraud detection keeps on changing day-by-day.

What to do? Do I need virtualization software?

Page 21: Big Data - Need of Converged Data Platform

VirtualizationGo for existing virtualization techniques? Are they expensive?

How about Linux Containers?

How about scheduling Containers? Do we need scheduling software? Apache Mesos

Kubernetes

How do I provision storage for containers?Craft disk independently for each container?

Is there a way to plug in storage from any node in the cluster to a container running on any node?

Page 22: Big Data - Need of Converged Data Platform

Performance and Security Problems1000 node cluster is not performing well.

Back to Big Data problem again.Swim 1000 node logs to identify what is the issue?

Security.Is data access kept confidential?

Authentication and Authorization is must. Is it same across all softwares?

Data encrypted on the wire?

DoS problems.

Page 23: Big Data - Need of Converged Data Platform

Multi TenancyHave 1000 tenants to work on 1000 node cluster.

How to provision storage, compute and network?

Is this going to be like Amazon cloud? Does each enterprise has the scale and capacity to develop Amazon cloud software?

Is there a way for tenants to share data?

Page 24: Big Data - Need of Converged Data Platform

Hot and Cold DataAs time moves forward, Data can possibly become cold.

A need may arise to keep hot data on solid state drives.

How to retain cold data?Move to cloud.

Does this need another software?

Is there a way to watch attributes of moved data into the cloud? Let’s say the file is /A/B/C. Can one see the time when C is modified while the data stays in the cloud.

Is there a way to dynamically move data between solid state drives and hard disk drives?

Page 25: Big Data - Need of Converged Data Platform

Reliability: Does it mean 3-way replicationData reliability means 3-way replicating by and large.

Peta Bytes of data being 3-way replicated causes storage waste.

How to eliminate it?A platform should try to represent data in erasure coded format (Probably 1.5x).

Yet while storing in erasure coded format, It should let to modify data if need arises.

Page 26: Big Data - Need of Converged Data Platform

IoT Devices : Edge ClustersIoT devices generate lot of data.

Each IoT device data has to be processed and stored with high reliability to meet government laws.

IoT devices has to process data.We know, single machine has limitation in processing data. By virtue of CPU’s,

Memory and hard disks.

Single machine also poses data reliability problems if the drive or CPU went bad.

Is this asking for a cluster near IoT devices? How can we do? NUC (Nuclear unit of computing) cluster may be the answer!!

Page 27: Big Data - Need of Converged Data Platform

IoT Edge ClustersProcess data and push to centralized cluster.

Access data in the centralized cluster and local cluster when need arises.

Unified global namespace access is must.

Ability to stream data from Edge Cluster to Centralized Cluster.

Edge cluster applications may not be sophisticated. They may have to write data with standard file system calls.

Does the software platform we chose can provide Edge Cluster Processing?

Page 28: Big Data - Need of Converged Data Platform

Application Data Access ModelTable Format.

Big data files. Hadoop files (Write Once and Read Many) or Mapr files (read and writable).

Object Store.Flat name space.

Data is accessed as objects with strict SLA’s.

Used to store videos, Images, etc.

The software platform lets all the applications run with their access model?

Page 29: Big Data - Need of Converged Data Platform

Converged Data PlatformNeeded as Big Data Store.

Ability to support unstructured key-value pairs.

Ability to support data with SQL engines like Drill, Hive, etc.

Ability to support real time streaming of data.

Ability to support container virtualization.

Ability to support applications accessing data through objects.

Ability to support global namespace for IoT Edge Clusters.

Ability to support data representation in erasure coded format.

Ability to support any new compute engine.

Page 30: Big Data - Need of Converged Data Platform

Converged Data Platform Continued ...Ability to support Multi Tenancy.

Ability to ensure security across several users and tenants.

Ability to provision CPU, Storage and network across tenants or users.

Ability to support different temperatures of the data.

Ability to move data between cloud and the cluster.

Page 31: Big Data - Need of Converged Data Platform

InnovationIs Innovation function of knowledge?

Isn’t knowledge function of time?

What promotes innovation?Salary?

Stock?

Recognition?

Peer Competition.

Page 32: Big Data - Need of Converged Data Platform

Innovation Continued ...Innovation needs innocent mind.

How can one be innocent in this world?Is there a way mind can be made innocent?

Recognizing innovation is innovation.

Page 33: Big Data - Need of Converged Data Platform

QuestionsI may not be able to answer all your questions!!

We can investigate the question together !! Not alone.

Page 34: Big Data - Need of Converged Data Platform

THANK YOU

You can reach me

Email: uvsaradhi at gmail dot com.