big data data lake and beyond

Big Data, Data Lake and Beyond

Rajesh Kumar Sr. Data Architect- Big Data Analytics & Information Management

Agenda:

Big Data, Data Lake

Data Lake- Hype vs challenges

Governance in the Data Lake

Hadoop Ecosystem

Takeaway

A Big Data and Hadoop Platform Manifesto Why so much buzz around the term Big Data and Hadoop, and it is amplifying each passing day. There’s good reason for all the excitement and buzz around big data and Hadoop. As per Gartner prediction worldwide smart connected devices will reach more than 50 billion by 2020 with economic value add across the sectors would cross $1.7 trillion.

Data generated by the devices, sensors, and real time data streams and IoT are becoming part of the new norm to discover the true business value when it comes to data.

We know that the trusted platforms of yesterday and, in some cases, today have limitations. infrastructure and data warehouse system is not purpose built for Big Data workload and all types of data found on different domain today

In the world of Big Data, Organization the adoption of Hadoop &Big Data analyticsto the complex data challenges and

So Big Data and analytics platform have evolved out of the challenges where traditional application and architecture have limitations to do that

What is causing technology disrupdata potential beyond the hype.

We know that the trusted platforms of yesterday and, in some cases, today have limitations. infrastructure and data warehouse system is not purpose built for Big Data workload and all types of

today

In the world of Big Data, Organization is looking beyond the data generated by operatBig Data analytics to find new business opportunity, deliver faster

and help identify new revenue stream

have evolved out of the need to deliver the solution on complex data challenges where traditional application and architecture have limitations to do that

disruption and paradigm shift? Why enterprises aredata potential beyond the hype. Let us get into the real fact

We know that the trusted platforms of yesterday and, in some cases, today have limitations. Existing infrastructure and data warehouse system is not purpose built for Big Data workload and all types of

beyond the data generated by operational system by deliver faster solution

tion on complex data challenges where traditional application and architecture have limitations to do that.

gm shift? Why enterprises are looking at big

Data volumes and new types of data

New types of Data trends driving Big Data

Big Data:

In past data came primarily from operational systems, insensor ,smart phones, social media, video, audio, texts data, internet customer interaction generated in massive amounts (often structured and un-structured (document,and so on ). This data is characterized by high define as Big Data.(There no as such any one universal definition)

Let us try to understand four specific attributes that

We always keep hearing “4Vs of Big DataVolume-Volume is the most obvious Question about It. Social-mobile-cloud data explosion.

Velocity-one of the least understood characteristics of Big Data is velocityat which data arrives at the enterprise and the time that it takes the enterprise to process and

and new types of data are on rise:

trends driving Big Data:

past data came primarily from operational systems, in today’s world Data is coming from everywheresmart phones, social media, Internet click stream, web logs, embedded machines, document,

data, internet customer interaction and many other sources. These data are amounts (often terabytes per day).Data is very complex because most of it is semi

structured (document, streaming data, videos, images, tweets, log feeds, sensor data data is characterized by high volume, variety, velocity and veracity, and

.(There no as such any one universal definition)

four specific attributes that define big data.

4Vs of Big Data”. What is Four Vs in big data context? obvious and important Big Data attribute. Data Volumes are on the r

cloud and the Internet of Things (IoT) are some of the main reason

ne of the least understood characteristics of Big Data is velocity. Velocity is defined as the rate at which data arrives at the enterprise and the time that it takes the enterprise to process and

coming from everywhere: web logs, embedded machines, document,

These data are very complex because most of it is semi

videos, images, tweets, log feeds, sensor data and In general it is

Data Volumes are on the rise, No are some of the main reason for

is defined as the rate at which data arrives at the enterprise and the time that it takes the enterprise to process and

understand that data. Big data is just not at-rest but also in-motion. How rapidly data (both structured & unstructured) can be moved to the end user and self-serving user, How fast can you analyze the data is the need of the hour.

Variety-The variety characteristic of Big Data is really about trying to capture all of the data that helps to drive business value and decisions making process not only the operational data.

Veracity- Data Here, Data There, Data everywhere. Veracity is the new term & latest addition in Three Vs of data that’s being used more and more to describe Big Data; The fourth V-The veracity refers to the quality, the accuracy of data; trustworthiness and understandability of the data coming from all the sources internal and external, structured & unstructured, data at rest &data in motion.

Combining these large volume and varieties of data turn into massive data sets that provides high value information and deep business insight and the platform which supports and handle big data solution in distributed cluster based computing environment ( by using set of software stack –HDFS, MapReduce, YARN, HBase, Hive, Flume, Spark, Kafka ),supported by commodity hardware is called Hadoop . In simple word, Hadoop is technology stack that handle Big Data solution. When we talk about Hadoop, the Data Lake takes the center stage.

Hadoop :

Massively scalable data storage and processing platform Schema on read Open source project-comprising of various technology stacks Designed to be scalable, fault tolerant, distributed Horizontally Scalable-Scales out to execute on thousands of servers Hadoop Ecosystem- HDFS,Mapreduce,YARN,Hive,HBase, Sqoop, Kafka, Spark, Ambari, Oozie, Zookeeper.

IoT- “Internet of Things” , Data generated by the devices, sensors and real time data streams are good

example of IoT data..The Internet of Things is our constantly growing universe of sensors and devices that create a flood of granular data about our world. The “things” include everything from sensor monitoring the weather, traffic or energy usage; “smart” household appliances, self-driving car to smart railway system, fault monitoring of power grid and so on. These sensors are constantly getting smarter, cheaper and smaller.

Gartner predicts IoT connected things will reach more than 50 billion by 2020 with economic value add across sectors would cross $1.7 trillion

As the volume and variety of sensors and other telemetry sources are growing , the connections between them and the analytic needs are also growing to create an IoT value curve that’s rising exponentially as time goes on.

Data Lake:

A data lake is a new generation distributed data repository that stores any data of any size and variety in its native format. The data can be structured or unstructured. Because Data Lake can accommodate all types of data- different shape and size, structured, unstructured, data at rest, data in motion, Data Lake is an emerging as powerful architectural approach for distributed data storage & processing mediums for big data. Especially as enterprises turn to mobile, cloud-based applications, and the Internet of Things (IoT) to leverage value curve that’s rising exponentially. In context of growing volume & variety of data, it is evident that traditional Data Warehouse and BI is not enough to satisfy emerging business needs and analytical complexity in fast changing data landscape To better understand the Data lake/Hadoop let us look at some of its important characteristics: Data lake is primary repository for ingesting , persisting ,and managing raw source of data stored in

native format It is Scale- out technology based on Hadoop Data doesn’t have to be modeled and transformed before loaded into the data lake It is schema less and based on philosophy of schema-on-read. It is file based platform where data is stored in Hadoop cluster of commodity hardware The structure of the data collected is therefore not known when it is fed into the data lake, but only

found through discovery, when read. Native Supports for NoSQL databases

Some of the biggest advantage of Data Lake includes: You can store all types of structured and unstructured data in adata lake–hence value that can be

derived is unlimited The biggest advantage of data lake is flexibility and fast movement of data to the end user & self-

service user It can be 10 to 50 times less expensive to deploy than traditional data warehouse technologies Scales out to execute on thousands of servers Allows business consumers to bring their tools to the data rather than spreading data to different

applications

Data Lake reference architecture diagram

Data Movement in Data Lake

Data Lake Hype:

Data Lake challenges:

Data swamp

This is general misconception that Data Lake is simply enterprise repository to ingest all the data in its native format and you don’t need to adhere to data governance policies, processes, data access controls, data quality rules and proper metadata or data catalog management. So the perceptions that ignore governance, data lineage, data quality, access control and load data freely into the lake and address the data governance and data quality and metadata later is not true. Because Later, when you need to discover insights from the data, you will have to clean data or find tools to clean it, there are real risks to this approach because you don’t know any information about how it got there and its business context. You needs to start somewhere in the massive amounts of data in the lake where you don’t know what data is for what.What is metadata, whether data is correct, data can be trusted; we won’t know where to begin, what to mine and how to discover business insight. So parts of the data lake will be ignored, become stagnant, isolated, and slowly become the data swamp.(massive data with so little structure & lineage; lacks data quality, data provenance ) So, it is highly important toincorporate a data lake management platform that has been purposely built to ingest and manage large volumes of diverse data sets in the data lake. It will allow you to catalog the data, leveraging metadata, and support the process of ensuring data quality, data lineage, and automating the workflows. Misconception &Big Data Hype - Hadoop is going to replace Data Warehousing BI framework. It’s not true, Hadoop is not the replacement of traditional, well governed, well understood data ware house and RDBMS system. It is perceived as platform to complement DW &RDBMS system One assumption is that big data technology equals Hadoop. It is not true, it’s more than Hadoop. There are real-time data stores, processing and analytics engines, and streaming technologies for monitoring and processing data as it flows. Some are built on top of Hadoop or HDFS while others exist independently. Next generation Hadoop reference architecture

Hadoop and Its Ecosystem

HDFS The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. powerful data storageof commodity servers.

Scale out instead of scale up

performance is often addressed by using larger and faster hardware. But In the Hadoop world, we add more nodes

MapReduce MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. of distributed Data Prbuilt over Apache YARN Framework. YARN stands for “YetNegotiator”. It is a architecture than the earlier

Hadoop and Its Ecosystem-Tools & Technology

The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. It is clustered based files system that provides scalable and

storage and processing in distributed file system across odity servers. HDFS was derived from Google File System (GFS)

Scale out instead of scale up-In the relational data warehouse environment, is often addressed by using larger and faster hardware. But In the

Hadoop world, we add more nodes (servers) and do the work in parallel

MapReduce is a programming model for processing large data sets with a parallel, algorithm on a cluster. MapReduce was derived from Google concepts

Data Processing on Large Clusters. The current MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-AnotherNegotiator”. It is a new 2.0framework and offers scalable, faster and

than the earlier MapReduce 1.0 framework.

The Hadoop Distributed File System (HDFS) offers a way to store large files across ides scalable and

in distributed file system across large clusters m Google File System (GFS).

ehouse environment, is often addressed by using larger and faster hardware. But In the

(servers) and do the work in parallel

MapReduce is a programming model for processing large data sets with a parallel, s derived from Google concepts

MapReduce version is Another-Resource-

and more generic

Apache Pig Pig provides an engine for executing data flows in parallel on Hadoop.Pig also builds as a high level procedural language that acts as interface to HDFS. Pig is more frequently utilized in extract, transform and load (ETL) scenarios than for just returning data results. Pig uses a text based language called “Pig Latin” for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Hadoop jobs ran in a serial batch manner, so one job had to finish before another could begin

Apache Hive

(SQL and data

warehouse

capability!)

It is an open-source data warehouse infrastructure for querying and analyzing large datasets stored in Hadoop files. In Hive data layer you can do data summarization, query, and analysis. It’s provides SQL-like language and batch processing capabilities (not SQL92 compliant):

Hive functions as a SQL metastore on top of HDFS—users can impose schemas (which look like tables to the user), onto files and then query them using a language called Hive Query Language (HiveQL).

HBase

(NoSQL

database)

An open source, non-relational, distributed database developed based on concept of Google’s BigTable .It was developed as part of Apache Software Foundation’s Apache Hadoop project, it runs on top of HDFS, providing BigTable-like capabilities for Hadoop.

HCatalog

(metadata

capability)

HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. Right now HCatalog is part of Hive.

Data Ingestion

Apache Flume Apache Flume is a service for streaming logs into Hadoop. It is a distributed and reliable service for efficiently collecting, aggregating, and moving large amounts of streaming data into the HDFS It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

1. Apache Flume project site

Apache Kafka

(Continuous messaging

A fast, scalable, durable, and fault-tolerant publish-subscribe messaging system, Kafka is often used in place of traditional message brokers. It offers higher throughput, reliability, and replication

based data Ingest)

Apache Sqoop

(data export capability)

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as Relational databases. You can use Sqoop to import data from external structured data stores into a Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured data stores such as relational databases and Enterprise data warehouses.

Apache Spark Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. It can integrate and work on the top of the Hadoop Distributed File System (HDFS) and also in standalone model.Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications. Spark provides a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach.

Apache

Zookeeper

(Coordination services)

It’s a coordination service that gives you the tools you need to write correct distributed applications and fulfill manage many of the roles for HDFS and other Hadoop infrastructure management. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps as famous.

Apache Oozie

(Scheduling and a workflow capability)

Hadoop workflow management and workflow scheduling system for MR jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs by time (frequency) and data availability

Apache Falcon

(Scheduling)

Falcon is a new data processing and management platform for Hadoop. Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows.Falcon’s

simplification of data management is quite useful to anyone building apps on Hadoop. Data Management on Hadoop encompasses data motion, process orchestration, lifecycle management, data discovery, etc. among other concerns that are beyond ETL.

Apache Mahout Machine learning library and math library, on top of MapReduce

Conclusion:

New data types and analytical complexity has meant that traditional data ware house and BI is not enough to satisfy all types of data complexity and analytical needs.

Hadoop and Big data analytics does not replace Data warehouse and Data marts application but to strengthens existing analytical environment

Big data/Hadoop platform is now good to have and fast placing itself as must to have technology Advance analytics /Data science has emerged as analytical process that needs to be applied to

all type of data not just operational data Data governance, metadata management, data quality framework, security and access control

are still to evolve and showcase the maturity.

Thank You…

big data data lake and beyond

Data & Analytics