computing big data - uaacloud computing (cc) is a technology aimed at processing and storing very...

28
C LOUD C OMPUTING A ND B IG D ATA Department of Software Engineering and Information Technology

Upload: others

Post on 12-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

CLOUD COMPUTING AND BIG DATA

Department of Software Engineering and Information Technology

Page 2: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

• Father:“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

• The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

• The Girl Was Pregnant…!

Page 3: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Target computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed it to assign each shopper a “pregnancy prediction” score.

More important, it could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.

Example: Jenny Ward, who is 23, lives in Atlanta and in March bought cocoa-butter lotion, a purse large enough to double as a diaper bag, zinc and magnesium supplements and a bright blue rug. There’s, say, an 87 percent chance that she’s pregnant and that her delivery date is sometime in late August.

Page 4: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

“Cloud Computing (CC) is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” (NIST 2011).

Page 5: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Cloud Computing resources are delivered as services where a cloud is called a public cloud when it is made available in a pay-as-you-go manner to the general public, and is called a private cloud when the cloud infrastructure is operated solely for a business or an organization

Page 6: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

In general, Cloud Computing providers offer one of the following three categories of cloud services:

Software-as-a-Service (SaaS) Applications are accessible from several client devices

The provider is responsible for the application

Examples, SalesForce.com, NetSuit, Google, IBM, etc.

Platform-as-a-Service (PaaS): The client is responsible for the end-to-end life cycle in terms of developing, testing and deploying applications

Providers supplies all the systems (operating systems, applications, and development environment)

Examples are Google’s appEngine, Microsoft´s Azure, etc.

Page 7: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Infrastructure-as-a-Service (IaaS): The service client has control over the operating system, storage, and applications which are offered through a Web-based access point

In this type of service the client manages the storing and development environments for Cloud Computing application such as the Hadoop Distributed File System (HDFS) and the MapReduce development framework.

Examples of infrastructure providers are GoGird, AppNexeus, Eucalyptus, Amazon EC2, etc.

Page 8: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International
Page 9: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International
Page 10: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Cloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD)

In December 2012, the International Data Corporation (IDC) released a report titled "The Digital Universe in 2020". This report mentions that at the end of 2012, the total data generated was 2.8 Zettabytes (ZB) (2.8 trillions of Gigabytes) [1]

1. Gantz, J. and D. Reinsel, THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, 2012, IDC: Framingham, MA, USA. p. 16.

Page 11: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

IDC predicts that the total data for 2020 will be 40 ZB - this is roughly equivalent to 5.2 terabytes (TB) of data generated by every human being alive at that year.

The report mentions that only 0.5 % of data have been analyzed until today and a quarter of all currently available data may contain valuable information if this is analyzed.

Just to mention one example, in 2012 alone Facebook generated more than 500TB of new data every day.

Page 12: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International
Page 13: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Hadoop is the Apache Software Foundation’s top-level project for Cloud Computing that aims Big Data processing

Hadoop holds various subprojects that allow for distributed processing across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. (Hadoop 2012)

Page 14: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Hadoop Common (Core)

Hadoop Distributed File System (HDFS)

MapReduce

Cassandra

Pig

Zoo

Kee

per

(C

oo

rdin

atio

n) Mahout Hive

HBase

Ch

ukw

a (D

ata

colle

ctio

n)

Low-level system

Object storage

Table storage

Data analysis tools (Parallel programming, warehousing, machine learning, etc.)

Hadoop Technology Stack

Page 15: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

In 2012, Facebook had one of the largest Hadoop cluster in the world, with over 100 PB of storage and 200 million of files on over 5000 computers.

Page 16: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

There are several distributions of Hadoop, the best known are Cloudera, Hortonworks, IBM and EMC. (Hadoop Ecosystem June 2012)

Page 17: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International
Page 18: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Three major Hadoop technologies are: The Hadoop Distributed File System (HDFS) which allows storage large amounts of data into files in multiple computers accessible via the Internet,

The MapReduce development framework which is a programming model (java) that allows develop and execute Cloud Computing Applications over the HDFS, and

HBase which hosts very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware

Page 19: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

The Hadoop Distributed File System (HDFS) is the open source implementation of the Google File System (GFS) and it is the primary storage system used by Hadoop applications.

Three main goals of HDFS are: To handle very large files (hundreds of megabytes, gigabytes, or terabytes in size),

Streaming data access, the idea that the most efficient data processing pattern is write once, read many times, and

Commodity hardware, it is designed to run smoothly on a commodity hardware cluster

Page 20: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

HDFS has been designed with a master/slave architecture of clusters, which consists of a single Name Node (NN) and a number of Data Nodes (DN) that manage large amounts of storage (Borthakur, 2013).

Page 21: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

The NameNode (NN) server is made up of two main elements which expose the file system to users and allow data storage retrieval. These elements are; the File System Namespace (FSN) and File System Metadata (FSM)

File System Namespace. The HDFS supports traditional hierarchical file organization like other file systems on which the basic operations on files (create, delete, move, or rename)

Page 22: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

File System Metadata. When the FSN is stored by the NN, it uses a transaction log, called EditLog, to persistently record every change that occurs to file system metadata. For example, creating a new file in the HDFS causes the NN to insert a record into the EditLog indicating this.

The entire FSN, including the mapping of blocks to files and file systems properties, is stored in a file, called the FsImage, which is stored as a file in the NN’s local file system. Both the EditLog and FsImage are part of FSM.

Page 23: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

The DataNode (DN) stores HDFS data in files in its local file system, but has no knowledge of the HDFS files.

It stores each block of HDFS data in a separate file using a heuristic to determine the optimal number of files per directory and creates subdirectories as required.

When the DN starts up, it sends a Blockreport to the NN, which is a list of all the HDFS data blocks located in its local file system.

Each block is replicated through several DN to create a distributed file system that will be reliable if a DN failure occurs.

Page 24: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

NameNode

DataNode

DataNode

DataNode

DataNode

DataNode

Rack One Rack Two Data Blocks

Replication Mechanism

Block Reports

File System Namespace

Block Operations

File System Metadata -EditLog -FSImage

Page 25: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Replication Mechanism The HDFS stores each file as a sequence of blocks, where all the blocks in a file (except the last one) are the same size.

The blocks in a file are replicated for fault tolerance, the block size and the replication factor being configurable for each file.

The NN makes all the decisions regarding the replication of blocks, and periodically receives a Heartbeat and the Blackreport from each of the DN in the cluster.

Page 26: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Replication Mechanism (cont…) The HDFS uses a Rack-Aware replica placement policy to improve data reliability, availability, and network bandwidth utilization where large HDFS instances run on a cluster of computers that commonly spread across many racks.

Consequently, the NN determines the rack ID of each DN by means of a process called Rack Awareness.

Page 27: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

Example of Replication Mechanism Suppose that the replication factor is three, the HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the same local rack, and the last one on a different node in a different rack.

The chance of a rack failure is less than that of node failure, so this policy has no impact on data reliability and availability guarantees.

Page 28: COMPUTING BIG DATA - UAACloud Computing (CC) is a technology aimed at processing and storing very large amounts of data also known as Big Data (BD) In December 2012, the International

The HDFS tries to satisfy a read request from the replica that is closest to the reader, in order to minimize global bandwidth consumption and read latency. If there is a replica on the same rack as the reader node, then that replica is preferred for satisfying the read request.