store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

23
StoreApp:A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters LIU Kai Email: [email protected] Blog: http://kiwenlau.com / National Institute of Informatics, Japan 06/27/2022 1 LIU Kai, National Institute of Informatics

Upload: kiwenlau

Post on 13-Aug-2015

115 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

StoreApp:A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters

LIU KaiEmail: [email protected]: http://kiwenlau.com/

National Institute of Informatics, Japan

04/15/2023 1LIU Kai, National Institute of Informatics

Page 2: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Contents

Introduction (What?) Motivation (Why?) Implementation (How?) Personal Ideas

04/15/2023 2LIU Kai, National Institute of Informatics

Page 3: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Introduction – What is StoreApp?

04/15/2023 3LIU Kai, National Institute of Informatics

Page 4: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Background

Hadoop (version 1): for big data storage and computation Hadoop Distributed File System (HDFS): for storage Hadoop MapReduce Framework: for computation Master/Slave Architecture Storage(DataNode) and computation(TaskTracker) co-locate in a node

04/15/2023 LIU Kai, National Institute of Informatics 4

DataNodeTaskTracker

Slave Slave Slave Slave

NameNodeJobTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Master

Physical MachineOr Virtual Machine

Page 5: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Overview

What is StoreApp? A Hadoop plugin For speeding up Hadoop running in virtual machines Separate storage (DataNode) from computation (TaskTracker)

04/15/2023 LIU Kai, National Institute of Informatics 5

TaskTracker

DataNode

TaskTracker

TaskTracker

Physical machine Physical machine

Virtual machineDataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Virtual machine

Page 6: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Benefit

Improve HDFS throughput by 78.3% Storage VM has higher priority in scheduling than computation VM Consolidating storage into one VM reduce I/O contentions

Reduce job completion time by 61% Most Hadoop jobs are data intensive Their performance are bottlenecked by slow disk access

04/15/2023 LIU Kai, National Institute of Informatics 6

Page 7: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Motivation – Why do we need StoreApp?

04/15/2023 7LIU Kai, National Institute of Informatics

Page 8: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Challenge 1

Can’t add or remove nodes easily Rebalancing data incurs significant data movement Cannot utilize the elasticity of virtual machines

04/15/2023 LIU Kai, National Institute of Informatics 8

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Physical MachineVirtual Machine

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Page 9: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Solution 1

Separate storage from computation Adding or removing computation node need no data movement Finding optimal number of computation nodes for each Hadoop job

04/15/2023 LIU Kai, National Institute of Informatics 9

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker…

Physical MachineVirtual Machine

Page 10: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Challenge 2

Colocated Virtual Machines often access disk concurrently Random IO operations will compete with each other Significantly degrade the Hadoop Job performance

04/15/2023 LIU Kai, National Institute of Informatics 10

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Physical MachineVirtual Machine

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Page 11: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Solution 2

Separate storage from computation Each physical machine only has one storage virtual machine Only the storage Virtual Machine is IO intensive No serious concurrent IO operations

04/15/2023 LIU Kai, National Institute of Informatics 11

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker…

Physical MachineVirtual Machine

Page 12: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Challenge 3

Can’t schedule Virtual Machines efficiently IO intensive VMs can be prioritized since they consume less CPU However, every VM is IO intensive!

04/15/2023 LIU Kai, National Institute of Informatics 12

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Physical MachineVirtual Machine

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Page 13: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Solution 3

Separate storage from computation Only the storage Virtual Machine is IO intensive The storage Virtual Machine will receive a higher priority

04/15/2023 LIU Kai, National Institute of Informatics 13

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker

TaskTracker

DataNode

TaskTracker

TaskTracker…

Physical MachineVirtual Machine

Page 14: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Implementation – How to design StoreApp?

04/15/2023 14LIU Kai, National Institute of Informatics

Page 15: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Architecture

04/15/2023 LIU Kai, National Institute of Informatics 15

A StoreApp manager and multiple storage nodes The StoreApp manager run on the master node Each physical machine has one storage node

Page 16: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Components

StoreApp manager Coordinate the operations of all data nodes

Scheduler Scheduling tasks according to data locations

HDFS Proxy Receive all HDFS requests and forward them to DataNode

Shuffler Receive map output and push them to DataNode

04/15/2023 LIU Kai, National Institute of Informatics 16

Page 17: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

HDFS Prefetching

04/15/2023 LIU Kai, National Institute of Informatics 17

Read the whole block b1 instead of needed partial records Unused data of block b1 is kept in the memory Read consecutive block into memory to form input split s1

task0 task1

Page 18: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Automated Cluster Resizing

04/15/2023 LIU Kai, National Institute of Informatics 18

Dynamically change Cluster Size during the job execution The iterative algorithm can search for the optimal cluster size

Page 19: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Personal Ideas

04/15/2023 19LIU Kai, National Institute of Informatics

Page 20: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Pros and cons

Pros Simple idea but shows good result Show clear logic of locating and solving problems

Cons Restrict to Hadoop 1 No open source

04/15/2023 LIU Kai, National Institute of Informatics 20

Page 21: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Future direction

From Hadoop 1 to Hadoop 2 Hadoop 2 is quite different with Hadoop 1 Hadoop 2 can support more application framework like Spark

From Virtual Machine to container Container is a more lightweight virtualization technology Container is more Resource efficient than Virtual Machine Container is more easy to scale than Virtual Machine

04/15/2023 LIU Kai, National Institute of Informatics 21

Page 22: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

References

Yanfei Guo, et al. "StoreApp: A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters”, INFOCOM, 4, 2015

04/15/2023 LIU Kai, National Institute of Informatics 22

Page 23: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Thank you!

04/15/2023 LIU Kai, National Institute of Informatics 23