flowos-rm: disaggregated resource management system · 2020-02-05 · flowos-rm: disaggregated...

2
FlowOS-RM: Disaggregated Resource Management System Ryousei Takano * , Kuniyasu Suzaki * , Hidetaka Koie * * Information Technology Research Institute National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan {takano-ryousei, k.suzaki, koie-hidetaka}@aist.go.jp I. MOTIVATION AND GOALS A traditional data center consists of monolithic-servers and the software stack is built on top of them. When used for various AI and Big Data workloads, such architecture faces limitations including lack of operational flexibility, low resource utilization, low maintainability, etc. Moreover, task-specific accelerators such as Google TPU, Fujitsu DLU, Microsoft BrainWave, and D-Wave Quantum Annealer are emerging. Utilization of such heterogeneous accelerators is the key to achieve sustainable performance improvement in the post-Moore era. From the view point of system, holistic resource management is essential, that is, a user is easily able to combine components including a generic processor and an accelerator for the sequence of a job. Resource disaggregation is a promising solution to address the limitations of traditional data centers. In a disaggregated data center, CPU, accelerator, memory, and storage are sep- arated and they are interconnected via a high-speed network. We propose a concept of disaggregated data center architec- ture, Flow-in-Cloud, that enables an existing cluster to expand an accelerator pool through a high-speed network. This poster demonstrates the feasibility of the prototype system using a distributed deep learning application. II. FLOW- IN-CLOUD:DISAGGREGATED DATA CENTER ARCHITECTURE Flow-in-Cloud (FiC) is a shared pool of heterogeneous accelerators such as GPU and FPGA, which are directly connected by a circuit-switched network. From the pool of accelerators, a slice is dynamically configured and provided according to a user request, as shown in Figure 1. FiC network comprises a set of FiC switch boards that have FPGA (Xilinx Kintex Ultrascale XCKU095), 32-10Gbps serial connections, DRAM, and Raspberry Pi 3 (controller). A circuit-switching logic and a user-defined logic written in a high-level synthesis language are running on the FPGA, and the latter logic is partially reconfigurable in advance of an application deploy- ment. Note that this poster uses ExpEther [1] [2] instead of FiC network to disaggregate accelerators from servers. FlowOS manages the entire FiC resources, and supports execution of a user job on provided slices. FlowOS employs a This paper is partially based on results obtained from a project commis- sioned by the New Energy and Industrial Technology Development Organi- zation (NEDO), Japan. A set of compute nodes FPGA GPU SCM FiC Network Resource Pool A prototype board of FiC switch system Slice Slice Configuring “meta-accelerators” according to the application requirement meta accelerator Fig. 1. The overview of Flow-in-Cloud Architecture layered architecture including FlowOS-API, FlowOS-RM, and FlowOS-drivers. This poster focuses on FlowOS-RM. III. FLOWOS-RM FlowOS-RM seamlessly works in cooperation with a cluster resource manager such as Apache Mesos [3], Kubernetes, SLURM, and so on. FlowOS-RM provides users with the REST API to configure a slice and execute a job on it. FlowOS-RM supports a single-node task and an MPI type multi-node task. To implement such a mechanism, FlowOS-RM combines the following components: (1) Disaggregate device manage- ment: ExpEther is a PCIe-over-Ethernet technology and it allows us to dynamically attach and detach remote PCIe devices through Ethernet. (2) OS deployment: Bare-Metal Container (BMC) [4] constructs an execution environment to run a Docker image with an application optimized OS kernel on a node. (3) Task scheduling and execution: FlowOS-RM is implemented on top of a Mesos framework, and it co-allocates nodes to meet a user requirement and launches a task on each node in the manner of Mesos. Figure 2 presents a job execution flow in FlowOS-RM, where a job is a set of tasks and each task runs on a node belonging to a slice. Firstly, constructing a slice is divided into two steps. (1) attach-device attaches disaggregated devices to a node. (2) launch-machine turns on the power of a node with a specific OS kernel and a Docker container. Secondly, launching a job is divided into two steps. (3) prepare-task does housekeeping for launching a task in the manner of Mesos. (4) launch-task runs a task in a node. Finally, destructing a slice

Upload: others

Post on 20-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FlowOS-RM: Disaggregated Resource Management System · 2020-02-05 · FlowOS-RM: Disaggregated Resource Management System Ryousei Takano , Kuniyasu Suzaki , Hidetaka Koie Information

FlowOS-RM: Disaggregated Resource ManagementSystem

Ryousei Takano∗, Kuniyasu Suzaki∗, Hidetaka Koie∗∗ Information Technology Research Institute

National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan{takano-ryousei, k.suzaki, koie-hidetaka}@aist.go.jp

I. MOTIVATION AND GOALS

A traditional data center consists of monolithic-serversand the software stack is built on top of them. When usedfor various AI and Big Data workloads, such architecturefaces limitations including lack of operational flexibility, lowresource utilization, low maintainability, etc.

Moreover, task-specific accelerators such as Google TPU,Fujitsu DLU, Microsoft BrainWave, and D-Wave QuantumAnnealer are emerging. Utilization of such heterogeneousaccelerators is the key to achieve sustainable performanceimprovement in the post-Moore era. From the view point ofsystem, holistic resource management is essential, that is, auser is easily able to combine components including a genericprocessor and an accelerator for the sequence of a job.

Resource disaggregation is a promising solution to addressthe limitations of traditional data centers. In a disaggregateddata center, CPU, accelerator, memory, and storage are sep-arated and they are interconnected via a high-speed network.We propose a concept of disaggregated data center architec-ture, Flow-in-Cloud, that enables an existing cluster to expandan accelerator pool through a high-speed network. This posterdemonstrates the feasibility of the prototype system using adistributed deep learning application.

II. FLOW-IN-CLOUD: DISAGGREGATED DATA CENTERARCHITECTURE

Flow-in-Cloud (FiC) is a shared pool of heterogeneousaccelerators such as GPU and FPGA, which are directlyconnected by a circuit-switched network. From the pool ofaccelerators, a slice is dynamically configured and providedaccording to a user request, as shown in Figure 1. FiC networkcomprises a set of FiC switch boards that have FPGA (XilinxKintex Ultrascale XCKU095), 32-10Gbps serial connections,DRAM, and Raspberry Pi 3 (controller). A circuit-switchinglogic and a user-defined logic written in a high-level synthesislanguage are running on the FPGA, and the latter logic ispartially reconfigurable in advance of an application deploy-ment. Note that this poster uses ExpEther [1] [2] instead ofFiC network to disaggregate accelerators from servers.

FlowOS manages the entire FiC resources, and supportsexecution of a user job on provided slices. FlowOS employs a

This paper is partially based on results obtained from a project commis-sioned by the New Energy and Industrial Technology Development Organi-zation (NEDO), Japan.

A set of compute nodes

FPGA GPU SCM

FiC Network

Resource Pool

A prototype board of FiCswitch system

Slice Slice

Configuring “meta-accelerators”according to the application requirement

metaaccelerator

Fig. 1. The overview of Flow-in-Cloud Architecture

layered architecture including FlowOS-API, FlowOS-RM, andFlowOS-drivers. This poster focuses on FlowOS-RM.

III. FLOWOS-RM

FlowOS-RM seamlessly works in cooperation with a clusterresource manager such as Apache Mesos [3], Kubernetes,SLURM, and so on. FlowOS-RM provides users with theREST API to configure a slice and execute a job on it.FlowOS-RM supports a single-node task and an MPI typemulti-node task.

To implement such a mechanism, FlowOS-RM combinesthe following components: (1) Disaggregate device manage-ment: ExpEther is a PCIe-over-Ethernet technology and itallows us to dynamically attach and detach remote PCIedevices through Ethernet. (2) OS deployment: Bare-MetalContainer (BMC) [4] constructs an execution environment torun a Docker image with an application optimized OS kernelon a node. (3) Task scheduling and execution: FlowOS-RM isimplemented on top of a Mesos framework, and it co-allocatesnodes to meet a user requirement and launches a task on eachnode in the manner of Mesos.

Figure 2 presents a job execution flow in FlowOS-RM,where a job is a set of tasks and each task runs on a nodebelonging to a slice. Firstly, constructing a slice is dividedinto two steps. (1) attach-device attaches disaggregated devicesto a node. (2) launch-machine turns on the power of a nodewith a specific OS kernel and a Docker container. Secondly,launching a job is divided into two steps. (3) prepare-task doeshousekeeping for launching a task in the manner of Mesos. (4)launch-task runs a task in a node. Finally, destructing a slice

Page 2: FlowOS-RM: Disaggregated Resource Management System · 2020-02-05 · FlowOS-RM: Disaggregated Resource Management System Ryousei Takano , Kuniyasu Suzaki , Hidetaka Koie Information

is divided into two steps. (5) detach-device detaches devicesfrom a node. (6) destroy-machine turns down a node.

Ethernet Switch

GPU(P100)

compute node

I/O BoxGPU(P100)

HBAHBA

GPU(P100)

I/O BoxGPU(P100)

Mesos

HBAHBA

RESTAPI

FlowOS-RM

BMC ExpEther

Device management:1. attach-device5. detach-device

OS deployment:2. launch-machine6. destroy-machine

Task execution:3. prepare-task, 4. launch-task

OS kernel OS kernel

Container ContainerMesosAgent Task Mesos

Agent Task

compute node

Fig. 2. Job execution flow in FlowOS-RM

IV. EXPERIMENT

A. Experimental Setting

In order to demonstrate the feasibility of FlowOS-RM, wehave conducted distributed deep learning experiments on afour-node cluster environment as shown in Figure 3a. AnMNIST application on ChainerMN [5] is used as a benchmarkprogram. Each compute nodes have two ExpEther HBAs toconnect PCIe devices on I/O Boxes through a 40 GbE swtich.

40G Ethernet Switch

GPU(P100)

cmp node

I/O BoxGPU(P100)

HBAHBA

cmp node

HBAHBA

cmp node

HBAHBA

cmp node

HBAHBA

GPU(P100)

I/O BoxGPU(P100)

GPU(P40)

I/O Box

NVM

e

I/O Box

NVM

e

NVM

e

NVM

e

(a) Physical Cluster Configuration

GPU(P100)

cmp node

I/O BoxGPU(P100)

HBAHBA

GPU(P100)

I/O BoxGPU(P100)

1node-4gpu

GPU(P100)

cmp node

I/O BoxGPU(P100)

HBAHBA

GPU(P100)

I/O BoxGPU(P100)

2node-2gpu

cmp node

HBAHBA

GPU(P100)

cmpnode

I/O BoxGPU(P100)

GPU(P100)

I/O BoxGPU(P100)

4node-1gpu

cmpnode

cmpnode

cmpnode

HBA HBA HBA HBA

(b) Slice Configurations

Fig. 3. Experimental Configuration

B. Experimental Results

1) Slice construction/destruction overhead: We ran anMNIST application on three slice configurations as shownin Figure 3b, and the breakdown in the execution time foreach slice is shown in Figure 4a. The MNIST training runfaster as the number of GPUs per node increases. Here wefocus on the overhead of slice construction and destructionoperations. A launch-machine operation takes longer as thenumber of nodes increases, because downloading a containerimage the size is about 3GB through GbE becomes thebottleneck. Some operations including attach/detach-deviceand launch-task take longer as the number of GPUs per nodeincreases, because these operations are not parallelized. We

plan to reduce the above overhead by using a faster NIC andparallelizing operations.

2) Resource sharing: We have confirmed disaggregatedresources are sharing among several slices according to a userrequirement. In this experiment, a user submitted four jobs andFlowOS-RM allocated resources into each slice in the FIFOmanner. The slice configurations of each job are as follows:Slice1 and 2 consist of 2node-2gpu (P100), Slice3 consists of1node-1gpu (P40), and Slice4 consists of 4node-4gpu (P100).Figure 4b shows that resource sharing among slices works asexpected.

0 100 200 300 400 500 600 700 800

1node-4gpu

2node-2gpu

4node-1gpu

Time (second)attach-device launch-machineprepare-task run-taskdetach-device destroy-machine

(a) Execution time of MNIST on ChainerMN

0 200 400 600 800 1000

1

2

3

4

5

Slice1

Slice2

Slice4Slice3

Time(sec)

Res

ourc

es (G

PUs)

(b) Slice Configurations

Fig. 4. Experimental Results

V. CONCLUSION AND FUTURE WORK

We have demonstrated effective resource sharing on theproposed disaggregated resource management system for AIand Big Data applications. We found some performance issuesbut the impact is limited for long hours-running applications.We plan to evaluate various applications with applying per-formance optimization techniques such as PGO [6] on thissystem.

REFERENCES

[1] J. Suzuki, Y. Hidaka, J. Higuchi, T. Yoshikawa, and A. Iwata, “Ex-pressether - ethernet-based virtualization technology for reconfigurablehardware platform,” in 14th IEEE Symposium on High-PerformanceInterconnects (HOTI), 2006, pp. 45–51.

[2] ExpEther Consortium, http://www.expether.org, 2018.[3] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H.

Katz, S. Shenker, and I. Stoica, “Mesos: A platform for fine-grainedresource sharing in the data center.” in NSDI, vol. 11, 2011, pp. 22–22.

[4] K. Suzaki, H. Koie, and R. Takano, “Bare-metal container — directexecution of a container image on a remote machine with an optimizedkernel —,” in 18th IEEE International Conference on High PerformanceComputing and Communications (HPCC), 2016, pp. 25–36.

[5] Preferred Networks, “ChinerMN,” https://github.com/chainer/chainermn,2018.

[6] K. Suzaki, H. Koie, and R. Takano, “Profile guided kernel optimizationfor individual container execution on bare-metal container,” in ACM/IEEEInternational Conference for High Performance Computing, Networking,Storage and Analysis (Poster), 2017.