[225]yarn 기반의 deep learning application cluster 구축...
TRANSCRIPT
![Page 1: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/1.jpg)
YARN, Docker 기반의Deep Learning Application Cluster 구축
김제민
NAVER Search
![Page 2: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/2.jpg)
하고 있고, 해 왔던 것들
![Page 3: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/3.jpg)
Naver Search
• 검색 모델링
• 검색 데이터 정제
• 검색 서비스
• 대용량 데이터 처리
• 분산 데이터 처리
• 실시간 데이터 처리
![Page 4: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/4.jpg)
C3 Hadoop Cluster 소개
• Apache Ambari
• Cluster 배포/관리/모니터
• Hadoop 2.7.1
• Hive, Spark, HBase, Oozie etc ..
![Page 5: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/5.jpg)
![Page 6: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/6.jpg)
Deep Learning• Machine Learning 기법 중의 하나• Artificial Neural Network 인데 Deep함
< AlexNet >
• 60 million parameters and 650,000 neurons
From “ImageNet Classification with Deep Convolutional Neural Networks”
![Page 7: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/7.jpg)
Deep Learning Breakthrough
• 학습의 어려움 극복
• 2006, “A fast learning algorithm for deep belief nets” by
Geoffrey E. Hinton
• Hardware
• GPU 성능의 발전
• NVIDIA CUDA
![Page 8: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/8.jpg)
Deep Learning 은
거의 모든 분야에서
절대적으로 성능이 우월함
![Page 9: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/9.jpg)
이제 Deep Learning 으로 Go, Go !!!
![Page 10: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/10.jpg)
• 다수의 프로젝트• 다수의 연구원/개발자
But
• GPU 자원은 제한적
![Page 11: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/11.jpg)
Server
Server
Server
Server
![Page 12: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/12.jpg)
GPU 자원의 효과적인 공유 필요성
?GPU Cluster
![Page 13: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/13.jpg)
Deep Learning Application 개발 환경
• GPU• NVIDIA CUDA• Deep Learning Frameworks
• Caffe
• TensorFlow
• Torch
• Theano
• Keras
![Page 14: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/14.jpg)
그래서
결론적으로
지금 우리에게 필요한 것은 …
![Page 15: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/15.jpg)
Multi-tenant Deep Learning Application 실행 환경
• 다양한 Deep Learning Framework 지원• GPU Cluster 자원 관리
? GPU Cluster
DL Application
DL Application
DL Application
![Page 16: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/16.jpg)
Hadoop YARN 기반의Multi-tenant Deep Learning Application 실행 환경
?GPU Cluster
Hadoop YARNDL Application
DL Application
DL Application
![Page 17: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/17.jpg)
Why YARN ? • Mesos ?
• YARN의 장점
• Capacity Scheduler : Queue기반의 Resource Scheduling
• Enterprise 환경에서 Resource Planning 용이
• 기타
• 기존 클러스터(C3)와의 호환성 및 통합 가능성
• Cluster 배포 및 관리 시스템 기존재 (Ambari)
• YARN Cluster 운영 Knowhow
![Page 18: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/18.jpg)
목차
1. Technical Issues
2. Deep Learning Application Toolset 개발
3. C3 Deep Learning Cluster 구축
![Page 19: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/19.jpg)
1.Technical Issues
![Page 20: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/20.jpg)
Technical Issues
1) YARN Cluster에서 Shell Script Application 실행
2) 다양한 Deep Learning Framework 환경 제공
3) YARN Cluster에서 GPU Resource Scheduling
![Page 21: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/21.jpg)
1. 1)YARN Cluster 에서Shell Script Application 실행
![Page 22: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/22.jpg)
Shell Script Application on YARN ?
YARN Cluster
CPU Core 1
Memory 1G
Main main.sh
dl_app.tar.gz
data/examples/output/
main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
data/examples/output/
ü main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
CPU Core 1
Memory 1G
tar
DL Application
![Page 23: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/23.jpg)
Hadoop YARN
• Hadoop YARN ( Hadoop v2)
• Yet Another Resource Negotiator
• MapReduce 이외에 일반적인 Application 실행 가능
• CPU,Memory 기반의 Resource Scheduling
• YARN 구조
• Resource Manager ( Master)
• Node Manager ( Slaves )
• 최소 실행 단위 : Container
![Page 24: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/24.jpg)
YARN 구조
Resource Manager(Master)
Node Manager(Slave)
Server Container
Container
Container
Container
NM
NM
![Page 25: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/25.jpg)
YARN Application
Resource Manager(Master) NM
Application MasterContainer
NM
NM
Container Request
![Page 26: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/26.jpg)
YARN Application
Resource Manager(Master) NM
Application MasterContainer
Container
NM
NM
Container
Container
![Page 27: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/27.jpg)
Shell Script Application on YARN ?
YARN Cluster
CPU Core 1
Memory 1G
Main main.sh
dl_app.tar.gz
data/examples/output/
main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
data/examples/output/
ü main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
CPU Core 1
Memory 1G
tar
DL Application
![Page 28: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/28.jpg)
YARN Application 개발하기
• ApplicationMaster ResourceManager
• org.apache.hadoop.yarn.client.api.async.AMRMClientAsync
• Interface AMRMClientAsync.CallbackHandler
• ApplicationMaster NodeManager
• org.apache.hadoop.yarn.client.api.async.NMClientAsync
• Interface NMClientAsync.CallbackHandler
![Page 29: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/29.jpg)
YARN Distributed Shell (Example)
• https://github.com/apache/hadoop-common/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-
applications/hadoop-yarn-applications-
distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell
• ApplicationMaster.java
• Client.java
• DSConstants.java
• Log4jPropertyHelper.java
![Page 30: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/30.jpg)
C3 Distributed Shell
• Yarn Distributed Shell 기반• 추가 기능
• CONTAINTER_INDEX, NUM_CONTAINERS 환경 변수
• YARN-5696
• Server Node 에 따른 Resource 요청
• 특정 Server Node 에 Resource 요청
• Server Node Blacklist 설정
• YARN-4703
![Page 31: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/31.jpg)
Technical Issues
ü YARN Cluster에서 Shell Script Application 실행
2) 다양한 Deep Learning Framework 환경 제공
3) YARN Cluster에서 GPU Resource Scheduling
è C3 Distributed Shell
![Page 32: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/32.jpg)
1. 2)다양한Deep Learning Framework 환경 제공
![Page 33: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/33.jpg)
Classical Approach• 모든 DL Framework를 모든 서버에 설치
• Installation, Upgrade, Update 등 유지 보수의 어려움
• 같은 DL Framework 에서 Multi Version 지원 이슈
• 서로 상이한 DL Framework간의 Library 의존성 충돌
• DevOps Tools :
• Dimensions
• ( DL Framworks ) * ( Versions ) * ( 의존 libs ) * ( Server 환경)
![Page 34: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/34.jpg)
각 Deep Learning Framework 환경들이
깔끔히 Isolation되어
관리되면 좋을텐데..
![Page 35: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/35.jpg)
Docker ?
![Page 36: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/36.jpg)
그렇지만 GPU는 ?
CUDA는 ?
Docker 에서 가능한가?
![Page 37: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/37.jpg)
Docker 에서 CUDA 사용하기• Docker Image에 NVIDIA User Level Libraries 와 CUDA
Library 를 설치
• 제약 : Host에 설치된 NVIDIA Driver Major,Minor Version과 일치 해
야함
• Docker 실행시 NVIDIA Device File 들을 Docker에 노출
$ docker run …--device=/dev/nvidiactl \--device=/dev/nvidia-uvm \--device=/dev/nvidia \
![Page 38: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/38.jpg)
Docker 에서 CUDA 사용하기
Host Server
NVIDIA Kernel Module361.48
User Level NVIDIA Libs361.48
NVIDIA Driver Version Major 361Minor Docker Container(Image)
User Level NVIDIA Libs361.48
GPU Application
CUDA Libs/dev/nvidiactl/dev/nvidia-uvm/dev/nvidia
![Page 39: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/39.jpg)
그렇지만
Docker Image에
Host 의존성이 …
![Page 40: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/40.jpg)
Nvidia-Docker !!
![Page 41: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/41.jpg)
Nvidia-Docker
• Docker 상에서 GPU Device를 사용시 발생하는 Host 의존성을해결
• GPU Resource Isolation 지원 !!
![Page 42: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/42.jpg)
Nvidia-Docker
Host Server
NVIDIA Kernel Module361.48
User Level NVIDIA Libs361.48
Docker Container(Image)
GPU Application
CUDA Libs
volume User Level NVIDIA Libs361.48
![Page 43: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/43.jpg)
Nvidia-Docker 실행• 기본 실행
• nvidia-docker <docker-options> <docker-command> <docker-
args>
• GPU isolation
• NV_GPU 환경 변수에 GPU Device ID 목록을 설정
• NV_GPU='0,1' nvidia-docker <docker-options> <docker-command> …
• Docker Image 제약
• Nvidia-docker 용으로 미리 build된 image 사용
![Page 44: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/44.jpg)
Nvidia-docker 기반으로
Deep Learning Framework 환경을
Docker Image로 제공
![Page 45: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/45.jpg)
Technical Issues
ü YARN Cluster에서 Shell Script Application 실행
ü 다양한 Deep Learning Framework 환경 제공
3) YARN Cluster에서 GPU Resource Scheduling
è C3 Distributed Shell
è Nvidia-Docker
![Page 46: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/46.jpg)
1. 3)YARN Cluster에서GPU Resource Scheduling
![Page 47: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/47.jpg)
YARN GPU Scheduling 현상황
• 현재 YARN 은 CPU와 Memory 만으로 스케쥴링
• GPU Scheduling은 지원하지 않음
• 새로운 자원 형태를 Scheduling 하기 위해서는 부가적인 개발이
필요함
![Page 48: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/48.jpg)
YARN GPU Scheduling 개발
주요 Source Code 변경 대상
• YARN Protocol
• YARN Configuration
• Resource 추상화 객체
• Dominant Resource Calculator
*YARN JIRA : YARN-5517
![Page 49: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/49.jpg)
YARN Protocol
• yarn_protos.proto
• yarn_service_protos.proto
Implementation Details• yarn_protos.protomessage ResourceProto { optional int32 memory = 1; optional int32 virtual_cores = 2; optional int32 gpu_cores = 3; }
enum SchedulerResourceTypes { MEMORY = 0; CPU = 1; GPU = 2; }
• yarn_service_protos.proto
Implementation Details• yarn_protos.protomessage ResourceProto { optional int32 memory = 1; optional int32 virtual_cores = 2; optional int32 gpu_cores = 3; }
enum SchedulerResourceTypes { MEMORY = 0; CPU = 1; GPU = 2; }
• yarn_service_protos.proto
![Page 50: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/50.jpg)
YARN Configuration
• org.apache.hadoop.yarn.conf.YarnConfiguration
Implementation Details• org.apache.hadoop.yarn.conf.YarnConfigurationpublic static final String RM_SCHEDULER_MINIMUM_ALLOCATION_GCORES = YARN_PREFIX + "scheduler.minimum-allocation-gcores"; public static final int DEFAULT_RM_SCHEDULER_MINIMUM_ALLOCATION_GCORES = 0;
public static final String RM_SCHEDULER_MAXIMUM_ALLOCATION_GCORES = YARN_PREFIX + "scheduler.maximum-allocation-gcores"; public static final int DEFAULT_RM_SCHEDULER_MAXIMUM_ALLOCATION_GCORES = 8;
public static final String NM_GCORES = NM_PREFIX + "resource.gcores"; public static final int DEFAULT_NM_GCORES = 0;
![Page 51: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/51.jpg)
Resource 객체• org.apache.hadoop.yarn.util.resource.Resource
Implementation Details• org.apache.hadoop.yarn.util.resource.Resources
public static Resource createResource(int memory) { return createResource(memory, (memory > 0) ? 1 : 0, 0); } public static Resource createResource(int memory, int cores) { return createResource(memory, cores, 0); } public static Resource createResource(int memory, int cores, int gcores) { Resource resource = Records.newRecord(Resource.class); resource.setMemory(memory); resource.setVirtualCores(cores); resource.setGpuCores(gcores); return resource; } public int getGpuCores() { return 0; } public void setGpuCores(int gcores) { throw new RuntimeException("NONE cannot be modified!"); } public int compareTo(Resource o) { int diff = 0 - o.getMemory(); if (diff == 0) { diff = 0 - o.getVirtualCores(); if (diff == 0) { diff = 0 - o.getGpuCores(); } } return diff; }
![Page 52: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/52.jpg)
Dominant Resource Calculator• org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
![Page 53: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/53.jpg)
YARN UI 변화UI Changes
* 테스트용 클러스터 화면 캡쳐
![Page 54: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/54.jpg)
YARN Cluster
CPU Core 1
Memory 1G
GPU 3
Main main.sh
tar dl_app.tar.gz
data/examples/output/
main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
data/examples/output/
ü main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
CPU Core 1
Memory 1G
GPU 3
C3 Distributed Shell : GPU 추가
![Page 55: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/55.jpg)
Technical Issues
ü YARN Cluster에서 Shell Script Application 실행
ü 다양한 Deep Learning Framework 환경 제공
ü YARN Cluster에서 GPU Resource Scheduling
è C3 Distributed Shell
è Nvidia-Docker
è YARN with GPU Scheduling 개발
![Page 56: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/56.jpg)
2.Deep Learning ApplicationLauncher 및 Toolset 개발
![Page 57: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/57.jpg)
Deep Learning Application Toolset 개발
• GPU Device ID Manager
• DL(Deep Learning) App Concept
• DL App Launcher
• DL App Tools
• DL App Log
• DL App 개발 환경
![Page 58: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/58.jpg)
GPU Device ID 할당 문제점
• 문제점 : YARN은 GPU 갯수만 관리함
• Deep Learning Application에 명시적으로 GPU Device ID를
할당하는 로직 필요
![Page 59: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/59.jpg)
GPU Device ID Manager
• Cluster 전체의 GPU Device ID Allocation 정보를 Zookeeper를 통해 관리
• GPU Device ID Manager
• python kazoo lib
• 분산 Lock
• GPU Device ID 할당/해제
• Garbage Collection
{”host01": {
"gpu0": "application_id_01","gpu1": “application_id_01”,"gpu2": “application_id_02”,"gpu3": “application_id_03”
},“host02": {
"gpu0": “application_id_04",…
},…
}
![Page 60: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/60.jpg)
DL(Deep Learning) App 구현 Concept
• DL App 구성 요소
• Deep Learning Framework 환경 (Caffe, TensorFlow …)
• User Program ( Source Code )
• Input Data
• Output Data Input Output
DL Framework 환경
SourceCode
![Page 61: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/61.jpg)
YARN Cluster
CPU Core 1
Memory 1G
GPU 3
Main main.sh
tar dl_app.tar.gz
data/examples/output/
main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
data/examples/output/
ü main.shcreate_mnist.shlenet_solver.prototxtlenet_train_test.prototxttrain_lenet.sh
CPU Core 1
Memory 1G
GPU 3
C3 Distributed Shell
C3 Distributed Shell
User Program ( Source Code)
Input Data ?
![Page 62: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/62.jpg)
DL(Deep Learning) App 구현 Concept
• DL App 구성 요소
• Deep Learning Framework 환경 (Caffe, TensorFlow …)
• User Program ( Source Code )
• Input Data
• Output Data Input Output
DL Framework 환경
SourceCode
![Page 63: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/63.jpg)
DL(Deep Learning) App 구현 Concept
• Input/Output Data• Data Repository Tech.• NFS• Ceph, GlusterFS, Lustreü HDFS
• C3 HDFS Storage를 저장소로 사용
![Page 64: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/64.jpg)
DL(Deep Learning) App 구현 Concept
• Docker Container의 데이터는 volatile함• Docker Container안에 permanent한 작업 영역(dir)이 필요함
• DL App Workspace (Directory) • DL App 실행을 위한 File/Dir들이 존재하는 영역• User Program Src Code , Input/Output Data 들이 위치함• Host Server상의 Directory이며 , Docker Container에 volume mount
되는 영역
![Page 65: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/65.jpg)
DL(Deep Learning) App 구현 Concept
/User_Dev_Workspace
/DL_App_Workspace
Deep Learning App
Input
Output
User
Container
![Page 66: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/66.jpg)
DL App 실행 과정
![Page 67: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/67.jpg)
DL App 실행 과정YARN Container
Yarn Container 시작
![Page 68: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/68.jpg)
DL App 실행 과정YARN Container
GPU Device ID 할당
GPU Device ID Manager
GPU 0,3
Zookeeper
![Page 69: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/69.jpg)
YARN Container
DL App 실행 과정
DL App Workspace
Source Code
User Workspace 복사
GPU 0,3
![Page 70: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/70.jpg)
YARN Container
DL App 실행 과정
DL App Workspace
Input
Source Code
Input Data 복사
GPU 0,3
![Page 71: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/71.jpg)
YARN Container
DL App 실행 과정
DL App Workspace( NVIDIA Docker )
Input
Source Code
NVIDIA Docker Cotainer 실행
GPU 0,3
![Page 72: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/72.jpg)
YARN Container
DL App 실행 과정
DL App Workspace( NVIDIA Docker )
Input
Source CodeDockerVolume
DL App Workspace Mount
GPU 0,3
![Page 73: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/73.jpg)
YARN Container
DL App 실행 과정
( NVIDIA Docker )
Input
Source Code
DL App Docker 실행
GPU 0,3
![Page 74: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/74.jpg)
YARN Container
DL App 실행 과정
( NVIDIA Docker )
Input
Source Code
Output
DL App Docker 실행
GPU 0,3
![Page 75: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/75.jpg)
YARN Container
DL App 실행 과정
Input
Source Code
Output
DL App Workspace
DL App Docker 종료
GPU 0,3
![Page 76: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/76.jpg)
YARN Container
DL App 실행 과정
Input
Source Code
Output
DL App Workspace
Output Data 복사
GPU 0,3
![Page 77: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/77.jpg)
DL App 실행 과정YARN Container
GPU Device ID 해제
GPU Device ID Manager
GPU 0,3
Zookeeper
![Page 78: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/78.jpg)
DL App 실행 과정YARN Container
Yarn Container 종료
![Page 79: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/79.jpg)
DL App 실행 과정 Yarn Container 종료
![Page 80: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/80.jpg)
원하는 DL App Properties
• Docker Image
• 실행 Src Code Dir Path
• main 실행 script
• Resource : CPU,Mem,GPU
• Input Data 경로
• Output Data 경로
DL App 생성 및 실행하기
C3Distributed
Shell
DL App 실행 과정
2) GPU ID 할당
3) User Workspace 복사
4) Input Data 복사
5) DL App Workspace 구성 완료
6) DL App NVIDIA Docker 실행
7) Output Data 복사
8) GPU ID 해제
YARN
Cluster
Shell Script Application
Shell Script
Application생성
![Page 81: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/81.jpg)
dl_app.properties
• Docker Image
• 실행 Src Code Dir Path
• main 실행 script
• Resource : CPU,Mem,GPU
• Input Data 경로
• Output Data 경로
C3Distributed
Shell
DL App 실행 과정
2) GPU ID 할당
3) User Workspace 복사
4) Input Data 복사
5) DL App Workspace 구성 완료
6) DL App NVIDIA Docker 실행
7) Output Data 복사
8) GPU ID 해제
YARN
Cluster
dlapp-launcher dl_app.properties
DL App Launcher
Shell Script Application
Shell Script
Application생성
![Page 82: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/82.jpg)
• Caffe LeNet MNIST Example• caffe/examples/mnist 기반• .
├── dl_app_caffe.properties└── user_dev_workspace
├── create_mnist.sh├── data/├── dl_app_start.sh├── examples/├── lenet_solver.prototxt├── lenet_train_test.prototxt├── output/└── train_lenet.sh
dlapp-launcher 실행 예
![Page 83: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/83.jpg)
• dl_app_caffe.properties[application]username=tonyappname='c3 deeplearning app caffe mnist example'
docker_image=naver/c3_dl-caffeuser_dev_workspace_path=./user_dev_workspaceuser_shell_script=dl_app_start.shuser_shell_args='-gpu all'
[from_hdfs]hdfs=/user/tony/caffe_test/*input_path=data
[to_hdfs]output_path=outputhdfs=/user/tony/caffe_mnist_example_outputoverwrite=true
• dl_app_start.sh
#!/usr/bin/env bash
./create_mnist.sh
./train_lenet.sh $@
dlapp-launcher 실행 예
.├── dl_app_caffe.properties└── user_dev_workspace
├── create_mnist.sh├── data/├── dl_app_start.sh├── examples/├── lenet_solver.prototxt├── lenet_train_test.prototxt├── output/└── train_lenet.sh
![Page 84: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/84.jpg)
dlapp-launcher 실행 예$ dlapp-launcher dl_app_caffe.propertiesC3 DL Cluster Version check : okCheckDLAppProperties : okuser_dev_workspace_path = /path/to/user_dev_workspaceuser_dev_workspace_exclude_list = []user_dev_workspace dir size = 4 kb
DL App Submission : OKDL App ID = application_aaaaaaaaaaa_bbbb
DL App Dashboard* While app is running , check : http://XXXXXXXXX/cluster/app/application_aaaaaaaaaaa_bbbb* After app is completed, check : http://XXXXXXXXX/applicationhistory/app/application_aaaaaaaaaaa_bbbb
DL App Status and stdout/stderr Log URLs* dlapp-status application_aaaaaaaaaaa_bbbb
![Page 85: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/85.jpg)
DL App Tools
• dlapp-status [ DL App ID ] • dlapp-list [Username] • dlapp-softkill [Username] [ DL App ID ] • dlapp-kill [Username] [ DL App ID ]
![Page 86: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/86.jpg)
dlapp-launcher 실행 예
![Page 87: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/87.jpg)
DL App Log 확인
• YARN History,Timeline Server 기능 활용• YARN Application stdout/stderr Log 는 Web 또는 yarn 명령어을 통
해 확인 가능• YARN Container 의 Attempt ID 및 Container ID 를 통해 확인
• dlapp-status tool을 통해 stdout/stderr Log URL 제공
![Page 88: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/88.jpg)
dlapp-status 실행 예$ dlapp-status application_aaaaaaaaaaa_bbbbDL App application_aaaaaaaaaaa_bbbb is FINISHED
Username : tonyDL App ID : application_aaaaaaaaaaa_bbbbDL App Name : 'c3 deeplearning app caffe mnist example'
startTime : 2016-09-26 22:24:11.575000finishTime : 2016-09-26 22:24:57.703000elapsedTime : 0:00:46.13
stdouthttp://XXXXX/applicationhistory/logs/XXXXX/container_aaaaaaaaaaa_bbbb_01_000004/container_aaaaaaaaaaa_bbbb_01_000004/tony/stdout/?start=-4096
stderrhttp://XXXXX/applicationhistory/logs/XXXXX/container_aaaaaaaaaaa_bbbb_01_000004/container_aaaaaaaaaaa_bbbb_01_000004/tony/stderr/?start=-4096
![Page 89: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/89.jpg)
Example
![Page 90: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/90.jpg)
dlapp-status 실행 예
![Page 91: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/91.jpg)
dlapp-list 실행 예
![Page 92: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/92.jpg)
dlapp-kill 실행 예
![Page 93: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/93.jpg)
DL App Development 환경의 필요성
• DL App을 개발하려면 GPU 장비가 필요함
• 초기 DL App개발 시점에서 기본적인 검증 용도로 DL Cluster를
이용하는 것은 매우 번거로움
• 코드의 사소한 실수를 수정하는 과정의 반복
• Input/output Data등의 Directory 구조 확인
![Page 94: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/94.jpg)
DL App Development 환경
• DL App 개발 전용 GPU 서버 구축• dlapp-shell
• dlapp-launcher 와 동일한 Docker Image 및 Workspace로 실행되는
docker bash 환경 제공
• dlapp-run : dlapp-shell 환경에서 C3 DL Cluster 에 dl app 을 실행
![Page 95: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/95.jpg)
YARN Container
DL App 실행 과정
( NVIDIA Docker )
Input
Source Code
DL App Docker 실행
GPU 0,3
![Page 96: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/96.jpg)
DL App 개발 서버
DL App Shell
( NVIDIA Docker ) : Bash Shell
Source Code
GPU 0,3
![Page 97: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/97.jpg)
dlapp-shell 실행 예dl_app_dev_serverdl_app_dev_server
dl_app_dev_server
dl_app_dev_serverdl_app_dev_serverdl_app_dev_server
user
useruser
user
yarn_resource_manageryarn_timeline_server
aaaaaaaaa_bbbb
aaaaaaaaa_bbbbaaaaaaaaa_bbbb
aaaaaaaaa_bbbb
docker shell
![Page 98: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/98.jpg)
DL App Toolset
• dlapp-launcher dl_app.properties• dlapp-shell dl_app.properties• dlapp-run
• dlapp-status [DL App ID] • dlapp-list [Username] • dlapp-softkill [Username] [DL App ID] • dlapp-kill [Username] [DL App ID]
![Page 99: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/99.jpg)
3.C3 Deep Learning Cluster
![Page 100: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/100.jpg)
Hadoop YARN 기반의Multi-tenant Deep Learning Application 실행 환경
?GPU Cluster
Hadoop YARNDL Application
DL Application
DL Application
![Page 101: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/101.jpg)
DL Application
C3 Deep Learning Cluster
Zookeeper
Dev. Environment
Dev. Server
Resource Manager
YARN
Node Manager
YARN
DockerRegistry C3 HDFS Storage
YARN with
GPU Scheduling
Node Manager
YARN
Node Manager
YARN
. . .
C3 Distributed Shell
NVIDIA Docker
GPU Device ID Manager Input
Output
DL AppToolset
dlapp-launcher
dlapp-shell dlapp-run
![Page 102: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/102.jpg)
DL Framework Docker Images
DL Solution Docker Image
Caffe naver/c3_dl-caffe
Torch naver/c3_dl-torch
Theano naver/c3_dl-theano
TensorFlow naver/c3_dl-tensorflow
Theano+Keras naver/c3_dl-theano-keras
TensorFlow+Keras naver/c3_dl-tensorflow-keras
Base (CUDA/cuDNN) naver/c3_dl-base
![Page 103: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/103.jpg)
• Resource 관리• GPU 자원 Utilization 증가• GPU Resource Planning ( YARN Queue)
• Administration• 작업 이력 관리• 프로젝트 관리• 사용자 관리
• DevOps• 규격화된 개발 환경 제공• 개발 환경 공유
개선점
![Page 104: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/104.jpg)
• 검색 품질 향상을 위한 seq2seq 기반의 검색 키워드 변환
• word2vec,CNN 기반의 문서 주제 분류
• CNN 기반의 Image Tagger
• 이외 다수의 프로젝트가 실행중
현재 C3 DL Cluster에서는 …
* CNN : Convolutional Neural Network
![Page 105: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/105.jpg)
To Do
![Page 106: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/106.jpg)
To Do
• GPU Resource Scheduling 고도화
• 대용량 Input Data
• Input Data Caching
• Input Data Feeder
• Distributed Deep Learning
![Page 107: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/107.jpg)
Q&A
![Page 108: [225]yarn 기반의 deep learning application cluster 구축 김제민](https://reader030.vdocuments.site/reader030/viewer/2022020723/58700b8b1a28ab427f8b729f/html5/thumbnails/108.jpg)
Thank You