big data computing overview

47
1 Big Data Computing Overview 2015.04.07 Youngsung Son

Upload: young-sung-son

Post on 19-Feb-2017

312 views

Category:

Technology


0 download

TRANSCRIPT

1

Big Data Computing Overview

2015.04.07Youngsung Son

2

Agenda

§ What am I doing?§ Big Data Computing History

– Supercomputer– Parallel Computing– Linux Cluster– Big Data Computing

§ Google File System (GFS)§ Hadoop Map and Reduce§ Spark Stream Processing§ References

3

What am I doing?

4

1. Personal Cloud Repository Access

2. Personal Health Record Retrieval

3. Case based Reasoning (Similar Case Search)

4. Comparisionamong Similar Patients (for Health Planning, Prediction, Advise)

1

2 3

4

Healing Platform

5

Healing Platform

모바일 플랫폼

Open API

의료 데이터프로바이더 1..N

5000만명x17건/365일=~200만건/일;

라이프레코드프로바이더 1..N

5000만명x5회= ~3억/일

개인 힐링 레코드저장소 1..N

5000만명/일

요청

전송

저장

서비스분석 엔진

모바일 서비스 1..N

RES

Tful

lAPI

3초 이내

로드요청

표준변환

Targeted 데이터/힐링지식베이스

(NoSQL DB)

TD TD

TD KB

변환

/필터

스트림컴퓨팅(업데이트 관리)

고속계산용DB

DW

구축

DC DC

DC DC

Big DataPersonal DataControl

Service

분석플랫폼

데이터 중계기

요청

전송

공공 임상사례 빅데이터

개인 힐링레코드사례 빅데이터

원본 빅데이터 (HDFS)

유사사례검색

트렌드

플래닝

TD 구

지식베이스 구축 엔진Cluster, CBR, …

6

Big Data Computing History

7

Supercomputer

8

Supercomputer

9

Architecture of HyperCube

John P. Hayes, “Architecture of Supercomputer,” International Conference of Parallel Processing 1986.http://web.eecs.umich.edu/~tnm/trev_test/papersPDF/1986.08.Architecture%20Of%20A%20Hypercube%20Supercomputer_Conf_Paralle l_Processing.pdf

10

Architecture of HyperCube

11

Architecture of HyperCube

12

Architecture of HyperCube

http://web.eecs.umich.edu/~tnm/trev_test/papersPDF/1986.08.Architecture%20Of%20A%20Hypercube%20Supercomputer_Conf_Parallel_Processing.pdf

13

Parallel Computing

§ MPI – Message Passing Interface

§ PVM – Parallel Virtual Machine

14

Parallel Computing

§ MPI (Message Passing Interface)

15

Parallel Computing

§ PVM (Parallel Virtual Machine)

16

Architecture of HyperCube

Too much costy!!!!

Too much difficult!!!!

17

Linux Cluster

18

Berkeley NOW Project (1995)

19

Linux Cluster Project

CROWN SystemClustering Resources of Workstation’s

Network(1997~1999)

20

21

22

Linux Cluster Specifications

§ 16 PCs§ PC’s specification

– Pentium3– 16MB– 20GB

§ Myrinet (300Mbps)

23

Linux Cluster’s Goals

24

Linux Cluster’s Goals

Real-timeRendering

25

Limitations of achieving this goal

§ Visible Human Project– Data Size : 40GB (~100GB)

§ Linux File System (ext2)– 16GB/1 file – IDE bandwidth : 33Mbps (66Mbps)– Ethernet bandwidth : 100Mbps (below 30Mbps)– RAM : not enough

§ Myrinet network interface– Too difficult to use– Kernel hooking required!!!

§ Programming Model– PVM or MPI – Too Slow & Difficult!!!

26

Google File System

27

Google File System (GFS, 2003)

SanjayGhemawat,“TheGoogleFileSystem,”http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

28

Google File System (GFS)

Distributed,Overlayed,ScalableFileSystem

29

Hadoop System2005

30

Map & Reduce

§ User Logs Counting

31

Map & Reduce

Whydoingasthis?

32

How about this example?

§ Count Phone Call Logs?– Each user’s total time for phone call– KT’s case : 40TB / month– No exception available

§ Oracle Database– HW cost : ?– SW cost : Over 400,000,000 Korean Won– Time cost : about 1 day.

33

Solution?

§ Simple is best– Log Merge

for(int i=0;i<max_log;i++)user[log[i].id].usage_time +=log[i].usage_time;

But,Toomuchtimerequired!!!

34

Map & Reduce

§ User Logs Counting

35

Spark Stream Processing2009

36

Hadoop’s Performance Problem

37

Hadoop’s Peformance Problem

38

Spark Stream Processing

39

Spark Stream Processing

40

41

42

Spark Code Example

43

Conclusion

§ Big Data Computing?– Of course, it is needed!! But for us?

§ We did a lot.– We need to enhance our aspect?

§ What’s the next? – Trends are repeated!!!– Your major might be come again?

44

45

아이고 의미없다.

46

References

§ John P. Hayes, “Architecture of Supercomputer,” International Conference of Parallel Processing 1986.

§ MPI code example, http://mpitutorial.com/tutorials/mpi-hello-world/

§ PVM code example, http://www.netlib.org/pvm3/book/node17.html

§ Sanjay Ghemawat, “Google File System,” SOSP 2003

§ Hadoop Code Example, http://azure.microsoft.com/en-us/documentation/articles/hdinsight-sample-wordcount/

§ Madhukara Phatak, Introduction to Apache Spark, http://blog.madhukaraphatak.com/introduction-to-spark/

47

Thank you

Young-Sung [email protected]