learning spark ch1-2

16
Quick Start to Spark 아키텍트를 꿈꾸는 사람들 Cecil

Upload: hyeonseok-choi

Post on 07-Jan-2017

4.908 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Learning spark ch1-2

Quick�Start�to�Spark�

아키텍트를�꿈꾸는�사람들�Cecil

Page 2: Learning spark ch1-2

Spark�란?

범용적이면서도�빠른�속도로�작업을�수행할�수�있도록��설계된�클러스터용�연산�플랫폼

배치�애플리케이션,�반복�알고리즘,��대화형�쿼리,��스트리밍과�같은�다양한�컴퓨팅�방법을�지원하는�단일�시스템

Page 3: Learning spark ch1-2

Spark�구성

클러스터��매니저

코어�엔진

워크로드��컴포넌트

Page 4: Learning spark ch1-2

스파크�코어•작업�스케쥴링,�메모리�관리,�장애�복구,�스토리지�연동등�기본�기능으로�구성�•분산�데이터�세트(RDD)를�생성하고�조작하는�API�제공

클러스터�매니저•효과적으로�성능을�확장�할�수�있도록�분산�노드를�관리�•지원:�Yarn(하둡),�Apache�Mesos,�Standalone�Scheduler

Page 5: Learning spark ch1-2

워크로드�컴포넌트•스파크�SQL:�정형�데이터�����-�하이브�테이블,�파케이,�JSON등�다양한�데이터�소스�지원�

•스파크�스트리밍:�실시간�데이터�스트림�

����-�스파크�코어의�RDD�API와�거의�일치하는�조작�API를�지원�����-�저장된�데이터나�실시간�데이터를�동일하게�다룰수�있도록�함�

•MLib:�머신러닝�����-�분류,�회귀,�클러스터링,�협업�필터링�등�일반적인�머신러닝�지원�

����-�최적화�알고리즘과�같은�저수준의�핵심�러닝�기능�지원�

•그래프�X:�그래프�라이브러리�����-�RDD�API를�확장,�일반적인�그래프�알고리즘의�라이브러리�지원

Page 6: Learning spark ch1-2

Spark�설치•Download�spark�

•github�

����-�git://github.com/apachespark.git

Page 7: Learning spark ch1-2

Spark�대화형�쉘•python:�bin/pyspark�

•scala:�bin/spark-shell

Page 8: Learning spark ch1-2

line�count�-�scalascala>�var�lines�=�sc.textFile("README.md")�lines:�org.apache.spark.rdd.RDD[String]�=�MapPartitionsRDD[1]�at�textFile�at�<console>:21�

scala>�lines.count()�

res0:�Long�=�98�

scala>�lines.first()�res1:�String�=�#�Apache�Spark

<-�RDD�생성

<-�Action

<-�Action

Page 9: Learning spark ch1-2

Spark�핵심�개념

드라이버�프로그램

Spark�Context

클러스터�매니저�(Mesos,�YARN,�Standalone)

작업�노드

익스큐터

태스크 태스크

작업�노드

익스큐터

태스크 태스크

•드라이버�프로그램�����-�main�함수를�포함하는�사용자�정의�프로그램�•Spark�context�����-�연산�클러스터에�대한�연결을�제공하는�객체�����-�RDD�를�다루는�API를�제공�•익스큐터�����-�프로그램이�실행되는�머신

Page 10: Learning spark ch1-2

Filtering�-�scalascala>�var�lines�=�sc.textFile("README.md")�

lines:�org.apache.spark.rdd.RDD[String]�=�MapPartitionsRDD[3]�at�textFile�…�

scala>�var�pythonLines�=�lines.filter(line�=>�line.contains("Python"))�

pythonLines:�org.apache.spark.rdd.RDD[String]�=�MapPartitionsRDD[4]�at�filter�…�

scala>�pythonLines.first()�

res2:�String�=�high-level�APIs�in�Scala,�Java,�Python,�and�R,�and�an�…�

scala>�pythonLines.count()�

res3:�Long�=�3

<-�RDD�생성

<-�Action

<-�Action

<-�Transform

Page 11: Learning spark ch1-2

단독�애플리케이션

•코드�구현�

•빌드�스크립트�작성�

•spark-submit�제출

Page 12: Learning spark ch1-2

WordCount�-�코드

<-�RDD�생성

<-�Spark�Context

<-�Transform

<-�Action

Page 13: Learning spark ch1-2

WordCount�-�Build.sbt

<-�Dependencies�설정:�컴파일�시에만�포함이�되도록�provided�로

Page 14: Learning spark ch1-2

WordCount�-�제출

<-�실행�결과

•spark-submit�주요�옵션�

����-�class:�사용자�정의�작업의�엔트리�클래스�

����-�deploy-mode:�cluster�or�client�

����-�master:�cluster일�경우�master의�주소�

����-�conf:�key-value

Page 15: Learning spark ch1-2

Q&A

Page 16: Learning spark ch1-2

References• Holden�Karau,�Andy�Konwinski,�Patrick�Wendell,�Matei�

Zaharia.�러닝�스파크(박종용�옮김).�경기도�파주시:�제이펍(주),�2015�

• Spark�home�page:�http://spark.apache.org�

• slide�share:�http://www.slideshare.net/yongho/rdd-paper-review