2016/8/20 ~ 9/3 ([email protected]) · 2016-08-19 · 요구사항 • commodity hardware –은...

빅데이터 기술 개요

2016/8/20 ~ 9/3

윤형기 ([email protected])

mailto:[email protected]

일정

2

1일차 2일차 3일차 4일차 5일차 6일차 7일차 8일차

오전 배경과 개요

MR 프로그래밍

MR 프로그래밍

Pig & Hive

Flume & Sqoop

R 사용법

기계학습 (1)

클라우드 활용

오후 환경 구축과 기본실습

N/A Pig & Hive

Flume & Sqoop

데이터 분석 기초

기초 통계와 시각화

기계학습 (2)

사례분석

http://www.openwith.net

D1

빅데이터 기술개요

3 http://www.openwith.net

도입

• 변화하는 세상

• 데이터의 힘


Changing World


Irreversible


Data Power


빅데이터

8

그림출처: zdnet


빅데이터 기술 개요

배경 – 3V

• Tidal Wave – 3VC

• Supercomputer – High-throughput computing

– 2가지 방향:

• 원격, 분산형 대규모 컴퓨팅 (grid computing)

• 중앙집중형 (MPP)

• Scale-Up vs. Scale-Out

• BI (Business Intelligence) – 특히 DW/OLAP/데이터 마이닝


BI

• BI 개요

11

구성 솔루션 설명

전략 BI BSC Balanced Scorecard. 균형성과관리.

VBM Value-based Management. 가치창조경영.

ABC Activity Based Costing. 활동기준 원가계산.

분석 BI OLAP On-line Analytical Processing. 다차원 분석

확장 ERP,CRM ERP, CRM, SCM 등의 기능을 확장하여 BI기능 제공

인프라/

운영 BI

ETL Extraction-Translation-Loading.

DW Data Warehouse. 데이터 저장소 (repository)

전달 BI Portal 포털.


Hadoop

• Hadoop의 탄생? – 배경

• Google!

• Nutch/Lucene 프로젝트에서 2006년 독립 – Doug Cutting

– Apache의 top-level 오픈소스 프로젝트

– 특징

• 대용량 데이터 분산처리 프레임워크 – http://hadoop.apache.org – 순수 S/W

• 프로그래밍 모델의 단순화로 선형 확장성 (Flat linearity) – “function-to-data model vs. data-to-function” (Locality)

– KVP (Key-Value Pair)


http://hadoop.apache.org/

Hadoop 탄생의 배경

1990년대 – Excite,

Alta Vista, Yahoo,

…

2000 – Google ;

PageRank,

GFS/MapReduce

2003~4 –

Google Paper

2005 – Hadoop

탄생

(D. Cutting &

Cafarella)

2006 – Apache

프로젝트에 등재


Frameworks


• Big Picture


• Hadoop Kernel

• Hadoop 배포판 – Apache 버전

• 2.x.x : 0.23.x 기반

– 3rd Party 배포판

• Cloudera, HortonWorks와 MapR 16 http://www.openwith.net

• Hadoop 배포판? – Apache 재단의 Hadoop은 0.10에서 시작하여 현재 0.23

– 현재 – Apache

• 2.x.x : 0.23.x 기반

• 1.1.x : 현재 안정버전 (0.22기반)

• 0.20.x: 아직도 많이 사용되는 legacy 안정버전

– 현재 – 3rd Party 배포판

• Cloudera

– CDH

• HortonWorks

• MapR

• …


• Hadoop Ecosystem Map


Hadoop – HDFS & MapReduce –

요구사항

• Commodity hardware – 잦은 고장은 당연한 일

• 수 많은 대형 파일 – 수백 GB or TB

– 대규모 streaming reads – Not random access

• “Write-once, read-many-times”

• High throughput 이 low latency보다 더 중요

• “Modest” number of HUGE files – Just millions; Each > 100MB & multi-GB files typical

• Large streaming reads – …


HDFS의 해결책

• 파일을 block 단위로 저장

– 통상의 파일시스템 (default: 64MB)보다 훨씬 커짐

• Replication 을 통한 신뢰성 증진

– Each block replicated across 3+ DataNodes

• Single master (NameNode) coordinates access, metadata

– 단순화된 중앙관리

• No data caching

– Streaming read의 경우 별 도움이 안됨

• Familiar interface, but customize the API

– 문제를 단순화하고 분산 솔루션에 주력


GFS 아키텍처

그림출처: Ghemawat et.al., “Google File System”, SOSP, 2003


HDFS File Storage


HDFS 이용환경

• 명령어 Interface

• Java API

• Web Interface

• REST Interface (WebHDFS REST API)

• HDFS를 mount하여 사용


HDFS 명령어 Interface

• Create a directory $ hadoop fs -mkdir /user/idcuser/data

• Copy a file from the local filesystem to HDFS $ hadoop fs -copyFromLocal cit-Patents.txt /user/idcuser/data/.

• List all files in the HDFS file system $ hadoop fs -ls data/*

• Show the end of the specified HDFS file $ hadoop fs -tail /user/idcuser/data/cit-patents-copy.txt

• Append multiple files and move them to HDFS (via stdin/pipes) $ cat /data/ita13-tutorial/pg*.txt | hadoop fs -put- data/all_gutenberg.txt


• File/Directory 명령어: – copyFromLocal, copyToLocal, cp, getmerge, ls, lsr

(recursive ls),

– moveFromLocal, moveToLocal, mv, rm, rmr (recursive

rm), touchz, mkdir

• Status/List/Show 명령어: – stat, tail, cat, test (checks for existence of path,

file, zero length files), du, dus

• Misc 명령어: – setrep, chgrp, chmod, chown, expunge (empties trash

folder)


HDFS Java API

• Listing files/directories (globbing)

• Open/close inputstream

• Copy bytes (IOUtils)

• Seeking

• Write/append data to files

• Create/rename/delete files

• Create/remove directory

• Reading Data from HDFS org.apache.hadoop.fs.FileSystem (abstract)

org.apache.hadoop.hdfs.DistributedFileSystem

org.apache.hadoop.fs.LocalFileSystem

org.apache.hadoop.fs.s3.S3FileSystem


HDFS Web Interface


전형적인 Topology


HDFS 정리

• 다수의 저가 H/W 위에서 대규모 작업에 중점 – 잦은 고장에 대처

– 대형 파일 (주로 appended and read)에 중점

– 개발자들에 촛점맞춘 filesystem interface

• Scale-out & Batch Job – 최근 여러 보완 프로젝트


MapReduce

MapReduce – 프로그래밍 모델


Job 수행


Word Count Output


WordCount 예의 개선

• 문제: 단 한 개의 reducer가 병목을 일으킴 – Work can be distributed over multiple nodes

(work balance 개선)

– All the input data has to be sorted before processing

– Question: Which data should be send to which reducer ?

• 해결책: – Arbitrary distributed, based on a hash function (default mode)

– Partitioner Class, to determine for every output tuple the corresponding reducer


• 참고

– 1. No of map : • depends on the input data size, usually 10~100 per node

• SetNumMapTasks(int) 를 이용해서 조정 가능

– 2. No of reducers • 일반식(?)

– = 0.95~1.75 x <no of nodes> * mapred.tasktracker.reduce.tasks.maximum

• zero reducer인 경우도 많음

• JobConf.setNumReduceTasks(int)를 이용해서 조정 가능


unix 명령어와 Streaming API

• Question: How many cities has each country ?

hadoop jar /mnt/biginsights/opt/ibm/biginsights/pig/test/e2e/ pig/lib/hadoop-streaming.jar \

-input input/city.csv \

-output output \

-mapper "cut -f2 -d," \

-reducer "uniq -c"\

-numReduceTasks 5

• Explanation: cut -f2 -d, # Extract 2nd col. in a CSV

uniq -c # Filter adjacent matches matching lines from INPUT,

# -c: prefix lines by the number of occurrences

additional remark: # numReduceTasks=0: no shuffle & sort phase!!


[실습] MR with Python stream

• hound of the baskervilles gutenberg • plain text (utf-8) as input.txt • -- • Python: mapper.py

#!/usr/bin/env python import sys counts={} for line in sys.stdin: words = line.split() for word in words: counts[word] = counts.get(word, 0)+1 print counts • -- • $ chmod +x mapper.py

• $ ./mapper.py < input.txt

http://www.openwith.net 45

• mapper2.py

#!/usr/bin/env python import sys counts={} for line in sys.stdin: words = line.split() for word in words: print word + “\t” + str(1)

• $ ./mapper2.py < input.txt | sort


• reducer.py

#!/usr/bin/env python import sys previous_key = None total =0 for line in sys.stdin: key, value = line.split("\t", 1) if key != previous_key: if previous_key !=None print previous_key + " was found" + str(total) + " times" previous_key = key total =0 total += int(value) if previous_key != None: print previous_key + " was found " +str(total) + " times"


MapReduce High Level


MRv1 vs. MRv2


작업방식

• 개요 – JobTracker/TaskTracker의 기능을 세분화

• a global ResourceManager • a per-application ApplicationMaster • a per-node slave NodeManager • a per-application Container running on a NodeManager

– ResourceManager 와 NodeManager가 새로 도입

• ResourceManager – ResourceManager 가 application 간의 자원요청을 관리 (arbitrates resources among

applications) – ResourceManager의 scheduler를 통해 resource allocation to applications

• ApplicationMaster – = a framework-specific entity 로서 필요한 resource container를 scheduler로부터 할

당 받음 – ResourceManager 와 협의한 후 NodeManager(s) 를 통해 component tasks를 수행 – Also, tracks status & monitors progress

• NodeManager – = per-machine slave, is responsible for launching the applications’ containers,

monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.


MRv2 진행경과


필요성

Feature 기능

Multi-tenancy YARN allows multiple access engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set. Multi-tenant data processing improves an enterprise’s return on its Hadoop investments.

Cluster Utilization

Dynamic allocation of cluster resources를 통해 MR 작업 향상

Scalability Scheduling 기능 개선으로 확장성 강화 (thousands of nodes managing PB’s of data).


Hadoop1 MR Daemons


Hadoop 1 Limitations

Scalability Max cluster size – 4,000 nodes Max. concurrent tasks – 40,000 Coarse sync in Job tracker

NameNode가 취약점 Failure kills all queued and running jobs

Re-startability Restart is very tricky due to complex state

낮은 Resource Utilization

Hard partition of resources into map and reduce slots

MR에 한정 Doesn’t support other programs Iterative applications implementations are 10x slower

Lack of wire-compatible protocols

Client and cluster must be of same version Applications and workflows cannot migrate to different clusters


Hadoop 2 Design concept

• job Tracker의 기능을 2개 function으로 분리

– cluster resource management

– Application life-cycle management

• MR becomes user library, or one of the application residing in Hadoop


MR2 이해를 위한 Key Concept

• Application – a job submitted to the framework – 예: MR job

• Container – = allocation의 기본 단위 Fine-grained resource allocation – 예: container A = 2GB, 1 CPU – replaces the fixed MR slots

• Resource Manager – = global resource scheduler – Hierarchical queues

• NodeManager – Per-machine agent – Container의 life-cycle관리 – container resource monitoring

• Application Master – Per application으로서 application scheduling 및 task execution을 관리 – 예: MR Application Master


• YARN = MR2.0 +

– Framework to develop and/or execute distributed processing applications

– 예: MR, Spark, Hama, Giraph


Hadoop2의 High-level architecture


비교


YARN의 문제점

• Complexity – Protocol are at very low level, very verbose

• Long running job에 적합치 않음

• Application doesn't survive Master crash

• No built-in communication between container and master

• Hard to debug


Hadoop의 장단점과 대응

• Haddop의 장점

– commodity h/w

– scale-out

– fault-tolerance

– flexibility by MR

• Hadoop의 단점

– MR!

– Missing! - schema와 optimizer, index, view, ...

– 기존 tool과의 호환성 결여

• 해결책: Hive

– SQL to MR

– Compiler + Execution 엔진

– Pluggable storage layer (SerDes)

• 미해결 숙제: Hive

– ANSI SQL, UDF, ...

– MR Latency overhead

– 계속 작업 중...!


SQL-on-MapReduce

• 방향 – SQL로 HDFS에 저장된 데이터를 빠르게 조회하고, 분석

– MR을 사용하지 않는 (low latency) 실시간 분석을 목표

– 대규모 batch 및 실시간 interactive 분석에 사용

– HDFS, 기타 데이터에 대한 ETL, Ad-hoc 쿼리, 온라인통합

• New Architecture for SQL on Hadoop – Data Locality

– (MR대신) Real-timer Query

– Schema-on-Read

– SQL ecosystem과 tight 통합


• SQL on Hadoop 프로젝트 예 – Google Dremel

– Apache Drill

– Cloudera Impala

– Citus Data

• Tajo – 2013년 3월 Apache Incubator Project에 선정

• APL V2.0

– 국내기업 적용 – SK텔레콤 등


Use the right tool for the right job


대표적인 Hadoop 활용

• Text Mining

• Index 생성

• 그래프 분석

• 패턴 인식

• Collaborative filtering

• 예측모델

• 감성분석

• Risk 분석

빅데이터분석교육(2015-11)

유형별 활용 양태

리스크 분석 (은행)

사기 탐지 (신용카드), 자금세탁 위험탐지

소셜네트워크 분석 금융 및 통신사의 마케팅 (이벤트)

유통 최적화 (시뮬레이션) 부당 보험첨구 및 탈세위험 탐지

사전적 예방점검 (항공) 감성분석/SNA 제조부문에서의 수요예측 건강보험/질병정보 분석 전통적 DW 텍스트 분석 실시간 영상감시

실시간 (real time) 일괄처리 (Batch)

데이터의 속도

데이터의 유형

정형데이터 비정형데이터

69

Hadoop Ecosystems

Ecosystem 관계도


그림출처: https://www.mssqltips.com/

https://www.mssqltips.com/



Hadoop Ecosystem

• "Hadoop Ecosystem" – 1차적 subprojects

• Zookeeper

• Hive and Pig

• HBase

• Flume

– 2차적 subprojects

• Sqoop

• Oozie

• Hue

• Mahout


The Ecosystem is the System

• Hadoop은 빅데이터용 분산 운영체제의 kernel 역할 – No one uses the kernel alone


YARN & Hadoop Ecosystems

• MR – Core component since Hadoop 1

• Tez – provides pre-warmed containers & low latency dispatch – 100배 성능향상 – 특히 Hive, Pig에서 이용

• HBase – column-oriented data store

• Storm – Streaming for large scale live event processing

• Giraph – Iterative graph processing

• Spark – In-memory cluster computing


빅데이터 분석


빅데이터 플랫폼

빅데이터분석교육(2015-11) 그림출처: it.toolbox.com

분석도구 – Big Bang

• 기능특화


R

• open-source 수리/통계 분석도구 및 프로그래밍 언어 – S 언어에서 기원하였으며 7,000여 개의 package

• CRAN: http://cran.r-project.org/

– 뛰어난 성능과 시각화 (visualization) 기능


http://cran.r-project.org/




Python

• 오픈소스 프로그래밍 언어

– Multi-platform

– 풍부한 패키지 (≈ 10k)

• 가독성

– Logic 언어

– Executable pseudocode

• 간결성

– Expressiveness less code

• Full-stack

– Web– GUI– OS– Science

• 활발한 커뮤니티 활동


분석기법

• Data Mining

• Predictive Analysis

• Data Analysis

• Data Science

• OLAP

• BI

• Analytics

• Text Mining

• SNA (Social Network Analysis)

• Modeling

• Prediction

• Machine Learning

• Statistical/Mathematical Analysis

• KDD (Knowledge Discovery)

• Decision Support System

• Simulation

편의상 (데이터) 분석(Data Analysis), 마이닝 (Data Mining)으로 혼용


• 통계기초이론 Taxonomy


• 기계학습이론 Taxonomy


실습환경의 구축


Hadoop 설치

Hadoop?

• Hadoop? – 2012 이후 경부터 "Hadoop"은

Hadoop Ecosystem을 의미하는 것으로 확대

• Base framework – Hadoop Common – contains

libraries and utilities needed by other Hadoop modules.

– HDFS – Hadoop MapReduce – a

programming model – YARN – a resource-

management platform

• Ecosystems – …


Hadoop 설치 방법론

• 선택 (1) 설치 모드 – Standalone – Pseudo distributed cluster – Multinode cluster

• 선택 (2) 배포판 – hadoop.apache.org – cloudera – Hortonworks – MapR – 기타

• 선택 (3) 설치항목 – One-by-one vs. All-in-one – Cloud (예: Amazon) – Virtual Machine?


Cloudera Quick Start


Hadoop 실습

[실습 1] Ubuntu + Apache Hadoop

• ubuntu 14.04 다운로드 – 영문으로 설치 – $ sudo apt-get install ssh – $ sudo apt-get install rsync

• -- [ java 설치]-- • $ sudo apt-get install openjdk-7-jdk

• $ ls /usr/lib/jvm/java-7-openjdk-amd64/

• $ sudo vi /etc/bash.bashrc

– append: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/ – 저장 후: $ source /etc/bash.bashrc

• $ java -version

• 참고자료: – https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-

common/SingleCluster.html

• --- hadoop user 추가 • $ sudo adduser hadoop (passwd: hadoop)

• $ su - hadoop

• $ ssh-keygen -t rsa • $ cat ~/.ssh/id_rsa.pub • $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys • $ chmod 0600 ~/.ssh/authorized_keys • $ ssh localhost

• $ exit

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html







• [Hadoop 다운로드 설치]

• $ wget http://apache.tt.co.kr/hadoop/commonstable/hadoop-2.7.2.tar.gz

• $ tar xzvf hadoop-2.7.2.tar.gz

• $ mv hadoop-2.7.2 hadoop

• --

• vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh • 다음을 추가

• export JAVA_HOME= ~~

[실습 2] 간단한 MR 예제프로그램 실행

• Standalone 작업 – 예: unpacked conf 디렉토리 내용을 입력 데이터로 복사한 후 정

규식 적용.

– $ mkdir input

– $ cp etc/hadoop/*.xml input

– $ bin/hadoop jar share/hadoop/mapreduce/hadoop-

mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+‘

– $ cat output/*

• Pseudo-Distributed 작업 – etc/hadoop/core-site.xml:

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>

– etc/hadoop/hdfs-site.xml:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

• Format the filesystem: – $ bin/hdfs namenode -format

• (1) 데몬 프로그램 실행 – NameNode daemon 및 DataNode daemon: – $ sbin/start-dfs.sh

• (2) 브라우저 이용 - NameNode - http://localhost:50070/

• (3) HDFS 디렉토리 생성 – $ bin/hdfs dfs -mkdir /user – $ bin/hdfs dfs -mkdir /user/<username>

• (4) 입력파일 복사 – $ bin/hdfs dfs -put etc/hadoop input

http://localhost:50070/

• (5) 예제 프로그램 실행 – $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-

examples-2.7.2.jar grep input output 'dfs[a-z.]+‘

• (6) 출력사항 검토 – $ bin/hdfs dfs -get output output – $ cat output/* – Or View the output files on the distributed filesystem: – $ bin/hdfs dfs -cat output/*

• (7) 종료 – $ sbin/stop-dfs.sh


[실습 3] Stream


[실습 5] 교재

• Map.java

package com.PACKT.chapter1; import java.io.*; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{ public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { StringTokenizer st = new StringTokenizer(value.toString().toLowerCase()); while(st.hasMoreTokens()) { output.collect(new Text(st.nextToken()), new IntWritable(1)); } } }


• Reduce.java

// Defining package of the class package com.PACKT.chapter1; // Importing java libraries import java.io.*; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; // Defining the Reduce class public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ // Defining the reduce method for aggregating the generated output of Map phase public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int count = 0; while(values.hasNext()) { count += values.next().get(); } output.collect(key, new IntWritable(count)); } }


• WordCount.java import … public class WordCount extends Configured implements Tool{ // run() method for setting the job configurations public int run(String[] args) throws IOException{ JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new WordCount(), args); System.exit(exitCode); } }


[실습 5] 배포판의 설치

• Hortonworks HDP

• Cloudera

• MapR


2016/8/20 ~ 9/3 ([email protected]) · 2016-08-19 · 요구사항 • commodity hardware –은...

Documents