2016/8/20 ~ 9/3 ([email protected]) · 2016-08-19 · 요구사항 • commodity hardware –은...
TRANSCRIPT
일정
2
1일차 2일차 3일차 4일차 5일차 6일차 7일차 8일차
오전 배경과 개요
MR 프로그래밍
MR 프로그래밍
Pig & Hive
Flume & Sqoop
R 사용법
기계학습 (1)
클라우드 활용
오후 환경 구축과 기본실습
N/A Pig & Hive
Flume & Sqoop
데이터 분석 기초
기초 통계와 시각화
기계학습 (2)
사례분석
http://www.openwith.net
D1
빅데이터 기술개요
3 http://www.openwith.net
도입
• 변화하는 세상
• 데이터의 힘
4 http://www.openwith.net
Changing World
5 http://www.openwith.net
Irreversible
6 http://www.openwith.net
Data Power
7 http://www.openwith.net
빅데이터
8
그림출처: zdnet
http://www.openwith.net
빅데이터 기술 개요
배경 – 3V
• Tidal Wave – 3VC
• Supercomputer – High-throughput computing
– 2가지 방향:
• 원격, 분산형 대규모 컴퓨팅 (grid computing)
• 중앙집중형 (MPP)
• Scale-Up vs. Scale-Out
• BI (Business Intelligence) – 특히 DW/OLAP/데이터 마이닝
10 http://www.openwith.net
BI
• BI 개요
11
구성 솔루션 설명
전략 BI BSC Balanced Scorecard. 균형성과관리.
VBM Value-based Management. 가치창조경영.
ABC Activity Based Costing. 활동기준 원가계산.
분석 BI OLAP On-line Analytical Processing. 다차원 분석
확장 ERP,CRM ERP, CRM, SCM 등의 기능을 확장하여 BI기능 제공
인프라/
운영 BI
ETL Extraction-Translation-Loading.
DW Data Warehouse. 데이터 저장소 (repository)
전달 BI Portal 포털.
http://www.openwith.net
Hadoop
• Hadoop의 탄생? – 배경
• Google!
• Nutch/Lucene 프로젝트에서 2006년 독립 – Doug Cutting
– Apache의 top-level 오픈소스 프로젝트
– 특징
• 대용량 데이터 분산처리 프레임워크 – http://hadoop.apache.org – 순수 S/W
• 프로그래밍 모델의 단순화로 선형 확장성 (Flat linearity) – “function-to-data model vs. data-to-function” (Locality)
– KVP (Key-Value Pair)
12 http://www.openwith.net
Hadoop 탄생의 배경
1990년대 – Excite,
Alta Vista, Yahoo,
…
2000 – Google ;
PageRank,
GFS/MapReduce
2003~4 –
Google Paper
2005 – Hadoop
탄생
(D. Cutting &
Cafarella)
2006 – Apache
프로젝트에 등재
13 http://www.openwith.net
Frameworks
14 http://www.openwith.net
• Big Picture
15 http://www.openwith.net
• Hadoop Kernel
• Hadoop 배포판 – Apache 버전
• 2.x.x : 0.23.x 기반
– 3rd Party 배포판
• Cloudera, HortonWorks와 MapR 16 http://www.openwith.net
• Hadoop 배포판? – Apache 재단의 Hadoop은 0.10에서 시작하여 현재 0.23
– 현재 – Apache
• 2.x.x : 0.23.x 기반
• 1.1.x : 현재 안정버전 (0.22기반)
• 0.20.x: 아직도 많이 사용되는 legacy 안정버전
– 현재 – 3rd Party 배포판
• Cloudera
– CDH
• HortonWorks
• MapR
• …
17 http://www.openwith.net
• Hadoop Ecosystem Map
18 http://www.openwith.net
Hadoop – HDFS & MapReduce –
HDFS
요구사항
• Commodity hardware – 잦은 고장은 당연한 일
• 수 많은 대형 파일 – 수백 GB or TB
– 대규모 streaming reads – Not random access
• “Write-once, read-many-times”
• High throughput 이 low latency보다 더 중요
• “Modest” number of HUGE files – Just millions; Each > 100MB & multi-GB files typical
• Large streaming reads – …
21 http://www.openwith.net
HDFS의 해결책
• 파일을 block 단위로 저장
– 통상의 파일시스템 (default: 64MB)보다 훨씬 커짐
• Replication 을 통한 신뢰성 증진
– Each block replicated across 3+ DataNodes
• Single master (NameNode) coordinates access, metadata
– 단순화된 중앙관리
• No data caching
– Streaming read의 경우 별 도움이 안됨
• Familiar interface, but customize the API
– 문제를 단순화하고 분산 솔루션에 주력
22 http://www.openwith.net
GFS 아키텍처
그림출처: Ghemawat et.al., “Google File System”, SOSP, 2003
23 http://www.openwith.net
HDFS File Storage
24 http://www.openwith.net
HDFS 이용환경
• 명령어 Interface
• Java API
• Web Interface
• REST Interface (WebHDFS REST API)
• HDFS를 mount하여 사용
25 http://www.openwith.net
HDFS 명령어 Interface
• Create a directory $ hadoop fs -mkdir /user/idcuser/data
• Copy a file from the local filesystem to HDFS $ hadoop fs -copyFromLocal cit-Patents.txt /user/idcuser/data/.
• List all files in the HDFS file system $ hadoop fs -ls data/*
• Show the end of the specified HDFS file $ hadoop fs -tail /user/idcuser/data/cit-patents-copy.txt
• Append multiple files and move them to HDFS (via stdin/pipes) $ cat /data/ita13-tutorial/pg*.txt | hadoop fs -put- data/all_gutenberg.txt
26 http://www.openwith.net
• File/Directory 명령어: – copyFromLocal, copyToLocal, cp, getmerge, ls, lsr
(recursive ls),
– moveFromLocal, moveToLocal, mv, rm, rmr (recursive
rm), touchz, mkdir
• Status/List/Show 명령어: – stat, tail, cat, test (checks for existence of path,
file, zero length files), du, dus
• Misc 명령어: – setrep, chgrp, chmod, chown, expunge (empties trash
folder)
27 http://www.openwith.net
HDFS Java API
• Listing files/directories (globbing)
• Open/close inputstream
• Copy bytes (IOUtils)
• Seeking
• Write/append data to files
• Create/rename/delete files
• Create/remove directory
• Reading Data from HDFS org.apache.hadoop.fs.FileSystem (abstract)
org.apache.hadoop.hdfs.DistributedFileSystem
org.apache.hadoop.fs.LocalFileSystem
org.apache.hadoop.fs.s3.S3FileSystem
28 http://www.openwith.net
HDFS Web Interface
29 http://www.openwith.net
전형적인 Topology
30 http://www.openwith.net
HDFS 정리
• 다수의 저가 H/W 위에서 대규모 작업에 중점 – 잦은 고장에 대처
– 대형 파일 (주로 appended and read)에 중점
– 개발자들에 촛점맞춘 filesystem interface
• Scale-out & Batch Job – 최근 여러 보완 프로젝트
31 http://www.openwith.net
MapReduce
MapReduce – 프로그래밍 모델
33 http://www.openwith.net
MapReduce – 프로그래밍 모델
34 http://www.openwith.net
35 http://www.openwith.net
36 http://www.openwith.net
37 http://www.openwith.net
38 http://www.openwith.net
39 http://www.openwith.net
Job 수행
40 http://www.openwith.net
Word Count Output
41 http://www.openwith.net
WordCount 예의 개선
• 문제: 단 한 개의 reducer가 병목을 일으킴 – Work can be distributed over multiple nodes
(work balance 개선)
– All the input data has to be sorted before processing
– Question: Which data should be send to which reducer ?
• 해결책: – Arbitrary distributed, based on a hash function (default mode)
– Partitioner Class, to determine for every output tuple the corresponding reducer
42 http://www.openwith.net
• 참고
– 1. No of map : • depends on the input data size, usually 10~100 per node
• SetNumMapTasks(int) 를 이용해서 조정 가능
– 2. No of reducers • 일반식(?)
– = 0.95~1.75 x <no of nodes> * mapred.tasktracker.reduce.tasks.maximum
• zero reducer인 경우도 많음
• JobConf.setNumReduceTasks(int)를 이용해서 조정 가능
43 http://www.openwith.net
unix 명령어와 Streaming API
• Question: How many cities has each country ?
hadoop jar /mnt/biginsights/opt/ibm/biginsights/pig/test/e2e/ pig/lib/hadoop-streaming.jar \
-input input/city.csv \
-output output \
-mapper "cut -f2 -d," \
-reducer "uniq -c"\
-numReduceTasks 5
• Explanation: cut -f2 -d, # Extract 2nd col. in a CSV
uniq -c # Filter adjacent matches matching lines from INPUT,
# -c: prefix lines by the number of occurrences
additional remark: # numReduceTasks=0: no shuffle & sort phase!!
44 http://www.openwith.net
[실습] MR with Python stream
• hound of the baskervilles gutenberg • plain text (utf-8) as input.txt • -- • Python: mapper.py
#!/usr/bin/env python import sys counts={} for line in sys.stdin: words = line.split() for word in words: counts[word] = counts.get(word, 0)+1 print counts • -- • $ chmod +x mapper.py
• $ ./mapper.py < input.txt
http://www.openwith.net 45
• mapper2.py
#!/usr/bin/env python import sys counts={} for line in sys.stdin: words = line.split() for word in words: print word + “\t” + str(1)
• $ ./mapper2.py < input.txt | sort
http://www.openwith.net 46
• reducer.py
#!/usr/bin/env python import sys previous_key = None total =0 for line in sys.stdin: key, value = line.split("\t", 1) if key != previous_key: if previous_key !=None print previous_key + " was found" + str(total) + " times" previous_key = key total =0 total += int(value) if previous_key != None: print previous_key + " was found " +str(total) + " times"
http://www.openwith.net 47
• $ ./mapper2.py < input.txt | sort | ./reducer.py
• (factoring)
• $ ./mapper2.py < input.txt | sort | ./reducer2.py
• $ cat *.txt | ./mapper2.py | sort | ./reducer2.py
http://www.openwith.net 48
MapReduce High Level
49 http://www.openwith.net
MRv1 vs. MRv2
50 http://www.openwith.net
작업방식
• 개요 – JobTracker/TaskTracker의 기능을 세분화
• a global ResourceManager • a per-application ApplicationMaster • a per-node slave NodeManager • a per-application Container running on a NodeManager
– ResourceManager 와 NodeManager가 새로 도입
• ResourceManager – ResourceManager 가 application 간의 자원요청을 관리 (arbitrates resources among
applications) – ResourceManager의 scheduler를 통해 resource allocation to applications
• ApplicationMaster – = a framework-specific entity 로서 필요한 resource container를 scheduler로부터 할
당 받음 – ResourceManager 와 협의한 후 NodeManager(s) 를 통해 component tasks를 수행 – Also, tracks status & monitors progress
• NodeManager – = per-machine slave, is responsible for launching the applications’ containers,
monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.
51 http://www.openwith.net
MRv2 진행경과
52 http://www.openwith.net
필요성
Feature 기능
Multi-tenancy YARN allows multiple access engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set. Multi-tenant data processing improves an enterprise’s return on its Hadoop investments.
Cluster Utilization
Dynamic allocation of cluster resources를 통해 MR 작업 향상
Scalability Scheduling 기능 개선으로 확장성 강화 (thousands of nodes managing PB’s of data).
53 http://www.openwith.net
Hadoop1 MR Daemons
54 http://www.openwith.net
55 http://www.openwith.net
Hadoop 1 Limitations
Scalability Max cluster size – 4,000 nodes Max. concurrent tasks – 40,000 Coarse sync in Job tracker
NameNode가 취약점 Failure kills all queued and running jobs
Re-startability Restart is very tricky due to complex state
낮은 Resource Utilization
Hard partition of resources into map and reduce slots
MR에 한정 Doesn’t support other programs Iterative applications implementations are 10x slower
Lack of wire-compatible protocols
Client and cluster must be of same version Applications and workflows cannot migrate to different clusters
56 http://www.openwith.net
Hadoop 2 Design concept
• job Tracker의 기능을 2개 function으로 분리
– cluster resource management
– Application life-cycle management
• MR becomes user library, or one of the application residing in Hadoop
57 http://www.openwith.net
MR2 이해를 위한 Key Concept
• Application – a job submitted to the framework – 예: MR job
• Container – = allocation의 기본 단위 Fine-grained resource allocation – 예: container A = 2GB, 1 CPU – replaces the fixed MR slots
• Resource Manager – = global resource scheduler – Hierarchical queues
• NodeManager – Per-machine agent – Container의 life-cycle관리 – container resource monitoring
• Application Master – Per application으로서 application scheduling 및 task execution을 관리 – 예: MR Application Master
58 http://www.openwith.net
• YARN = MR2.0 +
– Framework to develop and/or execute distributed processing applications
– 예: MR, Spark, Hama, Giraph
59 http://www.openwith.net
Hadoop2의 High-level architecture
60 http://www.openwith.net
비교
61 http://www.openwith.net
62 http://www.openwith.net
YARN의 문제점
• Complexity – Protocol are at very low level, very verbose
• Long running job에 적합치 않음
• Application doesn't survive Master crash
• No built-in communication between container and master
• Hard to debug
63 http://www.openwith.net
Hadoop의 장단점과 대응
• Haddop의 장점
– commodity h/w
– scale-out
– fault-tolerance
– flexibility by MR
• Hadoop의 단점
– MR!
– Missing! - schema와 optimizer, index, view, ...
– 기존 tool과의 호환성 결여
• 해결책: Hive
– SQL to MR
– Compiler + Execution 엔진
– Pluggable storage layer (SerDes)
• 미해결 숙제: Hive
– ANSI SQL, UDF, ...
– MR Latency overhead
– 계속 작업 중...!
64 http://www.openwith.net
SQL-on-MapReduce
• 방향 – SQL로 HDFS에 저장된 데이터를 빠르게 조회하고, 분석
– MR을 사용하지 않는 (low latency) 실시간 분석을 목표
– 대규모 batch 및 실시간 interactive 분석에 사용
– HDFS, 기타 데이터에 대한 ETL, Ad-hoc 쿼리, 온라인통합
• New Architecture for SQL on Hadoop – Data Locality
– (MR대신) Real-timer Query
– Schema-on-Read
– SQL ecosystem과 tight 통합
65 http://www.openwith.net
• SQL on Hadoop 프로젝트 예 – Google Dremel
– Apache Drill
– Cloudera Impala
– Citus Data
• Tajo – 2013년 3월 Apache Incubator Project에 선정
• APL V2.0
– 국내기업 적용 – SK텔레콤 등
66 http://www.openwith.net
Use the right tool for the right job
67 http://www.openwith.net
대표적인 Hadoop 활용
• Text Mining
• Index 생성
• 그래프 분석
• 패턴 인식
• Collaborative filtering
• 예측모델
• 감성분석
• Risk 분석
빅데이터분석교육(2015-11)
유형별 활용 양태
리스크 분석 (은행)
사기 탐지 (신용카드), 자금세탁 위험탐지
소셜네트워크 분석 금융 및 통신사의 마케팅 (이벤트)
유통 최적화 (시뮬레이션) 부당 보험첨구 및 탈세위험 탐지
사전적 예방점검 (항공) 감성분석/SNA 제조부문에서의 수요예측 건강보험/질병정보 분석 전통적 DW 텍스트 분석 실시간 영상감시
실시간 (real time) 일괄처리 (Batch)
데이터의 속도
데이터의 유형
정형데이터 비정형데이터
69
Hadoop Ecosystems
Ecosystem 관계도
빅데이터분석교육(2015-11)
그림출처: https://www.mssqltips.com/
Hadoop Ecosystem
• "Hadoop Ecosystem" – 1차적 subprojects
• Zookeeper
• Hive and Pig
• HBase
• Flume
– 2차적 subprojects
• Sqoop
• Oozie
• Hue
• Mahout
72 http://www.openwith.net
The Ecosystem is the System
• Hadoop은 빅데이터용 분산 운영체제의 kernel 역할 – No one uses the kernel alone
73 http://www.openwith.net
YARN & Hadoop Ecosystems
• MR – Core component since Hadoop 1
• Tez – provides pre-warmed containers & low latency dispatch – 100배 성능향상 – 특히 Hive, Pig에서 이용
• HBase – column-oriented data store
• Storm – Streaming for large scale live event processing
• Giraph – Iterative graph processing
• Spark – In-memory cluster computing
74 http://www.openwith.net
빅데이터 분석
빅데이터분석교육(2015-11)
빅데이터 플랫폼
빅데이터분석교육(2015-11) 그림출처: it.toolbox.com
분석도구 – Big Bang
• 기능특화
빅데이터분석교육(2015-11)
R
• open-source 수리/통계 분석도구 및 프로그래밍 언어 – S 언어에서 기원하였으며 7,000여 개의 package
• CRAN: http://cran.r-project.org/
– 뛰어난 성능과 시각화 (visualization) 기능
빅데이터분석교육(2015-11)
Python
• 오픈소스 프로그래밍 언어
– Multi-platform
– 풍부한 패키지 (≈ 10k)
• 가독성
– Logic 언어
– Executable pseudocode
• 간결성
– Expressiveness less code
• Full-stack
– Web– GUI– OS– Science
• 활발한 커뮤니티 활동
빅데이터분석교육(2015-11)
분석기법
• Data Mining
• Predictive Analysis
• Data Analysis
• Data Science
• OLAP
• BI
• Analytics
• Text Mining
• SNA (Social Network Analysis)
• Modeling
• Prediction
• Machine Learning
• Statistical/Mathematical Analysis
• KDD (Knowledge Discovery)
• Decision Support System
• Simulation
편의상 (데이터) 분석(Data Analysis), 마이닝 (Data Mining)으로 혼용
빅데이터분석교육(2015-11)
• 통계기초이론 Taxonomy
빅데이터분석교육(2015-11)
• 기계학습이론 Taxonomy
빅데이터분석교육(2015-11)
실습환경의 구축
83 http://www.openwith.net
Hadoop 설치
Hadoop?
• Hadoop? – 2012 이후 경부터 "Hadoop"은
Hadoop Ecosystem을 의미하는 것으로 확대
• Base framework – Hadoop Common – contains
libraries and utilities needed by other Hadoop modules.
– HDFS – Hadoop MapReduce – a
programming model – YARN – a resource-
management platform
• Ecosystems – …
85 http://www.openwith.net
Hadoop 설치 방법론
• 선택 (1) 설치 모드 – Standalone – Pseudo distributed cluster – Multinode cluster
• 선택 (2) 배포판 – hadoop.apache.org – cloudera – Hortonworks – MapR – 기타
• 선택 (3) 설치항목 – One-by-one vs. All-in-one – Cloud (예: Amazon) – Virtual Machine?
86 http://www.openwith.net
87 http://www.openwith.net
88 http://www.openwith.net
Cloudera Quick Start
89 http://www.openwith.net
90 http://www.openwith.net
Hadoop 실습
[실습 1] Ubuntu + Apache Hadoop
• ubuntu 14.04 다운로드 – 영문으로 설치 – $ sudo apt-get install ssh – $ sudo apt-get install rsync
• -- [ java 설치]-- • $ sudo apt-get install openjdk-7-jdk
• $ ls /usr/lib/jvm/java-7-openjdk-amd64/
• $ sudo vi /etc/bash.bashrc
– append: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/ – 저장 후: $ source /etc/bash.bashrc
• $ java -version
• 참고자료: – https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-
common/SingleCluster.html
• --- hadoop user 추가 • $ sudo adduser hadoop (passwd: hadoop)
• $ su - hadoop
• $ ssh-keygen -t rsa • $ cat ~/.ssh/id_rsa.pub • $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys • $ chmod 0600 ~/.ssh/authorized_keys • $ ssh localhost
• $ exit
• [Hadoop 다운로드 설치]
• $ wget http://apache.tt.co.kr/hadoop/commonstable/hadoop-2.7.2.tar.gz
• $ tar xzvf hadoop-2.7.2.tar.gz
• $ mv hadoop-2.7.2 hadoop
• --
• vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh • 다음을 추가
• export JAVA_HOME= ~~
[실습 2] 간단한 MR 예제프로그램 실행
• Standalone 작업 – 예: unpacked conf 디렉토리 내용을 입력 데이터로 복사한 후 정
규식 적용.
– $ mkdir input
– $ cp etc/hadoop/*.xml input
– $ bin/hadoop jar share/hadoop/mapreduce/hadoop-
mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+‘
– $ cat output/*
• Pseudo-Distributed 작업 – etc/hadoop/core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
– etc/hadoop/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
• Format the filesystem: – $ bin/hdfs namenode -format
• (1) 데몬 프로그램 실행 – NameNode daemon 및 DataNode daemon: – $ sbin/start-dfs.sh
• (2) 브라우저 이용 - NameNode - http://localhost:50070/
• (3) HDFS 디렉토리 생성 – $ bin/hdfs dfs -mkdir /user – $ bin/hdfs dfs -mkdir /user/<username>
• (4) 입력파일 복사 – $ bin/hdfs dfs -put etc/hadoop input
• (5) 예제 프로그램 실행 – $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-
examples-2.7.2.jar grep input output 'dfs[a-z.]+‘
• (6) 출력사항 검토 – $ bin/hdfs dfs -get output output – $ cat output/* – Or View the output files on the distributed filesystem: – $ bin/hdfs dfs -cat output/*
• (7) 종료 – $ sbin/stop-dfs.sh
http://www.openwith.net 98
[실습 3] Stream
http://www.openwith.net 99
[실습 5] 교재
• Map.java
package com.PACKT.chapter1; import java.io.*; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{ public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { StringTokenizer st = new StringTokenizer(value.toString().toLowerCase()); while(st.hasMoreTokens()) { output.collect(new Text(st.nextToken()), new IntWritable(1)); } } }
http://www.openwith.net 100
• Reduce.java
// Defining package of the class package com.PACKT.chapter1; // Importing java libraries import java.io.*; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; // Defining the Reduce class public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ // Defining the reduce method for aggregating the generated output of Map phase public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int count = 0; while(values.hasNext()) { count += values.next().get(); } output.collect(key, new IntWritable(count)); } }
http://www.openwith.net 101
• WordCount.java import … public class WordCount extends Configured implements Tool{ // run() method for setting the job configurations public int run(String[] args) throws IOException{ JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new WordCount(), args); System.exit(exitCode); } }
http://www.openwith.net 102
[실습 5] 배포판의 설치
• Hortonworks HDP
• Cloudera
• MapR
http://www.openwith.net 103