빅 데이터, 새로운 통찰력
TRANSCRIPT
-
TTA,
2012. 10. 17
-
2
4
5
6
1
3
-
DB DB DB
-
Calculating Database Online Ubiquitous
ICT
Intelligence
-
, SNS
IT
, ,
(embedded system)
-
()
CIO ERP
(Real Analytics)
(Mobile First) IT, IT
IT , IT ,
-
PC
/
EB(Exa Byte) (90 =100EB)
ZB(Zetta Byte) (2011=1.8ZB)
ZB (20=11 50 )
(, )
(, , SNS)
, (RFID, Sensor, )
, , ,
2011 1.8ZB()
1.8 = 1.8
2020 50
SNS Web2.0
1 1PC
www
(IDC & EMC, Digital Universe Study 2011)
IT everywhere
* Byte, Kilo, Mega, Giga, Tera, Peta, Exa, Zetta
1ZB() = 1021 Byte = 1 GB
(, , 2012. 3)
-
(Big Data)'
Volume Variety Velocity
Complexity Value
-
( )
(Hadoop, NoSQL, R )
,
(, ,
, , GPS )
(,
)
3V
++
: (2012), , IT 3
-
3
(Big Data Platform)
, (NoSQL, ETL..)
(Hadoop, MapReduce..)
( , , ..)
(Visualization)
(Big Data)
(Data Scientist)
, (IT )
, ,
-
(
)
,
Silos
Sharing
Aggregating
Co-creating
-
EU
(www.data.go.kr)
, 'Data.gov
65
Data.gov
(ODS: Open Data Strategy) (11. 12)
EU 2013 pan-European
2.0 (data.gov.au)
-
: (2011), Social Big Data & Collective Intelligence'
:
: :
-
: (2012), Big Data
-
(Hadoop)
(HDFS),
(MapReduce)
-
: KT
-
(Mathematics, Statistics..)
(Engineering, Computer Sciences, Natural Sciences, Social Sciences)
: Forbes, 'Amazon's John Rauser on "What Is a Data Scientist?"'(2011.10.7), , ..., (2012. 3. 18)
6
-
: HARD Skill : SOFT Skill
: , , , IT & Future Strategy, , 2012. 8.
-
Network World IT , , , ,
, , ,
Data Scientist
-
- Chief Economist, Hal R. Varian -
-
.
: , , , 2012.6
-
(Hadoop)
IT
BI ,
BI , , , ,
: , , , 2012.6
-
(, , 2012. 3)
?
-
`
DB KMS Web2.0
< '' ' >
--
2011 2 (Jeopardy!)' IBM '(Watson)'
,
-
,
, ,
(Huge Scale)
(Reality)
(Trend)
(Combination)
,
, ,
,
(, , 2012. 3)
-
Economist
(2010)
,
Gartner
(2011)
21 ,
(Information Silo)
McKinsey
(2011)
, ,
, 5 6
,
-
,
,
,
,
,
,
,
(, , 2012. 3)
-
: , , , 2012.6.
-
IT
-
google.com
50
() () 25,000
1 7,000 1
4 4
20 100,000
Gmail (SNS)
845
Google.org (, )
OS OS
G1 (Knol)
236
TV
S 380 30
-
TV
Google
(, )
-
Data Strategy Board
(BIS, 2012. 3) - - ,
Open Data Strategy
- , - ,
-
- 10 12~15 - Ad Hoc Group
: Active Japan ICT , 39-3-2
-
, , ,
, ,
,
, SW
, R&D
, , ,
7
-
: $3,300
: 60%
: Mckinsey(2011)
: 10
: 12~15
: (2012)
: 10 7
: (2011)
: 160~330 ( 2.5~4.5%)
: Policy
Exchange(2012)
EU
: 2,500
: McKinsey(2011)
-
, Big data,
-
, IT , (2012.4.23)
-
Calculating Database Online Ubiquitous
ICT
Intelligence
Q: ?
A: , , B: , , ,
-
2012 IT IT
-
/
/
/
/
/
/
IT
-
:
-
(, & , Gov3.0 , 2012. 6)
-
8 (71% ) : &
-
+ GPS
-
IT!
-
/
/
/ /
/
-
1. , , ,
2. -> vs ->->
3. :
4. +++
5.
-
1. , , ,
2. : , , , ,
3. , &
4. , ,
5. : ;
-
www.bigdataforum.or.kr
-
: , (2011. 11. 7)
-
(Mathematics, Statistics..)
(Engineering, Computer Sciences, Natural Sciences, Social Sciences)
: Forbes, 'Amazon's John Rauser on "What Is a Data Scientist?"'(2011.10.7), , ..., (2012. 3. 18)
!
-
0/88 0
ETRI Proprietary Electronics And Telecommunication Research Institute
-
1/88
-
-
2/88
?
: , ,
Data Mining
Text Mining
Log Mining
Bio/Medical Mining
Stream Mining
-
3/88
21 :
: 2011, 1.8ZB 2020, 35ZB (44 , 1ZB = 1GB)
21 (Gartner, 2011)
5%
: Economist, Gartner, IDC, McKinsey, Nature Next Google
21 Information silo
Gartner (2011.03)
/ , / , 5 6
Mckinsey (2011.05)
Big data: The next frontier Tor innovation, competition, and productivity
SNS M2M , , ,
Economist (2010.05)
-
4/88
1. Business application data (e.g., records, transactions)
2. Human-generated content (e.g., social media)
, , ,
3. Machine data (e.g., RFID, Log Files etc.)
-
5/88
-
6/88
21 (Gartner)
: Risk Assessment Horizon Scanning
: Evidence-driven decision support
Value
(//)
Horizon Scanning Advanced Analytics Decision Support
-
7/88
?
5 : (US), (EU), LBS , , : Mckinsey, 2011
-
8/88
?
-
9/88
-
10/88
,
-- : ,
:
:
-
11/88
() / () / , ()
()
,
, ,
-
12/88
-
-
13/88
,
/
//
?
-
14/88
Data Mining, Predictive Analytics
Text Mining, Question Answering
Opinion Mining, Social Media Analytics, Social Network Analytics, Predictive Analytics
Log Data Mining
Modelling & Simulation
-
15/88
(1) Data Mining
(Association rule mining)
Market basket analysis
(Classification) : , Buying decision, churn rate, consumption rate
(Regression) , ,
(Cluster analysis) Segmenting customers into similar groups for targeted marketing
(Novelty Detection) Fault detection, Fraud detection
Red Ocean: SAP, IBM, SAS, Oracle, Microsoft
-
16/88
(2) vs.
: ) /, /, /
: ) , ,
-
17/88
: (Classification)
(Class) , (Class) , ,
-
18/88
: (Regression)
,
X
Y
X
Y
37
?? 33
-
19/88
Google Prediction API
Googles cloud-based machine learning tools can help analyze your data to add the following features:
Fords Smart Car System
-
20/88
Predicting the Present with Google Trends
Can Google queries help predict economic activity? Google Trends provides an index of the volume of Google queries by
geographic location and category.
Google classifiers search queries into 27 categories at the top level and 241 categories at the second level.
GNU R
-
21/88
Google
10 (2009) ,
20
-
22/88
Google
Google
Google 18
-
23/88
[] GNU R Programming Language
R is an open source programming language and software environment
for statistical computing and graphics.
S
-
24/88
(3) Text Mining
Goal: to turn text into data for analysis via application of natural language processing (NLP) and analytical methods.
Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation,
information extraction, data mining techniques including link and
association analysis, visualization, and predictive analytics.
, , , ,
, , , , , 10
-
25/88
[] Apache UIMA
UIMA Architecture Frameworks:
support configuring and running pipelines of Annotator
components
Components (i.e., Annotators):
do the actual work of analyzing the unstructured information
Infrastructure:
include a simple server that can receive requests and return
annotation results, for use by
other web services.
-
26/88
(4) Opinion Mining
Opinion Mining or Sentiment Analysis
-
27/88
Opinion Mining
-
28/88
Opinion Mining
-
29/88
Opinion Mining
Application of Sentiment Analysis Business Intelligence system
Purchase planning
Public opinion management
Web advertising
-
30/88
Aspect-based Opinion Mining
Aspect Identification Aspect Expression Extraction Aspect Expression Clustering Aspect Hierarchy Generation
Value Expression Extraction {Aspect, Value} Relation Extraction
Implicit Aspect Identification
{Aspect, Value} Polarity Assignment
30
Terminology
Aspect
: { , , , , } Aspect Expression
.: { , , } Value Expression ( value)
.: { , , , }
-
31/88
Aspect Hierarchy Generation
optimization approach
Domain-Assisted Product Aspect Hierarchy Generation: Towards Hierarchical Organization of Unstructured Consumer Reviews [2011 EMNLP] 31
-
32/88
(5) Question Answering
:
:
(Answer Engine) IT
Life is about questions & answers.
-> Decision making
-
33/88
IBM Watson QA
Watson , ,
Deep QA- ()
SW
-> 3 ( 2~6)
(2.6GHz) 2
-> 1(200 )
Apache Hadoop
Apache Lucene
Apache UIMA(Unstructured Information Management Architecture)
Deep QA -> 100
33
-
34/88
IBMs Grand Challenges
Chess -> Human Language
SW (2) , Big data deep analytics Deep QA
HW (1) IBM Power750 90(2,880 ) Deep blue 100 2010 Top 94 (80TFs)
SW HW Deep Blue
-
35/88
Jeopardy! Questions
< Game Board Category: US Cities> Hard Question
Simple Question
-
36/88
Waston QA
: 3 vs. 0.4
Watson can never be sure of anything
Question Difficulty
Usability (, , )
Content Language Difficulty
Confidence
Accuracy
Speed
Broad Domain
Query Language Difficulty
-
37/88
Waston for Business Intelligence
, , , Insight
-
38/88
IBM ?
Do they accomplish human-like language processing? Paraphrase an input text
Translate the text into another language
Answer questions about the contents of the text
Draw inferences from the text
Truing test proposed by Alan Turing (1950) Waston has not met Turings standard or true AI.
It does not have the intelligence to understand the questions & the answers.
However, Waston is cerainly intelligence argument (IA) that extends human brains.
: IBM
-
39/88
Wolfram Alpha
Wolfram Alpha supports Apple's Siri for factual question answering
Siri now accounts for 25 percent of all searches made on Wolfram Alpha (NY Times, 2012.2.7)
-
40/88
Google Knowledge Graph
Googles next frontier for search
-
41/88
(6) Log Data Mining: Personal Location Data
Personal Location Data Mining
-
42/88
Log Data Mining: Web Log Data
Google Insights ()
Big data
-
43/88
(7) Social Network Analysis
-
44/88
(8)
1. Predict Risk
2. Predict Market
3. Predict Popularity
4. Predict Mood
5. Predict Social Dynamics
-
45/88
Predict Risk
, , Natural Risk(Storms, files, traffic jams, riots, earthquakes etc.)
(249) Earthquake Shakes Twitter User:Analyzing Tweets for Real-Time Event Detection, IW3C2, 2010
(88) Microblogging during two natural hazards events: what twitter may contribute to situational awareness, CHI, 2010
Financial Risk
(27) Predicting risk from financial reports with regression, NAACL, 2009
(2) Hunting for the black swan: risk mining from text, ACL, 2010
-
46/88
Predict Market
, , (Wisdom of crowds) Social Media, News PM
(9) Predicting Movie Success and Academy Awards Through Sentiment
and Social Network Analysis, 2008, ECIS
(124) Predicting the future with social media, 2010 (5) Using Social Media to Predict Future Events with Agent-Based Markets,
2010, IEEE
(130) Twitter mood predicts the stock market, 2010, journal of CS Predicting Financial Markets: Comparing Survey,News, Twitter and Search
Engine Data, 2011
(16) Reading the Markets: Forecasting Public Opinion of Political
Candidates by News Analysis, 2008, Coling
(106) Predicting Elections with Twitter:What 140 Characters Reveal about Political Sentiment, AAAI, 2010
-
47/88
Predict Popularity
social connection, link structure, user behavior pattern ( )
Digg, Youtube (22) Digging Digg : Comment Mining, Popularity Prediction, and Social Network
Analysis, IEEE, 2009 Dig ( , , ) digg-score
(111) Predicting the Popularity of Online Content, ACM, 2010
(Digg: 1 , Youtube: 7 ) 30
Forum.myspace.com, Forum.dpreview.com (9) An Approach to Model and Predict the Popularity of Online Contents with
Explanatory Factors
France News sites (2) Predicting the popularity of online articles based on user comments, ACM,
2011
Twitter (23) Trends in Social Media - Persistence and Decay, AAAI, 2011
- , , , 2012
-
48/88
Predict Mood
Sentiment ,
Global mood phenomena: ( )
Public mood
Mood modeling
(80) Capturing Global Mood Levels using Blog Posts, 2006, AAAI
(66) Modeling Public Mood and Emotion-twitter sentiment and socio-economic phenomena, 2009, AAAI
(1) Effects of the recession on public mood in the UK, 2012, WWW MSDN worshop
-
49/88
Predict Social Dynamics
Unemployment through the Lens of Social Media : ,
(2009.6.~2011.6)
: ,
: Un , SAS
40 5 , 6 90%
-
50/88
Recorded Future: Temporal Analytics Engine
Event Entity Time
CIA 2008
() (, ) /
() (: , ) .
() ,
-
51/88
(Ushahidi)
Ushahidi: , /
2007, ,
a tool to easily crowdsource information using multiple channels, including SMS, email, Twitter and the web.
, ,
++
,
,
51
-
52/88
(9) Modelling & Simulation
RAHS
- RAHS(Risk Assessment & Horizon Scanning)
- ,
- 11 RAHS 2.0
9.11
,
-
53/88
-
-
54/88
-/
- ? ,
, , , (1012) (SERI, 2010)
, , , ,
-
55/88
- ? /,
Insight
: , , , ,
: /
()
?
() S2 ?
(++ + )
-
56/88
-, ,
-
-/ ( )
//
-
57/88
,
() 6
.
(, , )
/nc /nc+/xsn+/jc /nc+/nc+/jj /nc+/jc /nc+gk/Xsv+/ec /pv+ /ep+/ef ./s
+/xsn+/jc+/jj +/jc /nc+gk/Xsv+/ec /pv+ /ep+/ef ./s
Verb():Arg1( ), Arg2( )
/
Entity: Object: , Value:
: (-9.5)
-
58/88
-
, Insight /
: , ,
1.
2.
3.
-
59/88
, (Evidence-driven)
:
:
:
-
60/88
Insight Delivery
Issue Predictive Analytics
Knowledge Analysis
Information Analysis
Data Sensing
/
/
/
/
-
SNS
/ / /
/
-
61/88
1 2 (12/9 )
98 187
39 67
39 92
43 99
/
//
Hadoop HBase
(Crawling API, Streaming API)
-
62/88
, , ,
:
Follower, Mention, Retweet PageRank ,
/
(SVM)
-
63/88
,
,
//
, ,
(B)
Depth Retwee
t
(/)
Nested
network
Depth Retweet
(/)
Nested
network
(A)
-
64/88
/
-
- //
- ()
-
- /// /
- ////
()
-
65/88
/
(, SNS )
( )
(2)
-
66/88
/
/
/
/
/
/
,
Transition-based parsing hash kernel , ( O(n^3) O(n): 8 ) Deterministic parser beam search
180 () 4 () Structural SVM
(2)
-
67/88
-/, -, -
/
SRL
,
/ ,
/ * SRL: Semantic Role Labeling
XX
S2
.
(2)
-
68/88
[// /()/()/]
Holder
Target
Aspect
Time
Sentiment
Trigger:
Anchor:
-
69/88
[] Theory of emotion
() () ()
()
() () ()
() () ()
() ()
() ()
() ()
() ()
[Plutchik's wheel of emotions: eight primary emotions] [ ]
-
70/88
17
/// /
/
Trigger
Sentiment Shifter(, )
NEGATIVE POSITIVE NEUTRAL
-
71/88
/
Sentiment Shifter(, )
-
72/88
-
73/88
-
74/88
(Seed)
?
?
:
:
(, )
-
4.11
3
1
5.16
: 2012 1-8 : 314,648,676 : 26,438,236(8.4%)
(8/11), . (7/31),
(4/5) 4.11(4/11). 3(5/24)
/
3 4 5 6 7 8
-
76/88
/ /// /
Competitive Intelligence
-
77/88
[]
-
78/88
-
, Insight /
: , ,
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
1.2000
1 2 3 4 5 6 7 8
11
:
(/ )
,
: 46,768
-
79/88
Novelty(h1): ? discrepancy score
Importance(h2): ? term
Strength(h3): ? //
Confidence(h4): ? source
Interestedness(h5): ? , , RT
-
80/88
[]
12/22: A
11/23:
12/30: A
A vs
[A ]
[ETRI-WISDOM]
-
81/88
, (Evidence-driven)
:
:
:
()
-
82/88
-
/ /
SNS
ARIMA: Autoregressive Integrated Moving Average
ECM: Error Correction Model
(ARIMA, ECM )
(, ) DB
(, )
-: / -: /
-: -: /
-
83/88
(1/6)
-
84/88
vs.
( )
/
(, , )
/
/
-
85/88
-
-
86/88 86
, - SNS , , , / Reasoning, ,
,
SW SW 2 10% (SERI, 2010)
Data-driven Insight / , ,
-
87/88
[] 5 Big Data Questions For CEOs
1. How is big data going to help my business?
2. How much will it cost?
3. How risky is it?
4. How will we measure the return?
5. How long will it take to see results?
: http://www.forbes.com/sites/ciocentral/2012/06/26/5-big-data-questions-for-ceos/
-
88/88
. Q&A
-
Big Data
Hadoop
Edward KIM
-
(JCO) 6 ( )
JBoss User Group
Architect
Hadoop Java EE
Open Flamingo (http://www.openflamingo.org)
Java Application Performance Tuning
IT
JBoss Application Server5, EJB 2/3
Oreilly RESTful Java
2
-
3
-
?
4
Insight, Context, Data Scientist
Early Adaptor Collector .
-
?
5
10G? 50G? 100G?
1T? 10T? 50T? 100T?
1P ?
10
100 Byte * 6(1) * 60(1)* 24(1) * 600
= 864,000 * 6,000,000 = 5,184,000,000,000 Bytes
= 494,3847M = 4,827G (1 )
-
Big Data
6
+++
H/W + S/W
DevOps
-
Big Data ?
7
-
Big Data
8
Platform
Service
-
Big Data OpenSource
9
Big Data
-
?
10
IT
-
?
11
-
Apache Hadoop
File System : HDFS(Hadoop Distributed File System)
64M
2003 Google Google File System
(MapReduce) (2004 Google )
HDFS
Parallelization, Distribution, Fault-Tolerance
12
-
Hadoop
13
!
) MapReduce Sorting Sorting
Local Sorting Out Of Memory
-
Apache Hadoop Architecture
14 Manning Hadoop In Practices
-
Apache Hadoop ?
/ .
.
I/O CPU .
.
linear .
linear .
.
Apache Hadoop .
Intel Core .
15
-
Hadoop, RDMBS
16
Big Data
.
-
Hadoop
17
-
Hadoop
ETL(Extract, Transform, Load)
Data Warehouse
Storage for Log Aggregator
Distributed Data Storage (; CDN)
Spam Filtering
Bioinformatics
Online Content Optimization
Parallel Image, Movie Clip Processing
Machine Learning
Science
Search Engine
18
-
Apache Hadoop
19
-
Apache Hadoop
20
-
Apache Hadoop
21
-
Apache Hadoop
22
-
Hadoop Cluster
2 CPU(4 Core Per CPU) Xeons 2.5GHz
4x1TB SATA
16G RAM
1G
10G
20
Ubuntu Linux Server 10.04 64bit
Sun Java SDK 1.6.0_23
Apache Hadoop 0.20.2
23
3~4
- HDD Crash
- Kernel Crash
- LAN Fail
-
Big Data Appliance Hardware
18 Sun X4270 M2 Servers
48 GB memory per node = 864 GB memory
12 Intel cores per node = 216 cores
36 TB storage per node = 648 TB storage
40 Gb p/sec InfiniBand
10 Gb p/sec Ethernet
24
Processors 2 Six-Core Intel Xeon X5675 Processors (3.06 GHz)
Memory 48GB (6 * 8GB) expandable to 96 GB or 144
Disks 12 x 3 TB 7.2K RPM High Capacity SAS (hot-swap)
Disk Controller Disk Controller HBA with 512MB Battery Backed Cache
Network 2 InfiniBand 4X QDR (40Gb/s) Ports (1 Dual-port PCIe 2.0 HCA)
4 Embedded Gigabit Ethernet Ports
-
Hadoop Ecosystem
25
-
Hadoop
26
Hadoop . Google Compute Engine
!!
-
Hadoop
27
Database
Hadoop
Analytics
Hadoop
New
Service
&
Platform
Architecture
Integration
Performance
Cost
Development
Data
Analytics
Practices
Focus Issue Project
-
SK Telecom Hadoop
28
AS-IS Oracle RAC Database Big Data (100 Tera Bytes)
3 Layer(Sub System)
Service Adaptation Layer(SAL)
KD CL
Open API XML
Collection Layer(CL)
ETL,
Knowledge Discovery(KD)
(; K-Means)
Big Data Analytics, Data Scientist
,
TO-BE Apache Hadoop
KD, CL Hadoop Migration
, , ,
-
SK Telecom Hadoop
29
Big Data Platform
Apache Hadoop, Pig, Hive
Workflow Engine & Designer, HDFS Browser
MapReduce based Mining Algorith, ETL
AR, CF, K-Means,
Service Platform
Melon :: Association Rule
T store, AppMercer :: CF, Cold Start, Association Rule
Hoppin :: Real-Time Mining, CF, Cold Start
NATE
Vingo
Ad Platform
100 segmentation
.
-
SK Telecom Hadoop
30
-
SK Telecom Hadoop
31
-
SK Telecom Hadoop
32
/ Best, Best
T store 20 , 0.05%
14%
Apple App Store 1000
1.76%
Android Market Top 50 60%
,
Top 10 (Cold Start)
-
SK Telecom Hadoop
T store
Collaborative Filtering
Association Rule
Cold Start
AS-IS
AS-IS
TO-BE
Hadoop
33
-
SK Telecom Hadoop
34
-
SK Telecom Hadoop
35
Melon
-
Melon
36
-
37
SK Telecom Hadoop
Oracle Hadoop
CPU 100% 70%
Core 80 Core Intel 8 Core * 20 = 160 Core
1 34
1 1
120,000,000
(T) 1,300,000
6 High End Server
300 * 20 = 6,000
) Core 700 * 80 = 56,000
0
-
SK Telecom Hadoop
Hoppin N
38
-
SK Telecom Hadoop
Hoppin
Real-Time
Action ) ,
Collaborative Filtering, Cold Start
, ,
Text Mining
()
39
-
SK Telecom Hadoop
40
- -
User Preference
Streaming - Data Grid -
Implementation
A
B
C
D
E
Rock R&B K-POP J-POP Soul
5 6 4 1 6 0
Rock R&B K-POP J-POP Soul
4 2 1 4 2 1
Rock R&B K-POP J-POP Soul
5 6 3 2 1 1
Rock R&B K-POP J-POP Soul
1 5 6 2 3 0
User Preference
-
Real Time Big Data
41
-
Use-Case: Dispenser
42
-
Use-Case: Dispenser
43
-
Facebook Real Time Analytics System
44
-
Apple iOS6 Maps
45
-
46
Big Data 4 3 Realtime Big Data
Realtime & Big Data
SI
, , ,
Big Data
Big Data
Big Data
-
47
1 (2004.04~) :: SW
SW
NEIS Linux
SW
2 (2009.04~) :: SW
SW
SW
3 (2012.10~) :: SW , , SW
-
48
SW
SW SW
SW
SW
SW
SW /
SW
SW R&D SW /
SW
-
NIPA :: Architecture Reference Model
49
, , ,
OpenSource
, ,
,
AS-IS, TO-BE Architecture
: Hadoop, Pig, Hive, MongoDB, Slurper, Oozie, Sqoop, Storm, Flume, Ganglia, RHQ
Big Data Slurper Collector
-
Hadoop Project
50
No Experience
HW & SW tightly
coupling
Installation
& Configuration
Performance
Tuning
Provisioning
Integration
Trade Off
-
Apache Hadoop HDFS Architecture
51 Manning Hadoop In Practices
-
MapReduce Logical Architecture
52
-
WordCount
Hadoop MapReduce Framework
ROW Word Word
53
(Mapper Input) (Reduce Output)
hadoop apache page hive hbase cluster hadoop page cloud copywrite
apache 1 cloud 1 cluster 1 copywrite 1 hadoop 2 hbase 1 hive 1 page 2
-
WordCount
54
-
Apache Pig
= Pig Latin
MapReduce
Pig Latin MapReduce
MapReduce
Bag, Tuple,
55
-
Pig Latin
56
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR
quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,111)
(1950,22)
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
-
Apache Hive
Data Warehouse Infrastructure
Data Summarization
Ad hoc Query on Hadoop
MapReduce for Execution
HDFS for Storage
MetaStore
Table/Partition
Thrift API
Metadata stored in any SQL backend
Hive Query Language
Basic SQL : Select, From, Join, Group BY
Equi-Join, Multi-Table Insert, Multi-Group-By
Batch Query
https://cwiki.apache.org/Hive/languagemanual.html 57
-
Hive QL
SQL DDL Operation
HDFS
58
hive> CREATE TABLE rating (userid STRING, movieid STRING, rating INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ^' STORED AS TEXTFILE;
https://cwiki.apache.org/Hive/languagemanual-ddl.html
hive> LOAD DATA INPATH '/movielens/ratings.dat' OVERWRITE INTO TABLE
ratings;
-
Hive QL
59
hive> INSERT OVERWRITE DIRECTORY '/movielens/ratings.dat' SELECT r.* FROM ratings r WHERE a.movieid=1212'; hive> SELECT t1.bar, t1.foo, t2.foo FROM movies m JOIN ratings r ON (m.movieid = r.movieid)
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*)
FROM invites a
WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out'
SELECT a.* FROM invites a
WHERE a.ds='2008-08-15';
-
Big Data
Hadoop
Hadoop Project
) MapReduce ~ ~
Hadoop
60
-
Hadoop
( )
!
Hadoop, Pig, Hive !
!
!
61
-
62
,
-
63
SI
-
Big Data Market Forecast
64
-
Big Data Revenue
65
-
Big Data Market Share
66
-
Big Data Revenue By Type
67
-
Hadoop
Software Maestro 3rd [email protected]
September 17, 2012
(SW Maestro) Hadoop September 17, 2012 1 / 47
-
Section 1
(SW Maestro) Hadoop September 17, 2012 2 / 47
-
1 , HDFS HDFS .
2 Lucene TF-IDF(TermFrequency-Inverse Document Frequency) , MapReduce .
(SW Maestro) Hadoop September 17, 2012 3 / 47
-
3 .
1 - , (HDFS) .
2 - Hadoop Full-Text (TF-IDF).
3 - , .
(SW Maestro) Hadoop September 17, 2012 4 / 47
-
Section 2
(SW Maestro) Hadoop September 17, 2012 5 / 47
-
(Crawler)
1 HDFS .
2 ( ) .
3 URL .
4 robots.txt .
5 IT ,Hadoop .
6 .
(SW Maestro) Hadoop September 17, 2012 6 / 47
-
, Manager Worker .
Manager .
, , .
Worker .
Raw Data HDFS .
Manager .
, Manager .
(SW Maestro) Hadoop September 17, 2012 7 / 47
-
Section 3
(SW Maestro) Hadoop September 17, 2012 8 / 47
-
TF-IDF
.
TF(Term Frequency) IDF(Inverse Document Frequency) .
TF-IDF TF-IDF . , .
.
, . .
(SW Maestro) Hadoop September 17, 2012 9 / 47
-
TF-IDF Algorithm
, .
t ( )D
nt,d t d
|D|
(SW Maestro) Hadoop September 17, 2012 10 / 47
-
TF-IDF AlgorithmTerm Frequency .
t ft,d = nt,d
Inverse Document Frequency .
id ft,d =1
|{d : t d D}|+ 1 TF IDF , t D, d TF-IDF .
t f id ft,d,D = t ft,d id ft,d(t d D) (SW Maestro) Hadoop September 17, 2012 11 / 47
-
Enhanced TF-IDF
TF-IDF .
1 , . TF .
2 1000 1 A 2 B ?
(SW Maestro) Hadoop September 17, 2012 12 / 47
-
Enhanced TF-IDF
TF-IDF .
t ft,d =
1+ ln(nt,d) if nt,d > 00 if nt,d = 0id ft,d = ln(
|D||{d : t d D}|+ 1)
(SW Maestro) Hadoop September 17, 2012 13 / 47
-
Example
t health .id ft,d = ln(
42) = 0.6931
ni,d nt,d t ft,d t f id f
d1 Health is a necessary condi-tion for happiness.
7 1 0.134 0.093
d2 It is the business of the po-lice to protect the commu-nity.
11 0 0 0
(SW Maestro) Hadoop September 17, 2012 14 / 47
-
Example
ni,d nt,d t ft,d t f id f
d3 The city health business de-partment runs several freeclinics for health profession-als throughout the year.
15 2 0.13 0.087
d4 That plane crash was a ter-rible business.
7 0 0 0
, health TF-IDF (d1, d3) .
(SW Maestro) Hadoop September 17, 2012 15 / 47
-
Section 4
(SW Maestro) Hadoop September 17, 2012 16 / 47
-
Vector Space Model
.
Vector , Vector (Dimension) .
d VSM .
Vd = [w1,d ,w2,d , . . . ,wN ,d]T
, wt,d .
wt,d = t f id ft,d,D = t ft,d id ft,d
(SW Maestro) Hadoop September 17, 2012 17 / 47
-
Cosine Similarity
6
-~q
~d1
:
~d2
Figure :
~q . cos .
cos=~d1 ~q| ~d1||~q|
, Cosine Similarity .
(SW Maestro) Hadoop September 17, 2012 18 / 47
-
, .
1 .
2 .
3 Cosine Similarity .
4 Similarity .
(SW Maestro) Hadoop September 17, 2012 19 / 47
-
Section 5
(SW Maestro) Hadoop September 17, 2012 20 / 47
-
Subsection 1
TF-IDF()
(SW Maestro) Hadoop September 17, 2012 21 / 47
-
Flow Diagram
MapReduce Flow , Flow Diagram .
- HDFS .
- HDFS TextFile .
- .
(SW Maestro) Hadoop September 17, 2012 22 / 47
-
TF-IDF Data Flow Diagram
Flow BDocument Term
Index
Flow CCalculate TF
Flow DCalculate DF
Document MySQL
Flow ATerm Document
IndexMySQL
MySQL
MySQL
.
TD, DT
TF, DF
(SW Maestro) Hadoop September 17, 2012 23 / 47
-
Flow A. Term-Document Index
Document
Document
Noun Extracter
Noun Extracter
Term Document Indexer
MySQL(TD Index)
ID: 13, "
."
ID: 14, " OS X
."
["","","","",""]
["","OS","X","","",""]
Mapper Reducer
MapReduce Job
(SW Maestro) Hadoop September 17, 2012 24 / 47
-
Flow B. Document-Term Index
Document
Document
Noun Extracter
Noun Extracter
Document Term Indexer
MySQL(DT Index)
ID: 13, "
."
ID: 14, " OS X
."
["","","","",""]
["","OS","X","","",""]
Mapper
MapReduce Job
(SW Maestro) Hadoop September 17, 2012 25 / 47
-
Flow C. Term Frequency
Document
Document
Noun Extracter
Noun Extracter
Term Frequency Counter
MySQL(TF)
ID: 15, "
."
ID: 27, "OmmiGraffle 99 ."
["", "", "", "", ""]
["OmmiGraffle", "", "", "99", "",
""]
Mapper Combiner
MapReduce Job
WordCount .
(SW Maestro) Hadoop September 17, 2012 26 / 47
-
Flow D. Document Frequency
MySQL(TD Index)
Document Frequency Counter
MySQL(DF)
SQL Query
IDF DF
DocumentCount .
(SW Maestro) Hadoop September 17, 2012 27 / 47
-
Subsection 2
(SW Maestro) Hadoop September 17, 2012 28 / 47
-
Data Flow Diagram
Flow AVectorize
Flow BList Preload
Query(User Input)
MySQL
Flow CScoring
MySQL(Temporary)
Flow DSorting and Paging
Search Result
(Query)
(SW Maestro) Hadoop September 17, 2012 29 / 47
-
Flow A. Vectorize
Query(User Input)
Noun ExtracterTerm Frequency
CounterNext Flow
" " ["", "", ""] , ,
VSM
Term Frequency .
(SW Maestro) Hadoop September 17, 2012 30 / 47
-
Flow B. List Preload
Query Vector
Merge document list contain terms in query vector
MySQL
Load Document Vector Information
.
, TF 300 .
(SW Maestro) Hadoop September 17, 2012 31 / 47
-
Flow C. Scoring
Query Vector
Load Document Frequency
MySQL
Loaded Document Vector
Scoring TF-IDF
, ,
, ,
, ,
.
Cosine-Similarity .
(SW Maestro) Hadoop September 17, 2012 32 / 47
-
Flow D. Sorting and Paging
Presorted TF-IDF Scores
, ,
.
Sorting Sorted Data
,,
, .
.
(SW Maestro) Hadoop September 17, 2012 33 / 47
-
Section 6
(SW Maestro) Hadoop September 17, 2012 34 / 47
-
SKT T cloud biz 4
1 : 1 Vcore, 2GB RAM, 40GB HDD, CentOS 5.5 64bit
Sun Java 1.6.0_35
Apache Hadoop 1.0.3 IP
Hadoop1: 1.234.45.90 (Namenode, Secondary Namenode) Hadoop2: 1.234.45.94 (Datanode) Hadoop3: 1.234.62.102 (Datanode) Hadoop4: 1.234.62.101 (Datanode)
(SW Maestro) Hadoop September 17, 2012 35 / 47
-
Hadoop1 (1.234.45.90) ssh ., HDFS/chiwanpark/memento-input .
> hadoop jar memento-engine-0.1-SNAPSHOT.jarcom.chiwanpark.memento.mapreduce.WorkRunner
.
(SW Maestro) Hadoop September 17, 2012 36 / 47
-
(SW Maestro) Hadoop September 17, 2012 37 / 47
-
Hadoop1 ssh . > java -classpathmemento-engine-0.1-SNAPSHOT.jar:/opt/hadoop/confcom.chiwanpark.memento.searcher.cli.SearchRunner query ""
id TF-IDF Score . ID HDFS .
> hadoop fs -cat /chiwanpark/memento-input/e02f5b1df830e8fcf89df333dc2dd642a9f0569ee6aea26cc1e3ec3a22e4b988bfadb397c1ba7bd593feb5bd99276b9ce15a84741b5fe583d1dc2cb9110ae70c.txt
(SW Maestro) Hadoop September 17, 2012 38 / 47
-
(SW Maestro) Hadoop September 17, 2012 39 / 47
-
(SW Maestro) Hadoop September 17, 2012 40 / 47
-
Section 7
(SW Maestro) Hadoop September 17, 2012 41 / 47
-
Subsection 1
(SW Maestro) Hadoop September 17, 2012 42 / 47
-
MapReduce , .
TF-IDF Lucene Lucene Score TF-IDF Score .
(SW Maestro) Hadoop September 17, 2012 43 / 47
-
Test1 Job1 - 102 /3 58 ( ) Job2 - 102 /3 43 ( ) 0.22
Test2 Job1 - 99 /3 54 ( ) Job2 - 99 /4 4 ( ) 0.21
(SW Maestro) Hadoop September 17, 2012 44 / 47
-
Test3 Job1 230 /8 44 ( ) Job2 230 /8 16 ( ) 0.22
Test4 Job1 1862 /1 3 55 ( ) Job2 1862 /1 4 27 ( ) 0.24
(SW Maestro) Hadoop September 17, 2012 45 / 47
-
Subsection 2
(SW Maestro) Hadoop September 17, 2012 46 / 47
-
,
, .
.
(SW Maestro) Hadoop September 17, 2012 47 / 47
-
, Hadoop File Split Mapper . , Single line Split .
Cloud System 4 , VM I/O . VM .
(SW Maestro) Hadoop September 17, 2012 48 / 47
-
0
/ 1
TTA
-
1
,
.
.
-
2
. .
We also want to challenge industry, research
universities, and nonprofits to join with the
administration to make the most of the
opportunities created by BIG DATA We need what the president calls an
all hands on deck effort. Tom Kalil (OSTP)
-
3
( ?)
,
2012
: ??
-
4
IBM
2012 CEO
IBM ,
PC
-
5
IBM CEO
60 100 CEO
One of the most profound things
they talk about is
data will separate the winners and losers in every single industry.
CEO
??
-
6
BIG DATA ( )
/
/ New Revolution
-
7
?
BIG : (volume) -
Gartner 3V = Volume + Variety + Velocity
-
8
HDD (1980~2010)
-
9
-
10
IT
,
Hadoop :
Amazon Web Service
-
11
,
,
,
Definition (Broad sense):
-
12
3V
, ,
, , ,
-
13
/
,
(context-based service)
-
14
PC
??
?
-
15
()
() . Tim OReilly
-
16
-
17
10
10
-
18
Occupy BIG DATA!
-
19
, -
- 1/3 10TB
BIG DATA
BIG DATA TECH
,
-
20
-
, ()
-
,
-
21
[] (sensing)
-
22
The Santa Cruz Experiment
:
2011 7 1 27%
-
23
/
, ,
-
24
-
? 10
LTE 1
-
25
( CEO )
-
26
()
,
100
-
27
BIG Data = Big Brother?
Privacy
/
vs.
, ,
-
28
, ?
,
-
29
1
,
,
ICT
-
30
, ,
-
31
Tim Berners-Lee Nigel Shadbolt
2011
-
32
~2010 2011 2012 2013 2014 2015 2016~
(IoT)
/ , SNS
DATA
MPP DWH - PB
MPP DWH
Stock
+ Flow
(POS/ ) ,
(SNS ) ,
Stock/Flow
: (2011).
-
33
2013 (10/50%)
2013 (4/20%)
*
WHY?
and
1
-
34
,
1
, Go or Stop? []
ICT
/
Slope of Enlightenment
2012
2013
2015~6
2016~7
2018
-
35
--
8
10
-
Big Data
October 18, 2012
-
2012 SAP AG. All rights reserved. 2
Agenda
1. Big Data
2. Big Data Technology Outlook
3. Big Data
4. SAP Big Data SAP Big Data Framework
5.
-
Big Data
-
2012 SAP AG. All rights reserved. 4
Big Data Gartner, IDC
, .
(Critical Mass)
Big Data
Mobile Device (Smart Device)
Cloud Service
Social Media
Big Data 3
Cloud Computing
Real Time
Network
Big Data
E-mail: 290
: 375 MB
Youtube :
20
Google :
240 MB
twitter : 5,000
Facebook :
7,000
Mobile Internet :
1.3 MB
Amazon :
72.9 GOOD & Munday, 2011 the world of Data
-
2012 SAP AG. All rights reserved. 5
Big Data 2012 9
Aberdeen presents a baseline of current "Big Data" initiatives and highlights some of the most attention-grabbing strategies and solutions.
Surprisingly, 93% of companies surveyed listed structured data as key to their "Big Data" efforts, followed by the more typical sources such as social media and customer sentiment data.
Predictive analytics features prominently in "Big Data's" future, but about three out of five companies polled also cited mobile BI and in-memory computing as technologies they will be investing in within the next two years.
-
2012 SAP AG. All rights reserved. 6
Big Data 2012 9
Source: Aberdeen Group, January 2012
1: Drivers for Fast, Streamlined Analysis of More Data
47% 1
35% Real Time Near Real Time
71% , 3 1
: 150 TB
17% 1 PB
42% , 1/5 75%
23%
47%
: 14, 9, 5 Big Data Enterprise
Big Data , Active Business Data 5 TB 99
, ;
Dark Data
Velocity
-
2012 SAP AG. All rights reserved. 7
Big Data 2012 9
Big Data .
Big Data , , 93% Big Data ( )
: High Volume, High Velocity, Internet generated source Click Stream, Social Media, customer sentiment data
, ,
,
, ,
Human Resource , Location & Geo-spatial
Digital Media
Machine to Machine (M2M), Sensor
,
: (Doc, PPT, XLS), e-Mail
2: Sources that feed Big Data
Source: Aberdeen Group, January 2012
Big Data Enterprise
Big Data , Active Business Data 5 TB 99
-
2012 SAP AG. All rights reserved. 8
Big Data 2012 9
Currently Use
Plan to Use
Predictive Analytics Big Data , Big Data
3: The Technological Wave of the Future Big Data
Source: Aberdeen Group, January 2012
Big Data Enterprise
Big Data , Active Business Data 5 TB 99
Big Data High Volume
MPP: cluster computing
Columnar DB:
Real time Integration Tools: / Stream
BI Mobile BI
In-Memory Computing
, Commodity
-
2012 SAP AG. All rights reserved. 9
Big Data 2012 9
1: Unique Data Source Used for Business Analysis
Source: Aberdeen Group, January 2012
2: The Top Processes Driving Data Management Initiative
Source: Aberdeen Group, January 2012
,
12 : 38%
3 2.5
(EDW, DM, Application, Unstructured, Social Data)
,
, , ,
Volume Velocity
Dark Data
Variety / Complexity
-
2012 SAP AG. All rights reserved. 10
Big Data 2012 9
3: Top Strategic Actions to Support Data Management
Source: Aberdeen Group, January 2012
4: Who Owns Data Management / Government
Source: Aberdeen Group, January 2012
Big Data IT
IT , .
Big Data
-
Big Data Technology Outlook
-
2012 SAP AG. All rights reserved. 12
Big Data Eco-System
NoSQL
Data .
/
Hadoop
Apache Open source project
Map/Reduce: , Web logs, text data, graph data.
Hbase:
Hive: , , DW
Commercial support Cloudera, HortonWorks, IBM, & EMC/Greenplum.
R Language
Open Source
-
2012 SAP AG. All rights reserved. 13
Big Data Hype Cycle, 2012
Figure 1. Hype Cycle for Big Data, 2012
-
2012 SAP AG. All rights reserved. 14
Big Data Priority Matrix, 2012
Less than 2 years 2 to 5 years 5 to 10 years More than 10 years
Transformational Column Store DBMS Cloud Computing In-Memory Database
Management Systems
Complex-Event Processing Content Analytics Context-Enriched Services Hybrid Cloud Computing Information Capabilities
Framework Telematics
Information Valuation Internet of Things
High Predictive Analytics Advanced Fraud Detection and Analysis Technologies
Cloud-Based Grid Computing Data Scientist In-Memory Analytics In-Memory Data Grids Open Government Data Predictive Modeling Solutions Social Analytics Social Content Text Analytics
Cloud Parallel Processing High-Performance Message
Infrastructure IT Service Root Cause
Analysis Tools Logical Data Warehouse Sales Analytics Search-Based Data Discovery
Tools Social Network Analysis
Semantic Web
Moderate Social Media Monitors Web Analytics
Activity Streams Claims Analytics Database Platform as a
Service (dbPaaS) Database Software as a
Service (dbSaaS) Intelligent Electronic Devices MapReduce and Alternatives noSQL Database Management
Systems Speech Recognition Web Experience Analytics
Cloud Collaboration Services Dynamic Data Masking Geographic Information
Systems for Mapping, Visualization and Analytics
Open SCADA Video Search
Low
Years to mainstream adoption
-
Big Data
-
2012 SAP AG. All rights reserved. 16
11 Industry Big Data Opportunity Heat Map
Big Data .
Volume, Velocity, Variety
Hardware, Software, Service
-
2012 SAP AG. All rights reserved. 17
Big Data AS-IS
ERP/CRM/SCM/PLM/MES
+
/ :
: High
ACID :
Data Governance : High
DW/eDW/DM/RMS/BI
+
/ :
: Middle
ACID :
Data Governance : High
ECM/EDMS/KMS/ILM
+
/ :
: High
ACID :
Data Governance : Middle
Blog/Facebook/Twitter/Log
/ :
: Low
ACID :
Data Governance : Low
, , , , ACID , Data Governance, ,
Business Social Media
-
2012 SAP AG. All rights reserved. 18
Big Data AS-IS : AS-IS
() ()
/ Dot Com
: 162
Dark Data Big Data .
** Dark Data , , ,
Source: Gartner, July 2012 [Dark Data Represents the Most Immediate Opportunity to Leverage Big Data]
-
2012 SAP AG. All rights reserved. 19
Big Data AS-IS : Big Data Market Big Data Big Data
Business Big Data ( ) Market Big Data (Portal )
+
+
, , ACID (Atomicity/Consistency/Isolation/Durability ) - , , , - , ,
, ACID CAP (Consistency / Availability / Partition Tolerance 2 )
Real Time Time Latency
Fact Past , Future
BI Tool Tool Open Source
Data Scientists, Experts
RDBMS SQL Open Source
Open Source Platform NoSQL Map/Reduce + Hadoop
* Open Source
* / ,
-
2012 SAP AG. All rights reserved. 20
Big Data
/
Cloud Digital Prototyping &Testing On demand Cloud
branch /Self
(Trading, , Processing)
Trading //
ICT Content
Content /Social
Content
Tracking
/
/
Processing
Booz&Company (2011) the next wave of digitization setting your direction, Building your capabilities
-
2012 SAP AG. All rights reserved. 21
Big Data Best Practices -
Big Data Best Practice
, , IT , ,
.
o Hadoop Big Data
o Hadoop DW
o MapReduce Hadoop
Big Data
, off line
[Gartner 12 dimension model for Big Data]
-
2012 SAP AG. All rights reserved. 22
Big Data : Open Source Big Data
Data
o Commodity System VS Enterprise System
Hadoop (HDFS) Batch Processing
o ,
Big Data BI tool
Skill Set
o Hadoop, Data Scientist, NoSQL, Map/Reduce, R Language
Big Data Back Up
Big Data Data Governance / Compliance
Big Data ( , )
HDFS
Name Node
(stores metadata)
Data Node
(stores actual data in blocks)
Data Node
(stores actual data in blocks) replication
client
HDFS MapRedu
ce HDFS
Input process output
-
SAP Big Data SAP Big Data Framework
-
2012 SAP AG. All rights reserved. 24
Big Data 3V (Velocity, Volume, Variety)
CRM data
GP
S
Demand
Spee
d
Velocity
Transactions
Op
po
rtu
nit
ies
Service Calls
Customer
Sales orders
Inventory
E-m
ails
Twee
ts
Planning
Things
Mobile
Instan
t messages
Velocity 18 2 ,
IDC
Volume 2005 150 Exabyte, 2011 1,200 Exabyte
The Economist
Variety 80 % ( + )
Gartner
-
2012 SAP AG. All rights reserved. 25
Variety 80 % ( + )
Gartner
CRM data
GP
S
Demand
Spee
d
Velocity
Transactions
Op
po
rtu
nit
ies
Service Calls
Customer
Sales orders
Inventory
E-m
ails
Twee
ts
Planning
Things
Mobile
Instan
t messages
Volume 2005 150 Exabyte, 2011 1,200 Exabyte
The Economist
SAP Big Data Framework (Velocity, Volume, Variety)
Velocity 18 2 ,
IDC
SAP Sybase ESP Complex Event Processing Engine
Real Time Analytic
Query than Data, not Data than Query
SAP HANA In Memory Computing Engine
In Memory Appliance
In Memory Analytic
Up to 1,000 times faster
SAP Sybase IQ Smarter Analytic engine
The 1st Columnar DBMS
Open Platform
In Database Analytic
:
Now-casting
-
2012 SAP AG. All rights reserved. 26
Variety 80 % ( + )
Gartner
Velocity 18 2 ,
IDC
CRM data
GP
S
Demand
Spee
d
Velocity
Transactions
Op
po
rtu
nit
ies
Service Calls
Customer
Sales orders
Inventory
E-m
ails
Twee
ts
Planning
Things
Mobile
Instan
t messages
Volume 2005 150 Exabyte, 2011 1,200 Exabyte
The Economist
SAP Big Data Framework (Velocity, Volume, Variety)
SAP Sybase IQ Smarter Analytic engine
Multiplex Grid Architecture
No Volume Limitation The Largest EDW Platform
SAP HANA In Memory Computing Engine
In Memory Appliance
Up to 100 node scale out Capacity
->
-
2012 SAP AG. All rights reserved. 27
Velocity 18 2 ,
IDC
CRM data
GP
S
Demand
Spee
d
Velocity
Transactions
Op
po
rtu
nit
ies
Service Calls
Customer
Sales orders
Inventory
E-m
ails
Twee
ts
Planning
Things
Mobile
Instan
t messages
Volume 2005 150 Exabyte, 2011 1,200 Exabyte
The Economist
Variety 80 % ( + )
Gartner
SAP Big Data Framework (Velocity, Volume, Variety)
SAP Sybase IQ Smarter Analytic engine
Unstructured Data Management
Hadoop Integration
SAP HANA In Memory Computing Engine
Text Analytic Engine
R embedded
-
2012 SAP AG. All rights reserved. 28
Ingest Store Process Present
Effo
rt
Effo
rt
/
Extract-Transform-Load
Event Stream Processing
ACID
SQL/OLAP
DB UDF
DB DFS
Low-latency
,
(DFS)
BASE
BI
Map/Reduce ,
SQL
Connectivity SQL
High-latency
Big Data
-
2012 SAP AG. All rights reserved. 29
SAP Real-time Analytics
SAP Big Data Processing Framework
Hadoop
Smart Meter
,
Big Data ad-hoc
Big Data streaming
Big Data
, ,
&
-
2012 SAP AG. All rights reserved. 30
SAP BusinessObjects BI solutions
Transaction Processing
DB Engine
In-memory Computing Engine
DB Engine
Analytic Grid
DB Engine
MapReduce Batch Compute Framework
Sybase Replication Server, SAP BusinessObjects Data Services (Integrate / synchronize data across deployment options)
Sybase ESP Stream & event
processing
SAP Big Data Processing Framework
SAP HANA Sybase IQ
Sybase ESP Monitor / filter
streaming events
Semi-structured Data Structured Data Unstructured Data
Hadoop Sybase ASE
Hive/HDFS
SAP Big Data Framework :
,
1) , 2) , 3)
Inge
st
Sto
re
Pro
cess
P
rese
nt
( )
Targeting
-
2012 SAP AG. All rights reserved. 31
Hadoop Distributions | OS + Hardware | Map-Reduce (M/R) Support
Reporting / Analytics
Reporting / Analytics
Reporting / Analytics
EDW ETL / Push Down Transformations
ETL / Move
Scheduled reports
Data Mart Data Warehouse
Big Data EDW Streaming Real-Time Analytics
M/R Analytics
M/R Analytics
M/R Analytics
HADOOP HADOOP HADOOP
CEP
Hadoop Big Data
-
2012 SAP AG. All rights reserved. 32
: Mitsui Knowledge Industry Healthcare industry Cancer cell genomic analysis
: Real-time Big data (R + Hadoop + HANA)
Mitsui IT
, , Big Data , : 1,990
:
1 1 TB DNA Sequence Matching
:
2 3 . HANA MKI 15 , 216
: DNA
:
: ,
Generate Reports
Generate Reports
Generate Reports
HANA
Hadoop
Hadoop-HANA Connector
Variant Calling With samtool
More Analysis with R packages
R Integration Predictive Analysis
Library
Preprocess Data Analysis Annotation
: 2~3 -
: 2~3 ( )
: 20~40 - SAP HANA & Apache Hadoop
Manual tasks Computational tasks
-
2012 SAP AG. All rights reserved. 33
: T Mobile USA
: SAP HANA + SAP Business Object + DW
2011 ( 2 1 )
,
( 9 2 )
:
50 - 60
18 (Teradata)
5.5 , 60
2 1
. ,
Company T-Mobile USA Headquarters Bellevue, Washington Industry Telecommunications Products and Services Mobile telephone service Employees 36,000 worldwide Revenue US$20.6 billion
50x improvement in the performance of analytics: We can recalibrate offers in the market place in one day that took a week using our existing solutions.
Erez Yarkoni,
T-Mobile CIO
-
2012 SAP AG. All rights reserved. 34
SAP Big Data Value SAP HANA Real Time Big Data
Big Data
Big Data
Billing
CDR
Real Time Replication Pre-processing
In DB Mining Real Time BI Market Big Data Business Big Data
Integrated Analytics on SAP HANA
-
2012 SAP AG. All rights reserved. 35
Big Data SAPs Value
Higher Performance
Higher Speed
More Data
Better Capability
SAPs Advanced Value
Business
Social Media
Hadoop
-
2012 SAP AG. All rights reserved. 36
Big Data Big Data
SAP Big Data Framework Big Data Value
Volume + Variety
Volume + Velocity
Hadoop batch pattern analysis
SAP real-time analytical
processing
SAP Big Data
Value
, , SAP Big Data
,
Big Data
-
!
SAP D&T
-
-
l HANA l Database & Technology l SAP Korea
: SAP HANA
-
2012 SAP Korea All rights reserved. 2
1. In-memory Computing ?
2. SAP In-memory Technologies
3. -
4. Roadmap
-
2012 SAP Korea All rights reserved. 3
In-memory Computing ?
-
2012 SAP Korea All rights reserved. 4
IMC(In-Memory Computing)
Big Data :
Mobile :
RTE, Cloud, SaaS
x86 64bit multi-cores
DRAM $10 / GB NAND Flash $1 / GB
- by Gartner : Top 10 Strategic Technology Trends, 2012 Feb
~100ns
>1Mns
+
IT Readiness
S/W (IMDB)
+
-
2012 SAP Korea All rights reserved. 5
IMC
2012, 70% Global 1000 BI , .
- Tipping Point 2013 .
2016 - - DBMS 25% DW (OLTP) .
Big Data 93% DBMS 63% In-Memory Computing, 50% Columnar DB, 50% Hadoop .
Oct 2011
Oct 2006
Jan 2012
Feb 2012
()
~
-
2012 SAP Korea All rights reserved. 6
-
- SAP .
.
.
.
.
1990 .
.
-
2012 SAP Korea All rights reserved. 7
- IT
-
2012 SAP Korea All rights reserved. 8
SAP IMC Technologies
-
2012 SAP Korea All rights reserved. 9
SAP In-Memory Computing Evolution
Object Store
APO In-memory Object Cache
2000 Object Store
Column Store
In-memory Text Search Column Index
2001
Object Store
Column Store
Row Store SQL
OLTP
Row Store IMDB 2005 SAP
2002
Object Store
Column Store
Row Store SQL
OLTP
MPP Appliance
BW In-Memory MPP Appliance
2006
SAP HANA In-Memory Database
Row & Column Store OLTP OLAP
H/W Appliance 2011
-
2012 SAP Korea All rights reserved. 10
In-Memory DB : SAP HANA
-
2012 SAP Korea All rights reserved. 11
: Disk-based vs Memory-based
Data Block Memory Cache
Database ( 10 TB)
Conventional RDBMS
Disk I/O
Memory (128 GB)
Memory
Data Volume Log Volume
All Data Sets
Persistent Storage
SAP HANA
Data Modeling
( Page)
Database
Disk Database
(100TB+)
-
2012 SAP Korea All rights reserved. 12
SAP HANA
Synergy : In-memory + Columnar + MPP
HANA
DW
+ 5,000
> 1,000
SAP HANA
Row
,
Column
1/10
-
2012 SAP Korea All rights reserved. 13
In-Memory MPP DB
Disk-basedMPP
In-
memoryMPP
MPP
SMP
-
2012 SAP Korea All rights reserved. 14
Latency
-
2012 SAP Korea All rights reserved. 15
()
With HANA
Without HANA
-
2012 SAP Korea All rights reserved. 16
Stand-by Fail-over
100TB = SAP 8
Petabyte
HANA -
-
2012 SAP Korea All rights reserved. 17
Batch Processing
Intraday+
Very Large 1 PB+
Ad-Hoc Predictive
HADOOP
Event Driven
Transactional
Processing EDW
Operational Data Store
Multi-Dimensional
OLAP
Real-Time Real-Time Intraday+ Intra-hour Intraday+
Small < 1GB
Small < 1GB
Large 1 TB+
Medium 100 GB+
Medium 100 GB+
Eventing Parametrized Parametrized Parametrized Ad-Hoc
Predictive Analysis
Data Volume
Latency
Event Insight
Sybase ASE
Sybase IQ
HANA
Drive Insights into Structured Data Analytics Framework
+
HANA -
-
2012 SAP Korea All rights reserved. 18
DBMS vs Hadoop
-
2012 SAP Korea All rights reserved. 19
SAP HANA
. /
.
. , R
.
-
2012 SAP Korea All rights reserved. 20
7
BI
/ / SI/SM/
SAP
HANA
ODBC
JDBC
-
2012 SAP Korea All rights reserved. 21
HANA -
-
2012 SAP Korea All rights reserved. 22
Readiness
3rd party
3rd party backup tools - IBM Tivoli, HP Data Protector, Symantec Netbackup etc.
3rd party monitoring tools - IBM Tivoli, HP Service Guard etc. (In preparation)
(HA)
Stand-by Node/System
Disaster Tolerance
HANA Instance Failover.
Automatic and manual procedures possible
&
Full Data Backup
Log Backup
Disaster Recovery
(Bare Metal Restore)
Data Center Readiness
SAP HANA
Available today Available today Available today Available soon In preparation
& Administration
SAP Solution Manager End to End monitoring/ alerting/ scheduling
Security & Auditing
-
2012 SAP Korea All rights reserved. 23
SAP HANA
Memory
Persistence Storage
Log Volume
(SSD)
Data Volume
(SSD, High-speed SAS)
[ Persistency Layer] [Scale-out HA] [Disaster Tolerance,
Warm stand-by]
-
2012 SAP Korea All rights reserved. 24
HANA vs DW Appliance ?
+
-
2012 SAP Korea All rights reserved. 25
Exadata 3 vs SAP HANA
-
2012 SAP Korea All rights reserved. 26
-
-
2012 SAP Korea All rights reserved. 27
-
Go deep
Go broad
In Real-time
with High-speed
w/o pre-fabrication
,,
,
//
-
2012 SAP Korea All rights reserved. 28
- :
1 600+ , 200+
HANA HANA
1 10+
1.5 30+
1 10
=> IT .
, , , , .
-
2012 SAP Korea All rights reserved. 29
2012 86
-
2012 SAP Korea All rights reserved. 30
,
270
, DB
-
2012 SAP Korea All rights reserved. 31
Manufacturer
Computing Engine
Machine Owner/Operator
Dealer (option: Delivered via CRM portal)
Manufacturer
Real Time
Equipment data Engine temp Oil pressure RPM CO2 Defect codes Speed Etc.
HANA
HANA DB
, ,
> >
-
2012 SAP Korea All rights reserved. 32
60 times faster
HANA DB R . SAS .
-
2012 SAP Korea All rights reserved. 33
408,000x faster than traditional disk-
based systems in
technical PoC
216 (DNA): 2-3 -> 20
-
2012 SAP Korea All rights reserved. 34
Transforming information into intelligence in real time is a cornerstone for McLarens winning formula and increasingly critical for the future of every company, Jim Hagemann Snabe, co-CEO, SAP AG
"Using HANA we can hopefully automate decision making. People have always made decisions based on the data, but we want to get to the point
where the system can make the decision, Stuart Birrell , McLaren CIO
14,000 : 5 -> 1
99% predict the outcome of a race
5,000 events per second loaded onto
SAP HANA
(not possible before)
10-30%
Interactive data analysis leading to
improved design
thinking and game
planning
1,000x faster tumor data analyzed in
seconds instead of
hours
:
2-10 seconds for report execution
-
2012 SAP Korea All rights reserved. 35
Transforming information into intelligence in real time is a cornerstone for McLarens winning formula and increasingly critical for the future of every company, Jim Hagemann Snabe, co-CEO, SAP AG
"Using HANA we can hopefully automate decision making. People have always made decisions based on the data, but we want to get to the point
where the system can make the decision, Stuart Birrell , McLaren CIO
McLaren Group Limited Automotive Industry (Formula One) Predict and Transform the outcome of races
Telemetry
.
.
.
99%
14,000 : 5 -> 1
-
2012 SAP Korea All rights reserved. 36
McLaren Case Study
-
2012 SAP Korea All rights reserved. 37
McLaren Case Study
-
2012 SAP Korea All rights reserved. 38
McLaren Case Study
-
2012 SAP Korea All rights reserved. 39
McLaren Case Study
-
2012 SAP Korea All rights reserved. 40
McLaren Case Study
-
2012 SAP Korea All rights reserved. 41
McLaren Case Study
-
2012 SAP Korea All rights reserved. 42
3
95% reduction in data load time 2 minutes in
BW HANA Vs. 35-40 min
in BW Oracle
266x faster query response time with 15x
average
/ : (BW/Oracle) 15 (BW/HANA)
/
2.5x faster reporting with sub-optimized
queries - from 28.54 sec.
to 11.38 sec.
453.7 : 1787.49 -> 3.94
70% saving on storage space with
data compressed to
30%
1,000 : 77 -> 13
60% improvement in data load time
4-10 times faster DSO activation
(2)
-
2012 SAP Korea All rights reserved. 43
Co-PA was the most interesting thing to look at in the first step. We saw response times reduce from about 620 seconds to about five seconds in one
case. Andrew Pike, (former) CIO
124x faster analytics - drilldown by alphacode -
from 620 sec. to 5 sec.
37x faster cost allocation drilldown by
sending cost center -
from 260 sec. to 7 sec.
40x faster reporting Runtime reading line
items for EBIT with
commodity sales - from
260 sec. to 7 sec.
9x faster cost allocation initial report -
from 45 sec. to 5 sec.
355x faster data analysis; from 77 minutes
to 13 seconds
8 weeks rapid, non-disruptive
implementation
2x data compression
60x faster SKU/Month reporting; from 120 sec
to 2 sec
: , /
-
2012 SAP Korea All rights reserved. 44
SAP HANA Roadmap
-
2012 SAP Korea All rights reserved. 45
4 HANA
-
2012 SAP Korea All rights reserved. 46
SAP BPC ( )
SAP Finance and Controlling Accelerator
SAP Smart Meter Analytics
SAP Sales Pipeline Analysis
SAP Predictive Analytics
SAP Customer Segmentation Accelerator
SAP HANA Platform
SAP Business Warehouse
SAP BusinessObjects BI
SAP CO-PA ( )
SAP B1 ( ERP)
Third Party Apps
SAP ERP
Today
New Cloud Apps
New Mobile Apps
SAP Planning for Retail
SAP Customer Value Intelligence
SAP Predictive Segmentation
SAP Sales & Operations Planning
SAP Account Intelligence
SAP Demand Signal Management
SAP Account Intelligence
SAP Liquidity Risk Management ( )
SAP Customer Energy Mgmt.
SAP Trade Promotion Mgmt
Future
HANA
-
2012 SAP Korea All rights reserved. 47
Legacy ODS EDW Data Marts BI/Report Mart
/
/
BI/
Legacy ODS EDW Data Marts BI/Report Mart
Oracle
(=)
SAP
()
Legacy ODS EDW Data Marts BI/Report Mart
SAP
()
Sybase ASE
/
Teradata Exadata Exadata Exalytics
+ Sybase ASIQ
-
2012 SAP Korea All rights reserved. 48
-
2012 SAP Korea All rights reserved. 49
SAP HANA DB
ERP , , Backflushing
, (Mobile BI) Time Gap (Predictive Analysis)
SAP HANA with Sensor Technology, Mobile, Big-Data, Social Data, etc , ,
-
2012 SAP Korea All rights reserved. 50
-
- ,- BI -
-
. Email: [email protected]
-
Case Study
2012 10 18
621 C 5
Tel: 02-6246-1400 http://www.wise.co.kr
TTA
-
1 WISEiTech Case Study
1. ,
2.
3.
4. ? SNS ? ?
5.
-
2 WISEiTech Case Study
Case Study
, ,
() .
.
,
.
!
-
3 WISEiTech Case Study
Case Study
.
, ?
1 ? -
3 .
.
-
4 WISEiTech Case Study
> >
()
3 RDBMS
,
Case Study
()
?
-
5 WISEiTech Case Study
,
BI (OLAP Report )
?
-
6 WISEiTech Case Study
Case Study - v.s
.
.
, ,
.
?
-
7 WISEiTech Case Study
?
3V?
( ) ?
100 TB ?
,
-
8 WISEiTech Case Study
1. ,
2.
3.
4. ? SNS ? ?
5.
-
9 WISEiTech Case Study
Case Study - Global
TV . TV app
, Video .
.
. .
2~3 ,
50 . .
? ?
?
-
10 WISEiTech Case Study
Case Study - Global
Global Public Cloud 2 Global Public Cloud 1
ODS
,
DW Mart
Mart
OLAP
Reporting
ODS : Operational Data Store DW : Data Warehouse OLAP : On-Line Analytical Processing
RDB BI
-
11 WISEiTech Case Study
Case Study - Global
.
.
. .
SW .
?
. . ,