Download - Big Data Curation And Its Application
Copyright © 2013 Hanmin Jung
Hanmin JungHead of Dept. of Computer Intelligence Research
KISTI
Big Data Curation & Its Application
Copyright © 2013 Hanmin Jung
Let Me Introduce Myself :-)
2
Copyright © 2013 Hanmin Jung
� Recent Social Activities (2012~)� 보건복지부빅데이터기술전문가패널� 한국연구재단융합연구우수성평가위원회위원� 국립전파연구원방송통신표준전문위원� 기술표준원빅데이터표준화프레임워크표준연구회위원� ITRC 과제기획위원회빅데이터분과, UI/UX 분과기획위원/기술간사� 교육과학기술부과학기술빅데이터포럼전문가위원회간사� 국가경쟁력강화위원회 ‘범부처 Giga KOREA 사업’전문가위원회위원� 국가과학기술위원회기술영향평가위원회위원� 방송통신위원회빅데이터포럼인력양성분과위원장/자문위원/운영위원� IBS 장비심의위원회, 지식경제부중앙장비심의위원회심의위원� 한국과학창의재단과학창의앰배서더� 지식경제부 ‘SW+인문포럼’멘토� IT VC 전문성강화프로그램전문강사� ISO/IEC JTC1/SC32 전문위원회위원, JTC 1/SC34 표준개발위원회위원� ISO 20022 TC68 Advisory Expert
� 한국정보통신기술협회정보통신표준화위원회 –디지털콘텐츠/SW품질평가/메타데이타/사물지능통신프로젝트그룹위원
� 교육과학기술부차세대정보컴퓨팅분과 –기획위원� 국무총리실정보지식인대회출제/평가위원� 문화체육관광부공공문화정보전문가위원
Let Me Introduce Myself :-)
3
Copyright © 2013 Hanmin Jung
� Weekly IT Trends (NIPA)� 과학기술빅데이터동향및활용방안 (Vol. 1593)
� 모바일비즈니스인텔리전스기술동향 (Vol. 1573)
� 시맨틱소셜네트워크를구성하는온톨로지어휘기술현황 (Vol. 1560)
� 정부빅데이터분석기술적용동향 (Vol. 1556)
� 제품및기술수명주기활용동향 (Vol. 1550)
� 개체식별관점에서바라본링크드데이터동향 (Vol. 1524)
� 유망기술발굴모델및서비스동향 (Vol. 1521)
� 포지셔닝의측면에서살펴본모바일태블릿컴퓨터동향 (Vol. 1486)
� 테크놀로지인텔리전스동향 (Vol. 1477)
� 추론기술연구동향 (Vol. 1446)
� 트리플레파지토리벤치마킹 (Vol. 1439)
� 차세대 IT 기기와HCI 기술동향전망 (Vol. 1435)
� 시맨틱검색기술동향 (Vol. 1431)
� 시맨틱웹국내특허동향 (Vol. 1420)
� 감성분석과브랜드모니터링기술동향 (Vol. 1396)
� 시맨틱웹이경제⋅사회에미치는영향 (Vol. 1372)
� 웹매핑서비스비교분석 (Vol. 1352)
� 시맨틱웹 2.0 기술동향 (Vol. 1344)
� 국내포털검색시장및특허동향 (Vol. 1341)
� Open API 기술동향 (Vol. 1296)
� 엔터프라이즈검색기술동향 (Vol. 1276)
� 전자상거래검색기술동향 (Vol. 1273)
� 시맨틱웹포털기술동향 (Vol. 1264)
Trend Reports
4
Copyright © 2013 Hanmin Jung
Art Curator
5http://www.chrysler.org/files/resources/gallery-talk-3.jpg
Copyright © 2013 Hanmin Jung6
Press
Copyright © 2013 Hanmin Jung
� Definition
� Active management of data over its lifecycle
� Ensuring that data is trustworthy, discoverable, accessible, reusable, and
fit for use
� E.g. data preparation for analytics
� Curation Types
� Manual curation
� (Semi-)automatic curation
� E.g. data cleansing, record duplication, classification
� Sheer curation (curation at source)
� Integrated in workflow for creating and managing data
Data Curation
7E. Curry, A. Freitas, and S. O’Riain, “Data Curation at the New York Times”, 2011.
Copyright © 2013 Hanmin Jung
� Steps to Setup the Process
� Identify what data you need to curate
� Identity who will curate the data
� Define the curation workflow
� Identify appropriate data-in & data-out formats
� Identity the artifacts, tools, and processes needed to support the process
Data Curation
8E. Curry, A. Freitas, and S. O’Riain, “Data Curation at the New York Times”, 2011.
Copyright © 2013 Hanmin Jung
Case Studies
9
1. Apple Maps
2. Strategic Foresight
3. Big Data
Copyright © 2013 Hanmin Jung
http://cdn0.sbnation.com/entry_photo_images/5667445/20120920-DSC_7192VERGE_large_verge_medium_landscape.jpg
Apple Maps
10
Copyright © 2013 Hanmin Jung
Google Maps vs. Apple Maps
11http://www.gadgetreview.com/2013/01/google-maps-vs-apple-maps-ios-comparison.html
Copyright © 2013 Hanmin Jung
Apple Siri
http://www.apple.com/ios/siri/
12
Copyright © 2013 Hanmin Jung13
We can see it as much about itas we know
http://investor.google.com/financial/tables.html
� Google’s Financial Table
Copyright © 2013 Hanmin Jung14
Nokia’s Burning Ships Strategy
http://www.brightsideofnews.com/Data/2011_5_5/Steven-Elop-Burn-Boats-Strategy-is-a-Cortez-Move-for-Nokia/Hernando_Cortez_BurningBoat.jpg
Copyright © 2013 Hanmin Jung
Result of the Strategy
� Worldwide Market Share
� Worldwide mobile device sales to end users in 2008 ~ 2013
Gartner, IDC Worldwide Mobile Phone Tracker
38.6, 117.937.8, 108.531.6,110.427.1, 106.618.7, 82.914.8, 61.9Nokia
3.5, 12.14.9, 19.13.1, 13.73.2, 13.5ZTE
418.6
41.9, 175.4
3.7, 15.4
8.9, 37.4
27.5, 115.0
1Q2013(%, M. Units)Company
3Q2012(%, M. Units)
3Q2011(%, M. Units)
3Q2010(%, M. Units)
3Q2009(%, M. Units)
3Q2008(%, M. Units)
Samsung 23.7, 105.4 22.3, 87.8 20.5, 71.4 21.0, 60.2 17.0, 52.0
Apple 6.1, 26.9 4.3, 17.1 4.0, 14.1
LG 3.1, 14.0 5.4, 21.1 8.1, 28.4 11.0, 31.6 7.5, 23.0
Sony Ericsson 4.9, 14.1 8.4, 25.7
Motorola 4.7, 13.6 8.3, 25.4
Others 45.3, 201.6 36.1, 142 32.2, 112.5 20.6, 59.1 20.1, 61.5
Total 444.5 393.7 348.9 287.1 305.4
15
Copyright © 2013 Hanmin Jung16
Sign of the Result
http://www.google.com/insights/search/
� Google Insights for Search
Copyright © 2013 Hanmin Jung
Strategic Foresight
R. Rohrbeck, H. Arnold, and J. Heuer, “Strategic Foresight in Multimedia Enterprises”, 2007.
17
Copyright © 2013 Hanmin Jung18
Hype Cycle
Emerging Technologies Hype Cycle 2012
Copyright © 2013 Hanmin Jung19
Social Data
http://bynoy.files.wordpress.com/2011/08/united-noy-weblife-60-seconds.jpg
Copyright © 2013 Hanmin Jung20
Machine Data
T. Baer, “What is Big Data? The Reality for Analytics”, OVUM, 2011.
Call data recordsCall data records
Sensory dataSensory data
Web log filesWeb log files
Financial Instrument TradeFinancial Instrument Trade
Copyright © 2013 Hanmin Jung21
Crop Yield Estimation
http://www.geovar.com/data/satellite/rapideye/rapideye_sample_image_bands543.jpg
Copyright © 2013 Hanmin Jung22
Crop Yield Estimation
https://www.agriskmanagementforum.org/content/improved-supply-intelligence-global-agriculture-markets
South American soybean production forecasts South American soybean production forecasts
Copyright © 2013 Hanmin Jung23
Soybeans Futures (CME)
Copyright © 2013 Hanmin Jung24
Data Mining Methods
A. Cheung, “Forecast Anything! The Seven Data Mining Models”
Decision Trees
Naive Bayes
Cluster Analysis
Sequence Clustering
Association Rules
Time Series
Neural Networks
Copyright © 2013 Hanmin Jung25
Value Pyramid
Search
Clustering
Extracting
DecisionSupport
Forecasting
ScenarioPlanning
Advising
Modified from D. Bousfield & P. Fooladi, “STM Information: 2009 Final Market Size and Share Report”, 2010.
InSciTe Advanced (2011)
InSciTe Adaptive (2012)
InSciTe Advisory (2013~)
1st Ta
rget o
f Data
Curat
ion
Copyright © 2013 Hanmin Jung
Data Curationfor Getting Insights & Foresights
26
1. Crawling Data
2. Extracting Information (Text Mining)
3. Resolving Identities
Copyright © 2013 Hanmin Jung
Crawling Data
� Crawling Full Texts from Google Scholar
27
Copyright © 2013 Hanmin Jung
Crawling Data
� Crawling Web Data by RSS & Google API
28
Copyright © 2013 Hanmin Jung
Crawling Data
� Web Data Statistics in InSciTe Adaptive
29
Source/Year 2001 … 2008 2009 2010 2011 2012 Total
IDC 0 … 1 13 21 313 250 598
Wikipedia 0 … 151,942 181,721 408,017 1,555,867 249,0209 4,975,178
InformationWeek 0 … 470 551 536 388 81 2,316
Gizmag 0 … 1,371 2,301 2,970 2,850 2,520 16,731
Technologyreview 89 … 646 693 758 873 405 5,330
IEEE spectrum 64 … 468 426 385 306 205 3,173
Technewsworld 31 … 696 742 1,396 2,332 2,485 9,843
DiscoverMagazine 2 … 104 51 79 57 49 606
NewYork Times 20,940 … 7,979 4,696 3,971 3,669 2,064 124,100
BBC 3,155 … 2,687 3,158 3,318 5,964 4,243 37,073
Fox News 0 … 1,577 1,965 1,123 2,015 1,872 10,749
CNN 3,594 … 1,154 1,499 1,071 1,308 814 19,769
Thomson Reuters 0 … 450 371 309 338 222 3,408
USA Today 458 … 4,616 3,878 4,630 6,579 4,276 38,750
EtnTws.com 447 … 1,964 2,241 2,099 1,585 844 14,259
Total 28,780 … 176,125 204,306 430,683 1,584,444 2,510,539 5,261,883
Copyright © 2013 Hanmin Jung
Text Mining – Concept
30
Copyright © 2013 Hanmin Jung31
� Mining Targets
� Named entities: e.g. PLOs
� Pattern-based entities: e.g. a-mail addresses, phone numbers
� Concepts: abstractions of entities
� Facts and relationships
� Concrete and abstract attributes: e.g. 10-year, expensive, comfortable
� Subjectivity in the forms of opinions, sentiments, and emotions: e.g. positive/negative, angry
S. Grimes, “Text Analytics Overview: Technology, Solutions, Market”, 2011.
Text Mining – Concept
Copyright © 2013 Hanmin Jung32
Text Mining – SINDI Architecture
Copyright © 2013 Hanmin Jung
Text Mining – Example
33
Copyright © 2013 Hanmin Jung
Text Mining – Example
34
Copyright © 2013 Hanmin Jung
Text Mining – Example
35
Copyright © 2013 Hanmin Jung
Text Mining – Example
36
Copyright © 2013 Hanmin Jung
Text Mining – Example
37
Copyright © 2013 Hanmin Jung
Text Mining – Application
38
Copyright © 2013 Hanmin Jung
Resolving Identities
Christian Bizer, Tom Heath, and Tim Berners-Lee, “Linked Data – The Story So Far”, 2009.
� URI Aliases
� URIs that refer to the same real-world objects
� E.g. http://dbpedia.org/resource/Berlin (for Berlin in DBpedia)
� E.g. http://sws.geonames.org/2950159 (for Berlin in Geonames)
� Information providers can set owl:sameAs links to URI aliases they know
about
� Resolution of Data Conflicts in Data Fusion
� Choosing a value in situations where multiple sources provide different
values for the same property of an object
39
Copyright © 2013 Hanmin Jung40
Resolving Identities
� Example
� 소녀시대, 소시, 少女時代, しょうじょじだい, 少時, Sho Jo Ji Dai, So Nyeo
Shi Dae, SoShi, Girl’s Generation, SNSD, So Nyuh Shi Dae, …
Copyright © 2013 Hanmin Jung
Resolving Identities
http://sindice.com/search?q=Hanmin+Jung
41
Copyright © 2013 Hanmin Jung42
OntoURIResolver
� Demonstration
Copyright © 2013 Hanmin Jung
LOD Project
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.html
� Linked Data
� 31 billion RDF triples, 504 RDF links (2011.9)
43
Copyright © 2013 Hanmin Jung
InSciTe Advanced (2011)
44
Copyright © 2013 Hanmin Jung45
InSciTe Adaptive (2012)
https://play.google.com/store/apps/details?id=net.xenix.inscite&feature=search_result#?t=W10.
Copyright © 2013 Hanmin Jung46
InSciTe Adaptive (2012)
Copyright © 2013 Hanmin Jung
InSciTe Adaptive (2012)
� Data Fact Sheet
� Articles: 22.6 millions (9.8 millions for papers, 7.6 millions for patents, 5.3 millions for Web data)
� All technical areas (2001~2011)
� Named entities: 1.9 millions
� Authority dictionary: 1.5 millions entries
� LOD data: 290 GB (are being connected)
47
Copyright © 2013 Hanmin Jung
Press
48
Copyright © 2013 Hanmin Jung
InSciTe Architecture
Analytics Models
ETD ModelEmerging Technology Discovery Model
TLCD ModelTechnology Life Cycle Discovery Model
TLC ModelTechnology Life Cycle Model
OntoRelFinder®
Relationship Path Finder
OntoReasoner®
Reasoning Engine
OntoURI®Semantic Knowledge Manager
OntoPipeliner®
Semantic Service Composer
SS&AESemantic Search & Analytics Engine
OntoURIResolver®
Identity Resolver
SINDI-CORE/LINKEntity & Relationship Extractor
TUC ModelTerminology Use Cycle Model
Ontology
Linked Data
OntoFrame
OntoVerifier®
Reasoning Verifier
Web Data CrawlerRSS/Google API
Web Data
Literatures
49
Copyright © 2013 Hanmin Jung
Ontology
DB
Ontology Schema
Ontology Instances
RDF Triples
Portability & Connectibility
For Storing & Managing
For ServicePlanning ServicesDefining ConceptsExploiting Relations
50
Copyright © 2013 Hanmin Jung
Big Data Processing in InSciTe
51
Copyright © 2013 Hanmin Jung52
InSciTe Homepage
http://inscite.kisti.re.kr
Copyright © 2013 Hanmin Jung53
Thank you
“A lot of times, people don’t know what they want until you show it to them.”
by Steve Jobs
“Many people won’t be convinced until they’ve seen it for themselves.”
by Jakob Nielsen