2014.6.27 김 우 주 연세대학교 정보산업공학과. 목차 i. 빅데이터 시대와...
TRANSCRIPT
LOD 를 말하다 !
2014.6.27
김 우 주
연세대학교 정보산업공학과
LOD 를 말하다 . LOD 를 나누다 (Linked Data Party 5)
2
목차
I. 빅데이터 시대와 정보의 홍수
II. 빅데이터 활용 사례
III. 빅데이터의 한계와 극복 방안
IV. Linked Data 의 구축과 활용
V. LOD 2 - 시맨틱 기술의 미래
LOD 를 말하다 !
An Instrumented Interconnected World빅데이터 시대와
정보의 홍수
2+ bil-lion
people on the
Web by end 2011
30 billion RFID tags today
(1.3B in 2005)
4.6 bil-lion
camera phones
world wide
100s of millions of GPS
enabled devices sold an-
nually
76 million smart meters in 2009… 200M by 2014
12+ TBs of tweet data
every day
25+ TBs of
log data every day
? TB
s o
fd
ata
every
d
ay
5
Information Overflow on the Web
Growth of the Web
The amount of information available on the Web grows so
fast.
The February 2014 survey shows there exist at least
920,120,079 sites
(http://news.netcraft.com/archives/category/web-server-
survey/).
빅데이터 시대와정보의 홍수
6
Information Overflow on the Web
The Indexed Web contains at least 19.8 billion pages (Sunday, 02 March, 2014).
http://www.worldwidewebsize.com/
빅데이터 시대와정보의 홍수
7
빅데이터란 ?
빅데이터란 ? (07/11/2013, European Commission)
Every minute the world generates 1.7 million billion
bytes of data, equivalent to 360,000 standard DVDs.
The big data sector is growing at a rate of 40% a year.
무엇이 빅데이터를 중요하게 하는가 ?
Big data is already affecting all areas of the economy.
Data-driven decision making leads to 5-6% efficiency
gains in the different sectors observed.
Intelligent processing of data is also essential for
addressing societal challenges.
빅데이터 시대와정보의 홍수
8
IBM 의 예측 : 2014 년 6 대 빅데이터 트렌드
직감보다는 더 분석적인 경영 방식
Companies will grow increasingly data driven and willing to apply
analytics-derived insights to key business operations.
빅데이터 프라이버시와 보안 문제
Organizations will make a greater effort to build security, privacy,
and governance policies into their big data processes.
빅데이터에 대한 투자 확대
CDO(Chief Data Officer) 의 등장
More organizations will bring a chief data officer (CDO) on board.
보다 유용한 빅데이터 응용 시스템
외부 데이터에 대한 관심 증대
빅데이터 시대와정보의 홍수
10
구글의 독감 트렌드
‘ 독감’ 관련 검색어 분석을 통한 독감 예보 가능성 확인
구글 검색 사이트에 사용자가 남긴 검색어의 빈도를 조사 , 독감 환자의 분포 및 확산
정보 제공
빅데이터 활용 사례
11
샌프란시스코 , 범죄 예방 시스템
과거 범죄 발생 지역과 시각 패턴 분석을 통한 경찰 인력 배치
과거 발생한 범죄 패턴을 분석하여 후속 범죄 가능성 예측
과거 데이터에서 범죄자 행동을 분석하여 사건 예방을 위한 해법 제시
빅데이터 활용 사례
12
미국 국세청 , 탈세 방지 시스템
빅데이터 분석을 통한 탈세 및 사기 범죄 예방 시스템 구축
사기 방지 솔루션 , 소셜 네트워크 분석 , 데이터 통합 및 마이닝 등 활용
세금 누락 및 불필요한 세금 환급 절감의 효과 발생
빅데이터 활용 사례
13
KT, 서울특별시 – 빅데이터 기반심야버스 노선 정책 지원
심야버스 노선 결정을 위한 유동인구 분석 및 노선 분석
서울시의 교통 환경 ( 정류장 / 전용차로 / 환승 ) 기반 지역별 최적 정류장
위치를 도출하고 KT 의 CDR 데이터 기반 심야시간 유동인구 및 목적지
통계를 융합하여 노선 검증
빅데이터 활용 사례
14
비씨카드 , 점포 평가 서비스
소상공인 창업 성공률 제고를 위한 상가데이터 및 신용카드거래데이터 기반의
빅데이터 분석
점포이력 , 상권분석 , 업종추천 등이 이루어지는 과거현황분석 , 추천 업종
또는 사용자 선택 업종 매출예측 , 수익예측 등의 서비스 제공
빅데이터 활용 사례
16
Information Overflow Problems
Problems How to cover all available information? - Recall How to find the relevant information? - Precision
빅데이터의 한계와극복 방안
Not data (search), but integration, analysis and insight, leading to deci-
sions and discovery
Information Silo Problem
Stove-piped Systems and Poor Content
Aggregation
빅데이터의 한계와극복 방안
19
Semantic Interoperability
To cope with the problems mentioned in the
preceding slide, we need Semantic
Interoperability.
Semantics
“The meaning or the interpretation of a word,
sentence, or other language form.”
What is Semantic Interoperability?
“Processing or Integration of resources based on the
understanding what’s intended or expressed by
other systems or parties.’’
빅데이터의 한계와극복 방안
What if I want to ...
Move my content from one place to another?
RSS ? Not enough
Aggregate my data
An open FriendFeed?
Re-use my Flickr friends on Twitter?
Invite. Again and again ...
The Semantic Web and Ontology can help !
By providing a common framework to interlink
data from various providers in an open way.
21
빅데이터의 한계와극복 방안
22
How is it Possible?
Ontology: Agreement with Common Vocabulary
& Domain Knowledge
Semantic Annotation: metadata (manual &
automatic metadata extraction)
Reasoning: semantics enabled search,
integration, analysis, mining, discovery
빅데이터의 한계와극복 방안
24
Three Technical Building Block
Basic Building Block
URIs for unambiguous names for resources,
RDF for common data model for expressing metadata,
Ontology(OWL) for common vocabularies.
Semantic Web becomes:
web of data/things/concepts
• What is a Thing/Concept? It can be anything in the world - a movie, a
person, a disease, a location…
• Machines will be able to understand the concept behind a html page.
• This page is talking about ‘Barack Obama’, He is a ‘Person’ and he is
the ‘President of USA’ ?
빅데이터의 한계와극복 방안
25
Who borrows this Idea?
Facebook Open Graph Protocol and Graph Search
Knowledge Graph
Real-time Semantic Web with Twitter Annotations
빅데이터의 한계와극복 방안
Linked Data
Building a “Web of Data” to enhance the current
Web
The Linking Open Data (LOD) project:
http://linkeddata.org/
Translating existing datasets into RDF and linking them
together.
• For example, DBpedia (Wikipedia) and GeoNames, Freebase, BBC
programmes, etc.
Government data also available as Linked Data
• DATA.gov
• DATA.gov.uk
27
Linked Data 의 구축과 활용
Web of Data (Statistics)
The size of the Web of Data
The size of the Web of Data can be estimated based on
the data set statistics that are collected by the LOD
community in the ESW wiki.
According to these statistics, the Web of Data currently
consists of 31 billion RDF triples, which are
interlinked by around 500 million RDF inter-links
(09/19/2011).
31
Linked Data 의 구축과 활용
Semantic Search Engines
Top 7 Semantic Search Engines as An
Alternative to Google
Kngine
Hakia
Kosmix: now is part of @WalmartLabs
DuckDuckGo
Evri: specialized for iPad and iPhone
Powerset: now is part of Bing
Truevert: focus only on environmental concerns.
33
Linked Data 의 구축과 활용
35
LOD2 : What is LOD2?
LOD2(Linked Open Data)
LOD2 is the large-scale integrating project co-funded by
the European Commission within the FP7 Information
and Communication Technologies Work Programme.
• Started in September 2010
Partners
• 14 partners (11 European Country)
LOD 2 - 시맨틱 기술의 미래
36
LOD2 : Objectives of LOD2
LOD2 Project Objectives
Achieving visualization, deployment, sharing,
accessibility for linked open data by software
technology.
• Increase visibility of Linked Data activities [Visualization]
• Support deployment Linked Data components [Deployment]
• Improve information sharing between Linked Data
components so that publishing Linked Data is eased. [Sharing]
• Improve access to the content: the online Linked Open Data
[Accessibility]
• Improve the software technology which support it [By software
technology]
LOD 2 - 시맨틱 기술의 미래
LOD2 Stack : Overview
LOD2 Stack
LOD2 project provides LOD2
Stack for the sake of easy
access to linked data
software.
the LOD2 software stack is
an integrated distribution of
aligned tools supporting the
life-cycle of Linked Data from
extraction, authoring/creation
over enrichment, interlinking,
fusing to visualization and
maintenance
37
LOD 2 - 시맨틱 기술의 미래
39
LOD2 Stack : The overview of tools
Apache Stanbol
In the LOD2 Stack, Apache Stanbol can be used for
NLP services which rely on the stack internal
knowledge bases, such as named entity recognition
and text classification.
CubeViz
CubeViz is a facetted browser for statistical data
utilizing the RDF Data Cube vocabulary which is the
state-of-the-art in representing statistical data in RDF.
LOD 2 - 시맨틱 기술의 미래
40
LOD2 Stack : The overview of tools
Dbpedia Spotlight
DBpedia Spotlight is a tool for automatically
annotating mentions of DBpedia resources in
text, providing a solution for linking unstructured
information sources to the Linked Open Data cloud
through DBpedia.
D2RQ
D2RQ is a system for accessing relational
databases(RDBMS) as virtual RDF graphs.
LOD 2 - 시맨틱 기술의 미래
41
LOD2 Stack : The overview of tools
DL-Learner
The DL-Learner software learns concepts in
Description Logics (DLs) from user-provided
examples. (Supervised-learning)
ORE
The ORE (Ontology Repair and Enrichment) tool allows
for knowledge engineers to improve an OWL
ontology by fixing inconsistencies and making
suggestions for adding further axioms to it.
LOD 2 - 시맨틱 기술의 미래
42
LOD2 Stack : The overview of tools
Poolparty
The PoolParty Extractor (PPX) offers an API
providing text mining algorithms based on semantic
knowledge models.
LOD 2 - 시맨틱 기술의 미래
43
LOD2 Stack : The overview of tools
SemMap
SemMap allows to visualize knowledge bases having a spatial
dimension.
Silk
The Silk Link Discovery Framework supports data publishers
in accomplishing the second task. Using the declarative Silk -
Link Specification Language (Silk-LSL), developers can
specify which types of RDF links should be discovered
between data sources as well as which conditions data
items must fulfill in order to be interlinked.
LOD 2 - 시맨틱 기술의 미래
44
LOD2 Stack : The overview of tools
Sieve
Sieve allows Web data to be filtered according to
different data quality assessment policies and
provides for fusing Web data according to different
conflict resolution methods.
LIMES
LIMES is a link discovery framework for the Web of
Data. It implements time-efficient approaches for
large-scale link discovery based on the
characteristics of metric spaces.
LOD 2 - 시맨틱 기술의 미래
45
Silk : Link Discovery Framework
Interlinking and Fusion Stage Component of
LOD2 Stack
Can be used by data providers to generate RDF links
between data sets on the web of data
• Especially, to set explicit RDF links between data items
within different data sources
“Data publishers can use Silk to set RDF links
from their data sources to other data sources
on the Web”
LOD 2 - 시맨틱 기술의 미래
46
Silk : Silk – Link Specification Language Example
Aggregation Example:
Combines multiple confidence values into a single
value (average)
LOD 2 - 시맨틱 기술의 미래
Confidence value is the aver-age of two compared weight
Numeric differences between parameters
DL-Learner
Introduction
The goal of DL-Learner is to provide a DL/OWL based
machine learning tool to solve supervised learning
tasks.
The DL-Learner software learns concepts in
Description Logics (DLs) from examples.
LOD 2 - 시맨틱 기술의 미래
DL-Learner : Features
Learning Problems
Positive and Negative Examples (=previous example)
Class Learning
• Find out Class Expression for given class
• father
LOD 2 - 시맨틱 기술의 미래
50
SWCL - Sample Example LOD 2 - 시맨틱 기술의 미래
Country
Province
hasPart
positiveInteger
positiveInteger
PopulationValue
PopulationValue
?
51
Constraints Representation in SWCL
Target Constraint
Corresponding SWCL Code
LOD 2 - 시맨틱 기술의 미래
<swcl:Constraint rdf:ID=”numberOfPopulation">
<swcl:qualifier>
<swcl:Variable rdf:id="y">
<swcl:bindingClass rdf:resource="#Country"/>
</swcl:Variable>
</swcl:qualifier>
<swcl:hasLHS>
<swcl:TermBlock rdf:ID="termBlock_1">
<swcl:sign rdf:resource="&swcl;plus"/>
<swcl:aggregateOperator rdf:resource="&swcl;Sigma"/>
<swcl:parameter>
<swcl:Variable rdf:id="x">
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#partOf"/>
<owl:hasValue rdf:resource="#y"/>
</owl:Restriction>
</rdfs:subClassOf>
</swcl:Variable>
</swcl:parameter>
<swcl:factor>
<swcl:FactorAtom>
<swcl:bindingClass rdf:resource="#x"/>
<swcl:bindingDatatypeProperty rdf:resource="#populationValue"/>
<swcl:FactorAtorm>
</swcl:factor>
</swcl:TermBlock>
</swcl:hasLHS>
<swcl:hasOperator rdf:resource="&swcl;equal"/>
<swcl:hasRHS>
<swcl:TermBlock rdf:ID="termBlock_2">
<swcl:sign rdf:resource="&swcl;plus"/>
<swcl:factor>
<swcl:FactorAtom>
<swcl:bindingClass rdf:resource="#y"/>
<swcl:bindingDatatypeProperty rdf:resource="#populationValue"/>
</swcl:FactorAtom>
</swcl:factor>
</swcl:TermBlock>
</swcl:hasRHS>
</swcl:Constraint>
Our Direction to the Future
Directions
Open, Share your data, whenever and wherever you
want
Semantic, Enhance your data, to make more sense of it
An example: LinkedGeoData.org
We need an integrated framework to enhance
communication and information sharing in GeoData.
52
LOD 2 - 시맨틱 기술의 미래