from web search to semantic search -...
TRANSCRIPT
[1]
X i a n g Z h a n g
x . z h a ng @seu . e d u . c n
h t t p : / / c s e . s e u . e d u . c n / Pe r s ona l Pa g e / x . z h a ng /
From Web Search to Semantic Search
[2]
Content 2
The history of Web and Search
Two faces of Semantic Web
Falcons
Watson
Wolfram Alpha
Knowledge Graph
[3]
As We May Think
"A memex is a device in which an
individual stores all his books,
records, and communications, and
which is mechanized so that it may
be consulted with exceeding speed
and flexibility. It is an enlarged
intimate supplement to his memory."
3
[4]
As We May Think
“Our wearable data collection system lets users collect their experiences into a
continually growing and adapting multimedia diary. The system—called
inSense—uses the patterns in sensor readings from a camera, microphone,
and accelerometers to classify the user's activities and automatically collect
multimedia clips when the user is in an "interesting" situation.”
4
[5]
As We May Think
Memex is internally complex in mechenics
dry photography, microphotography…
photocells, thermionic tubes, cathode ray tubes…
Memex is externally simple for human usage
associative indexing
tying two items together and name it
linking enormous items to form a trail and name it
5
[6]
Building Internet in the Cold War 6
[7]
Building Internet in the Cold War
The idea of package switching (early 1960s)
APARNet project launched (1960s) – US Gov
robust / fault-tolerant / distributed
APARNet with 4 nodes(Univ. Nodes) (1969)
The idea of OSI (Open System Interconnection) (1970s)
TCP/IP (1974-1982)
NSFNet (National Science Foundation) (1986)
APARNet closed (1990)
>1 million hosts in Internet (1992)
7
[8]
Building Internet in the Cold War
why Internet becomes a successful creation
divide-and-conquer
package switching
abstraction
OSI
Scalability / reliability > speed / efficiency
TCP / IP
8
[9]
Ted Nelson and His Hypertext
I don’t like Tim Berners-Lee and his
WWW. Too Simple!!!
“HTML is precisely what we were
trying to PREVENT— ever-breaking
links, links going outward only,
quotes you can't follow to their
origins, no version management, no
rights management.
9
[10]
Ted Nelson and His Hypertext
his first job is a photographer and film editor
“Hypertext” is corned by him in 1963
based on files
data was stored once
no deletion
information was accessible by a link from anywhere
10
[11]
Ted Nelson and His Hypertext
other achievement of Ted
Project Xanadu (1960)
found HTTP Company
produced the first bag for the first laptop
now producing 47% bags for laptop on the world
11
[12]
Doug Engelbart and His…(in 1968) 12
[13]
Doug Engelbart
other achievement of Doug (Tuning Award)
mouse (commercially implemented in LISA 1983, anybody
knows LISA??)
Windows
shared-screen teleconferencing
hypermedia
groupware
13
[14]
LISA是史蒂夫乔布斯女儿的
名字,也被乔布斯用于命名图形界面计算机和其上的操作系统。 没有Lisa就没有Macintosh,在Mac的开发
早期,很多系统软件都是在Lisa上设计的。她具有16位CPU,鼠标,硬盘,以及支
持图形用户界面和多任务的操作系统。并且随机捆绑了7个商用软件。LISA电脑售价$9,935。
14
[15]
Windows 1.0
(1985)
Windows 3.1 (1992)
15
[16]
Ted and Doug
they are both really avant-garde
every technology was in place in 1970s
but it took the PC revolution and widespread
internet to inspire the WWW
the process lasted for almost 20 years!
16
[17]
Apple II and 中华学习机
17
[18]
Sir Tim Berners-Lee and His Web Dream
18
[19]
In March 1989, Tim Berners-Lee
submitted a proposal for an
information management system to
his boss, Mike Sendall. ‘Vague, but
exciting’, were the words that
Sendall wrote on the proposal,
allowing Berners-Lee to continue.
http://info.cern.ch/Proposal.html
19
[20]
Tim Berners-Lee's original World Wide Web browser: 1990
20
[21]
Tim and His Web Dream
Why we use HTML for hypertext representation?
HTML vs. XML
21
[22]
Tim and His Web Dream
HTTP merits
quite simple (get or post)
flexible (content-type)
weak coupling (no connection / no state)
22
[23]
TED Talk: Tim Berners-Lee : The Magna Carta for the Web. (2014)
http://www.ted.com/talks/tim_berners_lee_a_magna_carta_for_the_web
23
[24]
CCTV
Documentary
互联网时代
The Internet Age (2014)
https://movie.douban.co
m/subject/25772505/
24
[25]
Semantic Web: A Glance
25 http://ws.nju.edu.cn/falcons
[26]
Architecture of Semantic Web 26
[27]
Why Semantic Web?
Web Intelligence
Machine-readable
Reasoning
Web of Data
Identification and Interlinks
Programmable Web
27
[28]
Ontology and RDF 28
[29]
Instantiation of Ontology 29
[30] 30
[31]
XML Representation of Ontology 31
[32]
The Logic Face of Semantic Web 32
[33]
Description Logic and Reasoning
33
[34]
The Data Face of Semantic Web
34
[35]
Applications of Semantic Web
Biomedicine
Transportation Engineering
Homeland Security
Software Design
Travel Planning
Job Finding
Online Dating…
35
[36]
TED Talk: Tim Berners-Lee : The Next Web. (2009)
http://www.ted.com/talks/tim_berners_lee_on_the_next_web
36
[37]
TED Talk: Tim Berners-Lee : The year open data went worldwide. (2010) http://www.ted.com/talks/tim_berners_lee_the_year_open_data_went_worldwide
37
[38]
Semantic Web in a Nutshell
Similar but different with
Relational Database
Object-oriented System
Shifted from logic to data
To keep it simple
To grow into large scale
38
[39]
Web Search
1993: Web Crawler
1994: Yahoo
1994: Lycos
1995: Altavista
1998: Google
39
[40]
[41]
Book: How Google Works
https://book.douban.com/subject/26008422/
[42]
Larry Page and Sergey Brin from Stanford
From “What Box”to 10100
Refused by InfoSeek
100,000 dollars from Sun
Yahoo, Excite.com, InfoSeek, Lycos
Marissa Mayer
Went to Yahoo in 2012
42
[43]
The two co-founders
of Google working in
the garage with a
monthly rent of $1700
[44]
Google Search Stats
search per second: 2.3M
Unique searchers per month: 1.17B
Number of pages indexed: 60T
Lines of codes for all Google services: 2B
Alphabet market capitalization: $570B
http://expandedramblings.com/index.php/by-the-numbers-a-gigantic-list-
of-google-stats-and-facts/
[45]
More about Google
Pagerank
Avg. 0.25 response time
45
Google-designed Server
Servers in Container
[46]
Understanding How Google Works 46
English: http://ppcblog.com/how-
google-works/
Chinese:
http://www.kuqin.com/searchengi
ne/20130320/334054.html
[47]
Google to Alphabet 47
[48]
23andme.com
By Sergey Brin`s wife, Anne Wojcicki
DNA analysis for $99 ($100M in 2000)
Angelina to avoid her breast cancer
An introduction: http://36kr.com/p/5035340.html
48
[49]
Web Search
Challenge of Web Search
Distributed Data
Volatile Data – changed or dead data
Large Volume
Unstructured and Redundant Data
Data Quality
Data Heterogeneous
49
[50]
Semantic Search
What is Semantic Search
Natural Language or Pseudo-natural Language search
Finding answers (data, facts, information, knowledge)
Exploiting domain knowledge to process search requests
50
[51]
Semantic Search
Different with Web Search
RDF Graphs vs. Bag of Words
Ontology, RDF document, Entity(object) vs. Pages
Ranking
Summarization / Snippet / Recommendation
Similar with Web Search
Keyword-based
51
[52]
Semantic Search 52
Tim
Jack
Mary
Southeast University
Nanjing
hasWife
worksAt
worksAt knows
located
Tim.foaf.xml
What can we get when searching “tim”
[53]
More examples: Who is Bob Marley? From Falcons
[54]
More examples: Who is Bob Marley? From Falcons
[55]
More examples: Who is Bob Marley? From Google and Baidu
[56]
Bob Marley:
No Woman, No Cry
Bob Marley:
Three Little Birds
[57]
More examples: Who is the tallest player in NBA ? From Falcons
[58]
More examples: Who is the tallest player in NBA ? From Google and Baidu
[59] 59
[60] 60
[61] 61
More examples: Who is [email protected]? He knows who? From Falcons
[62] 62
More examples: Who knows Chris Bizer? From Falcons
[63]
Semantic Search
2003: TAP (Guha, McCool, & Miller, 2003)
Earliest semantic web search engine
Keyword query
Return an object and its surrounding subgraph using
label information
Selection is based on popularity, user profile, and search
context
63
[64]
Semantic Search
2005: Swoogle (Ding et al., 2005)
One of the most popular semantic web search engine
Class/property search and ontology search
Pagerank-like ranking algorithm
Provides statistical metadata of results
64
[65] 65
[66] 66
Ontology Search of Swoogle
[67] 67
Document Search of Swoogle
[68]
Term Search of Swoogle
[69] 69
Metadata of foaf:knows in Swoogle
[70]
Semantic Search
2007: SWSE (Harth et al., 2007)
Keyword-based
Pagerank-liked ranking
Combining RDF graph ranks with data source ranks
Filtering results by specifying a class
70
[71]
Semantic Search
2007: Watson (d’Aquin et al., 2007)
Organize resulting objects by documents / ontologies
User can specify the searching scope
Local name
Labels
All literals
71
[72]
Searching Steve Jobs in Watson
[73]
Another Watson
An open-domain QA system
IBM DeepQA Project
MIT: Adaptive View-Based Appearance Model
UT: Reasoning and General Knowledge
USC: Information Extraction and Analysis
RPI: Virtualization Tools
University at Albany: Assurance of capacity for massive system
University of Trento: Self-study, Conversation
University of Massachusetts: Information Retrieval
Carnegie Mellon: QA basic algorithms
73
[74]
How Watson Works
[75]
Another Watson
Feature:
Unstructured text;
5 secs into more than 200 million pages;
Understanding human language, subtle meaning,
paddles, humor, satire…
Determing the certainty of the answer
No help from the WWW and engineers
Beat ken and Brad on Jeopardy in 2011;
75
[76]
Jeopardy 2011
[77]
Ken’s Talk in TEDTalk
[78]
Another Watson
Future
Business Decision-Making
Medical Treatment
78
[79]
Semantic Search
2008: Sindice (Oren et al., 2008)
Property-value pair look-up knowing a property of an
object
Keyword-based RDF document and Microformat search
Not focused on linked objects
79
[80]
Searching documents in Sindice
[81]
Searching documents
using property-value pair
in Sindice
[82]
82
SPARQL Query in Sindice
[83]
Semantic Search
2009: Falcons (Cheng et al., 2009 )
http://iws.seu.edu.cn/services/falcons/
Searching entities / linked objects / RDF documents
Keyword-based
Searching based on Virtual Document
Ranking results based on relevance and popularity
Class-based query refinement
83
[84]
Falcons – Scenario 1 84
“Who is the demo chair of ISWC 2008?”
query with “ISWC2008” AND “demo chair”
[85]
Falcons – Scenario 2 85
“Find relations between
Chris Bizer and Tom Heath”
query with “Chris Bizer”
AND “Tom Heath””
[86]
Falcons – Scenario 2
86
After type
refinement
[87]
Falcons – System Architecture
87
[88]
Falcons- Data Crawling
Seeding
Searching filetype:rdf / filetype:owl in Google with keywords
from Open Directory Project
Sampling from pingthesemanticweb.com / schemaweb.info / …
Sampling from Linked Open Data Project
Crawling
300,000 RDF documents per day
20M RDF documents total (55% well-formed) on August 2008
600M quadruples
88
[89]
Falcons – Data Crawling 89
The distribution of
the number of RDF
documents on pay-
level domains
[90]
Falcons – Constructing Virtual Documents
Virtual Documents of an object
local name
rdfs:label
rdfs:comment
other annotations
Virtual documents of neighbors
90
[91]
Falcons – Virtual Documents 91
VD(eg1:Journal) = ”journal” + ”article” + ”magazine”
[92]
Falcons – Ranking Objects
Query Relevance
Cosine similarity between query and virtual document
Popularity
The occurrence of the object in the document set
92
[93]
Falcons – Other Features
Query-relevant Snippet Generation
PD-Thread
Query Refinement with Class Hierarchies
Filtering the resulting objects with class Restrictions
Discovering implicit typing information by class-inclusion
reasoning
Recommending subclasses for incremental refinement
93
[94]
Wolfram Alpha
www.wolframalpha.com
Computational Knowledge Engine
2009.5.18
Wolfram Alpha: Semantic Search Is
Born – InfoToday
The Knowledge engine behind Siri
94
[95]
Wolfram Alpha
Chat with Siri:
Somebody:我的孩子总喜欢缠着我问我离圣诞节还有几天。
Siri: 让我查查……稍等……我为你找到了答案。
Siri:82 天,也就是 2 个月又 21 天,也就是 11 周零 5 天,
也就是 58 个工作日,也就是 0.22 年。
How Siri knows all these?
95
[96]
[97]
[98]
[99]
[10
0]
[10
1]
[10
2]
[10
3]
[10
4]
[10
5]
[10
6]
106
[10
7]
[10
8]
108
[10
9]
[110]
Wolfram Alpha
WA is computation + knowledge
Holding a vast of knowledge-bases
Using computational linguistics
Steven Wolfram was a quantum physicist!
First version of WA
15,000,000 lines of Mathmatica codes
10,000 CPU
The Technology behind Wolfram|Alpha
110
[111
]
Stephen Wolfram:计算万物的理论
[112]
Knowledge Graph 知识图谱
“The Knowledge Graph is a knowledge
base used by Google to enhance its
search engine's search results with
semantic-search information gathered
from a wide variety of sources.”
– From Wikipedia
112
[11
3]
113
[11
4]
[11
5]
Introducing to Knowledge Graph
[116]
Satori
Adopted for Bing.com by Microsoft
From 2013.12.12 TED讲座:当用户搜索某一个人名,如果这个人曾在TED做过演讲,那么用户将可以直接在搜索结果的
Snapshot pane中播放TED视频。
著名演讲及国歌:不仅仅是TED,其他一些著名人物的演讲也将会直接呈现在搜索结果中,另外也包括
国歌这样的内容,用户同意可以直接在Snapshot pane点击播放。
在线课程:搜索一所大学名称,Bing会在搜索结果中直接展现该学校的热门的在线课程。
科学知识:当用户搜索某个科学名词时,Bing将会突出展示来自维基百科的该词条基本内容。
历史事件:Bing会直接提供历史事件的概要预览,包括简单描述、开始时间、结束时间等。
相关人员:相关人员功能将会在用户搜索某一事件或事物时提供,并会给出此人与用户搜索词语相关的
原因。
动物种类:当用户搜索某个动物时,Bing将会提供更详细的相关品种让用户能够进一步进行查询。比如
搜“老虎”,Bing将会在Snapshot pane展现孟加拉虎、东北虎等种类,用户可以进一步进行搜索
116
[117]
Satori 117
[118]
Graph DB
The fundamental component of most KG
Graph Storage
CRUD
Create a graph
Read a graph
Update a graph
Delete a graph
118
[119]
Graph DB
Recommendation:
Neo4j
Virtuoso
119
[12
0]
Graph DB
Ranking
http://db-
engines.c
om/en/ran
king/grap
h+dbms
[12
1]
RDF Store
Ranking
http://db-
engines.co
m/en/ranki
ng/rdf+stor
e
[122]
Wikipedia 122
[123]
DBpedia
Extracted from Wikipedia
RDF / SPARQL
125 languages, totally >3 billion triples, all freely downloadable
For English version, 4.58M entities:
1,445,000 Persons
735,000 Places
123,000 Music Albums
87,000 Movies
241,000 Species
6,000 Diseases
123
[124]
YAGO
Wikipedia+WordNet+GeoNames
10M entities + 120M facts
Features
Manually evaluated the accuracy of
95%, every relation is annotated with its
confidence value
Combining the taxonomy of wiki and
WordNet
With temporal and spacial dimension
124
[125]
Freebase
Knowledge Graph behind Google
A Structured Wikipedia
Crowdsourcing
125
[126]
WordNet and BabelNet
126
[127]
WordNet and BabelNet 127
BabelNet = Wikipedia + WordNet
[12
8]
[129]
Thanks http://cse.seu.edu.cn/PersonalPage/x.zhang/teaching.html
129