Download - Probase : Understanding Data on the Web
![Page 1: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/1.jpg)
Probase: Understanding Data on the Web
Haixun WangMicrosoft Research Asia
![Page 2: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/2.jpg)
What’s our Goal?
injecting common sense into computing
![Page 3: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/3.jpg)
28 Oct 1955Bill Gates American
![Page 4: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/4.jpg)
animals
dogs
cats
dogs
isA isA
… animals other than cats such as dogs …
Correct
!
![Page 5: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/5.jpg)
household pets animals
reptiles
isA isA
… household pets other than animals such as reptiles, aquarium fish …
reptiles Correct
!
![Page 6: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/6.jpg)
Progress on Two Fronts
• System – accumulating and serving knowledge
• Applications – making smart use of knowledge
![Page 7: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/7.jpg)
Trinity: Distributed Graph DB with Full Transaction Support
interfaceGraph DB API
Graph = (Nodes, HyperEdges)
Node Set
HyperEdge Set
Abstract Storage Layer
Memory Pool 1 Memory Pool i Memory Pool n…... …...
Memory Cloud
![Page 8: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/8.jpg)
Trinity: Memory Cloud/Cell
![Page 9: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/9.jpg)
Knowledge Base
artist
painterPicasso
MovementBorn Died …
Cubism1881 1973 …
art
paintingGuernica
…Year Type
…1937 Oil on Canvas
created by
![Page 10: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/10.jpg)
Probase:
Freebase:
Cyc:
2.7 M concepts automatically
harnessed
2 K conceptsbuilt by community
effort
120 K concepts25 years human
labor
Probase has a logic foundation that supports evidential reasoning.
![Page 11: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/11.jpg)
Nodes: 2.7 million concepts(size distribution)
• 2.7 million concepts countries
Basic watercolor techniques
Celebrity wedding dress designers
![Page 12: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/12.jpg)
Nodes: 2.7 million concepts(frequency distribution)
![Page 13: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/13.jpg)
Concepts are the glue that holds our mental world together.
Gregory L. Murphy, NYU
![Page 14: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/14.jpg)
Edges: relationships
• isA (backbone of the taxonomy)
• similarity (derived relationship)
• part-whole (to be incorporated)
![Page 15: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/15.jpg)
Classes/Instances in Search
Concepts 0.02% only? Two reasons:• Concept modifiers are often interpreted as instances, e.g., San Diego biotech
companies.• Search engines do not handle concepts very well, and users stopped trying.
![Page 16: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/16.jpg)
Click to expand
![Page 17: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/17.jpg)
![Page 18: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/18.jpg)
![Page 19: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/19.jpg)
![Page 20: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/20.jpg)
![Page 21: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/21.jpg)
![Page 22: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/22.jpg)
![Page 23: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/23.jpg)
![Page 24: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/24.jpg)
![Page 25: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/25.jpg)
![Page 26: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/26.jpg)
Are good results in our top 10 returned by Bing or Google? (up to their top 1000)
10 20 50 100 500 10000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
C+K in Bing
10 20 50 100 500 10000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
C+K+C in Bing
10 20 50 100 500 10000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
C+K+C in Google
10 20 50 100 500 10000%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
C+K in Google
![Page 27: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/27.jpg)
Probase vs. FreebaseKnowledge is
black and white.
Clean up everything.
Dirty data is unusable.
Correctness is a probability.
Live with dirty data.
Dirty data is very useful.
![Page 28: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/28.jpg)
How to handle noisy data?
Score the data!
![Page 29: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/29.jpg)
Score the data• Consensus:
e.g., is there a company called Apple?
• Popularity:
e.g., is Apple a top-3 company, or a top-5, or a top-10 company?
• Ambiguity:
e.g., does the word Apple, sans any context, represent Apple the company?
• Similarity:
e.g., how likely is an actor also a celebrity?
• Freshness:
e.g., Pluto as a dwarf planet is a claim more fresh than Pluto as a planet.
![Page 30: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/30.jpg)
Quality
![Page 31: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/31.jpg)
Compare with Probase
![Page 32: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/32.jpg)
Consensus / Popularity
Is there a company called Apple?
is the same type of question as
Is Apple a top-3 company, or a top-5, top-10 company?
![Page 33: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/33.jpg)
Consensus/Popularity
• Noisy-or:
• Voting model:– an evidence votes to support a claim with probability – the probability that the claim is true = the probability
that it receives more than 50% votes
• Urns model:– How many times Paris is drawn from the “City” Urn?
![Page 34: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/34.jpg)
Negative Evidence
• E.g. Two claims:– China is a company 100 evidences– MyCrazyStartup is a company 10 evidences
• Negative evidences– treat each occurrence of China as a negative evidence
unless it’s about “China is a company”– treat the fact that Company and Countries have low
similarity (overlap) as a negative evidence
![Page 35: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/35.jpg)
Ambiguous Identity
• Apple is a company• Apple is a fruit
• Tiger is a vertebrate• Tiger is a mammal
There are two apples but just one tiger. How do we know?
![Page 36: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/36.jpg)
Important Instances
![Page 37: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/37.jpg)
What are the tasks?
artist
painterPicasso
MovementBorn Died …
Cubism1881 1973 …
art
paintingGuernica
…Year Type
…1937 Oil on Canvas
created by
![Page 38: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/38.jpg)
Data Sources for Taxonomy Construction
• Hearst’s patterns in HF data (1.68B docs)• HTML tables in Wikipedia • HTML tables in HF data
• Freebase data• Many more can be added in the future
![Page 39: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/39.jpg)
Hearst’s Patterns
• Patterns for single statements
NP such as {NP, NP, ..., (and|or)} NP such NP as {NP,}* {(or|and)} NP NP {, NP}* {,} or other NP NP {, NP}* {,} and other NP NP {,} including {NP ,}* {or | and} NP NP {,} especially {NP,}* {or|and} NP
![Page 40: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/40.jpg)
Examples
Easy: “rich countries such as USA and Japan …”
Tough: “animals other than cats such as dogs …”
Almost hopeless: “At Berklee, I was playing with cats such as Jeff Berlin, Mike Stern, Bill Frisell, and Neil Stubenhaus.”
![Page 41: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/41.jpg)
Taxonomy Construction
• Each evidence is an edge
• Put edges together into a graph
• Problem: if two edges has end nodes of the same label, should we merge them?
![Page 42: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/42.jpg)
Example
• Example:– plants such as trees and grass– plants such as steam turbines, pumps, and boilers
• Fortunately it’s extremely rare to see– “plants such as trees and steam turbines”
• “such as” naturally groups instances by their senses
![Page 43: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/43.jpg)
Hierarchy Construction
• Merging overlapping groups– “C such as X1, X2, …” and “C such as Y1, Y2, …”– “X1, X2, …” and “Y1, Y2, …” have certain overlap– then merge “X1, X2, …” and “Y1, Y2, …” under C
• Missing links– the group with the largest instance frequency usually represents
the dominant sense of the class label– the merging may not be complete (e.g., a group Turing, Church
under mathematicians somehow does not merge with the larger group containing instances like Leibniz and Hilbert)
– use supervised learning for further merging
![Page 44: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/44.jpg)
Attributes
• Given a class, find its attributes
• Candidate seed attributes:
– “What is the [attribute] of [instance]?”
– “Where”, “When”, “Who” are also considered
Picasso
MovementBorn Died …
Cubism1881 1973 …
![Page 45: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/45.jpg)
Reasoning
After building a coherent set of beliefs, reasoning can then follow.
Rules are uncertain/probabilistic as well.
![Page 46: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/46.jpg)
Expanding Concepts
citiestech companiesbasic watercolor techniques
learn swimmingbuy books on Amazon
noun phrases
noun phrases +verb +
prepositional phrases(high order concepts)
(low order concepts)
![Page 47: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/47.jpg)
Expanding Relationships
• Relationships among concepts (noun phrases)– locatedIn, friendOf, createdBy, etc– relationship between apple and Newton
• Relationships among high order concepts– causal relationships– tasks and subtasks
![Page 48: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/48.jpg)
Find questions for answers• For each claim, find all possible of questions that the claim
can be used to answer.
• <China, population, 1.3 billion>– Q: How many people are there in China?
• For a set of claims of the same class, find possible aggregate questions.
• <China, population, 1.3 billion>, <India, population, 1 billion>, …– Q: What’s the most populous nation?
![Page 49: Probase : Understanding Data on the Web](https://reader033.vdocuments.site/reader033/viewer/2022051317/568168f2550346895ddff8cf/html5/thumbnails/49.jpg)
Thanks!