using crowdsourcing for data analytics - drexel ccicci.drexel.edu/bigdata/bigdata2013/using...
TRANSCRIPT
1
1
Using CrowdSourcingfor Data Analytics
Hector Garcia-Molina(work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others)
Stanford University
• Big Data Analytics
• CrowdSourcing
2
CrowdSourcing
3
Real World Examples
4
Image MatchingTranslation
Image MatchingTranslation
Categorizing ImagesCategorizing Images
Search RelevanceSearch Relevance
Data GatheringData Gathering
3
Many Crowdsourcing Marketplaces!
Many Research Projects!
6
4
7
analytics
data
results
humans
Example tasks:• get missing data• verify results• analyze data
8
analytics
data
results
humans
Example tasks:• get missing data• verify results• analyze data
Key Point:• use humans judiciously
5
Today will illustrate with
• Entity Resolution
• (may cover another topic briefly)
9
10
Traditional Entity Resolution
cleansing
analysis
System 1 System n...
what matches what??
6
Why is ER Challenging?
• Huge data sets
• No unique identifiers
• Missing data
• Lots of uncertainty
• Many ways to skin the cat
11
Simple ER Example
12
7
Simple ER Example
13
a
c
b
d
sim=0.9
sim=0.8
Simple ER Example
14
a
c
b
d
sim=0.9
sim=0.8
8
15
ER: Exact vs Approximate
products
camerasresolvedcameras
CDs
books
...
resolvedCDs
resolvedbooks
...
ER
ER
ER
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
16
9
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
17
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
18
0.9
0.950.9
0.7
0.9
0.87
0.7
threshold = 0.7
10
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
19
0.9
0.950.9
0.7
0.9
0.87
0.7
Crowd ER
20
11
Same as this?
21
Crowd ER
• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)
22
12
Crowd ER
• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)
• Too expensive!
23
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
Crowd ER
• Second Cut: Compute similarities;workers verify "critical" pairs
24
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
critical??
13
Crowd ER
• Second Cut: Compute similarities;workers verify "critical" pairs
25
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
critical??
Crowd ER
• Second Cut: Compute similarities;workers verify "critical" pairs
26
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
critical??
14
27
pairwise analysis
records
clusters
generate questions
Key Point:• use humans judiciously
global analysis
crowd
new evidence
Key Issue: Semantics of Crowd Answer
28
15
Key Issue: Semantics of Crowd Answer
29
?
EDC
B A
Also issue: Similarities as Probabilities
30
sim(a,b) → prob(a,b)
16
Strategy
31
current state
a
b c
0.9
0.5
0.2
ER result use any given ER algorithm
Strategy
32
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
consider ALL possible questions(three in this example)
17
Strategy
33
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
new state
new state
new state
new state
new state
new state
ER result
ER result
ER result
ER result
ER result
ER result
Y
Y
Y
N
N
N
considerpossibleoutcomes
Strategy
34
current state
a
b c
0.9
0.5
0.2
Q(b,c)
new state ER resultY
example
a
b c
0.9
1.0
0.2a
b c
18
Strategy
35
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
new state
new state
new state
new state
new state
new state
ER result
ER result
ER result
ER result
ER result
ER result
score?
score?
score?
score?
score?
score?
Y
Y
Y
N
N
N
Two Remaining Issues
• How do we score an ER result?
36
• Efficiency?
ER result gold standard
F score
19
Gold Standard?
37
a
b c
0.9
0.5
0.2a
b c
1.0
0.6
0.2
sim to prob
Gold Standard?
38
a
b c
0.9
0.5
0.2a
b c
1.0
0.6
0.2
a
b c
a
b c
a
b c
a
b c
0.12
0.48
0.08
0.32
sim to prob
possible worlds
20
Gold Standard?
39
a
b c
0.9
0.5
0.2a
b c
1.0
0.6
0.2
a
b c
a
b c
a
b c
a
b c
0.12
0.48
0.08
0.32
a
b c
0.68
a
b c
0.32sim to prob
possible worlds possible clustering(via ER algorithm)
Strategy
40
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
new state
new state
new state
new state
new state
new state
ER result
ER result
ER result
ER result
ER result
ER result
scorevs GS?Y
Y
Y
N
N
N
scorevs GS?
scorevs GS?
scorevs GS?
scorevs GS?
scorevs GS?
21
Evaluating Efficiently
• See: Steven E. Whang, Peter Lofgren, and H. Garcia-Molina. Question Selection for Crowd Entity Resolution. To appear in Proc. 39th Int'l Conf. on Very Large Data Bases (PVLDB), Trento, Italy, 2013.
41
Sample Result
42
22
43
analytics
data
results
humans
Example tasks:• get missing data• verify results• analyze data
Key Point:• use humans judiciously
Summary
44
analytics
bigdata
Now for something completely different!
DBMS
23
45
analytics
bigdata
Now for something completely different!
DBMS
humans
46
data
DeCo: Declarative CrowdSourcing
DBMS
humans
Enduser
what is best price forNikon DSLR cameras?
24
47
data
DeCo: Declarative CrowdSourcing
DBMS
humans
Enduser
what is best price forNikon DSLR cameras?
model type brand
D7100 DSLR Nikon
7D DSLR Canon
P5000 comp Nikon
• • • • • • • • •
48
data
DeCo: Declarative CrowdSourcing
DBMS
humans
Enduser
what is best price forNikon DSLR cameras?
model type brand
D7100 DSLR Nikon
7D DSLR Canon
P5000 comp Nikon
• • • • • • • • •
what is best price forNikon D7100 camera?
Crowd
25
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
Example with a bit more detail:
Userview
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
50
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
Example with a bit more detail:
26
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
51
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
fetch rule
fetch ruleBytes
Chez Panisse
fetch rule
Example with a bit more detail:
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
52
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
fetch rule
fetch ruleBytes
Chez Panisse
fetch rulefetch rule
French
Example with a bit more detail:
27
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
53
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
resolutionrule
resolutionrule
Bytes
Chez Panisse
Example with a bit more detail:
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
54
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
1. Fetch2. Resolve3. Join
Example with a bit more detail:
28
Fetch[n]Fetch[ln]Fetch
[ln,c]Scan
D1(n,l)ScanA(n)
55
Join
Join
AtLeast [8]SELECT n,l,cFROM countryWHERE l = ‘Spanish’ATLEAST 8
Resolve[m3]Resolve[d.e]
Fetch[nl]
ScanD2(n,c)
Resolve[m3]
Fetch[nl,c]
Many Query Processing Challenges
Filter [l=‘Spanish’]
Fetch[nl,c]
Deco Prototype V1.0
56
29
Conclusion
• Crowdsourcing is important formanaging data!
• Still many challenges ahead!
57
58