mining and querying multimedia data fan guo sep 19, 2011 committee members: christos faloutsos,...
TRANSCRIPT
Mining and Querying Multimedia DataFan GuoSep 19, 2011Committee Members:Christos Faloutsos, ChairEric P. XingWilliam W. CohenAmbuj K. Singh, University of California at Santa Babara
What this talk is about?
2
What this talk is about?
3
Going Multimedia
4
Beyond Text and Images
5
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
6
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
7
Mining Multimedia Data (1)
•Labeling Satellite Imagery
8
Input Output
Mining Multimedia Data (2)
•Network Traffic Log Analysis
9
Mining Multimedia Data (3)
•Web Knowledge Base
10
Mining Multimedia Data
•Data-driven problem solving over multiple modes at a non-trivial scale.
11
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
12
Querying Multimedia Data (1)
•A querying system provides an interface to retrieve records that best match users’ information need.
13
Querying Multimedia Data (1)
•Here is another example:
14
https://www.facebook.com/pages/browser.php
Querying Multimedia Data (1)
•May be transformed into a graph search problem
15
Querying Multimedia Data (2)
•Calibrate ranking from user feedback
16
Querying Multimedia Data (2)
•Calibrate ranking from user feedback
17
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
18
Data
•Large-Scale Heterogeneous Networks
19
Port
198.129.1.2131.243.2.10
131.243.2.5
128.3.10.40 128.3.1.50
IP-source IP-destination
80 (HTTP)
80 (HTTP)
993 (IMAP)
Goal
•How can we automatically detect and visualize patterns within a local community of nodes?
20
Preliminary
•Tensor for high-order data representation▫3 data modes: source IP, destination IP,
port #
21
Approach
22
Data Decomposition
•The canonical polyadic (CP) decomposition can factor tensor into a sum of rank-1 tensors
23
Data Decomposition
•A special case is Singular Value Decomposition
24
Attribute Plot
25
How to compute?
Spike Detection
•Iteratively search for spikes in the histogram plot along each data mode.
26
“ “” ”
Substructure Discovery
•Focus on part of the data within the spike
•Categorize into a few subgraph patterns
27
Pattern 1: Generalized Star (1)
28
IP-src’s sending
packets to the same IP-
dst & the same portTypical
client/server system
Pattern 1: Generalized Star (1)
29
A ‘bar’ in a carefully
reordered tensor
Pattern 1: Generalized Star (2)
30
Extending along “Port-Number”
Pattern 1: Generalized Star (2)
31
Port scanning or P2P
Port numbers used in
packets from the same IP-
src to the same IP-dst
Pattern 2: Generalized Bipartite-Core (1)
32
A ‘plane’ in a carefully
reordered tensor
Pattern 2: Generalized Bipartite-Core (1)
33
IP-src’s sending
packets to the same IP-dst’s & the same portClients
talking to a shared server
pool
Pattern 2: Generalized Bipartite-Core (2)
34
A ‘plane’ in a carefully
reordered tensor
Pattern 2: Generalized Bipartite-Core (2)
35
IP-src’s sending
packets over multiple
ports to one IP-dstA multi-
purpose windows server
M1: MultiAspectForensics
•Automatically detects novel patterns in heterogenous networks
36
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
37
QMAS: Mining Satellite Imagery (1)•Low-labor labeling
38
Input Output
QMAS: Mining Satellite Imagery (2)•Low-labor labeling•Identification of Representatives
39
QMAS: Mining Satellite Imagery (2)•Low-labor labeling•Identification of Representatives and
Outliers
40
QMAS: Mining Satellite Imagery (2)•Low-labor labeling•Identification of Representatives and
Outliers
41
QMAS: Mining Satellite Imagery (3)•Low-labor labeling•Identification of Representatives and
Outliers•Linear in time & space
42
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
43
Web Search
44
User Clicks as Quality Feedback
45
# of total clicks
Motivation
•Leverage the signal from click data to improve search ranking.
46
Click Through Rate (CTR)
•CTR = # of Clicks / # of Impressions
47
Position Bias
48
Relevance of Web Document
•Relevance = CTR @ Position 1
49
# Clicks @ Position 1# Impressions @ Position 1
=
Problem Definition
•Estimate the relevance of web documents given clicks and their positions.
50
Design Goals / Constraints
•Scalable: single-pass, easy to parallel.
•Incremental: real-time updates possible.
•Accurate: consistent with past and future observations.
51
Approach
52
User Behavior Model
53
Last Clicked Position
54
Empirical Results
•Click data after pre-processing▫110K distinct queries, 8.8M query sessions.
•Training time: <6 mins
•Online update:▫Bump impression and click counters▫No data retention required
55
Empirical Results
•Higher log-likelihood indicates better quality.
56
27% accuracy in prediction2% improvement over ICM, the baseline model
Empirical Results
•Position-bias visualized
57
Ground Truth
DCM
Scaling to Terabytes
•265TB data, 1.15B document relevance results,running time on wall clock ~ 3 hours
58
Q1: Click Models
•A statistical approach to leveraging click data for better ranking aware of position-bias.
•They are incremental, more accurate than the baseline, scaling to almost petabyte-scale data.
59
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
60
Q2: C-DEM
•A flexible query interface for 3-mode data: images, genes, annotation terms.
61
Q2: C-DEM
62
Images
Terms Genes
Q2: C-DEM
•Solution: random walk with restart on graphs.
63
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
64
Q3: BEFH (1)• Bayesian exponential family harmonium• Deriving topical representations for
multimedia corpora (e.g., video snapshots and captions)
65
Input Model
Q3: BEFH (2)• Bayesian exponential family harmonium• Deriving topical representations for
multimedia corpora (e.g., video snapshots and captions)
66
Validation – Synthetic Data
Validation – TRECVID Data
Better Quality
Better Quality
Thesis Outline
MiningM1: MultiAspectForensics
M2: QMAS
Querying
Q1: Click Models
Q2: C-DEM
Q3: BEFH
67
Conclusion
•Data-driven research under the theme of pattern mining and similarity querying.
68
Conclusion
•Data-driven research under the theme of pattern mining and similarity querying.
•An array of practical tasks addressed:▫Internet traffic surveillance (M1)
69
Conclusion
•Data-driven research under the theme of pattern mining and similarity querying.
•An array of practical tasks addressed:▫Internet traffic surveillance (M1)▫Satellite image analysis (M2)
70
Conclusion
•Data-driven research under the theme of pattern mining and similarity querying.
•An array of practical tasks addressed:▫Internet traffic surveillance (M1)▫Satellite image analysis (M2)▫Web search (Q1)
71
Conclusion
•Data-driven research under the theme of pattern mining and similarity querying.
•An array of practical tasks addressed:▫Internet traffic surveillance (M1)▫Satellite image analysis (M2)▫Web search (Q1)▫…
72
Thank You!
•http://www.cs.cmu.edu/~fanguo/dissertation/
73
74