technische universität dortmund fakultät für informatik ls 8 prof. dr. katharina morik the...
TRANSCRIPT
technische universität dortmund
Fakultät für InformatikLS 8
Prof. Dr. Katharina Morik
The Challenge of Heterogeneity
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
2
Overview Heterogeneity in Data Distributed Data Web 2.0 Heterogeneity of Users
Structuring music collections
Structuring tag collections
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
3
Heterogeneity in Data Databases
Fixed set of attributes Declared data types Multi-relational Very large number of records
Preparation for mining Extract, Transform, Load Select attributes Declare label for learning Handle missing values Compose new attributes
Schema-mapping for re-use of DM MiningMart application to customer churn
-- Telecom Italia
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
4
Heterogeneity in Data Time series data
Measurements over time Business Medicine Production
Hand writing Pictures Music
Prediction Classification Clustering Signal to Symbol
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
5
Heterogeneity in Data Texts
High dimensional vectors Sparse word vectors Texts of the same class
need not share a word! Syntactic, semantic
structures Classification Clustering Named Entity Recognition,
Information Extraction
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
6
Distributed Data Distributed databases of
the same schema Distributed databases of
different schemas Low-level, low capacity
sensors Peer-to-peer networks
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
7
Heterogeneity of Users The same label name does
not necessarily mean the same concept.
Different names may refer to the same set of items.
Users apply diverse aspects, e.g., genre, time of day, episodes (summer 99),...
Users share some set of items (possibly under different names).
hip hoppopmetalalternative
death metal true metal
hip hoppoppianoclassic guitar
classic jazz
classicpopjazzfavourites
blues modern
workhome
office plane
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
8
Web 2.0 Organizing large data
collections requires semantic annotations.
Users annotate items with arbitrary tags.
No common ontology is required (“folksonomies”).
Users want to keep their tags, but like to benefit from efforts of others.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
9
Structuring Music Collections A concept’s meaning is its
extension, e.g., some music.
A concept’s meaning can be expressed by a classifier.
A concept hierarchy for each aspect --> hierarchical classification.
Acquiring the hierarchy by clustering under the assumption that user-given taggings are kept.
pop rock
metal
a
d e
bad good
blues
f
b
aggressive
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
10
Localized Alternative Cluster Ensembles (ECML 2006)
Acquiring hierarchical clusterings from Own partial clusterings Clusterings of other
peers
Preserve taggings of users Produce several alternative Exploit input clusterings Consider locality instead of
global consensus
hip hoppopmetalalternative
death metal true metal
hip hoppoppianoclassic guitar
classic jazz
classicpopjazzfavourites
blues modern
workhome
office plane
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
11
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
hip hoppop
d f
12
ba c
de
f
g
b
Items are represented by Ids.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
12
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
hip hoppop
d f
12
ba c
de
f
g
b
Best matching cluster node isselected by f-measure.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
13
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
pop
d f
12
b
11
alternative metal
true metaldeath metal
a
b c
hip hop
de
f
g
Items that are sufficiently similar to items in the best matching clustering are deleted from the query set.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
14
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
pop
d f
12
b
11
alternative metal
true metaldeath metal
a
b c
hip hop
de
f
g
A new query is posed containing the remaining items. Only tags not used yet are considered.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
15
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
pop
d f
12
b
11
alternative metal
true metaldeath metal
a
b c
hip hoppop
d f
12
1
hip hop
e g
The process continues until all items are covered, no additional match is possible or a maximal number of rounds is reached.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
16
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
hip hoppop
d f
12
b
11
alternative metal
true metaldeath metal
a
b c
hip hoppop
de
f
12’
g
1
Remaining items are added byclassification (kNN).
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
17
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
hip hoppop
d f
12
b
hip hoppop
1
metalalternative
death metal true metal
Process starts anew until no more matches are possible or the maximal number of results is reached.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
18
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
hip hoppop
d f
12
b
hip hoppop
1
metalalternative
death metal true metal
workhome
2
office plane
3 … k
Process starts anew until no more matches are possible or the maximal number of results is reached.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
19
LACE Algorithm
11
alternative metal
true metaldeath metal
a
c
hip hoppop
d f
12
b
P2
p N
etw
ork
hip hoppop
1
metalalternative
death metal true metal
workhome
2
office plane
3 … k
Ad hoc peer-to-peer network.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
20
Structuring Music CollectionsChallenge of music data: There is no perfect feature set
for all mining tasks. Learning feature extraction for a
classification taskMierswa/Morik MLJ 2005
Structuring music collectionsWurst/Morik/Mierswa ECML 2006
User views are local models - no global consensus wanted!Mierswa/Morik/Wurst, In: Masseglia, Poncelet, l and Teisserie(editors), Successes and New Directions in Data Mining, 2007
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
21
Structuring Tag Collections Users annotate resources
with arbitrary tags. Frequency of tags is
shown by the tag cloud. Tags structure the
collection.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
22
Navigation User may select a tag
and sees the resources. User may follow related
tags. Problem:
No hierarchical structure.
Restricted navigation to given tags.
No navigation according to subsets.
Photography and art cannot be found!
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
23
Given: Folksonomy A Folksonomy (U,T,R,Y),
with U Users T tags R Resources Y U T R a record (u,t,r) Y
means that user u has annotated resource r with tag t.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
24
Wanted: Tagset clustering Hierarchical clustering of
tags for navigation, based on frequency:
how many users used tag t?supp: P(T) --> suppU(T)=|{uU| t T: r R:
(u,t,r) Y}| Subset of the lattice of
frequent tag sets that optimizes clustering criteria.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
25
Starting Point: Termset Clustering Termset clustering: how many
resources support a term? Given frequent term sets form
a clustering with small overlap and large coverage.Beil, Ester, Xu (2002) Frequent Term-Based Text Clustering, in KDD 2002Fung, Wang, Ester (2003) Hierarchical Document Clustering Using Frequent Itemsets, in SDM 2003
Heuristics for minimizing overlap, maximizing coverage.
...{sun} {beach}
D1, D4, D5, D6, D2, D9, D13D8, D10, D11, D15 D8, D10, D11, D15
D7, D14 D2, D9, D13
D8, D10, D11, D15
{sun, fun, beach}
{sun,fun} {fun, beach} {sun,beach}D1,D4,D6,D8 ... D2, D8,
D9, D10D10, D11, D13 D11, D15
{ } D1, ..., D16
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
26
Heterogeneous Preferences
Child-count vs. completeness (left); coverage vs. overlap (right)
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
27
Multi-objective Optimization Given frequent tag sets Find all optimal
clusterings according to two orthogonal criteria.
Orthogonal criteria can only be determined empirically.
Childcount: number of successors of a cluster
Overlap: average overlap of clusters at each level.
Completeness: how much of the lattice is retained?
+
+
+
++
++
+ ++ +
+
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
28
GA for Optimization NSGA II algorithm
Deb, Agrawal,Pratab, Meyarivan (2000) in Procs. Parallel Problem Solving from Nature
Delivers all Pareto-optimal clusterings to a partial lattice of frequent tag sets.
Initial population
Fitness Stop?
Selection
Crossover
Mutation
Output
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
29
Encoding Frequent Tag Sets Given the lattice of possibly
frequent tag sets, a Binary vector indicates the
inclusion of a tag set into the clustering.
A vector can be mutated by flipping bits.
Two vectors can be combined to a new one by crossover.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
30
Result: Points of Pareto-front Childcount vs.
Completeness Pareto-front for
different minimal support
Instances
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
31
Application
Bibsonomy social bookmark system: Hotho, Jäschke, Schmitz, Stumme 2006 780 users, 59.000 resources, 25.000 tags 4000 frequent tag sets Optimization according to Childcount vs. Completeness and
Overlap vs. Coverage
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
32
Multi-objective Tagset Clustering Multi-objective optimization
allows the user to select among equally good clusterings -->heterogeneity of users is respected
High scalability, high dimensionality
Understandable labels (tags)
Hierarchical structure for navigation.
Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008
Faculty Computer ScienceLS 8
technische universität dortmund
33
Challenges for Data Mining High dimensional data High throughput data Distributed Data P2P networks Web 2.0 Diverse user preferences Service for end-user
systems, e.g. mobile “phones”