technische universität dortmund fakultät für informatik ls 8 prof. dr. katharina morik the...

33
technische universität dortmund Fakultät für Informatik LS 8 Prof. Dr. Katharina Morik The Challenge of Heterogeneity

Upload: ronnie-roland

Post on 11-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

technische universität dortmund

Fakultät für InformatikLS 8

Prof. Dr. Katharina Morik

The Challenge of Heterogeneity

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

2

Overview Heterogeneity in Data Distributed Data Web 2.0 Heterogeneity of Users

Structuring music collections

Structuring tag collections

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

3

Heterogeneity in Data Databases

Fixed set of attributes Declared data types Multi-relational Very large number of records

Preparation for mining Extract, Transform, Load Select attributes Declare label for learning Handle missing values Compose new attributes

Schema-mapping for re-use of DM MiningMart application to customer churn

-- Telecom Italia

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

4

Heterogeneity in Data Time series data

Measurements over time Business Medicine Production

Hand writing Pictures Music

Prediction Classification Clustering Signal to Symbol

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

5

Heterogeneity in Data Texts

High dimensional vectors Sparse word vectors Texts of the same class

need not share a word! Syntactic, semantic

structures Classification Clustering Named Entity Recognition,

Information Extraction

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

6

Distributed Data Distributed databases of

the same schema Distributed databases of

different schemas Low-level, low capacity

sensors Peer-to-peer networks

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

7

Heterogeneity of Users The same label name does

not necessarily mean the same concept.

Different names may refer to the same set of items.

Users apply diverse aspects, e.g., genre, time of day, episodes (summer 99),...

Users share some set of items (possibly under different names).

hip hoppopmetalalternative

death metal true metal

hip hoppoppianoclassic guitar

classic jazz

classicpopjazzfavourites

blues modern

workhome

office plane

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

8

Web 2.0 Organizing large data

collections requires semantic annotations.

Users annotate items with arbitrary tags.

No common ontology is required (“folksonomies”).

Users want to keep their tags, but like to benefit from efforts of others.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

9

Structuring Music Collections A concept’s meaning is its

extension, e.g., some music.

A concept’s meaning can be expressed by a classifier.

A concept hierarchy for each aspect --> hierarchical classification.

Acquiring the hierarchy by clustering under the assumption that user-given taggings are kept.

pop rock

metal

a

d e

bad good

blues

f

b

aggressive

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

10

Localized Alternative Cluster Ensembles (ECML 2006)

Acquiring hierarchical clusterings from Own partial clusterings Clusterings of other

peers

Preserve taggings of users Produce several alternative Exploit input clusterings Consider locality instead of

global consensus

hip hoppopmetalalternative

death metal true metal

hip hoppoppianoclassic guitar

classic jazz

classicpopjazzfavourites

blues modern

workhome

office plane

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

11

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

hip hoppop

d f

12

ba c

de

f

g

b

Items are represented by Ids.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

12

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

hip hoppop

d f

12

ba c

de

f

g

b

Best matching cluster node isselected by f-measure.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

13

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

pop

d f

12

b

11

alternative metal

true metaldeath metal

a

b c

hip hop

de

f

g

Items that are sufficiently similar to items in the best matching clustering are deleted from the query set.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

14

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

pop

d f

12

b

11

alternative metal

true metaldeath metal

a

b c

hip hop

de

f

g

A new query is posed containing the remaining items. Only tags not used yet are considered.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

15

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

pop

d f

12

b

11

alternative metal

true metaldeath metal

a

b c

hip hoppop

d f

12

1

hip hop

e g

The process continues until all items are covered, no additional match is possible or a maximal number of rounds is reached.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

16

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

hip hoppop

d f

12

b

11

alternative metal

true metaldeath metal

a

b c

hip hoppop

de

f

12’

g

1

Remaining items are added byclassification (kNN).

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

17

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

hip hoppop

d f

12

b

hip hoppop

1

metalalternative

death metal true metal

Process starts anew until no more matches are possible or the maximal number of results is reached.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

18

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

hip hoppop

d f

12

b

hip hoppop

1

metalalternative

death metal true metal

workhome

2

office plane

3 … k

Process starts anew until no more matches are possible or the maximal number of results is reached.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

19

LACE Algorithm

11

alternative metal

true metaldeath metal

a

c

hip hoppop

d f

12

b

P2

p N

etw

ork

hip hoppop

1

metalalternative

death metal true metal

workhome

2

office plane

3 … k

Ad hoc peer-to-peer network.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

20

Structuring Music CollectionsChallenge of music data: There is no perfect feature set

for all mining tasks. Learning feature extraction for a

classification taskMierswa/Morik MLJ 2005

Structuring music collectionsWurst/Morik/Mierswa ECML 2006

User views are local models - no global consensus wanted!Mierswa/Morik/Wurst, In: Masseglia, Poncelet, l and Teisserie(editors), Successes and New Directions in Data Mining, 2007

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

21

Structuring Tag Collections Users annotate resources

with arbitrary tags. Frequency of tags is

shown by the tag cloud. Tags structure the

collection.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

22

Navigation User may select a tag

and sees the resources. User may follow related

tags. Problem:

No hierarchical structure.

Restricted navigation to given tags.

No navigation according to subsets.

Photography and art cannot be found!

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

23

Given: Folksonomy A Folksonomy (U,T,R,Y),

with U Users T tags R Resources Y U T R a record (u,t,r) Y

means that user u has annotated resource r with tag t.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

24

Wanted: Tagset clustering Hierarchical clustering of

tags for navigation, based on frequency:

how many users used tag t?supp: P(T) --> suppU(T)=|{uU| t T: r R:

(u,t,r) Y}| Subset of the lattice of

frequent tag sets that optimizes clustering criteria.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

25

Starting Point: Termset Clustering Termset clustering: how many

resources support a term? Given frequent term sets form

a clustering with small overlap and large coverage.Beil, Ester, Xu (2002) Frequent Term-Based Text Clustering, in KDD 2002Fung, Wang, Ester (2003) Hierarchical Document Clustering Using Frequent Itemsets, in SDM 2003

Heuristics for minimizing overlap, maximizing coverage.

...{sun} {beach}

D1, D4, D5, D6, D2, D9, D13D8, D10, D11, D15 D8, D10, D11, D15

D7, D14 D2, D9, D13

D8, D10, D11, D15

{sun, fun, beach}

{sun,fun} {fun, beach} {sun,beach}D1,D4,D6,D8 ... D2, D8,

D9, D10D10, D11, D13 D11, D15

{ } D1, ..., D16

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

26

Heterogeneous Preferences

Child-count vs. completeness (left); coverage vs. overlap (right)

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

27

Multi-objective Optimization Given frequent tag sets Find all optimal

clusterings according to two orthogonal criteria.

Orthogonal criteria can only be determined empirically.

Childcount: number of successors of a cluster

Overlap: average overlap of clusters at each level.

Completeness: how much of the lattice is retained?

+

+

+

++

++

+ ++ +

+

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

28

GA for Optimization NSGA II algorithm

Deb, Agrawal,Pratab, Meyarivan (2000) in Procs. Parallel Problem Solving from Nature

Delivers all Pareto-optimal clusterings to a partial lattice of frequent tag sets.

Initial population

Fitness Stop?

Selection

Crossover

Mutation

Output

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

29

Encoding Frequent Tag Sets Given the lattice of possibly

frequent tag sets, a Binary vector indicates the

inclusion of a tag set into the clustering.

A vector can be mutated by flipping bits.

Two vectors can be combined to a new one by crossover.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

30

Result: Points of Pareto-front Childcount vs.

Completeness Pareto-front for

different minimal support

Instances

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

31

Application

Bibsonomy social bookmark system: Hotho, Jäschke, Schmitz, Stumme 2006 780 users, 59.000 resources, 25.000 tags 4000 frequent tag sets Optimization according to Childcount vs. Completeness and

Overlap vs. Coverage

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

32

Multi-objective Tagset Clustering Multi-objective optimization

allows the user to select among equally good clusterings -->heterogeneity of users is respected

High scalability, high dimensionality

Understandable labels (tags)

Hierarchical structure for navigation.

Prof. Dr. Katharina Morik | Heterogeneity | Hongkong 29.5.2008

Faculty Computer ScienceLS 8

technische universität dortmund

33

Challenges for Data Mining High dimensional data High throughput data Distributed Data P2P networks Web 2.0 Diverse user preferences Service for end-user

systems, e.g. mobile “phones”