content-based clustering for tag cloud visualization

Post on 25-May-2015

856 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

My presentation at ASONAM 2009 on July 21st, 2009

TRANSCRIPT

Content-based Clustering for Tag Cloud VisualizationASONAM 2009

Arkaitz ZubiagaAlberto P. Garcıa-Plaza

Vıctor FresnoRaquel Martınez

NLP & IR Group @ UNED

July 21st, 2009

Introduction

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 2 / 25

Introduction

Simple Tagging

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 3 / 25

Introduction

Collaborative Tagging

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 4 / 25

Introduction

Tag Cloud

No organization.

No relations between tags.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 5 / 25

Introduction

Our Work

Find relations between tags to organize them:

To ease visualization and search.To ease subscribing to a group of related tags.

Previous works rely on tag co-occurrence to find relations.

What about considering web documents’ content?

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 6 / 25

Dataset Generation

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 7 / 25

Dataset Generation

Dataset Generation

Starting point: 140 most popular tags on Delicious (T140, tag cloud).

Tag monitoring: ∼6.000 documents/tag (∼840.000 docs., html andpdf).

Data retrieval:

Tag data for each document.Document content.

Filtering: English-written documents with tag data available.

Result: 144.574 documents (unbalanced).

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 8 / 25

Our Method

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 9 / 25

Our Method

Representation

Most relevant tags for each document: at least, 40,7% of the top tag

Merge documents pertaining to each T140 tag.

Stopwords removal.

Stemming.

TF-IDF representation (reducing by DF).

1 vector/tag.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 10 / 25

Our Method

Clustering (SOM)

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 11 / 25

Our Method

Clustering Settings

12x12 sized map: 144 neurons.

vectors with 17.518 dimensions.

Learning rate: 0,1.

Neighborhood: 12.

Iterations: 50.000.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 12 / 25

Our Method

Terminology Extraction

Merge all the documents in each neuron.

Terminology extraction for each neuron.

Representative for the neuron, but not for the rest.Language models (KLD, Kullback-Leibler Divergence).

Result: Representative terms for each neuron.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 13 / 25

Results

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 14 / 25

Results

Results

Full map available at: http://nlp.uned.es/social-tagging/

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 15 / 25

Results

Results: Computer Science

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 16 / 25

Results

Results: Design

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 17 / 25

Results

Results: Cooking

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 18 / 25

Results

Results: Coherence

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 19 / 25

Results

Results: Terminology

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 20 / 25

Conclusions

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 21 / 25

Conclusions

Conclusions

We analyzed tag clustering and terminology extraction relying ondocuments’ content.

We collected the DeliciousT140 dataset.

Unlike previous works, we considered documents’ content.

The resulting map shows encouraging results, exhibiting the potentialof collaborative tagging systems.

It could allow community discovery.

It eases tag cloud visualization, as well as improving navigation andsubscribing.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 22 / 25

Future Work

Index

1 Introduction

2 Dataset Generation

3 Our Method

4 Results

5 Conclusions

6 Future Work

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 23 / 25

Future Work

Future Work

To compare our content-based approach to those based on tagco-occurrence.

To make a quantitative evaluation

To semantically analyze tags (polysemy, synonimy,...).

To extend the work to multilingual tag sets.

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 24 / 25

Future Work

Thank You for Your Attention

Achiu Arigato Danke Dhannvaad Dua Netjer en ek EfcharistoGracias Gracies Gratia Grazie Guishepeli Hvala Kiitos

Koszonom Merce Merci Mila esker Obrigado ShukranShukriya Tack Tak Takk Tanan Tapadh leat Tesekkur ederim Thank

you Toda

NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 25 / 25

top related