challenging retrieval scenarios: social media and linked open data

64
Institute for Web Science & Technologies – WeST Challenging Retrieval Scenarios: Social Media and Linked Open Data Dr. Thomas Gottron [email protected]

Upload: thomas-gottron

Post on 23-Aug-2014

438 views

Category:

Science


1 download

DESCRIPTION

Invited talk given in April 2012 at USI in Lugano at the IR research group of Fabio Crestani. Review of the work on Interestingness on Twitter and schema based indices on Linked Open Data (SchemEX).

TRANSCRIPT

Page 1: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Institute for Web Science & Technologies – WeST

Challenging Retrieval Scenarios:Social Media and Linked Open Data

Dr. Thomas [email protected]

Page 2: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 2Challenging Retrieval Scenarios

Outline

The ROBUST project Background Use cases

Retrieval on Microblogs Particularities of Twitter Interestingness LiveTweet

Search on the LOD cloud Querying LOD as IR task Schema extraction SchemEX

Page 3: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 3Challenging Retrieval Scenarios

Online Communities

Page 4: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 4Challenging Retrieval Scenarios

Business Communities

Information ecosystems Employees Business Partners, Customers General Public

Valuable asset

OpportunitiesRisks

Page 5: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 5Challenging Retrieval Scenarios

Risk Management• Risk modelling• Detection• Automatic

reaction

Community Analysis• Contents• Single users• Entire

communities

Community Forecasting • Policies• Prediction• Decision

support

Large Scale Processing• Big Data• Realtime• Parallel

Processing

High Level Objectives

Page 6: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 6Challenging Retrieval Scenarios

Scenario 1

Social Media - Microblogs

Page 7: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 7Challenging Retrieval Scenarios

IBM Connections

Page 8: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 8Challenging Retrieval Scenarios

Twitter

My dear @johndoe had

troubles to wake up this #morning

Follower

@janedoe

Page 9: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 10Challenging Retrieval Scenarios

Retrieval on Twitter: First Steps

10 Millionen Tweets Retrieval Engine Query: beer

Rang User Tweet1 LoriAG beer2 Crushdwinebar beer!!3 Skippertaylor BEER4 BigMacScola Beer5 VANiamore beer.......6 CindyMcManis To beer or not to beer on Beer Summit ?7 silverlakewine beer beer beer beer beer beer beer. Simple 3pm8 eldoradobar http://ping.fm/p/Bnra7 - In!!! BEER, BEER, BEER, BEER,

BEER, BEER, BEER, BEER, BEER, BEER,

9 tonx Lompoc. beer beer beer beer beer beer beer beer beer beer. http://twitpic.com/l68ld

10 punkeyfunky Beer beer beer beer beer beer beer beer beer beer beer beer beer. Er, guess what I'm looking forward to?

What is going wrong?

Page 10: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 11Challenging Retrieval Scenarios

Particularities of Twitter

Page 11: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 12Challenging Retrieval Scenarios

Twitter is different

Maximum length: 140 characters

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 1390

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

Zeichen

# Tw

eets

Page 12: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 13Challenging Retrieval Scenarios

Twitter is different

140 characters = few words

0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536394142434446471

10

100

1000

10000

100000

1000000

10000000

100000000

Max TF in Tweet

# Tw

eets

85% of tweets contain each word only once

𝑤 (𝑡 𝑗 ,𝑑𝑖 )=𝐭𝐟 (𝒕 𝒋 ,𝒅𝒊 ) ∙ log ( 𝑁df (𝑡 𝑗 ) )

Binary value !

Page 13: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 15Challenging Retrieval Scenarios

Length normalisation

Why are some documents longer (classic explanation)

Verbosity hypothesis: Long documents repeat themself Short documents prefered as they are more concise

Scope hypothesis: Long documents address more topics Short document prefered as they are more focussed

Intuition: Not valid for Twitter

Page 14: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 16Challenging Retrieval Scenarios

Verbosity hypothesis and Twitter?

Are long tweets more verbous? Consider length of tweets and number of repeated words

Correlation (Spearman‘s Rank)

Page 15: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 17Challenging Retrieval Scenarios

Scope hypothesis and Twitter?

Are long tweets broader in scope?

LDA: 100 topics

Observations 8,5% of tweets have no strong topic Remaining tweets:

• 77,1% are dominated by one topic• 99,6% are dominated by two topics

Page 16: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 18Challenging Retrieval Scenarios

Length normalisation on tweets

Not necessary! … Negative impact?

YES: Short tweets are preferred!

Long tweets are considered of too wide scope.

Beer!

Pubs brewing their own beer: a list for Düsseldorf http://bit.ly/w2GZrV

Page 17: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 19Challenging Retrieval Scenarios

Interestingness

Page 18: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 20Challenging Retrieval Scenarios

Interesting Content

Concept of „relevance“ in IR: Document is about a topic

Additionally for Twitter: Timeliness Current trend Informative

Interestingness Tweet is about a topic AND is interesting!

Question: How to determine what is interesting???

Page 19: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 21Challenging Retrieval Scenarios

Retweets

My dear @johndoe had

troubles to wake up this #morning

Follower

@janedoe

RT @janedoe: My dear @johndoe had troubles to wake up this

#morning

Page 20: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 22Challenging Retrieval Scenarios

Retweets

Retweet indicates quality „of interest for others“

Depends on Content Context (time, follower)

Idea: Learn to predict retweets!

Likelihood of retweet as metric for Interestingness

Page 21: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 24Challenging Retrieval Scenarios

Retweets: Prediction model

Aim: Prediction of a probability

Logistic regression

Model parameters learned on training data

Dataset Users Tweets Retweets

Choudhury 118,506 9,998,756 7.89%

Choudhury (extended) 277,666 29,000,000 8.64%

Petrovic 4,050,944 21,477,484 8.46%

Page 22: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 25Challenging Retrieval Scenarios

Logistic Regression: Weights

Feature Dimensions WeightConstant (intercept) -5.45

Message feature

Direct message -147.89Username 146.82Hashtag 42.27URL 249.09

SentimentValence -26.88Arousal 33.97Dominance 19.56

Emoticons Positive -21.8Negative 9.94

Exclamation Positive 13.66Negative 8.72

Punctuation ! -16.85? 23.67

Terms Odds 19.79

Page 23: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 26Challenging Retrieval Scenarios

Logistic Regression: Topic Weights

Topic Weight

social media market post site web tool traffic network 27.54

follow thank twitter welcome hello check nice cool people 16.08

credit money market business rate economy home 15.25

christmas shop tree xmas present today wrap finish 2.87

home work hour long wait airport week flight head -14.43

twitter update facebook account page set squidoo check -14.43

cold snow warm today degree weather winter morning -26.56

night sleep work morning time bed feel tired home -75.19

Page 24: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 27Challenging Retrieval Scenarios

Re-Ranking using Interestingness

Top-k relevant tweets Re-rank based on interestingness

Rang Username Tweet1 BeeracrossTX UK beer mag declares "the end of beer writing." @StanHieronymus says not so in the US.

http://bit.ly/424HRQ #beer2 narmmusic beer summit @bspward @jhinderaker no one had billy beer? heehee #narm - beer summit

@bspward @jhinde http://tinyurl.com/n29oxj3 beeriety Go green and turn those empty beer bottles into recycled beer glasses! | http://bit.ly/2src7F

#beer #recycle (via: @td333)4 hblackmon Great Divide beer dinner @ Porter Beer Bar on 8/19 - $45 for 3 courses + beer pairings.

http://trunc.it/172wt5 nycraftbeer Interesting Concept-Beer Petitions.com launches&hopes 2help craft beer drinkers enjoy beer

they want @their fave pubs. http://bit.ly/11gJQN6 carichardson Beer Cheddar Soup: Dish number two in my famed beer dinner series is Beer Cheddar

Soup.  I hadn’t had too.. http://bit.ly/1diDdF7 BeerBrewing New York City Beer Events - Beer Tasting - New York Beer Festivals - New York Craft Beer

http://is.gd/39kXj #beer8 delphiforums Love beer? Our member is trying to build up a new beer drinker's forum. Grab a #beer and

join us: http://tr.im/pD1n9 Jamie_Mason #Baltimore Beer Week continues w/ a beer brkfst, beer pioneers luncheon, drink & donate

event, beer tastings & more. http://ping.fm/VyTwg10 carichardson Seattle and Beer: I went to Seattle last weekend.  It was my friend’s stag - he likes

beer - we drank beer.. http://tinyurl.com/cpb4n9

Page 25: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 28Challenging Retrieval Scenarios

Application

Page 26: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 29Challenging Retrieval Scenarios

LiveTweet

Data: Twitter streaming API: sample 1% of all tweets

Architecture: Time slices over tweets Analytical component with

REST API Web Frontend for end user

Page 27: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 30Challenging Retrieval Scenarios

LiveTweet

http://livetweet.west.uni-koblenz.de/

Page 28: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 32Challenging Retrieval Scenarios

LiveTweet: What comes next?

Retrieval Incorporate with other retrieval metrics Include Interestingness in a learning to rank approach Social graph

System extension Personalisation Public API Work with IBM data

Page 29: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 33Challenging Retrieval Scenarios

Scenario 2

Linked Open Data

Page 30: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 34Challenging Retrieval Scenarios

Information needs requiring semantic structure

Examples Male persons who have a public profile document Computing science papers authored by social scientists American actors who are also politicians and are married

to a model.

Maybe specific databases available: Person search engines Bibliographic databases Movie database

How to integrate?

Page 31: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 35Challenging Retrieval Scenarios

Linked Data

B C

Thing

typedlinks

A D E

typedlinks

typedlinks

typedlinks

Thing

Thing

Thing

Thing

Thing Thing

Thing

Thing

Thing

Semantic Web Technology to1. Provide structured data on the web2. Link data across data sources

Page 32: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 36Challenging Retrieval Scenarios

Entities are identified via URIs

pd:cygri

Richard Cyganiak

dbpedia:Berlin

foaf:name

foaf:based_near

foaf:Personrdf:type

pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygridbpedia:Berlin = http://dbpedia.org/resource/Berlin

Description of a link between two data sources

One statement = one triple

Subject Predicate Object

Page 33: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 37Challenging Retrieval Scenarios

Resolving URIs

dp:Cities_in_Germany

3.405.259dp:population

skos:subject

Richard Cyganiak

dbpedia:Berlin

foaf:name

foaf:based_near

foaf:Personrdf:type

pd:cygri

Page 34: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 39Challenging Retrieval Scenarios

The LOD Cloud

Page 35: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 40Challenging Retrieval Scenarios

Querying linked data

SELECT ?xWHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}

Page 36: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 41Challenging Retrieval Scenarios

Info

rmat

ion

need

Keyword query Documents Information

SPARQL query Data sources Entities

Querying linked data – an IR task?

Here happens IR magic

Here we need magic

Page 37: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 42Challenging Retrieval Scenarios

Querying linked data – using an index

SELECT ?xWHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}

Page 38: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 43Challenging Retrieval Scenarios

A Schema for LOD

Page 39: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 44Challenging Retrieval Scenarios

Idea

Schema Index: Define families of graph patterns Assign entities to graph patterns Map graph patterns to context / source

Construction: Streambased for scalability Little loss of accuracy

NOTE: Index defined over entities But: Index stores the contexts (sources)

Page 40: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 45Challenging Retrieval Scenarios

http://dig.csail.mit.edu/2008/...

Input Data

n-Quads<subject> <predicate> <object> <context> .

Example:

<http://www.w3.org/People/Connolly/#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf> .

w3p:#me

foaf:Person

Page 41: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 46Challenging Retrieval Scenarios

Layer 1: RDF Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

http://dig.csail.mit.edu/2008/...

C1

DS 3DS 2DS 1

SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person .}

All entities of a particular type

Page 42: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 47Challenging Retrieval Scenarios

Layer 2: Type Clusters

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

TC1

DS 3DS 2DS 1

SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male .}

C1 C2

pim:Male

tc4711

pim:Male

All entities belonging to the same set of types

Page 43: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 48Challenging Retrieval Scenarios

Layer 3: Equivalence Classes

EQC1

DS 3DS 2DS 1

C1 C2

TC1

C3

TC2

Two entities are equivalent iff: They are in the same TC They have the same

properties The property targets are in the

same TC

Page 44: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 49Challenging Retrieval Scenarios

Layer 3: Equivalence Classes

timbl:card#i

foaf:Person

foaf:Person

http://www.w3.org/People/Berners-Lee/card

pim:Male

tc4711

pim:Male

foaf:PPD

timbl:card

eqc0815

foaf:PPD

tc1234

eqc0815-maker-tc1234

foaf:maker

SELECT ?xFROM …WHERE { ?x rdfs:type foaf:Person . ?x rdfs:type pim:Male . ?x foaf:maker ?y . ?y rdfs:type foaf:PersonalProfileDocument .}

Page 45: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 51Challenging Retrieval Scenarios

Schema Index Overview

3 Layers – 3 different graph patterns

Page 46: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 52Challenging Retrieval Scenarios

Schema Computation

Page 47: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 53Challenging Retrieval Scenarios

Building the Index from a Stream

Stream of n-quads (coming from a LD crawler)

… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFo

4

3

2

1

1

6

23

4

5

C3

C2

C2

C1

Page 48: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 55Challenging Retrieval Scenarios

Does it work good?

Comparison of stream based vs. Gold standard Schema on 11 M triple data set

Page 49: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 56Challenging Retrieval Scenarios

Does it scale?

Semantic Web Challenge: Billion Triples Track Provision of large scale RDF dataset Crawled from LOD

Task: Do something „useful“ Do it (web-)scalable Do it with at least 1 billion triples

Presentation at ISWC

Page 50: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 57Challenging Retrieval Scenarios

BTC results

1st billion 2nd billion full BTC

# triples 1 billion 1 billion 2.17 billion

# instances 187.7 M 222.6 M 450.0 M

# data sources 13.5 M 9.5 M 24.1 M

# type clusters 208.5 k 248.5 k 448.6 k

# equivalence classes 0.97 M 1.14 M 2.12 M

# triples index 29.1 M 24.8 M 54.7 M

Compression ratio 2.91% 2.48% 2.52%

# triples/sec. 40.5 k 45.6 k 39.5 k

1st place BTC‘11

Page 51: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 58Challenging Retrieval Scenarios

SchemEX: What comes next?

Hierarchy of semantic information: Type clusters Equivalence clusters Related types

Optimization Smarter caching Performance – Hadoop Error correction

Page 52: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 59Challenging Retrieval Scenarios

Conclusion

Page 53: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 60Challenging Retrieval Scenarios

Take away message

Web evolving in interesting directions Social networks, user generated content Semantic data

Challenges for IR Different settings Different tasks Question basic assumptions

Page 54: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 61Challenging Retrieval Scenarios

Thank you!

Contact:WeST – Institute for Web Science and TechnologiesUniversität Koblenz-Landau [email protected]

Page 55: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 62Challenging Retrieval Scenarios

Relevant Publications

1. A. Che Alhadi, S. Staab, and T. Gottron. Exploring user purpose writing single tweets. In WebSci ’11: Proceedings of the 3rd International Conference on Web Science, 2011.

2. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Microblog retrieval based on interestingness, in TREC’11: Proceedings of the Text Retrieval Conference 2011, 2011.

3. A. Che Alhadi, T. Gottron, J. Kunegis, and N. Naveed, Livetweet: Monitoring and predicting interesting microblog posts, in ECIR’12: Procedings of the 34th European Conference on Information Retrieval, 2012. in preparation.

4. T. Gottron and N. Lipka, A comparison of language identification approaches on short, query-style texts, in ECIR ’10: Proceedings of the 32nd European Conference on Infor-mation Retrieval, pp. 611–614, Mar. 2010.

5. M. Konrath, T. Gottron, and A. Scherp. Schemex – web-scale indexed schema extraction of linked open data, in Semantic Web Challenge, Submission to the Billion Triple Track,

6. 2011.N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Bad news travel fast: A content-based analysis of interestingness on twitter. In WebSci ’11: Proceedings of the 3rd International Conference on Web Science, 2011.

7. N. Naveed, T. Gottron, J. Kunegis, and A. Che Alhadi. Searching microblogs: Coping with sparsity and document quality. In CIKM’11: Proceedings of 20th ACM Conference on Information and Knowledge Management, 2011.

Page 56: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 63Challenging Retrieval Scenarios

Attic

Page 57: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 64Challenging Retrieval Scenarios

Use Cases

Business PartnersExtranet

EmployeesIntranet

Public DomainInternet

SAP Community Network (SCN) Lotus Connections MeaningMine

Communities• Customers• Partners• Suppliers• Developers

Business value• Products support• Services• Find business partners

Communities• Employees• Working groups• Interest Groups• Projects

Business value• Task relevant information• Collaboration• Innovation

Communities• Social media• News• Web fora• Public communities

Business value• Topics• Opinions• Service for partners

Volume• 6,000 posts/day• 1,700,000 subscribers• 16GB log/day

Volume• 4,000 posts/day• 386,000 employees• 1.5GB content/day

Volume• 1,400,000 posts/day• 708,000 web sources• 45GB content/day

Page 58: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 65Challenging Retrieval Scenarios

Twitter is different

Follower form social graph PageRank applicable?!

BUT: Follow not (only) motivated

by content No statement about tweets!

Page 59: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 66Challenging Retrieval Scenarios

Information seeking behaviour on Twitter

Web 2-4 query terms Broader terms Intentions

• Navigation• Information• Ressourcen

Get to know a topic

Twitter 1-2 query terms Specific terms Intentions

• Timely information• Trends• People

Follow a topic

Page 60: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 67Challenging Retrieval Scenarios

TREC

Microblog Track 2011 12.000.000 Tweets 2 Weeks 49 „Topics“ (Queries) Task: Filtering

Constraints No external knowledge! English tweets only Temporal order of topic & tweets Official extension of „relevance“ to „interestingness“ (!!!)

Page 61: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 68Challenging Retrieval Scenarios

WeST @ TREC Microblog Track

Basics: Lucene No length normalisation Interestingness

4 configurations: WESTfilter: Retrieval via Lucene, filtering non interesting

tweets WESTfilext: like WESTfilter, but with sentiments WESTrelint: like WESTfilter, but re-ranking according to

interestingness WESTrlext: like WESTrelint, but with sentiments

Page 62: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 69Challenging Retrieval Scenarios

Results

Filtering significantly better than re-ranking Sentiments are of disadvantage (not significant)

P5 P10 P15 P20 P30 R-prec bpref MAP nDCG0

0.050.1

0.150.2

0.250.3

0.350.4

WESTfilter WESTfilext WESTrelint WESTrlext

Metric

Scor

e

Page 63: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 70Challenging Retrieval Scenarios

Results

Effective especially for shorter queries

1 2 3 4 5 6 70

0.05

0.1

0.15

0.2

0.25

0.3

WESTfilext WESTfilter WESTrelint WESTrlext

Query Length (word count)

MA

P

Page 64: Challenging Retrieval Scenarios: Social Media and Linked Open Data

Thomas Gottron Lugano, 23.4.2012 71Challenging Retrieval Scenarios

Schema representation using VoiD