evolving web, evolving search

Post on 23-Jan-2015

436 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Evolving Web, gEvolving Search

Yuan Tian & Tianqi ChenYuan Tian & Tianqi ChenApex Data & Knowledge Management Lab

Shanghai Jiao Tong University

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Shanghai Jiao Tong Universityg J g y

Location Historyy Student Campus Campus

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Apex Labp

Director Professor Director Professor Yong Yu

Associate ProfessorG i X Guirong Xue

Apex Labp

Research Web Search Social Web Semantic Search Machine Learning Image Search

Apex Labp

Project Partners

Apex Labp

Ph.D. Students Haofen Wang Jing Lu Jia Chen Guangcan Liu Xian Wu Yunbo Cao Ruihua Songg

35 Master Students

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Search on Traditional Web

Focus on how to improve search relevance? rank pages? integrate mining technologies into search? search finer grained objects instead of documents?

Search Applicationspp General search engine Vertical search engineg Meta search engine

Expert Search

Expert Search (introduction)p ( )

Treat web page as bag of words Queries are not fully understoodQ y

Expert Search (motivation)p ( )

Searching for Experts: Searching for Experts: • A more and more important information needA more and more important information need

• PM search for DevP ti t h f D t• Patient search for Doctor

• Student search for Professor• ……

• Not only in EnterpriseBut also on WWW• But also on WWW

Query

Ranked List of

ExpertsExperts

An Evidence: an expert and a query co-occur in a document undercertain relation constraintcertain relation constraint

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

The Emergence of Web 2.0 Web gets social

g

Web 1.0 -> Web 2.0Publishing -> Participation

Personal Websites -> Blogging

Content Management Systems -> Wikis

Britannica Online -> WikipediaBritannica Online > Wikipedia

Directories (taxonomy) -> Tagging ("folksonomy")

Lower the barrier for contribution. More people are involved. They are less professional. More people are involved. They are less professional.

Search on Web 2.0

Focus on how to elaborate user involved data? search on new social media

Deegleg

(WWW 2006, WWW 2007, SIGIR 2008)

Related facetsRelated facets Related Related tagstags

Search resultsSearch results

Relatd Relatd usersusers

Emotion Analysis on the Blogy g

Blog can be the resource of the news, but also be the stage for representing the emotion

Enhancing the blog search for different user Enhancing the blog search for different user

Blog Searchg

I f ti ti lInformative articleNews that is similar to the news on traditional

b itnews websitesTechnical descriptions, e.g. programming

techniquestechniques.Commonsense knowledgeObjective comments on the events in the worldObjective comments on the events in the worldAffective article b l ffDiaries about personal affairsSelf-feelings or self-emotions descriptions

Two types of blogyp g

Intent-driven blog search (WWW 2007)

Informative Sense

Snippets

1 1 00 The catalogue of IBM certification: DB21 1.00 The catalogue of IBM certification: DB2Database Administrator DB2 ApplicationDeveloper MQSeries Engineer VisualAgeFor Java …

2 -0.94 Crazy Me! I have hesitated between Acerand smuggled IBM for one week. Iwouldn’t have taken into account theprice, quality or service if I had enoughmoney …

3 1.00 Selling IBM laptop, t22p3-900, , dvd S3/,g p p, p , , ,independent accelerating display card.3550 YUAN. (Post fee notincluded) .Please contact 30316255. Weguarantee the quality. This product is onlysold within Tianjing citysold within Tianjing city ...

4 -0.35 I got a laptop from my friend this week.Although outdated, it is still a classicalone in IBM enthusiast’s mind. There aremany second hand IBM laptops in the

k Al h h I h ld IBMmarket. Although I have sold many IBMlaptops …

5 -0.53 Doctor said that I should make morepreparations mentally. You have stayedwith me for three years, leaving withouty gany words. Do you feel fair for me? Doyou remember the moments we weretogether? You are heartless, I hate you! ...

Informative SnippetsSense

1 1.00 The catalogue of IBM certification: DB2 DatabaseAdministrator DB2 Application Developer MQSeriesEngineer VisualAge For Java …

2 1.00 Biao Lin is a military talent. Stalin called him “thegiftedgeneral”. Americans called him “the unbeaten general”.general . Americans called him the unbeaten general .Chiang Kai-shek called him “devil of war”. Biao Lin is aspecial person in modern history …

3 0.99 Microsoft’s hotmail can only be registered with suffix“@hotmail.com” by default. You can register @msn.com byvisiting…

4 0 95 Yi Sh i till di th fil t I ill ti it l t 14 0.95 Yi Shang is still sending the file to me. I will practice it later. 1.Start up Instance (db2inst1) db2start; 2. Stop Instance(db2inst1) db2stop …

5 0.84 Name: Lei Zhang. Student number: 5030309959. Classnumber: 007. The analysis and review about the tendency ofJilin Chemical Industry’ stock in 2005. Date, Increasing andDecreasing ranges, Open Price, Close Price, Amount ofdeals …

6 0.01 Recently I like reading the Buddhist Scripture. I can learnphilosophies in it. It makes me comfortable. It is from ...

7 -0.11 It’s out of my mind when I first saw it. The water seemed to beexuding from the building. There was much water on the floorof education building. Water was all around us, anywhere youcan touch had water. …

8 -0.51 I read an article about the last emperor Po-yee today. I havewatched “The Last Emperor” before, which realisticallydescribed his life without losing artistry. His love impresseddescribed his life without losing artistry. His love impressedme. As an emperor, he can’t choose the one he loved …

9 -0.53 She is 164 in height with white skin, black hair and long limpleg. I like the girl who has long hair and likes sport anddancing. I like sweet girls. …

10 -0.94 I have many things to do at the end of this semester. There arefi fi l i ti Di t M th tifive final examinations, Discrete Mathematics,Communication Theory, Architecture of Computer, Algorithmand Law. I know little about them. OMG! Only four weeks areleft. There are also two projects, Compiler and OperationSystem. Complier can be easily completed but OperationSystem …

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Our Vision of Semantic Web Search• It covers most of the important topics in SW• A lot of tools are built in o o oo s e bueach layer

• 10+ top papers (WWW’09, SIGMOD’09, SIGMOD’08, VLDB’07, ICDE’09, ISWC’07, etc)

Knowledge Engineering Layerg g g y

Ontology Engineering Orient: Integrating Ontology Engineering into Industry

Tooling Environment (ISWC 2004)

O t l L i & P l ti Ontology Learning & Population EachWiki: Facilitating Semantics Reuse for Wikipedia

Authoring (ISWC/ASWC 2007)u o g ( S C/ S C 00 ) PORE: Semi-supervised Positive Only Relation Extraction

from Wikipedia (ISWC/ASWC 2007)HS E l U i d Hi hi l S ti E l HS Explorer: Unsupervised Hierarchical Semantics Explorer for Social Annotations (ISWC/ASWC 2007)

Catriple: Extracting Triples from Wikipedia Categories p g p p g(ASWC 2008)

Indexing and Search Layerg y

Ontology Query Engine based on DBMS SOR: A Practical System for OWL Ontology Storage,

Reasoning and Search (VLDB 2007, SIGMOD 2008)

A t ti b d S ti S h E i (DB + IR) Annotation-based Semantic Search Engine (DB + IR) CE2: Towards Large Scale Annotation-based Semantic

Search (CIKM 2008)Sea c (C 008)

An Extension to IR index for Relational Search Semplore: An IR Approach to Scalable Hybrid Query of

Semantic Web Data (ISWC/ASWC 2007, ASWC 2008, WWW 2009, JWS)

Pattern based RDF Store Pattern-based RDF Store

SOR

Semantic Object Repository

Based on IBM DB2 Supports T-Box Supports T Box

reasoning

Semplorep

Extension to traditional IR engine

Ranking is considered

CE^2

Search over semantically annotated corpus

Combination of DB and IR search engines

Pattern-based RDF store

Learning to materialize join results Efficient retrieval of pattern matchesp Reasonable extra space -> Significant

performance increase (on some dataset)performance increase (on some dataset)

Query Interface and User Interaction Layer Keyword Interface for Semantic Search Keyword Interface for Semantic Search

Q2Semantic: Lightweight Ontology based Keyword Interpretation for Semantic Search (ESWC 2008, ICDE 2009)2009)

Natural Language Interface for Semantic Search PANTO: A Portable Natural Language Interface to

Ontologies (ESWC 2007) Snippet Generation

Snippet Generation for Semantic Web Search Engines Snippet Generation for Semantic Web Search Engines (ASWC 2008)

Ontology Presentation ZoomRDF: Semantic-driven Fisheye Zooming for RDF Data

(WWW 2010)

Q2SemanticQ

Structured queries vs. keyword queries

Structural data

RDF Snippetpp

Representation of search results

How will you know which answers are most relevant?

ZoomRDF

Research

Traditional Web Social Web Semantic Web Machine Learning Machine Learning

Agendag

Introduction to SJTU Introduction to Apex Labp Research Demo Demo

How to make them as a whole? We focused on Semantic Web

search Closed corpus / one single data source Closed corpus / one single data source

involved Just scale to million triples Uncertainty is not fully considered or usedy y

We need Semantic Web search, however

M th 11 illi d t (W b More than 11 million data sources (Web heterogeneity)

More than 2 billion triples (Scalability) Uncertainty everywhere Uncertainty everywhere

Thus, we carefully consider the following topics Pay as you go for semantic data integration Pay as you go for semantic data integration Semantic search engine towards billion

triples User-friendly query Interface for Semantic

MissingLet’s ForgetWeb Let s Forget

Hermes (2nd place Billion Triple Challenge, S SSIGMOD 2009, JWS)

1. Integrate and index data sourcesSelect a query Input keywords Refine or navigate

2. Understand user’s need 3. Search and refineq y

“ArticleStanfordTuring Award”

123

p y

ResultsRudi Studer, Semantic Web...Suggestions

g

Distributed Query Processing

Schema‐level Mapping Data‐level Mapping

Graph Data Processing Keyword Translation

SuggestionsAffiliations...

Element Label Extraction

Keyword Mapping

Top‐k Query G h S h Local Query

Query Graph Decomposition 

Result Combination & Ranking

Data Graph Summarization

Query Planning

Graph Element Scoring

Mapping Discovery

Graph Search Local Query Processing

Query Planning& Optimization

Internal IndicesMapping Discovery

IndexingKeyword Index

Schema 

Index

Mapping Index

Graph IndicesStructure

IndexIndex Index

Heterogeneous Transfer L iLearning

Machine Learning TeamgAPEXLABShanghai Jiao Tong University

Machine Learning Team in APEXg

Focus on machine learning and its application in Web mining and IR. Transfer learning Advertising Techniques in Web Short text classification&clustering Multiligual search result integeration

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

47

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

48

Traditional machine learningg

training data and test data in a same distribution.

T i i d t T t dTraining data: newsTest da49

Transfer learningg

Transfer learning: distributions are not identical.

Training data: newsTest datag50

Heterogeneous Transfer Learningg g

Learning across different feature spaces.

A fixed-wing aircraft, typically called an airplane, aeroplane or simply plane, is an aircraft capable of flight using forwardcapable of flight using forward motion that generates lift as the wing moves through the air…

An automobile, motor car or car is a wheeled motor vehicle used for transporting passengers, which also carries p g ,its own engine or motor...

T i i d T DT dTraining data: Text DoTest data51

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

52

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

53

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

54

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

55

Related Areas of Heterogeneous Learningg g

Feature Space

Multiple Domain Data

Heterogeneous Homogeneous

Feature Space among Domains

Instance D t Di t ib tiInstance Correspondences among Domains

Data Distribution among Domains

Each instance in onedomain has its

There are few or noInstance

Different Same

Multi-view Learning

Heterogeneous Transfer Learning

Transfer Learning across Different

Distributions

Traditional Machine Learning

correspondencesIn other domains

correspondenceamong domains

Apple is a fr-uit that can be found …

Banana is the common name for…

SourceDomain

TargetDomain

56

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Classification Clusteringg

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

57

Text to Images[Dai et al. NIPS 2008] [Lin et al. APWeb 2010]

Mining and learning the multimedia data is becoming increasing importantbecoming increasing important

Li i d b l b l d i d Limited by scarce labeled image data, can we use abundant text data in the Web?

Our answer is YES

58

Objective

EleLearningIn Ophma

pu utphanmai translatiput

utpu

antssi translati

t put

LearningIn Ots ve ng tpu utare ho learning  59

Basic Ideas

Exploiting co occurrence data as a bridge between text and imageExploiting co-occurrence data as a bridge between text and image

Data Sets

Documents from ODP Images from Caltech-256g

Experimental Resultp

Approach 2: Naïve Bayes Waypp y y[Lin et al. APWeb 2010]

P( | )P( | )P(v|w)P(w|c)P(w|c)

P(v|w)

Text-aided Image Classification g(TAIC)

64

Experiments: TAICp

Data sets: 9 binary classification data sets and 5 are six-class classification data sets Image data from Caltech-256 and Fifteen scene Auxiliary text data from Open Directory Project

Baseline methods Base classifiers: Naïve Bayes (NBC) and Support

vector machine (SVM)

65

Evaluation 1: Classification

Heterogeneous TL  No‐Heterogeneous TL 

Average Error Rate 0.318 0.334

66

Average Error Rate 0.318 0.334

4 8% error reduc

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Classification Clusteringg

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

67

T t id d Im Cl t rinText-aided Image Clustering[Yang et al. ACL 2009]

Image clustering is a effective method for increasing accessibility of image search result

Apple =OR

But traditional clustering methods do not work

Apple OR

But traditional clustering methods do not work well with small amount of data

d d h l We consider use annotated images in the social Web to help image clustering

68

Annotated PLSA Model for Clustering

Leveraging the auxiliary text data by From Flickrusing the topics as a bridge

Z

W dFrom Flickr.

Words Topicsfrom Image featuresTopics

Aux I

69DataIma

Making the transfer…g Log-likelihood objective function

T t i f t d ili t t Two parts: image features and auxiliary text features Image feature to image instance correlation: A Image feature to image instance correlation: A Word feature to image feature correlation: B

BA Nortrade

j llj

j lj

lj

iij

j ij

ij wfPB

BvfP

AA

)|(log)1()|(log' '' '

LNormali

tradeoff

Lik lih d fLik lihmali

ti-off

70Likelihood of Likelihozatiopara

Experiment Setupp p

Data sets: Generated from Caltech-256 and 15-scene corpora

Baseline methods Baseline clustering methods: KMeans, PLSA and STC Strategies:

clustering on target image data only combined: clustering target image data and annotated image combined: clustering target image data and annotated image

data together and evaluate result for target image data

71

Experimental Resultp

KM_Seperate KM_Combine PLSA_Seperate PLSA_Combine STC aPLSA

1 41.61.8

2

0 60.8

11.21.4

Entr

opy

00.20.40.6

Heterogeneous TL  No‐Heterogeneous TL 

Average Entropy 0.741 0.786

72

Average Entropy 0.741 0.786

5 7% entroy redu

Clustering Resultsgon Caltech256 [Griffin et al. TR 2007]

f k kbj h i tt hfrogkayakbearjesus-christwatch

73

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

74

Cross-language Classification g g[Ling et al. WWW 2008]

Classifier

llearn classify

Labeled Chinese Web

Unlabeled Chinese WebChinese Web 

pagesChinese Web 

pages

Text Classification 75

Cross-language Classificationg g

Much labelled data in English, but few in g ,Chinese.

Labeled Data English Chinese

News Reuters‐21578 ?News Reuters 21578 ?

newsgroups 20 Newsgroups ?

Web pages Open Document Project

Very few ODP dataProject

(> 1M)data (< 20k, ~ 1%)

76

Cross-language Classificationg g

ClassifierClassifier

learn classifyclassify

Labeled English Web 

Unlabeled Chinese Web 

pages pages

Cross‐language Classification77

Cross-language Classificationg g Information Bottleneck

l b d d ( b ) X : signals to be encoded (Web pages) : codewords (class labels) X Y : features related to X (terms)

XX

78

Cross-language Classificationg g

Optimization

minimizeInformation betwminimizeInformation betw

Minimize this distance

79

Cross-language Classificationg g

Performance

80

Outline

Introduction to heterogeneous transfer learning Cross media: Text Image g

Clustering Classification

Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising

81

Application: Visual Contextual Advertising [Chen et al. AAAI 2010]

P i h f d d ti i f t t[ ] Previous research focused on advertising for text

Web pages.With th b i f lti di d t d With the booming of multimedia data, we need to recommend advertisement for these dataDiffi lt i d th t t i diff t f t Difficulty: image and the text in different feature spacesU th d t t b id th t Use the co-occurrence data to bridge these two feature spaces

Figure illustration of Visual Contextual gAdvertising

Visual Contextual Advertisingg

(based on the independWe assume that there isindependent

We assume that there isent

iWhere

assumpti

Experimental Resultsp

Co-occurrence data from Flickr. Test Image from Flickr and Fifteen scene data g

set Advertisement are crawled from MSN search Advertisement are crawled from MSN search

engine with queries chosen from AOL query log.

Experimental Resultp

Experimental Resultp

Thank youy

For more details of APEXLAB http://apex.sjtu.edu.cn/apex_wiki/FrontPage

Our works http://apex.sjtu.edu.cn/apex_wiki/Papersp // p j / p _ / p

top related