web classification ontology and taxonomy. 2 references using ontologies to discover domain-level web...

Web classification

Ontology and Taxonomy

References

Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu

Learning to Construct Knowledge Bases from World Wide Web. {M. Craven, D. DiPasquo, A. Mitchell, K. Nigam, S

Slattery} Carnegie Mellon University-Pittsburg-USA; {D. Freitag A. McCallum} Just Reserch-Pittsburg-USA

Definitions

Ontology An explicit formal specification of how to

represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

Taxonomy a classification of organisms into groups

based on similarities of structure or origin etc

Capture and model behavioral patterns and profiles of users interacting with a web site.

Why? Collaborative filtering Personalization systems Improve organization and structural of the site Provide dynamic recommendations (www.recommend-

me.com)

Algorithm 0 (by Rafa’s brother: Gabriel)

Recommend pages viewed by other users with similar page ranks.

Problems New item problem Doesn’t consider content similarity nor

item-to-item relationships.

User session

User session s: <w(p1,s),w(p2,s),..,w(pn,s)> W(pi,s) is a weight in session s, associated with

page pi

Session clusters {cl1, cl2,…} cli is a subset of the set of sessions

Usage profile prcl={<p, weight(p,prcl) : weight(p,prcl) ≥ μ} Weight(p,prcl)=(1/|cl|) *∑w(p,s)

Algorithm 11. For every session, create a vector containing

the viewed pages and a weight for each page.2. Each vector represent a point in a N-

dimensional space, so we may identify the clusters.

3. For a new session, check to which cluster this vector/point belongs, and recommend high scores pages of this cluster

Problems New item problem Doesn’t consider content similarity nor item-to-

item relationships

Algorithm 2: keyword search Solves new item problem. Not good enough

A page can contain info for more than 1 object. A fundamental data can be pointed by the

page, not included. What exactly is a keyword.

Solution Domain ontologies for objects

Domain Ontologies Domain-Level Aggregate Profile: Set of pseudo

objects each characterizing objects of different types occurring commonly across the user sessions.

Class - C Attributes – a: < Da, Ta, ≤a, Ψa>

Ta type of attribute DaDomain of the values for a (red, blue,..) ≤a ordering relation among Da

Ψa combination function

Example – movie web site Classes:

movies, actors, directors, etc Attributes:

Movies: title, genre, starring actors Actors: name, filmography, gender, nationality

Functions: Ψactor(<{S,0.7; T, 0.2; U,0.1},1>, <{S,0.5;

T,0.5),0.7>) = sumi(wi*wo)/ sumi(wi) Ψyear({1991},{1994}) = {1991,1994}

Ψis_a({person,student},{person,TA})= {person}

Title Genre Actor year

About a boy {Romantic; Comedy; Family}

{H. Grant:0.6; R. Weisz: 0.1;

T.Collete: 0.3}2002

Creating an Aggregated Representation of a usage profile

pr={<o1wo1>, …,<onwon

Oi object; woi=significance on the profile pr

Let assume all the object are instances of the same class

Create a new virtual object o’, with attributes ai’= Ψi(o1,…,on)

Item level usage profileNameGenreActorYear

{A}Genre-allRomance

Romance Comedy

ComedyKids & family

{S:0.7; T:0.2; U:0.1}

{2002}

{B}Genre-allRomanceComedy

{S:0.5, T:0.5}

{1999}

{C}Genre-allRomance

{W:0.6,S:04}

{2001}

{A:1; B:1; C:1}

Genre-allRomance

{S:0.58; T:0.27;

W:0.09; U:0.05}

{1999 ,2002}

A real (estate property) example

Property

Price Location Room num

}300K{ }Chicago{ }5{

Item Level Usage Profile

WeightPriceLocationRoom num

1475KChicago5

0.7299KChicago4

0.18272kEvanston4

0.1899KChicago3

1365K{Chicago, Evanston}

Algorithm 2 Do not just recommend other items

viewed by other users, recommend items similar to the class representative.

Advantages: More accuracy Need less examples No new item problem Consider also content similarity (item-to-

item relationship).

Item Level Usage Profile

Weight

PriceLocationRoom#

1475KChicago5

0.7299KChicago40.180.18272k272kEvanstonEvanston44

0.180.1899K99KChicagoChicago33

1365K{Chicago, Evanston}4

1370KChicago4

Final Algorithm

Given a web site1. Classify it contents into classes and

attributes.2. Merge the objects of each user profile

and create a pseudo object. 3. Recommend according to this pseudo-

object.

Problems A per-topic solution Found patterns can be incomplete User patterns may change with time

(for movies) “I loved ET” problem. Need cookies and other methods to

identify users. How is weight calculated? Can need

many examples: “I loved American Beauty” problem.

How to automatically group the web-pages?

Hafsaka?

Constructing Knowledge Base from WWW Goal:

Automatically create computer understandable knowledge base from the web.

Why? To use in the previous described work, and similar Find all universities that offer Java Programming

courses Make me hotel and flight arrangements for the

upcoming Linux conference

…Constructing Knowledge Base from WWW

How? Use machine learning to create information

extraction methods for each of the desired types of knowledge

Apply it, to extract symbolic, probabilistic statements directly from the web: Student-of(Rafa, sdbi)= 99%

Used method Provide an initial ontology (classes and relations) Training examples – 3 out of 4 university sites (8000 web

pages, 1400 web-page pairs)

Fundamentals of CS Home PageInstructors:

JimTom

Jim’s Home PageI teach several courses:

Fundamental of CSIntro to AI

My research includesIntelligent web agents

Example of web pages

Classes: Faculty, Research-project, Student, Staff, (Person), Course, Department, OtherRelations: instructor-of, members-of-project, department-of.

Entity:HomepageHomepage title

activity other

Person:Department _ofProject ofCourse taught byName of

course:instructor ofTAs of

FacultyProject lead byStudent of

JimCourses taught by

Fundamental of csIntro to AIHome-page:…

Fundamental of CSInstructor of: jim, tomHome-page:….

Research ProjectMembers of project

Ontology

Web KB instances

Problem Assumption Class instance one-instance/one-webpage

≠ Multiple instances in one web-page≠ Multiple linked/related web-pages for instance≠ Elvis problem

Relation R(A,B) is represented by: Hyperlinks AB or ACD…B Inclusion in a particular context (I teach

Intro2cs) Statistical model of typical words

To Learn

1. Recognizing class instances by classifying bodies of hypertext

2. Recognizing relations instances by classifying chains of hyperlinks

3. Extract text fields

Recognizing class instances by classifying bodies of hypertext

1. Statistical bag-of-words approach1. Full Text2. Hyperlinks3. Title/Head

2. Learning first order rules Combine the previous 4 methods

Statistical bag-of-words approach

Context-less classification Given a set of classes C={c1, c2,…cN} Given a document consisting of

nn≤2000 words {w1, w2, ..,wn} c*= argmaxc Pr(c | w1,…,wn)

courstudfacustaffresedeptOtheAccuracy

Cours20217001055226.2

Stud042114172051943.3

Facu556118163026417.9

Staff0151400456.2

Rese8910562038413

Dept10831542091.7

Other193273120106493.6

Coverage

82.875.477.18.772.910035

predicted

actual

Statistical bag-of-words approach: Pr(wi|c) log (Pr(wi|c)/Pr(wi|~c))

student faculty coursemy 0.0247 DDDD 0.0138 course 0.0151page 0.0109 of 0.0113 DD:DD 0.013home 0.0104 and 0.0109 homework 0.0106am 0.0085 professor 0.0088 will 0.0088university 0.0061 computer 0.0073 D 0.008computer 0.006 research 0.006 assignments 0.0079science 0.0059 science 0.0057 class 0.0073me 0.0058 university 0.0049 hours 0.0059at 0.0049 DDD 0.0042 assignment 0.0058here 0.0046 systems 0.0042 due 0.0058

reaserch-project department othergroup 0.006 department 0.0179 D 0.0374project 0.0049 science 0.0153 DD 0.0246research 0.0049 computer 0.0111 the 0.0153of 0.003 faculty 0.007 eros 0.001laboratory 0.0029 information 0.0069 hplayD 0.0097systems 0.0028 undergraduate0.0058 uDDb 0.0067and 0.0027 graduate 0.0047 to 0.0064our 0.0026 sta 0.0045 bluto 0.0052system 0.0024 server 0.0042 gt 0.005

Accuracy/Coverage tradeoff for full-text classifiers

Accuracy/coverage tradeoff for hyperlinks classifiers

Accuracy/Coverage for title heading classifiers

Learning first order rules

The previous method doesn’t consider relations between pages

A page is a course home-page if it contains the word textbook and TA and point to a page containing the word assignment.

FOIL is a learning system that constructs Horn clause programs from examples

Relations Has_word(Page). Stemmed words: computer= computing=

comput. 200 occurrences but less than 30% in other class pages Link_to(page,page) m-estimate accuracy= (nc+(m*p))/(n+m)

nc: # of instances correctly classified by the rule N: Total # of instance classified by the rule m=2 P: proportion of instances in trainning set that belongs

to that class Predict each class with confidence = best_match /

total_#_of_matches

New learned rules student(A) :- not(has_data(A)),

not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)).

faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).

course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B), not(link_to(B, 1)),has_assign(B).

Accuracy/coverage for FOIL page classifiers

Boosting

The best prediction classification depends on the class Combine the predictions using the

measure confidence

Accuracy/coverage tradeoff for combined classifiers (2000 words vocabulary)

Boosting

Disappointing: Somehow it is not uniformly better

Possible solutions Using reduced size dictionaries (next) Using other methods for combining

predictions (voting instead of best_match / total_#_of_matches)

Accuracy/coverage tradeoff for combined classifiers (200 words vocabulary)

Multi-Page segments The group is the longest prefix (indicated in

parentheses) (@/{user,faculty,people,home,projects}/*)/*.{html,htm} (@/{cs???,www/,*})/*.{html,htm} (@/{cs???,www/,*})/ …

A primary page is any page which URL matches: @/index.{html,htm} @/home.{html,htm} @/%1/%1.{html,htm} …

If no page in the group matches one of these patterns, then the page with the highest score for any non-other class is a primary page.

Any non-primary page is tagged as Other

Accuracy/coverage tradeoff for the full text after URL grouping heuristics

Conclusion- Recognizing Classes Hypertext provides redundant information

We can classify using several methods Full text Heading/title Hyperlinks Text in neighboring pages + Grouping pages

No method alone is good enough. Combine predictions (classify methods)

allows a better result.

Learning to Recognize Relation Instances Assume: Relations are represented by hyper-links

Given the following background relations Class (Page) Link-to(Hyperlink,P1,P2) Has-word (H) – the word is part of the

Hyperlink All-words-capitalized (H) Has-alphanumeric-word (H) – I Teach CS2765 Has-neighborhood-word (H) – Neighborhood=

paragraph

…Learning to Recognize Relation Instances

Try to learn the following Members-of-project(P1,P2) Intsructors_of_course(P1,P2) Department_of_person(P1,P2)

Learned relations instructors of(A,B) :- course(A), person(B), link

to(C,B,A). Test Set: 133 Pos, 5 Neg

department of(A,B) :- person(A), department(B), link to(C,D,A), link to(E,F,D), link to(G,B,F), has neighborhood word graduate(E). Test Set: 371 Pos, 4 Neg

members of project(A,B) :- research project(A), person(B), link to(C,A,D), link to(E,D,B), has neighborhood word people(C). Test Set: 18 Pos, 0 Neg

Accuracy/Coverage tradeoff for learned relation rules

Learning to Extract Text Fields

Sometimes we want a small fragment of text, not the whole web-page or class (like Jon, Peter, etc) Make me hotel and flight arrangements

for the upcoming Linux conference

Predefined predicates

Let F= w1, w2, … wj be a fragment of text length({<,>,=…}, N). some(Var, Path, Feat, Value): some (A,

[next_token, next_token], numeric, true)

position(Var, From, Relop, N): relpos(Var1, Var2, Relop, N):

A wrongExample

ownername(Fragment) :- some(A, [prev token],

word, “gmt"), some(A, [ ], in title, true), some(A, [ ], word,

unknown), some(A, [ ], quadrupletonp,

false) length(<, 3)

Last-Modified: Wednesday, 26-Jun-96 01:37:46 GMT<title>

Bruce Randall Donald

</title><h1><img src="ftp://ftp.cs.cornell.edu/pub/brd/images/brd.gif"><p>Bruce Randall Donald<br>Associate Professor<br>

Accuracy/coverage tradeoff for Name Extraction

Conclusions Used machine learning algorithms to create

information extract methods for each desired type of knowledge.

WebKB achieves 70% accuracy at 30% coverage.

Bag-of-words (Hyperlinks, web-pages and full text) and First order learning can be used to boost the confidence

First order learning can be used to look outward from the page and consider its neighbors

Problems Not as accurate as we want

You can get more accuracy at cost of coverage Use linguistic features (verbs) Add new methods to the booster (predict the

department of a professor, based on the department of his students advisees)

A per topic, per language, per … method. Needs hand made labeling to learn. Learners with high accuracy can be used to

teach learners with low accuracy.

web classification ontology and taxonomy. 2 references using ontologies to discover domain-level web...

p2 slide

neg slide

usa slide

taxonomy slide

paragraph slide

link toc

link toe

link tog

Documents

sfr and cosmos bahram mobasher + the cosmos team

presenter : yubin li professor : dr. bamshad mobasher

electrical transport properties and field effect...

clustering bamshad mobasher depaul university bamshad...

distance and similarity measures bamshad mobasher depaul...

the knowledge discovery process; data preparation &...

medical imaging projects daniela s. raicu, phd assistant...

e-metrics and e-business analytics bamshad mobasher school...

mining frequent patterns ii: mining sequential &...

mobasher stang shah

a recommendation model based on latent principal factors...

hci 201 multimedia and the world wide web. about me contact...

overview of web mining and e-commerce data analytics bamshad...

experiments on two-layer density-stratiﬁed inertial...

prediction modeling for personalization & recommender...

larry dribin, ph.d. sogeti, a cap gemini company phone:...

nsf medix reu program medical imaging projects @ depaul cdm...

tdc561 network programming camelia zlatea, phd email:...

design fundamentals and interference mitigation...

classification and prediction: regression analysis bamshad...