web classification ontology and taxonomy. 2 references using ontologies to discover domain-level web...

Web classification

Ontology and Taxonomy

2

References

Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu

Learning to Construct Knowledge Bases from World Wide Web. {M. Craven, D. DiPasquo, A. Mitchell, K. Nigam, S

Slattery} Carnegie Mellon University-Pittsburg-USA; {D. Freitag A. McCallum} Just Reserch-Pittsburg-USA

3

Definitions

Ontology An explicit formal specification of how to

represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

Taxonomy a classification of organisms into groups

based on similarities of structure or origin etc

4

Goal

Capture and model behavioral patterns and profiles of users interacting with a web site.

Why? Collaborative filtering Personalization systems Improve organization and structural of the site Provide dynamic recommendations (www.recommend-

me.com)

5

Algorithm 0 (by Rafa’s brother: Gabriel)

Recommend pages viewed by other users with similar page ranks.

Problems New item problem Doesn’t consider content similarity nor

item-to-item relationships.

6

User session

User session s: <w(p1,s),w(p2,s),..,w(pn,s)> W(pi,s) is a weight in session s, associated with

page pi

Session clusters {cl1, cl2,…} cli is a subset of the set of sessions

Usage profile prcl={<p, weight(p,prcl) : weight(p,prcl) ≥ μ} Weight(p,prcl)=(1/|cl|) *∑w(p,s)

7

Algorithm 11. For every session, create a vector containing

the viewed pages and a weight for each page.2. Each vector represent a point in a N-

dimensional space, so we may identify the clusters.

3. For a new session, check to which cluster this vector/point belongs, and recommend high scores pages of this cluster

Problems New item problem Doesn’t consider content similarity nor item-to-

item relationships

8

Algorithm 2: keyword search Solves new item problem. Not good enough

A page can contain info for more than 1 object. A fundamental data can be pointed by the

page, not included. What exactly is a keyword.

Solution Domain ontologies for objects

9

Domain Ontologies Domain-Level Aggregate Profile: Set of pseudo

objects each characterizing objects of different types occurring commonly across the user sessions.

Class - C Attributes – a: < Da, Ta, ≤a, Ψa>

Ta type of attribute DaDomain of the values for a (red, blue,..) ≤a ordering relation among Da

Ψa combination function

10

Example – movie web site Classes:

movies, actors, directors, etc Attributes:

Movies: title, genre, starring actors Actors: name, filmography, gender, nationality

Functions: Ψactor(<{S,0.7; T, 0.2; U,0.1},1>, <{S,0.5;

T,0.5),0.7>) = sumi(wi*wo)/ sumi(wi) Ψyear({1991},{1994}) = {1991,1994}

Ψis_a({person,student},{person,TA})= {person}

11

Movie

Title Genre Actor year

About a boy {Romantic; Comedy; Family}

{H. Grant:0.6; R. Weisz: 0.1;

T.Collete: 0.3}2002

12

Creating an Aggregated Representation of a usage profile

pr={<o1wo1>, …,<onwon

>}

Oi object; woi=significance on the profile pr

Let assume all the object are instances of the same class

Create a new virtual object o’, with attributes ai’= Ψi(o1,…,on)

13

Item level usage profileNameGenreActorYear

{A}Genre-allRomance

Romance Comedy

ComedyKids & family

{S:0.7; T:0.2; U:0.1}

{2002}

{B}Genre-allRomanceComedy

{S:0.5, T:0.5}

{1999}

{C}Genre-allRomance

{W:0.6,S:04}

{2001}

{A:1; B:1; C:1}

Genre-allRomance

{S:0.58; T:0.27;

W:0.09; U:0.05}

{1999 ,2002}

14

A real (estate property) example

Property

Price Location Room num

}300K{ }Chicago{ }5{

15

Item Level Usage Profile

WeightPriceLocationRoom num

1475KChicago5

0.7299KChicago4

0.18272kEvanston4

0.1899KChicago3

1365K{Chicago, Evanston}

4

16

Algorithm 2 Do not just recommend other items

viewed by other users, recommend items similar to the class representative.

Advantages: More accuracy Need less examples No new item problem Consider also content similarity (item-to-

item relationship).

17

Item Level Usage Profile

Weight

PriceLocationRoom#

1475KChicago5

0.7299KChicago40.180.18272k272kEvanstonEvanston44

0.180.1899K99KChicagoChicago33

1365K{Chicago, Evanston}4

1370KChicago4

18

Final Algorithm

Given a web site1. Classify it contents into classes and

attributes.2. Merge the objects of each user profile

and create a pseudo object. 3. Recommend according to this pseudo-

object.

19

Problems A per-topic solution Found patterns can be incomplete User patterns may change with time

(for movies) “I loved ET” problem. Need cookies and other methods to

identify users. How is weight calculated? Can need

many examples: “I loved American Beauty” problem.

How to automatically group the web-pages?

20

Hafsaka?

21

Constructing Knowledge Base from WWW Goal:

Automatically create computer understandable knowledge base from the web.

Why? To use in the previous described work, and similar Find all universities that offer Java Programming

courses Make me hotel and flight arrangements for the

upcoming Linux conference

22

…Constructing Knowledge Base from WWW

How? Use machine learning to create information

extraction methods for each of the desired types of knowledge

Apply it, to extract symbolic, probabilistic statements directly from the web: Student-of(Rafa, sdbi)= 99%

Used method Provide an initial ontology (classes and relations) Training examples – 3 out of 4 university sites (8000 web

pages, 1400 web-page pairs)

23

Fundamentals of CS Home PageInstructors:

JimTom

Jim’s Home PageI teach several courses:

Fundamental of CSIntro to AI

My research includesIntelligent web agents

Example of web pages

Classes: Faculty, Research-project, Student, Staff, (Person), Course, Department, OtherRelations: instructor-of, members-of-project, department-of.

24

Entity:HomepageHomepage title

activity other

Person:Department _ofProject ofCourse taught byName of

course:instructor ofTAs of

FacultyProject lead byStudent of

JimCourses taught by

Fundamental of csIntro to AIHome-page:…

Fundamental of CSInstructor of: jim, tomHome-page:….

Research ProjectMembers of project

Ontology

Web KB instances

25

Problem Assumption Class instance one-instance/one-webpage

≠ Multiple instances in one web-page≠ Multiple linked/related web-pages for instance≠ Elvis problem

Relation R(A,B) is represented by: Hyperlinks AB or ACD…B Inclusion in a particular context (I teach

Intro2cs) Statistical model of typical words

26

To Learn

1. Recognizing class instances by classifying bodies of hypertext

2. Recognizing relations instances by classifying chains of hyperlinks

3. Extract text fields

27

Recognizing class instances by classifying bodies of hypertext

1. Statistical bag-of-words approach1. Full Text2. Hyperlinks3. Title/Head

2. Learning first order rules Combine the previous 4 methods

28

Statistical bag-of-words approach

Context-less classification Given a set of classes C={c1, c2,…cN} Given a document consisting of

nn≤2000 words {w1, w2, ..,wn} c*= argmaxc Pr(c | w1,…,wn)

29

courstudfacustaffresedeptOtheAccuracy

Cours20217001055226.2

Stud042114172051943.3

Facu556118163026417.9

Staff0151400456.2

Rese8910562038413

Dept10831542091.7

Other193273120106493.6

Coverage

82.875.477.18.772.910035

predicted

actual

30

Statistical bag-of-words approach: Pr(wi|c) log (Pr(wi|c)/Pr(wi|~c))

student faculty coursemy 0.0247 DDDD 0.0138 course 0.0151page 0.0109 of 0.0113 DD:DD 0.013home 0.0104 and 0.0109 homework 0.0106am 0.0085 professor 0.0088 will 0.0088university 0.0061 computer 0.0073 D 0.008computer 0.006 research 0.006 assignments 0.0079science 0.0059 science 0.0057 class 0.0073me 0.0058 university 0.0049 hours 0.0059at 0.0049 DDD 0.0042 assignment 0.0058here 0.0046 systems 0.0042 due 0.0058

reaserch-project department othergroup 0.006 department 0.0179 D 0.0374project 0.0049 science 0.0153 DD 0.0246research 0.0049 computer 0.0111 the 0.0153of 0.003 faculty 0.007 eros 0.001laboratory 0.0029 information 0.0069 hplayD 0.0097systems 0.0028 undergraduate0.0058 uDDb 0.0067and 0.0027 graduate 0.0047 to 0.0064our 0.0026 sta 0.0045 bluto 0.0052system 0.0024 server 0.0042 gt 0.005

31

Accuracy/Coverage tradeoff for full-text classifiers

32

Accuracy/coverage tradeoff for hyperlinks classifiers

33

Accuracy/Coverage for title heading classifiers

34

Learning first order rules

The previous method doesn’t consider relations between pages

A page is a course home-page if it contains the word textbook and TA and point to a page containing the word assignment.

FOIL is a learning system that constructs Horn clause programs from examples

35

Relations Has_word(Page). Stemmed words: computer= computing=

comput. 200 occurrences but less than 30% in other class pages Link_to(page,page) m-estimate accuracy= (nc+(m*p))/(n+m)

nc: # of instances correctly classified by the rule N: Total # of instance classified by the rule m=2 P: proportion of instances in trainning set that belongs

to that class Predict each class with confidence = best_match /

total_#_of_matches

36

New learned rules student(A) :- not(has_data(A)),

not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)).

faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).

course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B), not(link_to(B, 1)),has_assign(B).

37

Accuracy/coverage for FOIL page classifiers

38

Boosting

The best prediction classification depends on the class Combine the predictions using the

measure confidence

39

Accuracy/coverage tradeoff for combined classifiers (2000 words vocabulary)

40

Boosting

Disappointing: Somehow it is not uniformly better

Possible solutions Using reduced size dictionaries (next) Using other methods for combining

predictions (voting instead of best_match / total_#_of_matches)

41

Accuracy/coverage tradeoff for combined classifiers (200 words vocabulary)

42

Multi-Page segments The group is the longest prefix (indicated in

parentheses) (@/{user,faculty,people,home,projects}/*)/*.{html,htm} (@/{cs???,www/,*})/*.{html,htm} (@/{cs???,www/,*})/ …

A primary page is any page which URL matches: @/index.{html,htm} @/home.{html,htm} @/%1/%1.{html,htm} …

If no page in the group matches one of these patterns, then the page with the highest score for any non-other class is a primary page.

Any non-primary page is tagged as Other

43

Accuracy/coverage tradeoff for the full text after URL grouping heuristics

44

Conclusion- Recognizing Classes Hypertext provides redundant information

We can classify using several methods Full text Heading/title Hyperlinks Text in neighboring pages + Grouping pages

No method alone is good enough. Combine predictions (classify methods)

allows a better result.

45

Learning to Recognize Relation Instances Assume: Relations are represented by hyper-links

Given the following background relations Class (Page) Link-to(Hyperlink,P1,P2) Has-word (H) – the word is part of the

Hyperlink All-words-capitalized (H) Has-alphanumeric-word (H) – I Teach CS2765 Has-neighborhood-word (H) – Neighborhood=

paragraph

46

…Learning to Recognize Relation Instances

Try to learn the following Members-of-project(P1,P2) Intsructors_of_course(P1,P2) Department_of_person(P1,P2)

47

Learned relations instructors of(A,B) :- course(A), person(B), link

to(C,B,A). Test Set: 133 Pos, 5 Neg

department of(A,B) :- person(A), department(B), link to(C,D,A), link to(E,F,D), link to(G,B,F), has neighborhood word graduate(E). Test Set: 371 Pos, 4 Neg

members of project(A,B) :- research project(A), person(B), link to(C,A,D), link to(E,D,B), has neighborhood word people(C). Test Set: 18 Pos, 0 Neg

48

Accuracy/Coverage tradeoff for learned relation rules

49

Learning to Extract Text Fields

Sometimes we want a small fragment of text, not the whole web-page or class (like Jon, Peter, etc) Make me hotel and flight arrangements

for the upcoming Linux conference

50

Predefined predicates

Let F= w1, w2, … wj be a fragment of text length({<,>,=…}, N). some(Var, Path, Feat, Value): some (A,

[next_token, next_token], numeric, true)

position(Var, From, Relop, N): relpos(Var1, Var2, Relop, N):

51

A wrongExample

ownername(Fragment) :- some(A, [prev token],

word, “gmt"), some(A, [ ], in title, true), some(A, [ ], word,

unknown), some(A, [ ], quadrupletonp,

false) length(<, 3)

Last-Modified: Wednesday, 26-Jun-96 01:37:46 GMT<title>

Bruce Randall Donald

</title><h1><img src="ftp://ftp.cs.cornell.edu/pub/brd/images/brd.gif"><p>Bruce Randall Donald<br>Associate Professor<br>

52

Accuracy/coverage tradeoff for Name Extraction

53

Conclusions Used machine learning algorithms to create

information extract methods for each desired type of knowledge.

WebKB achieves 70% accuracy at 30% coverage.

Bag-of-words (Hyperlinks, web-pages and full text) and First order learning can be used to boost the confidence

First order learning can be used to look outward from the page and consider its neighbors

54

Problems Not as accurate as we want

You can get more accuracy at cost of coverage Use linguistic features (verbs) Add new methods to the booster (predict the

department of a professor, based on the department of his students advisees)

A per topic, per language, per … method. Needs hand made labeling to learn. Learners with high accuracy can be used to

teach learners with low accuracy.

web classification ontology and taxonomy. 2 references using ontologies to discover domain-level web...

Documents

p2 slide

neg slide

usa slide

taxonomy slide

paragraph slide

link toc

link toe

link tog