web classification ontology and taxonomy. 2 references using ontologies to discover domain-level web...
Post on 20-Dec-2015
218 views
TRANSCRIPT
Web classification
Ontology and Taxonomy
2
References
Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu
Learning to Construct Knowledge Bases from World Wide Web. {M. Craven, D. DiPasquo, A. Mitchell, K. Nigam, S
Slattery} Carnegie Mellon University-Pittsburg-USA; {D. Freitag A. McCallum} Just Reserch-Pittsburg-USA
3
Definitions
Ontology An explicit formal specification of how to
represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.
Taxonomy a classification of organisms into groups
based on similarities of structure or origin etc
4
Goal
Capture and model behavioral patterns and profiles of users interacting with a web site.
Why? Collaborative filtering Personalization systems Improve organization and structural of the site Provide dynamic recommendations (www.recommend-
me.com)
5
Algorithm 0 (by Rafa’s brother: Gabriel)
Recommend pages viewed by other users with similar page ranks.
Problems New item problem Doesn’t consider content similarity nor
item-to-item relationships.
6
User session
User session s: <w(p1,s),w(p2,s),..,w(pn,s)> W(pi,s) is a weight in session s, associated with
page pi
Session clusters {cl1, cl2,…} cli is a subset of the set of sessions
Usage profile prcl={<p, weight(p,prcl) : weight(p,prcl) ≥ μ} Weight(p,prcl)=(1/|cl|) *∑w(p,s)
7
Algorithm 11. For every session, create a vector containing
the viewed pages and a weight for each page.2. Each vector represent a point in a N-
dimensional space, so we may identify the clusters.
3. For a new session, check to which cluster this vector/point belongs, and recommend high scores pages of this cluster
Problems New item problem Doesn’t consider content similarity nor item-to-
item relationships
8
Algorithm 2: keyword search Solves new item problem. Not good enough
A page can contain info for more than 1 object. A fundamental data can be pointed by the
page, not included. What exactly is a keyword.
Solution Domain ontologies for objects
9
Domain Ontologies Domain-Level Aggregate Profile: Set of pseudo
objects each characterizing objects of different types occurring commonly across the user sessions.
Class - C Attributes – a: < Da, Ta, ≤a, Ψa>
Ta type of attribute DaDomain of the values for a (red, blue,..) ≤a ordering relation among Da
Ψa combination function
10
Example – movie web site Classes:
movies, actors, directors, etc Attributes:
Movies: title, genre, starring actors Actors: name, filmography, gender, nationality
Functions: Ψactor(<{S,0.7; T, 0.2; U,0.1},1>, <{S,0.5;
T,0.5),0.7>) = sumi(wi*wo)/ sumi(wi) Ψyear({1991},{1994}) = {1991,1994}
Ψis_a({person,student},{person,TA})= {person}
11
Movie
Title Genre Actor year
About a boy {Romantic; Comedy; Family}
{H. Grant:0.6; R. Weisz: 0.1;
T.Collete: 0.3}2002
12
Creating an Aggregated Representation of a usage profile
pr={<o1wo1>, …,<onwon
>}
Oi object; woi=significance on the profile pr
Let assume all the object are instances of the same class
Create a new virtual object o’, with attributes ai’= Ψi(o1,…,on)
13
Item level usage profileNameGenreActorYear
{A}Genre-allRomance
Romance Comedy
ComedyKids & family
{S:0.7; T:0.2; U:0.1}
{2002}
{B}Genre-allRomanceComedy
{S:0.5, T:0.5}
{1999}
{C}Genre-allRomance
{W:0.6,S:04}
{2001}
{A:1; B:1; C:1}
Genre-allRomance
{S:0.58; T:0.27;
W:0.09; U:0.05}
{1999 ,2002}
14
A real (estate property) example
Property
Price Location Room num
}300K{ }Chicago{ }5{
15
Item Level Usage Profile
WeightPriceLocationRoom num
1475KChicago5
0.7299KChicago4
0.18272kEvanston4
0.1899KChicago3
1365K{Chicago, Evanston}
4
16
Algorithm 2 Do not just recommend other items
viewed by other users, recommend items similar to the class representative.
Advantages: More accuracy Need less examples No new item problem Consider also content similarity (item-to-
item relationship).
17
Item Level Usage Profile
Weight
PriceLocationRoom#
1475KChicago5
0.7299KChicago40.180.18272k272kEvanstonEvanston44
0.180.1899K99KChicagoChicago33
1365K{Chicago, Evanston}4
1370KChicago4
18
Final Algorithm
Given a web site1. Classify it contents into classes and
attributes.2. Merge the objects of each user profile
and create a pseudo object. 3. Recommend according to this pseudo-
object.
19
Problems A per-topic solution Found patterns can be incomplete User patterns may change with time
(for movies) “I loved ET” problem. Need cookies and other methods to
identify users. How is weight calculated? Can need
many examples: “I loved American Beauty” problem.
How to automatically group the web-pages?
20
Hafsaka?
21
Constructing Knowledge Base from WWW Goal:
Automatically create computer understandable knowledge base from the web.
Why? To use in the previous described work, and similar Find all universities that offer Java Programming
courses Make me hotel and flight arrangements for the
upcoming Linux conference
22
…Constructing Knowledge Base from WWW
How? Use machine learning to create information
extraction methods for each of the desired types of knowledge
Apply it, to extract symbolic, probabilistic statements directly from the web: Student-of(Rafa, sdbi)= 99%
Used method Provide an initial ontology (classes and relations) Training examples – 3 out of 4 university sites (8000 web
pages, 1400 web-page pairs)
23
Fundamentals of CS Home PageInstructors:
JimTom
Jim’s Home PageI teach several courses:
Fundamental of CSIntro to AI
My research includesIntelligent web agents
Example of web pages
Classes: Faculty, Research-project, Student, Staff, (Person), Course, Department, OtherRelations: instructor-of, members-of-project, department-of.
24
Entity:HomepageHomepage title
activity other
Person:Department _ofProject ofCourse taught byName of
course:instructor ofTAs of
FacultyProject lead byStudent of
JimCourses taught by
Fundamental of csIntro to AIHome-page:…
Fundamental of CSInstructor of: jim, tomHome-page:….
Research ProjectMembers of project
Ontology
Web KB instances
25
Problem Assumption Class instance one-instance/one-webpage
≠ Multiple instances in one web-page≠ Multiple linked/related web-pages for instance≠ Elvis problem
Relation R(A,B) is represented by: Hyperlinks AB or ACD…B Inclusion in a particular context (I teach
Intro2cs) Statistical model of typical words
26
To Learn
1. Recognizing class instances by classifying bodies of hypertext
2. Recognizing relations instances by classifying chains of hyperlinks
3. Extract text fields
27
Recognizing class instances by classifying bodies of hypertext
1. Statistical bag-of-words approach1. Full Text2. Hyperlinks3. Title/Head
2. Learning first order rules Combine the previous 4 methods
28
Statistical bag-of-words approach
Context-less classification Given a set of classes C={c1, c2,…cN} Given a document consisting of
nn≤2000 words {w1, w2, ..,wn} c*= argmaxc Pr(c | w1,…,wn)
29
courstudfacustaffresedeptOtheAccuracy
Cours20217001055226.2
Stud042114172051943.3
Facu556118163026417.9
Staff0151400456.2
Rese8910562038413
Dept10831542091.7
Other193273120106493.6
Coverage
82.875.477.18.772.910035
predicted
actual
30
Statistical bag-of-words approach: Pr(wi|c) log (Pr(wi|c)/Pr(wi|~c))
student faculty coursemy 0.0247 DDDD 0.0138 course 0.0151page 0.0109 of 0.0113 DD:DD 0.013home 0.0104 and 0.0109 homework 0.0106am 0.0085 professor 0.0088 will 0.0088university 0.0061 computer 0.0073 D 0.008computer 0.006 research 0.006 assignments 0.0079science 0.0059 science 0.0057 class 0.0073me 0.0058 university 0.0049 hours 0.0059at 0.0049 DDD 0.0042 assignment 0.0058here 0.0046 systems 0.0042 due 0.0058
reaserch-project department othergroup 0.006 department 0.0179 D 0.0374project 0.0049 science 0.0153 DD 0.0246research 0.0049 computer 0.0111 the 0.0153of 0.003 faculty 0.007 eros 0.001laboratory 0.0029 information 0.0069 hplayD 0.0097systems 0.0028 undergraduate0.0058 uDDb 0.0067and 0.0027 graduate 0.0047 to 0.0064our 0.0026 sta 0.0045 bluto 0.0052system 0.0024 server 0.0042 gt 0.005
31
Accuracy/Coverage tradeoff for full-text classifiers
32
Accuracy/coverage tradeoff for hyperlinks classifiers
33
Accuracy/Coverage for title heading classifiers
34
Learning first order rules
The previous method doesn’t consider relations between pages
A page is a course home-page if it contains the word textbook and TA and point to a page containing the word assignment.
FOIL is a learning system that constructs Horn clause programs from examples
35
Relations Has_word(Page). Stemmed words: computer= computing=
comput. 200 occurrences but less than 30% in other class pages Link_to(page,page) m-estimate accuracy= (nc+(m*p))/(n+m)
nc: # of instances correctly classified by the rule N: Total # of instance classified by the rule m=2 P: proportion of instances in trainning set that belongs
to that class Predict each class with confidence = best_match /
total_#_of_matches
36
New learned rules student(A) :- not(has_data(A)),
not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)).
faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).
course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B), not(link_to(B, 1)),has_assign(B).
37
Accuracy/coverage for FOIL page classifiers
38
Boosting
The best prediction classification depends on the class Combine the predictions using the
measure confidence
39
Accuracy/coverage tradeoff for combined classifiers (2000 words vocabulary)
40
Boosting
Disappointing: Somehow it is not uniformly better
Possible solutions Using reduced size dictionaries (next) Using other methods for combining
predictions (voting instead of best_match / total_#_of_matches)
41
Accuracy/coverage tradeoff for combined classifiers (200 words vocabulary)
42
Multi-Page segments The group is the longest prefix (indicated in
parentheses) (@/{user,faculty,people,home,projects}/*)/*.{html,htm} (@/{cs???,www/,*})/*.{html,htm} (@/{cs???,www/,*})/ …
A primary page is any page which URL matches: @/index.{html,htm} @/home.{html,htm} @/%1/%1.{html,htm} …
If no page in the group matches one of these patterns, then the page with the highest score for any non-other class is a primary page.
Any non-primary page is tagged as Other
43
Accuracy/coverage tradeoff for the full text after URL grouping heuristics
44
Conclusion- Recognizing Classes Hypertext provides redundant information
We can classify using several methods Full text Heading/title Hyperlinks Text in neighboring pages + Grouping pages
No method alone is good enough. Combine predictions (classify methods)
allows a better result.
45
Learning to Recognize Relation Instances Assume: Relations are represented by hyper-links
Given the following background relations Class (Page) Link-to(Hyperlink,P1,P2) Has-word (H) – the word is part of the
Hyperlink All-words-capitalized (H) Has-alphanumeric-word (H) – I Teach CS2765 Has-neighborhood-word (H) – Neighborhood=
paragraph
46
…Learning to Recognize Relation Instances
Try to learn the following Members-of-project(P1,P2) Intsructors_of_course(P1,P2) Department_of_person(P1,P2)
47
Learned relations instructors of(A,B) :- course(A), person(B), link
to(C,B,A). Test Set: 133 Pos, 5 Neg
department of(A,B) :- person(A), department(B), link to(C,D,A), link to(E,F,D), link to(G,B,F), has neighborhood word graduate(E). Test Set: 371 Pos, 4 Neg
members of project(A,B) :- research project(A), person(B), link to(C,A,D), link to(E,D,B), has neighborhood word people(C). Test Set: 18 Pos, 0 Neg
48
Accuracy/Coverage tradeoff for learned relation rules
49
Learning to Extract Text Fields
Sometimes we want a small fragment of text, not the whole web-page or class (like Jon, Peter, etc) Make me hotel and flight arrangements
for the upcoming Linux conference
50
Predefined predicates
Let F= w1, w2, … wj be a fragment of text length({<,>,=…}, N). some(Var, Path, Feat, Value): some (A,
[next_token, next_token], numeric, true)
position(Var, From, Relop, N): relpos(Var1, Var2, Relop, N):
51
A wrongExample
ownername(Fragment) :- some(A, [prev token],
word, “gmt"), some(A, [ ], in title, true), some(A, [ ], word,
unknown), some(A, [ ], quadrupletonp,
false) length(<, 3)
Last-Modified: Wednesday, 26-Jun-96 01:37:46 GMT<title>
Bruce Randall Donald
</title><h1><img src="ftp://ftp.cs.cornell.edu/pub/brd/images/brd.gif"><p>Bruce Randall Donald<br>Associate Professor<br>
52
Accuracy/coverage tradeoff for Name Extraction
53
Conclusions Used machine learning algorithms to create
information extract methods for each desired type of knowledge.
WebKB achieves 70% accuracy at 30% coverage.
Bag-of-words (Hyperlinks, web-pages and full text) and First order learning can be used to boost the confidence
First order learning can be used to look outward from the page and consider its neighbors
54
Problems Not as accurate as we want
You can get more accuracy at cost of coverage Use linguistic features (verbs) Add new methods to the booster (predict the
department of a professor, based on the department of his students advisees)
A per topic, per language, per … method. Needs hand made labeling to learn. Learners with high accuracy can be used to
teach learners with low accuracy.