a term based ranking methodology for resources on the semantic web

15
1 A Term Based Ranking Methodology for Resources on the Semantic Web Aaron Huang

Upload: aaron-huang

Post on 21-Jan-2018

186 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Term Based Ranking Methodology for Resources on the Semantic Web

1

A Term Based Ranking Methodology for

Resources on the Semantic Web

Aaron Huang

Page 2: A Term Based Ranking Methodology for Resources on the Semantic Web

2

Abstract

The HTML web used today has enabled people efficient navigation over vast collections

of information stored on pages. However, the computers that host these pages have no

knowledge themselves of the data stored; they only record the page destinations (URL) and the

navigation links between. The semantic web is a relatively new web structure proposed to enable

the computer to not only have knowledge of the data stored within the documents of the web, but

also the ability to comprehend that data through associations and potentially utilize the

knowledge to perform autonomous actions. The semantic web keeps track of terms and their

associations with other terms in the form of triples, which are then used to form documents, thus

enabling the web to have knowledge of the context and significance of data within pages.

However, algorithms for the HTML web such as Google’s PageRank do not take advantage of

the semantic web’s numerous capabilities, and is also less efficient as documents need to be first

converted into HTML format in order to be ranked, so an algorithm was developed that first

ranks each term based upon its usage and associations with other terms; and then ranks each

document based on the terms it contains. Additionally, a function to determine the depth and

specificity of a document (reference, scholarly, etc.) was integrated alongside the document rank

to provide for a more streamlined and relevant search.

Page 3: A Term Based Ranking Methodology for Resources on the Semantic Web

3

Introduction

The current world wide web is a collection of HTML (HyperText Markup Language)

documents interconnected through URL (Uniform Resource Locator) links. While HTML sets a

standard for accessible creation of structured documents, pages created through the language

require humans to read and interpret, and are incomprehensible to the computer. The computer

sees the HTML web as a collection of pages, and although it can access the data on each page

(for example performing a search for a specific term), it is unable to assign any type of meaning

to the data. The Semantic Web aims to create a "web of data", as opposed to a "web of

documents", in which data can be stored in a form understood by machines, allowing for

computer automation of tasks. Such a web would advance artificial intelligence and learning by

enabling machines to not only comprehend terms and data, but more importantly learn the

relationships and associations between objects on the web, contributing to a better understanding

of the environment and the ability to make independent decisions based on previous knowledge.

The Page Rank algorithm was developed by Sergey Brin and Larry Page, and today is the

underlying structure of the Google search engine. It works by giving each page a PageRank (PR)

based on the number of incoming links it receives, factoring in the rank divided by the number of

outbound links of each page that the links come from. Thus, a relatively unimportant page with a

small PR would not give a large contribution to another page's PR, and a page with numerous

outbound links would give only small contributions to each of the linked pages. Since each

page’s rank depends on the others', and the page ranks of all the pages change after each

iteration, this algorithm is executed repeatedly until relative convergence. The algorithm also

uses the "random surfer model" to predict whether links will be followed or not. Using an

arbitrary damping factor d representing the likelihood a random surfer will follow a link, the

Page 4: A Term Based Ranking Methodology for Resources on the Semantic Web

4

algorithm accounts for two scenarios: the surfer follows the link, or he doesn't, in which case he

picks another random page and starts surfing again. Studies have shown that d is around 0.85.7

However, the traditional PageRank algorithm does not take advantage of the semantic

web structure and requires SWD's to be converted into HTML form, so this project intended to

create a new algorithm that would rank not only documents, but also terms, leading to more

meaningful rankings based upon content rather than links. Developing such a system would

enable computers to learn what data and ideas are important on the web, rather than just which

pages, and perhaps autonomously predict and publish new information of potential significance.

The Semantic Web is displayed as numerous Semantic Web Documents (SWD’s),

populated by instances of terms (SWT’s) that can either represent data or words within a

document. Terms are defined through ontologies (SWO’s), which describe data through

associations, so that terms are defined relative to each other. For example, if there is a

relationship within an SWO, “artist creates art”, where “artist”, “creates”, and “art” are SWT’s;

an SWD may have the statement, “Picasso creates cubism”, where “Picasso” is an instance of

artist, “cubism” is an instance of “art”, and “creates” is directly used in the document. Thus, the

computer can assign meaning to the relationship between Picasso and cubism from the general

association in the SWO. Instead of using HTML, the Semantic Web is written in XML

(Extensible Markup Language) syntax, and composed through more object-oriented languages

designed specifically for data management, including RDF (Resource Description Framework)

and OWL (Web Ontology Language).

The RDF is an infrastructure that defines a standard model for integration and

organization of data. Resources are described through subject-predicate-object expressions,

otherwise known as RDF triples. The subject refers to the resource being described, the object is

Page 5: A Term Based Ranking Methodology for Resources on the Semantic Web

5

the literal or other resource describing the subject, and the predicate denotes the type of

relationship between the subject and the object. For example, for the statement, “The water is

blue,” “water” would be the subject, “blue” would be the object describing the water, and “is”

would be the predicate defining the relationship between blue and water. URI’s (Universal

Resource Identifiers) are assigned to the subject and predicate, as well as the object if it is a

resource, in order to allow the system to identify the content and comprehend the relationship in

context of other resource associations. RDF triples, or statements, can be given URI’s and used

as resources in other statements, a process known as reification, creating a hierarchy of object-

oriented data description.

The RDF vocabulary is used to describe abstract RDF graph and includes classes; such as

rdf:Statement the class of all RDF statements, and rdf:Property the class of all possible

properties; as well as properties such as rdf:Type used to declare a resource as an instance of a

class, and rdf:Predicate the predicate of the RDF triple.1 Additionally, vocabularies can be built

Figure 1: A visual representation of the Semantic

Web. The Semantic Web may be visualized in two

parts: a web visible to users containing documents and

ontologies defining terms; and an RDF graph

containing individual resources and literals following

association triples defined in an abstract graph. The

abstract graph defines classes and properties of which

instances are created to form resources. These resources

follow the same associations and properties as their

classes, and contain terms that can be used in

documents on the web. In order to access these

relationships, web ontologies are used, which display

the terms and associations in the RDF graph on the

accessible web.

Page 6: A Term Based Ranking Methodology for Resources on the Semantic Web

6

upon the RDF, such as RDFS (Resource Description Framework Schema), which is used to

describe actual instances. Examples are the class rdfs:Resource which contains all resources on

the web, and the property rdfs:subClassOf which defines whether a resource is a subclass of

another resource.1

OWL is a language used to describe classes, properties and instances of resources

through associations, allowing the web to store knowledge not only of specific resources, but

recognize the type of resource and identify it as thus. OWL presents ontologies through an

abstract syntax, portraying them as a combination of annotations, axioms, and facts.2

Annotations include human and machine input meta-data descriptions about the contained

resources. The actual resource information is described through axioms and facts. The OWL

abstract syntax allows people to define relations and characteristics, which OWL then translates

into triples following the RDF structure, a form comprehensible by the machine.

Currently, there exist a few approaches to ranking semantic web resources, most notable

of which include Swoogle and SemRank. Swoogle is a crawler based search and metadata

engine, which discovers and indexes pages and corresponding metadata, providing basic access

to semantic web resources. The ranking algorithm used for these resources is very similar to

PageRank, with one weighting function integrated to take in the usage of terms on a document

rather than hyperlinks between pages.10 SemRank is an algorithm developed to rank the different

types of associations on the semantic web, specifically complex associations, and allows users to

vary their search mode to reflect their definition of importance. The algorithm is based upon

studies of ρ-path associations and ordered using a Top-K algorithm in an SSARK system.12; 15

Page 7: A Term Based Ranking Methodology for Resources on the Semantic Web

7

Theory

The objective was to develop an algorithm that could provide an accurate ranking system

for resources on the semantic web, while also taking advantage of the data knowledge of the web

and ranking pages based on content instead exchanged links.

The algorithm was developed on two levels, term and document, which would first rank all

terms based upon their usage and associations with other terms, and then rank documents based

on the terms they contained. The term rank of an SWT is based on the weighted average of all the

term’s occurrences as subjects in triples. Initially, all terms are assigned an equal rank, 1 divided

by the total number of terms, such that all the ranks sum to 1. Within each triple, the rank of the

object, the property of the association defined by the predicate, and the specified triple’s usage

on the web contribute to the subject’s rank. The property factor is a constant from 0 to 1 that

determines the relationship between the object and subject, and thus how much rank the object

may pass on to the subject. If the object is a literal, there are no associations in which it is a

subject, so its rank will always remain the initialized rank. Otherwise, since there are objects that

are resources whose ranks may change after each iteration, the ranks of all terms must be

repeatedly calculated until convergence. In order to ensure that each object does not pass on a

larger total rank than it contains, the equation is divided by the number of triples the object

describes another subject in. Thus the term rank is defined by equation 1,

TR(SWT ) =

(TR(Tobject ) i Ppred ) i freqOtripleT∈SWT

∑ , (1)

where TR is the term rank ranging from 0 to 1, SWT is the term to be ranked, T is each triple

defining the SWT as a subject, Ppred is the property factor, freq is fraction of the term’s total

Page 8: A Term Based Ranking Methodology for Resources on the Semantic Web

8

usage on the web that follows the indicated triple relationship, and Otriple is the number of triples

containing the object.

The Ppred factors are different constants, ranging from 0 to 1, describing the strength of

association of each type of predicate relationship. This relative strength is determined through

the average usage and importance of the predicate in triples on the web, which is given in

equation 2,

Ppred =

freq i PR(doc)Ndocdoc∈pred

∑ , (2)

where doc is any document on the web containing a relationship of the specified predicate pred,

freq is the fraction of the total number of associations on the doc that contain the predicate,

PR(doc) is the Google PageRank of the document, and Ndoc is the number of documents

containing the predicate.

The Document Rank, DR (range 0 to 1), ranks documents by the TR and frequency of

each distinct term occurring within the document (The frequency is scaled down to give

precedence to the term’s simple appearance over its number of occurrences). Alternatively, the

DR can be calculated based upon specific terms, which will isolate a search from external

factors. This allows for relevancy searches to be based upon actual understanding of the usage

and context of search terms rather than a simple count of number of occurrences. The DR of an

SWD is given by equation 3,

DR(SWD) = TR(SWT ) i St ( freqterm )

NtermSWT∈SWD∑ , (3)

where St is the scaling factor, and freqterm is the fraction of total terms on the document, Nterm,

that match the specified term.

Page 9: A Term Based Ranking Methodology for Resources on the Semantic Web

9

Using the class abstraction structure of the RDF graph defined within the ontologies, the

ontological rank of a term, OR (range 0 to 1), can be defined as how specific a term is based on

how many subclass levels the term resides on. All the terms are initialized with an OR of 0, and

incremented for each subclass level. The specificity levels gained from all parents are summed

up and then divided by the total number of parents. Thus, the OR of an SWT can be given by

equation 4,

OR(SWT ) =OR(Tparent )+ i

N parentT∈SWO∑ , (4)

where T is each triple in which the term is a subclass, and i is the increment, set to be 1 divided

by the total number of terms.

The specificity rank of a document, SR (range 0 to 1), is found through the average of the

OR’s of all the terms on the document, given by equation 5,

SR(SWD) = OR(SWT )NtermSWT

doc

∑ , (5)

Thus, when performing a search, the user specifies a level of depth to be searched at as

well as a term or group of terms, and results are ranked based upon a combination of SR and DR.

When computing the search, the deviation of the SR from the specified level and of the DR (note

this is the form that takes in specified SWT) from the maximum relevancy of 1, are averaged to

find the total deviation from the search inputs, as seen in equation 6,

Search(level,SWT ) = (level − SR(SWD))+ (1− (DR(SWD,SWT )))2

, (6)

where level is the user defined specificity level, and the search is performed over all documents,

with the search ranks ranging from 0 (no deviation, most relevant) to 1 (no relevance).

Page 10: A Term Based Ranking Methodology for Resources on the Semantic Web

10

Procedures

In order to understand the process and factors behind ranking resources and pages,

manual curation was performed on the Resource Disambiguator Web (RDW), a web that pulls

and collects potential resources to add to the Neuroscience Information Framework (NIF). A

curator then browses through the candidates and for each: determines whether the resource is

already contained on the NIF registry and what registry item it is associated with; categorizes the

resource as a database, ontology, software, core facility, or tissue bank; and marks whether the

candidate is a valuable addition to the NIF, adding additional comments if necessary.

Figure 2: An overview of the ranking procedure. Given an input search with a given level of specificity and certain term or group of terms to search for, the algorithm starts with each individual term, calculating a term rank and an ontological rank based on its associations with other terms. For all documents containing the specified term, the term rank contributes to the document rank, and the ontological rank contributes to the specificity rank. The specificity rank and document rank are then combined to find the importance of the document with respect to the search inputs.

Page 11: A Term Based Ranking Methodology for Resources on the Semantic Web

11

A mathematical model for a semantic web ranking algorithm was developed loosely

based on the logic of PageRank, then refined to adapt to usage for the semantic web. Initially, the

term rank was based on a weighted average of the frequency of occurrences on the web as well

as the rank’s change over time, but was changed as counting the occurrences relied on the

PageRank of the pages containing the terms, and tracking the rank meant keeping a record of all

previous ranks, presenting enormous and impractical memory overheads. Also, the equations and

variables were changed to allow each ranking to have a range from 0 to 1, and initialization

values were determined for each algorithm.

The resulting algorithms were then implemented in Java and documented to allow for

conversion to other languages such as C, C++, Python, etc. The ranking algorithm consisted of 3

classes, a Document class, Term class, and Rank class. The Document class was responsible for

keeping track of the terms it contained, including the TermRank, frequency, and

OntologicalRank of each term; calculating the DocumentRank of the document based on the

contained terms; and calculating the SpecificityRank of the document based on the terms’ OR’s.

The Term class was responsible for keeping track of the triples containing the term; calculating

the TermRank based on the triple associations in which it was a subject; and calculating the

OntologicalRank based on the triples it was an object subclass in. The Rank class initialized all

the Documents and Terms contained on the given web, and performed a search given an input as

well as a term or set of terms, returning the rank of each Document.

In order to test the implemented methods on data, simple ontologies were first created

using the Stanford created application protégé, exported and tested. Then, ontologies from the

NIF developer tools were imported and tested. Ontology pages were ranked first using Google’s

PageRank algorithm and then with the developed DocumentRank algorithm, and compared

Page 12: A Term Based Ranking Methodology for Resources on the Semantic Web

12

against the human curator assigned rank. Due to the different scales of each rankings, there is

still work being done in order to be able to compare all ranks on an equal basis. Also, a web

crawler is being implemented so that more resources may be integrated into the ranking

algorithm, creating more accurate rankings for each resource. Currently, tests are still in progress

to determine constants for the property factors for the different types of predicates, as well as the

scaling factor for DocumentRank under various web surfing models.

Discussion

In order to test the ranking algorithm, various factors such as the property factor and

scaling factor must be determined and given constant values. There may also be a factor in the

search ranking in order to scale the importance of specificity and the DocumentRank.

For the property factor, the constant is calculated for each type of predicate, and is based

on Google’s PageRank. Although as previously noted, PageRank is not optimal for ranking

semantic web resources, there is need for some ranking basis in order to initially determine the

properties’ strengths relative to each other, and so PageRank provides a way to differentiate the

sources of the triples contributing to each property factor. As each property type is subject to the

same PageRank algorithm, and only relative strengths are taken into account, systematic errors

of PageRank should not affect the accuracy of resulting constants. This study of how to rank

relationships is an entire other project, and currently, new methods for ranking predicate types

are being developed.

The scaling factor in calculating DocumentRank is necessary since a term’s general

appearance in documents should take precedence over the frequency of appearance within the

document. For example, a document with two instances of a certain term should not have twice

Page 13: A Term Based Ranking Methodology for Resources on the Semantic Web

13

the rank of a document with only one instance of the term. This factor depends on what surfer

model is used, and for the random surfer model, the scaling factor should be d, 0.85, if each term

is considered as a link, as d measures the likelihood a random surfer will follow a link, or in this

case, read an instance of a term.

When performing a search, it should be up to users how much precedence either

specificity or document rank takes, such as whether they would like to perform a search of all

documents of a specific topic but not a specific depth, or a search of all documents of a specific

level of complexity. However, a default configuration between the two will be initially set as 1 to

1, and modified in accordance to the popularity of certain types of searches.

Conclusion

A method for determining how accurate the ranking algorithm is, how it compares to

PageRank, is still needed. Human curation rankings are binary, either yes or no, and are not on

any type of scale. Averaging the human rankings for all resources would return a number

between 0 and 1 (if 0 were taken as no and 1 taken as yes), but the number would not be truly

indicative of the document’s rank, since it would imply all resources are equally weighted, which

they are not.

The algorithm has been developed, and now the need is for data to be collected to

determine constants in the ranking equations, as well as to develop accurate rankings for

individual resources. The ranking algorithm is only accurate to what sample of resources it is

given, so comparing it, when based on a small sample of resources, to PageRank, which is based

upon over a billion pages is not a fair or significant indicator of accuracy.

Page 14: A Term Based Ranking Methodology for Resources on the Semantic Web

14

Acknowledgements

I would like to thank my mentor, Dr. Anita Bandrowski for introducing me to the field of

web semantics, and helping me develop and test my ranking algorithm. I would also like to thank

the REHS program of San Diego Supercomputer at UCSD for giving me this internship

opportunity. Finally, I thank the entire NIF team for their support and help along my entire

research experience.

References 1. “Resource Description Framework (RDF).” W3C. 15 Mar. 2014 http://www.w3.org/RDF/.

2. OWL Web Ontology Language 1.0 Reference, W3C Working Draft, July 29, 2002.

3. “OntoQuestMain.” Confluence. 14 Apr. 2014 https://confluence.crbs.ucsd.edu/display/NIF/OntoQuestMain

4. “protégé.” Stanford Center of Biomedical Informatics Research. 2014. http://protege.stanford.edu.

5. http://nif-services.neuinfo.org/ontoquest/ontologies/, NIF ontology library, by NIF.

6. Aleman-Meza, B., Halaschek, C., Arpinar, I.B., Sheth, A: Context-Aware Semantic Association Ranking, First Intl. Workshop on Semantic Web and DBs, Berlin, Germany 2003.

7. Page, L.; Brin, S.; Motwani, R.; and Winograd, T. The Pagerank citation ranking: Bringing order to the web. Technical report, Stanford Database group. 1998

8. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 30(1-7): 1998. pp. 107-117

9. I. Rogers. The Google Pagerank Algorithm and How it Works. http://www.iprcom.com/papers/pagerank/, May 2002.

10. Ding L., Finin T., Joshi A., Pan R., Cost R.S., Peng Y., Reddivari P., Doshi V.C., Sachs J.: Swoogle: A search and metadata engine for the semantic web. In: CIKM’04. 2004.

11. Guo, L., Shao, F., Botev, C., Shanmugasundaram, J. XRANK: Ranked keyword search over XML documents. In ACM SIGMOD 2003, pp. 16–27, San Diego, California.

12. Anyanwu, K., Sheth, A. ρ-Queries: enabling querying for Semantic Associations on the Semantic Web. WWW 2003. pp. 690 – 699.

13. Cohen. S., Mamou, J., Kanza, Y., Sagiv, Y. XSEarch: A Semantic Search Engine for XML, VLDB 2003.

14. Barton, S. Designing Indexing Structure for Discovering Relationships in RDF Graphs. DATESO 7-17. 2004.

Page 15: A Term Based Ranking Methodology for Resources on the Semantic Web

15

15. Anyanwu, K., Maduko, A., and Sheth, A.P.: SemRank: Ranking Complex Relationship Search Results on the Semantic Web, Proceedings of the 14th International World Wide Web Conference, ACM Press, May 2005.

16. Stojanovic, N., Mädche, A., Staab, S., Studer, R., Sure, Y. SEAL -- A Framework for Developing SEmantic PortALs. In: K-CAP 2001 – In Proc. of ACM Conference on Knowledge Capture, October 21-23, 2001.

17. Sheth, A., Aleman-Meza, B., Arpinar, I. B., Halaschek, C., Ramakrishnan, C., Bertram, C., Warke, Y., Avant, D., Arpinar, F. S., Anyanwu, K., Kochut, K. Semantic Association Identification and Knowledge Discovery for National Security Applications Journal of Database Management, 16 (1), Jan-Mar 2005, pp. 33-53.

18. Sheth, A., Ramakrishnan, C.: Semantic (Web) Technology In Action: Ontology Driven Information Systems for Search, Integration and Analysis. IEEE Data Engineering Bulletin, Special issue on Making the Semantic Web Real 2003.

19. Lin, S., Chalupsky, H.: Unsupervised Link Discovery in Multi-relational Data via Rarity Analysis. The Third IEEE International Conference on Data Mining 2003.