a term based ranking methodology for resources on the semantic web
TRANSCRIPT
1
A Term Based Ranking Methodology for
Resources on the Semantic Web
Aaron Huang
2
Abstract
The HTML web used today has enabled people efficient navigation over vast collections
of information stored on pages. However, the computers that host these pages have no
knowledge themselves of the data stored; they only record the page destinations (URL) and the
navigation links between. The semantic web is a relatively new web structure proposed to enable
the computer to not only have knowledge of the data stored within the documents of the web, but
also the ability to comprehend that data through associations and potentially utilize the
knowledge to perform autonomous actions. The semantic web keeps track of terms and their
associations with other terms in the form of triples, which are then used to form documents, thus
enabling the web to have knowledge of the context and significance of data within pages.
However, algorithms for the HTML web such as Google’s PageRank do not take advantage of
the semantic web’s numerous capabilities, and is also less efficient as documents need to be first
converted into HTML format in order to be ranked, so an algorithm was developed that first
ranks each term based upon its usage and associations with other terms; and then ranks each
document based on the terms it contains. Additionally, a function to determine the depth and
specificity of a document (reference, scholarly, etc.) was integrated alongside the document rank
to provide for a more streamlined and relevant search.
3
Introduction
The current world wide web is a collection of HTML (HyperText Markup Language)
documents interconnected through URL (Uniform Resource Locator) links. While HTML sets a
standard for accessible creation of structured documents, pages created through the language
require humans to read and interpret, and are incomprehensible to the computer. The computer
sees the HTML web as a collection of pages, and although it can access the data on each page
(for example performing a search for a specific term), it is unable to assign any type of meaning
to the data. The Semantic Web aims to create a "web of data", as opposed to a "web of
documents", in which data can be stored in a form understood by machines, allowing for
computer automation of tasks. Such a web would advance artificial intelligence and learning by
enabling machines to not only comprehend terms and data, but more importantly learn the
relationships and associations between objects on the web, contributing to a better understanding
of the environment and the ability to make independent decisions based on previous knowledge.
The Page Rank algorithm was developed by Sergey Brin and Larry Page, and today is the
underlying structure of the Google search engine. It works by giving each page a PageRank (PR)
based on the number of incoming links it receives, factoring in the rank divided by the number of
outbound links of each page that the links come from. Thus, a relatively unimportant page with a
small PR would not give a large contribution to another page's PR, and a page with numerous
outbound links would give only small contributions to each of the linked pages. Since each
page’s rank depends on the others', and the page ranks of all the pages change after each
iteration, this algorithm is executed repeatedly until relative convergence. The algorithm also
uses the "random surfer model" to predict whether links will be followed or not. Using an
arbitrary damping factor d representing the likelihood a random surfer will follow a link, the
4
algorithm accounts for two scenarios: the surfer follows the link, or he doesn't, in which case he
picks another random page and starts surfing again. Studies have shown that d is around 0.85.7
However, the traditional PageRank algorithm does not take advantage of the semantic
web structure and requires SWD's to be converted into HTML form, so this project intended to
create a new algorithm that would rank not only documents, but also terms, leading to more
meaningful rankings based upon content rather than links. Developing such a system would
enable computers to learn what data and ideas are important on the web, rather than just which
pages, and perhaps autonomously predict and publish new information of potential significance.
The Semantic Web is displayed as numerous Semantic Web Documents (SWD’s),
populated by instances of terms (SWT’s) that can either represent data or words within a
document. Terms are defined through ontologies (SWO’s), which describe data through
associations, so that terms are defined relative to each other. For example, if there is a
relationship within an SWO, “artist creates art”, where “artist”, “creates”, and “art” are SWT’s;
an SWD may have the statement, “Picasso creates cubism”, where “Picasso” is an instance of
artist, “cubism” is an instance of “art”, and “creates” is directly used in the document. Thus, the
computer can assign meaning to the relationship between Picasso and cubism from the general
association in the SWO. Instead of using HTML, the Semantic Web is written in XML
(Extensible Markup Language) syntax, and composed through more object-oriented languages
designed specifically for data management, including RDF (Resource Description Framework)
and OWL (Web Ontology Language).
The RDF is an infrastructure that defines a standard model for integration and
organization of data. Resources are described through subject-predicate-object expressions,
otherwise known as RDF triples. The subject refers to the resource being described, the object is
5
the literal or other resource describing the subject, and the predicate denotes the type of
relationship between the subject and the object. For example, for the statement, “The water is
blue,” “water” would be the subject, “blue” would be the object describing the water, and “is”
would be the predicate defining the relationship between blue and water. URI’s (Universal
Resource Identifiers) are assigned to the subject and predicate, as well as the object if it is a
resource, in order to allow the system to identify the content and comprehend the relationship in
context of other resource associations. RDF triples, or statements, can be given URI’s and used
as resources in other statements, a process known as reification, creating a hierarchy of object-
oriented data description.
The RDF vocabulary is used to describe abstract RDF graph and includes classes; such as
rdf:Statement the class of all RDF statements, and rdf:Property the class of all possible
properties; as well as properties such as rdf:Type used to declare a resource as an instance of a
class, and rdf:Predicate the predicate of the RDF triple.1 Additionally, vocabularies can be built
Figure 1: A visual representation of the Semantic
Web. The Semantic Web may be visualized in two
parts: a web visible to users containing documents and
ontologies defining terms; and an RDF graph
containing individual resources and literals following
association triples defined in an abstract graph. The
abstract graph defines classes and properties of which
instances are created to form resources. These resources
follow the same associations and properties as their
classes, and contain terms that can be used in
documents on the web. In order to access these
relationships, web ontologies are used, which display
the terms and associations in the RDF graph on the
accessible web.
6
upon the RDF, such as RDFS (Resource Description Framework Schema), which is used to
describe actual instances. Examples are the class rdfs:Resource which contains all resources on
the web, and the property rdfs:subClassOf which defines whether a resource is a subclass of
another resource.1
OWL is a language used to describe classes, properties and instances of resources
through associations, allowing the web to store knowledge not only of specific resources, but
recognize the type of resource and identify it as thus. OWL presents ontologies through an
abstract syntax, portraying them as a combination of annotations, axioms, and facts.2
Annotations include human and machine input meta-data descriptions about the contained
resources. The actual resource information is described through axioms and facts. The OWL
abstract syntax allows people to define relations and characteristics, which OWL then translates
into triples following the RDF structure, a form comprehensible by the machine.
Currently, there exist a few approaches to ranking semantic web resources, most notable
of which include Swoogle and SemRank. Swoogle is a crawler based search and metadata
engine, which discovers and indexes pages and corresponding metadata, providing basic access
to semantic web resources. The ranking algorithm used for these resources is very similar to
PageRank, with one weighting function integrated to take in the usage of terms on a document
rather than hyperlinks between pages.10 SemRank is an algorithm developed to rank the different
types of associations on the semantic web, specifically complex associations, and allows users to
vary their search mode to reflect their definition of importance. The algorithm is based upon
studies of ρ-path associations and ordered using a Top-K algorithm in an SSARK system.12; 15
7
Theory
The objective was to develop an algorithm that could provide an accurate ranking system
for resources on the semantic web, while also taking advantage of the data knowledge of the web
and ranking pages based on content instead exchanged links.
The algorithm was developed on two levels, term and document, which would first rank all
terms based upon their usage and associations with other terms, and then rank documents based
on the terms they contained. The term rank of an SWT is based on the weighted average of all the
term’s occurrences as subjects in triples. Initially, all terms are assigned an equal rank, 1 divided
by the total number of terms, such that all the ranks sum to 1. Within each triple, the rank of the
object, the property of the association defined by the predicate, and the specified triple’s usage
on the web contribute to the subject’s rank. The property factor is a constant from 0 to 1 that
determines the relationship between the object and subject, and thus how much rank the object
may pass on to the subject. If the object is a literal, there are no associations in which it is a
subject, so its rank will always remain the initialized rank. Otherwise, since there are objects that
are resources whose ranks may change after each iteration, the ranks of all terms must be
repeatedly calculated until convergence. In order to ensure that each object does not pass on a
larger total rank than it contains, the equation is divided by the number of triples the object
describes another subject in. Thus the term rank is defined by equation 1,
TR(SWT ) =
(TR(Tobject ) i Ppred ) i freqOtripleT∈SWT
∑ , (1)
where TR is the term rank ranging from 0 to 1, SWT is the term to be ranked, T is each triple
defining the SWT as a subject, Ppred is the property factor, freq is fraction of the term’s total
8
usage on the web that follows the indicated triple relationship, and Otriple is the number of triples
containing the object.
The Ppred factors are different constants, ranging from 0 to 1, describing the strength of
association of each type of predicate relationship. This relative strength is determined through
the average usage and importance of the predicate in triples on the web, which is given in
equation 2,
Ppred =
freq i PR(doc)Ndocdoc∈pred
∑ , (2)
where doc is any document on the web containing a relationship of the specified predicate pred,
freq is the fraction of the total number of associations on the doc that contain the predicate,
PR(doc) is the Google PageRank of the document, and Ndoc is the number of documents
containing the predicate.
The Document Rank, DR (range 0 to 1), ranks documents by the TR and frequency of
each distinct term occurring within the document (The frequency is scaled down to give
precedence to the term’s simple appearance over its number of occurrences). Alternatively, the
DR can be calculated based upon specific terms, which will isolate a search from external
factors. This allows for relevancy searches to be based upon actual understanding of the usage
and context of search terms rather than a simple count of number of occurrences. The DR of an
SWD is given by equation 3,
DR(SWD) = TR(SWT ) i St ( freqterm )
NtermSWT∈SWD∑ , (3)
where St is the scaling factor, and freqterm is the fraction of total terms on the document, Nterm,
that match the specified term.
9
Using the class abstraction structure of the RDF graph defined within the ontologies, the
ontological rank of a term, OR (range 0 to 1), can be defined as how specific a term is based on
how many subclass levels the term resides on. All the terms are initialized with an OR of 0, and
incremented for each subclass level. The specificity levels gained from all parents are summed
up and then divided by the total number of parents. Thus, the OR of an SWT can be given by
equation 4,
OR(SWT ) =OR(Tparent )+ i
N parentT∈SWO∑ , (4)
where T is each triple in which the term is a subclass, and i is the increment, set to be 1 divided
by the total number of terms.
The specificity rank of a document, SR (range 0 to 1), is found through the average of the
OR’s of all the terms on the document, given by equation 5,
SR(SWD) = OR(SWT )NtermSWT
doc
∑ , (5)
Thus, when performing a search, the user specifies a level of depth to be searched at as
well as a term or group of terms, and results are ranked based upon a combination of SR and DR.
When computing the search, the deviation of the SR from the specified level and of the DR (note
this is the form that takes in specified SWT) from the maximum relevancy of 1, are averaged to
find the total deviation from the search inputs, as seen in equation 6,
Search(level,SWT ) = (level − SR(SWD))+ (1− (DR(SWD,SWT )))2
, (6)
where level is the user defined specificity level, and the search is performed over all documents,
with the search ranks ranging from 0 (no deviation, most relevant) to 1 (no relevance).
10
Procedures
In order to understand the process and factors behind ranking resources and pages,
manual curation was performed on the Resource Disambiguator Web (RDW), a web that pulls
and collects potential resources to add to the Neuroscience Information Framework (NIF). A
curator then browses through the candidates and for each: determines whether the resource is
already contained on the NIF registry and what registry item it is associated with; categorizes the
resource as a database, ontology, software, core facility, or tissue bank; and marks whether the
candidate is a valuable addition to the NIF, adding additional comments if necessary.
Figure 2: An overview of the ranking procedure. Given an input search with a given level of specificity and certain term or group of terms to search for, the algorithm starts with each individual term, calculating a term rank and an ontological rank based on its associations with other terms. For all documents containing the specified term, the term rank contributes to the document rank, and the ontological rank contributes to the specificity rank. The specificity rank and document rank are then combined to find the importance of the document with respect to the search inputs.
11
A mathematical model for a semantic web ranking algorithm was developed loosely
based on the logic of PageRank, then refined to adapt to usage for the semantic web. Initially, the
term rank was based on a weighted average of the frequency of occurrences on the web as well
as the rank’s change over time, but was changed as counting the occurrences relied on the
PageRank of the pages containing the terms, and tracking the rank meant keeping a record of all
previous ranks, presenting enormous and impractical memory overheads. Also, the equations and
variables were changed to allow each ranking to have a range from 0 to 1, and initialization
values were determined for each algorithm.
The resulting algorithms were then implemented in Java and documented to allow for
conversion to other languages such as C, C++, Python, etc. The ranking algorithm consisted of 3
classes, a Document class, Term class, and Rank class. The Document class was responsible for
keeping track of the terms it contained, including the TermRank, frequency, and
OntologicalRank of each term; calculating the DocumentRank of the document based on the
contained terms; and calculating the SpecificityRank of the document based on the terms’ OR’s.
The Term class was responsible for keeping track of the triples containing the term; calculating
the TermRank based on the triple associations in which it was a subject; and calculating the
OntologicalRank based on the triples it was an object subclass in. The Rank class initialized all
the Documents and Terms contained on the given web, and performed a search given an input as
well as a term or set of terms, returning the rank of each Document.
In order to test the implemented methods on data, simple ontologies were first created
using the Stanford created application protégé, exported and tested. Then, ontologies from the
NIF developer tools were imported and tested. Ontology pages were ranked first using Google’s
PageRank algorithm and then with the developed DocumentRank algorithm, and compared
12
against the human curator assigned rank. Due to the different scales of each rankings, there is
still work being done in order to be able to compare all ranks on an equal basis. Also, a web
crawler is being implemented so that more resources may be integrated into the ranking
algorithm, creating more accurate rankings for each resource. Currently, tests are still in progress
to determine constants for the property factors for the different types of predicates, as well as the
scaling factor for DocumentRank under various web surfing models.
Discussion
In order to test the ranking algorithm, various factors such as the property factor and
scaling factor must be determined and given constant values. There may also be a factor in the
search ranking in order to scale the importance of specificity and the DocumentRank.
For the property factor, the constant is calculated for each type of predicate, and is based
on Google’s PageRank. Although as previously noted, PageRank is not optimal for ranking
semantic web resources, there is need for some ranking basis in order to initially determine the
properties’ strengths relative to each other, and so PageRank provides a way to differentiate the
sources of the triples contributing to each property factor. As each property type is subject to the
same PageRank algorithm, and only relative strengths are taken into account, systematic errors
of PageRank should not affect the accuracy of resulting constants. This study of how to rank
relationships is an entire other project, and currently, new methods for ranking predicate types
are being developed.
The scaling factor in calculating DocumentRank is necessary since a term’s general
appearance in documents should take precedence over the frequency of appearance within the
document. For example, a document with two instances of a certain term should not have twice
13
the rank of a document with only one instance of the term. This factor depends on what surfer
model is used, and for the random surfer model, the scaling factor should be d, 0.85, if each term
is considered as a link, as d measures the likelihood a random surfer will follow a link, or in this
case, read an instance of a term.
When performing a search, it should be up to users how much precedence either
specificity or document rank takes, such as whether they would like to perform a search of all
documents of a specific topic but not a specific depth, or a search of all documents of a specific
level of complexity. However, a default configuration between the two will be initially set as 1 to
1, and modified in accordance to the popularity of certain types of searches.
Conclusion
A method for determining how accurate the ranking algorithm is, how it compares to
PageRank, is still needed. Human curation rankings are binary, either yes or no, and are not on
any type of scale. Averaging the human rankings for all resources would return a number
between 0 and 1 (if 0 were taken as no and 1 taken as yes), but the number would not be truly
indicative of the document’s rank, since it would imply all resources are equally weighted, which
they are not.
The algorithm has been developed, and now the need is for data to be collected to
determine constants in the ranking equations, as well as to develop accurate rankings for
individual resources. The ranking algorithm is only accurate to what sample of resources it is
given, so comparing it, when based on a small sample of resources, to PageRank, which is based
upon over a billion pages is not a fair or significant indicator of accuracy.
14
Acknowledgements
I would like to thank my mentor, Dr. Anita Bandrowski for introducing me to the field of
web semantics, and helping me develop and test my ranking algorithm. I would also like to thank
the REHS program of San Diego Supercomputer at UCSD for giving me this internship
opportunity. Finally, I thank the entire NIF team for their support and help along my entire
research experience.
References 1. “Resource Description Framework (RDF).” W3C. 15 Mar. 2014 http://www.w3.org/RDF/.
2. OWL Web Ontology Language 1.0 Reference, W3C Working Draft, July 29, 2002.
3. “OntoQuestMain.” Confluence. 14 Apr. 2014 https://confluence.crbs.ucsd.edu/display/NIF/OntoQuestMain
4. “protégé.” Stanford Center of Biomedical Informatics Research. 2014. http://protege.stanford.edu.
5. http://nif-services.neuinfo.org/ontoquest/ontologies/, NIF ontology library, by NIF.
6. Aleman-Meza, B., Halaschek, C., Arpinar, I.B., Sheth, A: Context-Aware Semantic Association Ranking, First Intl. Workshop on Semantic Web and DBs, Berlin, Germany 2003.
7. Page, L.; Brin, S.; Motwani, R.; and Winograd, T. The Pagerank citation ranking: Bringing order to the web. Technical report, Stanford Database group. 1998
8. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 30(1-7): 1998. pp. 107-117
9. I. Rogers. The Google Pagerank Algorithm and How it Works. http://www.iprcom.com/papers/pagerank/, May 2002.
10. Ding L., Finin T., Joshi A., Pan R., Cost R.S., Peng Y., Reddivari P., Doshi V.C., Sachs J.: Swoogle: A search and metadata engine for the semantic web. In: CIKM’04. 2004.
11. Guo, L., Shao, F., Botev, C., Shanmugasundaram, J. XRANK: Ranked keyword search over XML documents. In ACM SIGMOD 2003, pp. 16–27, San Diego, California.
12. Anyanwu, K., Sheth, A. ρ-Queries: enabling querying for Semantic Associations on the Semantic Web. WWW 2003. pp. 690 – 699.
13. Cohen. S., Mamou, J., Kanza, Y., Sagiv, Y. XSEarch: A Semantic Search Engine for XML, VLDB 2003.
14. Barton, S. Designing Indexing Structure for Discovering Relationships in RDF Graphs. DATESO 7-17. 2004.
15
15. Anyanwu, K., Maduko, A., and Sheth, A.P.: SemRank: Ranking Complex Relationship Search Results on the Semantic Web, Proceedings of the 14th International World Wide Web Conference, ACM Press, May 2005.
16. Stojanovic, N., Mädche, A., Staab, S., Studer, R., Sure, Y. SEAL -- A Framework for Developing SEmantic PortALs. In: K-CAP 2001 – In Proc. of ACM Conference on Knowledge Capture, October 21-23, 2001.
17. Sheth, A., Aleman-Meza, B., Arpinar, I. B., Halaschek, C., Ramakrishnan, C., Bertram, C., Warke, Y., Avant, D., Arpinar, F. S., Anyanwu, K., Kochut, K. Semantic Association Identification and Knowledge Discovery for National Security Applications Journal of Database Management, 16 (1), Jan-Mar 2005, pp. 33-53.
18. Sheth, A., Ramakrishnan, C.: Semantic (Web) Technology In Action: Ontology Driven Information Systems for Search, Integration and Analysis. IEEE Data Engineering Bulletin, Special issue on Making the Semantic Web Real 2003.
19. Lin, S., Chalupsky, H.: Unsupervised Link Discovery in Multi-relational Data via Rarity Analysis. The Third IEEE International Conference on Data Mining 2003.