Outline
Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology
Extraction” “Concept Extraction” Problems Main ideas
Motivations of “Semantic Web” Computation History:
W3C definition: “Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries”.
Huge Computations
Gathering Human Data
Sharing Data
Processing Huge Data
First Computers First ApplicationsInventing Web
Semantic Web
Motivations of “Semantic Web” Some Efforts:
http://dbpedia.org/ (RDF enabled Wikipedia Data) http://www.foaf-project.org/ (Friend of a Friend) http://sioc-project.org/ (Semantically-Interlinked
Online Communities) http://openguid.net/ http://simile.mit.edu/ (Semantic Interoperability of
Metadata and Information in unLike Environments)
Outline
Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology
Extraction” “Concept Extraction” Problems Main ideas
“Semantic Web” structure
Road from “zero” to Semantic Web
Having an Ontology
Annotating Documents
Submit Query on Semantic Enabled Data
Having an Ontology
Generating Data According to Ontology
Submit Query on Semantic Enabled Data
Existing Data
Existing Data
Outline
Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology
Extraction” “Concept Extraction” Problems Main ideas
Motivations of “Automatic Ontology Extraction” Ontology is domain specific In each domain we need a team of experts to
create an ontology Existing ontologies cover small ranges of
knowledge. WordNet Data about members, projects and … in
Universities DBLP, …
Motivations of “Automatic Ontology Extraction”
DailyMed for drugs FOAF MySpace BBC Widely used one is AudioSrobbler Based on event
ontology with 600 million entry but it is a simple ontology.
Problem: Representing Knowledge in a Machine Readable Format.
Outline
Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology
Extraction” “Concept Extraction” Problems Main ideas
“Concept Extraction”
An ontology is a set of “Concepts”, properties of concepts (and constraints on them) and “Rules” between “Concepts”.
If we can’t generate a general ontology in a domain, “Concepts” can be used in determining the relevance of documents and their ranks.
Outline
Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology
Extraction” “Concept Extraction” Problems Main ideas
Main Components in existing methods are as below: Similarity Measures
Euclidean Measures: Cosine Distance Problem: If we have a document “B” that is the subset of
a document “A” then the cosine distance will be large
Problems
Problems
Importance Measures Frequency: Has big false positive error (on technical
words) and false negative error (on common words) TFiDF (Term Frequency . Inverse Document
Frequency): Has more accuracy But if we have two semantic correlated words this measure
will make mistake on main purpose of the document. Also for long documents it will fail Information about order of words will be lost If we consider stems and synonyms or other related set of
words it can work fine, otherwise many problems.
Problems
LSI (Latent Semantic Indexing/Analysis) This method uses SVD (singular value decomposition) After applying SVD we have term×concept (U matrix),
eigenvalues, concept×document (V matrix). Matrices U and V are eigenvectors of terms correlation
matrix and document correlation matrix. Eigenvectors have property of clustering the nodes of a
graph (for example in PageRank algorithm we use same property)
But there are problems.
Problems
This method assumes some probabilistic properties for term-document matrix that may be not hold, for example Gaussian Model properties.
This method is not customized for sparse matrices This method is offline This method has big computation complexity
Some efforts have been made such as SDD This method again is based on cosine distance and similar
measures
Outline
Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology
Extraction” “Concept Extraction” Problems Main ideas
Main ideas
Main ideas on this process are as below Using more general spaces instead of Euclidian
space, such as “Banach Spaces” and defining a new distance measure that can serve desirable properties of term-document space. Banach spaces have interesting properties for high
dimensional problems because of their generlity. Using Linguistics theorems in determining the
importance of a term in a document instead of TFiDF
Main ideas
Customizing LSI for sparse matrices and creating a framework for any method for matrix decomposing.
Giving an online version of LSI LSI (SVD) basically is an approximation for term-
document matrix but also have big computation complexity and can be approximated to.
We use clustering properties of eigenvectors in LSI, so we can substitute it with any clustering that is not based on Euclidian as mentioned.