julie mason, data fountains service manager university of california, riverside

iVia and Data Fountains: Open Source Internet Portal System and Metadata

Generation Service for Amplifying the Efforts of Subject Experts

Julie Mason, Data Fountains Service Manager University of California, Riverside

LITA 2004 National Forum St. Louis, Missouri

SECTION I:

Technologies and Architectures of iVia and Data

Fountains

SECTION II:

Classification

SECTION III:

Preview of Data Fountains Interface

iVia and Data Fountains

SECTION I: Technologies and Architectures of iVia

and Data Fountains


Technologies and Architecture New and Interactive Collection Building Technologies to Amplify

Expert Effort / New Uses of Expertise to Refine Collection Building Technologies

Focused crawling Rich Text Identification and Harvest Machine-based Classification Foundation Record Hybrid Collections Architecture Usage in Existing, Closed, Relatively Homogeneous Collections Cooperative Technology Appropriately Scaled and Modularly Designed



http://infomine.ucr.edu/iVia/ The new open source software platform designed to help INFOMINE

and other virtual libraries scale well in terms of amplifying expert effort to enable the development of better and more representative collections.

Automated and semi-automated Internet resource identification and collection is made possible through focused crawling software.

Automated and semi-automated indexing or metadata generation is made possible through classifier software.

A hybrid, two-tiered collection is supported. The first tier are our expert created records and the second tier is made up of machine created records.

Brings together some of the best of expert created virtual library approaches with the best of automated approaches to collection building


Architecture overview of iVia


Master Database

MySQL-based SQL database

Contains both expert and robot generated records

Contains metadata (URL’s, subjects, keywords, authors, titles, …)

Contains full text in the form of compressed Web page content


The Adder Interface

Sophisticated Web interface for expert classification of Web pages

Password-protected with varying privilege levels

Allows both adding of new resources and editing of already existing ones

Has automatic resource duplicate checking

Contains an automatic metadata extractor

Configurable via a preferences screen


Crawlers

Add robot records to the master database

Assign metadata to crawled records

Three different types of crawlers in iVia/DF

Expert-guided crawler with drill-down and drill-out to crawl single sites

VL-crawler to crawl virtual libraries

Nalanda iVia Focused Crawler (NIFC) to crawl Web communities defined around a given topic


Search Engine

Public search interface (e.g. http://infomine.ucr.edu)

Based on inverted index databases built from the contents of the master database on a nightly basis

Supports sophisticated searching through metadata and full-text

Nested boolean queries, truncated searches, word proximity searches, etc

Search results can be displayed in a wide variety of different themes (skins) that allow collaborating institutions to brand their interface


http://infomine.ucr.edu/Data_Fountains(under development this year)

A cooperative, cost-recovery based metadata generation service that will be an array of iVias, one for each participating project or subject community, and which will create metadata records for the participants.

A big emphasis, in addition to fully-automated resource discovery and metadata generation, will be on semi-automated approaches that strongly involve and amplify the efforts of collection experts. They, in turn, work to refine and perfect machine approaches and processes.

The metadata created can be bundled in differing “products” according to differing participant needs in terms of amount of metadata needed, type (natural language or controlled terminology), degree of relevance or comprehensiveness desired (highly relevant records or moderately relevant).


Architecture overview of DF


Seed Set Generator

Seed sets are sets of URL’s that define a topic of interest

Seed sets can be supplied in various formats by a client(e.g. simple text file with a list of URL’s)

Typically need around 200 highly topic-specific URL’s

Problem: most users would come up with only a few dozen

Solution: scout module uses a search engine such as Googleto fatten up the user-provided initial set


Nalanda iVia Focused Crawler

Primarily developed by Dr. Soumen Chakrabarti (IIT Bombay), a leadingcrawler researcher

Sophisticated focused crawler using document classification methodsand Web graph analysis techniques to stay on topic

Supports user interaction via URL pattern blacklisting etc

Uses an apprentice classifier to prioritize links that should be followed

Returns a list of URL’s likely to be on the initial seed set topic


Distiller

Attempts to rank URL’s returned by the NIFC according to theirrelevance to the client-provided topic

Uses improved Kleinberg-like Web graph analysis to assign huband authority values to each URL

Returns scores for each provided URL


Metadata Exporter

Final stage of DF

Provides clients with convenient data formats to incorporatethe best on-topic URL’s into their own databases

Allows different amounts/quality of metadata to be exported basedon the client’s selected service model

Supports various export types and file formats (simple URL lists,delimiter-separated file formats, XML file formats, MARC recordsand export via OAI-PMH)


Modular Architecture that Supports a Federated Array ofSubject Specific Focused Crawlers and Classifiers

INFOMINE is a virtual library containing over 100,000 links (A hybrid collection containing 26,000 librarian created links and 75,000 plus robot/crawler created links).

Founded in January of 1994 it is one of the first Web-based services offered by a library anywhere.

It is a cooperative effort of librarians from UC Riverside, other UCs (including UCLA, UCSC and the UC Shared Cataloging Project), three California State Universities, Wake Forest University and the University of Detroit. Special cooperative efforts are in process with the Library of Congress and NSDL.


http://infomine.ucr.edu


SECTION II: Classification


LCC: Library of Congress Categories

LCSH: Library of Congress Subject Headings

INFOMINE Subject Categories•Biological, Agricultural, and Medical Sciences•Business and Economics•Cultural Diversity•Electronic Journals•Government Info•Maps and Geographical Information Systems•Physical Sciences, Engineering, and Mathematics•Social Sciences and Humanities•Visual and Performing Arts

Classification: Example Subject Categories


Example


Example: Korea Rice Genome Database

Is it about…– Geography ?– Agriculture ?– Genetics ?

Which INFOMINE category do we put it in ?– Biological, Agricultural, and Medical Sciences

Pretty obvious, right ?– For humans, yes. But how do we automate it ?


Automating Document Classification

• We need a way to measure document similarity

• Each document is basically just a list of words, so we can count how frequently each word appears in it

• These word frequencies are one of many possible document attributes

• Document similarity is mathematically defined in terms of document attributes



The previous slide contains 51 words– document 6– word, of 3 each– we, a, in, is, each 2 each– All other words 1 each

Note that we consider words such as word and words to be the same

We also don’t care about capitalization

In general, we’d also ignore non-descriptive words such as we, a, of, the, and so on



Not an easy task– The distribution of words shows that the slide in question is not

very rich in content• The most frequent word (document) is not very descriptive• The most descriptive word (classification) does not appear

very frequently in the slide– How descriptive and how frequent a word should be depends on

the category

The task is easier when:– we have a large number of content-rich documents– categories are characterized by very specific words which don’t

appear very frequently in other categories



Two documents sharing a large number of category-specific words are considered to be very similar to each other

Document similarity can thus be quantified and computed automatically

Documents can then be ranked by their similarity to each other

A large group of documents that are all very similar to each other can then be considered to define the category they belong to (the set of all such groups is called the Training Corpus)

One way to classify a document is then to put it in the same category as that of the training document that it’s most similar to



The classification method just described is known as the Nearest Neighbor method

There are other methods, which may be more suited for the classification of documents from the Internet– Naïve Bayes– Support Vector Machine (SVM)– Logistic Regression

Infomine uses a flexible approach – supporting all of these methods – in an attempt to produce highly-accurate classifications


SECTION III:Preview of Data Fountains Interface