cp3024 lecture 12 search engines. what is the main www problem? with an estimated 800 million web...

36
CP3024 Lecture 12 Search Engines

Upload: veronica-wiggins

Post on 18-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

What is a Search Engine?  A page on the web connected to a backend program  Allows a user to enter words which characterise a required page  Returns links to pages which match the query

TRANSCRIPT

Page 1: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

CP3024 Lecture 12

Search Engines

Page 2: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

What is the main WWW problem?

With an estimated 800 million web pages finding the one you want is difficult!

Page 3: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

What is a Search Engine?

A page on the web connected to a backend program

Allows a user to enter words which characterise a required page

Returns links to pages which match the query

Page 4: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

A Typical Search Engine

Page 5: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Types of Search Engine

Automatic search engine e.g. Altavista, Lycos

Classified Directory e.g. Yahoo!Meta-Search Engine e.g. Dogpile

Page 6: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Components of a Search Engine

Robot (or Worm or Spider)– collects pages– checks for page changes

Indexer– constructs a sophisticated file structure to

enable fast page retrievalSearcher

– satisfies user queries

Page 7: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Query Interface

Usually a boolean interface– (Fred and Jean) or (Bill and Sam)

Normally allows phrase searches– "Fred Smith"

Also proximity searchesNot generally understood by usersMay have extra 'friendlier' features

?

Page 8: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Search Results

Presented as linksSupposedly ordered in terms of relevancy

to the querySome Search Engines score resultsNormally organised if groups of ten per

page

Page 9: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Problems

Links are often out of dateUsually too many links are returnedReturned links are not very relevantThe Engines don't know about enough

pagesDifferent engines return different resultsU.S. bias

Page 10: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Improving query results

To look for a particular page use an unusual phrase you know is on that page

Use phrase queries where possibleCheck your spelling!Progressively use more termsIf you don't find what you want, use

another Search Engine!

Page 11: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Who operates Search Engines?

People who can get money from venture capitalists!

Many search engines originate from U.S. universities

Often paid for by advertisementsEngines monitor carefully what else

interests you (paid by the click)

Page 12: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

How do pages get into a Search Engine?

Robot discovery (following links)Self submissionPayments

Page 13: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Robot Discovery

Robots visit sites while following linksThe more links the more visitsMake sure you don't exclude Robots from

visiting public pages

Page 14: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Payments

Some search engines only index paying customers

The more you pay the higher you appear on answers to queries

Page 15: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Self submission

Register your page with a search enginePay for a company to register you with

many search enginesGet registration with many search engines

for free!

Page 16: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Getting to the top

Only relevant queries should be ranked highly

Search engines only look at textSearch engine operators try to stop "search

engine spamming"Some queries are pre-answered

Page 17: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Get where you should be!

Put more than graphics on a pageDon't use framesUse the <ALT….> tagMake good use of <TITLE> and <H1>Consider using the <META> tagGet people to link to your page

Page 18: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Summary

Search Engines are vital to the Web userSearch Engines are not perfect by a long

wayThere are tactics for better searchingPage design can bring more visitors via

Search EnginesThe more links the better!

Page 19: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

WWLib-TNG

A Next Generation Search Engine

Page 20: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

In the beginning

WWLib-TOS– Manually constructed directory– Classified on Dewey Decimal– Simple data structure– Proof of concept

Page 21: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

The New Architecture

Page 22: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

The Classifier

Page 23: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Motive - Why Generate Metadata Automatically?

Meta tags are not compulsoryOld pages are less likely to have meta tagsAvailable data can be unreliableThe Web of Trust requires comprehensive

resource descriptionAn essential prerequisite for widespread

deployment of RDF applications

Page 24: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Method - How can Metadata be Generated Automatically?

Using an automatic classifierThe classifier classifies Web Pages

according to Dewey Decimal Classification

Other useful metadata can be extracted during the process of automatic classification

Page 25: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Automatic Classification

Intended to combine the intuitive accuracy of manually maintained classified directories with the speed and comprehensive coverage of automated search engines

DDC has been adopted because of its universal coverage, multilingual scope and hierarchical nature

Page 26: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Automatic Classifier - How does it work?

Firstly, the page is retrieved from a URL or local file and parsed to produce a document object

Page 27: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Automatic Classifier - How does it work?

The document object is then compared with DDC objects representing the top ten DDC classes

Page 28: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Automatic Classifier - How does it work?

Each time a word in the document matches a word in the DDC object, the two associated weights are added to a total score

A measure of similarity is then calculated using a similarity coefficient

Page 29: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Automatic Classifier - How does it work? If there is a significant measure of similarity the

document will be compared with any subclasses of that DDC class

If there are no subclasses (i.e. the DDC class is a leaf node) the document is assigned the classmark

If the result is not significant, the comparison process will proceed no further through that particular branch of the DDC object hierarchy

Page 30: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Metadata elements

The automatic classification process can be used to extract other useful metadata elements other than the classification classmarks:

KeywordsClassmarksWord count

TitleURLAbstract

A unique accession number and associated dates can be obtained and supplied by the system

Page 31: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Metadata elements - Wolverhampton Core

Wolverhampton Core Dublin Core

1 Unique Accession number Identifier2 Title Title3 URL Identifier4 Abstract Description5 Keywords Subject6 Classmarks Subject7 Word count8 Classification date9 Last modified date Date

Page 32: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

RDF Data Model

Page 33: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

RDF Schema

There is a significant overlap with the Dublin Core element set

Requirement for implementation clarityThose that have Dublin Core equivalents

are declared as sub-propertiesMaintain interoperability with Dublin Core

applications

Page 34: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

RDF Schema

<rdf:Description ID="Keyword"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Keyword</rdfs:label> </rdf:Description>

<rdf:Description ID="Classmark"> <rdf:type rdf:resource="http://www.w3.org/TR/WD-rdf-syntax#Property"/> <rdfs:subPropertyOf resource="http://purl.org/metadata/dublin_core#Subject"/> <rdfs:label>Classmark</rdfs:label> </rdf:Description>

Page 35: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Classifier Evaluation

Automatic metadata generation will become important for the widespread deployment of RDF based applications

Documents created before the invention of RDF generating authoring tools also need to be described

RDF utilised in this manner may encourage interoperability between search engines

More info: http://www.scit.wlv.ac.uk/~ex1253/

Page 36: CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!

Current Status of WWLib-TNG

New results interface proposed– R-wheel (CirSA)

Builder and searcher constructed, now being tested

Classifier constructedTest Dispatcher/Analyser/Archiver in place