search engine comparisons by: thomie ventura. search engines today, much, but not all, of the work...

23
Search Engine Search Engine Comparisons Comparisons By: Thomie Ventura By: Thomie Ventura

Upload: walter-stevens

Post on 02-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Search Engine Search Engine ComparisonsComparisons

By: Thomie VenturaBy: Thomie Ventura

Search EnginesSearch Engines

Today, much, but not all, of the work Today, much, but not all, of the work we do revolves around the webwe do revolves around the web

Internet is accessible to almost Internet is accessible to almost anyoneanyone

Impact on businesses, schools, Impact on businesses, schools, professionals, home usersprofessionals, home users

Web is changing every day, but Web is changing every day, but everything is still not ACCESSIBLEeverything is still not ACCESSIBLE

FTP ServersFTP Servers

Only way of sharing files up to 1990Only way of sharing files up to 1990 FTP Servers and FTP ClientsFTP Servers and FTP Clients Down SideDown Side

Servers were mostly known through Servers were mostly known through word of mouthword of mouth

Not everyone was setting up their Not everyone was setting up their serversservers

Grandfather, Grandmother, Grandfather, Grandmother, MotherMother

Archie ( Grandfather)Archie ( Grandfather) Used FTP file ServersUsed FTP file Servers

Veronica (Grandmother)Veronica (Grandmother) Used Gopher file ServersUsed Gopher file Servers

World Wide Web Wanderer (Mother)World Wide Web Wanderer (Mother) First RobotFirst Robot Caused ControversyCaused Controversy

Are Robots a good or bad thing for the Are Robots a good or bad thing for the Internet?Internet?

““Web Search”Web Search”

What exactly does it mean?What exactly does it mean? Involve tools ? Involve tools ? Accessing proprietary databases such Accessing proprietary databases such

as as www.Factiva.comwww.Factiva.com or or www.dialog.comwww.dialog.com

We’ll focus on “web search” as an We’ll focus on “web search” as an open web source, and look at a open web source, and look at a searchers point of viewsearchers point of view

Difficulty CopingDifficulty Coping

Volume and Speed of the web and Volume and Speed of the web and Search Engines Search Engines Something new happens each day Something new happens each day So many things to do, so little time to So many things to do, so little time to

do itdo it Dynamic nature of web searching (indexing Dynamic nature of web searching (indexing

new documents)new documents) Staying up-to-date with traditional Staying up-to-date with traditional

tools( also undergo changes)tools( also undergo changes) Other random issues that arise everydayOther random issues that arise everyday

Will an “open web” search Will an “open web” search engine always have my engine always have my

answers?answers? Questions that should arise about Questions that should arise about

searching the websearching the web How long did it take to get it?How long did it take to get it? What is the database or search engine? What is the database or search engine? What kinds of questions will it help me What kinds of questions will it help me

answer?answer? Open web will not always give me Open web will not always give me

the answerthe answer What can it be used for? What can it be used for?

Quality of InformationQuality of Information

Anyone can become a publisherAnyone can become a publisher Evaluating content is crucial Evaluating content is crucial

ReputationReputation BackgroundBackground QualificationsQualifications Where did it come from? Where did it come from? What its purpose?What its purpose? Relevant to my topic? Relevant to my topic?

Limitations of General Web Limitations of General Web Search Tools Search Tools

Spiders don’t crawl in real-timeSpiders don’t crawl in real-time RecencyRecency

Linked or Submitted SitesLinked or Submitted Sites If a website contains 1000 pages, does If a website contains 1000 pages, does

not mean Search Engines make all of not mean Search Engines make all of them accessiblethem accessible

Invisible or Hidden Web Invisible or Hidden Web resourcesresources

Examples:Examples: Interacting resources, return “custom” Interacting resources, return “custom”

sitessites RegistrationRegistration

Why is it hidden? Why is it hidden? Created on the fly Created on the fly Spiders don’t fill in registration forms Spiders don’t fill in registration forms ““No-Robot” TagNo-Robot” Tag

Hidden is not always badHidden is not always bad

Research and EffortResearch and Effort Without proper tools, we can make Without proper tools, we can make

large databases even largerlarge databases even larger GoogleGoogle AltavistaAltavista ExciteExcite

Distributing Information Properly Distributing Information Properly

Specialized Focused and Specialized Focused and Site Specific Search ToolsSite Specific Search Tools

Necessary and ImportantNecessary and Important Hidden Web is out of reach of general Hidden Web is out of reach of general

purpose Search Engines purpose Search Engines More Precision than Recall More Precision than Recall Examples:Examples:

www.Psychcrawler.comwww.Psychcrawler.com www.Inomics.comwww.Inomics.com [http://[http://newssearch.bbc.co.uknewssearch.bbc.co.uk/ / ksenglish/query.htmksenglish/query.htm]], ,

Identifying and Collecting Identifying and Collecting Specialized Engines Specialized Engines

Profusion Profusion [http://www.profusion.com][http://www.profusion.com]

Librarians IndexLibrarians Index Covers large amount of specialized and Covers large amount of specialized and

invisible web databasesinvisible web databases [http://www.lii.org][http://www.lii.org]

Meta – Search EnginesMeta – Search Engines

Major DisadvantagesMajor Disadvantages You get it all!! High Recall Low You get it all!! High Recall Low

Precision Precision Basics of Search Engines usedBasics of Search Engines used Send queries to “pay for placement” Send queries to “pay for placement”

enginesengines A good metasearch EngineA good metasearch Engine

www.vivisimo.comwww.vivisimo.com

Old Pages, GONE!Old Pages, GONE!

Trying to find old pages? Trying to find old pages? Contact webmasterContact webmaster

Fortunately Fortunately Archiving Old MaterialArchiving Old Material Example:Example:

[http://www.clinton.nara.gov/index.html][http://www.clinton.nara.gov/index.html] ALexa ResearchALexa Research

[http://archive.alexa.com/][http://archive.alexa.com/] carries over 18 terabytes of data covering some 5 carries over 18 terabytes of data covering some 5

million Web sites and some 1.9 billion pages million Web sites and some 1.9 billion pages

Search Engine SizesSearch Engine Sizes

This is a search This is a search engine size engine size analysis as of analysis as of December 11, December 11, 2001 2001

Google Dominates Google Dominates

Sizes Over TimeSizes Over Time

Closer Look Closer Look

Dealing with Coping Dealing with Coping

Use the Search Engine Use the Search Engine Conduct research on a topicConduct research on a topic

This will get you familiar with search engine This will get you familiar with search engine You can see how results are displayedYou can see how results are displayed

Relevancy of returned documentsRelevancy of returned documents Let you gather your own bookmarks Let you gather your own bookmarks

Understanding Understanding limitationslimitations

What to do with these limitations?What to do with these limitations? Know limitationsKnow limitations Use more than one search engineUse more than one search engine Use “specialized” search engines that Use “specialized” search engines that

go deeper into a site to collect more go deeper into a site to collect more informationinformation

Use “invisible web” resourcesUse “invisible web” resources Use web directories, and bookmark Use web directories, and bookmark

important sitesimportant sites

Ability to Search Ability to Search MultimediaMultimedia

Now Available, but still expandingNow Available, but still expanding Wait weeks now becomes instantWait weeks now becomes instant search tools that provide access to search tools that provide access to

video and audio material using a non-video and audio material using a non-text mechanism to access the material text mechanism to access the material ex: searching a specific background or ex: searching a specific background or type colortype color

Still image tools Still image tools Google, Altavista, and Fast, use text Google, Altavista, and Fast, use text

surrounding imagesurrounding image

Become Aware of Become Aware of Multimedia SearchMultimedia Search

Video SearchesVideo Searches Virage www.virage.comVirage www.virage.com TVeyes www.tveyes.comTVeyes www.tveyes.com ShadowTv www.shadowtv.comShadowTv www.shadowtv.com Wordwave www.wordwave.comWordwave www.wordwave.com SpeechBot (keyword search engine demo by Compaq, SpeechBot (keyword search engine demo by Compaq,

uses speech technology to create real-time transcripts) uses speech technology to create real-time transcripts) www.speechbot.com www.speechbot.com

Image SearchesImage Searches Webseek (search or browse criteria in image) Webseek (search or browse criteria in image)

www.ctr.columbia.edu/webseek/ www.ctr.columbia.edu/webseek/ Visoo( uses software that looks for words embedded in Visoo( uses software that looks for words embedded in

image www.visoo.com image www.visoo.com

Making Old Pages StayMaking Old Pages Stay Long Term? Long Term?

Offer comments ( suggest how material can be Offer comments ( suggest how material can be more accessible and searcheable, a great archive more accessible and searcheable, a great archive of content without the correct means of accessing of content without the correct means of accessing it will be a hassle and is not great)it will be a hassle and is not great)

Short Term?Short Term? Take advanatage of Googles cache feature ( google Take advanatage of Googles cache feature ( google

crawls a site and makes a copy unless crawls a site and makes a copy unless unauthorized, and puts it on server, if site is gone, unauthorized, and puts it on server, if site is gone, the copy is in googles server, you must go to the copy is in googles server, you must go to search results and next to URL go to “cached”, will search results and next to URL go to “cached”, will not always be there, next time spider crawls site not always be there, next time spider crawls site and it is missing it will not save onto serverand it is missing it will not save onto server

www.savethis.com (lets you save web pages, and www.savethis.com (lets you save web pages, and access them)access them)