the players the majors dead search engines international search engines metasearch engines

Post on 22-Dec-2015

272 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Players

The MajorsDead Search EnginesInternational Search EnginesMetasearch Engines

Google

Developed as BackRub by Stanford University students Larry Page and Sergey Brin

Became a private company, and changed name to Google in 1998

One of largest databases >8 billion (they include pages their robots have searched, even if their indexing program hasn’t fully indexed it)

Indexes 3 billion pages every 28 days; 3 million every day

Makes money through powering over 130 portals and Corporate Web sites, and AdWords

Google

Google Spidering Uses its own ‘bots to spider web Generally ignores meta keywords and

description tags.

Google

Google Indexing Descriptions (snippets) are formed automatically

by extracting the most relevant portions of pages Finds the first instance of the search term on a

page, then includes the words that appear around this term

Only indexes first 100K or so Some pages don’t have a description - Google

will include a “botted” page even if it has not been “indexed”

Google

Indexes: Web - Indexed Web pages and other file types Ads - Paid advertisements appear on the right side or above search

results under a "Sponsored Links" heading Images - 880 million+ images searched Groups - 845 million+ usenet messages searched News Directory - A ranked version of the Open Directory using Google's

PageRank Froogle - Shopping and product search Catalog Search - Scanned, searchable retail catalogs

Google

Web index subsets: Government sites Military sites University sites Linux sites Apple/Macintosh sites Microsoft sites

Google

New! “Google teams with the libraries of Harvard, Stanford, the University

of Michigan, the University of Oxford, and The New York Public Library to digitally scan books from their collections so that users worldwide can search them in Google…Users searching with Google will see links in their search results page when there are books relevant to their query. Clicking on a title delivers a Google Print page where users can browse the full text of public domain works and brief excerpts and/or bibliographic data of copyrighted material. Library content will be displayed in keeping with copyright law.”

http://www.google.com/press/pressrel/print_library.html

Yahoo! Search

Originally just a subject directory Search engine launched Feb. 2004 Indexes first 500 KB of a Web page Includes some pay for inclusion sites

Teoma

Founded in 2000 by a team of scientists from Rutgers University

Teoma means "expert" in Gaelic Acquired by Ask Jeeves, Inc. in

September 2001.

Teoma

More than 2 billion English-only web documents Spam, duplicates and pornographic results

removed from index Indexes whole page; no stop words Considers meta-tag descriptions Aims to re-index every month (freshness) Sponsored links from Google Adwords

Teoma

Establishing authority and relevancy: Refine - organizes sites into naturally occurring

communities that are about the subject of each search query

Results - analyzes the relationship of sites within a community, ranking a site based on the number of same-subject pages that reference it (Subject-Specific Popularity)

Resources - identifies expert resources about a particular subject

Gigablast

Founded in 2000 Built and operated by sole proprietor Matt Wells Created to index up to 200 Billion pages with the least

amount of hardware possible Currently indexes 650 million Provides "Gigabits” to help searchers refine their search

based upon related topics from search results Makes money by selling search services to private

companies

Wisenut

Newer database ~2001 850 million pages indexed 1.5 billion – identified not crawled/indexed Few advanced search features Spider capable of fetching more than 100 million a day Often months out of date Smart/Relevant: all words on page, text or referring links

and words around them, significance and content of pages with the links

Generates automatic semantic searches called WiseGuide categories

MSN Search

New, improved ~4.2 billion pages search/indexed? Formerly used Inktomi, now has

proprietary robots, indexer, and retrieval engine

Dead Search Engines

What ever happened to…?

Direct Hit - defunct, redirecting to Teoma Infoseek – defunct, redirecting to Go Magellan - dead, redirects to WebCrawler Northern Light - defunct Openfind - Under "reconstruction" as of 2003 WebTop - Dead

Dead Search Engines

The search engine formerly know as… AlltheWeb - uses Yahoo! database AltaVista - uses Yahoo! database Excite - uses an InfoSpace meta search Go - took over Infoseek, but now just uses Overture iWon – now uses Google "sponsored" ads, web, and image

databases Looksmart - uses Wisenut search engine Lycos - uses Yahoo!/Inktomi database and LookSmart directory NBCi (formerly Snap) - uses metasearch engine Dogpile WebCrawler - uses an InfoSpace meta search

International Search Engines

There are hundreds of search engines all over the world. We will not be investigating any of these very closely, but you can use the resources below to locate and master international search engines:

All Search Engines: foreign search engines Search Engines Worldwide Search Engine Colossus Country-specific Search Engines

Metasearch Engines

A search engine that queries other search engines and then combines the results that are received from all

Allows user is not using just one search engine but a combination of many search engines at once to optimize Web searching

Metasearch Engines

The difference among them: Engines covered (many pay-for-placement) # of engines that can be searched at once Sophistication of search query # of records from each search engine Length of time it will search each search engine Delete duplicates (de-duping)

Metasearch Engines

Dogpile Metacrawler Mamma Kart00 Clusty Surfwax Ixquick Fazzle InfoGrid Gimenei

Metasearch Engines

Good for getting a lay of the land: What is out there? Is there anything out there? Who covers a topic best? Learning the names of new or emerging

search engines

Metasearch Engines

Otherwise, usually better off searching multiple SE’s individually:

Syntax varies among search engines and metasearch engines may not allow you to make use of all search engines

May not translate your query well into different SE’s

Metasearch Engines

Check out some cool, value-adding features emerging is metasearch engines

Clusty

Clusty (using Vivisimo clustering engine): Clustering: uses algorithm to put search

results together based on textual and linguistic similarity. Groups further refined using heuristics (i.e., human knowledge) designed to show what users wish to see when they examine clustered documents.

Clusty

“Vivísimo's Clustering Engine lets you see deeper and farther--with less effort--into a large number of search results to:

Get a quick overview of the main themes that relate to the query.

See similar results grouped together for faster access. Find results that are buried in the ranked list and would

otherwise be missed. Discover unexpected results and relationships between

items.”

Mamma

rSort Considers each listing duplicated in more than

one SE as a “vote” for that page. Uses votes to rank pages per the "Condorcet

Method“ One of the big advantages of this ranking

method is the elimination of search engine spam.

Kart00

Interactive Mapping display for results Uses proprietary algorithm to sort pages Relevance of results are displayed as different-sized

pages When you move the pointer over these pages, the

relevant keywords are illuminated and a brief description of the site appears on the left side of the screen

Click keywords to refine the search Refined or further results also displayed on a map

Surfwax

Targeted multi-source searching Searches only sources from specific domains or

topics determined as relevant SurfWax can spider deeper in any site public

site, including pages or parts that are invisible to traditional search engines

Uses a site's existing search syntax to uncover “deeper” content

Ixquick

Understands and translates, when possible, complex syntax

Complete Boolean searching Truncation/wildcard searching

Fazzle

Meta-searches SE’s, plus unique searches in news and other invisible web resources

Ranks everything together Delivers timely resources from news

sources Delivers dynamic content missing from

other metasearch engines

top related