the players the majors dead search engines international search engines metasearch engines
Post on 22-Dec-2015
272 Views
Preview:
TRANSCRIPT
The Players
The MajorsDead Search EnginesInternational Search EnginesMetasearch Engines
Developed as BackRub by Stanford University students Larry Page and Sergey Brin
Became a private company, and changed name to Google in 1998
One of largest databases >8 billion (they include pages their robots have searched, even if their indexing program hasn’t fully indexed it)
Indexes 3 billion pages every 28 days; 3 million every day
Makes money through powering over 130 portals and Corporate Web sites, and AdWords
Google Spidering Uses its own ‘bots to spider web Generally ignores meta keywords and
description tags.
Google Indexing Descriptions (snippets) are formed automatically
by extracting the most relevant portions of pages Finds the first instance of the search term on a
page, then includes the words that appear around this term
Only indexes first 100K or so Some pages don’t have a description - Google
will include a “botted” page even if it has not been “indexed”
Indexes: Web - Indexed Web pages and other file types Ads - Paid advertisements appear on the right side or above search
results under a "Sponsored Links" heading Images - 880 million+ images searched Groups - 845 million+ usenet messages searched News Directory - A ranked version of the Open Directory using Google's
PageRank Froogle - Shopping and product search Catalog Search - Scanned, searchable retail catalogs
Web index subsets: Government sites Military sites University sites Linux sites Apple/Macintosh sites Microsoft sites
New! “Google teams with the libraries of Harvard, Stanford, the University
of Michigan, the University of Oxford, and The New York Public Library to digitally scan books from their collections so that users worldwide can search them in Google…Users searching with Google will see links in their search results page when there are books relevant to their query. Clicking on a title delivers a Google Print page where users can browse the full text of public domain works and brief excerpts and/or bibliographic data of copyrighted material. Library content will be displayed in keeping with copyright law.”
http://www.google.com/press/pressrel/print_library.html
Yahoo! Search
Originally just a subject directory Search engine launched Feb. 2004 Indexes first 500 KB of a Web page Includes some pay for inclusion sites
Teoma
Founded in 2000 by a team of scientists from Rutgers University
Teoma means "expert" in Gaelic Acquired by Ask Jeeves, Inc. in
September 2001.
Teoma
More than 2 billion English-only web documents Spam, duplicates and pornographic results
removed from index Indexes whole page; no stop words Considers meta-tag descriptions Aims to re-index every month (freshness) Sponsored links from Google Adwords
Teoma
Establishing authority and relevancy: Refine - organizes sites into naturally occurring
communities that are about the subject of each search query
Results - analyzes the relationship of sites within a community, ranking a site based on the number of same-subject pages that reference it (Subject-Specific Popularity)
Resources - identifies expert resources about a particular subject
Gigablast
Founded in 2000 Built and operated by sole proprietor Matt Wells Created to index up to 200 Billion pages with the least
amount of hardware possible Currently indexes 650 million Provides "Gigabits” to help searchers refine their search
based upon related topics from search results Makes money by selling search services to private
companies
Wisenut
Newer database ~2001 850 million pages indexed 1.5 billion – identified not crawled/indexed Few advanced search features Spider capable of fetching more than 100 million a day Often months out of date Smart/Relevant: all words on page, text or referring links
and words around them, significance and content of pages with the links
Generates automatic semantic searches called WiseGuide categories
MSN Search
New, improved ~4.2 billion pages search/indexed? Formerly used Inktomi, now has
proprietary robots, indexer, and retrieval engine
Dead Search Engines
What ever happened to…?
Direct Hit - defunct, redirecting to Teoma Infoseek – defunct, redirecting to Go Magellan - dead, redirects to WebCrawler Northern Light - defunct Openfind - Under "reconstruction" as of 2003 WebTop - Dead
Dead Search Engines
The search engine formerly know as… AlltheWeb - uses Yahoo! database AltaVista - uses Yahoo! database Excite - uses an InfoSpace meta search Go - took over Infoseek, but now just uses Overture iWon – now uses Google "sponsored" ads, web, and image
databases Looksmart - uses Wisenut search engine Lycos - uses Yahoo!/Inktomi database and LookSmart directory NBCi (formerly Snap) - uses metasearch engine Dogpile WebCrawler - uses an InfoSpace meta search
International Search Engines
There are hundreds of search engines all over the world. We will not be investigating any of these very closely, but you can use the resources below to locate and master international search engines:
All Search Engines: foreign search engines Search Engines Worldwide Search Engine Colossus Country-specific Search Engines
Metasearch Engines
A search engine that queries other search engines and then combines the results that are received from all
Allows user is not using just one search engine but a combination of many search engines at once to optimize Web searching
Metasearch Engines
The difference among them: Engines covered (many pay-for-placement) # of engines that can be searched at once Sophistication of search query # of records from each search engine Length of time it will search each search engine Delete duplicates (de-duping)
Metasearch Engines
Dogpile Metacrawler Mamma Kart00 Clusty Surfwax Ixquick Fazzle InfoGrid Gimenei
Metasearch Engines
Good for getting a lay of the land: What is out there? Is there anything out there? Who covers a topic best? Learning the names of new or emerging
search engines
Metasearch Engines
Otherwise, usually better off searching multiple SE’s individually:
Syntax varies among search engines and metasearch engines may not allow you to make use of all search engines
May not translate your query well into different SE’s
Metasearch Engines
Check out some cool, value-adding features emerging is metasearch engines
Clusty
Clusty (using Vivisimo clustering engine): Clustering: uses algorithm to put search
results together based on textual and linguistic similarity. Groups further refined using heuristics (i.e., human knowledge) designed to show what users wish to see when they examine clustered documents.
Clusty
“Vivísimo's Clustering Engine lets you see deeper and farther--with less effort--into a large number of search results to:
Get a quick overview of the main themes that relate to the query.
See similar results grouped together for faster access. Find results that are buried in the ranked list and would
otherwise be missed. Discover unexpected results and relationships between
items.”
Mamma
rSort Considers each listing duplicated in more than
one SE as a “vote” for that page. Uses votes to rank pages per the "Condorcet
Method“ One of the big advantages of this ranking
method is the elimination of search engine spam.
Kart00
Interactive Mapping display for results Uses proprietary algorithm to sort pages Relevance of results are displayed as different-sized
pages When you move the pointer over these pages, the
relevant keywords are illuminated and a brief description of the site appears on the left side of the screen
Click keywords to refine the search Refined or further results also displayed on a map
Surfwax
Targeted multi-source searching Searches only sources from specific domains or
topics determined as relevant SurfWax can spider deeper in any site public
site, including pages or parts that are invisible to traditional search engines
Uses a site's existing search syntax to uncover “deeper” content
Ixquick
Understands and translates, when possible, complex syntax
Complete Boolean searching Truncation/wildcard searching
Fazzle
Meta-searches SE’s, plus unique searches in news and other invisible web resources
Ranks everything together Delivers timely resources from news
sources Delivers dynamic content missing from
other metasearch engines
top related