internet resources discovery (ird)
DESCRIPTION
Internet Resources Discovery (IRD). The Invisible Web. Contents. What is the Invisible Web? How big is the Invisible Web? Why is there an Invisible Web (and what’s in it)? Case study – patent search How to find Invisible Web resources?. What is the Invisible Web?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/1.jpg)
Internet Resources Discovery (IRD)
The Invisible Web
![Page 2: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/2.jpg)
T.Sharon-A.Frank2
Contents
• What is the Invisible Web?• How big is the Invisible Web?• Why is there an Invisible Web
(and what’s in it)? • Case study – patent search• How to find Invisible Web resources?
![Page 3: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/3.jpg)
T.Sharon-A.Frank3
What is the Invisible Web?• Called also “Deep Web” in contrast to the “Surface Web”,
which is the Visible Web.• The term “Invisible Web” relates to content of pages that are
available and accessible on the Web, but are not accessible and not indexed by the regular SEs, and includes mostly:– searchable databases– excluded pages
• These pages do not appear in the SEs search results.• Finding information on the Invisible Web is available using
direct access or using Specialized SEs.• The extent of the Invisible Web is larger than the Visible Web.
![Page 4: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/4.jpg)
T.Sharon-A.Frank4
How Big is the Invisible Web?
According to BrightPlanet study (2000):• Deep web (Invisible web) is 500 times larger.
– Number of search utilities:• 45,000 search engines on the surface web. • 200,000 searchable databases within the deep web.
– Number of documents:• 1 billion documents on the surface web. • 550 billion documents within deep web.
• Deep web quality is 1,000 times greater.
• 95% of deep web information is publicly available.
http://www.brightplanet.com/technology/deepweb.asp
![Page 5: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/5.jpg)
T.Sharon-A.Frank5
Visible vs. Invisible Web
![Page 6: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/6.jpg)
T.Sharon-A.Frank6
More Details (1)
• Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
• The deep Web contains 7,500 terabytes of information compared to 19 terabytes of information in the surface Web.
• The deep Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web.
• More than 200,000 deep Web sites presently exist.
![Page 7: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/7.jpg)
T.Sharon-A.Frank7
More Details (2)
• Sixty of the largest deep Web sites collectively contain about 750 terabytes of information - sufficient by themselves to exceed the size of the surface Web forty times.
• On average, deep Web sites receive 50% greater monthly traffic than surface sites and are more highly linked to than surface sites.
• However, the typical (median) deep Web site is not well known to the Internet-searching public.
![Page 8: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/8.jpg)
T.Sharon-A.Frank8
More Details (3)
• The deep Web is the largest growing category of new information on the Internet.
• Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
• More than half of the deep Web content resides in topic-specific databases.
![Page 9: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/9.jpg)
T.Sharon-A.Frank9
More Details (4)
• Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.
• Deep Web content is highly relevant to every information need, market, and domain.
• A full ninety-five per cent of the deep Web is publicly accessible information -- not subject to fees or subscriptions.
![Page 10: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/10.jpg)
T.Sharon-A.Frank10
Invisible Web?!
The Invisible Web
![Page 11: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/11.jpg)
T.Sharon-A.Frank11
Is it Really that Big?
• Some argue that the Invisible Web is actually only 50-80 times bigger than the Invisible Web.
![Page 12: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/12.jpg)
T.Sharon-A.Frank12
Deep Web Growing Faster than Surface Web
![Page 13: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/13.jpg)
T.Sharon-A.Frank13
Original deep content exceeds all printed global content
![Page 14: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/14.jpg)
T.Sharon-A.Frank14
The Size Today???
• 2000 - 1:550 (with source)
• 2004 - 10:1000 (estimate) - 2 orders of magnitude more
![Page 15: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/15.jpg)
T.Sharon-A.Frank15
Why is there an Invisible Web? (1)
1. Specialized searchable databases:– Have dynamic pages.– Require parameters and user judgment.– Require user and password.
2. Script-based pages:– Include “?” in their URL.– Hazard to SEs – traps.
![Page 16: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/16.jpg)
T.Sharon-A.Frank16
Why is there an Invisible Web? (2)
3. Real-time, constantly changing, content.
4. Very large websites are partially indexed.
5. Private/secret websites:– Internal companies portals.
– Excluded from SEs (using robots.txt or similar schemes).
![Page 17: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/17.jpg)
T.Sharon-A.Frank17
Why is there an Invisible Web? (3)
6. Multimedia (and other formats) files– Special formats
– Examples: PDF, DOC, PPT, GIF, Flash
– Some of the SEs can’t, or won’t index these.
![Page 18: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/18.jpg)
T.Sharon-A.Frank18
Why is there an Invisible Web? (4)
7. Additional reasons (mostly spam related or resources saving):– File size– Number of words in the page– Pages requiring cookies– Other spam characteristics– Multimedia files– Files and URLs with special
characters.
![Page 19: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/19.jpg)
T.Sharon-A.Frank19
8. Non-linked pages (no incoming/inbound links).9. Pages on servers with dynamic IP.
• For example, 5% of the Internet is not connected!The Internet is partitioned and there exists “Dark Address Space”, or prefixes that are not reachable for one provider but that are available from other providers for long periods of time.
5% of the total number of prefixes in the Internet or tens of millions of end hosts.
Source: Arbor Networks -http://www.arbornetworks.com/downloads/research38/dark_address_space.pdf
Why is there an Invisible Web? (5)
![Page 20: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/20.jpg)
T.Sharon-A.Frank20
Invisible
![Page 21: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/21.jpg)
T.Sharon-A.Frank21
The Invisible Web is mostly topic databases
![Page 22: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/22.jpg)
T.Sharon-A.Frank22
Databases (1)
• Tens of thousands of databases in different topics exist.
• The SEs know many databases, but not their content.
• Entry to many databases is blocked, the SE knows its URL (main page), but not the information inside!
![Page 23: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/23.jpg)
T.Sharon-A.Frank23
Databases (2)
• Searching these databases requires entry via the site user interfaces, and often also registration/password/cookies.
• The advantage of using the database’s user interface is that it is specialized and designed to get the best results searching that specific information.
![Page 24: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/24.jpg)
T.Sharon-A.Frank24
Specialized interface
VS.
![Page 25: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/25.jpg)
T.Sharon-A.Frank25
Topics of databases (1)
• Auction• Public, government
information• Expert directories• Buyer’s guides• Scientific databases
from educational institutions and research labs
• News & magazine archives
• Discussions• Mailing list archives• Product reviews• Legal databases• Shopping catalogs• Medical databases
![Page 26: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/26.jpg)
T.Sharon-A.Frank26
Topics of databases (2)
• Product support knowledge bases
• Phone numbers, address, and email databases
• and much more
• Dictionaries
• Thesauri
• Patent databases
• Trademark databases
• Genealogy and surname lists
![Page 27: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/27.jpg)
T.Sharon-A.Frank27
Sites by Subject Area
0.00% 5.00% 10.00% 15.00%
Health
Government
Engineering
Employment
Education
Computing/Web
Business
Arts
Agriculture
Series1
0.00% 5.00% 10.00% 15.00%
Travel
Shopping
Science, Math
References
Recreation, Sports
People, Companies
News, Media
Lifestyles
Law/Politics
Humanities
Series1
![Page 28: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/28.jpg)
T.Sharon-A.Frank28
Example Databases (1)
• WebDietitian– http://www.webdietitian.com
• AnimalSearch – http://animalsearch.net– A database for family-safe animal-related sites.
• NatureServe Explorer– http://www.natureserve.org/explorer– Online encyclopedia, provides authoritative
conservation information on plants, animals, & ecological communities.
![Page 29: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/29.jpg)
T.Sharon-A.Frank29
Example Databases (2)
• Search Systems– Public systems: courts, criminal records, birth, death,
marriages, recalls, etc.– http://www.searchsystems.net/
• PubMed– Provides access to 14 million + MEDLINE citations– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
• MedlinePlus– Trusted Medical Information– http://www.medlineplus.gov/
• National Library of Medicine– Directory of trusted medical information – http://www.nlm.nih.gov
![Page 30: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/30.jpg)
T.Sharon-A.Frank30
Advantages of such databases• Quality of Content (Authority)• Deep Content on Subject Area (Comprehensiveness) • Focused Databases (Limited Scope) Smaller Universe
of Documents to Search (Maximize Precision/Recall) • Material Unavailable Elsewhere on the Web
(Uniqueness)• Many Options to Limit, Sort, Interact with the Data
(Maximize Precision)• Timeliness vs. Time Lag of General Search Tools
(Currency)
![Page 31: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/31.jpg)
T.Sharon-A.Frank31
Will the Invisible Web become Visible?
Definitely!
• Using intelligent SEs.
• After a while, new information is being updated in SEs that can access the invisible web, and make it visible.
![Page 32: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/32.jpg)
T.Sharon-A.Frank32
Google and the Invisible Web
• Google’s Search “By Number” Service, examples:– UPS tracking numbers: "1Z9999W99999999999"
– FedEx tracking numbers: "999999999999"
– USPS tracking numbers: "9999 9999 9999 9999 9999 99"
– Vehicle ID (VIN) numbers: "AAAAA999A9AA99999"
– UPC codes example: "073333531084"
– Telephone area codes: "650"
– Patent numbers: "patent 5123123" (Need to put the word "patent" before your patent number).
– Google Scholar
• http://www.google.com/help/features.html#number
![Page 33: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/33.jpg)
T.Sharon-A.Frank33
Yahoo Shortcuts• Yahoo provides a lot of “shortcuts” to retrieve
information:– Gas prices– Exchange rates– Hotels– Airport Information– Sports Scores– Stock Quotes– and many more
• http://help.yahoo.com/help/us/ysearch/tips/tips-01.html
![Page 34: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/34.jpg)
T.Sharon-A.Frank34
Case Study - Patents
• Searching for “CD box” in patent abstract.
• Google’s Search by Number
• USPTO
• FreePatentsOnline
![Page 35: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/35.jpg)
T.Sharon-A.Frank35
Google’s Search by Number
![Page 36: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/36.jpg)
T.Sharon-A.Frank36
USPTO Database
![Page 37: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/37.jpg)
T.Sharon-A.Frank37
USPTO Advanced Search
![Page 38: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/38.jpg)
T.Sharon-A.Frank38
Free Patents Online
![Page 39: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/39.jpg)
T.Sharon-A.Frank39
Free Patents Online (Continuation)
![Page 40: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/40.jpg)
T.Sharon-A.Frank40
Searching Database “CD box”
![Page 41: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/41.jpg)
T.Sharon-A.Frank41
Searching Google
Example: Irrelevant50 results – mostly irrelevant
![Page 42: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/42.jpg)
T.Sharon-A.Frank42
Is the Invisible Web Invisible ?
GoogleYahoo!
X
![Page 43: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/43.jpg)
T.Sharon-A.Frank43
How to Find Invisible Web Resources?
Search for sources – not information!
• Two-step searching in general search engines.
• Specialty Search Engines
• Subject Directories
• Digital Libraries
![Page 44: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/44.jpg)
T.Sharon-A.Frank44
Two-Step Searching
• Use general search engine (such as Google) to search for a good database, then search for the information inside that database/website search engine.
• Examples words to add in the query:– database– association– tutorial– demographics– “how to”– webcam– streaming video + search engine– encyclopedia– product review
![Page 45: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/45.jpg)
T.Sharon-A.Frank45
Invisible Web Pathfinders• Sherman-Price Invisible-Web Directory
http://www.invisible-web.net/
• CompletePlanet http://www.completeplanet.com
• Beaucoup http://www.beaucoup.com
• Turbo10 http://turbo10.com/
![Page 46: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/46.jpg)
T.Sharon-A.Frank46
Sherman-Price Invisible-Web directory
![Page 47: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/47.jpg)
T.Sharon-A.Frank47
Invisible-Web directoryPeople Search
![Page 48: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/48.jpg)
T.Sharon-A.Frank48
Complete Planet
![Page 49: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/49.jpg)
T.Sharon-A.Frank49
Beaucoup (1)
• Over 2500 engines
• The engines listed on the main site are "free information" sites -- a *lot* of information.
• Subject Directory/Annotated
![Page 50: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/50.jpg)
T.Sharon-A.Frank50
Beaucoup (2)
![Page 51: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/51.jpg)
T.Sharon-A.Frank51
Turbo10 (1)
![Page 52: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/52.jpg)
T.Sharon-A.Frank52
Turbo10 (2)
![Page 53: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/53.jpg)
T.Sharon-A.Frank53
Turbo10 (3)
![Page 54: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/54.jpg)
T.Sharon-A.Frank54
Turbo10 – Edit Collections
![Page 55: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/55.jpg)
T.Sharon-A.Frank55
More Invisible Web Pathfinders
• Librarianas’ Index to the Internet http://www.lii.org
• MeL Michigan eLibrary http://www.mel.org/
• Internet Scout Project http://scout.wisc.edu/
• Infomine http://infomine.ucr.edu/
• More: http://www.calvin.edu/library/searreso/internet/webdirec.stm
![Page 56: Internet Resources Discovery (IRD)](https://reader030.vdocuments.site/reader030/viewer/2022013004/5681372f550346895d9ebc39/html5/thumbnails/56.jpg)
T.Sharon-A.Frank56
References• http://websearch.about.com/od/invisibleweb/• http://www.shelton.cc.al.us/library/lbs102/
lbs102session12.html• http://www.lib.berkeley.edu/TeachingLib/Guides/
Internet/InvisibleWeb.html• http://www.campus-technology.com/article.asp?
id=7477• http://www.searchengineoptimising.com/optimisation• http://www.press.umich.edu/jep/07-01/bergman.html