project 1 search engine

112
WE SCHOOL CASE STUDY – SEARCH ENGINES AS IMAGE BUILDERS HEENA JAISINGHANI DPGD/JL13/1836 SPECIALIZATION: GENERAL MANAGEMENT PRIN. L.N. WELINGKAR INSTITUE OF MANAGEMENT DEVELOPMENT & RESEARCH YEAR OF SUBMISSION: MARCH 2015 Page 1 of 112

Upload: heena-jaisinghani

Post on 11-Aug-2015

35 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Project 1 Search Engine

WE SCHOOL

CASE STUDY – SEARCH ENGINES AS IMAGE BUILDERS

HEENA JAISINGHANI

DPGD/JL13/1836

SPECIALIZATION: GENERAL MANAGEMENT

PRIN. L.N. WELINGKAR INSTITUE OF

MANAGEMENT DEVELOPMENT & RESEARCH

YEAR OF SUBMISSION: MARCH 2015

Page 1 of 75

Page 2: Project 1 Search Engine

WE SCHOOL

ANNEXURE 1

FLOW CHART INDICATING THE BASIC ELEMENTS OF THE PROJECT

Page 2 of 75

To reach Search Engine as Image Builder we need to know Search

Engine Optimization (SEO)

Code techniques that minimize the use of flash

and frames

Keywords or Keyword Phrase that fits the tatrget market

Linking Strategy

Page 3: Project 1 Search Engine

WE SCHOOLPage 3 of 75

Page 4: Project 1 Search Engine

WE SCHOOL

ANNEXURE 3

UNDERTAKING BY CANDIDATE

I declare that project entitled Case Study: Search Engine as Image Builders is my own

work conducted as part of my syllabus.

I further declare that the project work presented has been prepared personally by me and it is

not sourced from any outside agency. I understand that, any such malpractice will have very

serious consequence and my admission to the program will be cancelled without any refund

of fees.

I am also aware that, I may face legal action, if I follow such malpractice.

Heena Jaisinghani

(Signature of Candidate)

Page 4 of 75

Page 5: Project 1 Search Engine

WE SCHOOL

Table of contents

Introduction

Background

Methodology

Conclusions & Recommendations

Limitations

Bibliography

Page 5 of 75

Page 6: Project 1 Search Engine

WE SCHOOL

Introduction

Search Engine as Image Builder has two point of view which as follows

1) Using Search Engine how does the company Builds their Image &

2) How does Image gets Build through Search Engine Spider Simulator with Search Engine Optimization

Let us start with 2 topic first which will conclude us to 1 topic

Search Engine Optimization (SEO) is the process not only involves of making web pages

easy to find, easy to crawl and easy to categorise also make those pages rank high for certain

keywords or search terms.

The technique behind Image building is Search Engine Spider Simulator

Basically all search engine spiders function on the same principle – they crawl the Web and

pages, which are stored in a database and later use various algorithms to determine

page ranking, relevancy, etc. of the collected pages. While the algorithms of calculating

ranking and relevancy widely differ among search engines, the way they index sites is more

or less uniform and it is very important that you know what spiders are interested in and what

they neglect.

Businesses are growing more aware of the need to understand and implement at least the

basics of search engine optimization (SEO). But if you read a variety of blogs and websites,

you’ll quickly see that there’s a lot of uncertainty over what makes up “the basics.” Without

access to high-level consulting and without a lot of experience knowing what SEO resources

can be trusted, there’s also a lot of misinformation about SEO strategies and tactics.

Below are Techniques for usage of SEO

1. Commit yourself to the process: SEO isn’t a one-time event. Search engine algorithms

change regularly, so the tactics that worked last year may not work this year. SEO requires a

Page 6 of 75

Page 7: Project 1 Search Engine

WE SCHOOL

long-term outlook and commitment.

2. Be patient: SEO isn’t about instant gratification. Results often take months to see, and this

is especially true the smaller you are, and the newer you are to doing business online.

3. Ask a lot of questions when hiring an SEO company: It’s your job to know what kind of

tactics the company uses. Ask for specifics. Ask if there are any risks involved. Then get

online yourself and do your own research—about the company, about the tactics they

discussed, and so forth.

4. Become a student of SEO: If you’re taking the do-it-yourself route, you’ll have to become

a student of SEO and learn as much as you can.

5. Have web analytics in place at the start: You should have clearly defined goals for your

SEO efforts, and you’ll need web analytics software in place so you can track what’s working

and what’s not.

6. Build a great web site: Ask yourself, “Is my site really one of the 10 best sites in the

world on this topic?” Be honest. If it’s not, make it better.

7. Include a site map page: Spiders can’t index pages that can’t be crawled. A site map will

help spiders find all the important pages on your site, and help the spider understand your

site’s hierarchy. This is especially helpful if your site has a hard-to-crawl navigation menu. If

your site is large, make several site map pages. Keep each one to less than 100 links. It is

advisable 75 to the max to be safe.

8. Make SEO-friendly URLs: Use keywords in your URLs and file names, such as

yourdomain.com/red-widgets.html. Don’t overdo it, though. A file with 3+ hyphens tends to

look spammy and users may be hesitant to click on it. Use hyphens in URLs and file names,

not underscores. Hyphens are treated as a “space,” while underscores are not.

Page 7 of 75

Page 8: Project 1 Search Engine

WE SCHOOL

9. Do keyword research at the start of the project: If you’re on a tight budget, use the free

versions of Keyword Discovery or WordTracker, both of which also have more powerful

paid versions. Ignore the numbers these tools show; what’s important is the relative volume

of one keyword to another. Another good free tool is Google’s AdWords Keyword Tool,

which doesn’t show exact numbers.

10. Open up a PPC account: Whether it’s Google’s AdWords, Microsoft adCenter or

something else, this is a great way to get actual search volume for your keywords. Yes, it

costs money, but if you have the budget it’s worth the investment. It’s also the solution if you

didn’t like the “Be patient” suggestion above and are looking for instant visibility.

11. Use a unique and relevant title and Meta description on every page: The page title is

the single most important on-page SEO factor. It’s rare to rank highly for a primary term (2-3

words) without that term being part of the page title. The meta description tag won’t help you

rank, but it will often appear as the text snippet below your listing, so it should include the

relevant keyword(s) and be written so as to encourage searchers to click on your listing.

12. Write for users first: Google, Yahoo, etc., have pretty powerful bots crawling the web,

but to my knowledge these bots have never bought anything online, signed up for a

newsletter, or picked up the phone to call about your services. Humans do those things, so

write your page copy with humans in mind. Yes, you need keywords in the text, but don’t

stuff each page like a Thanksgiving turkey. Keep it readable.

13. Create great, unique content: This is important for everyone, but it’s a particular

challenge for online retailers. If you’re selling the same widget that 50 other retailers are

selling, and everyone is using the boilerplate descriptions from the manufacturer, this is a

great opportunity. Write your own product descriptions, using the keyword research you did

earlier (see #9 above) to target actual words searchers use, and make product pages that blow

the competition away. Plus, retailer or not, great content is a great way to get inbound links.

Page 8 of 75

Page 9: Project 1 Search Engine

WE SCHOOL

14. Use your keywords as anchor text when linking internally: Anchor text helps tells

spiders what the linked-to page is about. Links that say “click here” do nothing for your

search engine visibility.

15. Build links intelligently: Begin with foundational links like trusted directories. (Yahoo

and DMOZ are often cited as examples, but don’t waste time worrying about DMOZ

submission. Submit it and forget it.) Seek links from authority sites in your industry. If local

search matters to you (more on that coming up), seek links from trusted sites in your

geographic area — the Chamber of Commerce, local business directories, etc. Analyze the

inbound links to your competitors to find links you can acquire, too. Create great content on a

consistent basis and use social media to build awareness and links.

16. Use press releases wisely: Developing a relationship with media covering your industry

or your local region can be a great source of exposure, including getting links from trusted

media web sites. Distributing releases online can be an effective link building tactic, and

opens the door for exposure in news search sites. Only issue a release when you have

something newsworthy to report. Don’t waste journalists’ time.

17. Start a blog and participate with other related blogs: Search engines, Google

especially, love blogs for the fresh content and highly-structured data. Beyond that, there’s no

better way to join the conversations that are already taking place about your industry and/or

company. Reading and commenting on other blogs can also increase your exposure and help

you acquire new links. Put your blog at yourdomain.com/blog so your main domain gets the

benefit of any links to your blog posts. If that’s not possible, use blog.yourdomain.com.

18. Use social media marketing wisely. If your business has a visual element, join the

appropriate communities on Flickr and post high-quality photos there. If you’re a service-

oriented business, use Quora and/or Yahoo Answers to position yourself as an expert in your

industry. Any business should also be looking to make use of Twitter and Facebook, as social

information and signals from these are being used as part of search engine rankings for

Page 9 of 75

Page 10: Project 1 Search Engine

WE SCHOOL

Google and Bing. With any social media site you use, the first rule is don’t spam! Be an

active, contributing member of the site. The idea is to interact with potential customers, not

annoy them.

19. Take advantage of local search opportunities. Online research for offline buying is a

growing trend. Optimize your site to catch local traffic by showing your address and local

phone number prominently. Write a detailed Directions/Location page using neighbourhoods

and landmarks in the page text. Submit your site to the free local listings services that the

major search engines offer. Make sure your site is listed in local/social directories such as

CitySearch, Yelp, Local.com, etc., and encourage customers to leave reviews of your

business on these sites, too.

20. Take advantage of the tools the search engines give you. Sign up for

Google Webmaster Central, Bing Webmaster Tools and Yahoo Site Explorer to learn more

about how the search engines see your site, including how many inbound links they’re aware

of.

21. Diversify your traffic sources. Google may bring you 70% of your traffic today, but

what if the next big algorithm update hits you hard? What if your Google visibility goes away

tomorrow? Newsletters and other subscriber-based content can help you hold on to

traffic/customers no matter what the search engines do. In fact, many of the DOs on this

list—creating great content, starting a blog, using social media and local search, etc.—will

help you grow an audience of loyal prospects and customers that may help you survive the

whims of search engines

Page 10 of 75

Page 11: Project 1 Search Engine

WE SCHOOL

Background

Here it shows the behind the scene concept of Search Engine Spider Simulator & Techniques Mentioned above

Are Your Hyperlinks Spiderable?

The search engine spider simulator can be of great help when trying to figure out if the

hyperlinks lead to the right place. For instance, link exchange websites often put fake links to

your site with _JavaScript (using mouse over events and stuff to make the link look genuine)

but actually this is not a link that search engines will see and follow. Since the spider

simulator would not display such links, you'll know that something with the link is wrong.

It is highly recommended to use the <noscript> tag, as opposed to _JavaScript based menus.

The reason is that _JavaScript based menus are not spiderable and all the links in them will

be ignored as page text. The solution to this problem is to put all menu item links in the

<noscript> tag. The <noscript> tag can hold a lot but please avoid using it for link stuffing or

any other kind of SEO manipulation.

If you happen to have tons of hyperlinks on your pages (although it is highly recommended to

have less than 100 hyperlinks on a page), then you might have hard times checking if they are

OK. For instance, if you have pages that display “403 Forbidden”, “404 Page Not Found” or

similar errors that prevent the spider from accessing the page, then it is certain that this page

will not be indexed. It is necessary to mention that a spider simulator does not deal with 403

and 404 errors because it is checking where links lead to not if the target of the link is in

place, so you need to use other tools for checking if the targets of hyperlinks are the intended

ones.

Looking for Your Keywords

While there are specific tools, like the Keyword Playground or the Website Keyword

Suggestions, which deal with keywords in more detail, search engine spider simulators also

help to see with the eyes of a spider where keywords are located among the text of the page.

Page 11 of 75

Page 12: Project 1 Search Engine

WE SCHOOL

Why is this important? Because keywords in the first paragraphs of a page weigh more than

keywords in the middle or at the end. And if keywords visually appear to us to be on the top,

this may not be the way spiders see them. Consider a standard Web page with tables. In this

case chronologically the code that describes the page layout (like navigation links or separate

cells with text that are the same sitewise) might come first and what is worse, can be so long

that the actual page-specific content will be screens away from the top of the page.

Are Dynamic Pages Too Dynamic to be Seen At All

Dynamic pages (especially ones with question marks in the URL) are also an extra that

spiders do not love, although many search engines do index dynamic pages as well. Running

the spider simulator will give you an idea how well your dynamic pages are accepted by

search engines.

Meta Keywords and Meta Description

Meta keywords and meta description, as the name implies, are to be found in the <META>

tag of a HTML page. Once meta keywords and meta descriptions were the single most

important criterion for determining relevance of a page but now search engines employ

alternative mechanisms for determining relevancy, so you can safely skip listing keywords

and description in Meta tags (unless you want to add there instructions for the spider what to

index and what not but apart from that meta tags are not very useful anymore).

Meta tags are a great way for webmasters to provide search engines with information about

their sites. Meta tags can be used to provide information to all sorts of clients, and each

system processes only the meta tags they understand and ignores the rest. Meta tags are added

to the <head> section of your HTML page and generally look like this:

<!DOCTYPE html>

<html>

<head>

<meta charset="utf-8">

<meta name="Description" CONTENT="Author: A.N. Author, Illustrator: P. Picture, Category: Books, Price: £9.24, Length: 784 pages">

Page 12 of 75

Page 13: Project 1 Search Engine

WE SCHOOL

Methodology

Here we come to know how the whole process started

Finding information on the World Wide Web had been a difficult and frustrating task, but

became much more usable with breakthroughs in search engine technology in the late 1990s.

A web search engine is a software system that is designed to search for information on the

World Wide Web. The search results are generally presented in a line of results often referred

to as search engine results pages (SERPs). The information may be a mix of web pages,

images, and other types of files. Some search engines also mine data available in databases or

open directories. Unlike web directories, which are maintained only by human editors, search

engines also maintain real-time information by running an algorithm on a web crawler (A

Web crawler is an Internet bot that systematically browses the World Wide Web, typically

for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant,

an automatic indexer)

History

Page 13 of 75

Page 14: Project 1 Search Engine

WE SCHOOL

Further information: Timeline of web search engines

Timeline (full list)Yea

rEngine Current status

1993 W3Catalog InactiveAliweb InactiveJumpStation InactiveWWW Worm Inactive

1994 WebCrawler Active, AggregatorGo.com Active, Yahoo SearchLycos ActiveInfoseek Inactive

1995 AltaVista Inactive, redirected to Yahoo!Daum ActiveMagellan InactiveExcite ActiveSAPO ActiveYahoo! Active, Launched as a directory

1996 Dogpile Active, AggregatorInktomi Inactive, acquired by Yahoo!HotBot Active (lycos.com)Ask Jeeves Active (rebranded ask.com)

1997 Northern Light InactiveYandex Active

1998 Google ActiveIxquick Active also as Start pageMSN Search Active as Bingempas Inactive (merged with NATE)

1999 AlltheWeb Inactive (URL redirected to Yahoo!)GenieKnows Active, rebranded Yellowee.comNaver ActiveTeoma Inactive, redirects to Ask.comVivisimo Inactive

2000 Baidu ActiveExalead ActiveGigablast Active

2003 Info.com ActiveScroogle Inactive

2004 Yahoo! Search Active, Launched own web search(see Yahoo! Directory, 1995)

A9.com Inactive

Page 14 of 75

Page 15: Project 1 Search Engine

WE SCHOOL

Sogou Active2005 AOL Search Active

GoodSearch ActiveSearchMe Inactive

2006 Soso (search engine) ActiveQuaero InactiveAsk.com ActiveLive Search Active as Bing, Launched as

rebranded MSN SearchChaCha ActiveGuruji.com Inactive

2007 wikiseek InactiveSproose InactiveWikia Search InactiveBlackle.com Active, Google Search

2008 Powerset Inactive (redirects to Bing)Picollator InactiveViewzi InactiveBoogami InactiveLeapFish InactiveForestle Inactive (redirects to Ecosia)DuckDuckGo Active

2009 Bing Active, Launched asrebranded Live Search

Yebol InactiveMugurdy Inactive due to a lack of fundingScout (Goby) ActiveNATE Active

2010 Blekko ActiveCuil InactiveYandex Active, Launched global

(English) search2011 YaCy Active, P2P web search engine2012 Volunia Inactive2013 Halalgoogling Active, Islamic / Halal

filter Search

During early development of the web, there was a list of webservers edited by Tim Berners-

Lee and hosted on the CERN webserver. One historical snapshot of the list in 1992 remains,

Page 15 of 75

Page 16: Project 1 Search Engine

WE SCHOOL

but as more and more webservers went online the central list could no longer keep up. On the

NCSA(National Center for Supercomputing Applications) site, new servers were announced

under the title "What's New!"

The first tool used for searching on the Internet was Archie. The name stands for "archive"

without the "v". It was created in 1990 by Alan Emtage, Bill Heelan and J. Peter Deutsch,

computer science students at McGill University in Montreal. The program downloaded the

directory listings of all the files located on public anonymous FTP (File Transfer Protocol)

sites, creating a searchable database of file names; however, Archie did not index the contents

of these sites since the amount of data was so limited it could be readily searched manually.

In June 1993, Matthew Gray, then at MIT (Massachusetts Institute of Technology), produced

what was probably the first web robot (is a software application that runs automated tasks

over the Internet.), the Perl (is about the programming language Perl is a family of high-level,

general-purpose, interpreted, dynamic programming languages. The languages in this family

include Perl 5 and Perl 6)-based World Wide Web Wanderer, and used it to generate an index

called 'Wandex'.

The purpose of the Wanderer was to measure the size of the World Wide Web, which it did

until late 1995. The web's second search engine Aliweb appeared in November 1993. Aliweb

did not use a web robot, but instead depended on being notified by website administrators of

the existence at each site of an index file in a particular format.

JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to find web

pages and to build its index, and used a web form (it allows a user to enter data that

is sent to a server for processing.) as the interface to its query program. It was thus the first

Page 16 of 75

Page 17: Project 1 Search Engine

WE SCHOOL

WWW -discovery tool to combine the three essential features of a web search engine

(crawling, indexing, and searching) as described below, Because of the limited resources

available on the platform it ran on, its indexing and hence searching were limited to the titles

and headings found in the web pages the crawler encountered.

One of the first "all text" crawler-based search engines was WebCrawler, which came out in

1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which

has become the standard for all major search engines since. It was also the first one widely

known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was

launched and became a major commercial endeavor.

Soon after, many search engines appeared and vied for popularity. These included Magellan,

Excite, Infoseek, Inktomi, Northern Light, and AltaVista. Yahoo! was among the most

popular ways for people to find web pages of interest, but its search function operated on its

web directory (it specializes in linking to other web sites and categorizing those

links.) rather than its full-text copies of web pages. Information seekers could also

browse the directory instead of doing a keyword-based search.

Google adopted the idea of selling search terms in 1998, from a small search engine company

named goto.com( it relates to internet advertising). This move had a significant effect on the

SE business, which went from struggling to one of the most profitable businesses in the

internet.

In 1996, Netscape (it’s an American computer services company, best known for Netscape

Navigator, its web browser) was looking to give a single search engine an exclusive deal as

the featured search engine on Netscape's web browser. There was so much interest that

instead Netscape struck deals with five of the major search engines: for $5 million a year, Page 17 of 75

Page 18: Project 1 Search Engine

WE SCHOOL

each search engine would be in rotation on the Netscape search engine page. The five engines

were Yahoo!, Magellan, Lycos, Infoseek, and Excite.

Search engines were also known as some of the brightest stars in the Internet investing frenzy

that occurred in the late 1990s. Several companies entered the market spectacularly,

receiving record gains during their initial public offerings. Some have taken down their

public search engine, and are marketing enterprise-only editions, such as Northern Light.

Many search engine companies were caught up in the dot-com bubble, a speculation-driven

market boom that peaked in 1999 and ended in 2001.

Around 2000, Google's search engine rose to prominence. The company achieved better

results for many searches with an innovation called PageRank (it is a way of measuring the

importance of website pages), as was explained in the paper Anatomy of a Search Engine

written by Sergey Brin and Larry Page, the later founders of Google. This iterative algorithm

ranks web pages based on the number and PageRank of other web sites and pages that link

there, on the premise that good or desirable pages are linked to more than others. Google also

maintained a minimalist interface to its search engine. In contrast, many of its competitors

embedded a search engine in a web portal. In fact, Google search engine became so popular

that spoof engines emerged such as Mystery Seeker.

By 2000, Yahoo! was providing search services based on Inktomi's search engine. Yahoo!

acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003.

Yahoo! switched to Google's search engine until 2004, when it launched its own search

engine based on the combined technologies of its acquisitions.

Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In Page 18 of 75

Page 19: Project 1 Search Engine

WE SCHOOL

early 1999 the site began to display listings from Looksmart (it is an American, publicly

traded, online advertising company founded in 1995), blended with results from

Inktomi. For a short time in 1999, MSN Search used results from AltaVista were instead. In

2004, Microsoft began a transition to its own search technology, powered by its own web

crawler (called msnbot). Microsoft's rebranded search engine, Bing, was launched on June 1,

2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search

would be powered by Microsoft Bing technology.

Before going ahead with further details would like to highlight about Aggregators

“Aggregators” are the buzz word of choice for the various online companies that gather

information from fragmented marketplaces into a single portal to make life easier for

everyone. A classic example is the online airline and hotel reservations.

How web search engines work

A search engine operates in the following order:

1. Web crawling (it is an Internet bot that systematically browses the World Wide Web,

typically for the purpose of Web indexing)

2. Indexing (it collects, parses, and stores data to facilitate fast and accurate

information retrieval. Index design incorporates interdisciplinary concepts from

linguistics, cognitive psychology, mathematics, informatics, and computer science)

3. Searching (is a query that a user enters into a web search engine to satisfy his or her

information needs.)

Explanation of each is mentioned belowPage 19 of 75

Page 20: Project 1 Search Engine

WE SCHOOL

Web search engines work by storing information about many web pages, which they retrieve

from the HTML markup of the pages. These pages are retrieved by a Web crawler

(sometimes also known as a spider) — An automated Web crawler which follows every link

on the site. The site owner can exclude specific pages by using robots.txt.

The search engine then analyzes the contents of each page to determine how it should be

indexed (for example, words can be extracted from the titles, page content, headings, or

special fields called meta tags [They are part of a web page's head section ]. Data about web

pages are stored in an index database for use in later queries. A query from a user can be a

single word. The index helps find information relating to the query as quickly as possible.

Some search engines, such as Google, store all or part of the source page (referred to as a

cache) as well as information about the web pages, whereas others, such as AltaVista, store

every word of every page they find. This cached page always holds the actual search text

since it is the one that was actually indexed, so it can be very useful when the content of the

current page has been updated and the search terms are no longer in it. This problem might be

considered a mild form of linkrot, and Google's handling of it increases usability by

satisfying user expectations that the search terms will be on the returned webpage. This

satisfies the principle of least astonishment, since the user normally expects that the search

terms will be on the returned pages. Increased search relevance makes these cached pages

very useful as they may contain data that may no longer be available elsewhere.

Page 20 of 75

Page 21: Project 1 Search Engine

WE SCHOOL

High-level architecture of a standard Web crawler

When a user enters a query into a search engine (typically by using keywords), the engine

examines its inverted index and provides a listing of best-matching web pages according to

its criteria, usually with a short summary containing the document's title and sometimes parts

of the text. The index is built from the information stored with the data and the method by

which the information is indexed. From 2007 the Google.com search engine has allowed one

to search by date by clicking "Show search tools" in the leftmost column of the initial search

results page, and then selecting the desired date range. Most search engines support

the use of the Boolean operators (This article is about connectives in logical systems).AND,

and NOT to further specify the Web search query.

Boolean operators are for literal searches that allow the user to refine and extend the terms of

the search. The engine looks for the words or phrases exactly as entered. Some search

engines provide an advanced feature called proximity search, which allows users to define the

distance between keywords. There is also concept-based searching where the research

involves using statistical analysis on pages containing the words or phrases you search for.

As well, natural language queries allow the user to type a question in the same form one

would ask it to a human. A site like this would be ask.com.

Page 21 of 75

Page 22: Project 1 Search Engine

WE SCHOOL

The usefulness of a search engine depends on the relevance of the result set it gives back.

While there may be millions of web pages that include a particular word or phrase, some

pages may be more relevant, popular, or authoritative than others. Most search engines

employ methods to rank the results to provide the "best" results first. How a search engine

decides which pages are the best matches, and what order the results should be shown in,

varies widely from one engine to another. The methods also change over time as Internet

usage changes and new techniques evolve. There are two main types of search engine that

have evolved: one is a system of predefined and hierarchically ordered keywords that humans

have programmed extensively. The other is a system that generates an "inverted index" (In

computer science, an inverted index (also referred to as postings file or inverted file) is an

index data structure storing a mapping from content, such as words or numbers, to its

locations in a database file, or in a document or a set of documents. The purpose of an

inverted index is to allow fast full text searches, at a cost of increased processing when a

document is added to the database) by analyzing texts it locates. This first form relies much

more heavily on the computer itself to do the bulk of the work.

Most Web search engines are commercial ventures supported by advertising revenue and thus

some of them allow advertisers to have their listings ranked higher in search results for a fee.

Search engines that do not accept money for their search results make money by running

search related ads alongside the regular search engine results. The search engines make

money every time someone clicks on one of these ads.

Page 22 of 75

Page 23: Project 1 Search Engine

WE SCHOOL

Market share

Here we will come to know which site has the highest market share

Google is the world's most popular search engine, with a market share of 68.69 per cent.

Baidu comes in a distant second, answering 17.17 per cent online queries.

The world's most popular search engines are:

Search engine Market share in October 2014

Google 58.01%Baidu 29.06%Bing 8.01%Yahoo! 4.01%AOL 0.21%Ask 0.10%Excite 0.00%

East Asia and Russia

East Asian countries and Russia constitute a few places where Google is not the most popular

search engine. Soso (search engine) is more popular than Google in China.

Yandex commands a marketshare of 61.9 per cent in Russia, compared to Google's 28.3 per

cent. In China, Baidu is the most popular search engine. South Korea's homegrown

search portal, Naver, is used for 70 per cent online searches in the country. Yahoo! Japan

and Yahoo! Taiwan are the most popular avenues for internet search in Japan and Taiwan,

respectively.

Page 23 of 75

Page 24: Project 1 Search Engine

WE SCHOOL

Search engine bias

Although search engines are programmed to rank websites based on some combination of

their popularity and relevancy, empirical studies indicate various political, economic, and

social biases in the information they provide. These biases can be a direct result of

economic and commercial processes (e.g., companies that advertise with a search engine can

become also more popular in its organic search results), Organic search results are listings

on search engine results pages that appear because of their relevance to the search terms, as

opposed to their being advertisements. In contrast, non-organic search results may include

pay per click advertising. And political processes (e.g., the removal of search results to

comply with local laws). For example, Google will not surface certain Neo-Nazi websites in

and Germany, where Holocaust denial is illegal.

Biases can also be a result of social processes, as search engine algorithms are frequently

designed to exclude non-normative viewpoints in favor of more "popular" results. Indexing

algorithms of major search engines skew towards coverage of U.S.-based sites, rather than

websites from non-U.S. countries.

Google Bombing (The terms Google bomb and Googlewashing refer to the practice of

causing a web page to rank highly in search engine results for unrelated or off-topic search

terms by linking heavily.) is one example of an attempt to manipulate search results for

political, social or commercial reasons.

Customized results and filter bubbles

Page 24 of 75

Page 25: Project 1 Search Engine

WE SCHOOL

Many search engines such as Google and Bing provide customized results based on the user's

activity history. This leads to an effect that has been called a filter bubble. The term describes

a phenomenon in which websites use algorithms to selectively guess what information a user

would like to see, based on information about the user (such as location, past click behavior

and search history). As a result, websites tend to show only information that agrees with the

user's past viewpoint, effectively isolating the user in a bubble that tends to exclude contrary

information. Prime examples are Google's personalized search results and Facebook's

personalized news stream. According to Eli Pariser, who coined the term, users get less

exposure to conflicting viewpoints and are isolated intellectually in their own informational

bubble. Pursier related an example in which one user searched Google for "BP" and got

investment news about British Petroleum while another searcher got information about the

Deepwater Horizon oil spill and that the two search results pages were "strikingly

different. The bubble effect may have negative implications for civic discourse,

according to Pursier.

Since this problem has been identified, competing search engines have emerged that seek to

avoid this problem by not tracking or "bubbling" users.

Faith-based search engines

The global growth of the Internet and popularity of electronic contents in the Arab and

Muslim World during the last decade has encouraged faith adherents, notably in the Middle

East and Asian sub-continent, to "dream" of their own faith-based i.e. "Islamic" search

engines or filtered search portals filters that would enable users to avoid accessing forbidden

websites such as pornography and would only allow them to access sites that are compatible

Page 25 of 75

Page 26: Project 1 Search Engine

WE SCHOOL

to the Islamic faith. Shortly before the Muslim only month of Ramadan, Halalgoogling which

collects results from other search engines like Google and Bing was introduced to the world

July 2013 to presents the halal results to its users, nearly two years after I’mHalal, another

search engine initially (launched on September 2011) to serve Middle East Internet had to

close its search service due to what its owner blamed on lack of funding.

While lack of investment and slow pace in technologies in the Muslim World as the main

consumers or targeted end users has hindered progress and thwarted success of serious

Islamic search engine, the spectacular failure of heavily invested Muslim lifestyle web

projects like Muxlim, which received millions of dollars from investors like Rite Internet

Ventures, has - according to I’mHalal shutdown notice - made almost laughable the idea that

the next Facebook or Google can only come from the Middle East if you support your bright

youth. Yet Muslim internet experts have been determining for years what is or is not

allowed according to the "Law of Islam" and have been categorizing websites and such into

being either "halal" or "haram". All the existing and past Islamic search engines are merely

custom search indexed or monetized by web major search giants like Google, Yahoo and

Bing with only certain filtering systems applied to ensure that their users can't access Haram

sites, which include such sites as nudity, gay, gambling or anything that is deemed to be anti-

Islamic.

Another religiously-oriented search engine is Jewogle, which is the Jewish version of Google

and yet another is SeekFind.org, which is a Christian website that includes filters preventing

users from seeing anything on the internet that attacks or degrades their faith.

Till now we studied how search engines are build and their

Page 26 of 75

Page 27: Project 1 Search Engine

WE SCHOOL

contribution towards today’s high tech atmosphere

Now we look with the help of these technology how Image is build

How do I increase my site visibility to search engines?

These days you don’t have to limit your search to just websites. Many other forms of content

are easy to find, including images. No matter what you’re looking for, an image is (for better

or worse) just one image search away.

You may wonder, however, how image search works. How are images sorted and classified,

making it possible to find tens or hundreds of relevant results? Perhaps you’re just curious, or

perhaps you run a site and want to know so you can improve your own ranking. In either

case, taking a deeper look could be helpful.

Some people assume that image search is conducted via fancy algorithms that determine what

an image is about and then index it. I know that’s where I started. As it turns out, however,

old fashioned text is one of the most important factors in an image’s ranking.

More specifically, the file name matters. Go ahead – do an image search. What do the top

results have in common? Almost invariably, it’s a portion of their file name. Most of the top

results for “pizza” have the word pizza in the file name.

That might seem obvious. But actually, it’s not. Most digital photographs, for example, will

start life with a file name like “1020302.jpg.” It’s only later that they’re re-named. For

webmasters, ensuring that a relevant file name is given to an image is just as basic and

important as making sure that a webpage’s keyword appears in that page’s metadata title

and/or description. But it’s not automatic. It takes constant effort.

Page 27 of 75

Page 28: Project 1 Search Engine

WE SCHOOL

We are now at the final step of building an image search engine — accepting a query image

and performing an actual search.

Let’s take a second to review how we got here:

Step 1: Defining Your Image Descriptor. Before we even consider building an

image search engine, we need to consider how we are going to represent and quantify

our image using only a list of numbers (i.e. a feature vector). We explored three

aspects of an image that can easily be described: color, texture, and shape. We can use

one of these aspects, or many of them.

Step 2: Indexing Your Dataset. Now that we have selected a descriptor, we can

apply the descriptor to extract features from each and every image in our dataset. The

process of extracting features from an image dataset is called “indexing”. These

features are then written to disk for later use. Indexing is also a task that is easily

made parallel by utilizing multiple cores/processors on our machine.

Step 3: Defining Your Similarity Metric. In Step 1, we defined a method to extract

features from an image. Now, we need to define a method to compare our feature

vectors. A distance function should accept two feature vectors and then return a value

indicating how “similar” they are. Common choices for similarity functions include

(but are certainly not limited to) the Euclidean, Manhattan, Cosine, and Chi-Squared

distances.

Finally, we are now ready to perform our last step in building an image search engine:

Page 28 of 75

Page 29: Project 1 Search Engine

WE SCHOOL

Searching and Ranking

The Query

Before we can perform a search, we need a query.

The last time you went to Google, you typed in some keywords into the search box, right?

The text you entered into the input form was your “query”.

Google then took your query, analyzed it, and compared it to their gigantic index of

webpages, ranked them, and returned the most relevant webpages back to you.

Similarly, when we are building an image search engine, we need a query image.

Query images come in two flavors: an internal query image and an external query image.

As the name suggests, an internal query image already belongs in our index. We have already

analyzed it, extracted features from it, and stored its feature vector.

The second type of query image is an external query image. This is the equivalent to typing

our text keywords into Google. We have never seen this query image before and we can’t

make any assumptions about it. We simply apply our image descriptor, extract features, rank

the images in our index based on similarity to the query, and return the most relevant results.

Let’s think back to our similarity metrics for a second and assume that we are using the

Euclidean distance. The Euclidean distance has a nice property called the Coincidence

Axiom, implying that the function returns a value of 0 (indicating perfect similarity) if and

only if the two feature vectors are identical.

Example: If I were to search for an image already in my index, then the Euclidean distance

between the two feature vectors would be zero, implying perfect similarity. This image would

then be placed at the top of my search results since it is the most relevant. This makes sense

Page 29 of 75

Page 30: Project 1 Search Engine

WE SCHOOL

and is the intended behavior.

How strange it would be if I searched for an image already in my index and did not find it in

the #1 result position. That would likely imply that there was a bug in my code somewhere or

I’ve made some very poor choices in image descriptors and similarity metrics.

Overall, using an internal query image serves as a sanity check. It allows you to make sure

that your image search engine is functioning as expected.

Once you can confirm that your image search engine is working properly, you can then

accept external query images that are not already part of your index.

The Search

So what’s the process of actually performing a search? Checkout the outline below:

1. Accept a query image from the user

A user could be uploading an image from their desktop or from their mobile device. As

image search engines become more prevalent, I suspect that most queries will come from

devices such as iPhones and Droids. It’s simple and intuitive to snap a photo of a place,

object, or something that interests you using your cellphone, and then have it automatically

analyzed and relevant results returned.

2. Describe the query image

Now that you have a query image, you need to describe it using the exact same image

descriptor(s) as you did in the indexing phase. For example, if I used a RGB color histogram

with 32 bins per channel when I indexed the images in my dataset, I am going to use the same

32 bin per channel histogram when describing my query image. This ensures that I have a

consistent representation of my images. After applying my image descriptor, I now have a

feature vector for the query image.Page 30 of 75

Page 31: Project 1 Search Engine

WE SCHOOL

3. Perform the Search

To perform the most basic method of searching, you need to loop over all the feature vectors

in your index. Then, you use your similarity metric to compare the feature vectors in your

index to the feature vectors from your query. Your similarity metric will tell you how

“similar” the two feature vectors are. Finally, sort your results by similarity.

Looping over your entire index may be feasible for small datasets. But if you have a large

image dataset, like Google or TinEye, this simply isn’t possible. You can’t compute the

distance between your query features and the billions of feature vectors already present in

your dataset.

4. Display Your Results to the User

Now that we have a ranked list of relevant images we need to display them to the user. This

can be done using a simple web interface if the user is on a desktop, or we can display the

images using some sort of app if they are on a mobile device. This step is pretty trivial in the

overall context of building an image search engine, but you should still give thought to the

user interface and how the user will interact with your image search engine.

Summary

So there you have it, the four steps of building an image search engine, from front to back:

1. Define your image descriptor.2. Index your dataset.3. Define your similarity metric.4. Perform a search, rank the images in your index in terms of relevancy to the user, and

display the results to the user.

Page 31 of 75

Page 32: Project 1 Search Engine

WE SCHOOL

Here is the best example of image build search engine against above explanation

Think about it this way. When you go to Google and type “Lord of the Rings” into the search

box, you expect Google to return pages to you that are relevant to Tolkien’s books and the

movie franchise. Similarly, if we present an image search engine with a query image, we

expect it to return images that are relevant to the content of image — hence, we sometimes

call image search engines by what they are more commonly known in academic circles

as Content Based Image Retrieval (CBIR) systems.

So what’s the overall goal of our Lord of the Rings image search engine?

The goal, given a query image from one of our five different categories, is to return the

category’s corresponding images in the top 10 results.

That was a mouthful. Let’s use an example to make it more clear.

If I submitted a query image of The Shire to our system, I would expect it to give me all 5

Shire images in our dataset back in the first 10 results. And again, if I submitted a query

image of Rivendell, I would expect our system to give me all 5 Rivendell images in the first

10 results.

Make sense? Good. Let’s talk about the four steps to building our image search engine.

The 4 Steps to Building an Image Search Engine

On the most basic level, there are four steps to building an image search engine:

Define your descriptor: What type of descriptor are you going to use? Are you describing

color? Texture? Shape?

Index your dataset: Apply your descriptor to each image in your dataset, extracting a set of

features.

Page 32 of 75

Page 33: Project 1 Search Engine

WE SCHOOL

Define your similarity metric: How are you going to define how “similar” two images are?

You’ll likely be using some sort of distance metric. (a metric or distance function is a

function that defines a distance between elements of a set. A set with a metric is called a

metric space. A metric induces a topology on a set but not all topologies can be generated by

a metric.) Common choices include Euclidean, Cityblock (Manhattan), Cosine, and chi-

squared to name a few.

Searching: To perform a search, apply your descriptor to your query image, and then ask

your distance metric to rank how similar your images are in your index to your query images.

Sort your results via similarity and then examine them.

Step #1: The Descriptor – A 3D RGB Color Histogram

Our image descriptor is a 3D color histogram in the RGB color space with 8 bins per red,

green, and blue channel.

The best way to explain a 3D histogram is to use the conjunctive AND. This image descriptor

will ask a given image how many pixels have a Red value that falls into bin #1 AND a Green

value that falls into bin #2 AND how many Blue pixels fall into bin #1. This process will be

repeated for each combination of bins; however, it will be done in a computationally efficient

manner.

When computing a 3D histogram with 8 bins, OpenCV will store the feature vector as an (8,

8, 8) array. We’ll simply flatten it and reshape it to (512,). Once it’s flattened, we can easily

compare feature vectors together for similarity.

Ready to see some code? Okay, here we go:

3D RGB Histogram in OpenCV and Python

PythonPage 33 of 75

Page 34: Project 1 Search Engine

WE SCHOOL

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# import the necessary packages

import numpy as np

import cv2

 

class RGBHistogram:

def __init__(self, bins):

# store the number of bins the histogram will use

self.bins = bins

 

def describe(self, image):

# compute a 3D histogram in the RGB colorspace,

# then normalize the histogram so that images

# with the same content, but either scaled larger

# or smaller will have (roughly) the same histogram

hist = cv2.calcHist([image], [0, 1, 2],

None, self.bins, [0, 256, 0, 256, 0, 256])

hist = cv2.normalize(hist)

 

# return out 3D histogram as a flattened array

Page 34 of 75

# import the necessary packageimport numpy as npimport cv2

Page 35: Project 1 Search Engine

WE SCHOOL

20 return hist.flatten()

As you can see, RGBHistogram class has been defined. The reason for this is because you

rarely ever extract features from a single image alone. You instead extract features from an

entire dataset of images. Furthermore, you expect that the features extracted from all images

utilize the same parameters — in this case, the number of bins for the histogram. It wouldn’t

make much sense to extract a histogram using 32 bins from one image and then 128 bins for

another image if you intend on comparing them for similarity.

Let’s take the code apart and understand what’s going on:

Lines 6-8: Here I am defining the constructor for the RGBHistogram. The only parameter we

need is the number of bins for each channel in the histogram. Again, this is why I prefer using

classes instead of functions for image descriptors — by putting the relevant parameters in the

constructor, you ensure that the same parameters are utilized for each image.

Line 10: You guessed it. The describe method is used to “describe” the image and return a

feature vector.

Line 15: Here we extract the actual 3D RGB Histogram (or actually, BGR since OpenCV

stores the image as a NumPy array, but with the channels in reverse order).  We assume

self.bins is a list of three integers, designating the number of bins for each channel.

Line 16: It’s important that we normalize the histogram in terms of pixel counts. If we used

the raw (integer) pixel counts of an image, then shrunk it by 50% and described it again, we

would have two different feature vectors for identical images. In most cases, you want to

avoid this scenario. We obtain scale invariance by converting the raw integer pixel counts

into real-valued percentages. For example, instead of saying bin #1 has 120 pixels in it, we

Page 35 of 75

Page 36: Project 1 Search Engine

WE SCHOOL

would say bin #1 has 20% of all pixels in it. Again, by using the percentages of pixel counts

rather than raw, integer pixel counts, we can assure that two identical images, differing only

in size, will have (roughly) identical feature vectors.

Line 20: When computing a 3D histogram, the histogram will be represented as a NumPy

array with (N, N, N) bins. In order to more easily compute the distance between histograms,

we simply flatten this histogram to have a shape of (N ** 3,). Example: When we instantiate

our RGBHistogram, we will use 8 bins per channel. Without flattening our histogram, the

shape would be (8, 8, 8). But by flattening it, the shape becomes (512,).

Now that we have defined our image descriptor, we can move on to the process of indexing our dataset.

Step #2: Indexing our Dataset

Okay, so we’ve decided that our image descriptor is a 3D RGB histogram. The next step is to

apply our image descriptor to each image in the dataset.

This simply means that we are going to loop over our 25 image dataset, extract a 3D RGB

histogram from each image, store the features in a dictionary, and write the dictionary to file.

Yep, that’s it.

In reality, you can make indexing as simple or complex as you want. Indexing is a task that is

easily made parallel. If we had a four core machine, we could divide the work up between the

four cores and speedup the indexing process. But since we only have 25 images, that’s pretty

silly, especially given how fast it is to compute a histogram.

Let’s dive into some code: Indexing an Image Dataset using Python

Python

Page 36 of 75

Page 37: Project 1 Search Engine

WE SCHOOL

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1

# import the necessary packages

from pyimagesearch.rgbhistogram import RGBHistogram

import argparse

import cPickle

import glob

import cv2

 

# construct the argument parser and parse the arguments

ap = argparse.ArgumentParser()

ap.add_argument("-d", "--dataset", required = True,

help = "Path to the directory that contains the images to be indexed")

ap.add_argument("-i", "--index", required = True,

help = "Path to where the computed index will be stored")

args = vars(ap.parse_args())

 

# initialize the index dictionary to store our our quantifed

# images, with the 'key' of the dictionary being the image

# filename and the 'value' our computed features

index = {}

Page 37 of 75

# import the necessary packagefrom pyimagesearch.rgbhistogrimport argparseimport cPickle

Page 38: Project 1 Search Engine

WE SCHOOL

6

17

18

19

Alright, the first thing we are going to do is import the packages we need.

The --dataset argument is the path to where our images are stored on disk and the --index

option is the path to where we will store our index once it has been computed.

Finally, we’ll initialize our index — a builtin Python dictionary type. The key for the

dictionary will be the image filename. We’ve made the assumption that all filenames are

unique, and in fact, for this dataset, they are. The value for the dictionary will be the

computed histogram for the image.

Using a dictionary for this example makes the most sense, especially for explanation

purposes. Given a key, the dictionary points to some other object. When we use an image

filename as a key and the histogram as the value, we are implying that a given histogram H is

used to quantify and represent the image with filename K. 

Again, you can make this process as simple or as complicated as you want. More complex

image descriptors make use of term frequency-inverse document frequency weighting (tf-idf)

and an inverted index, but for the time being, let’s keep it simple.

Indexing an Image Dataset using Python

Python

Page 38 of 75

Page 39: Project 1 Search Engine

WE SCHOOL

1

2

3

# initialize our image descriptor -- a 3D RGB histogram with

# 8 bins per channel

desc = RGBHistogram([8, 8, 8])

Here we instantiate our RGBHistogram. Again, we will be using 8 bins for each, red, green,

and blue, channel, respectively.

Indexing an Image Dataset using Python

Python

1

2

3

4

5

6

# use glob to grab the image paths and loop over them

for imagePath in glob.glob(args["dataset"] + "/*.png"):

# extract our unique image ID (i.e. the filename)

k = imagePath[imagePath.rfind("/") + 1:]

 

# load the image, describe it using our RGB histogram

Page 39 of 75

# initialize our image descriptor # 8 bins per channeldesc = RGBHistogram([8, 8, 8])

# use glob to grab the image pafor imagePath in glob.glob(args[

# extract our unique imk = imagePath[imageP

Page 40: Project 1 Search Engine

WE SCHOOL

7

8

9

10

# descriptor, and update the index

image = cv2.imread(imagePath)

features = desc.describe(image)

index[k] = features

Here is where the actual indexing takes place. Let’s break it down:

Line 2: We use glob to grab the image paths and start to loop over our dataset.

Line 4: We extract the “key” for our dictionary. All filenames are unique in this sample

dataset, so the filename itself will be enough to serve as the key.

Line 8-10: The image is loaded off disk and we then use our RGBHistogram to extract a

histogram from the image. The histogram is then stored in the index.

Indexing an Image Dataset using Python

Python

1

2

3

4

5

# we are now done indexing our image -- now we can write our

# index to disk

f = open(args["index"], "w")

f.write(cPickle.dumps(index))

f.close()

Page 40 of 75

# w e are now done indexing ou# index to diskf = open(args["index"], "w ")f.w rite(cPickle.dumps(index))

Page 41: Project 1 Search Engine

WE SCHOOL

Now that our index has been computed, let’s write it to disk so we can use it for searching later on.

Step #3: The Search

We now have our index sitting on disk, ready to be searched.

The problem is, we need some code to perform the actual search. How are we going to

compare two feature vectors and how are we going to determine how similar they are?

This question is better addressed first with some code.

Building an Image Search Engine in Python and OpenCV

Python

1

2

3

4

5

6

7

8

# import the necessary packages

import numpy as np

 

class Searcher:

def __init__(self, index):

# store our index of images

self.index = index

 

Page 41 of 75

# import the necessary packageimport numpy as np

class Searcher:

Page 42: Project 1 Search Engine

WE SCHOOL

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

def search(self, queryFeatures):

# initialize our dictionary of results

results = {}

 

# loop over the index

for (k, features) in self.index.items():

# compute the chi-squared distance between the features

# in our index and our query features -- using the

# chi-squared distance which is normally used in the

# computer vision field to compare histograms

d = self.chi2_distance(features, queryFeatures)

 

# now that we have the distance between the two feature

# vectors, we can udpate the results dictionary -- the

# key is the current image ID in the index and the

# value is the distance we just computed, representing

# how 'similar' the image in the index is to our query

results[k] = d

 

# sort our results, so that the smaller distances (i.e. the

# more relevant images are at the front of the list)

Page 42 of 75

Page 43: Project 1 Search Engine

WE SCHOOL

24

25

26

27

28

29

30

31

32

33

34

35

36

37

3

results = sorted([(v, k) for (k, v) in results.items()])

 

# return our results

return results

 

def chi2_distance(self, histA, histB, eps = 1e-10):

# compute the chi-squared distance

d = 0.5 * np.sum([((a - b) ** 2) / (a + b + eps)

for (a, b) in zip(histA, histB)])

 

# return the chi-squared distance

return d

Page 43 of 75

Page 44: Project 1 Search Engine

WE SCHOOL

8

39

40

41

First off, most of this code is just comments. Don’t be scared that it’s 41 lines. If you haven’t

already guessed. Let’s investigate what’s going on:

Lines 4-7: The first thing I do is define a Searcher class and a constructor with a single

parameter — the index. This index is assumed to be the index dictionary that we wrote to file

during the indexing step.

Line 11: We define a dictionary to store our results. The key is the image filename (from the

index) and the value is how similar the given image is to the query image.

Lines 14-26: Here is the part where the actual searching takes place. We loop over the image

filenames and corresponding features in our index. We then use the chi-squared distance to

compare our color histograms. The computed distance is then stored in the results dictionary,

indicating how similar the two images are to each other.

Lines 30-33: The results are sorted in terms of relevancy (the smaller the chi-squared

distance, the relevant/similar) and returned.

Lines 35-41: Here we define the chi-squared distance function used to compare the two

histograms. In general, the difference between large bins vs. small bins is less important and

should be weighted as such. This is exactly what the chi-squared distance does. We provide

Page 44 of 75

Page 45: Project 1 Search Engine

WE SCHOOL

an epsilon dummy value to avoid those pesky “divide by zero” errors. Images will be

considered identical if their feature vectors have a chi-squared distance of zero. The larger the

distance gets, the less similar they are.

So there you have it, a Python class that can take an index and perform a search.

Now it’s time to put this searcher to work.

Step #4: Performing a Search

Finally. We are closing in on a functioning image search engine.

But we’re not quite there yet. We need a little extra code to handle loading the images off

disk and performing the search:

Building an Image Search Engine in Python and OpenCV

Python

1

2

3

4

5

6

7

# import the necessary packages

from pyimagesearch.searcher import Searcher

import numpy as np

import argparse

import cPickle

import cv2

 

Page 45 of 75

# import the necessary packagefrom pyimagesearch.searcher iimport numpy as npimport argparse

Page 46: Project 1 Search Engine

WE SCHOOL

8

9

10

11

12

13

14

15

16

17

18

# construct the argument parser and parse the arguments

ap = argparse.ArgumentParser()

ap.add_argument("-d", "--dataset", required = True,

help = "Path to the directory that contains the images we just indexed")

ap.add_argument("-i", "--index", required = True,

help = "Path to where we stored our index")

args = vars(ap.parse_args())

 

# load the index and initialize our searcher

index = cPickle.loads(open(args["index"]).read())

searcher = Searcher(index)

First things first. Import the packages that we will need. We then define our arguments in the

same manner that we did during the indexing step. Finally, we use cPickle to load our index

off disk and initialize our Searcher.

Python

1 # loop over images in the index -- we will use each one as

Page 46 of 75

# loop over images in the index # a query imagefor (query, queryFeatures) in in

# perform the search

Page 47: Project 1 Search Engine

WE SCHOOL

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# a query image

for (query, queryFeatures) in index.items():

# perform the search using the current query

results = searcher.search(queryFeatures)

 

# load the query image and display it

path = args["dataset"] + "/%s" % (query)

queryImage = cv2.imread(path)

cv2.imshow("Query", queryImage)

print "query: %s" % (query)

 

# initialize the two montages to display our results --

# we have a total of 25 images in the index, but let's only

# display the top 10 results; 5 images per montage, with

# images that are 400x166 pixels

montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8")

montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8")

 

# loop over the top ten results

for j in xrange(0, 10):

# grab the result (we are using row-major order) and

Page 47 of 75

Page 48: Project 1 Search Engine

WE SCHOOL

19

20

21

22

23

24

25

26

27

28

29

30

31

32

3

# load the result image

(score, imageName) = results[j]

path = args["dataset"] + "/%s" % (imageName)

result = cv2.imread(path)

print "\t%d. %s : %.3f" % (j + 1, imageName, score)

 

# check to see if the first montage should be used

if j < 5:

montageA[j * 166:(j + 1) * 166, :] = result

 

# otherwise, the second montage should be used

else:

montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result

 

# show the results

cv2.imshow("Results 1-5", montageA)

cv2.imshow("Results 6-10", montageB)

cv2.waitKey(0)

Page 48 of 75

Page 49: Project 1 Search Engine

WE SCHOOL

3

34

35

36

37

38

39

40

Most of this code handles displaying the results. The actual “search” is done in a single line

(#31). Regardless, let’s examine what’s going on:

Line 3: We are going to treat each image in our index as a query and see what results we get

back. Normally, queries are external and not part of the dataset, but before we get to that,

let’s just perform some example searches.

Line 5: Here is where the actual search takes place. We treat the current image as our query

and perform the search.

Lines 8-11: Load and display our query image.

Lines 17-35: In order to display the top 10 results, I have decided to use two montage images.

The first montage shows results 1-5 and the second montage results 6-10. The name of the

Page 49 of 75

Page 50: Project 1 Search Engine

WE SCHOOL

image and distance is provided on Line 27.

Lines 38-40: Finally, we display our search results to the user.

So there you have it. An entire image search engine in Python.

Figure: Search Results using Mordor-002.png as a query. Our image search engine is able to return images from Mordor and the Black Gate.

Let’s start at the ending of The Return of the King using Frodo and Sam’s ascent into the

volcano as our query image. As you can see, our top 5 results are from the “Mordor”

category.

Perhaps you are wondering why the query image of Frodo and Sam is also the image in the

Page 50 of 75

Page 51: Project 1 Search Engine

WE SCHOOL

#1 result position? Well, let’s think back to our chi-squared distance. We said that an image

would be considered “identical” if the distance between the two feature vectors is zero. Since

we are using images we have already indexed as queries, they are in fact identical and will

a distance of zero. Since a value of zero indicates perfect similarity, the query image appears

in the #1 result position.

Now, let’s try another image, this time using The Goblin King in Goblin Town:

Figure: Search Results using Goblin-004.png as a query. The top 5 images returned are from Goblin Town.

The Goblin King doesn’t look very happy. But we sure are happy that all five images from

Goblin Town are in the top 10 results.

Finally, here are three more example searches for Dol-Guldur, Rivendell, and The Shire.

Again, we can clearly see that all five images from their respective categories are in the top

Page 51 of 75

Page 52: Project 1 Search Engine

WE SCHOOL

10 results.

Figure: Using images from Dol-Guldur (Dol-Guldur-004.png), Rivendell (Rivendell-

003.png), and The Shire (Shire-002.png) as queries.

But clearly, this is not how all image search engines work. Google allows you

to upload an image of your own. TinEye allows you to upload an image of your own. Why

can’t we? Let’s see how we can perform a search using an image that we haven’t already

indexed:

Building an Image Search Engine using Python and OpenCV

Python

1 # import the necessary packages

Page 52 of 75

# import the necessary packagefrom pyimagesearch.rgbhistogrfrom pyimagesearch.searcher iimport numpy as np

Page 53: Project 1 Search Engine

WE SCHOOL

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

from pyimagesearch.rgbhistogram import RGBHistogram

from pyimagesearch.searcher import Searcher

import numpy as np

import argparse

import cPickle

import cv2

 

# construct the argument parser and parse the arguments

ap = argparse.ArgumentParser()

ap.add_argument("-d", "--dataset", required = True,

help = "Path to the directory that contains the images we just indexed")

ap.add_argument("-i", "--index", required = True,

help = "Path to where we stored our index")

ap.add_argument("-q", "--query", required = True,

help = "Path to query image")

args = vars(ap.parse_args())

 

# load the query image and show it

queryImage = cv2.imread(args["query"])

cv2.imshow("Query", queryImage)

print "query: %s" % (args["query"])

Page 53 of 75

Page 54: Project 1 Search Engine

WE SCHOOL

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

 

# describe the query in the same way that we did in

# index.py -- a 3D RGB histogram with 8 bins per

# channel

desc = RGBHistogram([8, 8, 8])

queryFeatures = desc.describe(queryImage)

 

# load the index perform the search

index = cPickle.loads(open(args["index"]).read())

searcher = Searcher(index)

results = searcher.search(queryFeatures)

 

# initialize the two montages to display our results --

# we have a total of 25 images in the index, but let's only

# display the top 10 results; 5 images per montage, with

# images that are 400x166 pixels

montageA = np.zeros((166 * 5, 400, 3), dtype = "uint8")

montageB = np.zeros((166 * 5, 400, 3), dtype = "uint8")

 

# loop over the top ten results

for j in xrange(0, 10):

Page 54 of 75

Page 55: Project 1 Search Engine

WE SCHOOL

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

# grab the result (we are using row-major order) and

# load the result image

(score, imageName) = results[j]

path = args["dataset"] + "/%s" % (imageName)

result = cv2.imread(path)

print "\t%d. %s : %.3f" % (j + 1, imageName, score)

 

# check to see if the first montage should be used

if j < 5:

montageA[j * 166:(j + 1) * 166, :] = result

 

# otherwise, the second montage should be used

else:

montageB[(j - 5) * 166:((j - 5) + 1) * 166, :] = result

 

# show the results

cv2.imshow("Results 1-5", montageA)

cv2.imshow("Results 6-10", montageB)

cv2.waitKey(0)

Lines 2-17: This should feel like pretty standard stuff by now. We are importing our packages

and setting up our argument parser, although, you should note the new argument –query. This

Page 55 of 75

Page 56: Project 1 Search Engine

WE SCHOOL

is the path to our query image.

Lines 20-21: We’re going to load your query image and show it to you, just in case you

forgot what your query image is.

Lines 27-28: Instantiate our RGBHistogram with the exact same number of bins as during our

indexing step. We then extract features from our query image.

Lines 31-33: Load our index off disk using cPickle and perform the search.

Lines 39-62: Just as in the code above to perform a search, this code just shows us our results.

One of Rivendell and one of The Shire. These two images will be our queries.

Check out the results below:

Figure: Using external Rivendell (Left) and the Shire (Right) query images. For both cases,

we find the top 5 search results are from the same category.

In this case, we searched using two images that we haven’t seen previously. The one on the

left is of Rivendell. We can see from our results that the other 5 Rivendell images in our

Page 56 of 75

Page 57: Project 1 Search Engine

WE SCHOOL

index were returned, demonstrating that our image search engine is working properly.

On the right, we have a query image from The Shire. Again, this image is not present in our

index. But when we look at the search results, we can see that the other 5 Shire images were

returned from the image search engine, once again demonstrating that our image search

engine is returning semantically similar images.

Summary

Here we’ve explored how to create an image search engine from start to finish.

The first step was to choose an image descriptor — we used a 3D RGB histogram to

characterize the color of our images. We then indexed each image in our dataset using our

descriptor by extracting feature vectors (i.e. the histograms). From there, we used the chi-s

quared distance to define “similarity” between two images. Finally, we glued all the pieces

together and created a Lord of the Rings image search engine.

After above example we will now see that being difference in terms of text search & image search they are correlated to each other

Most webmasters don’t see any difference between image alt text and title mostly keeping

them the same. A great discussion over at Google Webmaster Groups provides an exhaustive

information on the differences between an image Alt attribute and an image title and standard

recommendations of how to use them.

Alt text is meant to be an alternative information source for those people who have chosen

to disable images in their browsers and those user agents that are simply unable to “see” the

images. It should describe what the image is about and get those visitors interested to see it.

Without alt text, an image will be displayed as an empty iconIn Internet Explorer Alt text also

Page 57 of 75

Page 58: Project 1 Search Engine

WE SCHOOL

pops up when you hover over an image.

Plus, Google officially confirmed it mainly focuses on alt text when trying to understand

what an image is about. Image title (and the element name speaks for itself) should provide

additional information and follow the rules of the regular title: it should be relevant, short,

catchy, and concise (a title “offers advisory information about the element for which it is

set“). In FireFox and Opera it pops up when you hover over an image:

So based on the above, we can discuss how to properly handle them:

Both tags are primarily meant for visitors (though alt text seems more important for

crawlers) – so provide explicit information on an image to encourage views.

Include your main keywords in both, but change them up. Keyword stuffing in alt

text and title is still keyword stuffing, so keep them relevant and meaningful.

Another good point to take into consideration:

According to Aaron Wall, alt text is crucially important when used for a site-

wide header banner.

One of the reasons Aaron Wall, was so motivated to change the tagline of this site recently

was because the new site design contained the site's logo as a background image. The logo

link was a regular static link, but it had no anchor text, only a link title to describe the link. If

you do not look at the source code, the link title attribute can seem like an image alt tag when

you scroll over it, but to a search engine they do not look that same. A link title is not

weighted anywhere near as aggressively as an image alt tag is.

The old link title on the header link for this site was search engine optimization book. While

this site ranks #6 and #8 for that query in Google, neither of the ranking pages are the Page 58 of 75

Page 59: Project 1 Search Engine

WE SCHOOL

homepage (the tools page and sales letter rank). That shows that Google currently places

negligible, if any, weight on link titles.

If the only link to your homepage is a logo check the source code to verify you are using

descriptive image alt text.

Conclusion & Recommendations

Conclusion:

We conclude that with the help of above mentioned technologies that any company can build

their image which can be helpful to the society with these Search Engines.

While nobody can guarantee top level positioning in search engine organic results, proper

search engine optimization can help. Because the search engines, such as Google, Yahoo!,

and Bing, are so important today it is necessary to make each page in a Web site conform to

the principles of good SEO as much as possible.

To do this it is necessary to:

Understand the basics of how search engines rate sites

Use proper keywords and phrases throughout the Web site

Avoid giving the appearance of spamming the search engines

Write all text for real people, not just for search engines

Use well-formed alternate attributes on images

Make sure that the necessary meta tags (and title tag) are installed in the head of each Web page

Have good incoming links to establish popularity

Page 59 of 75

Page 60: Project 1 Search Engine

WE SCHOOL

Make sure the Web site is regularly updated so that the content is fresh

Recommendations:

Following recommendations are made on the basis of overall usage of systems:

Overview

Recommender systems typically produce a list of recommendations in one of two ways –

through collaborative or content-based filtering. Collaborative filtering approaches building

a model from a user's past behaviour (items previously purchased or selected and/or

numerical ratings given to those items) as well as similar decisions made by other users; then

use that model to predict items (or ratings for items) that the user may have an interest in.

Content-based filtering approaches utilize a series of discrete characteristics of an item in

order to recommend additional items with similar properties. These approaches are often

combined.

Main Article

Collaborative filtering:

One approach to the design of recommender systems that has seen wide use is collaborative

filtering. Collaborative filtering methods are based on collecting and analyzing a large

amount of information on users’ behaviors, activities or preferences and predicting what

users will like based on their similarity to other users. A key advantage of the collaborative

filtering approach is that it does not rely on machine analyzable content and therefore it is

Page 60 of 75

Page 61: Project 1 Search Engine

WE SCHOOL

capable of accurately recommending complex items such as movies without requiring an

"understanding" of the item itself. Many algorithms have been used in measuring user

similarity or item similarity in recommender systems. For example, the k-nearest neighbour

(k-NN) approach and the Pearson Correlation.

Collaborative Filtering is based on the assumption that people who agreed in the past will

agree in the future, and that they will like similar kinds of items as they liked in the past.

When building a model from a user's profile, a distinction is often made between explicit and

implicit forms of data collection.

Examples of explicit data collection include the following:

Asking a user to rate an item on a sliding scale.

Asking a user to search.

Asking a user to rank a collection of items from favourite to least favourite.

Presenting two items to a user and asking him/her to choose the better one of them.

Asking a user to create a list of items that he/she likes.

Examples of implicit data collection include the following:

Observing the items that a user views in an online store.

Analysing item/user viewing times

Keeping a record of the items that a user purchases online.

Obtaining a list of items that a user has listened to or watched on his/her computer.

Analyzing the user's social network and discovering similar likes and dislikes

Page 61 of 75

Page 62: Project 1 Search Engine

WE SCHOOL

The recommender system compares the collected data to similar and dissimilar data collected

from others and calculates a list of recommended items for the user. Several commercial and

non-commercial examples are listed in the article on collaborative filtering systems.

One of the most famous examples of collaborative filtering is item-to-item collaborative

filtering (people who buy x also buy y), an algorithm popularized by Amazon.com's

recommender system.

Facebook, MySpace, LinkedIn, and other social networks use collaborative filtering to

recommend new friends, groups, and other social connections (by examining the network of

connections between a user and their friends). Twitter uses many signals and in-memory

computations for recommending who to follow to its users.

Collaborative filtering approaches often suffer from three problems:

Cold Start,

Scalability,

And

Sparsity.

Cold Start: These systems often require a large amount of existing data on a user in order to

make accurate recommendations.

Scalability: In many of the environments that these systems make recommendations in, there

are millions of users and products. Thus, a large amount of computation power is often

necessary to calculate recommendations.

Sparsity: The number of items sold on major e-commerce sites is extremely large. The most

active users will only have rated a small subset of the overall database. Thus, even the most

popular items have very few ratings. A particular type of collaborative filtering algorithm uses Page 62 of 75

Page 63: Project 1 Search Engine

WE SCHOOL

matrix factorization, a low-rank matrix approximation technique.

Collaborative filtering are classified as memory-based and model based collaborative

filtering. A well known example of memory-based approaches is user-based algorithm

and that of model-based approaches is Kernel-Mapping Recommender.

Content-based filtering

Another common approach when designing recommender systems is content-based filtering.

Content-based filtering methods are based on a description of the item and a profile of the

user’s preference. In a content-based recommender system, keywords are used to describe

the items; beside, a user profile is built to indicate the type of item this user likes. In other

words, these algorithms try to recommend items that are similar to those that a user liked in

the past (or is examining in the present). In particular, various candidate items are compared

with items previously rated by the user and the best-matching items are recommended. This

approach has its roots in information retrieval and information filtering research.

To abstract the features of the items in the system, an item presentation algorithm is applied.

A widely used algorithm is the tf–idf, (short for term frequency–inverse document

frequency, is a numerical statistic that is intended to reflect how important a word is to a

document in a collection) representation (also called vector space representation).

To create user profile, the system mostly focuses on two types of information:

1) A model of the user's preference.

2) A history of the user's interaction with the recommender system.

Basically, these methods use an item profile (i.e. a set of discrete attributes and features)

characterizing the item within the system. The system creates a content-based profile of users

Page 63 of 75

Page 64: Project 1 Search Engine

WE SCHOOL

based on a weighted vector of item features. The weights denote the importance of each

feature to the user and can be computed from individually rated content vectors using a

variety of techniques.

Simple approaches use the average values of the rated item vector while other sophisticated

methods use machine learning techniques such as

Bayesian Classifiers (In machine learning, naive Bayes classifiers are a family of

simple probabilistic classifiers based on applying Bayes' theorem with strong (naive)

independence assumptions between the features.)

Cluster Analysis (is the task of grouping a set of objects in such a way that objects in

the same group (called a cluster) are more similar (in some sense or another) to each

other than to those in other groups (clusters)).

Decision Trees (A decision tree is a decision support tool that uses a tree-like graph

or model of decisions and their possible consequences, including chance event

outcomes, resource costs, and utility.) &

Artificial Neural Networks (In machine learning, artificial neural networks (ANNs)

are a family of statistical learning algorithms inspired by biological neural networks

(the central nervous systems of animals, in particular the brain) and are used to

estimate or approximate functions that can depend on a large number of inputs and

are generally unknown.) in order to estimate the probability that the user is going to like

the item.

Page 64 of 75

Page 65: Project 1 Search Engine

WE SCHOOL

Direct feedback from a user, usually in the form of a like or dislike button, can be used to

assign higher or lower weights on the importance of certain attributes.

A key issue with content-based filtering is whether the system is able to learn user

preferences from user's actions regarding one content source and use them across other

content types. When the system is limited to recommending content of the same type as the

user is already using, the value from the recommendation system is significantly less than

when other content types from other services can be recommended. For example,

recommending news articles based on browsing of news is useful, but it's much more useful

when music, videos, products, discussions etc. from different services can be recommended

based on news browsing.

Hybrid Recommender Systems

Recent research has demonstrated that a hybrid approach, combining collaborative filtering

and content-based filtering could be more effective in some cases. Hybrid approaches can be

implemented in several ways: by making content-based and collaborative-based predictions

separately and then combining them; by adding content-based capabilities to a collaborative-

based approach (and vice versa); or by unifying the approaches into one model.

Several studies empirically compare the performance of the hybrid with the pure

collaborative and content-based methods and demonstrate that the hybrid methods can

more accurate recommendations than pure approaches. These methods can also be used to

overcome some of the common problems in recommender systems such as cold start and the

sparsity problem.

Netflix is a good example of hybrid systems. They make recommendations by comparing the

watching and searching habits of similar users (i.e. collaborative filtering) as well as by Page 65 of 75

Page 66: Project 1 Search Engine

WE SCHOOL

offering movies that share characteristics with films that a user has rated highly (content-

based filtering).

A variety of techniques have been proposed as the basis for recommender systems:

collaborative, content-based, knowledge-based, and demographic techniques. Each of these

techniques has known shortcomings, such as the well-known cold-start problem for

collaborative and content-based systems (what to do with new users with few ratings) and the

knowledge engineering bottleneck in knowledge-based approaches is a technology used to

store complex structured and unstructured information used by a computer system.

A hybrid recommender system is one that combines multiple techniques together to achieve

some synergy between them.

Collaborative: The system generates recommendations using only information about rating

profiles for different users. Collaborative systems locate peer users with a rating history

similar to the current user and generate recommendations using this neighbourhood.

Content-based: The system generates recommendations from two sources: the features

associated with products and the ratings that a user has given them. Content-based

recommenders treat recommendation as a user-specific classification problem and learn a

classifier for the user's likes and dislikes based on product features.

Demographic: A demographic recommender provides recommendations based on a

demographic profile of the user. Recommended products can be produced for different

demographic niches, by combining the ratings of users in those niches.

Knowledge-based: A knowledge-based recommender suggests products based on inferences

about a user’s needs and preferences. This knowledge will sometimes contain explicit

functional knowledge about how certain product features meet user needs.Page 66 of 75

Page 67: Project 1 Search Engine

WE SCHOOL

The term hybrid recommender system is used here to describe any recommender system that

combines multiple recommendation techniques together to produce its output. There is no

reason why several different techniques of the same type could not be hybridized, for

example, two different content-based recommenders could work together, and a number of

projects have investigated this type of hybrid: NewsDude, which uses both naive Bayes and

kNN classifiers in its news recommendations is just one example.

Seven hybridization techniques:

1) Weighted: The score of different recommendation components are combined

numerically.

2) Switching: The system chooses among recommendation components and applies the

selected one.

3) Mixed: Recommendations from different recommenders are presented together.

4) Feature Combination: Features derived from different knowledge sources are

combined together and given to a single recommendation algorithm.

5) Feature Augmentation: One recommendation technique is used to compute a feature

or set of features, which is then part of the input to the next technique.

6) Cascade: Recommenders are given strict priority, with the lower priority ones

breaking ties the scoring of the higher ones.

7) Meta-level: One recommendation technique is applied and produces some sort of

model, which is then the input used by the next technique.

Beyond Accuracy

Typically, research on recommender systems is concerned about finding the most accurate

Page 67 of 75

Page 68: Project 1 Search Engine

WE SCHOOL

recommendation algorithms. However, there is a number of factors that are also important.

Diversity - Users tend to be more satisfied with recommendations when there is a higher

intra-list diversity, i.e. items from e.g. different artists.

Recommender Persistence - In some situations it is more effective to re-show

recommendations, or let users re-rate items, than showing new items. There are several

reasons for this. Users may ignore items when they are shown for the first time, for instance,

because they had no time to inspect the recommendations carefully.

Privacy - Recommender systems usually have to deal with privacy concerns because

users have to reveal sensitive information. Building user profiles using collaborative filtering

can be problematic from a privacy point of view. Many European countries have a strong

culture of data privacy and every attempt to introduce any level of user profiling can result in

a negative customer response. A number of privacy issues arose around the dataset offered by

Netflix for the Netflix Prize competition. Although the data sets were anonymised in order to

preserve customer privacy, in 2007, two researchers from the University of Texas were able

to identify individual users by matching the data sets with film ratings on the Internet Movie

Database. As a result, in December 2009, an anonymous Netflix user sued Netflix in Doe v.

Netflix, alleging that Netflix had violated U.S. fair trade laws and the Video Privacy

Protection Act by releasing the datasets. This led in part to the cancellation of a second

Netflix Prize competition in 2010. Much research has been conducted on ongoing privacy

issues in this space. Ramakrishnan have conducted an extensive overview of the trade-

offs between personalization and privacy and found that the combination of weak ties

and other data sources can be used to uncover identities of users in an anonymised dataset.

User Demographics - Beel et al. found that user demographics may influence how satisfied Page 68 of 75

Page 69: Project 1 Search Engine

WE SCHOOL

users are with recommendations. In their paper they show that elderly users tend to be

more interested in recommendations than younger users.

Robustness - When users can participate in the recommender system, the issue of fraud must

be addressed.

Serendipity - Serendipity is a measure "how surprising the recommendations are". For

instance, a recommender system that recommends milk to a customer in a grocery store,

might be perfectly accurate but still it is not a good recommendation because it is an obvious

item for the customer to buy.

Trust - A recommender system is of little value for a user if the user does not trust the

system. Trust can be built by a recommender system by explaining how it generates

recommendations, and why it recommends an item.

Labelling - User satisfaction with recommendations may be influenced by the labeling of the

recommendations. For instance, in the cited study click-through rate (CTR) is a way of

measuring the success of an online advertising campaign for a particular website as well as

the effectiveness of an email campaign by the number of users that clicked on a specific link

for recommendations labelled as "Sponsored" were lower (CTR=5.93%) than CTR for

identical recommendations labelled as "Organic" (CTR=8.86%). Interestingly,

recommendations with no label performed best (CTR=9.87%) in that study.

Mobile Recommender Systems

One growing area of research in the area of recommender systems is mobile recommender

systems. With the increasing ubiquity of internet-accessing smart phones, it is now possible to

offer personalized, context-sensitive recommendations. This is a particularly difficult area of

research as mobile data is more complex than recommender systems often have to deal with Page 69 of 75

Page 70: Project 1 Search Engine

WE SCHOOL

(it is heterogeneous, noisy, requires spatial and temporal auto-correlation, and has validation

and generality problems). Additionally, mobile recommender systems suffer from a

transplantation problem - recommendations may not apply in all regions (for instance, it

would be unwise to recommend a recipe in an area where all of the ingredients may not be

available).

One example of a mobile recommender system is one that offers potentially profitable

driving routes for taxi drivers in a city. This system takes as input data in the form of GPS

traces of the routes that taxi drivers took while working, which include location (latitude and

longitude), time stamps, and operational status (with or without passengers). It then

recommends a list of pickup points along a route that will lead to optimal occupancy times

and profits. This type of system is obviously location-dependent, and as it must operate on a

handheld or embedded device, the computation and energy requirements must remain low.

An other example of mobile recommendation is what (Bouneffouf et al., 2012) developed for

professional users. This system takes as input data the GPS traces of the user and his agenda

to suggest him suitable information depending on his situation and interests. The system uses

machine learning techniques and reasoning process in order to adapt dynamically the mobile

system to the evolution of the user’s interest. The author called his algorithm hybrid-ε-

greedy.

Mobile recommendation systems have also been successfully built using the Web of Data as

a source for structured information. A good example of such system is

SMARTMUSEUM. The system uses semantic modelling, information retrieval and

machine learning techniques in order to recommend contents matching user’s interest, even Page 70 of 75

Page 71: Project 1 Search Engine

WE SCHOOL

when the evidence of user's interests is initially vague and based on heterogeneous

information.

Risk-Aware Recommender Systems

The majority of existing approaches to RS focus on recommending the most relevant

documents to the users using the contextual information and do not take into account the risk

of disturbing the user in specific situations. However, in many applications, such as

recommending a personalized content, it is also important to incorporate the risk of upsetting

the user into the recommendation process in order not to recommend documents to users in

certain circumstances, for instance, during a professional meeting, early morning, late-night.

Therefore, the performance of the RS depends on the degree to which it has incorporated the

risk into the recommendation process.

Risk Definition: "The risk in recommender systems is the possibility to disturb or to upset

the user which leads to a bad answer of the user".

In response to this problems, the authors in have developed a dynamic risk sensitive

recommendation system called DRARS (Dynamic Risk-Aware Recommender System),

which model the context-aware recommendation as a bandit problem. This system combines

a content-based technique and a contextual bandit algorithm. They have shown that DRARS

improves the Upper Condense Bound (UCB) policy, the currently available best algorithm,

by calculating the most optimal exploration value to maintain a trade-off between exploration

and exploitation based on the risk level of the current user's situation. The authors conducted

experiments in an industrial context with real data and real users and has shown that taking

into account the risk level of users' situations significantly increased the performance of the

recommender systems.Page 71 of 75

Page 72: Project 1 Search Engine

WE SCHOOL

The Netflix Prize

One of the key events that energized research in recommender systems was the Netflix prize.

From 2006 to 2009, Netflix sponsored a competition, offering a grand prize of $1,000,000 to

the team that could take an offered dataset of over 100 million movie ratings and return

recommendations that were 10% more accurate than those offered by the company's existing

recommender system. This competition energized the search for new and more accurate

algorithms. On 21 September 2009, the grand prize of US$1,000,000 was given to the

BellKor's Pragmatic Chaos team using tiebreaking rules.

The most accurate algorithm in 2007 used an ensemble method of 107 different algorithmic

approaches, blended into a single prediction.

Predictive accuracy is substantially improved when blending multiple predictors. Our

experience is that most efforts should be concentrated in deriving substantially different

approaches, rather than refining a single technique. Consequently, our solution is an

ensemble of many methods. Many benefits accrued to the web due to the Netflix project.

A second contest was planned, but was ultimately cancelled in response to an ongoing lawsuit

and concerns from the Federal Trade Commission. (The Federal Trade Commission (FTC) is

an independent agency of the United States government, established in 1914 by the Federal

Trade Commission Act. Its principal mission is the promotion of consumer protection and the

elimination and prevention of anticompetitive business practices, such as coercive monopoly.)

Multi-criteria Recommender Systems

Multi-Criteria Recommender Systems (MCRS) can be defined as Recommender Systems that

incorporate preference information upon multiple criteria. Instead of developing

Page 72 of 75

Page 73: Project 1 Search Engine

WE SCHOOL

recommendation techniques based on a single criterion values, the overall preference of user

u for the item i, these systems try to predict a rating for unexplored items of u by exploiting

preference information on multiple criteria that affect this overall preference value. Several

researchers approach MCRS as a Multi-criteria Decision Making (MCDM) problem, and

apply MCDM methods and techniques to implement MCRS systems.

The limitations of SEO

One important characteristic of an expert and professional consultant is that he or she

understands that the theories and techniques in their field are subject to quite concrete and

specific limitations. In this page, I review the important ones that apply to search engine

optimization and website marketing and their practical significance.

SEO is hand made to order. If you have planned a product launch, a wedding or been a party

to a lawsuit, you know that the best laid plans are seldom executed without some major

changes. Life is simply too complex. SEO is another one of those human activities because,

to be effective, it is hand made for a specific site and business.

There are other important limitations which you need to understand and take into

consideration.

Searching is an evolving process from the point of view of providers (the search engines),

users and website owners. What worked yesterday may not work today and be counter-

productive or harmful tomorrow. In the result, monitoring or regular checks of the key search

engines and directories is required to maintain a high ranking once it is achieved.

Quality is everything. Since virtually everything that we do to improve a site's ranking will be

Page 73 of 75

Page 74: Project 1 Search Engine

WE SCHOOL

known to anyone who knows how to get it, innovations tend to be short-lived. Moreover,

search engines are always on the lookout for exploits that manipulate their ranking

algorithms. The only thing that cannot be copied or exploited is high quality and value

content especially when others link to it for those reasons. Only higher and more valuable

content trumps it.

The cost of SEO is rising. More expertise is required than before and this trend will continue.

techniques employed are more sophisticated, complex and time consuming. There are fewer

worthwhile search engines and directories that offer free listings. Paid placement costs are

rising and the best key words expensive.

The search lottery. Search engines collect only a fraction of the billions of sites' pages for

various technological reasons which change over time but nonetheless will mean for the

foreseeable future that searching is akin to a lottery. SEO improves the odds but cannot

remove the uncertainty altogether.

SEO is a marketing exercise and, accordingly, the same old business rules apply. You can sell

almost anything to someone once but businesses are built and prosper through repeat

customers to whom the reputation, brand or goodwill is important. Content quality and value

is the key and that remains elusive, expensive and difficult to source but, in websites, is the

only basis of effective marketing using SEO techniques.

Suffice to say, if your site is included in a search engine and you achieve a high enough

ranking for your requirements, then these limitations are costs of doing business in

cyberspace.

Page 74 of 75

Page 75: Project 1 Search Engine

WE SCHOOL

Bibliography

The topic itself completes the Bibliography that is search engine

The entire project is done with the help of following sites & taken Guidance from

Mr C.P. Venkatesh in Digital Marketing Workshop

www.google.com

www.searchcounsel.com

www.searchenginejournal.com

www.pyimagesearch.com

www.seobook.com

Page 75 of 75