seo robots txt file

Robots.txt File

What is robots.txt?

The robots.txt is a simple text file in your web site that inform search engine

bots how to crawl and index website or web pages.

By default search engine bots crawl everything possible unless they are

forbidden from doing so. They always scan the robots.txt file before crawling the web

site.

Declaring a robots.txt means that visitors (bots) are not allowed to index sensitive

data but it doesn’t mean that they can’t. The legal/good bots follow what is instructed to

them but the Malware robots don’t care about it, so don’t try to use it as a security for

your web site.

How to build a robots.txt file (Terms, Structure & Placement)?

The terms used in robots.txt and their meanings are given in tabular format.

The robots.txt is usually placed in the root folder of your web site so that the URL of

your robots.txt file resembles www.example.com/robots.txt in the web browser.

Remember that you use all the lower case letter for the filename.

http://www.searchenabler.com/blog/how-to-fix-crawl-errors/

Robots.txt File

You can define different restrictions to different bots by applying bot specific rules but be

aware that the more you make it complicated; it becomes harder for you to understand

its traps. Always specify bot specific rules before specifying common rules so that bots

read the file till the end to find rules specific to their names or else follow common rules.

You can check our many other sites robots.txt to get a feel on how these are generally implemented.

http://www.searchenabler.com/robots.txt http://www.google.com/robots.txt http://searchengineland.com/robots.txt

Example scenarios for robots.txt

If you have a close look at Search Enabler robots.txt, you can notice that we have

blocked following pages from search indexing. You can analyze which pages and links

should be blocked from your website. On a general note we advice hiding pages such

as search results page within your web site and user logins, profiles, logs and styling

CSS sheets.

1. Disallow: /?s=

It is a dynamic search results page and there is no point in indexing it which will

create duplicate content problems.

2. Disallow: /blog/2010/

These are the blogs categorized in a year wise patterns and are blocked because they

lead to duplication errors with different URLs pointing to the same web page.

3. Disallow: /login/

It is a login page meant only for users of searchenabler tool so it is blocked from getting

crawled.

How does robots.txt affect search results?

By using the robots.txt file, you can hide the pages such as user profiles and other temp

folders from being indexed and does not divulge your SEO effort into junk or the pages

which are useless for the search results. In general, you results will be more precise

and better valued.

http://www.searchenabler.com/robots.txt

http://www.google.com/robots.txt

http://www.searchengineland.com/robots.txt

http://www.searchenabler.com/blog/learn-seo-duplicate-content/

Robots.txt File

Default Robots.txt

Default Robots.txt file basically tells every crawler that it is allowed any web site

directory to its heart content:

User-agent: *

Disallow:

(which translates as “disallow nothing”)

The often asked question here is why to use it at all. Well, it is not required but

recommended to use for the simple reason that search bots will request it anyway (this

means you’ll see 404 errors in your log files from bots requesting your non-existent

Robots.txt page). Besides, having a default Robots.txt will ensure there won’t be any

misunderstandings between your site and a crawler.

Robots.txt Blocking Specific Folders / Content:

The most common usage of Robots.txt is to ban crawlers from visiting private folders or

content that gives them no additional information. This is done primarily in order to save

the crawler’s time: bots crawl on a budget – if you ensure that it doesn’t waste time on

unnecessary content, it will crawl your site deeper and quicker.

Samples of Robots.txt files blocking specific content (note: I highlighted only a few

most basic cases):

User-agent: *

Disallow: /database/

(blocks all crawlers from /database/ folder )

User-agent: *

Disallow: /*?

(blocks all crawlers from all URL’s containing ? )

User-agent: *

Disallow: /navy/

Allow: /navy/about.html

(blocks all crawlers from /navy/ folder but allow access to one page from this folder)

Note from John Mueller commenting below:

The “Allow:” statement is not a part of the robots.txt standard (it is however supported

by many search engines, including Google)

http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/d9e6a5f7eb247bb6/a843165619065af8#a843165619065af8

http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/d9e6a5f7eb247bb6/a843165619065af8#a843165619065af8

http://www.searchenginejournal.com/10-ways-to-increase-your-site-crawl-rate/7159/

http://johnmu.com/

Robots.txt File

Robots.txt Allowing Access to Specific Crawlers

Some people choose to save bandwidth and allow access to only those crawlers they

care about (e.g. Google, Yahoo and MSN). In this case, Robots.txt file should list those

Robots followed by the command itself, etc:

User-agent: *

Disallow: /

User-agent: googlebot

Disallow:

User-agent: slurp

Disallow:

User-agent: msnbot

Disallow:

(the first part blocks all crawlers from everything, while the following 3 blocks list those 3

crawlers that are allowed to access the whole site)

Need Advanced Robots.txt Usage?

I tend to recommend people to refrain from doing anything too tricky in their Robots.txt

file unless they are 100% knowledgeable in the topic. Messed-up Robots.txt file can

result in screwed project launch.

Many people spend weeks and months trying to figure why there site is ignored by

crawlers until they realize (often with some external help) that they have misused their

Robots.txt file. The better solution for controlling crawler activity might be to get away

with on-page solutions (robots meta tags). Aaron did a great job summing up the

difference in his guide(bottom of the page).

http://tools.seobook.com/robots-txt/

Robots.txt File

Best Robots.txt Tools: Generators and Analyzers

While I do not encourage anyone to rely too much on Robots.txt tools (you should either

make your best to understand the syntax yourself or turn to an experienced consultant

to avoid any issues), the Robots.txt generators and checkers I am listing below will

hopefully be ofadditional help:

Robots.txt generators:

Common procedure:

1. choose default / global commands (e.g. allow/disallow all robots);

2. choose files or directories blocked for all robots;

3. choose user-agent specific commands:

1. choose action;

2. choose a specific robot to be blocked.

As a general rule of thumb, I don’t recommend using Robots.txt generators for the

simple reason: don’t create any advanced (i.e. non default) Robots.txt file until you are

100% sure you understand what you are blocking with it. But still I am listing two most

trustworthy generators to check:

Google Webmaster tools: Robots.txt generator allows to create simple

Robots.txt files. What I like most about this tool is that it automatically adds all

global commands to each specific user agent commands (helping thus to

avoid one of the most common mistakes):

SEObook Robots.txt generator unfortunately misses the above feature but it is

really easy (and fun) to use:

http://www.searchenginejournal.com/robotstxt-4-things-you-should-know/7292/

https://www.google.com/webmasters/tools/

http://www.searchenginejournal.com/robotstxt-4-things-you-should-know/7292/

http://tools.seobook.com/robots-txt/generator/

Robots.txt File

Robots.txt checkers:

Google Webmaster tools: Robots.txt analyzer “translates” what your Robots.txt

dictates to the Googlebot:

Robots.txt Syntax Checker finds some common errors within your file by

checking for whitespace separated lists, not widely supported standards,

wildcard usage, etc.

A Validator for Robots.txt Files also checks for syntax errors and confirms correct

directory paths.

https://www.google.com/webmasters/tools/

http://www.sxw.org.uk/computing/robots/check.html

http://tool.motoricerca.info/robots-checker.phtml

seo robots txt file

Internet