Balkhis - Vision For Success

Robots.txt Guide for Wordpress – Avoid Duplicate Content

Dear Readers, I would like to thank you for reading this post. I highly value your feedback and would like to know more about my readers. Follow me on Twitter

Robots.txt Guide for Wordpress - Avoid Duplicate ContentToday, I got an Instant Message on msn from a regular reader. They suggested that I should write a decent article on Robots.txt because he was searching and could not find a good one. So I decided that would make a good topic on Balkhis SEO Section. First what you should do is view my Robots.txt. Now you can copy and paste the entire thing for all I care. But it wouldn’t make sense if you don’t understand what it is doing.

The main purpose of Robots.txt is controlling the Search Engine bots. This file single handedly controls what Search Engine bots can index and what they can’t index. This file plays an important role in avoiding duplicate content.

Hint:

You use Disallow: to Disallow files. (Disallow: /page/)
You use Allow: to Allow files (Allow: /about/)

The main thing you need to know in the robots.txt is that the $ sign at the end means file extension. So like I have on Balkhis /*.css$.

Now lets go onto analyze some of the important parts of my feeds that deals with Duplicate Content.

Disallow: /category/ – This code prevents a whole heck of duplicate content. Because your category contains the exact same thing as a single post page does. So you don’t want bots to see this.

Disallow: /page/ – I have mentioned this multiple times that archives are duplicate content. Pretty Obvious. So add this one as well.

Disallow: /tag/ – I don’t know if you are using tag or not. Just add it incase if you ever do decide to use it. I have tags on my Archive page and my search page, so I have it there. Because content categorized by tags are still the same content.

Disallow: */feed/ – Personally I feel that users should pay more attention to my blog rather than my feeds. So I have all feeds blocked.

Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

I don’t think that spiders should be allowed to see any of your javascripts, css, or include files let alone letting them index it. So block these off as well.

Disallow: /*? – This code doesn’t index any url that have a ? mark in it. So use this one also.

Now I hope you know what my robots.txt is doing. Now feel free to use it as a sample one for your site.



To stay updated with the blog, please consider subscribing to my full feed RSS. You can also subscribe by Email and have new posts sent directly to your inbox. I hate spam as well, so I promise you that you will not be spammed.
Add to Social Bookmarks:
Add this Article to Digg Add this Article to Stumbleupon Add this Article to Del.icio.us Add this Article to Reddit Add this Article to Newsvine Add this Article to Technorati

RSS feed | Trackback URI

26 Comments »

Comment by Colin
2008-08-23 08:00:46

Thanks for the info.

Are you disallowing the Google Image Search bot?

I get quite a lot of traffic from Google Image Search…


Comment by Syed Balkhi
2008-08-23 08:17:52

no I am not banning them entirely. Just from /wp-includes/ folder.

The images I like them to index are in other folders :) Considering I have recieved over 2.5k visitors from there. It would be stupid on my end.


 
 
Comment by FlickrFotos
2008-08-23 08:19:20

I never have any idea about using the robots.txt file to override the Google’s bots. But, as I read this article, I think I’m gonna use it soon. Thanks for the great info. :)


Comment by Syed Balkhi
2008-08-23 08:37:18

yup that is one of the best things you can do for your blog and its ranking because righht now you probably have alot of duplicate content which is preventing you to rank high. By doing this you will potentially rank higher.


 
 
Comment by Jim Hutchinson Subscribed to comments via email
2008-08-23 11:54:16

The only search engine you are blocking is Google. All the others, such as MSN, Yahoo!, WebCrawler, etc. still have access to index all of your site.

Either change User-agent at the top to “all” or remove it completely so it applies to all search engines.

When you put an article into a category, it is only there once, so duplicate content is not an issue. Preventing access to the other files and the wp-content folder is a good idea though.


Comment by Syed Balkhi
2008-08-23 15:56:46

Hey Jim,

Considering that 90% of my Search Engine Traffic comes from Google… I don’t think I really care about any other SE.

Duplicate content can also hurt your PR (Also another Google tool).

But yes your advice is correct people who want to do that can go ahead and do it :)


 
 
Comment by Angad Sodhi
2008-08-23 14:58:26

I always felt that I’ve messed up my site’s SEO. Though I had a robot.txt file prior to this, today’s post will take me one step closer to better SERPs


 
Comment by Angad Sodhi
2008-08-23 15:00:15

Hey why disallow the sitemap???


Comment by Syed Balkhi
2008-08-23 15:58:10

I don’t want them to see my sitemap because sitemap also have links to tags and others.


 
 
Comment by Michael Aulia
2008-08-23 19:14:45

You can be a paranoid and disallow archives too to avoid more duplicate contents

/2007/
/2008/ and so on :)


 
Comment by tyna
2008-08-23 21:07:50

I am lost here, can you explain all these in plain English,you know how i get
confused with codes.


Comment by Syed Balkhi
2008-08-23 23:29:03

Just copy and paste the robots.txt that I linked to. And upload it on your webhost. It will help your site’s ranking. That is pretty simple :P


 
 
Comment by Otooo
2008-08-23 22:52:55

Hey thanks – I currently don’t have a robots.txt implemented so I will do one now – hope it boosts my SERP’s a bit :)


 
Comment by David
2008-08-24 11:17:59

What about a duplicate content plugin that does pretty much the same thing as the robots.txt? There are several plugins available, but I’m not sure if it’s better to use them or create the robots.txt file.


Comment by Syed Balkhi
2008-08-24 14:07:29

I don’t use them. I just use the raw method … which is Robots.txt

I like to use codes rather than having everything widgetized. But thats me.


 
 
2008-08-25 09:55:37

Good coverage on robots.txt but I’m not too sure if I were to agree on disallowing sitemap. This is something unheard of…

Yan


Comment by Syed Balkhi
2008-08-25 10:04:35

Well, you don’t have to agree with every single bit. Sometimes it is personal preference ;)


 
 
2008-08-27 11:58:07

[...] If you are using WordPress, Syed Balkhi wrote a must-read Robots.txt Guide for Wordpress – Avoid Duplicate Content. [...]


 
Comment by AZ Blogging
2008-08-31 07:48:20

I usually used a plugin for my sitemap and just allowed the default settings because I thought they must be the best. After reading this post, I feel have a better understanding of this subject matter


 
Comment by Free Directories
2008-09-03 04:22:02

Most people just don’t realize how much duplicate content can impact their ranks. This is a good post, and people should apply these principals.


Comment by Syed Balkhi
2008-09-03 05:21:03

yeah duplicate content can really hurt sometimes.


 
 
2008-09-06 05:13:55

Thanks I did what you said here and I am looking for great traffic to come


 
Comment by Frank Richard
2008-09-19 13:31:00

Hmmm, great guide, I love it, thanks for sharing!


 
Comment by syafur
2009-03-01 07:26:51

thanks for the great ide


 
Comment by Harsh Agrawal
2009-04-25 19:32:27

Hey Syed,
This is my robots.txt file

Have a look and let me know, if it’s good or if I;m missing something here
http://www.shoutmeloud.com/robots.txt


Comment by Syed Balkhi
2009-04-25 20:16:34

All looks good except why is contact page on disallow?

You should have contact page indexed so people can see it in search engines.


 
 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.
Subscribe to Balkhis via RSS
Subscribe to Balkhis via Email