Robots.txt Guide for Wordpress – Avoid Duplicate Content
Today, I got an Instant Message on msn from a regular reader. They suggested that I should write a decent article on Robots.txt because he was searching and could not find a good one. So I decided that would make a good topic on Balkhis SEO Section. First what you should do is view my Robots.txt. Now you can copy and paste the entire thing for all I care. But it wouldn’t make sense if you don’t understand what it is doing.
The main purpose of Robots.txt is controlling the Search Engine bots. This file single handedly controls what Search Engine bots can index and what they can’t index. This file plays an important role in avoiding duplicate content.
Hint:
You use Disallow: to Disallow files. (Disallow: /page/)
You use Allow: to Allow files (Allow: /about/)
The main thing you need to know in the robots.txt is that the $ sign at the end means file extension. So like I have on Balkhis /*.css$.
Now lets go onto analyze some of the important parts of my feeds that deals with Duplicate Content.
Disallow: /category/ – This code prevents a whole heck of duplicate content. Because your category contains the exact same thing as a single post page does. So you don’t want bots to see this.
Disallow: /page/ – I have mentioned this multiple times that archives are duplicate content. Pretty Obvious. So add this one as well.
Disallow: /tag/ – I don’t know if you are using tag or not. Just add it incase if you ever do decide to use it. I have tags on my Archive page and my search page, so I have it there. Because content categorized by tags are still the same content.
Disallow: */feed/ – Personally I feel that users should pay more attention to my blog rather than my feeds. So I have all feeds blocked.
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
I don’t think that spiders should be allowed to see any of your javascripts, css, or include files let alone letting them index it. So block these off as well.
Disallow: /*? – This code doesn’t index any url that have a ? mark in it. So use this one also.
Now I hope you know what my robots.txt is doing. Now feel free to use it as a sample one for your site.





















Hey, I am Syed Balkhi, The guy who is behind Balkhis Inc. I entered the industry back in 2002 not knowing a single thing. I barely spoke English at that time. In the past six years, my language barrier has been eliminated. Aside from English, now I also speak html, and php. Along with the languages I have also managed to master a few arts. Art of web-designing started when I first entered. Messing around with photoshop, I learned how to create my first web design. Now I founded a web designing firm Uzzz Productions. After running numerous amount of websites in various niche, I have mastered the art of web-development. Now I am compiling a resource of what I already know, and what I am learning on this blog. This resource is to help me if I ever need a guide to look back to, and it is help my fellow webmasters.




Thanks for the info.
Are you disallowing the Google Image Search bot?
I get quite a lot of traffic from Google Image Search…
no I am not banning them entirely. Just from /wp-includes/ folder.
The images I like them to index are in other folders
Considering I have recieved over 2.5k visitors from there. It would be stupid on my end.
I never have any idea about using the robots.txt file to override the Google’s bots. But, as I read this article, I think I’m gonna use it soon. Thanks for the great info.
yup that is one of the best things you can do for your blog and its ranking because righht now you probably have alot of duplicate content which is preventing you to rank high. By doing this you will potentially rank higher.
The only search engine you are blocking is Google. All the others, such as MSN, Yahoo!, WebCrawler, etc. still have access to index all of your site.
Either change User-agent at the top to “all” or remove it completely so it applies to all search engines.
When you put an article into a category, it is only there once, so duplicate content is not an issue. Preventing access to the other files and the wp-content folder is a good idea though.
Hey Jim,
Considering that 90% of my Search Engine Traffic comes from Google… I don’t think I really care about any other SE.
Duplicate content can also hurt your PR (Also another Google tool).
But yes your advice is correct people who want to do that can go ahead and do it
I always felt that I’ve messed up my site’s SEO. Though I had a robot.txt file prior to this, today’s post will take me one step closer to better SERPs
Hey why disallow the sitemap???
I don’t want them to see my sitemap because sitemap also have links to tags and others.
You can be a paranoid and disallow archives too to avoid more duplicate contents
/2007/
/2008/ and so on
I am lost here, can you explain all these in plain English,you know how i get
confused with codes.
Just copy and paste the robots.txt that I linked to. And upload it on your webhost. It will help your site’s ranking. That is pretty simple
Hey thanks – I currently don’t have a robots.txt implemented so I will do one now – hope it boosts my SERP’s a bit
What about a duplicate content plugin that does pretty much the same thing as the robots.txt? There are several plugins available, but I’m not sure if it’s better to use them or create the robots.txt file.
I don’t use them. I just use the raw method … which is Robots.txt
I like to use codes rather than having everything widgetized. But thats me.
Good coverage on robots.txt but I’m not too sure if I were to agree on disallowing sitemap. This is something unheard of…
Yan
Well, you don’t have to agree with every single bit. Sometimes it is personal preference
[...] If you are using WordPress, Syed Balkhi wrote a must-read Robots.txt Guide for Wordpress – Avoid Duplicate Content. [...]
I usually used a plugin for my sitemap and just allowed the default settings because I thought they must be the best. After reading this post, I feel have a better understanding of this subject matter
Most people just don’t realize how much duplicate content can impact their ranks. This is a good post, and people should apply these principals.
yeah duplicate content can really hurt sometimes.
Thanks I did what you said here and I am looking for great traffic to come
Hmmm, great guide, I love it, thanks for sharing!
thanks for the great ide
Hey Syed,
This is my robots.txt file
Have a look and let me know, if it’s good or if I;m missing something here
http://www.shoutmeloud.com/robots.txt
All looks good except why is contact page on disallow?
You should have contact page indexed so people can see it in search engines.