WordPress robots.txt tips against duplicate content

Been getting some questions about my robots.txt file and what certain things do.

Thankfully some regular expressions are supported in the robots.txt (but not many).

$ in regex means the end of the file. So if you do .php$ it your robots.txt that means it will match anything that ends in .php

This is really handy when you want to block all .exe .php or other files. For example:

Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.exe$

Specifically this is some of the things I use in my robots.txt

Disallow: /*? – this blocks all urls with a ? in them. A good way to avoid duplicate content issues with wordpress blogs. Obviously you only want to use this if you have changed your url structure to not be 100% ?=.

Disallow: /*.php$ – This blocks all .php files. Another good way to avoid duplicate content with a wordpress blog.

Disallow: /*.inc$ – you should not be showing .inc or include files to bots (google code search will eat you alive)

Disallow: /*.css$ – why would you show css files for indexing seems silly.. The wildcard is used here in case there are many css files.

Disallow: */feed/ feeds being indexed dilute your site equity. The wildcard * is used incase there is preceding chars.

Disallow: */trackback/ – no reason a trackback url should be indexed. The wildcard * is used incase there is preceding chars.

Disallow: /page/ – assloads of duplicate content in pages for wordpress.

Disallow: /tag/ – more douplicate content.

Disallow: /category/ – even more duplicate content.

SO what if you want to ALLOW a page. Like for instance my serps tool is serps.php and from the above rules that would not fly.

Allow: /serps.php – this does the trick!

Keep in mind I am not a SEO but I have picked up a few tricks along the way.