Using a robots.txt file is very useful for preventing indexing of various areas of your website that you do not wish to be indexed by a web crawler. The example below will disallow access by web crawlers to a few directories under the root of the public_html folder. This can be useful for a large site, as unneeded folders do not waste crawl cycles, and only the important pages are indexed instead.
1 2 3 4 5 6 7 8 9 10 11 12 13 | User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /cache/ Disallow: /class/ Disallow: /images/ Disallow: /include/ Disallow: /install/ Disallow: /kernel/ Disallow: /language/ Disallow: /templates_c/ Disallow: /themes/ Disallow: /uploads/ |
The example below is very good for a WordPress website.
1 2 3 4 5 6 7 8 | User-agent * Disallow: /wp-admin/ Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/ Disallow: /readme.html Disallow: /refer/ Allow /wp-admin/admin-ajax.php Sitemap: sitemap.xml |
This is a great way to optimise Google crawling of the website and prevent Google from wasting time indexing unnecessary files.
And this is yet another version that disallows certain folders. This may be used to allow certain folders and then disallow others.
1 2 3 4 5 6 7 8 | # Default robots file version:2 User-agent: * Disallow: /calendar/action* Disallow: /events/action* Allow: /*.css Allow: /*.js Disallow: /*? Crawl-delay: 3 |
And finally, this is how to block certain bots from crawling your website.
# # Disallow Money for Google News User-agent: Googlebot-News Disallow: /tmoney/* # # Allow Adsense User-agent: Mediapartners-Google Disallow: # # User-agent: CrystalSemanticsBot Disallow: / # User-agent: GPTBot Disallow: / # |
Or use this in your .htaccess file.
1 2 3 | RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^.*(Baiduspider|HTTrack|Yandex).*$ [NC] RewriteRule .* – [F,L] |