- The File Name Contains Upper Case The only possible file name is robots.txt, nor Robots.txt, or ROBOTS. TXT.
2. Using Robot.Txt Instead of Robots.txt Once again, the File must be called robots.txt.
Typical Robots.txt Mistakes
3. Incorrectly Formatted Instructions: Disallow: Googlebot The only correct option is: User-agent: Googlebot Disallow: /
4. Mentioning Several Catalogs in Single ‘Disallow’ Instructions Do not place all the catalogs you want to hide in one ‘disallow’ line, like this: Disallow: /CSS/ /CGI-bin/ /images/ The only correct option is: Disallow: /CSS/ Disallow: /CGI-bin/ Disallow: /images/
5. Empty Line in ‘User-Agent’ Wrong option: User-agent: Disallow: The only correct option is User-agent: * Disallow
6. Using Upper Case in the File, I’m afraid that’s not right and is treated as a bad style: USER-AGENT: GOOGLEBOT DISALLOW:
7. Mirror Websites & URL in the Host Directive To state which website is the main one and the mirror (replica), specialists use 301 redirect for Google and also the ‘host’ directive for Yandex. Although the links to http://www.site.com, http://site.com, HTTPS:// www.site.com, and https://site.com seem identical for humans, search engines treat them as four different websites. Be careful when mentioning ‘host’ directives, so that search engines understand you correctly: Wrong User-agent: Googlebot Disallow: /CGI-bin Host: http://www.site.com/ Correct User-agent: Googlebot Disallow: /CGI-bin Host: www.site.com If your site has HTTPS, the correct option is User-agent: Googlebot Disallow: /CGI-bin Host: HTTPS:// www.site.com
Listing All the Files Within the Directory
9. Absence of Disallow Instructions The disallow instructions are required so that search engines bots understand your intents.
10. Redirect 404 Even if you are not going to create and fill out robots.txt. File for your website; search engines may still try to reach the File. Consider creating at least an empty robots.txt. To avoid disappointing search engines with 404 Not Found pages.
11. Wrong User-agent: * Disallow: /AL/Alabama.html Disallow: /AL/AR.html Disallow: /Az/AZ.html Disallow: /Az/bali.html Disallow: /Az/bed-breakfast.html Wrong User-agent: * Disallow: /AL/Alabama.html Disallow: /AL/AR.html Disallow: /Az/AZ.html Disallow: /Az/bali.html Disallow: /Az/bed-breakfast.html Correct Just hide the entire directory: User-agent: * Disallow: /AL/ Disallow: /Az/ Correct Just hide the entire directory: User-agent: * Disallow: /AL/ Disallow: /Az. Using Additional Directives in the * Section If you have additional directives, such as ‘host’ for example, you should create separate sections.
12. Incorrect HTTP Header Some bots can refuse to index the file if there is a mistake in the HTTP header. Wrong User-agent: * Disallow: /css/ Host: www.example.com Wrong Content-Type: text/html Correct User-agent: * Disallow: /css/ User-agent: Googlebot Disallow: /css/ Host: www.example.com Correct Content Type: text/plain
Checking Pages Blocked with Robots.txt Let’s use Screaming Frog to check the web pages blocked with our robots.txt File. 1. Go to the right panel and choose ‘Overview’ (1), ‘Response Codes’ (2), ‘Blocked by Robots.txt’ (3). Check to ensure that no pages with essential content are occasionally hidden from search engines. 3. Choose ‘User Agent’ to test robots.txt for various search engines. Also, specify which search engine bots the tool should imitate. 5. You may test various robots.txt sections by repeating the entire process and pressing ‘Start.’