2 Min

The famous robots.txt file - explanation

The robots.txt file is a simple text file used by websites to give instructions to web crawlers or search engine robots about which pages or files on their website can and cannot be crawled or indexed. This file plays an important role in the field of search engine optimization (SEO) and website management.

Important features of the robots.txt

  1. Storage location: The robots.txt file must be stored in the root directory of the website, i.e. at www.yourwebsite.com/robots.txt. It is recommended to store the sitemap.xml in the same directory.
  2. Syntax: The file has a simple syntax and consists of a set of rules that web crawlers should follow.

Basic structure of a robots.txt

The robots.txt file consists of one or more groups of rules. Each group begins with a User-agent line and is followed by one or more Disallow or Allow lines.

Example of a simple robots.txt

User-agent: *
Disallow: /private/
Disallow: /tmp/
Allow: /public/
  • User-agent: Specifies which web crawler the following rules apply to. The asterisk * means that the rules apply to all crawlers.
  • Disallow: Specifies which directories or files stored on the server may not be crawled.

  • Allow: Specifies which directories or files may be crawled despite a disallow command (important for user-owned files in an otherwise blocked directory).

Examples and applications

  1. Block all crawlers:
    User-agent: *
    Disallow: /
    

    This prevents all web crawlers from indexing any part of the website.

  2. Block only specific areas:
    User-agent: *
    Disallow: /admin/
    Disallow: /private/
    
  3. Block specific crawlers:
    User-agent: Googlebot
    Disallow: /no-google/
    User-agent: Bingbot
    Disallow: /no-bing/
    
  4. Allow specific file:
    User-agent: *
    Disallow: /files/
    Allow: /files/special-file.txt
    

Important considerations

  • robots.txt is just a guideline: Search engines are not required to follow the instructions in the robots.txt and some crawlers ignore them completely.
  • Security considerations: The robots.txt file should not be used to lock down sensitive information, as it is publicly available and can be viewed by anyone.
  • Search engine indexing: While robots.txt provides instructions on whether pages can be crawled, it does not directly affect the indexing of pages that have already been crawled. This requires HTML tags such as <meta name="robots" content="noindex">.

Conclusion

The robots.txt file is a useful tool for controlling how web crawlers interact with your website. It helps reduce the load on the server, which is usually configured in PHP, by preventing irrelevant or private areas of the website from being crawled. Despite its simplicity, it is an important part of website optimization and management.

Updated: