⚡️ What robots.txt does and does not
The famous robots.txt file - explanation
The robots.txt
file is a simple text file used by websites to give instructions to web crawlers or search engine robots about which pages or files on their website can and cannot be crawled or indexed. This file plays an important role in the field of search engine optimization (SEO) and website management.
Important features of the robots.txt
- Storage location: The
robots.txt
file must be stored in the root directory of the website, i.e. atwww.yourwebsite.com/robots.txt
. It is recommended to store the sitemap.xml in the same directory. - Syntax: The file has a simple syntax and consists of a set of rules that web crawlers should follow.
Basic structure of a robots.txt
The robots.txt
file consists of one or more groups of rules. Each group begins with a User-agent
line and is followed by one or more Disallow
or Allow
lines.
Example of a simple robots.txt
User-agent: *
Disallow: /private/
Disallow: /tmp/
Allow: /public/
- User-agent: Specifies which web crawler the following rules apply to. The asterisk
*
means that the rules apply to all crawlers. -
Disallow: Specifies which directories or files stored on the server may not be crawled.
- Allow: Specifies which directories or files may be crawled despite a disallow command (important for user-owned files in an otherwise blocked directory).
Examples and applications
- Block all crawlers:
User-agent: * Disallow: /
This prevents all web crawlers from indexing any part of the website.
- Block only specific areas:
User-agent: * Disallow: /admin/ Disallow: /private/
- Block specific crawlers:
User-agent: Googlebot Disallow: /no-google/ User-agent: Bingbot Disallow: /no-bing/
- Allow specific file:
User-agent: * Disallow: /files/ Allow: /files/special-file.txt
Important considerations
- robots.txt is just a guideline: Search engines are not required to follow the instructions in the
robots.txt
and some crawlers ignore them completely. - Security considerations: The
robots.txt
file should not be used to lock down sensitive information, as it is publicly available and can be viewed by anyone. - Search engine indexing: While
robots.txt
provides instructions on whether pages can be crawled, it does not directly affect the indexing of pages that have already been crawled. This requires HTML tags such as<meta name="robots" content="noindex">
.
Conclusion
The robots.txt
file is a useful tool for controlling how web crawlers interact with your website. It helps reduce the load on the server, which is usually configured in PHP, by preventing irrelevant or private areas of the website from being crawled. Despite its simplicity, it is an important part of website optimization and management.