Introduction
A robots.txt is a text file that contains rules instructing web crawlers and search engines to either access or ignore specific sections of your website. Commonly referred to as web robots, crawlers follow directives in the robots.txt file before scanning any part of your website. The robots.txt file should be in the website’s document root directory for access by any web crawler.
This article explains how you can use robots.txt to control web crawlers on your website.
Prerequisites
- Deploy a Cloud Server on Rcs.
- Point an active domain name to the server.
- Log in through SSH as a non-root user with sudo privileges.
- Host a website on the server such as WordPress.
The Robots.txt File Structure
A valid Robots.txt file contains one or more directives declared in the format: field, colon, value.
- User-agent: Declares the web crawler a rule applies to.
- Allow: Specifies the path a web crawler should access.
- Disallow: Declares the path a web crawler should not access.
- Sitemap: Full URL to the website structure sitemap.
Values must include relative paths for the allow/disallow fields, absolute paths (valid URL) for sitemap, and web crawler names for the user-agent field. Common user-agent names and respective search engines you can safely declare in a robots.txt file include:
- Alexa
- ia_archiver
- AOL
- aolbuild
- Bing
- Bingbot
- BingPreview
- DuckDuckGo
- DuckDuckBot
- Google
- Googlebot
- Googlebot-Image
- Googlebot-Video
- Yahoo
- Slurp
- Yandex
- Yandex
Undeclared crawlers follow the all *
directive.
Common Robots.txt Directives
Rules in the robots.txt must be valid or web crawlers ignore invalid syntax rules. A valid rule must include a path or a fully qualified URL. The examples below explain how to allow, disallow and control web crawlers in the robots.txt file.
1. Grant Web Crawlers access to Website Files
Allow a single web crawler to access all website files.
User-agent: Bingbot
Allow: /
Allow all web crawlers to access website files.
User-agent: *
Allow: /
Grant a web crawler access to a single file.
User-agent: Bingbot
Allow: /documents/helloworld.php
Grant all web crawlers access to a single file.
User-agent: *
Allow: /documents/helloworld.php
2. Deny Web Crawlers Access to Website Files
Deny a web crawler access to all website files.
User-agent: Googlebot
Disallow: /
Deny all web crawlers access to website files.
User-agent: *
Disallow: /
Deny a web crawler access to a single image.
User-agent: MSNBot-Media
Disallow: /documents/helloworld.jpg
Deny a web crawler access to all images of a specific type.
User-agent: MSNBot-Media
Disallow: /*.jpg$
You can also deny a specific images crawler access to all website images. For example, the following rule instructs Google images to ignore all and remove indexed images from their database.
User-agent: Googlebot-Image
Disallow: /
Deny web crawlers access to all files except for one file.
User-agent: *
Disallow: /~documents/helloworld.php
To explicitly allow access to multiple files, repeat the Disallow
rule:
User-agent: *
Disallow: /~documents/hello.php
Disallow: /~documents/world.php
Disallow: /~documents/again.php
Instruct all web crawlers to access website files, but ignore a specific file.
User-agent: *
Allow: /
Disallow: /documents/index.html
Instruct all web crawlers to ignore a specific directory. For example: wp-admin
.
User-agent: *
Disallow: /wp-admin/
3. Grouping Robots.txt Directives
To apply robots.txt directives in groups, declare multiple user agents and apply the single rule.
For example:
User-agent: Googlebot # First Group
User-agent: Googlebot-News
Allow: /
Disallow: /wp-admin/
User-agent: Bing # Second Group
User-agent: Slurp
Allow: /
Disallow: /wp-includes/
Disallow: /wp-content/uploads/ # Ignore WordPress Images
The above directives apply the same rule per declared group.
4. Control Web Crawler Intervals
Web Crawler requests can increase your server load, so you need to regulate the rate at which crawlers scan your website in your seconds.
For example, the following directive instructs all web crawlers to wait at least 60 seconds between successive requests to your server.
User-agent: *
Crawl-delay: 60
Example
The following robots.txt sample instructs all web crawlers to access website files, ignore critical directories, and use the sitemap to understand the website’s structure.
User-agent: *
Allow: /
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Sitemap: https://www.example.com/sitemap_index.xml
To test and view your robots.txt file, visit your website and load the file after /
. For example: http://example.com/robots.txt
.
If your website returns a 404
error, create a new robots.txt file, and upload it to your document root directory, usually /var/www/html
or /var/www/public_html
.
Most web crawlers follow your robots.txt directives. However, bad bots and malware crawlers may ignore your rules. To secure your server, block bad bots through the .htaccess
file if you are using the LAMP stack on your server by adding the following lines to the file:
SetEnvIfNoCase User-Agent ([a-z0-9]{2000}) bad_bots
SetEnvIfNoCase User-Agent (archive.org|binlar|casper|checkpriv|choppy|clshttp|cmsworld|diavol|dotbot|extract|feedfinder|flicky|g00g1e|harvest|heritrix|httrack|kmccrew|loader|miner|nikto|nutch|planetwork|postrank|purebot|pycurl|python|seekerspider|siclab|skygrid|sqlmap|sucker|turnit|vikspider|winhttp|xxxyy|youda|zmeu|zune) bad_bots
Order Allow,Deny
Allow from All
Deny from env=bad_bots