A robots.txt file is a text file that controls which parts of your website search engine crawlers can access. It sits in your site’s root directory and gives instructions to bots like Googlebot and Bingbot about which pages to crawl or ignore. The main benefits include preventing duplicate content in search results, protecting private pages from indexing, managing server load through crawl delays, and helping search engines find your sitemap. You use it to block staging sites from appearing in search results, stop crawlers from accessing admin areas, prevent server overload, and guide bots toward important content. The main components are the user-agent directive, allow and disallow commands, sitemap directive, and crawl delay instruction.
What is Robots.txt?
A robots.txt file is a plain text file located in the root directory of your website that provides instructions to search engine crawlers. It tells bots like Googlebot, Bingbot, and YandexBot which parts of your site they can crawl and which parts they should avoid.
The robots.txt file follows the Robots Exclusion Protocol (REP), which is a web standard that governs how crawlers interact with websites. When a search engine bot visits your site, it first checks for this file at the root directory domain (for example, https://pureseo.com/robots.txt). The bot reads the instructions and follows them during the crawling process.
The History of Robots.txt
Martijn Koster proposed the robots.txt file in 1994 after his web spiders caused server issues. 1994: Koster created web spiders that accidentally overloaded his servers. He created robots.txt to guide search bots away from problematic areas. 1997: A formal draft specified web robot control methods using the robots.txt file. 2019: Google announced the Robots Exclusion Protocol (REP) as an official web standard on July 1, 2019. This happened nearly 25 years after the original creation. The modern draft includes requirements for breaking down the first 500 kibibytes of a robots.txt file, caching robots.txt content for 24 hours, and handling disallowed pages when the file becomes inaccessible.
How Does a Robots.txt File Work?
A robots.txt file contains directives that tell specific user agents which parts of your website they can crawl. The user agent is the specific web crawler receiving the instructions. The file uses commands that either allow or disallow access to certain pages, folders, or your entire site.
Using correct syntax is crucial for a robots.txt file to function properly. Here are two examples of a basic robots.txt file:
User-agent: * Disallow: /
This syntax blocks website crawlers from accessing all your website pages, including the homepage.
User-agent: * Disallow:
This syntax allows the user agent to access all pages of the website, including the homepage.
To block access to an individual webpage, specify it in the syntax:
User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html
How to Create One
Creating a robots.txt file is straightforward because it is a basic text file. Use any text editor, such as Notepad or TextEdit. The file must be hosted in the root directory of your domain so crawlers can find it. Each website domain should contain only one robots.txt file, and it must be named “robots.txt.”
After naming the file, add rules about which parts of the website can or cannot be crawled by specified user-agents. The rules you enter depend on your website content and goals. After establishing the rules, upload the file and test whether it is publicly accessible using Google’s Robots.txt Tester.
Why Use a Robots.txt File?
There are 4 main benefits of using a robots.txt file for your website:
Maintain Privacy – Keep bots away from private sections of your website. This is useful when creating a staging site and you do not want a specific page public yet.
Help Search Engines Find Your Sitemap – Your sitemap allows crawlers to access the most important areas of your website more efficiently. The robots.txt file helps search engines locate your site map, which benefits SEO.
Prevent Duplicate Content – Duplicate content on your website can harm your SEO. With a robots.txt file, you can prevent duplicate content from appearing on the SERPs.
Prevent Server Overload – Crawlers loading too much content at once can overload servers. With a robots.txt file, you can specify a crawl delay to prevent this issue.
What Are Web Crawlers?
Web crawlers, also called spider bots or search bots, are internet bots operated by search engines. These bots crawl the web to examine web pages and ensure information can be viewed or fetched by users.
There are 6 types of site crawlers found on the web:
- Search Engine Bots
- Commercial web spider
- Personal crawler bot
- Desktop site crawler
- Copyright crawling bots
- Cloud-based crawler robot
User Agent
The user-agent directive refers to the SEO crawler for which the command was written. This is the first line for any robots.txt format or rule group. This command uses the * symbol, which means the directive applies to all bots. Every SEO crawler has a unique name. Google web crawlers are called Googlebot, while the Yahoo spider is called Slurp.
Example 1:
User-agent: *
Disallow: /wp-admin/
Since * was used, the robots.txt will not allow user-agents to access the URL.
Example 2:
User-agent: Googlebot
Disallow: /wp-admin/
Since Googlebot was named the user-agent, all search spiders will access the URL except Google crawlers.
Example 3:
User-agent: Googlebot
User-agent: Slurp
Disallow: /wp-admin/
All user-agents except Google crawlers and Yahoo bots have access to URLs.
Allow Command
The allow command states which content can be accessed by the user-agent. The robots.txt allow command is supported by Bing and Google. The allow protocol must be followed by the path that can be accessed by web crawlers. If the path is not specified, crawlers will overlook the allow directive.
Example 1:
User-agent: *
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
The allow directive applies to all user-agents. All spider search engines are blocked from accessing the /wp-admin/ directory except the page /wp-admin/admin-ajax.php.
Disallow Command
The disallow command indicates all URLs that must not be accessed by web crawlers. Similar to the allow command, the disallow directive should be followed by the path you do not want web crawlers to access.
Example 1:
User-agent: *
Disallow: /wp-admin/
The command restricts all user-agents from accessing the /wp-admin/ directory.
Sitemap
The sitemap command directs search engine bots to the XML sitemap. This is supported by search engines like Google, Yahoo, Bing, and Ask.
Example 1:
User-agent: *
Disallow: /wp-admin/
Sitemap: https://websitename.com/sitemap1.xml
Sitemap: https://websitename.com/sitemap2.xml
The disallow command tells all search bots not to access /wp-admin/. The syntax also indicates there are 2 sitemaps on the website. You can add several XML sitemaps in the robots.txt file.
Crawl Delay
The crawl delay command prevents web crawlers from overtaxing a server. This command enables admins to specify the time spiders should wait between each crawl request, measured in milliseconds.
Example:
User-agent: *
Disallow: /wp-admin/
Disallow: /calendar/
Disallow: /events/
User-agent: Bingbot
Disallow: /calendar/
Disallow: /events/
Crawl-delay: 10
Sitemap: https://websitename.com/sitemap.xml
The crawl delay command instructs spiders to wait 10 seconds before requesting the next URL. Some web spiders, like Google web crawler, do not support crawl delay commands. Run your syntax on a robots.txt checker before submitting to any search engine.
Where Will You Find Robots.txt in WordPress?
WordPress powers more than 40% of all websites. Follow these 5 steps to access the WordPress robots.txt file:
- Log in to your WordPress dashboard
- Find “SEO”
- Click “Yoast” (this plugin is necessary to edit robots.txt)
- Click on “File Editor”
- View the WordPress robots.txt file and edit it in the WordPress directory
How to Use a Robots.txt File for SEO
Search engine optimisation is a key component of a successful website. Using robots.txt correctly benefits your SEO strategies, while mistakes can cause unintentional harm. Keep SEO best practice in mind and avoid common errors.
Pages you want crawled by search engines must not be accidentally blocked by robots.txt. Links featured on a blocked page will not be followed, meaning linked resources will not be crawled or indexed. This negatively affects link equity and has a direct impact on your website’s SEO.
Having too many disallow instructions can harm search rankings. Do not overuse this command. Use it only where necessary. Check for syntax errors in your robots.txt before saving the file in the directory.
Wrap Up
A robots.txt file is an essential tool for controlling how search engine crawlers interact with your website. It helps maintain privacy, prevents duplicate content, manages server load, and guides crawlers to your sitemap. The main components include user-agent directives, allow and disallow commands, sitemap references, and crawl delay instructions. Proper implementation of robots.txt file in SEO improves crawl efficiency and protects your search rankings.


