Everything You Need to Know About the WordPress Robots.txt File
If you manage a WordPress website, chances are that you’ve heard of ‘robots.txt’ and probably wondered what it is.
You've probably asked yourself "Is it an important part of my site?" Well, we’ve got you covered! In this post, you will get a clear picture of what robot.txt is and how it manages helps increase your site's security.
What Is the WordPress Robots.txt File?
Before getting into the details about robots.txt, let’s first define what a “robot” means in this context. We’ll do so by taking an example of search engine crawlers that “crawl” about the internet and help search engines like Google index and rank the many pages available online. These crawlers are simply ‘bots’ or ‘robots’ visiting websites on the internet.
To put it simply, bots are a necessary thing for the internet. That doesn’t mean you should let them run around your site unregulated. The robots.txt file, known as the ‘Robots Exclusion Protocol’ was developed because website owners wanted to control how these robots interact with their websites. The robots.txt file can be used to limit the access of bots to certain areas of the site or even block them completely.
This regulation is subject to certain limitations though. For instance, bot cannot be forced to follow the commands of the robots.txt file. Also, malicious bots are able to ignore the file in the same way that Google and other prominent organizations ignore certain controls that you add in robots.txt. If you are going through lots of problems with bots, a security solution like Cloudflare or Sucuri can be quite useful.
How Does the Robots.txt File Help Your Website?
There are two basic benefits of a well-integrated robots.txt file. First, blocking bots that waste your server resources increases the efficiency of your site. Second, it optimizes search engines’ crawl resources by telling them which URLs on your site they’re allowed to index.
Before a search engine crawls any page on a domain it hasn’t come across before, the domain's robots.txt file is opened and its commands are analyzed. Contrary to popular belief, robots.txt isn’t particularly for regulating which pages get indexed in search engines.
If stopping certain pages from being included in search engine results is your primary aim, a better way of doing this is by using a noindex meta tag or another equally direct approach. The reason behind this is that robots.txt does not explicitly command search engines not to index content It only commands them not to crawl it.
This means that even though Google will not crawl the specified areas within your site, those pages will still be indexed whenever an external site links to them.
Creating and Editing Your Robots.txt File
Your site will already have a robots.txt file created for it by WordPress. The robots.txt file is always at the root of your domain, so if your domain is www.nameofwebsite.com, it should be found at http://nameofwebsite.com/robots.txt. This is a virtual file and thus cannot be edited. To be able to edit your robots.txt file, what you need to do is create a physical file on your server. It can then be tweaked according to your requirements.
Yoast SEO is a very popular plugin, and its interface allows you to create/edit the robots.txt file using the following steps.
- First, you need to enable Yoast SEO’s advanced features. This can be done by going to SEO, tapping on Dashboard and choosing Features from the menu that appears. Then, toggle on Advanced settings pages and enable it.
- Once it is activated, go to SEO and select Tools, then click on File Editor. You are then given an option to create the robots.txt file.
- When you click on the “Create robots.txt file” button, you will be able to use the same interface to edit the contents of your robots.txt file.
We’ll talk about what types of commands to put in your robots.txt file later in this article.
Creating and editing a robots.txt file with All in One SEO
When it comes to popularity, the All in One SEO Pack plugin is almost on par with Yoast SEO. This plugin’s interface can be used to create and edit the robots.txt file. Just follow these simple steps:
- Go to the plugin dashboard, select Feature Manager and Activate the Robots.txt feature.
- Now, choose Robots.txt and You’ll be able to manage your robots.txt file here.
Creating and Editing a Robots.txt File via FTP
Don't use an SEO plugin that offers robots.txt? There’s no need to worry. A robots.txt file can still be created and edited by using SFTP. Follow these steps:
- Make a blank file named “robots.txt” using any text editor and save it.
- Upload this file to the root folder of your site while you are connected to your site via SFTP.
- You can now use STFP to make changes and edit your robots.txt file. You can also upload new versions of the file if you wish.
Deciding What to Put in Your Robots.txt File
Now that you have a physical robots.txt file, you can tweak and edit it as per your requirements. Let’s look at what you can do by using this file! We’ve already talked about how robots.txt is useful for controlling the interaction between bots and your site. We’ll now discuss the two core commands that are required to accomplish this.
- The goal of the User-agent control is to target particular bots. This command will help you create a rule that applies to one search engine but not to another. Bots use user-agents to identify themselves.
- The Disallow command enables you to keep robots from accessing specific areas of your site.
There’s another command called Allow which comes in use when you disallow access to a folder and its child folders but want to allow access to one particular folder out of these. Keep in mind that all the content on your website is marked with “Allow” by default.
When adding rules, the first thing you should do is to state the user-agent to which the rule will apply and then mention the regulations that will implement it using the allow and disallow functions.
Let’s look at some specific use cases for robots.txt.
If your site is still in the development stage, you may want to block crawler access to it. To do this, you will have to add the following code to your WordPress robots.txt file:
User-agent: * Disallow: /
How Does This Code Work?
The *(asterisk) after user agent signifies “all user agents” and the /(slash) after Disallow signifies that access to all pages that contain “www.nameofwebsite.com/ ” (every page on your website) should be disallowed.
Now let’s say you want to prevent a particular search engine from crawling your content. For example, you might want to allow Google to crawl your site but want to disallow Bing. You can do so simply by replacing the *(asterisk) in the previous example with Bingbot.
User-agent: Bingbot Disallow: /
If you only want to block access to a particular folder or file (and consequently its sub-folders), this is the command that you should follow.
User-agent: * Disallow: /wp-admin Disallow: /wp-login.php/
Here, we use the example of the wp-admin file. It can be any file as per your requirements, and all you need to do is replace "wp-admin" in the above code with the name of the folder or file you want to prevent from being crawled by search engines.
Let’s say you wish to block an entire folder but still allow access to a particular file within it.
In the previous example, we blocked access to the WordPress admin folder completely. What if we want to block access to the entire contents of the /wp-admin/ folder EXCEPT the /wp-admin/admin-ajax.php file? All you have to do is add an Allow command to the code in the previous example.
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
If you wish to prevent search crawlers from crawling your search results pages, there’s a very simple command that will save the day. WordPress, by default, uses the query parameter “?s=”. Just add this command to block access. For example:
User-agent: * Disallow: /?s= Disallow: /search/
In all the above cases, we worked with one rule that accomplished a singular goal. But.. what if you want to create different sets of commands for different bots?
This is actually very easy to do. All you have to do is create a separate set of rules under the user-agent command for each bot. Say you want one rule for all bots but a separate rule only for Bingbot, this is what you’ll do:
user-agent: * disallow: /wp-admin/ user-agent:bingbot disallow:/
What you’re doing in this case is blocking all bots from accessing the wp-admin file but blocking Bingbot from accessing your entire website.
How to Test Your Robots.txt File
You can check your robots.txt file to see if your entire website is crawlable, if you have blocked specific URLs, or if you already have blocked or disallowed certain crawlers.
You can do this in the Google search console. You just have to go to your website and go to “Crawl”. Under it, select “robots.txt Tester” and enter any URL to check its accessibility.
Look out for the UTF-8 BOM
Your robots.txt file may look completely okay but really have a major issue. For example, you may find that the directives given are not being adhered to and pages that are not supposed to be crawled are in fact being crawled. The reason behind this almost always comes down to an invisible character called the UTF-8 BOM.
Here BOM signifies byte order mark, and it sometimes tends to be added to files by older text editors. If this character is present in your robots.txt file, Google might not be able to read it and complain about “Syntax not understood”. This has a significant impact on SEO and can render your robots.txt file useless. While you are testing your robots.txt file, make sure to look out for the UTF-8 BOM by checking whether Google does not understand any of your syntaxes.
Ensuring the Correct Use of the Robots.txt File
Let’s end this guide with a quick reminder that, while robots.txt blocks are crawling, it does not necessarily stop indexing. Even though robots.txt will help you add guidelines to control and outline your site’s interaction with search engines and bots, it doesn't control explicitly control whether or not your content is indexed.
Tweaking your site's robts.txt file can be very helpful if... 1. Your website is having trouble with a specific bot. 2. You want to have better control over the interaction between search engines and some content/plugins on your site. If you don't meet either of these criteria, there may not be an urgent need for you to change the virtual robots.txt file that is present in your WordPress site by default.