WordPress robots.txt: Guide to understanding and using it
If you manage a WordPress website, chances are that you have heard of ‘robots.txt'. Yet, you probably wonder what it is. Besides, you might have asked yourself "Is it an important part of my site?" Well, we have got you covered. In this post, you will get a clear picture of what robot.txt is and how it manages and helps increase your site's security.
What Are the WordPress Robots.Txt File?
Before getting into the details about robots.txt, let us first define what a 'robot' means in this context. However, we will do so by taking an example of search engine crawlers. They 'crawl' about the internet, and help search engines like Google index and rank pages. Check tips to getting Google to index your site. Besides, these crawlers are ‘bots’ or ‘robots’ visiting websites on the internet.
To make it clear, bots are a necessary thing for the internet. Nonetheless, that does not mean you should let them run around your site unregulated. The robots.txt file get referred as the ‘Robots Exclusion Protocol.' They got developed because website owners wanted to control their interaction with websites. The robots.txt file can get used to limit the access of bots to certain areas of the site or even block them completely.
Even so, this regulation is subject to certain limitations. For instance, bot cannot get forced to follow the commands of the robots.txt file. Also, malicious bots are able to ignore the file. Google and other prominent organizations ignore certain controls you add in robots.txt. If you are going through lots of problems with bots, a security solution is useful. For example, Cloudflare or Sucuri can be quite useful.
How Does the Robots.txt File Help Your Website?
There are two basic benefits of a well-integrated robots.txt file. First, blocking bots that waste your server resources. Hence, it increases the efficiency of your site. Second, it optimizes search engines’ crawl resources. It does so by telling them which URLs on your site they get allowed to index. What happens before a search engine crawls any page on a domain it has not come across before? The domains robots.txt file get opened and its commands get analyzed. Unlike the believe, robots.txt is not for regulating indexing of pages in search engines.
Is stopping certain pages from inclusion in search engine results your main aim? If so, a better way of doing this is by using a no-index meta tag or another equally direct approach. Reason being, robots.txt does not fully command search engines not to index content. Instead, it only commands them not to crawl it.As a result, it means that even though Google will not crawl the specified areas within your site. Even so, those pages will still get indexed whenever an external site links to them.
Creating and Editing Your Robots.txt File
Your site will already have a robots.txt file created for it by WordPress. The robots.txt file is always at the root of your domain. So, if your domain is www.nameofwebsite.com, it should get found at http://nameofwebsite.com/robots.txt. This is a virtual file. Thus, it cannot get edited. To be able to edit your robots.txt file, what you need to do is create a physical file on your server. It can then get tweaked according to your requirements.
Creating And Editing A Robots.Txt File With Yoast SEO
This is a very popular plugin. Besides, its interface allows you to create/edit the robots.txt file. Here are steps to follow:
- 1First, you need to enable Yoast SEO's advaanced features. This can get done by going to SEO, tapping on Dashboard and choosing Features from the menu that appears. Then, toggle on Advanced settings pages and Enable it.
- 2Once it gets activated, go to SEO and select Tools, then click on File Editor. You are then given an option to create the robots.txt file.
- 3Click the “Create robots.txt file” button. Hence, you will get allowed to use the same interface to edit the contents of your file.
We will talk about what types of commands to put in your robots.txt file later in this article.
Creating and editing a robots.txt file with All in One SEO
When it comes to popularity, the All in One SEO Pack plugin is almost on par with Yoast SEO. This plugin’s interface can be used to create and edit the robots.txt file. Just follow these simple steps:
- Go to the plugin dashboard, select Feature Manager and Activate the Robots.txt feature.
- Now, choose Robots.txt and You’ll be able to manage your robots.txt file here.
Creating and Editing a Robots.txt File via FTP
Do you use an SEO plugin for ranking that offers robots.txt? Hence, there is no need to worry. A robots.txt file can still get created, and edited by using SFTP. Follow these steps:
- 1Make a blank file named “robots.txt” using any text editor and save it.
- 2Upload this file to the root folder of your site while you are connected to your site via SFTP.
- 3You can now use STFP to make changes and edit your robots.txt file. You can also upload new versions of the file if you wish.
Deciding What to Put in Your Robots.txt File
Now that you have a physical robots.txt file, you can tweak and edit it as per your requirements. Let us look at what you can do by using this file. We have already talked about the importance of robots.txt in controlling bots and your site. Now, we will discuss the two core commands that get required to do this.
- 1The goal of the User-agent control is to target particular bots. This command will help you create a rule that applies to one search engine but not to another. Bots use user-agents to identify themselves.
- 2The Disallow command enables you to keep robots from accessing specific areas of your site.
Furthermore, there is another command called Allow. The command comes in use when you disallow access to a folder and its sub-folders. Yet, you want to allow access to a specific folder out of these. Keep in mind that all the content on your website get markedwith “Allow” by default. Thus, when adding rules, the first thing you should do is to state the user-agent to which the rule will apply. Then, state the regulations that will put it in place using the allow and disallow functions.
Specific use cases for robots.txt
If your site is still in the development stage, you may want to block crawler access to it. To do this, you will have to add the following code to your WordPress robots.txt file:
How Does This Code Work?
The *(asterisk) after user agent signifies “all user agents” and the /(slash) after Disallow signifies that access to all pages that contain “www.nameofwebsite.com/ ” (every page on your website) should be disallowed.
Now let’s say you want to prevent a particular search engine from crawling your content. For example, you might want to allow Google to crawl your site but want to disallow Bing. You can do so simply by replacing the *(asterisk) in the previous example with Bingbot.
If you only want to block access to a particular folder or file (and consequently its sub-folders), this is the command that you should follow.
Here, we use the example of the wp-admin file. It can be any file as per your requirements, and all you need to do is replace "wp-admin" in the above code with the name of the folder or file you want to prevent from being crawled by search engines.
Let’s say you wish to block an entire folder but still allow access to a particular file within it.
In the previous example, we blocked access to the WordPress admin folder completely. What if we want to block access to the entire contents of the /wp-admin/ folder EXCEPT the /wp-admin/admin-ajax.php file? All you have to do is add an Allow command to the code in the previous example.
If you wish to prevent search crawlers from crawling your search results pages, there’s a very simple command that will save the day. WordPress, by default, uses the query parameter “?s=”. Just add this command to block access. For example:
In all the above cases, we worked with one rule that accomplished a singular goal. However, what happens if you want to create different sets of commands for different bots? This is easier to do. All you have to do is create a separate set of rules under the user-agent command for each bot. For example, you want one rule for all bots but a separate rule only for Bingbot, this is what you will do:
What you are doing in this case is blocking all bots from accessing the wp-admin file. Yet, you are blocking Bingbot from accessing your entire website.
How to Test Your Robots.txt File
You can check your robots.txt file to see if your entire website is crawlable, if you have blocked specific URLs, or if you already have blocked or disallowed certain crawlers.
You can do this in the Google search console. You just have to go to your website and go to “Crawl”. Under it, select “robots.txt Tester” and enter any URL to check its accessibility.
Look out for the UTF-8 BOM
Your robots.txt file may look completely okay but really have a major issue. For example, you may find that the directives given are not being adhered to and pages that are not supposed to be crawled are in fact being crawled. The reason behind this almost always comes down to an invisible character called the UTF-8 BOM.
Here BOM signifies byte order mark, and it sometimes tends to be added to files by older text editors. If this character is present in your robots.txt file, Google might not be able to read it and complain about “Syntax not understood”. This has a significant impact on SEO and can render your robots.txt file useless. While you are testing your robots.txt file, make sure to look out for the UTF-8 BOM by checking whether Google does not understand any of your syntaxes.
Ensuring the Correct Use of the Robots.txt File
Let us end this guide with a quick reminder. Although robots.txt blocks are crawling, it does not necessarily stop indexing. Even so, robots.txt helps you add guidelines. They control and outline the interaction of your site with search engines and bots. Nonetheless, it does not control explicitly whether or not your content gets indexed. Tweaking your site's robots.txt file can be very helpful if you intend:
- 1Fixing your website, which is having trouble with a specific bot.
- 2To have better control over search engines and some content/plugins on your site. Nonetheless, if you do not meet the above, you need to change the default virtual robots.txt file on your site.