What is robots.txt and why is it important for search engine optimization (SEO)? Robots.txt is a set of optional directives that tell web crawlers which parts of your website they can access. Most search engines, including Google, Bing, Yahoo and Yandex, support and use robot txt to identify which web pages to crawl, index and display in search results.
If you’re having issues getting your website indexed by search engines, your robots.txt file may be the problem. Robot.txt errors are among the most common technical SEO issues that appear on SEO audit reports and cause a massive drop in search rankings. Even seasoned technical SEO services providers and web developers are susceptible to committing robot.txt errors.
As such, it is important that you understand two things: 1) what is robots.txt and 2) how to use robots.txt in WordPress and other content management systems (CMS). This will help you create a robots.txt file that is optimized for SEO and make it easier for web spiders to crawl and index your web pages.
In this guide, we cover:
• What Is Robots.txt?
• What Is a Web Crawler and How Does It Work?
• What Does Robot Txt Look Like?
• What Is Robots.txt Used For?
• WordPress Robots.txt Location
• Where Is Robots.txt in WordPress?
• How To Find Robots.txt in cPanel
• How To Find Magento Robots.txt
• Robots Txt Best Practices
Let’s dive deep into the basics of robots.txt. Read on and discover how you can leverage the robots.txt file to improve your website’s crawlability and indexability.
What Is Robots.txt?
Robots txt, also known as the robots exclusion standard or protocol, is a text file located in the root or main directory of your website. It serves as an instruction for SEO spiders on which parts of your website they can and cannot crawl.
Robots.Text Timeline
The robot txt file is a standard proposed by Allweb creator Martijn Koster to regulate how different search engine bots and web crawlers access web content. Here’s an overview of the robots txt file development over the years:
In 1994, Koster created a web spider that caused malicious attacks on his servers. To protect websites from bad SEO crawlers, Koster developed the robot.text to guide search bots to the right pages and hinder them from reaching certain areas of a website.
In 1997, an internet draft was created to specify web robots control methods using a robot txt file. Since then, robot.txt has been used to restrict or channel a spider robot to select parts of a website.
On July 1, 2019, Google announced that it is working towards formalizing the robots exclusion protocol (REP) specifications and making it a web standard – 25 years after robots txt file was created and adopted by search engines.
The goal was to detail unspecified scenarios for robots txt parsing and matching to adapt to the modern web standards. This internet draft indicates that:
1. Any Uniform Resource Identifier-based (URI) transfer protocol, such as HTTP, Constrained Application Protocol (CoAP) and File Transfer Protocol (FTP), can use robots txt.
2. Web developers must parse at least the first 500 kibibytes of a robot.text to alleviate unnecessary strain on servers.
3. Robots.txt SEO content is generally cached for up to 24 hours to provide website owners and web developers enough time to update their robot txt file.
4. Disallowed pages are not crawled for a reasonably long period when a robots txt file becomes inaccessible because of server problems.
Several industry efforts have been made over time to extend robots exclusion mechanisms. However, not all web crawlers may support these new robot txt protocols. To clearly understand how robots.text works, let’s first define web crawler and answer an important question: how do web crawlers work?
What Is a Web Crawler and How Does It Work?
A website crawler, also called a spider robot, site crawler or search bot, is an internet bot typically operated by search engines like Google and Bing. A web spider crawls the web to analyze web pages and ensure information can be retrieved by users any time they need it.
What are web crawlers and what’s their role in technical SEO? To define web crawler, it’s vital that you familiarize yourself with the different types of site crawlers across the web. Each spider robot has a different purpose:
1. Search Engine Bots
What is a search engine spider? A spider search engine bot is one of the most common SEO crawlers used by search engines to crawl and scrape the internet. Search engine bots use robots.txt SEO protocols to understand your web crawling preferences. Knowing the answer to what is a search engine spider? gives you an upper hand in optimizing your robots.text and ensuring it works.
2. Commercial Web Spider
A commercial site crawler is a tool developed by software solution companies to help website owners collect data from their own platforms or public sites. Several firms provide guidelines on how to build a web crawler for this purpose. Be sure to partner with a commercial web crawling company that maximizes the efficiency of an SEO crawler to meet your specific needs.
3. Personal Crawler Bot
A personal website crawler is built to help businesses and individuals scrape data from search results and/or monitor their website performance. Unlike a spider search engine bot, a personal crawler bot has limited scalability and functionality. If you’re curious about how to make a website crawler that performs specific jobs to support your technical SEO efforts, consult one of the many guides on the internet that show you how to build a web crawler that runs from your local device.
4. Desktop Site Crawler
A desktop crawler robot runs locally from your computer and is useful for analyzing small websites. Desktop site crawlers, however, are not recommended if you’re analyzing tens or hundreds of thousands of web pages. This is because data crawling large sites requires custom setup or proxy servers that a desktop crawler bot does not support.
5. Copyright Crawling Bots
A copyright website crawler looks for content that violates copyright law. This type of search bot can be operated by any company or person who owns copyrighted material, regardless if you know how to build a web crawler or not.
6. Cloud-based Crawler Robot
Cloud-based crawling bots are used as a technical SEO services tool. A cloud-based crawler robot, also known as Software as a Service (SaaS), runs on any device with an internet connection. This internet spider has become increasingly popular because it crawls websites of any size and does not require multiple licenses to use on different devices.
Why It’s Important To Know: What Are Web Crawlers?
Search bots are usually programmed to search for robot.text and follow its directives. Some crawling bots, such as spambots, email harvesters and malware robots, however, often disregard the robots.txt SEO protocol and do not have the best intentions when accessing your site content.
What is a web crawler behavior if not a proactive measure to improve your online presence and enhance your user experience? In making an effort to understand the answer to what is a search engine spider? and how it differs from bad site crawlers, you can ensure a good search engine’s spider can access your website and prevent unwanted SEO crawlers from ruining your user experience (UX) and search rankings.
The 8th Annual Bad Bot Report by Imperva shows that bad web crawling bots drove 25.6 percent of all site traffic in 2020, while good SEO spiders generated only 15.2 percent of traffic. With the many disastrous activities bad spider crawl bots are capable of, such as click fraud, account takeovers, content scraping and spamming, it pays to know 1) what is a web crawler that’s beneficial to your site? and 2) which bots do you need to block when you create robot text?
Should Marketers Learn How To Make a Website Crawler?
You don’t necessarily need to learn how to make a website crawler. Leave the technical aspects of developing an SEO crawler to software solution companies and focus on your SEO robots txt optimization instead.
No one creates their own web crawler unless they’re specifically scraping data from a site,” said Ronnel Viloria, Thrive’s demand generation senior SEO strategist. “From the standpoint of technical SEO, the tools for website crawling already exist. Only if you’re scraping tens of GB of data constantly would it be cost-effective to build and host your own internet crawler.”
How Do Web Crawlers Work?
In this fast-paced digital landscape, simply knowing what a web crawler is is not enough to guide your SEO robots txt optimization. Besides “what are web crawlers?” you also need to answer “how do web crawlers work?” to ensure you create robot text that contains the proper directives.
Search spiders are primarily programmed to perform automatic, repetitive searches on the web to build an index. The index is where search engines store web information to be retrieved and displayed on relevant search results upon user query.
An internet crawler follows certain processes and policies to improve its website crawling process and achieve its spider web target.
So, how does a web crawler work, exactly? Let’s take a look.
Discover URLs | Web spiders begin web crawling from a list of URLs, then pass between page links to crawl websites. To boost your site’s crawlability and indexability, be sure to prioritize your website navigability, create a clear robots.txt sitemap and submit robots.txt to Google. |
Explore a List of Seeds | Search engines provide their search engine spiders a list of seeds or URLs to check out. Search engine spiders then visit each URL on the list, identify all the links on each page and add them to the list of seeds to visit. Web spiders use sitemaps and databases of previously crawled URLs to explore more web pages across the web. |
Add to the Index | Once a search engine’s spider visits the URLs on the list, it locates and renders the content, including the text, files, videos and images, on each web page and adds it to the index. |
Update the Index | Search engine spiders consider key signals, such as keywords and content relevance and freshness, when analyzing a web page. Once an internet crawler locates any changes on your website, it updates its search index accordingly to ensure it reflects the latest version of the web page. |
According to Google, computer programs determine how to crawl a website. They look at the perceived importance and relevance, crawl demand and the level of interest that search engines and online users have in your website. These factors impact how frequently an internet spider would crawl your web pages.
How does a web crawler work and ensure all Google web crawling policies and spider crawl requests are met?
To better communicate with a search engines spider how to crawl a website, technical SEO services providers and WordPress web design experts advise you to create robots.txt that clearly indicates your data crawling preferences. The SEO robots txt is one of the protocols that web spiders use to guide their Google web crawling and data crawling process across the internet.
What Does Robot Txt Look Like?
You can customize your robots.txt file to apply to specific search spiders, disallow access to particular files or web pages or control your robots.txt crawl delay.
This is how a default SEO robots.txt looks like:
Spider crawl instructions are specified by using the following directives:
User-agent
The user-agent directive pertains to the name of the SEO crawler for which the command was meant. It is the first line for any robots.txt format or rule group.
The user-agent command uses a wildcard or the symbol *. It means the directive applies to all search bots. Directives may also apply to specific user-agents.
Each SEO crawler has a different name. Google web crawlers are called Googlebot, Bing’s SEO crawler is identified as BingBot and Yahoo’s internet spider is called Slurp. You can find the list of all user-agents here.
# Example 1 |
In this example, since * was used, it means robots.txt block all user-agents from accessing the URL.
# Example 2 |
Googlebot was specified as a user-agent. This means all search spiders can access the URL except Google crawlers.
# Example 3 |
Example #3 indicates all user-agents except Google crawler and Yahoo’s internet spider are allowed to access the URL.
Allow
The robots.txt allow command indicates which content is accessible to the user-agent. The Robots.txt allow directive is supported by Google and Bing.
Keep in mind that the robot.txt allow protocol should be followed by the path that can be accessed by Google web crawlers and other SEO spiders. If no path is indicated, Google crawlers will ignore the robot.txt allow directive.
# Example 1 |
For this example, the robots.txt allow directive applies to all user-agents. This means robots txt block all spider search engine from accessing the /wp-admin/ directory except for the page /wp-admin/admin-ajax.php.
# Example 2: Avoid conflicting directives like this |
When you create robots txt directive like this, Google crawlers and search spiders would be confused about what to do with the URL http://www.yourwebsite.com/example.php. It’s unclear which protocol to follow.
To avoid Google web crawling issues, be sure to avoid using wildcards when using robot.txt allow and robots disallow directives together.
Disallow
The robots.txt disallow command is used to specify which URLs should not be accessed by Google crawl robots and website crawling spiders. Like the robots.txt allow command, robots.txt disallow directive should also be followed by the path you don’t want Google web crawlers to access.
# Example 1 |
For this example, the robots disallow all command prevents all user-agents from accessing the /wp-admin/ directory.
The robots.txt disallow command is used to specify which URLs should not be accessed by Google crawl robots and website crawling spiders. Like the robots.txt allow command, robots.txt disallow directive should also be followed by the path you don’t want Google web crawlers to access.
# Example 2 |
This robots.txt disallow command tells a Google web crawler and other search bots to crawl website Google pages – the entire website – because nothing is disallowed.
Note: Even though this robots disallow directive contains only two lines, be sure to follow the right robots.txt format. Do not write user-agent: * Disallow: on one line because this is wrong. When you create robots.txt, each directive should be on a separate line.
# Example 3 |
The / symbol represents the root in a website’s hierarchy. For this example, the robot.txt disallow directive is equivalent to the robots disallow all command. Simply put, you are hiding your entire website from Google spiders and other search bots.
Note: Similar to the example above (user-agent: * Disallow:), avoid using a one-line robots.txt syntax (user-agent: * Disallow: / ) to disallow access to your website.
A robots.txt format like this user-agent: * Disallow: / would confuse a Google crawler and might cause WordPress robot.txt parsing issues.
Sitemap
The robots.txt sitemap command is used to point Google spiders and web crawlers to the XML sitemap. The robots.txt sitemap is supported by Bing, Yahoo, Google and Ask.
As for how to add sitemap to robots.txt? Knowing the answer to this questions is useful, especially if you want as many search engines as possible to access your sitemap.
# Example |
In this example, the robots disallow command tells all search bots not to access the /wp-admin/. The robot.txt syntax also indicates that there are two sitemaps that can be found in the website. When you know how to add sitemap to robots.txt, you can place multiple XML sitemaps in your robot txt file.
Crawl-delay
The robots.txt crawl delay directive is supported by major spider bots. It stops a Google web crawler and other search spiders from overtaxing a server. The robots txt crawl delay command allows administrators to specify how long the Google spiders and web crawlers should wait between each Google crawl request, in milliseconds.
# Example User-agent: BingBot Sitemap: https://yourwebsite.com/sitemap.xml |
In this example, the robots.txt crawl delay directive tells search bots to wait a minimum of 10 seconds before requesting another URL.
Some web spiders, like Google web crawler, do not support robots txt crawl delay commands. Be sure to run your robots.txt syntax on a robots txt checker before you submit robots.txt to Google and other search engines to avoid parsing issues.
Baidu, for one, does not support robots txt crawl delay directives, but you can leverage the Baidu Webmaster Tools to control your website’s crawl frequency. You can also use Google Search Console (GSC) to define web crawler crawl rate.
Host
The host directive tells search spiders your preferred mirror domain or the replica of your website hosted at a different server. The mirror domain is used to distribute the traffic load and avoid the latency and the server load on your website.
# Example Host: yourwebsite.com |
The WordPress robot.txt host directive lets you decide if you want search engines to show yourwebsite.com or www.yourwebsite.com .
End-of-string Operator
The $ sign is used to indicate the end of a URL and direct a Google web crawler on how to crawl a website with parameters. It is placed at the end of the path.
# Example |
In this example, the robots txt nofollow directive tells a Google crawler and other user-agents not to crawl website Google URLs that end with .html.
This means URLs with parameters like this one https://yourwebsite.com/page.html?lang=en would still be included in the Google crawl request since the URL doesn’t end after .html.
Comments
Comments serve as a guide for web design and development specialists, and they are preceded by the # sign. They can be placed at the start of a WordPress robot.txt line or after a command. If you’re placing comments after a directive, be sure that they are on the same line.
Everything after the # will be ignored by Google crawl robots and search spiders.
# Example 1: Block access to /wp-admin/ directory for all search bots. |
# Example 2 |
What Is Robots.txt Used For?
Robot.txt syntax is used to manage spider crawl traffic to your website. It plays a crucial role in making your website more accessible to search engines and online visitors.
Want to learn how to use robots.txt and create robots txt for your website? Here are the top ways you can improve your SEO performance with robots.txt for WordPress and other CMS:
1. Avoid overloading your website with Google web crawl and search bot requests.
2. Prevent Google crawl robots and search spiders from crawling private sections in your website using robots txt nofollow directives.
3. Protect your website from bad bots.
4. Maximize your crawl budget – the number of pages web crawlers can crawl and index on your website within a given timeframe.
5. Increase your website crawlability and indexability.
6. Avoid duplicate content in search results.
7. Hide unfinished pages from Google web crawl robots and search spiders before they are ready for publishing.
8. Improve your user experience.
9. Pass link equity or link juice to the right pages.
Wasting your crawl budget and resources on pages with low-value URLs can negatively impact your crawlability and indexability. Don’t wait until your site experiences several technical SEO issues and a significant drop in rankings before you finally prioritize learning how to create robots txt for SEO.
Master robots.txt Google optimization and you’ll protect your website from bad bots and online threats.
Do All Websites Need To Create Robot Text?
Not all websites need to create a robots.txt file. Search engines like Google have systems in place on how to crawl website Google pages, and they automatically disregard duplicate or unimportant versions of a page.
Technical SEO specialists, however, recommend that you create a robots.txt file and implement robots txt best practices to allow faster and better web crawling and indexing by Google crawl robots and search spiders.
According to Edgar Dagohoy, Thrive’s SEO specialist, new websites do not need to worry about how to use robots.txt since your goal is to make your web pages accessible to as many search spiders as possible. On the other hand, if your website is more than a year old, it might start gaining traffic and attracting Google crawl requests and search spider request issues.
[When this happens] you would need to block those URLs in the WordPress robots.txt file so that your crawl budget won’t be affected,” said Dagohoy. “Note that websites with many broken URLs are crawled less by search engine bots, and you wouldn’t want that for your site.”
As mentioned above, knowing how to edit robots.txt for SEO gives you a significant advantage. More importantly, it gives you peace of mind that your website is secure from malicious attacks by bad bots.
WordPress Robots.txt Location
Ready to create robots.txt? The first step to achieve your spider web target budget is to learn how to find robots.txt on your website. You can find the WordPress robots.txt location by going to your site URL and adding the /robots.txt parameter.
For example: yourwebsite.com/robots.txt
Here’s a screenshot of a robots.txt syntax on the Thrive website:
This is an example of an optimized search engine and Google robots txt file. Thrive’s robot.txt syntax contains robots.txt disallow directory and robot.txt allow commands to guide Google web crawl robots and search spiders which pages to crawl and index.
Besides the robot.txt disallow and allow directives, the robots.txt Google and search bots directory also includes a robots.txt sitemap to direct web crawlers to the XML sitemap and avoid wasting spider web target crawl budget.
Where Is Robots.txt in WordPress?
WordPress is considered the world’s most popular and widely used CMS, powering approximately 40 percent of all websites on the web. It’s no wonder many website owners aim to learn how to edit robots.txt WordPress. Some even tap WordPress web design professionals to get help with optimizing robots.txt for WordPress.
Where is robots.txt in WordPress? Follow these steps to access your WordPress robots.txt file:
1. Log in to your WordPress dashboard as an admin.
2. Navigate to “SEO.”
3. Click on “Yoast.” This is a WordPress plugin that you must install on your website to edit robots.txt WordPress and create robots txt updates anytime you need.
4. Click on “File editor.” This tool allows you to make quick changes to your robots.txt Google directives.
5. Now you can view your WordPress robots.txt file and edit robots.txt WordPress directory.
As for how to access robots.txt in WordPress and update your robot.txt disallow directives to show URL restricted by robots txt? Just follow the same process you used to determine where is robots.txt found in WordPress.
Don’t forget to save all the changes you make to your robots.txt for WordPress to ensure your robots.txt no index and robot.txt allow commands are up to date.
How To Find Robots.txt in cPanel
cPanel is one of the most popular Linux-based control panels, used to manage web hosting accounts with maximum efficiency. Web developers also use cPanel to create robots.txt file.
How to find robots.txt in cPanel: Follow these steps to access your web crawlers and Google robots txt file in cPanel.
1. Log in to your cPanel account.
2. Open “File Manager” and go to the root directory of your site.
3. You must be able to access the search bots and Google robots txt file in the same location as the index or first page of your website.
How To Edit Robots.txt in cPanel
If you want to edit your robots.txt disallow directory or make necessary changes in your robots.txt syntax, simply:
1. Highlight the robots.txt no index file.
2. Click on “Editor” or “Code Edit” in the top menu to edit your robots txt nofollow commands.
3. Click “Save Changes” to save the latest alterations in your robots.txt disallow directory.
How To Create Robots Txt in cPanel
To create robots.txt file in cPanel, perform the following steps:
1. Log in to your cPanel account.
2. Navigate to the “Files” section and click on “File Manager.”
3. Click on “New File” and hit the “Create New File” button. Now you can create a robots.txt file.
How To Find Magento Robots.txt
Besides the common question of how to access robots.txt in WordPress, many website owners also aim to learn how to access, edit and optimize Magento robots.txt to better communicate to search spiders the URL restricted by robots txt.
Magento is an eCommerce platform with built-in PHP designed to help web developers create SEO-optimized eCommerce websites. And how to find Magento robots.txt?
1. Log in to your Magento dashboard.
2. Navigate to the “Admin panel,” then click “Stores.”
3. Go to the “Settings,” then select “Configuration.”
4. Open the “Search Engine Robots” section. You can now view and edit your robots.txt file to determine the URL restricted by robots txt.
5. When complete, click the “Save Config” button.
What about how to create robots txt in Magento? The same process applies when you create robots.txt file for Magento. You may also click the “Reset to Default” button should you need to restore the default instructions.
Robots Txt Best Practices
Learning how to access robots.txt in WordPress and how to edit robots.txt in various platforms are just the initial steps in optimizing your robots.txt no index and robot.txt allow directives.
To guide your robots.txt optimization process, follow these steps:
1. Run regular audits using a robots txt checker. Google offers a free robots txt checker to help you determine any robots.txt issues on your website.
2. Learn how to add sitemap to robots.txt and apply it to your robots.txt file.
3. Leverage the robots.txt block all directives to prevent search bots from accessing private files or unfinished pages on your website.
4. Check your server logs.
5. Monitor your crawl report on Google Search Console (GSC) to identify how many search spiders are crawling your website. The GSC report shows your total crawl requests by response, file type, purpose and Googlebot type.
6. Check if your website is generating traffic and requests from bad bots. If so, you need to block them using robots.txt block all directives.
7. If your website has many 404 and 500 errors and they are causing web crawling issues, you can implement 301 redirects. In the event of the errors escalating quickly and reaching millions of 404 pages and 500 errors, you can use robots txt block all directives to restrict some user-agents from accessing your web pages and files. Be sure to optimize your robots.txt file to resolve recurring web crawling issues.
8. Enlist professional technical SEO services and web development solutions to properly implement robots txt block all, robot.txt allow and other directives on your robots.txt syntax.
Common Robots.txt Mistakes You Need To Avoid
Take note of these common errors when creating robots.txt file and make sure to avoid them to improve your site crawlability and online performance:
❌Placing robots.txt directives in a single line.Each robot txt directive should always be on a separate line to provide clear instructions to web crawlers how to crawl a website.
Incorrect: User-agent: * Disallow: /
Incorrect: User-agent: * Disallow:
❌Failure to submit robots.txt to Google.Always submit your updated robots.txt file to Google. Whether you made small changes, such as add robots.txt deny all commands to specific user-agents or deleted robots disallow all directives, be sure to click the submit button. This way, Google will be notified of any changes you’ve made on your robots.txt file.
❌Placing the wrong robots.txt no index directives.Doing so puts your website at risk of not getting crawled by search bots, losing valuable traffic and, worse, a sudden drop in search rankings.
❌Not placing the robot text file in the root directory.Putting your robots.txt file on sub-directories could make it undiscoverable by web crawlers.
Incorrect: https://www.yourwebsite.com/assets/robots.txt
Correct: https://www.yourwebsite.com/robots.txt
❌Improper use of robots.txt deny all commands, wildcards, trailing slash and other directives.Always run your robot.text file on a robots.txt validator before saving and submitting to Google and other search engines, so you don’t generate robots.txt errors.
❌Relying on robots.txt file generator to generate robots.txt file.Although a robots.txt file generator is a useful tool, relying solely on it without doing manual checks on the robots.txt deny all directives, robot.txt allow commands and user-agents on your robot txt file is a bad practice.If you have a small website, using a robots.txt file generator to generate robots.txt is acceptable. But if you own an eCommerce website or offer many services, be sure to get expert help in creating and optimizing your robots.txt file.
❌Disregarding robots.txt validator reports.A robots.txt validator is there for a reason. So, maximize your robots.txt checker and other tools to ensure your robots.txt optimization efforts for SEO are on the right track.
Gain Control of Your Crawl Budget
Dealing with robots.txt optimization and other technical SEO issues can be taxing, especially if you don’t have the right resources, manpower and capabilities to perform the necessary tasks. Don’t stress yourself out by dealing with website issues that professionals could quickly resolve.
Entrust your local SEO, technical optimization and other digital marketing needs to Thrive Internet Marketing Agency and let us help you strengthen your online authority.