February 13, 2025

What Is a Robots.txt File?

A robots.txt file is a set of directions that inform search engines like google which pages to crawl and which pages to keep away from, guiding crawler entry however not essentially protecting pages out of Google’s index.

A robots.txt file appears like this:

robots.txt example

Robots.txt information could seem sophisticated. Nevertheless, the syntax (laptop language) is easy.

Earlier than explaining the small print of robots.txt, we’ll make clear how robots.txt differs from different phrases that sound comparable.

Robots.txt information, meta robots tags, and x-robots tags information search engines like google in dealing with website content material however differ of their stage of management, whether or not they’re situated, and what they management.

Contemplate these specifics:

  • Robots.txt: This file is situated within the web site’s root listing and offers site-wide directions to look engine crawlers on which areas of the positioning they need to and shouldn’t crawl
  • Meta robots tags: These tags are snippets of code within the <head> part of particular person webpages and supply page-specific directions to search engines like google on whether or not to index (embody in search outcomes) and observe (crawl the hyperlinks on) every web page
  • X-robot tags: These code snippets are used primarily for non-HTML information, comparable to PDFs and pictures and are carried out within the file’s HTTP header

Additional studying: Meta Robots Tag & X-Robots-Tag Defined

Why Is Robots.txt Essential for search engine marketing?

A robots.txt file is essential for search engine marketing as a result of it helps handle internet crawler actions to stop them from overloading your web site and crawling pages not supposed for public entry.

Beneath are just a few causes to make use of a robots.txt file:

1. Optimize Crawl Finances

Blocking pointless pages with robots.txt permits Google’s internet crawler to spend extra crawl finances (what number of pages Google will crawl in your website inside a sure timeframe) on pages that matter.

Crawl finances can range primarily based in your website’s dimension, well being, and variety of backlinks.

In case your website has extra pages than its crawl finances, essential pages could fail to get listed.

Unindexed pages gained’t rank, which suggests you’ve wasted time creating pages that customers by no means see in search outcomes.

2. Block Duplicate and Non-Public Pages

Not all pages are supposed for inclusion within the search engine outcomes pages (SERPs), and a robots.txt file permits you to block these personal pages from crawlers.

Contemplate staging websites, inner search outcomes pages, duplicate pages, or login pages. Some content material administration methods deal with these inner pages mechanically.

WordPress, for instance, disallows the login web page “/wp-admin/” for all crawlers.

3. Conceal Assets

Robots.txt permits you to exclude assets like PDFs, movies, and pictures from crawling if you wish to hold them personal or have Google give attention to extra essential content material.

How Does a Robots.txt File Work?

A robots.txt file tells search engine bots which URLs to crawl and (extra importantly) which URLs to keep away from crawling.

When search engine bots crawl webpages, they uncover and observe hyperlinks. This course of leads them from one website to a different throughout varied pages. 

If a bot finds a robots.txt file, it reads that file earlier than crawling any pages.

The syntax is easy. You assign guidelines by figuring out the user-agent (the search engine bot) and specifying directives (the principles).

You should use an asterisk (*) to assign directives to all user-agents directly.

For instance, the next instruction permits all bots besides DuckDuckGo to crawl your website:

all bots except DuckDuckGo instructed to crawl the site

Semrush bots crawl the online to assemble insights for our web site optimization instruments, comparable to Website Audit, Backlink Audit, and On Web page search engine marketing Checker.

Semrush bots respect the principles in your robots.txt file, which means they gained’t crawl your website when you block Semrush bots from crawling.

Nevertheless, blocking Semrush bots limits using sure Semrush instruments. 

For instance, when you block the SiteAuditBot from crawling your website, you may’t audit your website with the Website Audit device. This device helps analyze and repair technical points in your website.

Errors section in Site Audit tool

When you block the SemrushBot-SI from crawling your website, you may’t use the On Web page search engine marketing Checker device successfully. 

Consequently, you lose the chance to generate optimization concepts that would enhance your webpages’ rankings.

On Page SEO Checker tool

How one can Discover a Robots.txt File

Your robots.txt file is hosted in your server, similar to different information in your web site.

You may view any web site’s robots.txt file by typing the positioning’s homepage URL into your browser and including “/robots.txt” on the finish. 

For instance: “https://semrush.com/robots.txt.”

https://semrush.com/robots.txt example

Examples of Robots.txt Recordsdata

Listed below are some real-world robots.txt examples from common web sites.

YouTube

YouTube’s robots.txt file tells crawlers to not entry person feedback, video feeds, login/signup pages, and age verification pages.

The foundations in YouTube’s robots.txt file discourages indexing user-specific or dynamic content material that doesn’t assist search outcomes and should elevate privateness issues.

YouTube robots.txt file with highlights showing disallowed URL paths

G2

G2’s robots.txt file tells crawlers to not entry sections with user-generated content material, like survey responses, feedback, and contributor profiles.

The foundations in G2’s robots.txt information assist defend person privateness by limiting entry to doubtlessly delicate private data. The foundations additionally forestall makes an attempt to control search outcomes.

G2 robots.txt file with highlights showing disallowed URL paths

Nike

Nike’s robots.txt file makes use of the disallow directive to dam crawlers from accessing user-generated directories, like “/checkout/” and “*/member/inbox.”

The foundations in Nike’s robots.txt file forestall delicate person knowledge from showing in search outcomes and cut back alternatives to control search engine marketing rankings.

Nike robots.txt file with highlights showing disallowed URL paths

Search Engine Land

Search Engine Land’s robots.txt file makes use of the disallow tag to discourage indexing of “/tag/” listing pages, which frequently have low search engine marketing worth and might trigger duplicate content material points.

The foundations in Search Engine Land’s robots.txt file encourage search engines like google to give attention to higher-quality content material and optimize the positioning’s crawl finances—one thing particularly essential for giant web sites like Search Engine Land.

Search Engine Land robots.txt file with highlights showing disallowed url path

Forbes

Forbes’s robots.txt file instructs Google to not crawl the “/take a look at/” listing, which probably comprises testing or staging environments.

The foundations in Forbes’s robots.txt file forestall unfinished or delicate content material from being listed, assuming it’s not linked to from elsewhere.

Forbes robots.txt file with highlights showing disallowed URL path

Explaining Robots.txt Syntax

A robots.txt file consists of a number of directive blocks, with every block specifying a user-agent (a search engine bot) and offering “permit” or “disallow” directions.

A easy block could appear like this:

Consumer-agent: Googlebot
Disallow: /not-for-google
Consumer-agent: DuckDuckBot
Disallow: /not-for-duckduckgo
Sitemap: https://www.yourwebsite.com/sitemap.xml

The Consumer-Agent Directive

The primary line of every directive block specifies the user-agent, which identifies the crawler.

For instance, use these traces to stop Googlebot from crawling your WordPress admin web page:

Consumer-agent: Googlebot
Disallow: /wp-admin/

When a number of directives exist, a bot could select essentially the most particular one. 

Think about you’ve got three units of directives: one for *, one for Googlebot, and one for Googlebot-Picture. 

If the Googlebot-Information person agent crawls your website, it should observe the Googlebot directives. 

Nevertheless, the Googlebot-Picture person agent will observe the extra particular Googlebot-Picture directives.

The Disallow Robots.txt Directive

The disallow directive lists elements of the positioning a crawler shouldn’t entry.

An empty disallow line means no restrictions exist. 

For instance, the under rule permits all crawlers entry to your whole website:

Consumer-agent: *
Enable: /

To dam all crawlers out of your whole website, use the under block:

Consumer-agent: *
Disallow: /

The Enable Directive

The permit directive permits search engines like google to crawl a subdirectory or particular web page, even in an in any other case disallowed listing.

For instance, use the under rule to stop Googlebot from accessing all weblog posts besides one:

Consumer-agent: Googlebot
Disallow: /weblog
Enable: /weblog/example-post

The Sitemap Directive

The sitemap directive tells search engines like google—particularly Bing, Yandex, and Google—the place to search out your XML sitemap (a file that lists all of the pages you need search engines like google to index).

The picture under reveals what the sitemap directive appears like:

The Sitemap directive example

Together with a sitemap directive in your robots.txt file is a fast technique to share your sitemap. 

Nevertheless, you also needs to submit your XML sitemap on to search engines like google through their webmaster instruments to hurry up crawling.

The Crawl-Delay Directive

The crawl-delay directive tells crawlers what number of seconds to attend between requests, which helps keep away from server overload.

Google now not helps the crawl-delay directive. To set crawl charges for Googlebot, use Google Search Console.

Bing and Yandex do help the crawl-delay directive. 

For instance, use the under rule to set a 10-second delay after every crawl motion:

Consumer-agent: *
Crawl-delay: 10

Additional studying: 15 Crawlability Issues & How one can Repair Them

The Noindex Directive

A robots.txt file tells search engines like google what to crawl and what to not crawl however can’t reliably hold a URL out of search outcomes—even when you use a noindex directive.

When you use noindex in robots.txt, the web page can nonetheless seem in search outcomes with out seen content material.

Example of page indexed and blocked by robots.txt in search results

Google by no means formally supported the noindex directive in robots.txt and confirmed so in September 2019.

To reliably exclude a web page from search outcomes, use a meta robots noindex tag as an alternative.

How one can Create a Robots.txt File

Use a robots.txt generator tool to rapidly create a robots.txt file. 

Comply with these steps to create a robotx.txt file from scratch:

1. Create a File and Title It Robots.txt

Open a .txt doc in a text editor or internet browser.

Title the doc “robots.txt.”

Now you can begin typing directives.

2. Add Directives to the Robots.txt File

A robots.txt file comprises a number of teams of directives, and every group contains a number of traces of directions.

Every group begins with a user-agent and specifies:

  • Who the group applies to (the user-agent)
  • Which directories (pages) or information the agent ought to entry
  • Which directories (pages) or information the agent shouldn’t entry
  • A sitemap (non-compulsory) to inform search engines like google which pages and information you deem essential

Crawlers ignore traces that don’t match the above directives.

Think about you don’t need Google to crawl your “/purchasers/” listing as a result of it’s for inner use solely.

The primary group in your file would appear like this block:

Consumer-agent: Googlebot
Disallow: /purchasers/

You may add extra directions for Google after that, just like the one under:

Consumer-agent: Googlebot
Disallow: /purchasers/
Disallow: /not-for-google

Then press enter twice to start out a brand new group of directives. 

Now think about you need to forestall entry to “/archive/” and “/help/” directories for all search engines like google.

A block stopping entry to these :

Consumer-agent: Googlebot
Disallow: /purchasers/
Disallow: /not-for-google
Consumer-agent: *
Disallow: /archive/
Disallow: /help/

When you’re completed, add your sitemap:

Consumer-agent: Googlebot
Disallow: /purchasers/
Disallow: /not-for-google
Consumer-agent: *
Disallow: /archive/
Disallow: /help/
Sitemap: https://www.yourwebsite.com/sitemap.xml

Save the file as “robots.txt.”

3. Add the Robots.txt File

After saving your robots.txt file, add the file to your website so search engines like google can discover it.

The method of importing your robots.txt file is determined by your internet hosting atmosphere.

Search on-line or contact your internet hosting supplier for particulars.

For instance, search “add robots.txt file to [your hosting provider]” for platform-specific directions.

Beneath are some hyperlinks explaining the way to add robots.txt information to common platforms:

After importing, verify that the file is accessible and that Google can learn it.

4. Check Your Robots.txt File

First, confirm that anybody can view your robots.txt file by opening a non-public browser window and getting into your sitemap URL.

For instance, “https://semrush.com/robots.txt.”

Example of https://semrush.com/robots.txt search

When you see your robots.txt content material, take a look at the markup.

Google offers two testing choices:

  1. The robots.txt report in Search Console
  2. Google’s open-source robots.txt library (superior)

Use the robots.txt report in Search Console in case you are not a complicated person.

Open the robots.txt report

When you haven’t linked your website to Search Console, add a property and confirm website possession first.

If you have already got verified properties, choose one from the drop-down after opening the robots.txt report.

The device reviews syntax warnings and errors.

Display of warnings and errors

Edit errors or warnings straight on the web page and retest as you go. 

Adjustments made throughout the robots.txt report aren’t saved to your website’s stay robots.txt file, so copy and paste corrected code into your precise robots.txt file.

Semrush’s Website Audit device can even examine for robots.txt points.

Arrange a challenge and run an audit.

When the device is prepared, navigate to the “Points” tab and seek for “robots.txt.”

Search for "robots.txt" in Site Audit

Click on “Robots.txt file has format errors” if it seems.

"Robots.txt file has format errors"

View the record of invalid traces.

List of invalid lines

Click on “Why and the way to repair it” for particular directions.

"Why and how to fix it" section

Test your robots.txt file repeatedly. Even small errors can have an effect on your website’s indexability.

Robots.txt Finest Practices

Use a New Line for Every Directive

Place every directive by itself line to make sure search engines like google can learn them and observe the directions.

Incorrect instance:

Consumer-agent: * Disallow: /admin/
Disallow: /listing/

Right instance:

Consumer-agent: *
Disallow: /admin/
Disallow: /listing/

Use Every Consumer-Agent Solely As soon as

Checklist every user-agent as soon as to maintain the file organized and cut back the chance of human error.

Complicated instance:

Consumer-agent: Googlebot
Disallow: /example-page
Consumer-agent: Googlebot
Disallow: /example-page-2

Clear instance:

Consumer-agent: Googlebot
Disallow: /example-page
Disallow: /example-page-2

Writing all directives underneath the identical user-agent is cleaner and helps you keep organized.

Use Wildcards to Make clear Instructions

Use wildcards (*) to use directives broadly. 

To forestall search engines like google from accessing URLs with parameters, you can technically record them out one after the other. 

Nevertheless, you may simplify your instructions with a wildcard.

Inefficient instance:

Consumer-agent: *
Disallow: /sneakers/vans?
Disallow: /sneakers/nike?
Disallow: /sneakers/adidas?

Environment friendly instance:

Consumer-agent: *
Disallow: /sneakers/*?

The above instance blocks all search engine bots from crawling all URLs underneath the “/sneakers/” subfolder with a query mark.

Use ‘$’ to Point out the Finish of a URL

Use “$” to point the tip of a URL.

To dam search engines like google from crawling all of a sure file kind, utilizing “$” helps you keep away from itemizing all of the information individually. 

Inefficient:

Consumer-agent: *
Disallow: /photo-a.jpg
Disallow: /photo-b.jpg
Disallow: /photo-c.jpg

Environment friendly:

Consumer-agent: *
Disallow: /*.jpg$

Use “$” rigorously as a result of errors can result in unintentional unblocking.

Add feedback by beginning a line with “#”—crawlers ignore something that begins with a hash.

For instance:

Consumer-agent: *
#Touchdown Pages
Disallow: /touchdown/
Disallow: /lp/
#Recordsdata
Disallow: /information/
Disallow: /private-files/
#Web sites
Enable: /web site/*
Disallow: /web site/search/*

Builders generally add humorous feedback utilizing hashes since most customers by no means see the file.

For instance, YouTube’s robots.txt file reads: “Created within the distant future (the yr 2000) after the robotic rebellion of the mid 90’s which worn out all people.”

YouTube's robots.txt file example

And Nike’s robots.txt reads “simply crawl it” (a nod to its “simply do it” tagline) and options the model’s emblem.

Nike's robots.txt example

Use Separate Robots.txt Recordsdata for Completely different Subdomains

Robots.txt information solely management crawling on the subdomain the place they reside, which suggests chances are you’ll want a number of information.

In case your website is “area.com” and your weblog is “weblog.area.com,” create a robots.txt file for each the area’s root listing and the weblog’s root listing.

5 Robots.txt Errors to Keep away from

When creating your robots.txt file, be careful for the next widespread errors:

1. Not Together with Robots.txt within the Root Listing

Your robots.txt file should be situated in your website’s root listing to make sure search engine crawlers can discover it simply.

For instance, in case your web site’s homepage is “www.instance.com,” place the file at “www.instance.com/robots.txt.”

When you put it in a subdirectory, like “www.instance.com/contact/robots.txt,” search engines like google could not discover it and will assume you haven’t set any crawling directions.

2. Utilizing Noindex Directions in Robots.txt

Don’t use noindex directions in robots.txt—Google doesn’t support the noindex rule within the robots.txt file.

As an alternative, use meta robots tags (e.g., <meta title=”robots” content material=”noindex”>) on particular person pages to regulate indexing.

noindex meta robots tag in page source code

3. Blocking JavaScript and CSS

Keep away from blocking entry to JavaScript and CSS information through robots.txt except essential (e.g., limiting entry to delicate knowledge).

Blocking crawling of JavaScript and CSS information makes it troublesome for search engines like google to know your website’s construction and content material, which may hurt your rankings.

Additional studying: JavaScript search engine marketing: How one can Optimize JS for Search Engines

4. Not Blocking Entry to Your Unfinished Website or Pages

Block search engines like google from crawling unfinished variations of your website to maintain it from being discovered earlier than you’re prepared (additionally use a meta robots noindex tag for every unfinished web page).

Serps crawling and indexing an in-development web page could result in a poor person expertise and potential duplicate content material points.

Utilizing robots.txt to maintain unfinished content material personal till you’re able to launch.

5. Utilizing Absolute URLs

Use relative URLs in your robots.txt file to make it simpler to handle and keep.

Absolute URLs are pointless and might trigger errors in case your area adjustments.

❌ Instance with absolute URLs (not beneficial):

Consumer-agent: *
Disallow: https://www.instance.com/private-directory/
Disallow: https://www.instance.com/temp/
Enable: https://www.instance.com/important-directory/

✅ Instance with relative URLs (beneficial):

Consumer-agent: *
Disallow: /private-directory/
Disallow: /temp/
Enable: /important-directory/

Preserve Your Robots.txt File Error-Free

Now that you simply perceive how robots.txt information work, you need to guarantee yours is optimized. Even small errors can have an effect on how your website is crawled, listed, and displayed in search outcomes.

Semrush’s Website Audit device makes analyzing your robots.txt file for errors straightforward and offers actionable suggestions to repair any points.