Jump to content

SEO - Robots.txt

What is a robots.txt file?

When Google or other search engines come to your site to read and store the content in its search index, it will look for a special file called robots.txt. This file is a set of instructions to tell search engines where they can look to crawl content and where they are not allowed to crawl content. We can use these rules to ensure that search engines don't waste their time looking at links that do not have valuable content and avoid links that produce faceted content.

Why is this important?

Search engines need to look at and store as many pages that exist on the internet as possible. There are currently an estimated 4.5 billion web pages active today. That's a lot of work for Google.

It cannot look and store every single page, so it needs to decide what to keep and how long it will spend on your site indexing pages. This is called a crawl budget.

How many pages a day Google will index depends on many factors, including how fresh the site is, how much content you have and how popular your site is. Some websites will have Google index as few as 30 links a day. We want every link to count and not waste Google's time.

What does the suggested Robots.txt file do?

The Invision Community optimised rules exclude site areas with no unique content but instead redirect links to existing topics, such as the leaderboard, the default activity stream. Also excluded are areas such as the privacy policy, cookie policy, log in and register pages and so on. Submit buttons and filters are also excluded to prevent faceted pages. Finally, user profiles are excluded as these offer little valuable content for Google but contain around 150 redirect links. Given that Google has mere seconds on your site, these links that exist elsewhere eat up your crawl budget quickly.

What is the suggested Robots.txt file?

Here is the content of the suggested Robots.txt file. Depending on your configuration, Invision Community can automatically serve this. If your community is inside a directory, you will need to apply it to the root of your site manually. So, for example, if your community was at /home/site/public_html/community/ - you would need to create this robots.txt file and add it to /home/site/public_html. The Admin CP will guide you through this.

User-Agent: *
# Block pages with no unique content
Disallow: /startTopic/
Disallow: /*?do=add
Disallow: /*?do=submit
Disallow: /discover/unread/
Disallow: /markallread/
Disallow: /staff/
Disallow: /online/
Disallow: /discover/
Disallow: /leaderboard/
Disallow: /search/
Disallow: /*?advancedSearchForm=
Disallow: /register/
Disallow: /lostpassword/
Disallow: /login/

# Block faceted pages and 301 redirect pages
Disallow: /*?sortby=
Disallow: /*?filter=
Disallow: /*?tab=comments
Disallow: /*?do=findComment
Disallow: /*?do=getLastComment
Disallow: /*?do=getNewComment
Disallow: /*?do=reportComment
Disallow: /*?do=markRead

# Block profile pages as these have little unique value, consume a lot of crawl time and contain hundreds of 301 links
Disallow: /profile/

# Sitemap URL
Sitemap: http://domain.tld/sitemap.php

*Note, if you are copying this file, you may need to add the path name and correct the sitemap URL.


  Report Guide


×
×
  • Create New...