Jump to content

Robots.txt Overkill?


Recommended Posts

I am hoping to get some input on my site's Robots.txt file. Some of the items in it are likely now outdated, and some may be overkill and possibly hurting my site's crawlability, so I am hoping to start a discussion about it here. Most of it was put together long before IPB offered a standard version of the Robots.txt file. Some of what I include was borrowed from past posts here, and some of it may even been poorly or incorrectly formatted, which is why I wanted to get some input on it. 

I've changed my site a bit regarding profiles, and have a custom plugin that has a noindex for profiles without "About Us" info, but allows indexing for those who have info there.

Here is my file:

# START Default Rules for Invision Community (https://invisioncommunity.com)

User-Agent: *
# Block pages with no unique content
Disallow: /startTopic/
Disallow: /discover/unread/
Disallow: /markallread/
#Disallow: /staff/
Disallow: /online/
Disallow: /discover/
Disallow: /leaderboard/
Disallow: /search/
Disallow: /*?advancedSearchForm=
Disallow: /register/
Disallow: /lostpassword/
Disallow: /login/
#
# Block faceted pages and 301 redirect pages
Disallow: /*?sortby=
Disallow: /*?filter=
Disallow: /*?tab=
Disallow: /*?do=
Disallow: /*ref=
Disallow: /*?forumId*
#
# Block profile pages as these have little unique value, consume a lot of crawl time and contain hundreds of 301 links
#Disallow: /profile/
#
# END Default Rules for Invision Community (https://invisioncommunity.com)
#
#
# START CUSTOM RULES 
Disallow: /tags/
Disallow: /notifications/
#Disallow: /applications/
Disallow: /announcement/
Disallow: /*?*sortby=
Disallow: /*?*sort=
Disallow: /*?sort=
Disallow: /*?*sortdirection=
Disallow: /*?sortdirection=
Disallow: /*?set_template=mobile*
Disallow: /*&section=notifications*
Disallow: /*&do=topContributors*
Disallow: /*&do=askAQuestion*
Disallow: /*?app=core*
Disallow: /*?act=calendar*
Disallow: /*?act=rssout*
#Disallow: /articles/*/*/*/Page1.html/addfav
#Disallow: /articles/*/*/*/Page1.html/addread
#Disallow: /articles/*/*/*/Page1.html/print
Disallow: /profile/*/?do=*
Disallow: /profile/*/content/
Disallow: /profile/*/followers/
Disallow: /profile/*/reputation*
Disallow: /profile/0-Guest/*
Disallow: /blogs/*?view=grid*
Disallow: /blogs/*?view=list*
Disallow: /blogs/submit/*
Disallow: /calendar/*/week/
Disallow: /calendar/*/submit/
Disallow: /calendar/submit/*
Disallow: /clubs/*?view=grid*
Disallow: /clubs/*?view=list*
Disallow: /clubs/index.php?app=core*
# added to stop social share links
Disallow: /submit?url=*
#
# Custom Plugin (DP47) Bad Link Fixer for Bots
Disallow: /*&do=retrieveUrl*
Disallow: /*&do=retrieveUrl*
Disallow: /*?app=dp47badlinksfixer*
# Sitemaps
Sitemap: https://www.celiac.com/sitemap.php

 

Edited by sadams101
Link to comment
Share on other sites

We would recommend and support only the default robots.txt that is included in the software. Anything else would be something you can discuss with the community at large so I will move this to the proper forum. If you are wanting to suggest things to be added or removed from the default robots.txt, you can certainly suggest those in our Feature Suggestion forum for further evaluation.

Link to comment
Share on other sites

I am fairly sure that I added the tags there before there was noindex code on them, as I recall my reason for doing this was because I kept getting flagged by google for the search throttle delay, but thank you, and this is exactly why I'm sharing this.

The question is, will google start to crawl the tags again, slow my site (lots of searches slow things down)?

Link to comment
Share on other sites

I did some research on having both the meta noindex AND the robots.txt, and if it is not disallowed in robots.txt then Google WILL access the page, meaning they will follow those links, which I don't want to happen. I found this here:

https://developers.google.com/search/blog/2007/03/using-robots-meta-tag

  • "If you allow a page with robots.txt but block it from being indexed using a meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it."

The proper meta tag to stop google from crawling would be nofollow. It seems like the /tags should include nofollow, rather than noindex, if the goal is to stop them from being crawled by the bots.

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...