Jump to content

Severe indexing issues from using plugins and apps vs robots.txt


Recommended Posts

Posted (edited)

The default robots.txt on IC cloud is disallowing these -

# Block faceted pages and 301 redirect pages
Disallow: /*?sortby=
Disallow: /*?filter=
Disallow: /*?tab=
Disallow: /*?do=
Disallow: /*ref=
Disallow: /*?forumId*
Disallow: /*?&controller=embed


After installing 2 below plugins, Google Analytics started reporting errors and 9K pages on our fairly new site got unindexed and traffic dwindled drastically:

Pages Comments/Reviews Tab Order By @Adriano Faria


(TB) Sort Questions Forums by Date By @teraßyte


We also used the Movies app too that resulted in 5K similar errors ("Duplicate without user-selected canonical", etc) for the already added movies. Here are a few sample errors:

/movies/movie/27-the-international/?do=addMovie&type=movie&movieid=1101123
/movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters
/movies/movie/27-the-international/?do=getMovieDataMovieView&type=images
/movies/movie/13-migration/?do=getMovieDataMovieView&type=cast
/movies/movie/13-migration/?do=getMovieDataMovieView&type=crew
/movies/movie/13-migration/?tab=watchers
/movies/movie/13-migration/?tab=reviews
/movies/movie/13-migration/?tab=comments
/movies/movie/13-migration/?do=getMovieDataMovieView&type=videos
/articles/youtube-keyboard-shortcuts-r74/?tab=comments
/articles/true-horror-the-cannibalistic-tragedy-of-the-donner-party/?tab=reviews
/topic/517-how-sloths-survive-thrive-as-nature%E2%80%99s-couch-potato-60-minutes/?sortby=date
/topic/437-stackoverflow%E2%80%99s-shockingly-simple-architecture/?sortby=votes
/topic/613-work-visas-in-europe/?sortby=date
/events/event/194-salaar/?do=embed
/events/submit/?do=submit&id=19&y=2024&m=05&d=01&view=day


To overcome the errors, we switched to custom robots.txt and commented the below lines -

# Block faceted pages and 301 redirect pages
#Disallow: /*?sortby=
Disallow: /*?filter=
#Disallow: /*?tab=
#Disallow: /*?do=
Disallow: /*ref=
Disallow: /*?forumId*
#Disallow: /*?&controller=embed

The above opened a can of worms with more errors as it exposed other areas in the application to be indexed that are not supposed to be indexed. So we had to disable the 2 required plugins for now and reverted to use the default robots.txt (it would take weeks/months for Google to reindex the pages back).

Both the plugins are free and work as expected and no issues.

The plugins and app above need to use the commented parameters in the robots.txt (sortby, tab, do, embed) but cannot use the default parameters as they are disallowed in the robots.txt for a reason. The better approach is to use some other parameters like sortby2, tab2, etc. for plugins/apps in the frontend and reuse the same sortby, tab... parameters in the backend.

The most graceful solution to these issues is to incorporate the logic of the 2 plugins above into IC software as these are sane, expected behaviors so please add these to your roadmap.

I'm posting these issues here so the devs could coordinate with the IC team in figuring out alternate parameter names for apps/plugins to bypass the conditions in robots.txt (not sure who would decide alternate parameter names).

Edited by WebCMS
Link to comment
Share on other sites

  • Management

It's hard for me to get a foothold into the issue without seeing the site, can you link it here, or DM it if you want to keep it private.

I'm sure there's an easy solution. A mix of better robots.txt and perhaps getting some canonical links added will fix most of that.

Link to comment
Share on other sites

18 minutes ago, Matt said:

It's hard for me to get a foothold into the issue without seeing the site, can you link it here, or DM it if you want to keep it private.

I'm sure there's an easy solution. A mix of better robots.txt and perhaps getting some canonical links added will fix most of that.

https://www.telugus.com

We would rather prefer to use the default robots.txt provided by the software as it is most optimal. Resolving it at the software, apps/plugins level would resolve the issue for ALL clients using the default robots.txt instead of tweaking a custom robots.txt for each client who use apps/plugins.

Link to comment
Share on other sites

  • Management

Thanks, so I'll go through some of the links you're having issues with:

/movies/movie/27-the-international/?do=addMovie&type=movie&movieid=1101123

Obviously you don't want Google to index submission forms, that's a waste of the crawl budget, the default do= rule will stop that from happening.

/movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters
/movies/movie/27-the-international/?do=getMovieDataMovieView&type=images
/movies/movie/13-migration/?do=getMovieDataMovieView&type=cast
/movies/movie/13-migration/?do=getMovieDataMovieView&type=crew

The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages. The issue is caused by those tabbed pages having a canonical URL of the main page, eg:

<link rel="canonical" href="https://www.telugus.com/movies/movie/35-ride-on-%E9%BE%99%E9%A9%AC%E7%B2%BE%E7%A5%9E/" />

/articles/true-horror-the-cannibalistic-tragedy-of-the-donner-party/?tab=reviews
/topic/517-how-sloths-survive-thrive-as-nature%E2%80%99s-couch-potato-60-minutes/?sortby=date /topic/437-stackoverflow%E2%80%99s-shockingly-simple-architecture/?sortby=votes
/topic/613-work-visas-in-europe/?sortby=date

The sort by URLs should not be indexed. These are faceted pages which Google is not keen on and can waste crawl budget. 

You said that this is a custom app? If so, direct the developer to this topic and ask them to add some FURL rules in /data/furl.json so that those tabbed pages have a unique URL fixing the duplication issue, and canonical issue.

Link to comment
Share on other sites

4 hours ago, Matt said:

Thanks, so I'll go through some of the links you're having issues with:

/movies/movie/27-the-international/?do=addMovie&type=movie&movieid=1101123

Obviously you don't want Google to index submission forms, that's a waste of the crawl budget, the default do= rule will stop that from happening.

/movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters
/movies/movie/27-the-international/?do=getMovieDataMovieView&type=images
/movies/movie/13-migration/?do=getMovieDataMovieView&type=cast
/movies/movie/13-migration/?do=getMovieDataMovieView&type=crew

The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages. The issue is caused by those tabbed pages having a canonical URL of the main page, eg:

<link rel="canonical" href="https://www.telugus.com/movies/movie/35-ride-on-%E9%BE%99%E9%A9%AC%E7%B2%BE%E7%A5%9E/" />

/articles/true-horror-the-cannibalistic-tragedy-of-the-donner-party/?tab=reviews
/topic/517-how-sloths-survive-thrive-as-nature%E2%80%99s-couch-potato-60-minutes/?sortby=date /topic/437-stackoverflow%E2%80%99s-shockingly-simple-architecture/?sortby=votes
/topic/613-work-visas-in-europe/?sortby=date

The sort by URLs should not be indexed. These are faceted pages which Google is not keen on and can waste crawl budget. 

You said that this is a custom app? If so, direct the developer to this topic and ask them to add some FURL rules in /data/furl.json so that those tabbed pages have a unique URL fixing the duplication issue, and canonical issue.

@teraßyte

@Adriano Faria

Link to comment
Share on other sites

My plugin simply changes the default sortby value from votes to date when there is none specified in the URL/request. Everything else is then handed to the framework to handle behind the scenes. There's nothing I can really change with it. 🤷‍♂️

Link to comment
Share on other sites

4 minutes ago, teraßyte said:

My plugin simply changes the default sortby value from votes to date when there is none specified in the URL/request. Everything else is then handed to the framework to handle behind the scenes. There's nothing I can really change with it. 🤷‍♂️

sortby is blocked by default robots.txt which was reported by GA as a crawling issue.

Link to comment
Share on other sites

Hmm, wait. Are you saying Google can't index any pages starting from 2 because they all have the sortby value in them (which is not there by default)?

If that's the case I can add a check to exclude adding it if a bot/search engine is viewing the topic, but I'm not sure if Google would then complain that your site is showing different content compared to guests. Since they're not indexing any sortby links it should be fine, though. 🤔

Link to comment
Share on other sites

4 hours ago, teraßyte said:

Hmm, wait. Are you saying Google can't index any pages starting from 2 because they all have the sortby value in them (which is not there by default)?

Yes. These are used by apps/plugins but blocked in the robots and hence crawling issues -

sortby=, tab=, do, /*?&controller=embed

4 hours ago, teraßyte said:

If that's the case I can add a check to exclude adding it if a bot/search engine is viewing the topic, but I'm not sure if Google would then complain that your site is showing different content compared to guests. Since they're not indexing any sortby links it should be fine, though. 🤔

I'm not sure what Google would or not do. Things like this altering the behavior dynamically has almost always resulted in heartache while dealing with Google as far as I know.

Any other graceful solution? Like using a different parameter name to bypass the default name in robots?

Link to comment
Share on other sites

42 minutes ago, WebCMS said:

Any other graceful solution? Like using a different parameter name to bypass the default name in robots?

I can't think of anything else, unfortunately. The sortby value is what the framework uses in several functions, some of which I can't even hook in. I don't think changing it to another variable is feasible.

 

Considering v5 won't have any hooks, and that Q&A forums are also going away, I don't plan on rewriting this plugin from scratch to change how it works to try and somehow work around this problem.

At this point, your best option is to disable the plugin and wait for v5.

Link to comment
Share on other sites

22 hours ago, Matt said:

/movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters
/movies/movie/27-the-international/?do=getMovieDataMovieView&type=images
/movies/movie/13-migration/?do=getMovieDataMovieView&type=cast
/movies/movie/13-migration/?do=getMovieDataMovieView&type=crew

The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages. The issue is caused by those tabbed pages having a canonical URL of the main page, eg:

<link rel="canonical" href="https://www.telugus.com/movies/movie/35-ride-on-%E9%BE%99%E9%A9%AC%E7%B2%BE%E7%A5%9E/" />

I wanted to ask about this point as it's something I'm struggling with on my own community. I have a "wiki" like section set up on my community, using Pages, and I use tabs to squirrel away 'sections' of media or info about a given page (example below):

https://www.sonicstadium.org/wiki/games/mainline/sonic-the-hedgehog-16-bit/

If I wanted to split some of the content on this page into sub-pages (i.e. the "Prototypes and Mysteries" section), I wouldn't be able to create something like:

https://www.sonicstadium.org/wiki/games/mainline/sonic-the-hedgehog-16-bit/prototypes

for two reasons:

  1. You cannot structure Pages in that way ("sonic-the-hedgehog-16-bit" would need to be a category, and 'prototypes' a page within that category)
  2. It appears that Pages won't let you use the same name for a page twice, no matter where it is placed (so assuming I could make '/sonic-the-hedgehog-16-bit/prototypes', I then would not be able to make '/sonic-the-hedgehog-3-and-knuckles/prototypes', it'd have to be called 'prototypes2' (or even '/sonic-the-hedgehog-3-and-knuckles/sonic-the-hedgehog-3-and-knuckles-prototypes', which is ridiculous)

I am probably getting ahead of some cool new/restructured Pages app features for V5 that will address the above, but just wanted to get your thoughts on that.

 

Link to comment
Share on other sites

Posted (edited)

Hi Matt,

On 6/3/2024 at 8:09 AM, Matt said:

The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages.

Just a question: how does the framework handle the tab=comments or tab=reviews in Downloads, for example? All the tabs from the Movies app are created on the "commentReviewTabs" method of its content item. I mean:

Quote

and

Quote

would generate the same "errors", right?

I didn't find anything specific for them. Are they "excluded" somehow?

Wouldn't a "noindex" in these links solve the issue?

Edited by Adriano Faria
Link to comment
Share on other sites

1 hour ago, Matt said:

I am thinking about a better solution for v5 where each tab gets its own FURL.

Yes, FURLs are the way. I'd prefer to have FURLs for everything that should be indexable by search engines, and just a simple rule in robots.txt that disallows indexing of anything with additional (non-FURL) parameters instead of a bunch of disallowed parameters.

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...