WebCMS Posted June 3 Posted June 3 (edited) The default robots.txt on IC cloud is disallowing these - # Block faceted pages and 301 redirect pages Disallow: /*?sortby= Disallow: /*?filter= Disallow: /*?tab= Disallow: /*?do= Disallow: /*ref= Disallow: /*?forumId* Disallow: /*?&controller=embed After installing 2 below plugins, Google Analytics started reporting errors and 9K pages on our fairly new site got unindexed and traffic dwindled drastically: Pages Comments/Reviews Tab Order By @Adriano Faria (TB) Sort Questions Forums by Date By @teraßyte We also used the Movies app too that resulted in 5K similar errors ("Duplicate without user-selected canonical", etc) for the already added movies. Here are a few sample errors: /movies/movie/27-the-international/?do=addMovie&type=movie&movieid=1101123 /movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters /movies/movie/27-the-international/?do=getMovieDataMovieView&type=images /movies/movie/13-migration/?do=getMovieDataMovieView&type=cast /movies/movie/13-migration/?do=getMovieDataMovieView&type=crew /movies/movie/13-migration/?tab=watchers /movies/movie/13-migration/?tab=reviews /movies/movie/13-migration/?tab=comments /movies/movie/13-migration/?do=getMovieDataMovieView&type=videos /articles/youtube-keyboard-shortcuts-r74/?tab=comments /articles/true-horror-the-cannibalistic-tragedy-of-the-donner-party/?tab=reviews /topic/517-how-sloths-survive-thrive-as-nature%E2%80%99s-couch-potato-60-minutes/?sortby=date /topic/437-stackoverflow%E2%80%99s-shockingly-simple-architecture/?sortby=votes /topic/613-work-visas-in-europe/?sortby=date /events/event/194-salaar/?do=embed /events/submit/?do=submit&id=19&y=2024&m=05&d=01&view=day To overcome the errors, we switched to custom robots.txt and commented the below lines - # Block faceted pages and 301 redirect pages #Disallow: /*?sortby= Disallow: /*?filter= #Disallow: /*?tab= #Disallow: /*?do= Disallow: /*ref= Disallow: /*?forumId* #Disallow: /*?&controller=embed The above opened a can of worms with more errors as it exposed other areas in the application to be indexed that are not supposed to be indexed. So we had to disable the 2 required plugins for now and reverted to use the default robots.txt (it would take weeks/months for Google to reindex the pages back). Both the plugins are free and work as expected and no issues. The plugins and app above need to use the commented parameters in the robots.txt (sortby, tab, do, embed) but cannot use the default parameters as they are disallowed in the robots.txt for a reason. The better approach is to use some other parameters like sortby2, tab2, etc. for plugins/apps in the frontend and reuse the same sortby, tab... parameters in the backend. The most graceful solution to these issues is to incorporate the logic of the 2 plugins above into IC software as these are sane, expected behaviors so please add these to your roadmap. I'm posting these issues here so the devs could coordinate with the IC team in figuring out alternate parameter names for apps/plugins to bypass the conditions in robots.txt (not sure who would decide alternate parameter names). Edited June 3 by WebCMS
Management Matt Posted June 3 Management Posted June 3 It's hard for me to get a foothold into the issue without seeing the site, can you link it here, or DM it if you want to keep it private. I'm sure there's an easy solution. A mix of better robots.txt and perhaps getting some canonical links added will fix most of that.
WebCMS Posted June 3 Author Posted June 3 18 minutes ago, Matt said: It's hard for me to get a foothold into the issue without seeing the site, can you link it here, or DM it if you want to keep it private. I'm sure there's an easy solution. A mix of better robots.txt and perhaps getting some canonical links added will fix most of that. https://www.telugus.com We would rather prefer to use the default robots.txt provided by the software as it is most optimal. Resolving it at the software, apps/plugins level would resolve the issue for ALL clients using the default robots.txt instead of tweaking a custom robots.txt for each client who use apps/plugins.
Management Matt Posted June 3 Management Posted June 3 Thanks, so I'll go through some of the links you're having issues with: /movies/movie/27-the-international/?do=addMovie&type=movie&movieid=1101123 Obviously you don't want Google to index submission forms, that's a waste of the crawl budget, the default do= rule will stop that from happening. /movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters /movies/movie/27-the-international/?do=getMovieDataMovieView&type=images /movies/movie/13-migration/?do=getMovieDataMovieView&type=cast /movies/movie/13-migration/?do=getMovieDataMovieView&type=crew The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages. The issue is caused by those tabbed pages having a canonical URL of the main page, eg: <link rel="canonical" href="https://www.telugus.com/movies/movie/35-ride-on-%E9%BE%99%E9%A9%AC%E7%B2%BE%E7%A5%9E/" /> /articles/true-horror-the-cannibalistic-tragedy-of-the-donner-party/?tab=reviews /topic/517-how-sloths-survive-thrive-as-nature%E2%80%99s-couch-potato-60-minutes/?sortby=date /topic/437-stackoverflow%E2%80%99s-shockingly-simple-architecture/?sortby=votes /topic/613-work-visas-in-europe/?sortby=date The sort by URLs should not be indexed. These are faceted pages which Google is not keen on and can waste crawl budget. You said that this is a custom app? If so, direct the developer to this topic and ask them to add some FURL rules in /data/furl.json so that those tabbed pages have a unique URL fixing the duplication issue, and canonical issue.
WebCMS Posted June 3 Author Posted June 3 4 hours ago, Matt said: Thanks, so I'll go through some of the links you're having issues with: /movies/movie/27-the-international/?do=addMovie&type=movie&movieid=1101123 Obviously you don't want Google to index submission forms, that's a waste of the crawl budget, the default do= rule will stop that from happening. /movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters /movies/movie/27-the-international/?do=getMovieDataMovieView&type=images /movies/movie/13-migration/?do=getMovieDataMovieView&type=cast /movies/movie/13-migration/?do=getMovieDataMovieView&type=crew The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages. The issue is caused by those tabbed pages having a canonical URL of the main page, eg: <link rel="canonical" href="https://www.telugus.com/movies/movie/35-ride-on-%E9%BE%99%E9%A9%AC%E7%B2%BE%E7%A5%9E/" /> /articles/true-horror-the-cannibalistic-tragedy-of-the-donner-party/?tab=reviews /topic/517-how-sloths-survive-thrive-as-nature%E2%80%99s-couch-potato-60-minutes/?sortby=date /topic/437-stackoverflow%E2%80%99s-shockingly-simple-architecture/?sortby=votes /topic/613-work-visas-in-europe/?sortby=date The sort by URLs should not be indexed. These are faceted pages which Google is not keen on and can waste crawl budget. You said that this is a custom app? If so, direct the developer to this topic and ask them to add some FURL rules in /data/furl.json so that those tabbed pages have a unique URL fixing the duplication issue, and canonical issue. @teraßyte @Adriano Faria
teraßyte Posted June 3 Posted June 3 My plugin simply changes the default sortby value from votes to date when there is none specified in the URL/request. Everything else is then handed to the framework to handle behind the scenes. There's nothing I can really change with it. 🤷♂️
WebCMS Posted June 3 Author Posted June 3 4 minutes ago, teraßyte said: My plugin simply changes the default sortby value from votes to date when there is none specified in the URL/request. Everything else is then handed to the framework to handle behind the scenes. There's nothing I can really change with it. 🤷♂️ sortby is blocked by default robots.txt which was reported by GA as a crawling issue.
teraßyte Posted June 3 Posted June 3 Hmm, wait. Are you saying Google can't index any pages starting from 2 because they all have the sortby value in them (which is not there by default)? If that's the case I can add a check to exclude adding it if a bot/search engine is viewing the topic, but I'm not sure if Google would then complain that your site is showing different content compared to guests. Since they're not indexing any sortby links it should be fine, though. 🤔
WebCMS Posted June 3 Author Posted June 3 4 hours ago, teraßyte said: Hmm, wait. Are you saying Google can't index any pages starting from 2 because they all have the sortby value in them (which is not there by default)? Yes. These are used by apps/plugins but blocked in the robots and hence crawling issues - sortby=, tab=, do, /*?&controller=embed 4 hours ago, teraßyte said: If that's the case I can add a check to exclude adding it if a bot/search engine is viewing the topic, but I'm not sure if Google would then complain that your site is showing different content compared to guests. Since they're not indexing any sortby links it should be fine, though. 🤔 I'm not sure what Google would or not do. Things like this altering the behavior dynamically has almost always resulted in heartache while dealing with Google as far as I know. Any other graceful solution? Like using a different parameter name to bypass the default name in robots?
teraßyte Posted June 3 Posted June 3 42 minutes ago, WebCMS said: Any other graceful solution? Like using a different parameter name to bypass the default name in robots? I can't think of anything else, unfortunately. The sortby value is what the framework uses in several functions, some of which I can't even hook in. I don't think changing it to another variable is feasible. Considering v5 won't have any hooks, and that Q&A forums are also going away, I don't plan on rewriting this plugin from scratch to change how it works to try and somehow work around this problem. At this point, your best option is to disable the plugin and wait for v5.
Dreadknux Posted June 4 Posted June 4 22 hours ago, Matt said: /movies/movie/27-the-international/?do=getMovieDataMovieView&type=posters /movies/movie/27-the-international/?do=getMovieDataMovieView&type=images /movies/movie/13-migration/?do=getMovieDataMovieView&type=cast /movies/movie/13-migration/?do=getMovieDataMovieView&type=crew The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages. The issue is caused by those tabbed pages having a canonical URL of the main page, eg: <link rel="canonical" href="https://www.telugus.com/movies/movie/35-ride-on-%E9%BE%99%E9%A9%AC%E7%B2%BE%E7%A5%9E/" /> I wanted to ask about this point as it's something I'm struggling with on my own community. I have a "wiki" like section set up on my community, using Pages, and I use tabs to squirrel away 'sections' of media or info about a given page (example below): https://www.sonicstadium.org/wiki/games/mainline/sonic-the-hedgehog-16-bit/ If I wanted to split some of the content on this page into sub-pages (i.e. the "Prototypes and Mysteries" section), I wouldn't be able to create something like: https://www.sonicstadium.org/wiki/games/mainline/sonic-the-hedgehog-16-bit/prototypes for two reasons: You cannot structure Pages in that way ("sonic-the-hedgehog-16-bit" would need to be a category, and 'prototypes' a page within that category) It appears that Pages won't let you use the same name for a page twice, no matter where it is placed (so assuming I could make '/sonic-the-hedgehog-16-bit/prototypes', I then would not be able to make '/sonic-the-hedgehog-3-and-knuckles/prototypes', it'd have to be called 'prototypes2' (or even '/sonic-the-hedgehog-3-and-knuckles/sonic-the-hedgehog-3-and-knuckles-prototypes', which is ridiculous) I am probably getting ahead of some cool new/restructured Pages app features for V5 that will address the above, but just wanted to get your thoughts on that.
Adriano Faria Posted June 5 Posted June 5 (edited) Hi Matt, On 6/3/2024 at 8:09 AM, Matt said: The issue at the heart of this problem is that you have unique content without unique URLs. In an ideal world, you would have unique URLs such as: /movies/movie/13-migration/cast/ and /movies/movie/13-migration/crew/ which would stop the problem as Google would see them as unique pages. Just a question: how does the framework handle the tab=comments or tab=reviews in Downloads, for example? All the tabs from the Movies app are created on the "commentReviewTabs" method of its content item. I mean: Quote http://localhost/47X/index.php?/files/file/1-file-1/&tab=reviews and Quote http://localhost/47X/index.php?/files/file/1-file-1/&tab=comments would generate the same "errors", right? I didn't find anything specific for them. Are they "excluded" somehow? Wouldn't a "noindex" in these links solve the issue? Edited June 5 by Adriano Faria
Management Matt Posted June 6 Management Posted June 6 We just block tab= in the robots.txt by default. I am thinking about a better solution for v5 where each tab gets its own FURL. It would be easy to do: site.com/articles/123-record/tab-content/ or something. aia, Dreadknux, WebCMS and 2 others 3 2
aia Posted June 6 Posted June 6 1 hour ago, Matt said: I am thinking about a better solution for v5 where each tab gets its own FURL. Yes, FURLs are the way. I'd prefer to have FURLs for everything that should be indexable by search engines, and just a simple rule in robots.txt that disallows indexing of anything with additional (non-FURL) parameters instead of a bunch of disallowed parameters. Matt and WebCMS 1 1
WebCMS Posted June 20 Author Posted June 20 Hi @Adriano Faria @Matt Any feedback on these issues? Not found (404) /{url="app=movies&module=movies&controller=nowplaying&do=getPersonDetail&personid=3686772 /{url="app=movies&module=movies&controller=nowplaying&do=getPersonDetail&personid=3292 Excluded by ‘noindex’ tag /movies/movie/46-divergent/ /movies/movie/47-insurgent/ Blocked due to other 4xx issue: /{url="app=movies&module=movies&controller=nowplaying&do=getPersonDetail&personid=2120814
Adriano Faria Posted June 20 Posted June 20 (edited) 3 hours ago, WebCMS said: Not found (404) The links exist and point to a controller to display a popup when you click on any movie from the widget Now Playing: Example: .../index.php?app=movies&module=movies&controller=nowplaying&id=573435 The 3 links you provided work for me and they go to the same controller to display data from the cast/crew: Edited June 20 by Adriano Faria
WebCMS Posted June 21 Author Posted June 21 On 6/20/2024 at 1:43 PM, Adriano Faria said: The links exist and point to a controller to display a popup when you click on any movie from the widget Now Playing: Example: .../index.php?app=movies&module=movies&controller=nowplaying&id=573435 The 3 links you provided work for me and they go to the same controller to display data from the cast/crew: In the link, do=... is blocked by robots.txt which is why Google is not able to crawl it: /{url="app=movies&module=movies&controller=nowplaying&do=getPersonDetail&personid=3292 These are not in the robots.txt but not sure why they are excluded by noindex tag and hence not crawled- Excluded by ‘noindex’ tag /movies/movie/46-divergent/ /movies/movie/47-insurgent/
Adriano Faria Posted June 21 Posted June 21 (edited) 1 hour ago, WebCMS said: In the link, do=... is blocked by robots.txt which is why Google is not able to crawl it: /{url="app=movies&module=movies&controller=nowplaying&do=getPersonDetail&personid=3292 I’m not sure what you are asking here, honestly. DO is the way to call a function in the controller. If you are asking me to change everything just because you want to get better results, sorry, that won’t happen. You are free to change your install and do whatever you want with it. Start by adding a noindex on it. Plus, support is provided in my board only so stop mentioning me here. Edited June 21 by Adriano Faria G17 Media 1
Recommended Posts