Invision Community 4: SEO, prepare for v5 and dormant account notifications Matt November 11, 2024Nov 11
Posted April 16, 201014 yr Hi there, I want to start here a discussion in how we can optimize our default robots.txt file to get it updated in future IP.Board versions. :) It's simple: share your experiences. Use services like Google Webmaster Tools, Bing Webmaster Center and Yahoo! Site Explorer to identify wich pages is getting duplicated or is throwing errors/problems in these the crawlers. From what I already detected, (if I'm not wrong) we can block 5 more URLs:Disallow: /*?s= Disallow: /*&s= Disallow: /index.php?app=core&module=global§ion=login&do=deleteCookies Disallow: /index.php?app=forums&module=extras§ion=rating Disallow: /index.php?app=forums&module=forums§ion=markasread I still haven't put these lines in my own robots.txt but I tested them in Google Webmaster Tools (GMT) and I'm convicted that will have positive impact to 1) reduce useless indexed pages, 2) reduce duplicated content and 3) reduce HTML suggestions from GMT.For the record: Crawlers hates this: http://www.google.com/search?q=intitle%3A%22Board+Message%22+%22An+Error+Occurred%22+site%3Acommunity.invisionpower.com&filter=0 All these duplicated and useless pages have negative impact to our rank in rigorous crawlers (like Google). We need to block them to reduce our penalty. :( I'm still analysing my GMT reports and I'll update here with all new useless URLs I find. But I want more people involved to share your knowledge. :) Sorry if my english is not perfect, I still need to dedicate more time to learn it. :huh: Best regards,
April 16, 201014 yr Best regards, Ip.Board 3.1 uses the correct HTTP headers on error pages, so this won't be a problem in future. I don't think it's a good idea to block the session variable, it could have a knock on effect when robots first visit your site. You can either force your PHP setup to use cookies only, or wait for the session bug to be fixed.
April 16, 201014 yr Session ids are not placed in the URL for spiders we identify, so you shouldn't have to try to block session ids in the URL via robots.txt. Our session class actually wipes out the session ID in memory when it sees a bot. { self::setSearchEngine( $uAgent ); /* Reset some data */ $this->session_type = 'cookie'; $this->session_id = ""; } if ( $uAgent['uagent_type'] == 'search' )
April 16, 201014 yr Sure? In Google search results I have looooooooooot of these entriesforums.domain.tld/blogs/page__s__f0fafa4bb50ae01f6c1a7a92e7f16e8c forums.domain.tld/blogs/page__s__08ab014e65b12c37833dbc33a8cb3f43 forums.domain.tld/blogs/page__s__33c29f46885918b68bd9df8867f4826e etc...
April 16, 201014 yr Session ids are not placed in the URL for spiders we identify, so you shouldn't have to try to block session ids in the URL via robots.txt. Our session class actually wipes out the session ID in memory when it sees a bot. In theory. :P Here's the practice: http://www.google.com/search?q=site%3Acommunity.invisionpower.com%2Fcalendar+-inurl%3A(day|week|event)&filter=0 (just an example) :( Regards, Edit: Oops, answered in the wrong browser/account. :P
April 16, 201014 yr Fair enough. I've added the suggestions in the first post for the next update. Naturally that doesn't matter since we just have a default renamed robots.txt, so you'd need to put them in your actual robots.txt yourself if you already have one.
April 16, 201014 yr Author Ip.Board 3.1 uses the correct HTTP headers on error pages, so this won't be a problem in future. I didn't know precisely, but I think that even returning 403 errors crawlers will waste your traffic unnecessarily. Blocking URLs in robots.txt avoids compliant crawlers to follow these addresses. I can imagine how much this would cost to huge traffic boards. :ermm: Just my POV. :) Regards,
April 16, 201014 yr I didn't know precisely, but I think that even returning 403 errors crawlers will waste your traffic unnecessarily. Blocking URLs in robots.txt avoids compliant crawlers to follow these addresses. I can imagine how much this would cost to huge traffic boards. :ermm: Just my POV. :) Regards, Are you going to add a new rule to your robots.txt everytime you move a topic to a private area or delete it ?
April 16, 201014 yr Author Are you going to add a new rule to your robots.txt everytime you move a topic to a private area or delete it ? Not, but I know it has very negative impact in not doing this. It seems that IP.Board 3.1 will help this, but the "fix" is not immediate - crawlers are quite slow to remove pages from their index. Regards,
April 16, 201014 yr Can someone please explain me if below URL should be blocked or not? If not then why? http://domain.com/forum/index.php?app=forums&module=forums§ion=findpost&pid=38832
April 18, 201014 yr Mine: User-agent: * Disallow: /foorumi/admin/ Disallow: /foorumi/cache/ Disallow: /foorumi/converge_local/ Disallow: /foorumi/hooks/ Disallow: /foorumi/ips_kernel/ Disallow: /foorumi/retail/ Disallow: /foorumi/public/js/ Disallow: /foorumi/public/style_captcha/ Disallow: /foorumi/public/style_css/ Disallow: /foorumi/index.php?action=verificationcode Disallow: /foorumi/index.php?app=core&module=task Disallow: /foorumi/index.php?app=core&module=usercp&tab=forums&area=forumsubs Disallow: /foorumi/index.php?app=core&module=usercp&tab=forums&area=watch&watch=topic Disallow: /foorumi/index.php?app=forums&module=extras§ion=forward Disallow: /foorumi/index.php?app=members&module=messaging Disallow: /foorumi/index.php?app=members&module=chat Disallow: /foorumi/index.php?app=members&module=search Disallow: /foorumi/index.php?app=members&module=search&do=active Disallow: /foorumi/index.php?&unlockUserAgent=1 Disallow: /foorumi/index.php?app=core&module=global§ion=login&do=deleteCookies Disallow: /foorumi/index.php?app=forums&module=extras§ion=rating Disallow: /foorumi/index.php?app=forums&module=forums§ion=markasread Disallow: /*app=core&module=usercp Disallow: /*app=core&module=usercp Disallow: /*app=members&module=messaging Disallow: /*&p= Disallow: /*&pid= Disallow: /*&hl= Disallow: /*&start= Disallow: /*view__getnewpost$ Disallow: /*view__getlastpost$ Disallow: /*view__old$ Disallow: /*view__new$ Disallow: /*view__getfirst$ Disallow: /*view__getprevious$ Disallow: /*view__getnext$ Disallow: /*view__getlast$ Disallow: /*&view=getnewpost$ Disallow: /*&view=getlastpost$ Disallow: /*&view=old$ Disallow: /*&view=new$ Disallow: /*&view=getfirst$ Disallow: /*&view=getprevious$ Disallow: /*&view=getnext$ Disallow: /*&view=getlast$ Disallow: /*?s= Disallow: /*&s=
June 28, 201014 yr Author I'm still very busy here but what I can say for now is that there is an quite strange behavior in user profile pages, as you could see here: http://www.google.com/search?q=inurl:"community.invisionpower.com/user/94759-paulo+-freitas"&filter=0 I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :) I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D Regards,
July 15, 201014 yr Author I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :) I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D Regards, Could any dev take a look into this? I think it's another SEO improvement we can get, a time it's causing SERPs with duplicated content (which isn't good). I think this issue is a good one to pick for 3.1.2, you're taking SEO seriously in the last releases. :-) Cheers,
July 15, 201014 yr There's nothing stopping you from customizing your robots.txt further yourself already. :)
January 21, 201114 yr I have setting like this in robotstxt for my ipb 3.1.4 are this good for me ? User-agent: * Disallow: admin/ Disallow: style_images/ Disallow: index.php?act=idx Disallow: index.php?act=Login Disallow: index.php?act=Search Disallow: index.php?act=Shoutbox Disallow: index.php?act=Reg Disallow: index.php?act=Msg Disallow: index.php?act=Mail Disallow: index.php?act=Forward Disallow: index.php?act=Track Disallow: index.php?act=Post Disallow: index.php?act=post Disallow: index.php?act=Print Disallow: index.php?act=ST Disallow: index.php?act=boardrules Disallow: ?act=boardrules Disallow: index.php?act=Help Disallow: index.php?act=Stats Disallow: index.php?act=stats Disallow: index.php?act=Members Disallow: index.php?act=Online Disallow: index.php?act=calendar Disallow: index.php?act=SR Disallow: index.php?act=SF Disallow: index.php?act=ICQ Disallow: index.php?act=MSN Disallow: index.php?act=AOL Disallow: index.php?act=AIM Disallow: index.php?act=SC Disallow: index.php?act=task Disallow: index.php?act=findpost Disallow: index.php?act=UserCP Disallow: index.php?act=usercp Disallow: index.php?&act= Disallow: index.php?act=report Disallow: index.php?act=buddy Disallow: index.php?act=legends Disallow: index.php?CODE= Disallow: index.php?act=attach Disallow: index.php?act=Attach Disallow: index.php?&&CODE= Disallow: index.php?&debug=1 Disallow: index.php?act=Profile Disallow: index.php?showuser Disallow: index.php?s= Disallow: *&view=getnewpost$ Disallow: *&view=getlastpost$ Disallow: *&view=old$ Disallow: *&view=new$ Disallow: *mode=linear Disallow: *mode=threaded Disallow: *mode=linearplus Disallow: *&p= Disallow: *&pid= Disallow: *&gopid= Disallow: *&hl= Disallow: *&start= Disallow: *&showtopic Disallow: *gallery&req=stats Disallow: *gallery&req=user Disallow: *gallery&req=slideshow Disallow: *reportimage
February 6, 201312 yr Are these suggestion still valid in 3.4.2? I want to also not just disallow but remove status updates from the index.
February 6, 201312 yr I don't use robots.txt and wouldn't recomend it. It is based on the honor of the search bots to honor it or not. Too many disregard it and crawl where they please.
February 6, 201312 yr Google will respect Robots.txt and that is who I'm concerned about. I don't want Google giving any weight to my user's status updates.
February 6, 201312 yr I don't use robots.txt and wouldn't recomend it. It is based on the honor of the search bots to honor it or not. Too many disregard it and crawl where they please. but since it works for those that honor it why wouldn't you? google/bing honor it and if you don't want member profiles showing it google its a fast and easy way to stop that
February 7, 201312 yr but since it works for those that honor it why wouldn't you? google/bing honor it and if you don't want member profiles showing it google its a fast and easy way to stop that Bad bots are blocked by IP range, and as for member profiles, I don't care if they are crawled. They are not something that needs to be secure, and most don't fill out much more than their name anyhow. Profiles are not private.
February 7, 201312 yr Just a quick tip for anyone who reads this thread , you should follow the exact opposite route Sandi_ follows :)
November 13, 20159 yr On 6/29/2010, 8:10:56, Paulo Freitas said: I'm still very busy here but what I can say for now is that there is an quite strange behavior in user profile pages, as you could see here: http://www.google.com/search?q=inurl:"community.invisionpower.com/user/94759-paulo+-freitas"&filter=0 I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :) I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D Regards, Can you elaborate on this 'page__f__' issue? What about them seems buggy and dangerous? (I know this is an old thread, might still be valid)
Archived
This topic is now archived and is closed to further replies.