Paulo Freitas Posted April 16, 2010 Posted April 16, 2010 Hi there, I want to start here a discussion in how we can optimize our default robots.txt file to get it updated in future IP.Board versions. :) It's simple: share your experiences. Use services like Google Webmaster Tools, Bing Webmaster Center and Yahoo! Site Explorer to identify wich pages is getting duplicated or is throwing errors/problems in these the crawlers. From what I already detected, (if I'm not wrong) we can block 5 more URLs:Disallow: /*?s= Disallow: /*&s= Disallow: /index.php?app=core&module=global§ion=login&do=deleteCookies Disallow: /index.php?app=forums&module=extras§ion=rating Disallow: /index.php?app=forums&module=forums§ion=markasread I still haven't put these lines in my own robots.txt but I tested them in Google Webmaster Tools (GMT) and I'm convicted that will have positive impact to 1) reduce useless indexed pages, 2) reduce duplicated content and 3) reduce HTML suggestions from GMT.For the record: Crawlers hates this: http://www.google.com/search?q=intitle%3A%22Board+Message%22+%22An+Error+Occurred%22+site%3Acommunity.invisionpower.com&filter=0 All these duplicated and useless pages have negative impact to our rank in rigorous crawlers (like Google). We need to block them to reduce our penalty. :( I'm still analysing my GMT reports and I'll update here with all new useless URLs I find. But I want more people involved to share your knowledge. :) Sorry if my english is not perfect, I still need to dedicate more time to learn it. :huh: Best regards,
stoo2000 Posted April 16, 2010 Posted April 16, 2010 Best regards, Ip.Board 3.1 uses the correct HTTP headers on error pages, so this won't be a problem in future. I don't think it's a good idea to block the session variable, it could have a knock on effect when robots first visit your site. You can either force your PHP setup to use cookies only, or wait for the session bug to be fixed.
bfarber Posted April 16, 2010 Posted April 16, 2010 Session ids are not placed in the URL for spiders we identify, so you shouldn't have to try to block session ids in the URL via robots.txt. Our session class actually wipes out the session ID in memory when it sees a bot. { self::setSearchEngine( $uAgent ); /* Reset some data */ $this->session_type = 'cookie'; $this->session_id = ""; } if ( $uAgent['uagent_type'] == 'search' )
Axel Wers Posted April 16, 2010 Posted April 16, 2010 Sure? In Google search results I have looooooooooot of these entriesforums.domain.tld/blogs/page__s__f0fafa4bb50ae01f6c1a7a92e7f16e8c forums.domain.tld/blogs/page__s__08ab014e65b12c37833dbc33a8cb3f43 forums.domain.tld/blogs/page__s__33c29f46885918b68bd9df8867f4826e etc...
CanalDev Posted April 16, 2010 Posted April 16, 2010 Session ids are not placed in the URL for spiders we identify, so you shouldn't have to try to block session ids in the URL via robots.txt. Our session class actually wipes out the session ID in memory when it sees a bot. In theory. :P Here's the practice: http://www.google.com/search?q=site%3Acommunity.invisionpower.com%2Fcalendar+-inurl%3A(day|week|event)&filter=0 (just an example) :( Regards, Edit: Oops, answered in the wrong browser/account. :P
bfarber Posted April 16, 2010 Posted April 16, 2010 Fair enough. I've added the suggestions in the first post for the next update. Naturally that doesn't matter since we just have a default renamed robots.txt, so you'd need to put them in your actual robots.txt yourself if you already have one.
Paulo Freitas Posted April 16, 2010 Author Posted April 16, 2010 Ip.Board 3.1 uses the correct HTTP headers on error pages, so this won't be a problem in future. I didn't know precisely, but I think that even returning 403 errors crawlers will waste your traffic unnecessarily. Blocking URLs in robots.txt avoids compliant crawlers to follow these addresses. I can imagine how much this would cost to huge traffic boards. :ermm: Just my POV. :) Regards,
stoo2000 Posted April 16, 2010 Posted April 16, 2010 I didn't know precisely, but I think that even returning 403 errors crawlers will waste your traffic unnecessarily. Blocking URLs in robots.txt avoids compliant crawlers to follow these addresses. I can imagine how much this would cost to huge traffic boards. :ermm: Just my POV. :) Regards, Are you going to add a new rule to your robots.txt everytime you move a topic to a private area or delete it ?
Paulo Freitas Posted April 16, 2010 Author Posted April 16, 2010 Are you going to add a new rule to your robots.txt everytime you move a topic to a private area or delete it ? Not, but I know it has very negative impact in not doing this. It seems that IP.Board 3.1 will help this, but the "fix" is not immediate - crawlers are quite slow to remove pages from their index. Regards,
AlexJ Posted April 16, 2010 Posted April 16, 2010 Can someone please explain me if below URL should be blocked or not? If not then why? http://domain.com/forum/index.php?app=forums&module=forums§ion=findpost&pid=38832
Owdy Posted April 18, 2010 Posted April 18, 2010 Mine: User-agent: * Disallow: /foorumi/admin/ Disallow: /foorumi/cache/ Disallow: /foorumi/converge_local/ Disallow: /foorumi/hooks/ Disallow: /foorumi/ips_kernel/ Disallow: /foorumi/retail/ Disallow: /foorumi/public/js/ Disallow: /foorumi/public/style_captcha/ Disallow: /foorumi/public/style_css/ Disallow: /foorumi/index.php?action=verificationcode Disallow: /foorumi/index.php?app=core&module=task Disallow: /foorumi/index.php?app=core&module=usercp&tab=forums&area=forumsubs Disallow: /foorumi/index.php?app=core&module=usercp&tab=forums&area=watch&watch=topic Disallow: /foorumi/index.php?app=forums&module=extras§ion=forward Disallow: /foorumi/index.php?app=members&module=messaging Disallow: /foorumi/index.php?app=members&module=chat Disallow: /foorumi/index.php?app=members&module=search Disallow: /foorumi/index.php?app=members&module=search&do=active Disallow: /foorumi/index.php?&unlockUserAgent=1 Disallow: /foorumi/index.php?app=core&module=global§ion=login&do=deleteCookies Disallow: /foorumi/index.php?app=forums&module=extras§ion=rating Disallow: /foorumi/index.php?app=forums&module=forums§ion=markasread Disallow: /*app=core&module=usercp Disallow: /*app=core&module=usercp Disallow: /*app=members&module=messaging Disallow: /*&p= Disallow: /*&pid= Disallow: /*&hl= Disallow: /*&start= Disallow: /*view__getnewpost$ Disallow: /*view__getlastpost$ Disallow: /*view__old$ Disallow: /*view__new$ Disallow: /*view__getfirst$ Disallow: /*view__getprevious$ Disallow: /*view__getnext$ Disallow: /*view__getlast$ Disallow: /*&view=getnewpost$ Disallow: /*&view=getlastpost$ Disallow: /*&view=old$ Disallow: /*&view=new$ Disallow: /*&view=getfirst$ Disallow: /*&view=getprevious$ Disallow: /*&view=getnext$ Disallow: /*&view=getlast$ Disallow: /*?s= Disallow: /*&s=
Paulo Freitas Posted June 28, 2010 Author Posted June 28, 2010 I'm still very busy here but what I can say for now is that there is an quite strange behavior in user profile pages, as you could see here: http://www.google.com/search?q=inurl:"community.invisionpower.com/user/94759-paulo+-freitas"&filter=0 I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :) I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D Regards,
Paulo Freitas Posted July 15, 2010 Author Posted July 15, 2010 I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :) I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D Regards, Could any dev take a look into this? I think it's another SEO improvement we can get, a time it's causing SERPs with duplicated content (which isn't good). I think this issue is a good one to pick for 3.1.2, you're taking SEO seriously in the last releases. :-) Cheers,
bfarber Posted July 15, 2010 Posted July 15, 2010 There's nothing stopping you from customizing your robots.txt further yourself already. :)
Indo-IPB Posted January 21, 2011 Posted January 21, 2011 I have setting like this in robotstxt for my ipb 3.1.4 are this good for me ? User-agent: * Disallow: admin/ Disallow: style_images/ Disallow: index.php?act=idx Disallow: index.php?act=Login Disallow: index.php?act=Search Disallow: index.php?act=Shoutbox Disallow: index.php?act=Reg Disallow: index.php?act=Msg Disallow: index.php?act=Mail Disallow: index.php?act=Forward Disallow: index.php?act=Track Disallow: index.php?act=Post Disallow: index.php?act=post Disallow: index.php?act=Print Disallow: index.php?act=ST Disallow: index.php?act=boardrules Disallow: ?act=boardrules Disallow: index.php?act=Help Disallow: index.php?act=Stats Disallow: index.php?act=stats Disallow: index.php?act=Members Disallow: index.php?act=Online Disallow: index.php?act=calendar Disallow: index.php?act=SR Disallow: index.php?act=SF Disallow: index.php?act=ICQ Disallow: index.php?act=MSN Disallow: index.php?act=AOL Disallow: index.php?act=AIM Disallow: index.php?act=SC Disallow: index.php?act=task Disallow: index.php?act=findpost Disallow: index.php?act=UserCP Disallow: index.php?act=usercp Disallow: index.php?&act= Disallow: index.php?act=report Disallow: index.php?act=buddy Disallow: index.php?act=legends Disallow: index.php?CODE= Disallow: index.php?act=attach Disallow: index.php?act=Attach Disallow: index.php?&&CODE= Disallow: index.php?&debug=1 Disallow: index.php?act=Profile Disallow: index.php?showuser Disallow: index.php?s= Disallow: *&view=getnewpost$ Disallow: *&view=getlastpost$ Disallow: *&view=old$ Disallow: *&view=new$ Disallow: *mode=linear Disallow: *mode=threaded Disallow: *mode=linearplus Disallow: *&p= Disallow: *&pid= Disallow: *&gopid= Disallow: *&hl= Disallow: *&start= Disallow: *&showtopic Disallow: *gallery&req=stats Disallow: *gallery&req=user Disallow: *gallery&req=slideshow Disallow: *reportimage
CheersnGears Posted February 6, 2013 Posted February 6, 2013 Are these suggestion still valid in 3.4.2? I want to also not just disallow but remove status updates from the index.
Royzee Posted February 6, 2013 Posted February 6, 2013 I don't use robots.txt and wouldn't recomend it. It is based on the honor of the search bots to honor it or not. Too many disregard it and crawl where they please.
CheersnGears Posted February 6, 2013 Posted February 6, 2013 Google will respect Robots.txt and that is who I'm concerned about. I don't want Google giving any weight to my user's status updates.
Dmacleo Posted February 6, 2013 Posted February 6, 2013 I don't use robots.txt and wouldn't recomend it. It is based on the honor of the search bots to honor it or not. Too many disregard it and crawl where they please. but since it works for those that honor it why wouldn't you? google/bing honor it and if you don't want member profiles showing it google its a fast and easy way to stop that
Royzee Posted February 7, 2013 Posted February 7, 2013 but since it works for those that honor it why wouldn't you? google/bing honor it and if you don't want member profiles showing it google its a fast and easy way to stop that Bad bots are blocked by IP range, and as for member profiles, I don't care if they are crawled. They are not something that needs to be secure, and most don't fill out much more than their name anyhow. Profiles are not private.
GreenLinks Posted February 7, 2013 Posted February 7, 2013 Just a quick tip for anyone who reads this thread , you should follow the exact opposite route Sandi_ follows :)
Prank Posted November 13, 2015 Posted November 13, 2015 On 6/29/2010, 8:10:56, Paulo Freitas said: I'm still very busy here but what I can say for now is that there is an quite strange behavior in user profile pages, as you could see here: http://www.google.com/search?q=inurl:"community.invisionpower.com/user/94759-paulo+-freitas"&filter=0 I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :) I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D Regards, Can you elaborate on this 'page__f__' issue? What about them seems buggy and dangerous? (I know this is an old thread, might still be valid)
Recommended Posts
Archived
This topic is now archived and is closed to further replies.