Jump to content

SEO: Improving robots.txt


Paulo Freitas

Recommended Posts

Hi there,

I want to start here a discussion in how we can optimize our default robots.txt file to get it updated in future IP.Board versions. :)

It's simple: share your experiences. Use services like Google Webmaster Tools, Bing Webmaster Center and Yahoo! Site Explorer to identify wich pages is getting duplicated or is throwing errors/problems in these the crawlers.

From what I already detected, (if I'm not wrong) we can block 5 more URLs:

Disallow: /*?s=

Disallow: /*&s=

Disallow: /index.php?app=core&module=global&section=login&do=deleteCookies

Disallow: /index.php?app=forums&module=extras&section=rating

Disallow: /index.php?app=forums&module=forums&section=markasread


I still haven't put these lines in my own robots.txt but I tested them in Google Webmaster Tools (GMT) and I'm convicted that will have positive impact to 1) reduce useless indexed pages, 2) reduce duplicated content and 3) reduce HTML suggestions from GMT.

For the record: Crawlers hates this: http://www.google.com/search?q=intitle%3A%22Board+Message%22+%22An+Error+Occurred%22+site%3Acommunity.invisionpower.com&filter=0

All these duplicated and useless pages have negative impact to our rank in rigorous crawlers (like Google). We need to block them to reduce our penalty. :(

I'm still analysing my GMT reports and I'll update here with all new useless URLs I find. But I want more people involved to share your knowledge. :)

Sorry if my english is not perfect, I still need to dedicate more time to learn it. :huh:

Best regards,

Link to comment
Share on other sites




Best regards,



Ip.Board 3.1 uses the correct HTTP headers on error pages, so this won't be a problem in future.

I don't think it's a good idea to block the session variable, it could have a knock on effect when robots first visit your site. You can either force your PHP setup to use cookies only, or wait for the session bug to be fixed.
Link to comment
Share on other sites

Session ids are not placed in the URL for spiders we identify, so you shouldn't have to try to block session ids in the URL via robots.txt. Our session class actually wipes out the session ID in memory when it sees a bot.

{ self::setSearchEngine( $uAgent ); /* Reset some data */ $this->session_type = 'cookie'; $this->session_id = ""; }

		if ( $uAgent['uagent_type'] == 'search' )


Link to comment
Share on other sites

Sure? In Google search results I have looooooooooot of these entries

forums.domain.tld/blogs/page__s__f0fafa4bb50ae01f6c1a7a92e7f16e8c

forums.domain.tld/blogs/page__s__08ab014e65b12c37833dbc33a8cb3f43

forums.domain.tld/blogs/page__s__33c29f46885918b68bd9df8867f4826e



etc...

Link to comment
Share on other sites

Session ids are not placed in the URL for spiders we identify, so you shouldn't have to try to block session ids in the URL via robots.txt. Our session class actually wipes out the session ID in memory when it sees a bot.


In theory. :P

Here's the practice: http://www.google.com/search?q=site%3Acommunity.invisionpower.com%2Fcalendar+-inurl%3A(day|week|event)&filter=0 (just an example) :(

Regards,

Edit: Oops, answered in the wrong browser/account. :P
Link to comment
Share on other sites

Fair enough.

I've added the suggestions in the first post for the next update. Naturally that doesn't matter since we just have a default renamed robots.txt, so you'd need to put them in your actual robots.txt yourself if you already have one.

Link to comment
Share on other sites

Ip.Board 3.1 uses the correct HTTP headers on error pages, so this won't be a problem in future.


I didn't know precisely, but I think that even returning 403 errors crawlers will waste your traffic unnecessarily. Blocking URLs in robots.txt avoids compliant crawlers to follow these addresses. I can imagine how much this would cost to huge traffic boards. :ermm:

Just my POV. :)

Regards,
Link to comment
Share on other sites


I didn't know precisely, but I think that even returning 403 errors crawlers will waste your traffic unnecessarily. Blocking URLs in robots.txt avoids compliant crawlers to follow these addresses. I can imagine how much this would cost to huge traffic boards. :ermm:



Just my POV. :)



Regards,



Are you going to add a new rule to your robots.txt everytime you move a topic to a private area or delete it ?
Link to comment
Share on other sites

Are you going to add a new rule to your robots.txt everytime you move a topic to a private area or delete it ?


Not, but I know it has very negative impact in not doing this. It seems that IP.Board 3.1 will help this, but the "fix" is not immediate - crawlers are quite slow to remove pages from their index.

Regards,
Link to comment
Share on other sites

Mine:


User-agent: *

Disallow: /foorumi/admin/

Disallow: /foorumi/cache/

Disallow: /foorumi/converge_local/

Disallow: /foorumi/hooks/

Disallow: /foorumi/ips_kernel/

Disallow: /foorumi/retail/

Disallow: /foorumi/public/js/

Disallow: /foorumi/public/style_captcha/

Disallow: /foorumi/public/style_css/

Disallow: /foorumi/index.php?action=verificationcode

Disallow: /foorumi/index.php?app=core&module=task

Disallow: /foorumi/index.php?app=core&module=usercp&tab=forums&area=forumsubs

Disallow: /foorumi/index.php?app=core&module=usercp&tab=forums&area=watch&watch=topic

Disallow: /foorumi/index.php?app=forums&module=extras&section=forward

Disallow: /foorumi/index.php?app=members&module=messaging

Disallow: /foorumi/index.php?app=members&module=chat

Disallow: /foorumi/index.php?app=members&module=search

Disallow: /foorumi/index.php?app=members&module=search&do=active

Disallow: /foorumi/index.php?&unlockUserAgent=1

Disallow: /foorumi/index.php?app=core&module=global&section=login&do=deleteCookies

Disallow: /foorumi/index.php?app=forums&module=extras&section=rating

Disallow: /foorumi/index.php?app=forums&module=forums&section=markasread

Disallow: /*app=core&module=usercp

Disallow: /*app=core&module=usercp

Disallow: /*app=members&module=messaging

Disallow: /*&p=

Disallow: /*&pid=

Disallow: /*&hl=

Disallow: /*&start=

Disallow: /*view__getnewpost$

Disallow: /*view__getlastpost$

Disallow: /*view__old$

Disallow: /*view__new$

Disallow: /*view__getfirst$

Disallow: /*view__getprevious$

Disallow: /*view__getnext$

Disallow: /*view__getlast$

Disallow: /*&view=getnewpost$

Disallow: /*&view=getlastpost$

Disallow: /*&view=old$

Disallow: /*&view=new$

Disallow: /*&view=getfirst$

Disallow: /*&view=getprevious$

Disallow: /*&view=getnext$

Disallow: /*&view=getlast$

Disallow: /*?s=

Disallow: /*&s=


Link to comment
Share on other sites

  • 2 months later...

I'm still very busy here but what I can say for now is that there is an quite strange behavior in user profile pages, as you could see here: http://www.google.com/search?q=inurl:"community.invisionpower.com/user/94759-paulo+-freitas"&filter=0

I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :)

I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D

Regards,

Link to comment
Share on other sites

  • 3 weeks later...



I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :)



I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D



Regards,


Could any dev take a look into this? I think it's another SEO improvement we can get, a time it's causing SERPs with duplicated content (which isn't good).

I think this issue is a good one to pick for 3.1.2, you're taking SEO seriously in the last releases. :-)

Cheers,
Link to comment
Share on other sites

  • 6 months later...

I have setting like this in robotstxt for my ipb 3.1.4
are this good for me ?

User-agent: *
Disallow: admin/
Disallow: style_images/
Disallow: index.php?act=idx
Disallow: index.php?act=Login
Disallow: index.php?act=Search
Disallow: index.php?act=Shoutbox
Disallow: index.php?act=Reg
Disallow: index.php?act=Msg
Disallow: index.php?act=Mail
Disallow: index.php?act=Forward
Disallow: index.php?act=Track
Disallow: index.php?act=Post
Disallow: index.php?act=post
Disallow: index.php?act=Print
Disallow: index.php?act=ST
Disallow: index.php?act=boardrules
Disallow: ?act=boardrules
Disallow: index.php?act=Help
Disallow: index.php?act=Stats
Disallow: index.php?act=stats
Disallow: index.php?act=Members
Disallow: index.php?act=Online
Disallow: index.php?act=calendar
Disallow: index.php?act=SR
Disallow: index.php?act=SF
Disallow: index.php?act=ICQ
Disallow: index.php?act=MSN
Disallow: index.php?act=AOL
Disallow: index.php?act=AIM
Disallow: index.php?act=SC
Disallow: index.php?act=task
Disallow: index.php?act=findpost
Disallow: index.php?act=UserCP
Disallow: index.php?act=usercp
Disallow: index.php?&act=
Disallow: index.php?act=report
Disallow: index.php?act=buddy
Disallow: index.php?act=legends
Disallow: index.php?CODE=
Disallow: index.php?act=attach
Disallow: index.php?act=Attach
Disallow: index.php?&&CODE=
Disallow: index.php?&debug=1
Disallow: index.php?act=Profile
Disallow: index.php?showuser
Disallow: index.php?s=
Disallow: *&view=getnewpost$
Disallow: *&view=getlastpost$
Disallow: *&view=old$
Disallow: *&view=new$
Disallow: *mode=linear
Disallow: *mode=threaded
Disallow: *mode=linearplus
Disallow: *&p=
Disallow: *&pid=
Disallow: *&gopid=
Disallow: *&hl=
Disallow: *&start=
Disallow: *&showtopic
Disallow: *gallery&req=stats
Disallow: *gallery&req=user
Disallow: *gallery&req=slideshow
Disallow: *reportimage

Link to comment
Share on other sites

  • 2 years later...

I don't use robots.txt and wouldn't recomend it. It is based on the honor of the search bots to honor it or not. Too many disregard it and crawl where they please.

but since it works for those that honor it why wouldn't you?

google/bing honor it and if you don't want member profiles showing it google its a fast and easy way to stop that

Link to comment
Share on other sites

but since it works for those that honor it why wouldn't you?

google/bing honor it and if you don't want member profiles showing it google its a fast and easy way to stop that

Bad bots are blocked by IP range, and as for member profiles, I don't care if they are crawled. They are not something that needs to be secure, and most don't fill out much more than their name anyhow.

Profiles are not private.

Link to comment
Share on other sites

  • 2 years later...
  • 4 months later...
On 6/29/2010, 8:10:56, Paulo Freitas said:

I'm still very busy here but what I can say for now is that there is an quite strange behavior in user profile pages, as you could see here: http://www.google.com/search?q=inurl:"community.invisionpower.com/user/94759-paulo+-freitas"&filter=0

I don't know where they're coming out, but the results with "page__f__" seems a bit buggy and probably dangerous for SEO. Can you take a look on it? :)

I'm running against time to have more time to analyze these things, I think we still can optimize a lot more. :D

Regards,

Can you elaborate on this 'page__f__' issue? What about them seems buggy and dangerous? 

(I know this is an old thread, might still be valid)

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...