IPS4 ~ robots.txt

Tripp★ · July 10, 2015

DO NOT USE THIS ROBOTS.TXT CODE, AS IT MAY DO MORE HARM THAN GOOD. READ ON TO FIND OUT THE PURPOSE OF THIS THREAD.

Unless I've missed something I noticed that IPS4 doesn't have a robots.txt file and I've also noticed that a lot of links appearing on Google (First page) for my site is "Email this page" which is a little concerning. I'm not sure if this should be applied; so instead of uploading it to the marketplace I've added it here for all to see and scrutinise, because I am no expert.

So basing it loosey off the one that comes with IPB 3.4.8, I've compiled a basic one for IPS4 installations.

# HOW TO USE THIS FILE:
# 1) Edit this file to change the leading in "/" to the correct relative path from your base URL, for example if your forum was at "domain.com/sites/community", then you'd use "/sites/community/"
# 2) If you allow guests to view profiles, calendar, gallery and the like, remove those lines.
# 3) Rename the file to 'robots.txt' and move it to your web root (public_html, www, or htdocs)
# 4) Edit the file to remove this comment (anything above the dashed line, including the dashed line
#
# NOTES:
# Even though wild cards and pattern matching are not part of the robots.txt specification, many search bots understand and make use of them
#------ REMOVE THIS LINE AND EVERYTHING ABOVE SO THAT User-agent: * IS THE FIRST LINE -------
User-agent: *
Disallow: /admin/
Disallow: /applications/
Disallow: /datastore/
Disallow: /plugins/
Disallow: /system/
Disallow: /uploads/
Disallow: /profile/
Disallow: /calendar/
Disallow: /gallery/
Disallow: /*$do=getNewComment
Disallow: /*$do=findComment&comment$
Disallow: /*$do=email&comment$

Machsterdaemon · July 11, 2015

DO NOT USE THIS ROBOTS.TXT CODE, AS IT MAY DO MORE HARM THAN GOOD. READ ON TO FIND OUT THE PURPOSE OF THIS THREAD.
Unless I've missed something I noticed that IPS4 doesn't have a robots.txt file and I've also noticed that a lot of links appearing on Google (First page) for my site is "Email this page" which is a little concerning.

I just added the noindex robots meta tag for the contact page as I don't like it to be indexed by the search engines, and this is working fine. But perhaps it might be better to disalow it in the robots file.

I had many disalow entries in the robots file for the 3.4 system, both the default ones and many custom ones, but now I have just 2 entries for the IPS4 system and it seems to be working very fine.

tjk · July 11, 2015

I just added the noindex robots meta tag for the contact page as I don't like it to be indexed by the search engines, and this is working fine. But perhaps it might be better to disalow it in the robots file.
I had many disalow entries in the robots file for the 3.4 system, both the default ones and many custom ones, but now I have just 2 entries for the IPS4 system and it seems to be working very fine.

Machsterdaemon, would you mind posting a copy of your robots.txt?|

Thanks!

Machsterdaemon · July 11, 2015

Machsterdaemon, would you mind posting a copy of your robots.txt?|
Thanks!

~~Yes, I would.~~

Edit: nevermind, but there is just a disalow rule for the admin folder.

tjk · July 11, 2015

~~Yes, I would.~~
Edit: nevermind, but there is just a disalow rule for the admin folder.

Thanks...

Tripp★ · July 11, 2015

I was just sort of going by IPB3.x's logic. I'm just not sure if there's any ligitamate reason why Google should index those folders, and I'm pretty sure that indexing these two:

Disallow: /*$do=getNewComment
Disallow: /*$do=findComment&comment$

Would index the same content twice/or more and treat it like duplicated content, my SEO is very rusty but doesn't duplicate content detract from your standing with Google? I was hoping for more feedback than this though. Maybe it's only an issue for me, and for everyone else it's a non-issue. :/

Machsterdaemon · July 11, 2015

I was just sort of going by IPB3.x's logic. I'm just not sure if there's any ligitamate reason why Google should index those folders, and I'm pretty sure that indexing these two:
Disallow: /*$do=getNewComment
Disallow: /*$do=findComment&comment$
Would index the same content twice/or more and treat it like duplicated content, my SEO is very rusty but doesn't duplicate content detract from your standing with Google? I was hoping for more feedback than this though. Maybe it's only an issue for me, and for everyone else it's a non-issue. :/

I get redirected to the normal topic URL if I click a URL with this:

do=getNewComment

But a profile activity link with the below parameter:

do=findComment&comment

redirects me to a post url with the below parameter (I did put x instead of the post id):

#comment-xxxxxxx

URL's with the above parameter (#comment) is probably not indexed by Google, even though you allow guests to view the profile pages.

If you are using the Google Search Console (Webmaster Tools) you can see a section there that allows you to view URL parameters for your site that Google is ignoring. You can also configure more parameters in this section.

Machsterdaemon · July 12, 2015

I get redirected to the normal topic URL if I click a URL with this:
do=getNewComment

This is the case on one site running 4.0.8.1 and you do this as guest but not as a member, and this seems also to be the case using this forum as a guest. But it seems that if you are logged in you will get redirected to this URL instead:

#comment

sobrenome · July 29, 2015

Is there any official IPS staff position about robots directives?

Durango · February 16, 2016

Hi

Same question here :

What useful directivefs in robots.txt should we use ?

this one :

Disallow: /profile/

looks very useful as many profiles are empty and it is not good at all to let google index thousands of empty (or almost empty) identical pages

Bliblou · February 16, 2016

Its really a big problem.

There is a lot of 404 error because Google bot want to read old file CSS and JS

Example :

Quote

016/02/16 10:53:39 [error] 16090#0: *14454 open() "/var/www/forum/uploads/javascript_global/root_map.js.0a131f7c22268fd0f11d92ed432ca70a.js" failed (2: No such file or directory), client: 66.249.78.145, server: www.website.fr, request: "GET /forum/uploads/javascript_global/root_map.js.0a131f7c22268fd0f11d92ed432ca70a.js?v=2c4842661c HTTP/1.1", host: "www.website.fr"
2016/02/16 10:57:34 [error] 16093#0: *16365 open() "/var/www/forum/uploads/css_built_1/76e62c573090645fb99a15a363d8620e_forums_responsive.css.7141dd3c0e8bb8a476448853384525e2.css" failed (2: No such file or directory), client: 66.249.78.145, server: www.website.fr, request: "GET /forum/uploads/css_built_1/76e62c573090645fb99a15a363d8620e_forums_responsive.css.7141dd3c0e8bb8a476448853384525e2.css?v=2c4842661c HTTP/1.1", host: "www.website.fr"
2016/02/16 10:57:35 [error] 16093#0: *16391 open() "/var/www/forum/uploads/javascript_global/root_front.js.20fa81eb4ec063d131495937504c7b62.js" failed (2: No such file or directory), client: 66.249.64.156, server: www.website.fr, request: "GET /forum/uploads/javascript_global/root_front.js.20fa81eb4ec063d131495937504c7b62.js?v=2c4842661c HTTP/1.1", host: "www.website.fr"

But if you block /UPLOADS/ directory, Google can't read CSS file so there is no design in screen of google and cache

Tripp★ · February 16, 2016

Noted; as I stated in my post I didn't intend for that to be used, but thank you alerting me to it; I moved away from this and went with a more customised solution. I assume that's why no one really responded to this, and why the discussion never really took off.

Bliblou · February 16, 2016

The problem is IPB Team and DEV

Why files are deleted ??!! It's ridiculous. simple solution is to delete files after 1 year for example.

opentype · February 16, 2016

2 hours ago, Bliblou said:

Its really a big problem.

Please elaborate, why that should be a big problem.

2 hours ago, Bliblou said:

There is a lot of 404 error because Google bot want to read old file CSS and JS

So what? That’s how the internet works. If stuff gets deleted, the server throws a 404 and if that continues to be the case, the search engines knows that this stuff is probably gone for good. That’s all intended behaviour.

2 hours ago, Bliblou said:

But if you block /UPLOADS/ directory, Google can't read CSS file

Correct. Don’t do block that stuff. Just let the crawler do its work.

Bliblou · February 16, 2016

Google bot don't stop to check files. Situation isn't normal.

I never see that. IPB4 is the first script

opentype · February 16, 2016

3 minutes ago, Bliblou said:

Google bot don't stop to check files. Situation isn't normal.

Haha! So go to Google to complain then. ;-)

Bliblou · February 16, 2016

Are you serious ?

IPS delete files, it's not google. IPS don't care about is problem.

That's not every time google the problem ...

opentype · February 17, 2016

9 hours ago, Bliblou said:

Are you serious ?

No. It was a joke as a reply to your comment, which showed you don’t know what you are talking about regarding this issue.

9 hours ago, Bliblou said:

IPS delete files, it's not google. IPS don't care about is problem.

They don’t care, because THERE IS NO PROBLEM. And it was explained to you several times now, by other users AND the IPS staff in the bug tracker. You need to accept that instead of reopening that topic over and over again.

Bliblou · February 17, 2016

I don’t know what you are talking lol.

I'm probably an amateur like you ! . The 404 page is not a solution and if you are really a dev, you know that.

It's not a solution to use the 404 error like that.

A bot want to read a file and the file is not here. It's normal if the bot rescan page during some of days (if file is deleted by error).

So, Never mind, if you dont understand and if for you, it's normal tu use 404 error, that's cool !

Bliblou · February 17, 2016

So

I found a partial solution (for NGINX)

In the conf of virtualhost

Quote

# CSS and Javascript
location ~* \.(?:css|js)$ {
expires 24h;
access_log off;
add_header Cache-Control "public";
log_not_found off;
}

log_not_found off; => not log for 404

opentype · February 17, 2016

30 minutes ago, Bliblou said:

I'm probably an amateur like you !

Can’t speak for you, but I’m a professional web designer since 1999.

But that doesn’t matter much. Anyone can be right or wrong regarding certain questions.

30 minutes ago, Bliblou said:

It's not a solution to use the 404 error like that.

Yes, it is. You are just not listening to anyone’s arguments given about this in 3 topics there are now open. You just stubbornly repeating your complaint, without being able to properly explain why this should even be a “big problem” as you say.

Bliblou · February 17, 2016

I want to understand. I'm sorry to ask a question in a support forum.

It's forbidden if i want to understand i suppose

opentype · February 17, 2016

If you really want to understand, you can just ASK politely in ONE topic or bug report. There is no need to speak of a “big problem” if you can’t be sure it even is problem at all and there is no need to point fingers at IPS, if you can’t be sure they actually did something wrong.

Bliblou · February 17, 2016

It's not normal for me and if the professional web designer since 1999 think the 404 error is normal, it's your choice.

I asked politely (with my little vocabulary in english !) in 2 topics because i think it's interesting, because it's 2 topics with subject differents and i think it's not your problem (you re not a modo no ?).

It's IPB who delete css and js files. Normal or not, i want to understand why there is this situation and if for me, it's a big problem, that's because 404 error is not a solution.

It's a support forum so i ask questions. If the subject don't intesrest you, don't answer. Easy.

Don't say to me where i can post and wich questions i can post.

opentype · February 17, 2016

11 minutes ago, Bliblou said:

i want to understand why there is this situation

Because ressources like CSS and JavaScript files get rebuilt after certain actions like upgrades. Thats a GOOD thing. It AVOIDS problems, e.g. with cached resources.

Google is okay with that. It will just pick up the new URLs and learn that the old one are gone.

11 minutes ago, Bliblou said:

and if for me, it's a big problem

Why?

11 minutes ago, Bliblou said:

that's because 404 error is not a solution.

Why?

11 minutes ago, Bliblou said:

It's a support forum so i ask questions.

No, you didn’t asked about this, you complain: “ The problem is IPB Team and DEV. Why files are deleted ??!! It's ridiculous.”

11 minutes ago, Bliblou said:

If the subject don't intesrest you, don't answer.

Who says it doesn’t interest me?

11 minutes ago, Bliblou said:

Don't say to me where i can post and wich questions i can post.

I didn’t do that.

IPS4 ~ robots.txt

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Recently Browsing 0 members