Jump to content

IPS4 ~ robots.txt


Recommended Posts

DO NOT USE THIS ROBOTS.TXT CODE, AS IT MAY DO MORE HARM THAN GOOD. READ ON TO FIND OUT THE PURPOSE OF THIS THREAD.

Unless I've missed something I noticed that IPS4 doesn't have a robots.txt file and I've also noticed that a lot of links appearing on Google (First page) for my site is "Email this page" which is a little concerning. I'm not sure if this should be applied; so instead of uploading it to the marketplace I've added it here for all to see and scrutinise, because I am no expert. 

So basing it loosey off the one that comes with IPB 3.4.8, I've compiled a basic one for IPS4 installations.

# HOW TO USE THIS FILE:
# 1) Edit this file to change the leading in "/" to the correct relative path from your base URL, for example if your forum was at "domain.com/sites/community", then you'd use "/sites/community/"
# 2) If you allow guests to view profiles, calendar, gallery and the like, remove those lines.
# 3) Rename the file to 'robots.txt' and move it to your web root (public_html, www, or htdocs)
# 4) Edit the file to remove this comment (anything above the dashed line, including the dashed line
#
# NOTES:
# Even though wild cards and pattern matching are not part of the robots.txt specification, many search bots understand and make use of them
#------ REMOVE THIS LINE AND EVERYTHING ABOVE SO THAT User-agent: * IS THE FIRST LINE -------
User-agent: *
Disallow: /admin/
Disallow: /applications/
Disallow: /datastore/
Disallow: /plugins/
Disallow: /system/
Disallow: /uploads/
Disallow: /profile/
Disallow: /calendar/
Disallow: /gallery/
Disallow: /*$do=getNewComment
Disallow: /*$do=findComment&comment$
Disallow: /*$do=email&comment$

 

Link to comment
Share on other sites

DO NOT USE THIS ROBOTS.TXT CODE, AS IT MAY DO MORE HARM THAN GOOD. READ ON TO FIND OUT THE PURPOSE OF THIS THREAD.

Unless I've missed something I noticed that IPS4 doesn't have a robots.txt file and I've also noticed that a lot of links appearing on Google (First page) for my site is "Email this page" which is a little concerning.

I just added the noindex robots meta tag for the contact page as I don't like it to be indexed by the search engines, and this is working fine. But perhaps it might be better to disalow it in the robots file.

I had many disalow entries in the robots file for the 3.4 system, both the default ones and many custom ones, but now I have just 2 entries for the IPS4 system and it seems to be working very fine.

 

Link to comment
Share on other sites

I just added the noindex robots meta tag for the contact page as I don't like it to be indexed by the search engines, and this is working fine. But perhaps it might be better to disalow it in the robots file.

I had many disalow entries in the robots file for the 3.4 system, both the default ones and many custom ones, but now I have just 2 entries for the IPS4 system and it seems to be working very fine.

 

Machsterdaemon, would you mind posting a copy of your robots.txt?|

Thanks!

Link to comment
Share on other sites

I was just sort of going by IPB3.x's logic. I'm just not sure if there's any ligitamate reason why Google should index those folders, and I'm pretty sure that indexing these two:

Disallow: /*$do=getNewComment
Disallow: /*$do=findComment&comment$

Would index the same content twice/or more and treat it like duplicated content, my SEO is very rusty but doesn't duplicate content detract from your standing with Google? I was hoping for more feedback than this though. Maybe it's only an issue for me, and for everyone else it's a non-issue. :/

Link to comment
Share on other sites

I was just sort of going by IPB3.x's logic. I'm just not sure if there's any ligitamate reason why Google should index those folders, and I'm pretty sure that indexing these two:

Disallow: /*$do=getNewComment
Disallow: /*$do=findComment&comment$

Would index the same content twice/or more and treat it like duplicated content, my SEO is very rusty but doesn't duplicate content detract from your standing with Google? I was hoping for more feedback than this though. Maybe it's only an issue for me, and for everyone else it's a non-issue. :/

I get redirected to the normal topic URL if I click a URL with this:

do=getNewComment

But a profile activity link with the below parameter:

do=findComment&comment

redirects me to a post url with the below parameter (I did put x instead of the post id):

#comment-xxxxxxx

URL's with the above parameter (#comment) is probably not indexed by Google, even though you allow guests to view the profile pages.

If you are using the Google Search Console (Webmaster Tools) you can see a section there that allows you to view URL parameters for your site that Google is ignoring. You can also configure more parameters in this section.

Link to comment
Share on other sites

I get redirected to the normal topic URL if I click a URL with this:

do=getNewComment

This is the case on one site running 4.0.8.1 and you do this as guest but not as a member, and this seems also to be the case using this forum as a guest. But it seems that if you are logged in you will get redirected to this URL instead:

#comment
Link to comment
Share on other sites

  • 3 weeks later...
  • 6 months later...

Hi

Same question here :

What useful directivefs in robots.txt should we use ?

this one : 

Disallow: /profile/

looks very useful as many profiles are empty and it is not good at all to let google index thousands of empty (or almost empty) identical pages

 

Link to comment
Share on other sites

Its really a big problem.

There is a lot of 404 error because Google bot want to read old file CSS and JS

Example : 

Quote

016/02/16 10:53:39 [error] 16090#0: *14454 open() "/var/www/forum/uploads/javascript_global/root_map.js.0a131f7c22268fd0f11d92ed432ca70a.js" failed (2: No such file or directory), client: 66.249.78.145, server: www.website.fr, request: "GET /forum/uploads/javascript_global/root_map.js.0a131f7c22268fd0f11d92ed432ca70a.js?v=2c4842661c HTTP/1.1", host: "www.website.fr"
2016/02/16 10:57:34 [error] 16093#0: *16365 open() "/var/www/forum/uploads/css_built_1/76e62c573090645fb99a15a363d8620e_forums_responsive.css.7141dd3c0e8bb8a476448853384525e2.css" failed (2: No such file or directory), client: 66.249.78.145, server: www.website.fr, request: "GET /forum/uploads/css_built_1/76e62c573090645fb99a15a363d8620e_forums_responsive.css.7141dd3c0e8bb8a476448853384525e2.css?v=2c4842661c HTTP/1.1", host: "www.website.fr"
2016/02/16 10:57:35 [error] 16093#0: *16391 open() "/var/www/forum/uploads/javascript_global/root_front.js.20fa81eb4ec063d131495937504c7b62.js" failed (2: No such file or directory), client: 66.249.64.156, server: www.website.fr, request: "GET /forum/uploads/javascript_global/root_front.js.20fa81eb4ec063d131495937504c7b62.js?v=2c4842661c HTTP/1.1", host: "www.website.fr"

But if you block /UPLOADS/ directory, Google can't read CSS file so there is no design in screen of google and cache

Link to comment
Share on other sites

Noted; as I stated in my post I didn't intend for that to be used, but thank you alerting me to it; I moved away from this and went with a more customised solution. I assume that's why no one really responded to this, and why the discussion never really took off. 

Link to comment
Share on other sites

2 hours ago, Bliblou said:

Its really a big problem.

Please elaborate, why that should be a big problem. 

2 hours ago, Bliblou said:

There is a lot of 404 error because Google bot want to read old file CSS and JS

So what? That’s how the internet works. If stuff gets deleted, the server throws a 404 and if that continues to be the case, the search engines knows that this stuff is probably gone for good. That’s all intended behaviour. 

2 hours ago, Bliblou said:

But if you block /UPLOADS/ directory, Google can't read CSS file 

Correct. Don’t do block that stuff. Just let the crawler do its work. 

Link to comment
Share on other sites

9 hours ago, Bliblou said:

Are you serious ?

No. It was a joke as a reply to your comment, which showed you don’t know what you are talking about regarding this issue. 

 

9 hours ago, Bliblou said:

IPS delete files, it's not google. IPS don't care about is problem.

They don’t care, because THERE IS NO PROBLEM. And it was explained to you several times now, by other users AND the IPS staff in the bug tracker. You need to accept that instead of reopening that topic over and over again. 

Link to comment
Share on other sites

I don’t know what you are talking lol.

I'm probably an amateur like you ! . The 404 page is not a solution and if you are really a dev, you know that.

It's not a solution to use the 404 error like that.

A bot want to read a file and the file is not here. It's normal if the bot rescan page during some of days (if file is deleted by error).

So, Never mind, if you dont understand and if for you, it's normal tu use 404 error, that's cool !

Link to comment
Share on other sites

So

I found a partial solution (for NGINX)

In the conf of virtualhost

Quote

# CSS and Javascript
location ~* \.(?:css|js)$ {
  expires 24h;
  access_log off;
  add_header Cache-Control "public";
  log_not_found off;
}


log_not_found off; => not log for 404

Link to comment
Share on other sites

30 minutes ago, Bliblou said:

I'm probably an amateur like you !

Can’t speak for you, but I’m a professional web designer since 1999. 

But that doesn’t matter much. Anyone can be right or wrong regarding certain questions. 

30 minutes ago, Bliblou said:

It's not a solution to use the 404 error like that.

Yes, it is. You are just not listening to anyone’s arguments given about this in 3 topics there are now open. You just stubbornly repeating your complaint, without being able to properly explain why this should even be a “big problem” as you say. 

Link to comment
Share on other sites

If you really want to understand, you can just ASK politely in ONE topic or bug report. There is no need to speak of a “big problem” if you can’t be sure it even is problem at all and there is no need to point fingers at IPS, if you can’t be sure they actually did something wrong. 

Link to comment
Share on other sites

It's not normal for me and if the professional web designer since 1999 think the 404 error is normal, it's your choice.

I asked politely (with my little vocabulary in english !)  in 2 topics because i think it's interesting, because it's 2 topics with subject differents and i think it's not your problem (you re not a modo no ?).

It's IPB who delete css and js files. Normal or not, i want to understand why there is this situation and if for me, it's a big problem, that's because 404 error is not a solution.

It's a support forum so i ask questions. If the subject don't intesrest you, don't answer. Easy.

Don't say to me where i can post and wich questions i can post.

Link to comment
Share on other sites

11 minutes ago, Bliblou said:

i want to understand why there is this situation

Because ressources like CSS and JavaScript files get rebuilt after certain actions like upgrades. Thats a GOOD thing. It AVOIDS problems, e.g. with cached resources. 

Google is okay with that. It will just pick up the new URLs and learn that the old one are gone.

11 minutes ago, Bliblou said:

and if for me, it's a big problem

Why?

11 minutes ago, Bliblou said:

that's because 404 error is not a solution.

Why?

11 minutes ago, Bliblou said:

It's a support forum so i ask questions.

No, you didn’t asked about this, you complain: “ The problem is IPB Team and DEV. Why files are deleted ??!! It's ridiculous.”

11 minutes ago, Bliblou said:

If the subject don't intesrest you, don't answer. 

Who says it doesn’t interest me?

11 minutes ago, Bliblou said:

Don't say to me where i can post and wich questions i can post.

I didn’t do that. 

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...