Jump to content

Large community? You have a problems with sitemap!


Recommended Posts

37 minutes ago, jair101 said:

I believe this issue is with the share links, not with the embeds:

 

No ... an no.

That's the problem in every post (Date of post with direct link): (for example here)

https://invisioncommunity.com/forums/topic/442742-large-community-you-have-a-problems-with-sitemap/?do=findComment&comment=2728237

 

Link to comment
Share on other sites

4 minutes ago, mark007 said:

 

No ... an no.

That's the problem in every post (Date of post with direct link): (for example here)


https://invisioncommunity.com/forums/topic/442742-large-community-you-have-a-problems-with-sitemap/?do=findComment&comment=2728237

 

Click on the icon to share a post and see how the link looks like...

Link to comment
Share on other sites

Hello, i have the same problem and another problem.

My sitemap bug when blog is deleted but blog_entry still here.
The sitemap stop.

It's possible to add a security to not add entry if entry_author_id=0 ?

 

For the topic date, my script is updated, but the date does not appear on the sitemap 

Edited by Xavier Hallade
Link to comment
Share on other sites

also check your robots.txt file. Good configuration looks like this:

Quote

# Sitemap...
Sitemap: http://yoursite.com/sitemap.php

User-agent: *

# Disallow directory
Disallow: /api/
Disallow: /cgi-bin/
Disallow: /datastore/
Disallow: /plugins/
Disallow: /system/
Disallow: /themes/
Disallow: /go/

#Disallow files
Disallow: /403error.php
Disallow: /404error.php
Disallow: /500error.php
Disallow: /Credits.txt
Disallow: /error.php
Disallow: /upgrading.html

# Querystring
Disallow: /?tab=*
Disallow: /index.php?*
Disallow: /*?app=*
Disallow: /*sortby=*
Disallow: /*/?do=download
Disallow: /profil/*/?do=*
Disallow: /profil/*/content/
Disallow: /*?do=add
Disallow: /*?do=email
Disallow: /*?do=getNewComment
Disallow: /*?do=getLastComment
Disallow: /*?do=getLastComment
Disallow: /*?do=findComment*
Disallow: /*?do=reportComment*

# Allow specific parts
Allow: /applications/core/interface/imageproxy/imageproxy.php?img=*

 

Link to comment
Share on other sites

7 hours ago, SeNioR- said:

also check your robots.txt file. Good configuration looks like this:

 

We can add admin to.

in my case, with more than 1,200 sitemap, I think it's more useful to frequently generate the list of latest topics and blog comments than to regenerate everything.
If i run the script every minute, my last topic sitemap will take more than 20 hours to build.
So I modify the script that it forces the generation of all the x launches of the sitemap of the last sitemap of blogs and topics

Link to comment
Share on other sites

  • Management

Just so you know, we're watching this topic and looking at our own stats to build a better picture.

The facts we know:

1) Almost every site I've got access to (via friends, etc) have seen a massive drop since June of indexed pages. This is not exclusive to Invision Community powered sites. I've seen the same with Wordpress.

2) Google slipped in an update in 2017 to target several things, one of these things is poor backlinks and other poor quality links. It looks like this means that user profiles that have no content have been dropped from the index along with links that 301. That is fine. You don't want Google storing the 301 link, as long as it stores the real link (and it does seem to).

3) A drop in what is indexed doesn't actually correlate to the health of the site. We've seen our index volume drop, but clicks, engagement and discovery slightly increase (probably due to better quality results?)

As always, Google say nothing so we're left guessing.

We will look at stopping user profiles from being submitted. For example, we see nearly 380k links as 'discovered' but Google has chosen to not index them. Looking through the list, it's all user profiles.

This means:

1) Sitemaps are working fine. There's no massive problem with them that correlates with a drop in indexed pages

2) The cornerstones of good SEO are taken care of in the software

3) Google is being weird and mysterious as always.

What can we do in the short term?

1) Stop sending profiles with no content to the sitemap. They are now ignored and Google appears to be dropping them from its indexes

2) Add in nofollow on links that 301 so Google doesn't bother 'discovering' them at all.

Link to comment
Share on other sites

Profiles that are not being indexed seem to be ones that have not posted anything so the pages have little to no content. At least that's what I'm seeing. I don't know if I would exclude all profiles. You could have content, status updates and such that make up an indexed page that could draw in traffic. I would add to the page title "xxxxx's profile page - site name" or something like that with the ability to enhance that. Most display names are short  unlike topic names.

I would add lastmod as mentioned and update frequency tags into the sitemap. I would see if a separate image sitemap could be generated as part of gallery app with google recommended tags.

Also, some SEO things...I would give the option of changing the automatic meta description length in settings. Google seems to be allowing longer descriptions. Add the ability to automatically include tags as meta keywords even though google may not use keywords there are other traffic sources that do. 

Link to comment
Share on other sites

I think an account with 0 post can be exclude from the sitemap.
Other options :
-  allow to exclude topic with 0 response 
- add last-mod to all sitemap content ( try to run the google crawl on new content )
-  allow to exclude / noindex particular topic / profile / page 
- generate a robots.txt with the good values 
-  permit to refresh more frequently sitemap for the last added content 
 I've using IPboard since 2004 if my old topics are not frequently generated to the sitemap it's not a problem, but for my last topics / blogs / profiles , i would like a refresh more often. I have more than 1000 sitemap files, with a 15 min refresh, my new topics are in the sitemap each 10days
- generate error on the sitemap :D 
I have old entry, old blog, old content, .... my sitemap was not updated since 2 month because their 10 blogs entry witch are linked to a deleted blog. No error but the sitemap was not updated ($e->last_message empty  but not $e) 

Link to comment
Share on other sites

8 minutes ago, SebastienG said:

I think an account with 0 post can be exclude from the sitemap.
Other options :
-  allow to exclude topic with 0 response 
- add last-mod to all sitemap content ( try to run the google crawl on new content )
-  allow to exclude / noindex particular topic / profile / page 
- generate a robots.txt with the good values 
-  permit to refresh more frequently sitemap for the last added content 
 I've using IPboard since 2004 if my old topics are not frequently generated to the sitemap it's not a problem, but for my last topics / blogs / profiles , i would like a refresh more often. I have more than 1000 sitemap files, with a 15 min refresh, my new topics are in the sitemap each 10days
- generate error on the sitemap :D 
I have old entry, old blog, old content, .... my sitemap was not updated since 2 month because their 10 blogs entry witch are linked to a deleted blog. No error but the sitemap was not updated ($e->last_message empty  but not $e) 

Good list but I am not for excluding topics with no responses. Most of the time the first post contains all the initial topic content and should index well as long as members aren't starting topics with a very low word count.

Link to comment
Share on other sites

  • Management

I think we need to be mindful the the sitemap is just one way that Google discovers and crawls links.

What goes in the sitemap isn't a hard rule that Google must only check out those links, so there's little point in adding too many restrictions here and there because it'll be mostly pointless. You'll submit fewer links, but Google will still pull up the ones you didn't add.

I did add a setting for profiles, because of the huge number of 'dead' profiles that stuff up the sitemap, which is just a waste.

5a749113a0493_SearchEngineOptimization2018-02-0216-25-22.thumb.jpg.6433abc90e254c0923d95015548f61e5.jpg

What may or may not be in the sitemap doesn't solve why Google is shedding indexed pages.

That said, when using the new search console, the figures are totally different.

We have 92k indexed pages
We have about 400k pages that Google has either 'discovered' or 'crawled but not indexed' due to its own algorithms. These are 301 redirect links (this is OK, it has no reason to store these) and empty profiles which have almost zero content.

But it's important to realise that Google is not punishing us, it is just working harder to index content that it thinks others will find useful, and "Johnny@11" who registered in 2011 and has never posted doesn't count any more.

 

 

Link to comment
Share on other sites

4 hours ago, Matt said:

1) Sitemaps are working fine. There's no massive problem with them that correlates with a drop in indexed pages

Yes, they're doing great. :mad:
Then how would you explain this?

Google says:
5a748fbcb777a_Screenshot_20-Copy.thumb.jpg.6a777e63a465d4e35785c5d4625f51f4.jpg

Google says that this Topic in my Forum has 25 posts written by 16 authors.

But, reality is different:

Screenshot_21.thumb.jpg.0a6171d6978edfcd4b35399cf90c5a14.jpg

Screenshot_28.thumb.jpg.20b4261f41e228dc974a2ac9ec49bec8.jpg

That Topic has 1.238 replies written by who knows how many authors.

Who is guilty because Google does not see 98% of the posts written in this Topic?
Who is responsible if the sitemap is working fine? Please, give me a reasonable explanation.

If I understand your post well, IPS only wants to reduce the number of pages that are excluded, and not to increase the number of indexed pages.

Link to comment
Share on other sites

  • Management

Again, the sitemap is not a YOU CAN ONLY LOOK AT THESE LINKS GOOGLE LOL.

The sitemap just informs Google of "important" URLs on your site. It will use these as a base to spider out from.

I have no idea why Google is not updating the meta data of your indexed URL. That's not down to the sitemap. That's down to Google not refreshing the data. Google will pull the replies meta data from the page itself.

To save me bother, what is the URL to that topic? I'd like to review the meta tags in the json LD to make sure they're correct.

Link to comment
Share on other sites

3 minutes ago, Nesa said:

Yes, they're doing great. :mad:
Then how would you explain this?

Google says:
5a748fbcb777a_Screenshot_20-Copy.thumb.jpg.6a777e63a465d4e35785c5d4625f51f4.jpg

Google says that this Topic in my Forum has 25 posts written by 16 authors.

But, reality is different:

Screenshot_21.thumb.jpg.0a6171d6978edfcd4b35399cf90c5a14.jpg

Screenshot_28.thumb.jpg.20b4261f41e228dc974a2ac9ec49bec8.jpg

That Topic has 1.238 replies written by who knows how many authors.

Who is guilty because Google does not see 98% of the posts written in this Topic?
Who is responsible if the sitemap is working fine? Please, give me a reasonable explanation.

If I understand your post well, IPS only wants to reduce the number of pages that are excluded, and not to increase the number of indexed pages.

Wouldn't Google only capture how many authors are part of that page /URL? If you want to test, change how many posts you show per page which will reduce or increase your topic page count. Maybe I'm wrong through.

The question for topics with a lot of replies, how well are those additional pages indexed? Are the page titles, urls, etc., SEF and without duplicate meta info.

Link to comment
Share on other sites

  • Management

Ok, right away I can see the LD is fine.

    "interactionStatistic": [
        {
            "@type": "InteractionCounter",
            "interactionType": "http://schema.org/ViewAction",
            "userInteractionCount": 80927
        },
        {
            "@type": "InteractionCounter",
            "interactionType": "http://schema.org/CommentAction",
            "userInteractionCount": 1239
        },
        {
            "@type": "InteractionCounter",
            "interactionType": "http://schema.org/FollowAction",
            "userInteractionCount": 3
        }
    ],

Testing the link using Google's tool shows the meta data is being received perfectly.

5a749446d0432_StructuredDataTestingTool2018-02-0216-38-56.thumb.jpg.c4f2b4cd512e88ee46c79932d81513be.jpg

Invision Community is doing its job.

 

 

Link to comment
Share on other sites

1 minute ago, AlexWebsites said:

Wouldn't Google only capture how many authors are part of that page /URL? If you want to test, change how many posts you show per page which will reduce or increase your topic page count. Maybe I'm wrong through.

Yes, that sounds logical. Like you, I'm not sure...
I've noticed that new posts, new topics do not appear on the Google Index for 10 days...and just those 10 days of delay were also mentioned by other people on this topic

Link to comment
Share on other sites

  • Management

10 days might be fine depending on how often Google visits your site.

Again, the frequency that Google visits your site has nothing to do with the sitemap.

In 4.3, we have added the lastmod timestamp, and added a button to rebuild your index from scratch.

Also, just double check your forum and topic permissions. Remember, if a guest cannot see the page, then Google cannot either.

Link to comment
Share on other sites

14 minutes ago, Matt said:

I think we need to be mindful the the sitemap is just one way that Google discovers and crawls links.

What goes in the sitemap isn't a hard rule that Google must only check out those links, so there's little point in adding too many restrictions here and there because it'll be mostly pointless. You'll submit fewer links, but Google will still pull up the ones you didn't add.

I did add a setting for profiles, because of the huge number of 'dead' profiles that stuff up the sitemap, which is just a waste.

5a749113a0493_SearchEngineOptimization2018-02-0216-25-22.thumb.jpg.6433abc90e254c0923d95015548f61e5.jpg

What may or may not be in the sitemap doesn't solve why Google is shedding indexed pages.

That said, when using the new search console, the figures are totally different.

We have 92k indexed pages
We have about 400k pages that Google has either 'discovered' or 'crawled but not indexed' due to its own algorithms. These are 301 redirect links (this is OK, it has no reason to store these) and empty profiles which have almost zero content.

But it's important to realise that Google is not punishing us, it is just working harder to index content that it thinks others will find useful, and "Johnny@11" who registered in 2011 and has never posted doesn't count any more.

 

 

Of Course, I'm Ok with you, I have about 1.2M non indexed pages and 200K indexed
But the sitemap is important. Google analyse it and can configure it's crawl with the submited URL 

If I have 200k of indexed pages Google crawl 'daily' pages and my indexed pages, with 30000 pages/day, Google need 1 week to crawl my indexed pages 

I think if last-mod is set and probably we can use the changefreq value to : 

New topic:  changefreq daily or hourly
Topic not update since 1 week : daily
Topic not update since 1 month : weekly 
Topic not update since 1 year : yearly 

The but is to give at the different crawler the new value quickly and to not use crawl ressource for old ressources which are not updated 
 

 

Link to comment
Share on other sites

  • Management

Also, make sure if you have switched to HTTPS that you add your HTTPS link to Google's search console, or it won't pick up those hits and indexes.

We've seen this being the reason that people have seen drop offs in multiple cases now. There isn't a drop off, it's just Google dropping http indexes and picking up https indexes.

Link to comment
Share on other sites

10 hours ago, sadams101 said:

@SeNioR- can you tell me where this robots.txt is from? Also, what is the current standard robots.txt?

You have to create it 

15 hours ago, Matt said:

Also, make sure if you have switched to HTTPS that you add your HTTPS link to Google's search console, or it won't pick up those hits and indexes.

We've seen this being the reason that people have seen drop offs in multiple cases now. There isn't a drop off, it's just Google dropping http indexes and picking up https indexes.

 I have the http and https URL in Google Search and i have seen drop offs 

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...