Jump to content

Large community? You have a problems with sitemap!


Recommended Posts

IPS Sitemap generator using special database table source for refreshing - core_sitemap.

Primary search engine source of sitemap is url https://example.com/sitemap.php which is list of sub-sitemap files. You can see list of that files proceed for this link.

Each of that file contain no more than 1000 urls to specail pages (profile status, topic (without number of pages or comment) and other elements, with supported sitemap as core extension).

One of our case is forum with more than 100k topics, more than 4.2kk posts and more than 6kk users. So with simply math we have 5214 sitemap files (you can simply count number of that files with command 

select count(*) from core_sitemap; // 5214

Sitemap generator task run by default once per 15 minuts and update only one oldest element from that big list. With simple math we can try to answer question 'how many time we need for update everything?' (because users can post not only in newest and may post in some old topics... but.. new created topic will add to sitemap file only when ALL older files will newer than current file with new topic inside). So, how much time we need for update?

5214*15 = 78210 minuts = 1303 hours = 54 days! 54! days! Search engine will add your newest content after 54 days after them posted. Incredible thing. Not believe? Or want to know this lag for your community? You can simple know your lag time with that sql:

select FROM_UNIXTIME(updated,'%a %b %d %H:%i:%s UTC %Y') from core_sitemap order by updated asc limit 1; // Wed Nov 01 14:13:49 UTC 2017

Yep.. In our case oldest file last updated in 1 November...

What we should do for fix it? Very fast solution - create a temp file, like a 'mycustomsitemapupdater.php' with this content:

<?php

require 'init.php';

$generator = new \IPS\Sitemap;
$generator->buildNextSitemap();

$last = \IPS\Db::i()->select('FROM_UNIXTIME(updated, "%a %b %d %H:%i:%s UTC %Y")', 'core_sitemap', null, 'updated asc', 1)->first();
print_r('Oldest time now: ' . $last . PHP_EOL);

And run it via web or cli so times, what you want (before oldest time not be so old).

Solution for a longer time - add this script to the cron and run it every minute or, which better - change task 'sitemap generator' run time from 15 mins to one minute (but it may be not solve you special problem, if you need to update it faster - do it with smart).

Better solution - wait for IPS updating of that system.

Thanks for attension!

P.S. If you read my text with negative speach - it's wrong. I love IPS and just want to make attension for that problem and help others with their large communities. ^_^

Link to comment
Share on other sites

Little improvement (5214 elements will update more than 3 days). So you can speed up more this. Just get time needed for one

time php mycustomsitemapupdater.php // return something like 4 sec

So with that you can create a cycle inside for X times to run $generator->buildNextSitemap(); For example in my case - 10 times in one minute. So for 5214 elements i will need 521 minuts for all update (~= 8 hours - not bad). 

Link to comment
Share on other sites

6 hours ago, opentype said:

I am also not sure what ProSkill is asking about for example. The original post is about the speed of generating sitemaps for large sites, ProSkill talks about “decrease in cached URLs”. Not sure what that is and how it relates to sitemap generation. 

I don't think he clearly read it correctly. lol

Link to comment
Share on other sites

13 hours ago, opentype said:

I am also not sure what ProSkill is asking about for example. The original post is about the speed of generating sitemaps for large sites, ProSkill talks about “decrease in cached URLs”. Not sure what that is and how it relates to sitemap generation. 

Cached URLs in Googles index. If the site map isn't generating fast enough or enough URLS then obviously the number of URLs contained in the sitemap won't give Google the full picture.

Link to comment
Share on other sites

Yet, you mention a “decrease” in cached (you mean indexed?) URLs. How do you loose pages because of a slow sitemap creation? It doesn’t make any sense. 

By the way: Google will crawl your site anyway and discover new content. A sitemap is technically not even necessary. So I would be careful to blame problems with your site on the sitemap creation. 

Link to comment
Share on other sites

"Each of that file contain no more than 1000 urls to specail pages"

I take that to mean that each sitemap is only going to show up to a 1000 urls. If you have 8k plus topics and 700k posts that can be an issue. 

"Search engine will add your newest content after 54 days after them posted"

I take that to mean that there is a long delay for the sitemap to update changes. If you move a posts, topics, or forums around there will be a significant delay before the sitemap is updated. 

All of the above can lead to the sitemap being out of date, and thus you can lose cached URLS. 

 

By the way: Google will crawl your site anyway and discover new content.

Yes they will. However, a sitemap is the best way to tell google about the contents of your site. You can see how many URLS are cached from the Sitemap vs how many are found via crawling. The sitemap typically has a lot more cached URLS. 
 

Edited by ProSkill
Link to comment
Share on other sites

20 minutes ago, ProSkill said:

All of the above can lead to the sitemap being out of date, and thus you can lose cached URLS. 

No. Nothing gets dropped from the index for not being in the sitemap. 

Or maybe we just have a language problem. You keep saying “cached URLs” and probably mean something else. Maybe you also don’t mean “loosing”. I don’t know. 

Link to comment
Share on other sites

On 1/6/2018 at 3:18 PM, opentype said:

No. Nothing gets dropped from the index for not being in the sitemap. 

Or maybe we just have a language problem. You keep saying “cached URLs” and probably mean something else. Maybe you also don’t mean “loosing”. I don’t know. 

Put simply, if your sitemap is out of date and/or inaccurate, you may lose indexed pages in Google or experience long delays in indexing. From what I understand OP has identified an issue where the default sitemap generator is too slow and doesn't provide a complete accounting of the number of pages after 1000 URLs are generated per sitemap.  If you run a large and dynamic community this may present an issue. 

Link to comment
Share on other sites

It would be great if IPS allowed you to manually rebuild the sitemap on demand. XenForo allowed you to do this and it does it via a cron.

When you adjust the sitemap settings in IPS there is a long delay (OP mentioned it was every 15 mins? Unless I misread)

Link to comment
Share on other sites

Found one more sitemap problem.

<lastmod>

tag show generation time of the current sitemap file. It's right, but.. What is tell standard

Quote

<lastmod> optional
Identifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be in W3C Datetime format.

By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index i.e. a crawler may only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.

So, coming back to our case, now we have 5271 sitemap files. So google should get all of them! He get information 'it's modified! take it' and doesn't matter content inside changed or not. Moreover - inside sub-sitemap with url's we didn't have any <lastmod> tags. So google get very old url to subsitemap file, get it and see just list of urls without additional meta information.

 

My proposal:

add <lastmod> tag to every url inside all sub-sutemaps. It will tell google which urls contain new elements and which should be scan and it tell which one not changed and not need to re-scan => will optimize scan perfomance.

Add <lastmod> tag to index sitemap file, which never tell date of this file generation - it should provide newer date of last modified url inside this file. With that google never download sitemap with 500 urls where no changes exist => will optimize scan perfomance.

P.S. I'll try to create a patch. If i do this - i'll share it here (for other dev's checks and helping IPS).

Thanks for you attension and support )

Link to comment
Share on other sites

Did it.

Before:

560fb-clip-60kb.thumb.png.53405ae0d5239668fd2709ca004582d7.png

After:

8b412-clip-54kb.thumb.png.417ab7264be2c322ed9be06769de71e6.png

No issues detected by several sitemap online checking tools:

35e97-clip-24kb.png.b4a055c7a67df204799ea35ed45a67a2.png

I did it very ugly. Just for try and check. You can improve it by yourself (and share it with us, please):

/applications/core/extensions/core/Sitemap/Content.php

line 209: after $data line add that:

if (get_class($node) === 'IPS\forums\Forum' && isset($node->last_post)) {
    $data['lastmod'] = $node->last_post;
}

and line 259 (line 262 after add previous) add after $data line that:

if (get_class($item) === 'IPS\forums\Topic' && isset($item->last_post)) {
    $data['lastmod'] = $item->last_post;
}

After that the sitemap script should re-generate all sub-sitemaps for write new data to db.

And I haven't done changing correct lastmod in index sitemap, depended on newer date inside sub-sitemap.

Thanks.

Link to comment
Share on other sites

On 03.01.2018 at 5:28 PM, Nathan Explosion said:

Has it been logged as a bug with support? If not, do so.....you can't rely solely on the Peer to Peer support forum as a way of reporting issues to the developers.

My way for communication with IPS - write a full information topic in forums first. Discuss with other devs (may be i wrong with something?) and after some time ping via ticket system. Now ticket created (id 996743 - for IPS support).

On 06.01.2018 at 10:52 PM, ProSkill said:

I take that to mean that each sitemap is only going to show up to a 1000 urls. If you have 8k plus topics and 700k posts that can be an issue. 

No issues with that. 8k topics will create 8 sub-sitemap files and they will be linked in index sitemap file. And no issue with 5000+ links to the sub-sitemaps from index file. Google saw it successfull. Confirmed.

On 06.01.2018 at 10:52 PM, ProSkill said:

I take that to mean that there is a long delay for the sitemap to update changes. If you move a posts, topics, or forums around there will be a significant delay before the sitemap is updated. 

Sitemap will update only if you move post from one topic to another. Sitemap contain 'item' elements, where 'item' is topic, profile status, calendar category.. Sub-items like a posts, calendar event, profile status reply - is just a united content inside 'items'. Google didn't create something like 'tree'. It's positive way for imagine how it works more human undestandable. In fact google saves urls from specified domain and it's content. With sitemap we just help him to do it smarter. Not only with sitemaps. With url params define too. For example, i said 'page' url query in link - is page navigation. So google will know - it should scan every page with ?page=2, ?page=3 and other because it will contain another content. With comment= just link to specified location, not content changes and etc..

On 08.01.2018 at 5:20 AM, Optic14 said:

When you adjust the sitemap settings in IPS there is a long delay (OP mentioned it was every 15 mins? Unless I misread)

Now one time per 15 mins IPS update only 1 sitemap file. Not full update. Yep. It's no good. And aggree with proposal for ability to run manually full update.

Link to comment
Share on other sites

can you not rebuild by emptying the database table

of course it will then take a lot of time to rebuild as it does only 1 'page' at a time every 15 minutes but if you can resolve that to a lesser time as above then may be workable?

 

plus as the sitemap is only submitted to google once every 24 hours by the invision software

then if 'improving' this feature it may also be worthwhile changing this 24 hrs fixed term to a user selected variable?

 

 

Edited by sound
Link to comment
Share on other sites

I have some questions:

1) Do I need to turn off the IPB's every 15 minute site map feature, or do anything?

2) I did set this up to run every minute: mycustomsitemapupdater.php. But I did not understand the improvements you mentioned below. Is this a code change in the file? Can you explain?

On 12/25/2017 at 4:51 AM, Upgradeovec said:

Little improvement (5214 elements will update more than 3 days). So you can speed up more this. Just get time needed for one


time php mycustomsitemapupdater.php // return something like 4 sec

So with that you can create a cycle inside for X times to run $generator->buildNextSitemap(); For example in my case - 10 times in one minute. So for 5214 elements i will need 521 minuts for all update (~= 8 hours - not bad). 

3) I could not find the code mentioned in this file:

/applications/core/extensions/core/Sitemap/Content.php

Is it located somewhere else, or has IPB changed it?

Edited by sadams101
Link to comment
Share on other sites

On 12.01.2018 at 11:06 PM, sadams101 said:

I wasn't able to find this code in the file mentioned:

Perhaps they included this in a recent update?

It's a my code, which need to insert after $data variable.

On 12.01.2018 at 11:44 PM, sadams101 said:

1) Do I need to turn off the IPB's every 15 minute site map feature, or do anything?

No. Both of them not affect other.

On 12.01.2018 at 11:44 PM, sadams101 said:

2) I did set this up to run every minute: mycustomsitemapupdater.php. But I did not understand the improvements you mentioned below. Is this a code change in the file? Can you explain?

You can run 

select FROM_UNIXTIME(updated,'%a %b %d %H:%i:%s UTC %Y') from core_sitemap order by updated asc limit 1;

and see lag time between last updated file and current date. If they are simular - script already catch current date. Before my scripts running this command showed lag without additional script runs. As I described at the top, on my server this lag time get more 1 month.

Link to comment
Share on other sites

On 13.01.2018 at 8:29 PM, AlexWebsites said:

Wonder what they are going to actually improve in 4.3. Can you keep us posted if applying your patch helped?

a71ee4c567fb6bfb57598cd0bd7f5e7a.thumb.png.4f3e0e17d37f28816d5854b8c170d7bc.png

Mainly the result of first patch already exist. Now google webmasters said 4 Jan - most oldest my sitemap file (actually all updated tiday, but google got it more frequently). Second patch with <lastmod> can provide ordering (last column). Now i can't see improvements by statistic. This big works for google get a lot of time and resources (and any update got it too). So need a more time. There is the graph of downloaded size per day
124901bb3e3d8e864588ff9ec6f53c4e.thumb.png.a33e336bd9f87cc91d4d57aba9a639e6.png

I see average goes up and this is good. Numbers in KiB (as legend said).

I'll post info here if i get some more intelligent proofs and results. Thanks you for your interest )

Edited by Upgradeovec
Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...