Can we have a more efficient upgrade process?

CheersnGears · June 8, 2016

So I've been running 3.4 -> 4.1 upgrades multiple times over the last 5 days on my test board. What I'm wondering is if anything can be moved from the initial upgrade process into the cleanup tasks that run after upgrade. There is just a painful amount of downtime involved. The one that I am focusing on is the converting of tags and then cleanup of tags.

This is taking hours for my server to complete. It seems to be only processing 175 tags about every 15 seconds.

Anything that can be booted out of the initial upgrade process and moved to post-upgrade cleanup tasks probably should be. I started this upgrade test at 11:40am eastern and it is 4:09pm now.. On my production install, I would want to get back in and start my template updates, block installs, and page creation ASAP and let a task like this run in the background.

Lindy · June 9, 2016

The 3.x to IPS4 upgrader is currently with the engineering devs for rework, I'll point them to this.

Charles · June 9, 2016

If you are seeing a screen that looks like that it means somewhere along the line your server took a long time to response and the ajax system timed out. It automatically reverts to the simple refresh system you see there to keep things going but that system is very slow as you can see.

If that happens I would close your browser and go back to /admin/upgrade. You will see an option to continue your upgrade and it will pick up where it left off but will use ajax system which will go much faster than that meta refresh screen you see there.

I did an upgrade yesterday morning with more tags than that and that step took maybe 15 minutes but it never did the fallback method you're seeing there.

CheersnGears · June 9, 2016

Good to know, thanks.

CheersnGears · June 17, 2016

On 6/9/2016 at 6:12 AM, Charles said:

If you are seeing a screen that looks like that it means somewhere along the line your server took a long time to response and the ajax system timed out. It automatically reverts to the simple refresh system you see there to keep things going but that system is very slow as you can see.

If that happens I would close your browser and go back to /admin/upgrade. You will see an option to continue your upgrade and it will pick up where it left off but will use ajax system which will go much faster than that meta refresh screen you see there.

I did an upgrade yesterday morning with more tags than that and that step took maybe 15 minutes but it never did the fallback method you're seeing there.

So.. I'm running another test upgrade today and it went back to this fall back process again. I did what you said, back out to the main upgrader and restart the upgrade process. It used the main screen for a bit and then fell back again. My server load is really low (0.48), so not sure what the issue could be.

CodingJungle · June 17, 2016

i wish there was a CLI upgrader.

TSP · June 16, 2017

Could you please consider implementing a CLI-upgrade process? For large communities it would be a godsend. Please, I'm begging you.

Lindy · June 17, 2017

11 hours ago, TSP said:

Could you please consider implementing a CLI-upgrade process? For large communities it would be a godsend. Please, I'm begging you.

This topic was about 3.x to v4 upgrades... what are you referring to?

Whether it's still 3.x to v4 or just v4 and beyond, there is such a limited use-case for a CLI upgrader, I really don't see us putting the resources into it; I'm sorry. Even our enterprise clients, some with tens of millions of posts and content items wouldn't require a command line upgrade.

Maybe if you can outline the issue for us we can better address your concerns? Thanks.

TSP · June 17, 2017

1 hour ago, Lindy said:

This topic was about 3.x to v4 upgrades... what are you referring to?

Whether it's still 3.x to v4 or just v4 and beyond, there is such a limited use-case for a CLI upgrader, I really don't see us putting the resources into it; I'm sorry. Even our enterprise clients, some with tens of millions of posts and content items wouldn't require a command line upgrade.

Maybe if you can outline the issue for us we can better address your concerns? Thanks.

I'm referring to a 3.4 to 4.2 upgrade, but it would be useful for all upgrades.

Currently I'm doing an test upgrade that so far has taken me 6 and a half hours, and at this point I'm not done with the core application. During all this time I have to closely monitor the upgrade process, because it'll pop up queries that I have to do manually, because this is a very large community. I still have queries against for example the posts-table that'll likely take me a minimum of 2 hours, ahead of me. There is also the possibillity that the browser could timeout on one of the requests, crash or I could lose internet connection, which depending on where it happens in the upgrade process, could cause issues on the retry. While I'm sure you can come up with scenarios where a CLI process would also fail, the risks and potential sources are less if I could start a CLI command directly in a screen on a server the software runs, there are simply less components involved.

I would still have to monitor this if it had been a CLI-task, but not as closely and I could've left the upgrader with a better peace of mind. A CLI-upgrader could also have given me a more detailed error log at once when something does fail, which would have aided the process of solving an upgrader problem more efficiently.

It has been the same story with all other major upgrades, but even smaller ones could've benefitted from a CLI upgrader, as it would help to minimize the downtime, since a CLI process will always be able to do it faster than when you have to continually do requests through the browser and interact with the upgrader because of manual queries.

I don't know how you do the smaller upgrades for your CiC-clients, but I assume things would've been simpler with more automation.

For me, minimizing the downtime necessary is very important to me. As opposed to what other clients seem to be comfortable with, I'm not comfortable having to take down the forum for any more time than absolutely necessary (12 hours is really pushing it) or having to wait 4-5+ weeks for the forum to be at a point where it can finally function properly because it's then finally been able to complete the post-upgrade tasks.

The Old Man · June 17, 2017

Blimey, that is taking a long time, must be very frustrating. Out of pure interest if you don't mind sharing and to give a better idea of available resources, how many posts or topics, members, size of DB? Server spec?

TSP · June 17, 2017

@The Old Man: I don't feel this can be attributed too much to hardware. There is simply a lot of data to process, and "dead time" because of the periods I have to run manual queries. But trust me, I've done these 8+ hours now as efficiently as possible for a first run, where I also want to document all the things that are taking time and how long time does takes.

This community has close to 24 million posts, over 1.7 million topics and 350 000 members. The size of the SQL file when it has been backed up has previously been above 25GB, approximately 1GB per million posts. Loaded into the database it's obviously more because of indexes.

The server is running nginx with PHP Version => 7.0.15-0ubuntu0.16.04.4

MySQL is 5.7.16-10-log Percona Server (GPL), Release '10', Revision 'a0c7d0d'

The MySQL is running on it's own hardware, which have earlier been detailed to me as: 8 x 16 GB RAM with 2 x 6 core 2,4 GHz CPUer

I'm not too sure about the hardware on the "app"-servers, but they are "virtual". Running cat /proc/cpuinfo I get this, times 4. (There are processor 1, processor 2, and processor 3, as well as processor 0 shown below, they all have the same stats.)

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 13
model name	: QEMU Virtual CPU version (cpu64-rhel6)
stepping	: 3
microcode	: 0x1
cpu MHz		: 2500.144
cache size	: 4096 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 4
wp		: yes
flags		: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm nopl pni cx16 hypervisor lahf_lm
bugs		:
bogomips	: 5000.28
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

I have increased the limits on a lot of the processing queries as I've went along, since the server handles that fine within the timeout value.

This is the result of cat /proc/meminfo

MemTotal:        8175148 kB
MemFree:          485948 kB
MemAvailable:    4044188 kB
Buffers:          363348 kB
Cached:          2889988 kB
SwapCached:        40828 kB
Active:          4369588 kB
Inactive:        2184452 kB
Active(anon):    2677524 kB
Inactive(anon):   970204 kB
Active(file):    1692064 kB
Inactive(file):  1214248 kB
Unevictable:           0 kB
Mlocked:            1840 kB
SwapTotal:       1048572 kB
SwapFree:         274420 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:       3260160 kB
Mapped:           110416 kB
Shmem:            347024 kB
Slab:            1020576 kB
SReclaimable:     955852 kB
SUnreclaim:        64724 kB
KernelStack:        9488 kB
PageTables:        51132 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5136144 kB
Committed_AS:    5722296 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1001472 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      141300 kB
DirectMap2M:     8247296 kB

This is part of a larger collection of "app" servers, this is the only server running our test environment, but it's specced the same as our production app servers (which we have 6 of).

Rhett · June 17, 2017

7 hours ago, TSP said:
@The Old Man: I don't feel this can be attributed too much to hardware. There is simply a lot of data to process, and "dead time" because of the periods I have to run manual queries. But trust me, I've done these 8+ hours now as efficiently as possible for a first run, where I also want to document all the things that are taking time and how long time does takes.

This community has close to 24 million posts, over 1.7 million topics and 350 000 members. The size of the SQL file when it has been backed up has previously been above 25GB, approximately 1GB per million posts. Loaded into the database it's obviously more because of indexes.

The server is running nginx with PHP Version => 7.0.15-0ubuntu0.16.04.4

MySQL is 5.7.16-10-log Percona Server (GPL), Release '10', Revision 'a0c7d0d'

The MySQL is running on it's own hardware, which have earlier been detailed to me as: 8 x 16 GB RAM with 2 x 6 core 2,4 GHz CPUer

I'm not too sure about the hardware on the "app"-servers, but they are "virtual". Running cat /proc/cpuinfo I get this, times 4. (There are processor 1, processor 2, and processor 3, as well as processor 0 shown below, they all have the same stats.)
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 13
model name	: QEMU Virtual CPU version (cpu64-rhel6)
stepping	: 3
microcode	: 0x1
cpu MHz		: 2500.144
cache size	: 4096 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 4
wp		: yes
flags		: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm nopl pni cx16 hypervisor lahf_lm
bugs		:
bogomips	: 5000.28
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
I have increased the limits on a lot of the processing queries as I've went along, since the server handles that fine within the timeout value.

This is the result of cat /proc/meminfo
MemTotal:        8175148 kB
MemFree:          485948 kB
MemAvailable:    4044188 kB
Buffers:          363348 kB
Cached:          2889988 kB
SwapCached:        40828 kB
Active:          4369588 kB
Inactive:        2184452 kB
Active(anon):    2677524 kB
Inactive(anon):   970204 kB
Active(file):    1692064 kB
Inactive(file):  1214248 kB
Unevictable:           0 kB
Mlocked:            1840 kB
SwapTotal:       1048572 kB
SwapFree:         274420 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:       3260160 kB
Mapped:           110416 kB
Shmem:            347024 kB
Slab:            1020576 kB
SReclaimable:     955852 kB
SUnreclaim:        64724 kB
KernelStack:        9488 kB
PageTables:        51132 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     5136144 kB
Committed_AS:    5722296 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1001472 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      141300 kB
DirectMap2M:     8247296 kB
This is part of a larger collection of "app" servers, this is the only server running our test environment, but it's specced the same as our production app servers (which we have 6 of).

If this is a VPS/Cloud server, which it seems to be, you will have many restrictions on resources likely slowing this a little, with that said, a data set of 25GB will take a long while to run in this case of an upgrade from 3.4 to Invision 4, we have made many many enhancements to this process over the years though, and it's much faster than it was originally. In this case, it's taking old data from 3.4, and copying it all to the new proper structure of Invision 4, this is mainly dependent on processor strength, and disk read/write speeds, if you have the ability to increase either of those for this process, that would help greatly though.

TSP · June 17, 2017

This isn't a weak VPS/Cloud server host. Although the app-nodes are virtual, the setup can't be compared to a generic budget friendly VPS host. The price for this server rig, which does not only run this community, is vastly more than 1000 USD a month. This is a host geared towards buisnesses, they don't provide services/hosting to single/regular persons.

I know it's faster than before, I've pushed for it to be improved over the years, when I've upgraded four other large 3.4 to 4.X-installations over the years. Now I'm pushing for it to be improved more, in the long run this will benefit your customers immensely the way I see it.

If the upgrader had been coded in a way that properly followed the MVC-architecture in the first place, then this wouldn't have been as large task as it'll be now, I still think it would be worth the effort.

Again, there is a lot of data to process. It seems to do it as fast as one could expect. I've increased the limits as it's able to do a lot more per run than what you seem to expect from a server. It spent 80-90 minutes upgrading the members for example, but for 350 000 members, I don't think that's something you would consider to be slow.

Rhett · June 17, 2017

1 minute ago, TSP said:

This isn't a weak VPS/Cloud server host. Although the app-nodes are virtual, the setup can't be compared to a generic budget friendly VPS host. The price for this server rig, which does not only run this community, is vastly more than 1000 USD a month. This is a host geared towards buisnesses, they don't provide services/hosting to single/regular persons.

I know it's faster than before, I've pushed for it to be improved over the years, when I've upgraded four other large 3.4 to 4.X-installations over the years. Now I'm pushing for it to be improved more, in the long run this will benefit your customers immensely the way I see it.

If the upgrader had been coded in a way that properly followed the MVC-architecture in the first place, then this wouldn't have been as large task as it'll be now, I still think it would be worth the effort.

I'll be honest here, considering the majority of people from 3.x upgraded many years ago, it's not likely to see many changes at this time on 3.x to 4.x upgrades, much time has been spent on this back in 2015 when it was currently a focus point though, the focus in 2017 is 4.2+.

TSP · June 17, 2017

@Rhett do you have any sort of ideas for the time it should take (in general) to upgrade members? In my case it took 80-90 minutes for 350 000 members, which I don't think you'd consider slow (or be slower than what you're used to), but I could be wrong.

Rhett · June 17, 2017

Just now, TSP said:

@Rhett do you have any sort of ideas for the time it should take (in general) to upgrade members? In my case it took 80-90 minutes for 350 000 members, which I don't think you'd consider slow (or be slower than what you're used to), but I could be wrong.

That is pretty fast to be honest of that is 350,000 members? If so you are doing fine.

TSP · June 17, 2017

1 minute ago, Rhett said:

That is pretty fast to be honest of that is 350,000 members? If so you are doing fine.

So the point I'm trying to make is that I feel my servers are free of any blame here, which you seem to agree with. As I've said, there is a lot of data to process, and I understand that takes time.

But having to babysit an upgrade for 10 hours makes it that much more difficult and inconvenient to run it multiple times for figuring out issues that occurred, debugging etc.

Rhett · June 17, 2017

9 minutes ago, TSP said:

So the point I'm trying to make is that I feel my servers are free of any blame here, which you seem to agree with. As I've said, there is a lot of data to process, and I understand that takes time.

But having to babysit an upgrade for 10 hours makes it that much more difficult and inconvenient to run it multiple times for figuring out issues that occurred, debugging etc.

If you are able to complete that upgrade in 10 hours you are doing well.

Charles · June 18, 2017

On 6/17/2017 at 2:33 AM, TSP said:

I still have queries against for example the posts-table that'll likely take me a minimum of 2 hours, ahead of me.

I am curious: how many posts are in that table? We host sites with many millions of posts and I have never seen a query take 2 hours to run.

The Old Man · June 18, 2017

"This community has close to 24 million posts, over 1.7 million topics and 350 000 members."

OMG, I feel insignificant. Makes me think my 4.1x VPS server should be screaming along with a tiny community!

Numbered · June 19, 2017

We have a large community (5+ million users) with the same number of content posted. Our conversion process doing more than week. But we haven't ability for do 1 week downtime period. So we create incremental convertion script, which load last ID's of needed tables (forums, topics, posts, messenger), connect to production DB and load into new DB each row with correct class (\IPS\forums\Topic, \IPS\forums\Topic\Post), needed LegacyTextParsers works (with our special methods included) and other incremental convertion works. So this method gave to us ability to switch in special time forums from old engine to new. I hope this method will help you.

In convertion problems i think CLI upgrader can't be better, than WEB upgrader. Most of perfomance not using because this convertion can works only on 1 CPU on 1 thread. I am sure - if upgrader (and background tasks, such as queue) will can do a lot of works in parallel mode (10x copies on 10x CPUs) - it will be x10 faster. Big members table, big posts table - all of them need more CPU than DB select/insert speed. And I not say about posts table, where LegacyTextParser with a lot of RegEx using 99% time from CPU and 1% from DB.

So, if you, IPS, will think about upgrader and background tasks improvements - try to research about multitasking.

Thanks!

TSP · June 19, 2017

14 hours ago, Charles said:

I am curious: how many posts are in that table? We host sites with many millions of posts and I have never seen a query take 2 hours to run.

My initial fears for the time spent on that table were off. I based it on previous experiences, which could be attributted to queries against that table on earlier server hardware we had.

The table has 24 million posts. Two substantial manual queries had to be run against the table. The first of them took 21 minutes, 30 seconds and the second took 19 minutes, 30 seconds.

Can we have a more efficient upgrade process?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Recently Browsing 0 members

Upcoming Events

Trending Content