PeterUK Posted October 21, 2013 Share Posted October 21, 2013 File Name: Unofficial Multi-Threaded UTF-8 Converter File Submitter: PeterUK File Submitted: 20 Oct 2013 File Category: Maintenance Supported Versions: IP.Board 3.1.x, IP.Board 3.2.x, IP.Board 3.3.x, IP.Board 3.4.x With the release of IPB 4 coming in the future, one of the things that is going to hit people with large communities hard is the conversion to UTF-8 if you are not already there. This can be a huge task, and with bfarber's converter, this works fine, but can be very slow as PHP can only run so quickly as it's bound by a single CPU thread. On my forum, which is a total DB size of ~10GB, bfarber's single threaded script takes over 48 hours straight to run. I found this amount of downtime for my live forum to be unacceptable, so I created a fork of his script and used PHP's fork functions to multi-thread it. With 8 cores assigned to the script, I was able to convert this database in just 5 hours under testing. When I did my live conversion, I assigned 12 cores and was able to do it in 3 hours. This script is aimed at people with large databases, where a conversion would normally take a significant amount of time, and people who are power users and understand server maintenance and configuration.Please ensure you carefully read the included readme.txt file, and the documentation in the PHP file for the settings. The database I converted was an IPB 3.4.5 database, but since this is based on bfarber's script from 2010, there's no reason this shouldn't work on pretty much any IPB version in the 3.x.x series, but I will only be officially supporting it on 3.4.5. This converter has the following enhancements over the original: Multi-threading Use of PDO directly for DB connections, which gives a small improvement over using IPB's library Use of PDO prepared statements for multiple inserts meaning less data and less CPU time are used by MySQL Ability to remap characters post-conversion Easy to read exceptions thrown to STDERR on query failure Useful ETA information based on the records already processed here to download this file Link to comment Share on other sites More sharing options...
PeterUK Posted October 21, 2013 Author Share Posted October 21, 2013 Fixed a small typo in the post-conversion replacement section of the code and backdated the upload since it didn't really warrant a new version number. Link to comment Share on other sites More sharing options...
AlexJ Posted October 21, 2013 Share Posted October 21, 2013 Isn't the 4.0 version will have some pre-in-build UTF-8 converter? Also, I have UTF-8 but some characters are still from other character set.. Can I still use the script? Link to comment Share on other sites More sharing options...
PeterUK Posted October 21, 2013 Author Share Posted October 21, 2013 Isn't the 4.0 version will have some pre-in-build UTF-8 converter? Also, I have UTF-8 but some characters are still from other character set.. Can I still use the script? It's understood that IPB4 will require users to convert to UTF-8 first, yes. However this encoder is aimed at people where the time taken for that process is unacceptable and they need something faster. You can use the script if you want, however it will take what you give as your current character set, and convert it to the new one, if you tell it your current characters are UTF-8, and ask it to convert to UTF-8 I doubt anything will happen other than your time will be wasted. :P Link to comment Share on other sites More sharing options...
marklcfc Posted October 22, 2013 Share Posted October 22, 2013 My forums mysql tables are in latin1_swedish_ci and for the first 6-7 years the character set was iso-8859-1. I then changed to UTF-8 for the past 4 years. This has meant that there are errors in the older posts, for example it doesn't display £ or ' correctly and it'll look a bit of a mess. So if I used this, would it fix the older posts but retain the newer posts are they are displaying fine (even though the database is still latin_swedish_ci and needs changing to UTF). Link to comment Share on other sites More sharing options...
PeterUK Posted October 22, 2013 Author Share Posted October 22, 2013 My forums mysql tables are in latin1_swedish_ci and for the first 6-7 years the character set was iso-8859-1. I then changed to UTF-8 for the past 4 years. This has meant that there are errors in the older posts, for example it doesn't display £ or ' correctly and it'll look a bit of a mess. So if I used this, would it fix the older posts but retain the newer posts are they are displaying fine (even though the database is still latin_swedish_ci and needs changing to UTF). Your question is a bit confusing. You say that you changed to utf8 which broke some of your older characters, but then later on you say your database is in latin1. Are you saying you just changed the character set of your tables from latin1 to utf8 (and didn't convert any data?), so new posts were inserted in utf8 but older posts are still in latin1? If so, you could give this a try (using latin1as the input character set), and see what it comes out with, I honestly don't know how it would handle that. Another thing you could do is specify utf8 as the input and output character set, and use the mappings to convert characters which you know are problematic. I could patch the script so that if the input charset equals the output charset, it only performs replacements and doesn't attempt the (slower) character conversion. Link to comment Share on other sites More sharing options...
marklcfc Posted October 22, 2013 Share Posted October 22, 2013 Yes I just changed the character set. I had no idea what I was doing, after one upgrade it said that UTF-8 was recommended, and mine said iso so I just changed it.. Link to comment Share on other sites More sharing options...
Luis Manson Posted October 25, 2013 Share Posted October 25, 2013 Hi, thanks for this tool, its a great enhancement i just found out my sessions table was not created, can you add an option to create the schema for not convertet tables? wait, i had an error and after a fix (emptied the original table) i used the skip option, somehow i had other errors after that, not sure what whas the cause, so i will do another conversion test Link to comment Share on other sites More sharing options...
Luis Manson Posted October 26, 2013 Share Posted October 26, 2013 i think i have a bug somewhere, i can no create a topic with title áéíóú, it wil be shown as áé, and if a try to change the title the query gets formed as: UPDATE topics SET title='tema áé� óú',title_seo='tema' WHERE tid= Link to comment Share on other sites More sharing options...
PeterUK Posted October 27, 2013 Author Share Posted October 27, 2013 That looks like a conversion problem somewhere before the query hits the DB. Is this a problem *after* the conversion? As for the creation of skipped tables, by default it should skip over the session data but create the table itself. Link to comment Share on other sites More sharing options...
Luis Manson Posted October 27, 2013 Share Posted October 27, 2013 nevermind, i have a BIG mess, i made a thread on "server config" forum, it feels like i have latin1 data, and utf8 unicode/general collations set Link to comment Share on other sites More sharing options...
Luis Manson Posted October 28, 2013 Share Posted October 28, 2013 not sure if its a bug in code or something is not correctly set, but your code create the tables as UTF8_general_ci and i want to use unicode_ci is that a setup problem on my end? Link to comment Share on other sites More sharing options...
PeterUK Posted October 28, 2013 Author Share Posted October 28, 2013 not sure if its a bug in code or something is not correctly set, but your code create the tables as UTF8_general_ci and i want to use unicode_ci is that a setup problem on my end? It sounds like it, because by default utf8_general_ci isn't mentioned anywhere in the file, by default it creates tables with utf8_unicode_ci collation. Can you issue a SHOW CREATE TABLE `table_name`; for one of the tables in question and paste the output? Perhaps your server is putting the collation within the table definition which the script does not override. You can also do SHOW COLLATION LIKE 'utf8%'; and if utf8_general_ci is listed as default for you then that's most likely why. Link to comment Share on other sites More sharing options...
Luis Manson Posted October 28, 2013 Share Posted October 28, 2013 Thanks peter, im waiting for a conversion, i wil run that later, to be honest i have rally NO idea on the right solution now, its also beyond the scope of this thread (your converter) What I know almost for sure is that my data is latin1 but collations on the running database is set to utf8 (unicode and general) Link to comment Share on other sites More sharing options...
PeterUK Posted October 28, 2013 Author Share Posted October 28, 2013 Thanks peter, im waiting for a conversion, i wil run that later, to be honest i have rally NO idea on the right solution now, its also beyond the scope of this thread (your converter) What I know almost for sure is that my data is latin1 but collations on the running database is set to utf8 (unicode and general) Well remember the collation is irrelevant to the character set. It's no problem having a mix of utf8_general_ci and utf8_unicode_ci, it's all still utf8, that just affects how MySQL does sorting on the data. The problem comes when you have what you describe, which is characters in one character set but their character set in the DB does not match this. It sounds like if you have all latin1 characters you need to set that as your "from" character set, and utf8 as your "to" character set and see if it converts correctly. Link to comment Share on other sites More sharing options...
Luis Manson Posted October 28, 2013 Share Posted October 28, 2013 right now i will go for a sed seach & replace, since what you said did not work fome, but i will do another test just to be sure, thanks! Link to comment Share on other sites More sharing options...
Luis Manson Posted October 28, 2013 Share Posted October 28, 2013 It sounds like it, because by default utf8_general_ci isn't mentioned anywhere in the file, by default it creates tables with utf8_unicode_ci collation. Can you issue a SHOW CREATE TABLE `table_name`; for one of the tables in question and paste the output? Perhaps your server is putting the collation within the table definition which the script does not override. You can also do SHOW COLLATION LIKE 'utf8%'; and if utf8_general_ci is listed as default for you then that's most likely why. show create on topics table: show create table gatopicsG *************************** 1. row *************************** Table: gatopics Create Table: CREATE TABLE `gatopics` ( `tid` int(10) NOT NULL AUTO_INCREMENT, `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '', `description` varchar(250) COLLATE utf8_unicode_ci DEFAULT NULL, `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL, `posts` int(10) DEFAULT NULL, `starter_id` mediumint(8) NOT NULL DEFAULT '0', `start_date` int(10) DEFAULT NULL, `last_poster_id` mediumint(8) NOT NULL DEFAULT '0', `last_post` int(10) DEFAULT NULL, `icon_id` tinyint(2) DEFAULT NULL, `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL, `last_vote` int(10) DEFAULT NULL, `views` int(10) DEFAULT NULL, `forum_id` smallint(5) NOT NULL DEFAULT '0', `approved` tinyint(1) NOT NULL DEFAULT '0', `author_mode` tinyint(1) DEFAULT NULL, `pinned` tinyint(1) DEFAULT NULL, `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL, `topic_hasattach` smallint(5) NOT NULL DEFAULT '0', `topic_firstpost` int(10) NOT NULL DEFAULT '0', `topic_queuedposts` int(10) NOT NULL DEFAULT '0', `topic_open_time` int(10) NOT NULL DEFAULT '0', `topic_close_time` int(10) NOT NULL DEFAULT '0', `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0', `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0', `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '', `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '', `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '', `topic_deleted_posts` int(10) NOT NULL DEFAULT '0', `short_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `tweet_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `tweet` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `template_fields` text COLLATE utf8_unicode_ci, `tdelete_time` int(11) NOT NULL DEFAULT '0', `moved_on` int(11) NOT NULL DEFAULT '0', `last_real_post` int(10) NOT NULL DEFAULT '0', `topic_archive_status` int(1) NOT NULL DEFAULT '0', `topic_answered_pid` int(10) NOT NULL DEFAULT '0', PRIMARY KEY (`tid`), KEY `topic_firstpost` (`topic_firstpost`), KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`), KEY `forum_id` (`forum_id`,`pinned`,`approved`), KEY `last_post_sorting` (`last_post`,`forum_id`), KEY `start_date` (`start_date`), KEY `last_x_topics` (`forum_id`,`approved`,`start_date`), KEY `approved` (`approved`,`tdelete_time`), KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`), KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`), KEY `topic_archive_status` (`topic_archive_status`,`forum_id`), KEY `last_poster_id` (`last_poster_id`) ) ENGINE=MyISAM AUTO_INCREMENT=89169 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 1 row in set (0.00 sec) about the collation, yes, general is the default on both server, one has unicode specified, yet general is shown as default : Link to comment Share on other sites More sharing options...
xtech Posted November 17, 2013 Share Posted November 17, 2013 Hi, i get the "Cannot find PCTNL functions used to fork processes". I am running windows server, so i suppose this is normal. Is there any workaround to be able to run this (i don't mind not getting it to be multithreaded). Regards Link to comment Share on other sites More sharing options...
PeterUK Posted January 4, 2014 Author Share Posted January 4, 2014 Hi, i get the "Cannot find PCTNL functions used to fork processes". I am running windows server, so i suppose this is normal. Is there any workaround to be able to run this (i don't mind not getting it to be multithreaded). Regards Sorry for the slow response, you are correct, Linux is a requirement for this script. It would be possible to make it run single threaded but ultimately you may as well just run bfarber's script, I am about to issue a patched version of this though which uses a superior third party library for UTF-8 encoding. Link to comment Share on other sites More sharing options...
teraßyte Posted January 4, 2014 Share Posted January 4, 2014 Sorry for the slow response, you are correct, Linux is a requirement for this script. It would be possible to make it run single threaded but ultimately you may as well just run bfarber's script, I am about to issue a patched version of this though which uses a superior third party library for UTF-8 encoding. Rather than Brandon's version it's better to use the new script by Matt in the IPS Extra category: '?do=embed' frameborder='0' data-embedContent>> Matt is working on it to include it in the 4.0 final version since they'll be converting every board to UTF-8 with 4.0 anyway. Link to comment Share on other sites More sharing options...
PeterUK Posted January 5, 2014 Author Share Posted January 5, 2014 Rather than Brandon's version it's better to use the new script by Matt in the IPS Extra category: '?do=embed' frameborder='0' data-embedContent>> Matt is working on it to include it in the 4.0 final version since they'll be converting every board to UTF-8 with 4.0 anyway. Ah yes, I didn't realise this had been released publicly, I tested it some time ago before it was released. You should indeed use that one if you don't want multi-threading. Link to comment Share on other sites More sharing options...
PeterUK Posted January 25, 2014 Author Share Posted January 25, 2014 Updated the script to use a superior library amongst other things. Link to comment Share on other sites More sharing options...
InvisionHQ Posted March 1, 2014 Share Posted March 1, 2014 Hi Peter, your script is very fast but I got this error: [1393659319] [CHILD #5] Processed 1,250/80,012 (1.56%) (ETA: 00h 02m 06s) [ibf_search_keywords] [1393659319] SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'republik osterreich catalogo xxxxxx' for key 'idx_keyword_unq' Killed Link to comment Share on other sites More sharing options...
Luis Manson Posted March 3, 2014 Share Posted March 3, 2014 if you are not very seo orinted empty that table, it just has search keywords Link to comment Share on other sites More sharing options...
InvisionHQ Posted March 4, 2014 Share Posted March 4, 2014 I empy this table and conversion script make the job without errors: But when I try to change $INFO['sql_database'] with the name of DB converted I got this: Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.