Jump to content

Unofficial Multi-Threaded UTF-8 Converter


PeterUK

Recommended Posts

File Name: Unofficial Multi-Threaded UTF-8 Converter

File Submitter: PeterUK

File Submitted: 20 Oct 2013

File Category: Maintenance

Supported Versions: IP.Board 3.1.x, IP.Board 3.2.x, IP.Board 3.3.x, IP.Board 3.4.x



With the release of IPB 4 coming in the future, one of the things that is going to hit people with large communities hard is the conversion to UTF-8 if you are not already there. This can be a huge task, and with bfarber's converter, this works fine, but can be very slow as PHP can only run so quickly as it's bound by a single CPU thread. On my forum, which is a total DB size of ~10GB, bfarber's single threaded script takes over 48 hours straight to run. I found this amount of downtime for my live forum to be unacceptable, so I created a fork of his script and used PHP's fork functions to multi-thread it. With 8 cores assigned to the script, I was able to convert this database in just 5 hours under testing. When I did my live conversion, I assigned 12 cores and was able to do it in 3 hours.

This script is aimed at people with large databases, where a conversion would normally take a significant amount of time, and people who are power users and understand server maintenance and configuration.

Please ensure you carefully read the included readme.txt file, and the documentation in the PHP file for the settings.

The database I converted was an IPB 3.4.5 database, but since this is based on bfarber's script from 2010, there's no reason this shouldn't work on pretty much any IPB version in the 3.x.x series, but I will only be officially supporting it on 3.4.5.

This converter has the following enhancements over the original:

  • Multi-threading
  • Use of PDO directly for DB connections, which gives a small improvement over using IPB's library
  • Use of PDO prepared statements for multiple inserts meaning less data and less CPU time are used by MySQL
  • Ability to remap characters post-conversion
  • Easy to read exceptions thrown to STDERR on query failure
  • Useful ETA information based on the records already processed



here to download this file

Link to comment
Share on other sites

Isn't the 4.0 version will have some pre-in-build UTF-8 converter? Also, I have UTF-8 but some characters are still from other character set.. Can I still use the script?

It's understood that IPB4 will require users to convert to UTF-8 first, yes. However this encoder is aimed at people where the time taken for that process is unacceptable and they need something faster.

You can use the script if you want, however it will take what you give as your current character set, and convert it to the new one, if you tell it your current characters are UTF-8, and ask it to convert to UTF-8 I doubt anything will happen other than your time will be wasted. :P

Link to comment
Share on other sites

My forums mysql tables are in latin1_swedish_ci and for the first 6-7 years the character set was iso-8859-1.

I then changed to UTF-8 for the past 4 years. This has meant that there are errors in the older posts, for example it doesn't display £ or ' correctly and it'll look a bit of a mess.

So if I used this, would it fix the older posts but retain the newer posts are they are displaying fine (even though the database is still latin_swedish_ci and needs changing to UTF).

Link to comment
Share on other sites

My forums mysql tables are in latin1_swedish_ci and for the first 6-7 years the character set was iso-8859-1.

I then changed to UTF-8 for the past 4 years. This has meant that there are errors in the older posts, for example it doesn't display £ or ' correctly and it'll look a bit of a mess.

So if I used this, would it fix the older posts but retain the newer posts are they are displaying fine (even though the database is still latin_swedish_ci and needs changing to UTF).

Your question is a bit confusing. You say that you changed to utf8 which broke some of your older characters, but then later on you say your database is in latin1. Are you saying you just changed the character set of your tables from latin1 to utf8 (and didn't convert any data?), so new posts were inserted in utf8 but older posts are still in latin1? If so, you could give this a try (using latin1as the input character set), and see what it comes out with, I honestly don't know how it would handle that. Another thing you could do is specify utf8 as the input and output character set, and use the mappings to convert characters which you know are problematic. I could patch the script so that if the input charset equals the output charset, it only performs replacements and doesn't attempt the (slower) character conversion.

Link to comment
Share on other sites

Hi, thanks for this tool, its a great enhancement

i just found out my sessions table was not created, can you add an option to create the schema for not convertet tables?

wait, i had an error and after a fix (emptied the original table) i used the skip option, somehow i had other errors after that, not sure what whas the cause, so i will do another conversion test

Link to comment
Share on other sites

not sure if its a bug in code or something is not correctly set, but your code create the tables as UTF8_general_ci and i want to use unicode_ci is that a setup problem on my end?

It sounds like it, because by default utf8_general_ci isn't mentioned anywhere in the file, by default it creates tables with utf8_unicode_ci collation.

Can you issue a

SHOW CREATE TABLE `table_name`; 

for one of the tables in question and paste the output? Perhaps your server is putting the collation within the table definition which the script does not override.

You can also do

SHOW COLLATION LIKE 'utf8%';

and if utf8_general_ci is listed as default for you then that's most likely why.

Link to comment
Share on other sites

Thanks peter, im waiting for a conversion, i wil run that later, to be honest i have rally NO idea on the right solution now, its also beyond the scope of this thread (your converter)

What I know almost for sure is that my data is latin1 but collations on the running database is set to utf8 (unicode and general)

Link to comment
Share on other sites

Thanks peter, im waiting for a conversion, i wil run that later, to be honest i have rally NO idea on the right solution now, its also beyond the scope of this thread (your converter)

What I know almost for sure is that my data is latin1 but collations on the running database is set to utf8 (unicode and general)

Well remember the collation is irrelevant to the character set. It's no problem having a mix of utf8_general_ci and utf8_unicode_ci, it's all still utf8, that just affects how MySQL does sorting on the data. The problem comes when you have what you describe, which is characters in one character set but their character set in the DB does not match this. It sounds like if you have all latin1 characters you need to set that as your "from" character set, and utf8 as your "to" character set and see if it converts correctly.

Link to comment
Share on other sites

It sounds like it, because by default utf8_general_ci isn't mentioned anywhere in the file, by default it creates tables with utf8_unicode_ci collation.

Can you issue a

SHOW CREATE TABLE `table_name`; 

for one of the tables in question and paste the output? Perhaps your server is putting the collation within the table definition which the script does not override.

You can also do

SHOW COLLATION LIKE 'utf8%';

and if utf8_general_ci is listed as default for you then that's most likely why.

show create on topics table:

show create table gatopicsG
*************************** 1. row ***************************
       Table: gatopics
Create Table: CREATE TABLE `gatopics` (
  `tid` int(10) NOT NULL AUTO_INCREMENT,
  `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  `description` varchar(250) COLLATE utf8_unicode_ci DEFAULT NULL,
  `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,
  `posts` int(10) DEFAULT NULL,
  `starter_id` mediumint(8) NOT NULL DEFAULT '0',
  `start_date` int(10) DEFAULT NULL,
  `last_poster_id` mediumint(8) NOT NULL DEFAULT '0',
  `last_post` int(10) DEFAULT NULL,
  `icon_id` tinyint(2) DEFAULT NULL,
  `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,
  `last_vote` int(10) DEFAULT NULL,
  `views` int(10) DEFAULT NULL,
  `forum_id` smallint(5) NOT NULL DEFAULT '0',
  `approved` tinyint(1) NOT NULL DEFAULT '0',
  `author_mode` tinyint(1) DEFAULT NULL,
  `pinned` tinyint(1) DEFAULT NULL,
  `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
  `topic_hasattach` smallint(5) NOT NULL DEFAULT '0',
  `topic_firstpost` int(10) NOT NULL DEFAULT '0',
  `topic_queuedposts` int(10) NOT NULL DEFAULT '0',
  `topic_open_time` int(10) NOT NULL DEFAULT '0',
  `topic_close_time` int(10) NOT NULL DEFAULT '0',
  `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0',
  `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0',
  `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  `topic_deleted_posts` int(10) NOT NULL DEFAULT '0',
  `short_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `tweet_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `tweet` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `template_fields` text COLLATE utf8_unicode_ci,
  `tdelete_time` int(11) NOT NULL DEFAULT '0',
  `moved_on` int(11) NOT NULL DEFAULT '0',
  `last_real_post` int(10) NOT NULL DEFAULT '0',
  `topic_archive_status` int(1) NOT NULL DEFAULT '0',
  `topic_answered_pid` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`tid`),
  KEY `topic_firstpost` (`topic_firstpost`),
  KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`),
  KEY `forum_id` (`forum_id`,`pinned`,`approved`),
  KEY `last_post_sorting` (`last_post`,`forum_id`),
  KEY `start_date` (`start_date`),
  KEY `last_x_topics` (`forum_id`,`approved`,`start_date`),
  KEY `approved` (`approved`,`tdelete_time`),
  KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`),
  KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`),
  KEY `topic_archive_status` (`topic_archive_status`,`forum_id`),
  KEY `last_poster_id` (`last_poster_id`)
) ENGINE=MyISAM AUTO_INCREMENT=89169 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)
 

about the collation, yes, general is the default on both server, one has unicode specified, yet general is shown as default :

Link to comment
Share on other sites

  • 3 weeks later...
  • 1 month later...

Hi,

i get the "Cannot find PCTNL functions used to fork processes". I am running windows server, so i suppose this is normal. Is there any workaround to be able to run this (i don't mind not getting it to be multithreaded).

Regards

Sorry for the slow response, you are correct, Linux is a requirement for this script. It would be possible to make it run single threaded but ultimately you may as well just run bfarber's script, I am about to issue a patched version of this though which uses a superior third party library for UTF-8 encoding.

Link to comment
Share on other sites

Sorry for the slow response, you are correct, Linux is a requirement for this script. It would be possible to make it run single threaded but ultimately you may as well just run bfarber's script, I am about to issue a patched version of this though which uses a superior third party library for UTF-8 encoding.

Rather than Brandon's version it's better to use the new script by Matt in the IPS Extra category: '?do=embed' frameborder='0' data-embedContent>>

Matt is working on it to include it in the 4.0 final version since they'll be converting every board to UTF-8 with 4.0 anyway.

Link to comment
Share on other sites

Rather than Brandon's version it's better to use the new script by Matt in the IPS Extra category: '?do=embed' frameborder='0' data-embedContent>>

Matt is working on it to include it in the 4.0 final version since they'll be converting every board to UTF-8 with 4.0 anyway.

Ah yes, I didn't realise this had been released publicly, I tested it some time ago before it was released. You should indeed use that one if you don't want multi-threading.

Link to comment
Share on other sites

  • 3 weeks later...
  • 1 month later...

Hi Peter, your script is very fast but I got this error:

[1393659319] [CHILD #5] Processed 1,250/80,012 (1.56%) (ETA: 00h 02m 06s) [ibf_search_keywords]


[1393659319] SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'republik osterreich catalogo xxxxxx' for key 'idx_keyword_unq'

Killed

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...