Jump to content

Invision Community Blog


Managing successful online communities


IPS UTF-8 Database Converter

Please note that this entry may be a little technical, if you do have any questions, please post them in the comments below.

A little history
IP.Board was first released over ten years ago when the web landscape was very different. A lot of emerging technologies were still trying to define new standards. Very early versions of IP.Board allowed one to specify the document character set and had a default of "ISO-8559-1" which is useful for languages that use latin based characters. This meant, for example, that if you needed Chinese characters you would need to change the character set to something more suitable. This disparity between character sets creates many challenges when working with a single code base.

UTF-8
Over the past handful of years there has been a push towards a single document character set; UTF-8. UTF-8 is a variable-width encoding that is able to show every character in the Unicode character set. This makes it suitable for latin and Chinese characters (and many many more!). Popular Javascript libraries such as jQuery require that data is sent and received with UTF-8 and many native PHP string functions prefer UTF-8. The future is very much UTF-8 and trying to keep our codebase working with any other character set is going to be more and more challenging.

IPS 4.0
Even though IP.Board 3 introduced UTF-8 as the default character set for new installations, we're aware that we still have many clients that are not using UTF-8 currently. IPS 4.0 is going to be strictly UTF-8 only which means we need to convert the data before or as part of the upgrade process.

Converting to UTF-8 isn't as simple as changing the database encoding. Merely doing this will simply corrupt the data you have in your database. We need to be a little smarter and use a script to do this work for us.

The great news is that even if you choose to convert your data to UTF-8 today, your IP.Board 3.x will run just fine and you may even find it more efficient as it doesn't have to convert lots of data on the fly.

The IPS UTF-8 Database Converter
We've written a script that can safely convert your database to UTF-8. The script does not overwrite your data until you manually confirm that the conversion process has been successful. This means that there is no risk of corrupting your existing data.

Of course, it is good practice to perform a full MySQL back-up before making any changes to the database as a precaution and we recommend that you do this.

You can download the converter and its instructions here.

How can I tell if I need to convert my database?
When you first run the converter, it'll check your database and let you know if you need to convert or not. Even if you are running UTF-8, you may not be using the correct collation (utf8_unicode_ci) so you have the option of changing your collation which is a very fast procedure and does not need a full conversion to complete.

If you first used IP.Board with IP.Board 3.0 then you may only need to change your database table collation. This isn't a required step and the IPS 4 upgrade process will perform this task if you'd prefer to wait until IPS 4.0 is released.
Support
Please note that while we're happy to provide some pointers within the client forums, this release is not officially supported by our technical support department.

Beta Release
As this is a beta release, please be aware that there may be bugs. If you do spot one, please post it to the IPS Extras bug tracker.


Comments

Recommended Comments



My host upgraded to SQL 5.5, afterwards I was getting SQL driver errors about mixed charsets although everything was set to UTF-8.

 

Ran the converter, now when i start the web converter i'm getting:

Error: Division by zero File: /modules/browser/browser.php Line: 117

Link to comment
Share on other sites

I tried the old convertor UTF converter IPS had, but it was useless to me. It changed the table collation to UTF8, but let the individual table fields in their old collation. I am wondering if it is the same with this version?

 

Oh well, can't hurt to try I guess.

Link to comment
Share on other sites

I tried the old convertor UTF converter IPS had, but it was useless to me. It changed the table collation to UTF8, but let the individual table fields in their old collation. I am wondering if it is the same with this version?

 

Oh well, can't hurt to try I guess.

 

I never heard of an old converter :|

 

Unless you mean the one brandon released separately (which I used and worked quite well on my site :P ).

Link to comment
Share on other sites

Anyone have a suggestion for me how to get my problem solved ? - website down almost 20 hours now :-(

 

You won't get any help in a blog comment:

 

Support
Please note that while we're happy to provide some pointers within the client forums, this release is not officially supported by our technical support department.

 

 

And aside from that, my understanding of the script is that it uses another database where it puts the converted data. Can't you simply edit your conf_global.php file and point it to the original DB? If that's not the case open a topic in the forum just as mentioned in the quote above.

Link to comment
Share on other sites

Question - My character set has been set as UTF-8 for a few years, my database has always been in latin_swedish though. Will the converter realise this and convert or will it detect I'm using UTF-8 as a character set and just convert the collation?

 

Sorry for lack of knowledge

Link to comment
Share on other sites

Running from shell i get this error:

 

Processing:                                                                              0%

Error: Column 'appcomponent' cannot be null


File: /system/Convert/Convert.php


Line: 293

Link to comment
Share on other sites

Ok, i deleted all the null appcomponet arrows and it resolved that issue from the admin logs table.

 

But now i have a new one:

 

Processing: cache_store                                                              0.02%

Error: htmlentities(): Invalid multibyte sequence in argument


File: /system/Convert/Convert.php


Line: 260

Link to comment
Share on other sites

Well, UTF-8 does not work.  We have had it enabled from the start and when I translate the word Turkey using Google translate into Turkish and paste it in the forum name, I get a "?" for 'i' that you see in Türkiye, right after the k.

 

The same is true for some other foreign words, especially Hungarian and Russian.  I need to use country names spelled in their local language in the forums and half the time, IPB gets indigestion.

Link to comment
Share on other sites

Well, UTF-8 does not work.  We have had it enabled from the start and when I translate the word Turkey using Google translate into Turkish and paste it in the forum name, I get a "?" for 'i' that you see in Türkiye, right after the k.
 
The same is true for some other foreign words, especially Hungarian and Russian.  I need to use country names spelled in their local language in the forums and half the time, IPB gets indigestion.

 
 
Try disabling 'Remove chr(0xCA) from input' from the settings.

Link to comment
Share on other sites

UTF-8 does not work?  I'm afraid that's simply not possible.  UTF-8 is an open standard used by many things beyond web software, and it works just fine.

 

Perhaps your site, specifically, is having trouble with individual characters.  Ian's advice is probably the correct resolution for the problem, however if you still notice problems I would encourage you to submit a ticket so a technician can investigate.  Many things can cause character problems, but it's not because the character set itself "doesn't work".

Link to comment
Share on other sites

UTF-8 does not work?  I'm afraid that's simply not possible.  UTF-8 is an open standard used by many things beyond web software, and it works just fine.

 

That UTF-8 is a recognized standard, that is true.  Perhaps what I should have said [implied] was that UTF-8 does not work flawlessly in IPS products. 

 

I am using the ZuZu skin which is just a couple of tweaks off of the default.  I really don't think you can immediately declare that the character set itself works perfectly.  If Ian is correct, and I have to modify the settings, then the UTF-8 integration is not without issues in IPS products.

Link to comment
Share on other sites

Anyone have a suggestion for me how to get my problem solved ? - website down almost 20 hours now :-(

You ran this on a production site without testing it in a dev/QA instance and without a backup prior? Otherwise It is time to look and see what is throwing that error with firebug or similar tool.

Link to comment
Share on other sites

Well, UTF-8 does not work.  We have had it enabled from the start and when I translate the word Turkey using Google translate into Turkish and paste it in the forum name, I get a "?" for 'i' that you see in Türkiye, right after the k.

 

The same is true for some other foreign words, especially Hungarian and Russian.  I need to use country names spelled in their local language in the forums and half the time, IPB gets indigestion.

I think between your database and IPB forum is not right...

probably you need to use set_names ut8 function to connect your database or you need to set sql_charset to ut8....

Good luck...

Link to comment
Share on other sites

Why do you use so slow methods? Why not sql convert routines?
I use this routines to copy tables to new base:

SET sql_mode = 'NO_AUTO_VALUE_ON_ZERO';
CREATE TABLE `NEW_BASE`.`TABLE1` LIKE `OLD_BASE`.`TABLE1`; 
INSERT INTO `NEW_BASE`.`TABLE1` SELECT * FROM `OLD_BASE`.`TABLE1`;
CREATE TABLE `NEW_BASE`.`TABLE2` LIKE `OLD_BASE`.`TABLE2`;
INSERT INTO `NEW_BASE`.`TABLE2` SELECT * FROM `OLD_BASE`.`TABLE2`;
...  

It takes about 5 minutes for 2kk posts table.
Then i convert database and tables to UTF-8:

ALTER DATABASE `".$dbname."` DEFAULT CHARACTER SET $codepage COLLATE $collation;
ALTER TABLE $table DEFAULT CHARACTER SET $codepage COLLATE $collation, CONVERT TO CHARACTER SET $codepage COLLATE $collation;
...

Only problem with this method is serialized data.
This is my functions to fix serialized dato for multi-byte encoding:
 

/*
$data - serialized? string
$max_recurs - int max recursion level
$recurs - int current iteration

return  string
*/
function fix_ser_recursively($data,$max_recurs=10,$recurs=0){
	if (!is_string($data) OR !preg_match('/s:(d+):"(.*)";($|w:|}$)/i',$data) OR $recurs > $max_recurs) return $data;
	$test = @unserialize(fix_ser($data));
	if ($test !== FALSE AND $recurs <= $max_recurs){
		if (is_array($test) and count($test)){
			foreach ($test as $k => $v){
				if (is_string($v)){
					$_test[$k] = fix_ser_recursively($v,$max_recurs,$recurs+1);
				}
				else {
					$_test[$k] = $v;
				}
			}
			$test = $_test;
		}
		$data = serialize($test);
	}
	return $data;
}

/*
$data - serialized? string
return  string
*/
function fix_ser($data){
	$offset=1;
	$result=$data;
	while (preg_match('/s:(d+):"(.*)";($|w:|}$)/i',$data,$match,PREG_OFFSET_CAPTURE,$offset)) {
		
		$try1 = substr($match[0][0],strlen("s:{$match[1][0]}:""),$match[1][0]);
		if(strlen($try1) != mb_strlen($try1,'utf-8') AND substr($match[0][0],strlen("s:{$match[1][0]}:"")+$match[1][0],2) != '";'){
			$realstring = mb_substr($match[0][0],mb_strlen("s:{$match[1][0]}:""),$match[1][0],'utf-8');
		
			$reallen = strlen($realstring);
		
			$data=substr_replace($data,$reallen,$match[1][1],strlen($match[1][0]));
			$offset = $match[1][1]+strlen($reallen)+$reallen+4;
		}
		else {
			$offset = $match[1][1]+strlen($match[1][0])+$match[1][0]+4;
		}
	}
	return $data;

}

15 minutes for 3Gb Databases

Link to comment
Share on other sites

We've tested that approach before and it doesn't actually convert the data.  It also assumes the original tables has the correct character set and collations applied which is typically NOT the case.  Many of our clients have been using our software for a long time and they are storing, as a random example, Arabic data in a latin1-swedish database setup.  Relying on MySQL to try to convert that data simply does not work.

 

@surferboy I'm sorry, but I have to say you are incorrect.  UTF-8 works just fine with our software.  If you find any bugs explicitly related to UTF-8 please report them to our tracker, however the default character set for new installations of our software has been UTF-8 since 3.0.0 and is what we recommend for new clients.  UTF-8 works just fine with our software.

Link to comment
Share on other sites

Is it also worth people checking that their servers/hosting is set to UTF8 - if not then any new databases will be setup as whatever the server is set to.

 

I assume that with a new install that would be checked and corrected, but how about third party apps installing tables - will these throw up any issues if the server is not set to UTF8, as I assume they would not be UTF8?

Link to comment
Share on other sites
  • Management

Why do you use so slow methods? Why not sql convert routines?
I use this routines to copy tables to new base:

SET sql_mode = 'NO_AUTO_VALUE_ON_ZERO';
CREATE TABLE `NEW_BASE`.`TABLE1` LIKE `OLD_BASE`.`TABLE1`; 
INSERT INTO `NEW_BASE`.`TABLE1` SELECT * FROM `OLD_BASE`.`TABLE1`;
CREATE TABLE `NEW_BASE`.`TABLE2` LIKE `OLD_BASE`.`TABLE2`;
INSERT INTO `NEW_BASE`.`TABLE2` SELECT * FROM `OLD_BASE`.`TABLE2`;
...  

It takes about 5 minutes for 2kk posts table.
Then i convert database and tables to UTF-8:

 

Yes, this will work if you are using ISO-8859-* and have no special characters like € £ ", etc. What you're doing is changing the character set of the table and the collation. As ISO-8859-1 is latin, and this maps almost identically to the corresponding entry in UTF8 it mostly works.

 

However, this is not a solution we can give to our customers as they will find that even if using ISO-8559-1 they'll have unconverted artefacts and other oddities.

 

UTF8 will work fine if the database is set to UTF8, the database connection is set to UTF8, the table fields are set to UTF8, the table collation is set to UTF8* AND the data itself is converted from whatever to UTF8. If you miss any of these steps, UTF8 'won't work'.

 

Our converter does all of the above.

Link to comment
Share on other sites

I am currently converting my community DB from UTF8 to UTF8MB4 with the official converter. The machine that i am using is quite good: intel I7 4770, 32 GB RAM, SSD Disk Drive. The DB has about 1,3 GB.

I am running for about 50 minutes, and it only shows: "Converting" and "Processing admin_logs (Total 0.01%), after 20 minutes. Is this normal? Should i refresh the browser? To my knowledge this table is quite small...? 

 

The converter seems to be stuck... shall i wait?

Link to comment
Share on other sites

 

I am currently converting my community DB from UTF8 to UTF8MB4 with the official converter. The machine that i am using is quite good: intel I7 4770, 32 GB RAM, SSD Disk Drive. The DB has about 1,3 GB.

I am running for about 50 minutes, and it only shows: "Converting" and "Processing admin_logs (Total 0.01%), after 20 minutes. Is this normal? Should i refresh the browser? To my knowledge this table is quite small...? 

 

The converter seems to be stuck... shall i wait?

 

 

Have you tried the command line version of the converter? For anything but the smallest sites, the command line tool should be much quicker.

Link to comment
Share on other sites



Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

Important Information

We use technologies, such as cookies, to customise content and advertising, to provide social media features and to analyse traffic to the site. We also share information about your use of our site with our trusted social media, advertising and analytics partners. See more about cookies and our Privacy Policy

×