Jump to content

Charset Conversion Libraries


Ali Majrashi

Recommended Posts

Posted

The charset conversion libraries from IPB2 are the same ones that will be used for IPB3.

With IPB3 we are going to provide them as a separate download - practically none of our users have a need for them, and they consume something like 10MB of space, so we'll just offer them as a separate download for those who do need them. In the mean time, you can just use the folder from IPB2.

Posted

[quote name='aljareh4ever' post='1783316' date='Jan 29 2009, 05:21 PM']sorry but that not complete Solution to my web site DB .

i found an interesting plugin for VB that completely convert your DB .

but need alot of tweak to work with IPB her

http://www.vcharset.com/


Wow...
That's a pretty good conversion script...
I like it....

This script needs that too...
I have to convert 3 IPB to UTF-8 because of the SEF URL issue...

Thank you for sharing that URL aljareh4ever

Posted

[quote name='media' post='1783430' date='Jan 30 2009, 10:33 PM']Wow...
That's a pretty good conversion script...
I like it....

This script needs that too...
I have to convert 3 IPB to UTF-8 because of the SEF URL issue...

Thank you for sharing that URL aljareh4ever

will it's not only FURL also Attachment,search and to take fully advantage of ajax by converting
your DB into utf8

@ bfarber thanks but by that libraries can i convert my DB easily

Posted

We do not presently have any tools that allow you to convert your database character sets. I posted a sample of how such a library could be written for ipb3 earlier but there is no final script, no.

Posted

[quote name='bfarber' post='1783587' date='Feb 1 2009, 01:38 PM']We do not presently have any tools that allow you to convert your database character sets. I posted a sample of how such a library could be written for ipb3 earlier but there is no final script, no.

Hi,

I followed this thread (and the corresponding IPB3 thread) and I am also very much interested in this issue.
I attempt to migrate an existing IPB2.3.5 database from iso-8859-7 to utf8.
I will try to write such an automatic conversion php script and post it here for everybody to try, based on the IPB3 script you provided and the i18n charset library.
However, before I do this, I need to ask you some important questions:
1) Since I haven't studied any IPB3 code, will the IPB2 function:

$post = $this->ipsclass->txt_convert_charsets($r['post'] , 'old_char_set', 'utf-8' );

do exactly the same thing as the IPB3 function you used:

$post = IPSText::convertCharsets( $r['post'], 'old_char_set', 'utf-8' );

or not?

2) I will automatically get a list of IPB fields to convert. I plan to convert the fields:

char varchar text enum set tinytext mediumtext longtext

Should I convert some of them? Or maybe there are more to convert? Can you please specify the proper set?

3) In the database I see a lot Greek characters converted to html entities for some reason (as I said iso-8859-7 was used as the charset in IPB) that I assume cause problems in searches. Is this handled by txt_convert_charsets() function?
Should I try to convert these entities to utf8 e.g. using a standard PHP function like html_entity_decode() or mb_convert_encoding(), or not?

4) A terrible headache could be text fields written with the serialize() function in IPB, where I guess conversion will fail, since string lengths may change after converting to utf8. Is this true? If so, what is the best way to deal with this, or at least is there a way to automatically identify serialized fields and leave them out of the conversion (maybe only latin characters really exist in such fields)?

Many thanks in advance.

Posted

[quote name='Valtasar' post='1783669' date='Feb 1 2009, 08:04 PM']Hi,

I followed this thread (and the corresponding IPB3 thread) and I am also very much interested in this issue.
I attempt to migrate an existing IPB2.3.5 database from iso-8859-7 to utf8.
I will try to write such an automatic conversion php script and post it here for everybody to try, based on the IPB3 script you provided and the i18n charset library.
However, before I do this, I need to ask you some important questions:

Thank you Valtasar....
I really apriciated... That would save tons of time for us...
:)

Posted

[quote name='Valtasar' post='1783669' date='Feb 1 2009, 09:04 PM']Hi,

I followed this thread (and the corresponding IPB3 thread) and I am also very much interested in this issue.
I attempt to migrate an existing IPB2.3.5 database from iso-8859-7 to utf8.
I will try to write such an automatic conversion php script and post it here for everybody to try, based on the IPB3 script you provided and the i18n charset library.
However, before I do this, I need to ask you some important questions:
1) Since I haven't studied any IPB3 code, will the IPB2 function:

$post = $this->ipsclass->txt_convert_charsets($r['post'] , 'old_char_set', 'utf-8' );

do exactly the same thing as the IPB3 function you used:

$post = IPSText::convertCharsets( $r['post'], 'old_char_set', 'utf-8' );

or not?

Yes, it is effectively the same function.

[quote name='Valtasar' post='1783669' date='Feb 1 2009, 09:04 PM']2) I will automatically get a list of IPB fields to convert. I plan to convert the fields:

char varchar text enum set tinytext mediumtext longtext

Should I convert some of them? Or maybe there are more to convert? Can you please specify the proper set?

We do not use any enum, set, or longtext fields, to my knowledge, but you can look and convert them if there are any. The list is fine otherwise.

[quote name='Valtasar' post='1783669' date='Feb 1 2009, 09:04 PM']3) In the database I see a lot Greek characters converted to html entities for some reason (as I said iso-8859-7 was used as the charset in IPB) that I assume cause problems in searches. Is this handled by txt_convert_charsets() function?
Should I try to convert these entities to utf8 e.g. using a standard PHP function like html_entity_decode() or mb_convert_encoding(), or not?

This is most likely from AJAX functionality. html_entity_decode won't generally be able to convert these character entities, however mb_convert_encoding might be able to using the special HTML-ENTITIES "character set".

[quote name='Valtasar' post='1783669' date='Feb 1 2009, 09:04 PM']4) A terrible headache could be text fields written with the serialize() function in IPB, where I guess conversion will fail, since string lengths may change after converting to utf8. Is this true? If so, what is the best way to deal with this, or at least is there a way to automatically identify serialized fields and leave them out of the conversion (maybe only latin characters really exist in such fields)?

Many thanks in advance.

Indeed this is a big problem, and I'm afraid there's no easy solution. They are not solely latin characters (e.g. polls and poll data is serialized), so they cannot just be left as is, unless that's a compromise in creating the script from the start ("your polls will not convert") that the user is ok with.

The proper solution would be to select the serialized data, unserialize it, iterate recursively through the array, convert each value and store the converted value back into the array over itself, reserialized, then the db value updated.

Posted

As promised, here is my php script to automatically convert a IPB2 database to utf8.
It does not use IPB2-style queries, but seems to work for me, converting most of my iso-8859-7 database to utf8.
However, it still fails in a few cases, which I have to investigate in the next days.
The reason is probably (as I noticed) that there is some binary content within some text fields (??),
which I can only guess are either already utf8-encoded chars, or I don't know what.
I will have to write to a file the specific cases where txt_convert_charsets fails, to find out what really happened.

As I also explain inside the file, in my case, default charset for remote and local database was utf8 but the database had latin1 charset, latin1_swedish_ci collation and IPB charset was set to 'iso-8859-7'.
Hence it was also erroneously displayed in phpmyadmin.
When I got my local sql backup, the content was really utf8-encoded latin1 chars (I couldn't get anything better from backup/phpmyadmin)!
I then changed all the "CREATE TABLE" and "SET character_set_client" commands from latin1 to utf8, trying to do
ABSOLUTELY NO modification to the content of the database, before running the attached script.

- Serialization is handled quite nicely I think, with a command I found to change the string lengths (without unserializing).
- It is fully automatic - it will examine your database and convert only "char varchar text enum set tinytext mediumtext longtext" fileds.
- Skips fields to be converted if they are a primary key. Any ideas on how to convert these? (small problem-I guess they do not contain utf8 strings).
- Skips conversion if the table does not have a primary key (odd but there existed such tables in my IPB2 database-not any important ones of course).

To run it, just place it at your IPB home directory, and change the database connection details and your source charset.
You can optionally convert html entities after utf8 conversion:it worked for me, but it was the opposite of what bfarber says in the above post: html_entity_decode( $out, ENT_NOQUOTES, "utf-8" ) worked fine (php5.2.6, mysql 5.0, windows xp) while mb_convert_encoding($out, 'utf-8', 'HTML-ENTITIES') failed-I really don't know why!

Feel free to try it, and to correct/improve it if possible, so that at the end we all manage to convert our databases to utf8!
ipb2_utf8_convert.php

VERY IMPORTANT: ONLY USE THIS FOR EXPERIMENTATION AND ONLY USING A LOCAL COPY!

Posted

Valtasar,
Thank you for this, but your post is too technical for us....

What do you mean by "LOCAL COPY ONLY"????

If i make a back up from my database with PHPMYADMIN and run this script on my backup on the server, CAN I DO THIS?

Posted

[quote name='media' post='1783929' date='Feb 3 2009, 11:08 PM']Valtasar,
Thank you for this, but your post is too technical for us....

What do you mean by "LOCAL COPY ONLY"????

If i make a back up from my database with PHPMYADMIN and run this script on my backup on the server, CAN I DO THIS?

Yes, I think he's just saying not to use this on your live site as of yet. ;) Could/probably will break something until more testing is done.

Impressive Valtasar. :) This isn't the easiest thing to deal with by any means.

Posted

Guys don't get me wrong I have to understand before i do anything please follow my steps and tell if anything wrong with these steps...

[*]A new copy of my database on the server [*]Upload script to the root of IPB [*]Edit new database information (copy) in the script (Database and charset) [*]Run the script Do i need to do anything other than above steps? Valtasar, I have to give the GOLD MEDAL man.... Thank you :cool:














Posted

[quote name='media' post='1783973' date='Feb 4 2009, 03:11 PM']Guys don't get me wrong I have to understand before i do anything please follow my steps and tell if anything wrong with these steps...

[*]A new copy of my database on the server [*]Upload script to the root of IPB [*]Edit new database information (copy) in the script (Database and charset) [*]Run the script Do i need to do anything other than above steps? Valtasar, I have to give the GOLD MEDAL man.... Thank you :cool: Many thanks to all of you for your nice words! Yes, I can probably upload the script to resources, after a few improvements. Well, you seem to want to try this on a backup copy of your database at your server - I think this is possible, although I did it at my copy at home. I do not know the current charset of your database, but you should try to run the script, on a database copy having exactly the same content, however both the database copy and its tables should be created with utf8 encoding. I did this by: 1) Exporting the database (with create tables commands) to an sql script 2) Replacing latin1 to utf8 everywhere in the script (create tables commands and client charset) 3) Creating a new utf8 database 4) Importing the sql script in this new utf8 database Good luck and please tell us if it worked! I hope it works better in your case - as I said in my case some posts/fields were not converted, and I am still checking what went wrong and if I find something I will post it here.





























Posted

Ok let me rewrite what I do again and let's see if you see anything i am missing.... :)

1. Make a copy of my database with phpmyadmin
2. On new database, I run this little script to convert collums and fields to new charset



If change above information to new copy database, i assume that you are using conf_global.php and you get again the database connection info....

Do I have to change bot info to new copy database or just above code would be enough????

Thanks

<?php

// Database info


$dbhost = "localhost";

$dbuser = "username";

$dbpass = "password";

$dbname = "db_name";


//ALTER TABLE $table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci, CONVERT TO CHARACTER SET utf8

//ALTER TABLE $table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci

//—————


header("Content-type: text/plain");


$dbconn = mysql_connect($dbhost, $dbuser, $dbpass) or die( mysql_error() );

$db = mysql_select_db($dbname) or die( mysql_error() );


$sql = "SHOW TABLES";

$result = mysql_query($sql) or die( mysql_error() );


while ( $row = mysql_fetch_row($result) )

{

$table = mysql_real_escape_string($row[0]);

$sql = "ALTER TABLE $table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci, CONVERT TO CHARACTER SET utf8";

mysql_query($sql) or die( mysql_error() );

print "$table changed to UTF-8.\n";

}


mysql_close($dbconn);

?>


3. upload your script (new conversion script) to root of IPB

4. Edit your script and update new database information

5. Run The script


Am i missing anything????


The only problem that i see is this:

In your script you do have this

require_once ROOT_PATH   . "conf_global.php";


and this

// OPTIONS to check and change

$hostname = "localhost";

$username = "admin";

$password = "password";

$dbname = "bak_forum";

$old_char_set = "iso-8859-9";

$convert_html_entities=true;
Posted

Hi again,

You pose two issues:
1) I also thought about including ALTER TABLE commands provided by your script in my script, but I was a bit afraid how MySQL deals with them. Please go to the manual page and search for "For a column that has a data type". The problem mentioned here is that the field types COULD even change (extend) after the ALTER TABLE command (i.e. text->mediumtext, varchar -> mediumtext, etc) to fit the "larger" utf8 data.

bfarber, or any other IPB expert please tell us if such field length changes MAY cause any problem to IPB.
I was also unsure if the same problem exists when e.g. we use "CREATE TABLE" with utf8 charset (I guess the field size is still 64k bytes, i.e. 64k/3 utf8 chars). So I decided not to include at all such commands in my script, until I know exactly how this should be done :)
There is also the probability that different MySQL versions may behave differently on the above issue!

Anyway, IMHO your script will probably work.

2) Maybe I do not perfectly understand your question, but you are probably asking which of the two databases will be modified after including conf_global.php script, which contains reference to your REAL (LIVE) database.
The answer to this is very clear: ONLY the database you state in $dbname = "bak_forum"; will be affected/converted.
In fact you can even comment out (or delete) the 5 "require once" lines just after "require_once( './sources/ipsclass.php' );" and the script will still work! Thus, you will be sure that nothing will be modified in your LIVE database.

Posted

[quote name='Valtasar' post='1784039' date='Feb 4 2009, 04:39 PM']Hi again,

You pose two issues:
1) I also thought about including ALTER TABLE commands provided by your script in my script, but I was a bit afraid how MySQL deals with them. Please go to the manual page and search for "For a column that has a data type". The problem mentioned here is that the field types COULD even change (extend) after the ALTER TABLE command (i.e. text->mediumtext, varchar -> mediumtext, etc) to fit the "larger" utf8 data.

bfarber, or any other IPB expert please tell us if such field length changes MAY cause any problem to IPB.
I was also unsure if the same problem exists when e.g. we use "CREATE TABLE" with utf8 charset (I guess the field size is still 64k bytes, i.e. 64k/3 utf8 chars). So I decided not to include at all such commands in my script, until I know exactly how this should be done :)
There is also the probability that different MySQL versions may behave differently on the above issue!

Anyway, IMHO your script will probably work.

2) Maybe I do not perfectly understand your question, but you are probably asking which of the two databases will be modified after including conf_global.php script, which contains reference to your REAL (LIVE) database.
The answer to this is very clear: ONLY the database you state in $dbname = "bak_forum"; will be affected/converted.
In fact you can even comment out (or delete) the 5 "require once" lines just after "require_once( './sources/ipsclass.php' );" and the script will still work! Thus, you will be sure that nothing will be modified in your LIVE database.


Thank you I am ready to give it a try right now....
will see what happens...
At least it is a copy of my database we can play with it...

Thanks Valtasar....

Posted

Ok i got this as output:



No did not do the job...
I guess memory limit or something is effecting the script what should i do to prevent that.... or something else

Table inv_acp_help: fields to be converted: 4 primary keys: 1 Converted: page_key Converted: help_title Converted: help_body Converted: help_mouseover Table inv_admin_login_logs: fields to be converted: 3 primary keys: 1 Converted: admin_ip_address Converted: admin_username Converted: admin_post_details Table inv_admin_logs: fields to be converted: 4 primary keys: 1 Converted: act Converted: code Converted: note Converted: ip_address Table inv_admin_permission_keys: Skipping field perm_key since it is a primary key Table inv_admin_permission_keys: fields to be converted: 3 primary keys: 1 Converted: perm_main Converted: perm_child Converted: perm_bit Table inv_admin_permission_rows: fields to be converted: 1 primary keys: 1 Converted: row_perm_cache Table inv_admin_sessions: Skipping field session_id since it is a primary key Table inv_admin_sessions: fields to be converted: 4 primary keys: 1 Converted: session_ip_address Converted: session_member_name Converted: session_member_login_key Converted: session_location Table inv_announcements: fields to be converted: 3 primary keys: 1 Converted: announce_title Converted: announce_post
Posted

[quote name='media' post='1784047' date='Feb 4 2009, 11:11 PM']No did not do the job...
I guess memory limit or something is effecting the script what should i do ot prevent that.... or something else

First it is strange that important IPB tables (e.g. ibf_topics, ibf_posts and others) that definitely contain Turkish characters are not in the output? I have no idea why the script seems to not even try to convert them.

I have a suggestion to see what goes wrong:

1) Uncomment line 87 and provide there a table that definitely contains Turkish characters. Now the script will try to convert THIS table ONLY.
2) Then go to line 132 of the script (printf("in: %s\n",$row[$num_pk]); ) and uncomment it out.
3) Redirect the output to a file, as you already did. If your database and table you selected and the database connection and content is fine, you should now see the correct contents of the specific Table and real iso-8859-9 characters in this file.

Good luck.
PS: You probably HAVE to write the php script from command line or alter the PHP variable "max_execution_time", because this php script really needs a lot of time to run.

Posted

After I change those new setting...
It is working i see chars in UTF-8...

But still problem exist, if the database or table is big the script stops working....

Is there a way to setup the memory thing in php? to prevent that...

Posted

I set php_value max_execution_time 0 in .htaccess and try to run again....
But for the big databases we need batch system like to run 500 records at a time...

Finally Timed out... :)

Thanks

Posted

Firstly, you should run the script from command line. If you do, your .htaccess changes will have no impact because you're not using apache. Additionally, you'll need to get rid of memory limit and max execution times to give it enough power to finish.

Secondly, I do agree, processing in batches might prove more reliable. What will happen if someone has 1M posts?

Posted

[quote name='bfarber' post='1784122' date='Feb 5 2009, 08:42 AM']Firstly, you should run the script from command line. If you do, your .htaccess changes will have no impact because you're not using apache. Additionally, you'll need to get rid of memory limit and max execution times to give it enough power to finish.

Secondly, I do agree, processing in batches might prove more reliable. What will happen if someone has 1M posts?

Thank you...

I will run it from command line next time but someone needs to modify the script with batch system...

I hope someone willing to do that..

Thanks guys...

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...