Jump to content

the utf8 problem


Recommended Posts

Posted

Hello, im playing with an utf9 convesion script and have found a bug somewhere in my setup

The final problem is that if i set a topic title to "prueba áéíóú" its saved as: "prueba áé"

the query executed is:

UPDATE gatopics SET title='otro áé� óú vocalesd3',title_seo='otro-vocalesd3' WHERE tid=88970

to thest my server setup i setup a new test board and changed the title of the demo topic to something similar and it worked nicely, the query executed is:

UPDATE utf8_topics SET title='Welcome áéíóú¡²³¤€¼½¾‘dsf',title_seo='welcome-%c3%a1%c3%a9%c3%ad%c3%b3%c3%ba%c2%a1%c2%b2%c2%b3%c2%a4%e2%82%ac%c2%bc%c2%bd%c2%be%e2%80%98dsf' WHERE tid=1

I think at somepoint in some migration i got everyringh talking latin1 while the database was specified utf8 or something along that line

the schemas of the tables are:

original:



--

-- Table structure for table `utf8_topics`

--



CREATE TABLE IF NOT EXISTS `utf8_topics` (

  `tid` int(10) NOT NULL AUTO_INCREMENT,

  `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `posts` int(10) DEFAULT NULL,

  `starter_id` mediumint(8) NOT NULL DEFAULT '0',

  `start_date` int(10) DEFAULT NULL,

  `last_poster_id` mediumint(8) NOT NULL DEFAULT '0',

  `last_post` int(10) DEFAULT NULL,

  `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_vote` int(10) DEFAULT NULL,

  `views` int(10) DEFAULT NULL,

  `forum_id` smallint(5) NOT NULL DEFAULT '0',

  `approved` tinyint(1) NOT NULL DEFAULT '0',

  `author_mode` tinyint(1) DEFAULT NULL,

  `pinned` tinyint(1) DEFAULT NULL,

  `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,

  `topic_hasattach` smallint(5) NOT NULL DEFAULT '0',

  `topic_firstpost` int(10) NOT NULL DEFAULT '0',

  `topic_queuedposts` int(10) NOT NULL DEFAULT '0',

  `topic_open_time` int(10) NOT NULL DEFAULT '0',

  `topic_close_time` int(10) NOT NULL DEFAULT '0',

  `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0',

  `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0',

  `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `topic_deleted_posts` int(10) NOT NULL DEFAULT '0',

  `tdelete_time` int(11) NOT NULL DEFAULT '0',

  `moved_on` int(11) NOT NULL DEFAULT '0',

  `topic_archive_status` int(1) NOT NULL DEFAULT '0',

  `last_real_post` int(10) NOT NULL DEFAULT '0',

  `topic_answered_pid` int(10) NOT NULL DEFAULT '0',

  PRIMARY KEY (`tid`),

  KEY `topic_firstpost` (`topic_firstpost`),

  KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`),

  KEY `forum_id` (`forum_id`,`pinned`,`approved`),

  KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`),

  KEY `last_post_sorting` (`last_post`,`forum_id`),

  KEY `start_date` (`start_date`),

  KEY `last_x_topics` (`forum_id`,`approved`,`start_date`),

  KEY `approved` (`approved`,`tdelete_time`),

  KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`),

  KEY `topic_archive_status` (`topic_archive_status`,`forum_id`),

  KEY `last_poster_id` (`last_poster_id`),

  FULLTEXT KEY `title` (`title`)

) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=2 ;



after conversion:



--

-- Table structure for table `gatopics`

--



CREATE TABLE IF NOT EXISTS `gatopics` (

  `tid` int(10) NOT NULL AUTO_INCREMENT,

  `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `description` varchar(250) COLLATE utf8_unicode_ci DEFAULT NULL,

  `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `posts` int(10) DEFAULT NULL,

  `starter_id` mediumint(8) NOT NULL DEFAULT '0',

  `start_date` int(10) DEFAULT NULL,

  `last_poster_id` mediumint(8) NOT NULL DEFAULT '0',

  `last_post` int(10) DEFAULT NULL,

  `icon_id` tinyint(2) DEFAULT NULL,

  `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_vote` int(10) DEFAULT NULL,

  `views` int(10) DEFAULT NULL,

  `forum_id` smallint(5) NOT NULL DEFAULT '0',

  `approved` tinyint(1) NOT NULL DEFAULT '0',

  `author_mode` tinyint(1) DEFAULT NULL,

  `pinned` tinyint(1) DEFAULT NULL,

  `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,

  `topic_hasattach` smallint(5) NOT NULL DEFAULT '0',

  `topic_firstpost` int(10) NOT NULL DEFAULT '0',

  `topic_queuedposts` int(10) NOT NULL DEFAULT '0',

  `topic_open_time` int(10) NOT NULL DEFAULT '0',

  `topic_close_time` int(10) NOT NULL DEFAULT '0',

  `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0',

  `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0',

  `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `topic_deleted_posts` int(10) NOT NULL DEFAULT '0',

  `short_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `template_fields` text COLLATE utf8_unicode_ci,

  `tdelete_time` int(11) NOT NULL DEFAULT '0',

  `moved_on` int(11) NOT NULL DEFAULT '0',

  `last_real_post` int(10) NOT NULL DEFAULT '0',

  `topic_archive_status` int(1) NOT NULL DEFAULT '0',

  `topic_answered_pid` int(10) NOT NULL DEFAULT '0',

  PRIMARY KEY (`tid`),

  KEY `topic_firstpost` (`topic_firstpost`),

  KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`),

  KEY `forum_id` (`forum_id`,`pinned`,`approved`),

  KEY `last_post_sorting` (`last_post`,`forum_id`),

  KEY `start_date` (`start_date`),

  KEY `last_x_topics` (`forum_id`,`approved`,`start_date`),

  KEY `approved` (`approved`,`tdelete_time`),

  KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`),

  KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`),

  KEY `topic_archive_status` (`topic_archive_status`,`forum_id`),

  KEY `last_poster_id` (`last_poster_id`)

) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=88972 ;

 

original:



--

-- Table structure for table `gatopics`

--



CREATE TABLE IF NOT EXISTS `gatopics` (

  `tid` int(10) NOT NULL AUTO_INCREMENT,

  `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `description` varchar(250) COLLATE utf8_unicode_ci DEFAULT NULL,

  `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `posts` int(10) DEFAULT NULL,

  `starter_id` mediumint(8) NOT NULL DEFAULT '0',

  `start_date` int(10) DEFAULT NULL,

  `last_poster_id` mediumint(8) NOT NULL DEFAULT '0',

  `last_post` int(10) DEFAULT NULL,

  `icon_id` tinyint(2) DEFAULT NULL,

  `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_vote` int(10) DEFAULT NULL,

  `views` int(10) DEFAULT NULL,

  `forum_id` smallint(5) NOT NULL DEFAULT '0',

  `approved` tinyint(1) NOT NULL DEFAULT '0',

  `author_mode` tinyint(1) DEFAULT NULL,

  `pinned` tinyint(1) DEFAULT NULL,

  `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,

  `topic_hasattach` smallint(5) NOT NULL DEFAULT '0',

  `topic_firstpost` int(10) NOT NULL DEFAULT '0',

  `topic_queuedposts` int(10) NOT NULL DEFAULT '0',

  `topic_open_time` int(10) NOT NULL DEFAULT '0',

  `topic_close_time` int(10) NOT NULL DEFAULT '0',

  `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0',

  `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0',

  `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `topic_deleted_posts` int(10) NOT NULL DEFAULT '0',

  `short_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `template_fields` text COLLATE utf8_unicode_ci,

  `tdelete_time` int(11) NOT NULL DEFAULT '0',

  `moved_on` int(11) NOT NULL DEFAULT '0',

  `last_real_post` int(10) NOT NULL DEFAULT '0',

  `topic_archive_status` int(1) NOT NULL DEFAULT '0',

  `topic_answered_pid` int(10) NOT NULL DEFAULT '0',

  PRIMARY KEY (`tid`),

  KEY `topic_firstpost` (`topic_firstpost`),

  KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`),

  KEY `forum_id` (`forum_id`,`pinned`,`approved`),

  KEY `last_post_sorting` (`last_post`,`forum_id`),

  KEY `start_date` (`start_date`),

  KEY `last_x_topics` (`forum_id`,`approved`,`start_date`),

  KEY `approved` (`approved`,`tdelete_time`),

  KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`),

  KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`),

  KEY `topic_archive_status` (`topic_archive_status`,`forum_id`),

  KEY `last_poster_id` (`last_poster_id`)

) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=88970 ;

 
Posted

i think that all data in MySQL is latin1, while all collations are UTF

i say this because:

all schemas show utf8_general_ci

if i set sql_charset in conf_global.php as latin1 everything works

if i set it to UTF i have a database error

Posted

made a new test, restored original database made a mysqldump with latin1 as codepage=latin1 and imported as utf, now i can use UTF8 in conf_global yet áéíóú in topic titles (i wonder if there is any misbehaving filter, since the cut is always in the accented I

Posted

weird, alfter a LOT, and i mean more than 20 dumps, restores, greps and thinkgs like that i always endup with the same problems, the í character in post title makes a mess

not sure why this is happening, but im pretty much sure everything is in utf8, but found this in the logs:

 INSERT INTO gatopics (`title`,`title_seo`,`state`,`posts`,`starter_id`,`starter_name`,`seo_first_name`,`start_date`,`last_poster_id`,`last_poster_name`,`seo_last_name`,`last_post`,`author_mode`,`poll_state`,`last_vote`,`views`,`forum_id`,`approved`,`topic_archive_status`,`pinned`,`topic_open_time`,`topic_close_time`) VALUES('otro áé<C3> óú vocales','otro-vocales','open',0,1,'Luis Manson','luis-manson',1383171922,1,'Luis Manson','luis-manson',1383171922,1,0,0,0,12,1,0,0,0,0)

==============================================================================
==============================================================================

 Time Taken: 0.00158
 INSERT INTO gaposts (`author_id`,`use_sig`,`use_emo`,`ip_address`,`post_date`,`post`,`author_name`,`topic_id`,`queued`,`post_htmlstate`,`post_bwoptions`,`post_key`,`new_topic`) VALUES(1,1,1,'190.229.88.203',1383171922,'<p>otro áéíóú vocales</p>n<p>äåé®þüúíóö</p>n<p>qwertyuiop</p>n','Luis Manson',89176,0,0,0,'977dba96f133077e2fd5f23842871690',1)

as you can see IPB inserts into the topic table "otro áé<C3> óú vocales" while it should be "otro áéíóú vocales"

Posted

It's probably not trying to actually insert C3, it just happens that the byte structure of the character í is C3AD, so it's almost like it's losing the last byte or something which is pretty crazy. I presume you got that query from a log file, in which case the fact the log file has a character corrupted in that way is a bit worrying. Where did it come from? Is it a IPB log or your MySQL slow query log? That question seems important to me because if it's an IPB log, then that means the query seems to be corrupted before it hits the DB, but if it's the MySQL slow log then perhaps it is being corrupted on the way there.

In the other topic, you said that utf8 is your default character set, is it definitely?

You could run this MySQL query and you should see something along these lines:

mysql> SHOW VARIABLES WHERE Variable_Name LIKE 'character_set%' OR Variable_Name LIKE 'collation_%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
| collation_connection     | utf8_unicode_ci            |
| collation_database       | utf8_unicode_ci            |
| collation_server         | utf8_unicode_ci            |
+--------------------------+----------------------------+
11 rows in set (0.00 sec)
Posted

Hi PeterUK, i have suck a big mess that i think i will end up paying for someone to fix this

i posted also another topic, in ipb support forums because i got that query from the IPB log file, and in one of my tests i had disabled all hooks and apps, thats why im confused in what kind of problem i really have

in my production server i have this:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
| collation_connection     | utf8_general_ci            |
| collation_database       | latin1_swedish_ci          |
| collation_server         | latin1_swedish_ci          |
+--------------------------+----------------------------+

and the one where im testing:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
| collation_connection     | utf8_general_ci            |
| collation_database       | latin1_swedish_ci          |
| collation_server         | latin1_swedish_ci          |
+--------------------------+----------------------------+

Posted

So you do have a mix between character sets and collations within your server. I do not know enough in depth about MySQL configuration to know how having a differing character_set_database/server from the character_set_client/connection will cause problems, but I guess it probably isn't great. You can modify your MySQL configuration to make all of those utf8 which may help with some of your issues of queries seemingly being corrupted.

If you want to do that, you could add the following to your my.cnf file under [mysqld]:

skip-character-set-client-handshake
collation-server=utf8_unicode_ci
character-set-server=utf8
Posted

i had that and did not work, i also took a dump and replaced all latin1 for utf8 and made no difference, but i will give it a try again

By any chance do you have an idea how to correctly identify this issue? Im not quite sure yet

Posted

this is TOO weird, i came up to one problem (for now)

i can NEVER save a thread with an í on its title, it gets cut (and saved) right there, there is no problem on post content.

Also, it only works if i use the "latin1" database

Posted

me again:

Foudnd this query:

SELECT title, HEX( title ) , LENGTH( title ) , CHAR_LENGTH( title )
FROM gatopics
ORDER BY `tid` DESC
LIMIT 0 , 5

the hex shows UTF encoding:

title                HEX( title )         LENGTH( title ) CHAR_LENGTH( title ) 
otro áéÝ            6F74726F20C3A1C3A9C39D      11                8

still dont know whats the thing with the í

Posted

I would be happy to take a look around your test server for you if you wish to provide me details in PM, and you don't mind me messing with configs or restarting it etc.. Do you have a decent copy of your database, ie. one with no encoding issues with characters but maybe the problem with the mix of character sets? If so maybe you could send me a dump of that and some examples of where characters are wrong and I could have a look.

Posted

I would be happy to take a look around your test server for you if you wish to provide me details in PM, and you don't mind me messing with configs or restarting it etc.. Do you have a decent copy of your database, ie. one with no encoding issues with characters but maybe the problem with the mix of character sets? If so maybe you could send me a dump of that and some examples of where characters are wrong and I could have a look.

thanks Peter, i sent you a PM

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...