Jump to content

the utf8 problem


Recommended Posts

Hello, im playing with an utf9 convesion script and have found a bug somewhere in my setup

The final problem is that if i set a topic title to "prueba áéíóú" its saved as: "prueba áé"

the query executed is:

UPDATE gatopics SET title='otro áé� óú vocalesd3',title_seo='otro-vocalesd3' WHERE tid=88970

to thest my server setup i setup a new test board and changed the title of the demo topic to something similar and it worked nicely, the query executed is:

UPDATE utf8_topics SET title='Welcome áéíóú¡²³¤€¼½¾‘dsf',title_seo='welcome-%c3%a1%c3%a9%c3%ad%c3%b3%c3%ba%c2%a1%c2%b2%c2%b3%c2%a4%e2%82%ac%c2%bc%c2%bd%c2%be%e2%80%98dsf' WHERE tid=1

I think at somepoint in some migration i got everyringh talking latin1 while the database was specified utf8 or something along that line

the schemas of the tables are:

original:



--

-- Table structure for table `utf8_topics`

--



CREATE TABLE IF NOT EXISTS `utf8_topics` (

  `tid` int(10) NOT NULL AUTO_INCREMENT,

  `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `posts` int(10) DEFAULT NULL,

  `starter_id` mediumint(8) NOT NULL DEFAULT '0',

  `start_date` int(10) DEFAULT NULL,

  `last_poster_id` mediumint(8) NOT NULL DEFAULT '0',

  `last_post` int(10) DEFAULT NULL,

  `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_vote` int(10) DEFAULT NULL,

  `views` int(10) DEFAULT NULL,

  `forum_id` smallint(5) NOT NULL DEFAULT '0',

  `approved` tinyint(1) NOT NULL DEFAULT '0',

  `author_mode` tinyint(1) DEFAULT NULL,

  `pinned` tinyint(1) DEFAULT NULL,

  `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,

  `topic_hasattach` smallint(5) NOT NULL DEFAULT '0',

  `topic_firstpost` int(10) NOT NULL DEFAULT '0',

  `topic_queuedposts` int(10) NOT NULL DEFAULT '0',

  `topic_open_time` int(10) NOT NULL DEFAULT '0',

  `topic_close_time` int(10) NOT NULL DEFAULT '0',

  `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0',

  `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0',

  `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `topic_deleted_posts` int(10) NOT NULL DEFAULT '0',

  `tdelete_time` int(11) NOT NULL DEFAULT '0',

  `moved_on` int(11) NOT NULL DEFAULT '0',

  `topic_archive_status` int(1) NOT NULL DEFAULT '0',

  `last_real_post` int(10) NOT NULL DEFAULT '0',

  `topic_answered_pid` int(10) NOT NULL DEFAULT '0',

  PRIMARY KEY (`tid`),

  KEY `topic_firstpost` (`topic_firstpost`),

  KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`),

  KEY `forum_id` (`forum_id`,`pinned`,`approved`),

  KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`),

  KEY `last_post_sorting` (`last_post`,`forum_id`),

  KEY `start_date` (`start_date`),

  KEY `last_x_topics` (`forum_id`,`approved`,`start_date`),

  KEY `approved` (`approved`,`tdelete_time`),

  KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`),

  KEY `topic_archive_status` (`topic_archive_status`,`forum_id`),

  KEY `last_poster_id` (`last_poster_id`),

  FULLTEXT KEY `title` (`title`)

) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=2 ;



after conversion:



--

-- Table structure for table `gatopics`

--



CREATE TABLE IF NOT EXISTS `gatopics` (

  `tid` int(10) NOT NULL AUTO_INCREMENT,

  `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `description` varchar(250) COLLATE utf8_unicode_ci DEFAULT NULL,

  `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `posts` int(10) DEFAULT NULL,

  `starter_id` mediumint(8) NOT NULL DEFAULT '0',

  `start_date` int(10) DEFAULT NULL,

  `last_poster_id` mediumint(8) NOT NULL DEFAULT '0',

  `last_post` int(10) DEFAULT NULL,

  `icon_id` tinyint(2) DEFAULT NULL,

  `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_vote` int(10) DEFAULT NULL,

  `views` int(10) DEFAULT NULL,

  `forum_id` smallint(5) NOT NULL DEFAULT '0',

  `approved` tinyint(1) NOT NULL DEFAULT '0',

  `author_mode` tinyint(1) DEFAULT NULL,

  `pinned` tinyint(1) DEFAULT NULL,

  `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,

  `topic_hasattach` smallint(5) NOT NULL DEFAULT '0',

  `topic_firstpost` int(10) NOT NULL DEFAULT '0',

  `topic_queuedposts` int(10) NOT NULL DEFAULT '0',

  `topic_open_time` int(10) NOT NULL DEFAULT '0',

  `topic_close_time` int(10) NOT NULL DEFAULT '0',

  `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0',

  `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0',

  `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `topic_deleted_posts` int(10) NOT NULL DEFAULT '0',

  `short_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `template_fields` text COLLATE utf8_unicode_ci,

  `tdelete_time` int(11) NOT NULL DEFAULT '0',

  `moved_on` int(11) NOT NULL DEFAULT '0',

  `last_real_post` int(10) NOT NULL DEFAULT '0',

  `topic_archive_status` int(1) NOT NULL DEFAULT '0',

  `topic_answered_pid` int(10) NOT NULL DEFAULT '0',

  PRIMARY KEY (`tid`),

  KEY `topic_firstpost` (`topic_firstpost`),

  KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`),

  KEY `forum_id` (`forum_id`,`pinned`,`approved`),

  KEY `last_post_sorting` (`last_post`,`forum_id`),

  KEY `start_date` (`start_date`),

  KEY `last_x_topics` (`forum_id`,`approved`,`start_date`),

  KEY `approved` (`approved`,`tdelete_time`),

  KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`),

  KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`),

  KEY `topic_archive_status` (`topic_archive_status`,`forum_id`),

  KEY `last_poster_id` (`last_poster_id`)

) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=88972 ;

 

original:



--

-- Table structure for table `gatopics`

--



CREATE TABLE IF NOT EXISTS `gatopics` (

  `tid` int(10) NOT NULL AUTO_INCREMENT,

  `title` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `description` varchar(250) COLLATE utf8_unicode_ci DEFAULT NULL,

  `state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `posts` int(10) DEFAULT NULL,

  `starter_id` mediumint(8) NOT NULL DEFAULT '0',

  `start_date` int(10) DEFAULT NULL,

  `last_poster_id` mediumint(8) NOT NULL DEFAULT '0',

  `last_post` int(10) DEFAULT NULL,

  `icon_id` tinyint(2) DEFAULT NULL,

  `starter_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_poster_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `poll_state` varchar(8) COLLATE utf8_unicode_ci DEFAULT NULL,

  `last_vote` int(10) DEFAULT NULL,

  `views` int(10) DEFAULT NULL,

  `forum_id` smallint(5) NOT NULL DEFAULT '0',

  `approved` tinyint(1) NOT NULL DEFAULT '0',

  `author_mode` tinyint(1) DEFAULT NULL,

  `pinned` tinyint(1) DEFAULT NULL,

  `moved_to` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,

  `topic_hasattach` smallint(5) NOT NULL DEFAULT '0',

  `topic_firstpost` int(10) NOT NULL DEFAULT '0',

  `topic_queuedposts` int(10) NOT NULL DEFAULT '0',

  `topic_open_time` int(10) NOT NULL DEFAULT '0',

  `topic_close_time` int(10) NOT NULL DEFAULT '0',

  `topic_rating_total` smallint(5) unsigned NOT NULL DEFAULT '0',

  `topic_rating_hits` smallint(5) unsigned NOT NULL DEFAULT '0',

  `title_seo` varchar(250) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_last_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `seo_first_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `topic_deleted_posts` int(10) NOT NULL DEFAULT '0',

  `short_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `tweet` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,

  `template_fields` text COLLATE utf8_unicode_ci,

  `tdelete_time` int(11) NOT NULL DEFAULT '0',

  `moved_on` int(11) NOT NULL DEFAULT '0',

  `last_real_post` int(10) NOT NULL DEFAULT '0',

  `topic_archive_status` int(1) NOT NULL DEFAULT '0',

  `topic_answered_pid` int(10) NOT NULL DEFAULT '0',

  PRIMARY KEY (`tid`),

  KEY `topic_firstpost` (`topic_firstpost`),

  KEY `last_post` (`forum_id`,`pinned`,`last_post`,`state`),

  KEY `forum_id` (`forum_id`,`pinned`,`approved`),

  KEY `last_post_sorting` (`last_post`,`forum_id`),

  KEY `start_date` (`start_date`),

  KEY `last_x_topics` (`forum_id`,`approved`,`start_date`),

  KEY `approved` (`approved`,`tdelete_time`),

  KEY `moved_redirects` (`moved_on`,`moved_to`,`pinned`),

  KEY `starter_id` (`starter_id`,`forum_id`,`approved`,`start_date`),

  KEY `topic_archive_status` (`topic_archive_status`,`forum_id`),

  KEY `last_poster_id` (`last_poster_id`)

) ENGINE=MyISAM  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=88970 ;

 
Link to comment
Share on other sites

weird, alfter a LOT, and i mean more than 20 dumps, restores, greps and thinkgs like that i always endup with the same problems, the í character in post title makes a mess

not sure why this is happening, but im pretty much sure everything is in utf8, but found this in the logs:

 INSERT INTO gatopics (`title`,`title_seo`,`state`,`posts`,`starter_id`,`starter_name`,`seo_first_name`,`start_date`,`last_poster_id`,`last_poster_name`,`seo_last_name`,`last_post`,`author_mode`,`poll_state`,`last_vote`,`views`,`forum_id`,`approved`,`topic_archive_status`,`pinned`,`topic_open_time`,`topic_close_time`) VALUES('otro áé<C3> óú vocales','otro-vocales','open',0,1,'Luis Manson','luis-manson',1383171922,1,'Luis Manson','luis-manson',1383171922,1,0,0,0,12,1,0,0,0,0)

==============================================================================
==============================================================================

 Time Taken: 0.00158
 INSERT INTO gaposts (`author_id`,`use_sig`,`use_emo`,`ip_address`,`post_date`,`post`,`author_name`,`topic_id`,`queued`,`post_htmlstate`,`post_bwoptions`,`post_key`,`new_topic`) VALUES(1,1,1,'190.229.88.203',1383171922,'<p>otro áéíóú vocales</p>n<p>äåé®þüúíóö</p>n<p>qwertyuiop</p>n','Luis Manson',89176,0,0,0,'977dba96f133077e2fd5f23842871690',1)

as you can see IPB inserts into the topic table "otro áé<C3> óú vocales" while it should be "otro áéíóú vocales"

Link to comment
Share on other sites

It's probably not trying to actually insert C3, it just happens that the byte structure of the character í is C3AD, so it's almost like it's losing the last byte or something which is pretty crazy. I presume you got that query from a log file, in which case the fact the log file has a character corrupted in that way is a bit worrying. Where did it come from? Is it a IPB log or your MySQL slow query log? That question seems important to me because if it's an IPB log, then that means the query seems to be corrupted before it hits the DB, but if it's the MySQL slow log then perhaps it is being corrupted on the way there.

In the other topic, you said that utf8 is your default character set, is it definitely?

You could run this MySQL query and you should see something along these lines:

mysql> SHOW VARIABLES WHERE Variable_Name LIKE 'character_set%' OR Variable_Name LIKE 'collation_%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
| collation_connection     | utf8_unicode_ci            |
| collation_database       | utf8_unicode_ci            |
| collation_server         | utf8_unicode_ci            |
+--------------------------+----------------------------+
11 rows in set (0.00 sec)
Link to comment
Share on other sites

Hi PeterUK, i have suck a big mess that i think i will end up paying for someone to fix this

i posted also another topic, in ipb support forums because i got that query from the IPB log file, and in one of my tests i had disabled all hooks and apps, thats why im confused in what kind of problem i really have

in my production server i have this:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
| collation_connection     | utf8_general_ci            |
| collation_database       | latin1_swedish_ci          |
| collation_server         | latin1_swedish_ci          |
+--------------------------+----------------------------+

and the one where im testing:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
| collation_connection     | utf8_general_ci            |
| collation_database       | latin1_swedish_ci          |
| collation_server         | latin1_swedish_ci          |
+--------------------------+----------------------------+

Link to comment
Share on other sites

So you do have a mix between character sets and collations within your server. I do not know enough in depth about MySQL configuration to know how having a differing character_set_database/server from the character_set_client/connection will cause problems, but I guess it probably isn't great. You can modify your MySQL configuration to make all of those utf8 which may help with some of your issues of queries seemingly being corrupted.

If you want to do that, you could add the following to your my.cnf file under [mysqld]:

skip-character-set-client-handshake
collation-server=utf8_unicode_ci
character-set-server=utf8
Link to comment
Share on other sites

me again:

Foudnd this query:

SELECT title, HEX( title ) , LENGTH( title ) , CHAR_LENGTH( title )
FROM gatopics
ORDER BY `tid` DESC
LIMIT 0 , 5

the hex shows UTF encoding:

title                HEX( title )         LENGTH( title ) CHAR_LENGTH( title ) 
otro áéÝ            6F74726F20C3A1C3A9C39D      11                8

still dont know whats the thing with the í

Link to comment
Share on other sites

I would be happy to take a look around your test server for you if you wish to provide me details in PM, and you don't mind me messing with configs or restarting it etc.. Do you have a decent copy of your database, ie. one with no encoding issues with characters but maybe the problem with the mix of character sets? If so maybe you could send me a dump of that and some examples of where characters are wrong and I could have a look.

Link to comment
Share on other sites

I would be happy to take a look around your test server for you if you wish to provide me details in PM, and you don't mind me messing with configs or restarting it etc.. Do you have a decent copy of your database, ie. one with no encoding issues with characters but maybe the problem with the mix of character sets? If so maybe you could send me a dump of that and some examples of where characters are wrong and I could have a look.

thanks Peter, i sent you a PM

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...