Jump to content

UTF8, standard string functions and multibyte encoding


Sonya*

Recommended Posts

I have seen that some questions and bugs here deals with standard string functions like strtolower(), which do not properly handle multibyte-encoded strings. This is a common problem for non-English resources and often ignored by English(only) :smile: speaking developers.

I have a suggestions to use Function Overloading Feature http://php.net/manua...ng.overload.php for those who would like to generally replace standard string functions with their multibyte-encoded counterparts. Surely this should be tested, but this solution is smart and does not require numerous patches in the core files.

Add this tip in documentation for non-English systems. As far as I know it can be done in .htaccess or in php.ini.

Link to comment
Share on other sites

Thank you for your feedback.


after some investigation disabling such function resolved everything.



except of UTF8 issues with non-English content or display names :( There are over 50 files that must be patched by replacing standard string functions with their multibyte-encoded counterparts... At least for Russian characters.
Link to comment
Share on other sites

Why do they need to be patched, specifically? We don't *typically* use regular string functions for data that is output to the screen. We use strtolower() on member name when storing it in a column in the DB, but then we use the same function when querying against it later, so it shouldn't really matter (although we have a FFFV bug about it anyways). I'm curious where you are seeing issues with a non-patched area of code that you are patching to resolve.

Link to comment
Share on other sites

  • 2 weeks later...

We use strtolower() on member name when storing it in a column in the DB


I have a fresh install here, without any modification except of $INFO['sql_charset'] = 'utf8'; Created two users and changed the display name of the first in cyrillic, and the second in German with special chars. Then looked up in DB and have noticed that the column ipb_members.members_l_display_name is empty for both users. The reason is: broken UTF-8 strings cannot be saved in MySQL.

Can you try these two display names please?
1. Администратор
2. Ögül

What is saved in the column ipb_members.members_l_display_name in your installation?
Link to comment
Share on other sites

  • 1 year later...

Sorry for digging this old thread, but i am having the same problem.

Thank you for your feedback.


except of UTF8 issues with non-English content or display names sad.png There are over 50 files that must be patched by replacing standard string functions with their multibyte-encoded counterparts... At least for Russian characters.

I have reached to the same conclusion as Sonya. My board is in Portuguese and members cannot register themselves with punctuated usernames (gives an SQL error). I tried to patch the ipsmember.php file replacing the strtolower() function with the mb_strtolower() passing UTF-8 as the second parameter and it worked. Nevertheless all files need to have that function replaced, because users with punctuated characters could not login as the functions were different.

Why do they need to be patched, specifically? We don't *typically* use regular string functions for data that is output to the screen. We use strtolower() on member name when storing it in a column in the DB, but then we use the same function when querying against it later, so it shouldn't really matter (although we have a FFFV bug about it anyways). I'm curious where you are seeing issues with a non-patched area of code that you are patching to resolve.

bfaber, it matters as the strtolower() spoils punctuated names. It is well known that that function does not support UTF-8 encoding, so why do you keep using it instead of the mb_strtolower? I understand that for english-language boards this is not a problem, but what about the other languages?

Link to comment
Share on other sites

You are misunderstanding. :smile: I completely 100% realize it matters when changing a character that is to be displayed, because it corrupts it.

What I am saying is, if when you register on a site it uses strtolower($name) to make a lower-case version of the name (yes, it corrupts punctuated characters) and stores this in a special column in the database, and then later when you go to login on the site it ALSO uses strtolower($name), you will have the same corrupted characters in BOTH cases, so the login attempt DOES still match. This value is never displayed to the user. We have many clients running non-English sites in every language imaginable, and this is not a bug at present - it works fine.

We cannot do a simple replacement for strtolower() to mb_strtolower() in the current version for multiple reasons.

  • Not all users have mb* functions available, and it is not a required extension presently
  • Users who upgrade will suddenly find their site broken. Taking the above example, if we stored your name previously as strtolower($name) and now when you login we do mb_strtolower($name) the values will NOT match, and the user will be unable to login.

Having said that, we are moving to UTF-8 across the board in 4.0 and will be utilizing mb* functions at that time. This does not affect functionality in 3.x, which does work fine, but will allow us to better support languages that have non-ANSI characters.

Link to comment
Share on other sites

Thank you bfarber for your explanation.
That will be a great improvement for 4.0, for sure :smile:

The problem is that mySQL side is complaining about the insertion of those "spoiled" characters: i get a error 1366 "incorrect string value" like this when i try to register a punctuated name, and users cannot register:

Error: 1366 - Incorrect string value: 'xE3xA1cxE3xA3o' for column 'members_l_username' at row 1
IP Address: 188.250.193.89 - /index.php?app=core&module=global&section=register
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
mySQL query error: INSERT INTO ibf_members (`name`,`members_display_name`,`email`,`member_group_id`,`joined`,`ip_address`,`time_offset`,`coppa_user`,`members_auto_dst`,`allow_admin_mails`,`language`,`members_l_username`,`members_created_remote`,`member_login_key`,`member_login_key_expire`,`view_sigs`,`bday_day`,`bday_month`,`bday_year`,`restrict_post`,`auto_track`,`msg_count_total`,`msg_count_new`,`msg_show_notification`,`last_visit`,`last_activity`,`member_uploader`,`members_pass_salt`,`members_pass_hash`,`members_l_display_name`,`fb_uid`,`fb_emailhash`,`members_seo_name`,`members_bitoptions`) VALUES('Jácão','Jácão','chopsshotout@forumusica.com',1,1382447821,'188.250.193.89',0,0,1,0,2,'jã¡cã£o',0,'730916bffd2cfee8dd267bf19c3b57b0',0,1,0,0,0,0,'',0,0,1,1382447823,1382447823,'flash','"B{im','8686f95d690a2c67d7a91a4ba5171eb1','jã¡cã£o',0,'','j%c3%a1c%c3%a3o',0)
.--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------.
| File | Function | Line No. |
|----------------------------------------------------------------------------+-------------------------------------------------------------------------------+-------------------|
| C:inetpubwwwrootadminsourcesbaseipsMember.php | [db_main_mysql].insert | 661 |
'----------------------------------------------------------------------------+-------------------------------------------------------------------------------+-------------------'
| C:inetpubwwwrootadminapplicationscoremodules_publicglobalregister.php| [IPSMember].create | 1906 |
'----------------------------------------------------------------------------+-------------------------------------------------------------------------------+-------------------'
| C:inetpubwwwrootadminapplicationscoremodules_publicglobalregister.php| [public_core_global_register].registerProcessForm | 65 |
'----------------------------------------------------------------------------+-------------------------------------------------------------------------------+-------------------'
| C:inetpubwwwrootadminsourcesbaseipsController.php | [public_core_global_register].doExecute | 306 |
'----------------------------------------------------------------------------+-------------------------------------------------------------------------------+-------------------'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This error does not happen when using mb_strtolower() function with the UTF-8 parameter.
Do you have any thoughts about this?
Link to comment
Share on other sites

Somehow i have sorted it out changing php.ini to the following definitions (overloading the mbstring to 2):

Directive Local Value Master Value mbstring.detect_order auto auto mbstring.encoding_translation On On mbstring.func_overload 2 2 mbstring.http_input UTF-8 UTF-8 mbstring.http_output UTF-8 UTF-8 mbstring.http_output_conv_mimetypes ^(text/|application/xhtml+xml) ^(text/|application/xhtml+xml) mbstring.internal_encoding UTF-8 UTF-8 mbstring.language neutral neutral mbstring.strict_detection Off Off mbstring.substitute_character no value no value
Link to comment
Share on other sites

There are a few possibilities

1) Easiest option may be to turn off MySQL strict mode in your MySQL configuration. This will stop the complaining about the characters.

2) You should be able to change the charset and collation on the table/column such that it allows the characters to be inserted

Feel free to submit a ticket for technical support. If your board gives you a MySQL error when attempting to register and it isn't caused by third party code, we will troubleshoot and find a solution. :)

Link to comment
Share on other sites

Thank you bfarber,

I have already submitted the ticket to the IP support, but they weren't able to help me... "out of scope" :(

I have been performing some research:

1) if turn off strict mode, i got no error.. nevertheless the fields where they were supposed to occur are filled NULL (blank), such as 'members_l_username'. So when user tries to login, the user is not found.
2) Somehow, ipboard strtolower() is converting the punctuated characters to 4 byte UTF. If i change collumn collation to utf8mb4_general_ci, i get no errors. Nevertheless SQL complains about the length of the VARCHAR, which i have to reduce from 255 to 100. Besides that, i had to convert all the database to utf8mb4_general_ci, as inner joins get troubled by using different collations (UTF-8 vs UTF8mb4).

I am warning users not to use puntuated characters in their username when they register... it's the only workaround.. :(
And somehow, PHP cannot get the right locale information from windows or IIS... my is pt-PT, from what i can check.

Any thoughts about this?

Link to comment
Share on other sites

My board is Polish. All settings / tables ale 100% UTF-8. And I have the same problem with some "spoiled" characters, only in different columns (cache_content, log_url)

For example, these characters (Polish characters replaced by distorted "equivalent") as a part of URL in posts:

Error: 1366 - Incorrect string value: 'xB3yny.p...' for column 'cache_content' at row 1

c4b9.png


Error: 1366 - Incorrect string value: 'xEAcia&a...' for column 'cache_content' at row 1

6ms9.png
... totally locked whole topics, which could not be opened due to "Driver error". I had to go directly into database and remove characters. That was the only thing I could do on non-invasive level.


xtech


1) if turn off strict mode, i got no error.. nevertheless the fields where they were supposed to occur are filled NULL (blank), such as 'members_l_username'. So when user tries to login, the user is not found.
2) Somehow, ipboard strtolower() is converting the punctuated characters to 4 byte UTF. If i change collumn collation to utf8mb4_general_ci, i get no errors. Nevertheless SQL complains about the length of the VARCHAR, which i have to reduce from 255 to 100. Besides that, i had to convert all the database to utf8mb4_general_ci, as inner joins get troubled by using different collations (UTF-8 vs UTF8mb4).


I also made some research and found utf8mb4 tip, but I holded before converting anything. In the past, I made a conversion to UTF-8 and it cost me a lot of work + direct characters replacements in a database. I won't repeat a similar process without any guarantee.
Disabling SQL strict mode is only a "dirty workaround".

Link to comment
Share on other sites

For the column length, that's due to the index I'd imagine - you don't have to reduce column to 100 chars, just the index on that column (index definitions can specify the number of characters in the column to index).

What was the ticket number? I'd like to check into it.

Link to comment
Share on other sites

Hi bfarber, ticket no is 872702

Luuuk, i have not converted to utf8mb4... i tried only to convert the column to see what happened and it worked. Nevertheless i reverted then to regular utf8, as i have already plenty of encoding issues to solve :D, and when 4.0 goes out it will be fully UTF8 compatible with the use of the mb_strtolower function, so i didn't want to take the risk.

I have been researching this issue for some days, and i have reached some interesting conclusions:

- The locale Info on windows server systems (like mine, i am using windows server 2012) is quite different. On the language locale, if i put "pt_PT", (with no commas) it happened... nothing.
But i tried to put "Portuguese_Portugal" just for a random try, well, magic happened: i could see correct date format. Nevertheless i had one strange issue: the € euro character appeared as a strange square with a question mark, so i could not see it.

- I then tried to set the charset in ACP the following string: "pt_utf8" it worked: the euro sign was correctly shown... but all the info i had input previously with punctuated characters, it appeared "spoiled"! When i refreshed the forum, all the punctuated characters appeared spoiled, and i could see some uploaded images disappeared (the image path was probably corrupted) I tried the registration process and although i got the same error, i could see that the SQL sentence had the correct characters. This was quite odd... as the sentence was well built but the error was repeating again? I did not understand how this could be possible! I reverted to the UTF-8 in charset. Which, to be honest, spoiled again the euro symbol, but kept my forum with good appearance, at least.

- Btw, one thing i forgot to tell: the previous SQL error i described some posts above had one particularity: if i copied it and pasted in MySQL workbench, it worked with no complaints. So i do not know why the error keeps appearing.

With all this mess.. i have to say that i am really confused with encodings, charsets, locales, and everything. :sad:

Link to comment
Share on other sites

bfarber

I see that in my case the biggest issues are URLs pasted by users. The characters lock topics with "Driver error". I gave some examples before. The newest occurrence involved a non-breaking space at the end of URL:

sqlerror.txt

s799.png

I run IPB 3.4.5 with the URLs patch. And to clarify: the problem in cache_content column appeared recently. I did two things before the problem was first time noticed: installing URLs patch and reinstalling crashed server (the differences are a little bit newer versions: Mysql 5.0.8 + PHP Version 5.3.24).

Do you have any advice?

Link to comment
Share on other sites

I see you had converted your database to UTF-8 and/or recently changed your locale, etc. Unfortunately changes like that can cause new issues to crop up, and while I understand it ran for a little while after making the changes before it started causing issues, *that* is outside the scope of our support I'm afraid. As is often the case, there is more to this story in the ticket than was available here on the forums.

@Luuuk - submit a ticket if you are getting a driver error when submitting a topic. Ultimately looking at the SQL error, however, it would seem that your database charset/collation does not support the characters being inserted.

Link to comment
Share on other sites

@Luuuk - submit a ticket if you are getting a driver error when submitting a topic. Ultimately looking at the SQL error, however, it would seem that your database charset/collation does not support the characters being inserted.

Thanks for the answer. I created a ticked (Request ID 873179) and provided a more detailed info about charset/collations.

But I like to pinpoint:
- Few already existing topics previously opened without "Driver error". Forum software did not complain about these already posted URLs before...
- The problem is that I can't control what kind of links / from where / how users are pasting (there is always a chance to have a strange character in URL), but the whole topic or editing capability can be locked due to a single character in a particular link ...

Link to comment
Share on other sites

There are too many factors to answer that here. This is a support issue, however, not feedback about the software or company. :) Hence why I suggested opening a ticket. I can't tell you why you would suddenly start seeing errors if you haven't in the past (although I can say that virtually 10 out of 10 times this is due to a recent change someone has made).

Link to comment
Share on other sites

  • 3 months later...

Well, let us shed some more light into this long term unsolved issue. I figured out the following:

- The strtolower function uses the locale information.

- The windows locale information comes not in UTF-8, but in 1252 encoding.
- Being so, punctuated characters are spoiled when this function is used.

The question is, and i would like someone from IP to explain it:

- Why this happens for usernames (members_l_username) row, for example and not for posts info? In my board post info is correct stored and retrieved regardless the characters it containts. But usernames... don't, and we get database errors each time a user with punctuated username tries to register? What kind of difference in implementation you made? I just cannot understand it - if both were equal if we override the strtolower with the mb_strtolower functions it would work, but if we do it, it works for usernames but not for posts. Why this inconsistency?

We get the error 1366 as usual.

Is there a way i can manually fix this issue?

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...