Jump to content

Get rid of special characters from FURLs


Axel Wers

Recommended Posts

I have suggested to make custom replacements for characters depending on the language and admin preferences. Either as admin settings in ACP where replacements sign by sign can be added manually, or at least as Data Hook that will allow to hook into FURLs.

It takes time to figure out whether this feature is globally desired. We are a minority here with non-English boards ;) English boards do not suffer from special characters.

Link to comment
Share on other sites

Mark, it is not about support, it is about how browsers add these URLs into clipboard. They encode them making long messy and spam looking urls, like these wikipedia URLs:

http://de.wikipedia.org/wiki/Universit%C3%A4t_Z%C3%BCrich
http://ru.wikipedia.org/wiki/%D0%A6%D1%8E%D1%80%D0%B8%D1%85%D1%81%D0%BA%D0%B8%D0%B9_%D1%83%D0%BD%D0%B8%D0%B2%D0%B5%D1%80%D1%81%D0%B8%D1%82%D0%B5%D1%82
http://el.wikipedia.org/wiki/%CE%A0%CE%B1%CE%BD%CE%B5%CF%80%CE%B9%CF%83%CF%84%CE%AE%CE%BC%CE%B9%CE%BF_%CF%84%CE%B7%CF%82_%CE%96%CF%85%CF%81%CE%AF%CF%87%CE%B7%CF%82
http://he.wikipedia.org/wiki/%D7%90%D7%95%D7%A0%D7%99%D7%91%D7%A8%D7%A1%D7%99%D7%98%D7%AA_%D7%A6%D7%99%D7%A8%D7%99%D7%9A

They might work correct, but if somebody copies link to my board from the browser address bar and adds it somewhere else, I do not want these beautiful codes in my backlinks :D That is the reason why I substitute any non-ASCI characters in URLs. Just netiquette and common way to do it in non-English boards.

Link to comment
Share on other sites

Do you get the same problem copying links from, say, Wikipedia: http://en.wikipedia.org/wiki/Śūnyatā ?

Yes. Any link with special characters in it, not only wikipedia.


Is your community using the UTF-8 character set?

Yes. However, it does not deal with my the specific community but with browsers. If you use non-English OS and copy-paste the url from the address bar, you'll see that your clipboard has the copied URL procent escaped. This is nothing that you can change in IPS. It is just current standard, RFC 3986:


The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values.

And there are not only browsers that percent-encode UTF-8 URLs. When I would use special characters in URLs, my server logs would look like this.

GET /dir1/%D0%A6%D1%8E%D1%80%D0%B8%D1%85%D1%81%D0%BA%D0%B8%D0%B9_%D1%83%D0%BD%D0%B8%D0%B2%D0%B5%D1%80%D1%81%D0%B8%D1%82%D0%B5%D1%82.html HTTP/1.1
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; 
  en-US; rv:1.5a) Gecko/20030728 Mozilla Firebird/0.6.1
…

Just impossible for visual analysis without any additional tool. When I would like to block specific URLs in robots.txt or add a new rule to mod_rewrite, then I have to use the same long string. Means that they are just not readable by human.

Another problem is netiquette. Do you have special characters like ü, ö, ä, ß on your keyboard? Do you have cyrillic letters on your keyboard? Chinese, Arabic? Can you type (not copy paste, but type!) the ulrs below if I would tell the address to you by phone?

http://www.mysite.com/grüße
http://www.mysite.com/привет

You cannot type them yourself. You can only copy and paste those URLs.

That's why I transliterate all URLs to ASCII only on my projects. ;)

Link to comment
Share on other sites



What problems do you mean? URLs can support UTF-8 characters... even Emoji works if you're on an OS that supports them.

I tested my board (3.4.3) on Android. Browser in mobile couldn't handle with FURLs with special characters (I got error messages: too many redirects). But I think new version 3.4.5 fixed this (tested only on test board).

Worser thing is in Google Webmaster Tools:

Links with special characters I see like this:

2vud7cl.jpg' alt='' class='ipsImage' wid" alt="2vud7cl.jpg">

It's almost impossible to analyze it, what the hell is that for link?

The same I see in server error logs.

Link to comment
Share on other sites

I have the same issue with Viglink in their stats. They cannot decode links back to UTF-8 representation. Looks like Google Webmaster. I have reported it to Viglink support in November 2012 and there is still no progress. I know that IPS does not have anything to do with Viglink stats. This is just an example of how other try to handle such URLs and fail.

Link to comment
Share on other sites

There are 4 main reasons why we do what we do:

  • Transliteration is difficult to do universally. Although, for example, it's quite common for ß in German to be transliterated to ss, for some characters in some languages it's not appropriate to transliterate, and in many cases the same character would be translated to different values depending on the language. For example, we've had bug reports saying æ should be "ae", so we changed it, then another bug report said it should be "e", so we changed it again, then we had a third bug report saying it should be "ae" again.
  • Languages which don't use latin characters at all (any east-Asian language) simply cannot be transliterated. So you end up with all your URLs being "/topic/123-/" which no friendly URL element.
  • It is the proper way to do things. As you yourself point out, the RFC which defines the standard for URIs says that characters may be UTF-8 encoded - it does not suggest transliteration.
  • It's good for SEO (otherwise you sort of defeat the point of friendly URLs).
Link to comment
Share on other sites

Mark, sorry but what you wrote is entirely false.

Transliteration is difficult to do universally.

No. It is simple, just two dimensional array like I have suggested it here I use it for years without any issues or impact on any IPS functions.


Although, for example, it's quite common for ß in German to be transliterated to ss, for some characters in some languages it's not appropriate to transliterate, and in many cases the same character would be translated to different values depending on the language.

This is the reason why I have asked for Data hook so that everybody can do substitution individually in a hook. In other software it is just a part of localization, where such arrays are different per language.

Languages which don't use latin characters at all (any east-Asian language) simply cannot be transliterated.

False. There are ISO standards for transliteration for every language http://en.wikipedia.org/wiki/List_of_ISO_transliterations

It is the proper way to do things.

It is an IPS way to do it but is not proper in the global world where English is not the only language people speak ;)

Link to comment
Share on other sites

  • 2 weeks later...

There is another problem, big problem.

I wondered why in last time traffic systematically decreases on my board.
It seems Google doesn't like FURLs with special characters, those FURL are excluded from main results

I have about 130 categories on board. Before upgrade to 3.4.3 ALL were in results, but AFTER upgrade only about 30. Others are excluded in this way

zilnrp.jpg

Similar with topics. I have about 3200 topics on board. In search results about 30. All topics with special characters disappeared from SERP.

I still running under 3.4.3 version. Was improved something with FURLs with special characters in 3.4.5? This is important for me, because your experiment with special characters ruined my board. So get rid of those damned spec characters from FURL.

Link to comment
Share on other sites

I opened a ticket long time ago about this: #851800

and got this reply from support:

Unfortunately this isn't possible without changing your site's charset to be something other than UTF-8, and that isn't something which I would advise.

Romanization features included within IP.Board are only supported for non-UTF-8 sites, as you are already aware.

I'm sorry but it doesn't look like there's anything we can do to 'fix' this as the software is Working As Intended.

Please let me know if you have any further questions.

Thank you,

Link to comment
Share on other sites

Axel, this *can* deal with canonical url. In you canonical you have special characters, but if Google does not like special characters it cannot read canonical properly and cannot match? Just a guess... I cannot reproduce the issue on my boards as I use patch to transliterate special characters in URLs.

Link to comment
Share on other sites

How?

Link to the solution above > This way I can transliterate and replace any sign I do not like. It does not have any impact on latin URLs, it deals only with special characters additionally. It works for every url generated in IPS, for all IPS and custom apps that uses furlTemplates.php, except of tags (tags are made another way in IPS, saved raw and not translated in the database and that's why it would not work for tags).

Link to comment
Share on other sites

Problem is other. Basically there is not problem with indexing FURLs, FURLs are indexed very good, this is an example from my board:

2zyhok2.jpg' alt='' class='ipsImage' wid" alt="2zyhok2.jpg">

Problem is, Google won't show FURLs with special characters in SERP

When I type in Google site:forum.freespace.sk/tema/ I get only 3 pages of results, what is about 30 topics. Remaining topics are hidden as very similar.

So this is not any working as intended, this is issue ruining my board is Google SERP.

2z68jya.jpg

Link to comment
Share on other sites

Nope. There is not any alert about this.

And I know another two boards with the same exact issue.

And THIS board has the same issue.

Try to put to Google search site:community.invisionpower.com/topic/

But I *think* this is problem rather with Google search than IP.Board.

Link to comment
Share on other sites

Link to the solution above > This way I can transliterate and replace any sign I do not like. It does not have any impact on latin URLs, it deals only with special characters additionally. It works for every url generated in IPS, for all IPS and custom apps that uses furlTemplates.php, except of tags (tags are made another way in IPS, saved raw and not translated in the database and that's why it would not work for tags).

Sonya,

Would you please make a step by step instruction on that modification that I can follow and get my board ready too?????
Thanks

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...