Jump to content

Native Chinese (non-Latin) search possible?


Huai

Recommended Posts

Posted

Hi,

I really like to use IPB.

But it uses MySQL fulltext search or manual search which is very slow.
Only way to solve in using Sphinx.

Is it possible to improve it like phpBB does?
phpBB have it's own search method and uses PCRE and mbstring in the back, works very well with non-Latin characters.

  • 2 weeks later...
Posted

Hi,



I really like to use IPB.



But it uses MySQL fulltext search or manual search which is very slow.


Only way to solve in using Sphinx.



Is it possible to improve it like phpBB does?


phpBB have it's own search method and uses PCRE and mbstring in the back, works very well with non-Latin characters.




I have no idea how phpBB works, but I can't fathom how any PHP libraries help with database searching, unless it's to convert the content before sending it to the database to some normalized form. In any event, I don't think this is necessary at all - you just need your database configured correctly for the languages you use on your site.
Posted

I have no idea how phpBB works, but I can't fathom how any PHP libraries help with database searching, unless it's to convert the content before sending it to the database to some normalized form. In any event, I don't think this is necessary at all - you just need your database configured correctly for the languages you use on your site.




IPB only use MySQL native search, which has really bad support for language characters such as Korean, Japanese and Chinese.

not sure ow phpBB handles it, I think they are converting the contents as you said.

The only way I know is to use Sphinx to solve the problem...
Posted

You'll have to excuse me if any of this is a little off - I'm by no means an expert on Asian languages, but have come across this problem while working on a Japanese site.



The problem with MySQL fulltext and Asian languages is that fulltext searches by word - and assumes that each word is separated by a space (by default).
This isn't the case in Chinese, Japanese and Korean.

Now, before I carry on, I should point out that normal (non-fulltext search) does work fine:

And we do have an option to use that.

The best thing you can do with MySQL is pre-process text to be stored separating the words. I'm not sure if phpBB does this - last time I checked they just uses non-fulltext search, but it sounds possible from what you're describing.
Once the words are seperated by a delimiter, MySQL can use that as it's word delimiter for fulltext searches.
Now, this isn't perfect, a few problems that come into play:

  • You'd need to remove the delimiter when showing the content so it in general adds a fair amount overhead (could cause struggles with very long posts)
  • You wouldn't be able to have a bilingual site if you were doing this
  • You'd need to configure MySQL to allow fulltext searches on 1 character (since just 1 character can represent a word) - not only is this not ideal, many hosting providers will be unwilling to do this.


Now when you add up all those problems, I have to wonder if it's not just better off using non-fulltext search (I never did any extensive tests, but if I had to guess I'd say you'd be very close to loosing any benefits fulltext gives you at all). Or using Sphinx - after all, you're going to need to change your MySQL configuration anyway, you may as well install Sphinx.


As an alternative to code-level changes, if you're really set on making this work, there is a project called Senna which is like MySQL but with support for fulltext searching Asian languages. The above problems still apply, but it should in theory make fulltext searching work without any code-level
changes to IP.Board.



All in all, I think it's a very invasive change to make for little benefit when 3 solutions (non-fulltext searching, Sphinx and Senna) are already available.
Posted

All in all, I think it's a very invasive change to make for little benefit when 3 solutions (non-fulltext searching, Sphinx and Senna) are already available.



Yes, I know those 3 solutions.
But non-fulltext searching is really slow, and for other 2 you have to set-up manually.
So I was thinking are there batter choices or not.

But I'm not coder, so just asking questions here, sorry for this.
Thank you for the detailed explanation!

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...