HACKER Q&A
📣 graderjs

Full text search engine in JavaScript for English and Chinese?


I'm interested in providing the full text search capability for product 22120 which has a web archive capability.

I've investigated a few such as those based on solr but my concern is that they do not handle multiple human languages with minimal configuration. Ideally something that has the stemming and tokenization for multiple languages including East Asian languages such as Chinese Japanese Korean and ideally South Asian languages like Hindi and Urdu as well.

Unfortunately a great many search engines out there seem to have a stemming and tokenization available for romance languages such as Latin derived or Germanic languages. People from everywhere like to archive web content that they browse and the content is in multiple human languages so to provide a good full text search I need something that can handle multiple human languages. I considered maybe writing something myself such as a simple Trie, but I think the rabbit hole of creating a good full text search is a very very long and convoluted one so preferable to plug in something that already exists.

I really love what flexsearch is doing especially how they are using signals from context I think that's the future. But I'm concerned how basic their support for stemming and tokenization is for example: https://github.com/nextapps-de/flexsearch/issues/207


  👤 marshallbananas Accepted Answer ✓
I spent a few years developing a Japanese-English dictionary that had searchable example sentences. Full text indexing for Japanese is a nightmare. I used MeCab, Kumon, and Kuromoji for morphological analysis and tokenization, you should check them out. I played a bit with Chinese and it was relatively easy (compared to Japanese). Korean I suppose is somewhere in-between those two.

AFAIK there is nothing out there for East Asian languages that works as good as their romanized counterparts. They work pretty ok with text book, perfect grammar, and easy kanji material. They fall apart completely on casual human text/speech.

Do not attempt to solve this problem yourself! I'm guessing only the likes of Google and ML experts will be able to tackle this.