About this site

Danny's Weblog

2009 Apr 13 [ Mon ]

Using Google for Thai and Khmer words

In a photo of one of the demonstrations against the current Thai government, I happened to notice a small sign which I could just about read (as opposed to the large signs where I have no idea what half the characters even are). It seemed to just say "Trat", like the town/province in southern Thailand. www.telegraph.co.uk [http://www.telegraph.co.uk/telegraph/multimedia/archive/01382/protests_1382426c.jpg]

The London Telegraph captions the pic "Red Shirt anti government demonstrators face off the army lines outside the Pattaya Exhibition and Convention Hall Photo: EPA"

It didn't seem to make any sense that someone was holding up a sign with just the name of a town, so I vaguely wondered if some hapless taxi driver was trying to meet a client in that mob! Idly I tried to find it in Google, and couldn't. I figured I must be misreading the last letter, so I tried various possibilities, but still nothing seemed quite right. I gave up.

Later, it occurred to me to try to find Trat itself, so I found it. It looked just like my first guess at what was on the sign. I plugged it into Google, and indeed Google again could not find it. Actually, Google had found it, but for some reason only two out of the first ten hits actually found Trat. The rest found "rat" (pour, spill, top).

I think this sort of behavior is liable to be very common when we ask Google to search for words in Thai text, because both the webpages *and the text we're looking for* don't have whitespace at word boundaries. ...Hmm, at least I don't *think* it does. I wonder if Thai internet users are actually inserting some sort of nonprinting space at word boundaries in search terms? Anyway, I was looking for a *single* word.

If you want to try it, here it is in Unicode: ฅราด

...Wow, my editor points out it needs 12 Unicode bytes, to express just four Thai characters.

Cambodian (Khmer) suffers the same problem. Here is a simple word in Khmer which you can search for: តង

Google returns a bunch of hits which mostly refer to sentences which contain those two characters at the end of one word and the beginning of the next. Google also has an actual flub on one of the hits: a PDF file – whose sample text for some reason is displayed with all the jerng characters displayed on the normal instead of lower level – shows those two wanted characters followed by a bantaq diacritic, which by Khmer/Unicode spelling rules *must* belong to the preceding character... Hmm. I suppose that's fair enough. They're ignoring diacritics for the search, which matches their behavior for French, German etc.



I hope this information was useful. There may be a great deal more information on this site that is relevant to what you need. Take a look at the "site map" display at left; you can click on a topic to see many recent items on that topic.

Copyright © 2003-2010 Alternate Worlds Publishing, Boston MA USA
The Little Trouble Girl font is copyright Blue Vinyl Fonts vendorurl: www.bvfonts.com

This page is available for searching.

Debug: hittotal: 1 startban: 0 dancookie: endbandate: banned: 0 tempdate: tert: jse: jsx jsh: 1