Friday, March 26, 2004

Unicode vs. search engines

Someone came here looking for "צורי USB" on Google. צורי is "rock" from this post, and USB came from here. Probably not what this person was looking for.

I'm happy. I was worried Google didn't up the UTF-8 stuff. After all, if the search engine can't see them, how would I know of the pages that aren't getting indexed? The search engine issue was what pushed me over the edge to give up on the more portable HTML escapes. This is also why good spelling is important — only the people who spell consistently with everyone else are the folks who will be found.

Not that the escapes were hard. In Mozilla, all I had to do was make sure I was looking at the post submission form in a non-Unicode encoding, and Mozilla, knowing the characters can't be displayed in that encoding, automatically handled the conversion.

What's neat is, with Unicode or the HTML entities, Lynx tries its best even if you tell it you don't have Unicode capabilities. Kohelet with the vowels left out (which is how it's written in the Bible), shows up as Q+H+L+T+. Japanese phonetic characters fare even better, though it doesn't deal with Kanji. I'm guessing is probably because getting the phonetic equivalents would computationally require about 70% of the effort of just translating the whole thing anyway.

No comments: