Using JavaScript to split text string into word tokens, taking account of punctuation and whitespace and UTF-8 charset

I got an interesting problem today. I was supposed to check some HTML form before submitting to see if the text entered by the user in textarea has some specific words in it. Googling around I found a lot of stuff like "how to split text separated by commas" and such, but I simply wanted to extract words from a paragraph like this one.

My instinct was to use String.split() function, but it splits on a single character and I would have to write a recursive or iterative function to split on all non-word characters. Not being able to predict all the crap users can enter, this did not look like the right choice.

Luckily, I discovered String.match() which uses regex and is able to split text into an array of words, using something like this:

var arr = inputString.match(/\w+/g);

Cool, eh? Now, this all went fine for ASCII English text. But I need to work with UTF-8, or more specifically, Serbian language. Serbian Latin script used by my users has only 5 characters that are not from ASCII set, so I wrote a small replace function to replace those 5 with their closest matches. The final code looks like this:

var s = srb2lat(inputString.toUpperCase());
var a = s.match(/\w+/g);
for (var i = 0; a && i < a.length; i++)
{
    if (a[i] == 'SPECIAL')
        alert('Special word found!');
}

function srb2lat(str)
{
    var len = str.length;
    var res = '';
    var rules = { 'Đ':'DJ', 'Ž':'Z', 'Ć':'C', 'Č':'C', 'Š':'S'};
    for (var i = 0; i < len; i++)
    {
        var ch = str.substring(i, i+1);
        if (rules[ch])
            res += rules[ch];
        else
            res += ch;
    }
    return res;
}

If you use some other language, just replace the rules array with different transliteration rules.

Tweet to @mbabuskov Milan Babuškov, 2011-12-01