The JavaScript RegEx API

Regular Expressions are about as elegant as a pig on a bicycle. Using a regular expression feels like resorting to machine code when all those patterns we’re taught to love just aren’t up to the job. Which, I suppose, is also a reason to like them. They have a brute force directness, free from pattern politics and endless analysis.

And they work. Eventually.

If the JavaScript Regular Expressions API makes your head spin then this might be for you. I’ll document the basics and demonstrate how you might use them to full effect.

For the sake of brevity (not to mention my own lack of regex proficiency) I won’t discuss the syntax of the expressions themselves. Suffice to say, JavaScript regEx syntax is Perl based. There are many excellent online resources for this, as well as some nice online RegEx testers.

The RegExp object

RegExp is a global object which serves three purposes:-

1) It’s a constructor function for creating new instances of Regular Expressions…

It accepts the expression and (optionally) the flags as arguments. As with strings, in regex you can drop the constructor syntax and just specify the literal on it own. RegEx literals are delimited by the / symbol instead of quotes.

var a = new RegExp("\\b[\\w]{4}\\b","g"); //match all four letter words

//same as...
a = /\b\w{4}\b/g;
a.constructor //RegExp()

2) It aggregates a set of global (static) properties reflecting the most recent regex match…

(Edit All these properties were omitted from ECMA 3 but still work in the latest versions of all the major browsers)

leftContext, the text to the left of the most recent match
rightContext, text to the right of the most recent match
lastMatch, the most recently matched text
lastParen, the text matched by the last parenthezised subexpression
$n, the text matched by the nth parenthezised groups (up to n==9)

"(penalty)Lampard, Frank(1-0)".match(/\b([\w]+),\s?([\w]+)/g);

RegExp.leftContext //"(penalty)"
RegExp.rightContext //"(1-0)"
RegExp.$1 //"Lampard"
RegExp.$2 //"Frank"

…and a variable that will be applied to the next regex match…

input, if no argument is passed to exec and test use this value instead.

var a = /\b[a-z]{10,}\b/i; //match long alpha-only word

RegExp.input=document.body.innerHTML;

a.test(); //true (on google.com)

3) Each instance stores additional properties…

source,  the full source of the regex expression
global,  search for all matches (the expression’s g attribute is present)
multiline , a boolean specifying whether string used for next match should be treated as single or multiline (equivalent to the m attribute)
ignoreCase,  search ignore’s case (the expression’s i attribute is present)
lastIndex,  index to begin the next search

(lastIndex is writeable, the other three properties are not)

The RegExp prototype also defines 3 methods:-

test

Was the match succesful? (see example above)

exec

When a match is found it returns an array of results where element 0 is the matched text and elements 1 to n represent the matched groups in sequence (equivalent to the RegExp.$n values). If the expression includes the global(g) attribute, the lastIndex property is updated after each call so that repeated calls to exec will loop through each match in the string.

Here’s a method to return the first n cards from the “pack”, such that their total value does not exceed 21. Notice we define an optional group 2 to match the numeric value of cards with non numeric names (e.g King)

var expr = /\b([^@\(]+)\(?(\d*)\)?@([^\s]+)\s?/g
<pre>var theString = '3@Clubs King(10)@Hearts 3@Spades 5@Diamonds 7@Clubs 2@Hearts 9@Spades Jack(10)@Clubs 4@Diamonds 9@Hearts';
var result = [], total=0, matching = true;

while(true) {
    var matching = expr.exec(theString);
    var value = parseInt(RegExp.$2 ? RegExp.$2 : RegExp.$1);
    if (!matching || (total += value)>21) {
        break;
    }
    alert('&' + RegExp.$1);
    result.push(RegExp.$1 + " of " + RegExp.$3);
}

result; //["3 of Clubs", "King of Hearts", "3 of Spades", "5 of Diamonds"]

compile

Edit this RegExp instance. If you’re neurotic about the overhead of creating a new RegExp instance everytime then this is for you. Enough said.

The String methods

Three string methods accept regular expressions as arguments. They differ from the RegExp methods in that they ignore RegExp’s last index property (more accurately they set it to zero) and if the pattern is global they return all matches in one pass, rather than one match for each call. RegExp static properties (e.g. RegExp.$1) are set with each call.

match

Returns the array of pattern matches in a string. Unless the pattern is global the array length will be 0 or 1

var a = /(-[\d*\.\d*]{2,})|(-\d+)/g //all negative numbers

"74 -5.6 9 -.5 -2 49".match(a); //["-5.6", "-.5", "-2"]
RegExp.$2; //"-2"
RegExp.leftContext; //"74 -5.6 9 -.5 "
var queryExpr = new RegExp(/\?/);
var getQueryString = function(url) {
    url.match(queryExpr);
    return RegExp.rightContext;
}
getQueryString("http://www.wunderground.com/cgi-bin/findweather/getForecast?query=94101&hourly=1&yday=138&weekday=Wednesday");
//"?query=94101&hourly=1&yday=138&weekday=Wednesday";

split

Converts to array according to the supplied delimiter Optionally takes a regular expression as delimiter

var names = "Smith%20O'Shea%20Cameron%44Brown".split(/[^a-z\']+/gi); //names = ["Smith", "O'Shea", "Cameron", "Brown"];
RegExp.lastMatch; //"%44"

Nick Fitzgerald points out that IE is out on a limb when it comes to splitting on grouped expressions

var time = "Two o'clock PM".split(/(o'clock)/);
//time = ['Two','PM'] (IE)
//time = ['Two', 'o,clock', 'PM'] (FF, webkit)

replace

Replaces argument 1 with argument 2. Argument 1 can be a regular expression and if its a global pattern, all matches will be replaced.

Additionally replace comes with two little used but very nice features.

First, you can use $1…$n in the second argument (representing 1…n matched groups)

var a = "Smith, Bob; Raman, Ravi; Jones, Mary";
a.replace(/([\w]+), ([\w]+)/g,"$2 $1"); //"Bob Smith; Ravi Raman; Mary Jones"

var a  = "California, San Francisco, O'Rourke, Gerry";
a.replace(/([\w'\s]+), ([\w'\s]+), ([\w'\s]+), ([\w'\s]+)/,"$4 $3 lives in $2, $1"); //"Gerry O'Rourke lives in San Francisco, California"

Second, you can also use a function as the second argument. This function will get passed the entire match followed by each matched group ($1…$n) as arguments.

var chars = "72 101 108 108 111  87 111 114 108 100 33";
chars.replace(/(\d+)(\s?)/gi,function(all,$1){return String.fromCharCode($1)}); //"Hello World!"
time = ['Two','PM'] (IE)
About these ads

10 thoughts on “The JavaScript RegEx API

  1. I particularly like “someStr.replace(someRegex, function (originalStr, match) { /* … */ })”!

    Originally, Tempest was pretty similar to this minimal templating engine:

    function simpleTemplater(template, data) {
        return template.replace(/\{\{[ ]*(.*)[ ]*\}\}/gi, function (originalStr, match) {
            with (data) {
               return eval("(" + match + ")");
            }
        });
    }
    
    // Should return "Hello, Javascript!" but I haven't run this code :)
    simpleTemplater("Hello, {{ name }}!", { name: "Javascript" });
    
    • I like this a lot. You could also do something similar using partial. The fully loaded version of replace() is a gift to us functional programmers. I have no idea how many people use it or even know about it.

      BTW props for having the courage to use with (not kosher in ECMA 5!). I feel a blog coming along about how certain gurus are trying to force javascript into a java-style rat hole by wrapping it in cotton wool ;-)

      • Yeah, it would be pretty easy to add some basic currying to the templater function so if you don’t pass in data, it returns a function that takes the data.

        Generally, I don’t ever use “with” because of the many little quirks and it slows down variable lookup times. I guess I could rewrite the mini templater like this:

        function revisedSimpleTemplater(template, data) {
            return template.replace(/\{\{[ ]*(.*)[ ]*\}\}/gi, function (originalStr, match) {
                // No "with" by prepending "data." to template vars.
                // Probably makes eval safer, too.
                return eval("( data." + match + " )");
            });
        }
        

        This revised version is probably much better then the first version :)

    • wavded – thanks! (I never used static props anyway but thought I should mention them – that will teach me to rely on my dog eared copy of The Definitive Guide as a reference!). These props were changed in ECMA 3 – my bad.

      So the deprecated props still work in all browsers. But of course the static properties that were moved to instance (e.g. global and multiline) no longer work staticly.

      I’m going to update the post

  2. Pingback: Templater for Joomla!1.5 | www.avehot.com

  3. You mentioned that “$n, the text matched by the nth parenthezised groups (up to n==9)” is available in all major browsers, but my impression was that it’s not available in IE. Any idea if I’m misinformed…?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s