George MacKerron: code blog

GIS, software development, and other snippets

Extended and properly-multi-line Regular Expressions for JavaScript

Perl, Ruby, and some other languages support a readable ‘extended’ regular expression syntax, in which literal whitespace is ignored and comments (starting with #) are available. They also support a multi-line mode where the . character matches anything, including a newline.

JavaScript does neither of these: it doesn’t recognise the extended syntax, and its version of multi-line only allows the ^ and $ characters to match the beginnings and ends of lines within a string (it will never allow the . to match a newline).

So I wrote the following function to convert extended and fully-multi-line RegExp source strings to the basic syntax that JavaScript understands.

function convertRegExpSource(source, ext, multi) {  // string, boolean, boolean
  if (! ext && ! multi) return source;
  var convertedSource = '', len = source.length;
  var inCharClass = false, inComment = false, justBackslashed = false;
  for (var i = 0; i < len; i ++) {
    var c = source.charAt(i);
    if (justBackslashed) {
      if (! inComment) convertedSource += c;
      justBackslashed = false;
      continue;
    }
    if (c == '\\') {
      if (! inComment) convertedSource += c;
      justBackslashed = true;
      continue;
    }
    if (inCharClass) {
      convertedSource += c;
      if (c == ']') inCharClass = false;
      continue;
    }
    if (inComment) {
      if (c == "\n" || c == "\r") inComment = false;
      continue;
    }
    if (c == '[') {
      convertedSource += c;
      inCharClass = true;
      continue;
    }
    if (ext && c == '#') {
      inComment = true;
      continue;
    }
    if (multi && c == '.') {
      convertedSource += '[\\s\\S]';
      continue;
    }
    if (! ext || ! c.match(/\s/)) convertedSource += c;
  }
  return convertedSource;
}

Use it

The starting example below is the Daring Fireball monster URL regex.

Extended RegExp source (editable)

Basic RegExp source output

Let me know if you find any bugs, or if you know a nicer way to do this (I know nothing about parsing, so there surely is one).

I think I may suggest extended RegExp support as a possible CoffeeScript enhancement.

Share

Written by George

August 8th, 2010 at 7:44 pm

Posted in JavaScript