Inclusion of Unicode character set constants in RegExp (\p{L} ... )

Hans Schmucker hansschmucker at gmail.com
Thu Sep 4 06:53:26 PDT 2008


Apologies if this isn't the right place or the right format to suggest
a modification to ES. I try my best to be as specific as possible to
minimize problems when/if this was moved to the spec.

So far, the RegularExpression functions in ES are based on the
assumption that the only language worth matching is English (\w) and
that the limited number of special cases (i.e. äöü...) can easily be
handled by combining the predefined range with explicit declaration of
allowed unicode ranges (i.e. u1FE0-\u1FEC ...). While this is mostly
true for latin-based languages like German or French it cannot be
applied to non latin-based languages like arab or japanese.

This shortcoming causes some major issues if international names are
to be allowed in ES applications as they cannot reliably be checked
using the built in RegExp constants. Instead, all ranges have to be
specified, which translates to a RegExp of around 4000 characters
length (using ranges like u1FE0-\u1FEC,see
http://pastebin.mozilla.org/530081 for the full definition of a
regular expression matching characters \p{Ll} \p{Lu} \p{Lt} \p{Lm}
\p{Lo} and \p{Nl}), which is highly impractical as it bloats the code
and dramatically increases compile time.

Recent Regular Expression environments like Perl, Java or PCRE
therefore ship with an extended Syntax that allows for matching based
on it's character's Unicode properties. All Unicode characters are
flagged with properties based on their role, i.e. Uppercase (Lu),
Lowercase (Ll) and so on and using the beforementioned syntax (\p{Ll}
\p{Lu} ...).

The practical value would be enormous while implementation is
relatively trivial if the used Regular Expression library already
supports this function, increasing the chances for adoption. Also,
using RegularExpressions like the afforementionend
http://pastebin.mozilla.org/530081 provide a fallback for web authors
in case the functionality is not available.

The only issue I see so far is that these commands are already legal
RegExp values, but producing a different result. Therefore, in order
to avoid undesired behaviour, I'd suggest an additional flag, so that
authors can determine support using try/catch blocks and only enable
it when actually in use.

I think this would be an obvious candidate for harmony, as it
increases functionality in common scenarios without requiring changes
to the basic logic of ES.

Hans Schmucker
Mannheim
Germany

(hansschmucker at gmail.com)


More information about the Es-discuss mailing list