[rust-dev] unicode support and core

Evan Martin martine at danga.com
Fri Jan 6 17:15:59 PST 2012

On Sat, Dec 24, 2011 at 2:03 AM, Rust <rust at kudling.de> wrote:
> I've opened pull request #1382 to add is_lower/uppercase() to "char".
> I wonder if and how we will be implementing to_lower/uppercase()
> 1) without including libicu in libcore
> 2) without replicating libicu functionality

In my experience, code that wants something like is_lower falls in one
of two categories:

1) code that is implementing some well-defined specification like a
lexer for a programming language, where either the ASCII rule will do
or they have some complex side requirement (like how the IRC protocol
has weird up/down-casing rules for punctuation)

2) code that is attempting to do some sort of human language
processing, where the Unicode definition of upper/lower is unlikely to
be what you want.  For example to properly lowercase a character in
the Unicode sense you need to know the source language of the text (as
capital I lowercases differently in Turkish than others).  Or consider
that the return type of to_upper can't be a single char due to
upper-casing ß.

To elaborate on #2, let me give another example.  People often want to
break a string on whitespace to extract the words.  You might think
this leads down the rabbit hole of Unicode definitions of whitespace
characters, but in practice once you're worrying about Unicode you
need to handle real text properly, including both English rules
("it's" is one word) and French ("L'Académie" is likely two) or even
Arabic (where you can't even compute the word break programmatically).
 The proper thing is not make the \s regex match the Unicode
definition of whitespace, but instead to use a Unicode break iterator
as defined one of the monster Unicode reports

The worst case is when a library attempts to provide some of Unicode
without doing it right.  Consider C library's implementation of
tolower(), which attempts to helpfully obey your locale which means
when SQLite tried to lowercase a query like "INSERT INTO ..." in a
Turkish locale, it helpfully corrupted all the capital I's with the
Turkish dotless one and then failed to parse the query.

>From these sorts of experiences I've concluded the best strategies for
these sorts of APIs is to provide two forms: a simple
lowercase/uppercase that only works with ASCII but clearly works in
that way -- for example, only define it for the 'byte' type (or
however you represent non-Unicode characters), and then punt the rest
off into a monster library like ICU.

More information about the Rust-dev mailing list