[rust-dev] unicode support and core
graydon at mozilla.com
Sat Jan 7 12:04:53 PST 2012
On 06/01/2012 5:15 PM, Evan Martin wrote:
> From these sorts of experiences I've concluded the best strategies for
> these sorts of APIs is to provide two forms: a simple
> lowercase/uppercase that only works with ASCII but clearly works in
> that way -- for example, only define it for the 'byte' type (or
> however you represent non-Unicode characters), and then punt the rest
> off into a monster library like ICU.
Agreed. This is the strategy we're following, with one additional
category: tasks that satisfy all three of these points:
- Requiring some >ASCII, unicode logic
- Not-requiring any linguistic or locale-related logic
- Common-ish in routine 'language ignorant' data-processing tasks
These, and only these, are what I'm going to put in libcore manually for
char:: stuff. You might think it's an empty set, but there are a small
handful of things:
- Language-neutral extensions of concepts like "is a metachar' or,
in our lexer, 'is an identifier'. This uses XID_Start/XID_Continue.
- Normalization forms, NFKC and such. I have a conversion of this
logic but it adds another couple hundred kb footprint to libcore,
hoping to be able to reduce that.
- Suspicious input "sanitization" by general-category whitelist.
- Possibly UCA and DUCET (?) I'm not as sure this addresses any real
Really any serious linguistically-aware task beyond this sort requires a
linguistically-aware library and we're not in that business. This is the
same reason I am mostly-disinterested in the "I need random access to
unicode codepoints!" argument about whether to represent strings as utf8
vs. ucs4 in memory. If you think you need random access to unicode
codepoints, you're *probably* making an algorithmic mistake.
More information about the Rust-dev