[rust-dev] unicode support and core

Graydon Hoare graydon at mozilla.com
Sat Jan 7 12:04:53 PST 2012


On 06/01/2012 5:15 PM, Evan Martin wrote:

>  From these sorts of experiences I've concluded the best strategies for
> these sorts of APIs is to provide two forms: a simple
> lowercase/uppercase that only works with ASCII but clearly works in
> that way -- for example, only define it for the 'byte' type (or
> however you represent non-Unicode characters), and then punt the rest
> off into a monster library like ICU.

Agreed. This is the strategy we're following, with one additional 
category: tasks that satisfy all three of these points:

   - Requiring some >ASCII, unicode logic
   - Not-requiring any linguistic or locale-related logic
   - Common-ish in routine 'language ignorant' data-processing tasks

These, and only these, are what I'm going to put in libcore manually for 
char:: stuff. You might think it's an empty set, but there are a small 
handful of things:

   - Language-neutral extensions of concepts like "is a metachar' or,
     in our lexer, 'is an identifier'. This uses XID_Start/XID_Continue.

   - Normalization forms, NFKC and such. I have a conversion of this
     logic but it adds another couple hundred kb footprint to libcore,
     hoping to be able to reduce that.

   - Suspicious input "sanitization" by general-category whitelist.

   - Possibly UCA and DUCET (?) I'm not as sure this addresses any real
     use cases.

Really any serious linguistically-aware task beyond this sort requires a 
linguistically-aware library and we're not in that business. This is the 
same reason I am mostly-disinterested in the "I need random access to 
unicode codepoints!" argument about whether to represent strings as utf8 
vs. ucs4 in memory. If you think you need random access to unicode 
codepoints, you're *probably* making an algorithmic mistake.

-Graydon


More information about the Rust-dev mailing list