[rust-dev] First thoughts on Rust

Graydon Hoare graydon at mozilla.com
Mon Jan 23 12:48:10 PST 2012


On 23/01/2012 2:43 AM, Masklinn wrote:
> On 2012-01-23, at 05:37 , Kevin Cantu wrote:
>> I'm curious though, because I've not used it in depth, what makes NSString
>> so good.  What does it do that Haskell's Text and other languages' string
>> types don't do?
> First-class interaction with grapheme clusters (which it calls "composed characters)[0], I don't remember seeing that in any other language, and good first-class (not tucked in a library hidden out of the way) support for Unicode text manipulation algorithms (lower and upper case conversions, sorting, etc…)

You're asking for a locale-qualified composed-character type. That's 
more than a string type. That's much higher up the ladder towards UI.

Str is more like int or float, or say one of those nice date-time types 
like TAI64. Not like a UI object representing "number" or "text" or 
"calendar date". Those UI concepts contain buckets of affordances that 
have no relevance -- just cost and limitation -- to most instances of 
the datatype that occur lower down in the program.

Use of strings for display-to-humans-as-UI is actually a pretty narrow 
subset of all the strings your average program works with. Making such 
UI-level concepts the sole representation might make sense in a world where:

   - Performance doesn't matter. Space, time, etc.
   - Programmers never want to think of the thing in terms of its
     lower-level representation.

But IMO that's not the general condition.

>>   And what do you need from a core string library that
>> doesn't belong in, say, an extended package of ICU bindings?
>
> As far as I am concerned, any string operation which is defined in Unicode should be either implemented "correctly" (according to unicode) on the string type or not at all (and delegated to a third-party library).

I agree with this. If it's a "string operation defined in unicode", we 
intend to ship the real thing. The things in libcore that claim to be 
unicode-y are correct unicode algorithms. We just don't ship *all* and 
*only* unicode algorithms in libcore; some will be pushed to libstd or 
further (punting to libICU) if they seem rare. Unicode is a huge standard.

Moreover, we do (and will continue to) ship algorithms on str values 
that are not defined by unicode. More on this next...

> This means, for instance, either string comparisons should implement the UCA or they should be forbidden.

I disagree with this.

In particular, I disagree with the idea that the only comparison that 
exists is UCA comparison. UCA-comparison is not even a single operation 
at all: it's a *family* of highly customizable operations. That the 
documentation clearly lists a number of serious shortcomings of:

   - It's very slow.
   - It is not a stable sort.
   - It is not preserved under concatenation or substring.
   - It's highly variable: results vary by locale, legal system
     and organizational tailorings, phonetic dictionaries, etc. etc.
   - It will disagree with any tool doing codepoint or byte order.
   - It is subject to revision by the unicode consortium and may
     be found to be in wildly different states on opposite ends of
     a communication medium (say).

All this is not to disparage the fine work done by the consortium. UCA 
is a massive work of linguistic engineering. It's also inappropriate for 
jamming into the middle of all uses of strings. The authors are quite 
clear on that:

   "The Unicode Collation Algorithm does not restrict the many different
    ways in which implementations can compare strings"

Most uses of strings involve computers talking to other computers, or to 
themselves, not humans via a GUI. And most of those operations are more 
like:

   - Bulk IO.
   - Use as keys in hashtables or balanced trees.
   - Substring and concatenation operations.

These operations do, regularly, have use for a "<" operator that does 
something a lot less than any particular locale-customization of UCA. 
Namely: memcmp. So that's what we do for <. It's a different operation 
than any UCA operator. It has a different API. A proper API for a UCA 
operation isn't even *expressible* as a < expression, since as the spec 
states "collation is not a property of strings". It has to be tailored 
by locale and a dozen other features of the scripts.

Similarly, demanding the sole representation for our strings be in 
precomposed-grapheme form means that all bulk IO on strings takes not 
just a codepoint conversion hit, but a normalization-pass hit. That's a 
very high cost and there's no reason to assume "most" uses of strings 
require it.

(Plus one can't even implement either of these things without shipping a 
program with an 18mb library. Again, this only makes sense in a world 
where performance costs are somehow invisible or not counted.)

> This, of course, does not apply to a bytes/[w8] type, which would operate solely at the byte level.

In rust, str is not [u8]. Str is unicode, [u8] is a step further down. 
Str is just held in the most common and future-proof unicode encoding to 
avoid constant round-tripping through different encodings and 
normalizations during bulk IO, and to grant default operations like < 
and == based on performance and commonality estimates and our own 
experience writing code that uses strings.

I'm willing to listen to arguments about "commonality" to some extent, 
but I'd be very surprised if your position is that most programs you've 
worked on would benefit from (say) their balanced trees implementing 
DUCET father than memcmp. I think many of them would just break, and the 
remainder would slow down by a factor of 100.

Str is, in other words, a point of tension between many forces, like int 
and float. One of those forces is definitely "be unicode" -- we have no 
intention of *burying* unicode-ness -- but performance, commonality, 
simplicity and compatibility are also concerns.

-Graydon


More information about the Rust-dev mailing list