[rust-dev] Unicode vs hex escapes in Rust

Graydon Hoare graydon at mozilla.com
Thu Jul 5 11:18:36 PDT 2012


On 12-07-04 1:12 PM, Christian Siefkes wrote:

> personally, I find the current behavior of Rust less risky and more logical.
> If you can write '\u263a', why would you want to write the cumbersome
> '\xE2\x98\xBA' instead? Moreover, it's dangerous--just writing '\xE2\x98' or
> '\xE2' would result in a broken UTF-8 string. Perl and C couldn't avoid that
> since they are older then Unicode/UTF-8, but what would be the point of
> allowing it in Rust?

Oh, a good point, but we wouldn't accept it during parsing. I don't want 
to get into the game of allowing strings in that aren't valid utf8. Use 
a [u8] for that.

The string-specific reasons I can see for this are:

   - You want to denote some utf8 bytes and you want to avoid doing
     the work of figuring out which codepoint it decodes to. For
     example if you were writing a crude tool that emitted rust string
     literals by doing byte-at-a-time copies of text files.

   - You want to copy a string literal from C or C++.

Neither of these are _great_ reasons, but they feel like enough to 
consider the change. I'm not actually sure how to interpret the "risk" 
Behdad suggested of users thinking strings are latin-1 (as in: why they 
would, and how to mitigate that). I mean, maybe if the user believed 
that \xNN was the only escape form, no longer escapes? I don't know, 
it's 2012 and I am sort of perplexed that anyone would think strings 
would be anything other than unicode-of-some-sort. Anyone looking at 
\xNN and wanting to write longer escapes would, I expect, google "rust 
unicode escapes", or try writing "\uNNNN" or something :)

> No such danger exists in the current implementation, where every \xNN
> sequence refers to a Unicode codepoint < 256 (which also happens to be
> Latin1 character, but that's just because Unicode is a superset of Latin1).
> The current implementation is simple and consistent: all escapes refer to
> code points, none refers to bytes. If your code point is below 2^8, you can
> use any of "\xHH, \u00HH, \U000000HH", if it's below 2^16, you can use
> either of "\uHHHH, \U0000HHHH", otherwise you have to use "\UHHHHHHHH". Nice
> and sane.

I agree. This is the counterargument and the one I had in mind when 
picking the current scheme. Any other feelings / rationales for deciding 
one way or another? I'm not super clear on which way to go on this.

> Admittedly, if string literals should be useful not only for entering UTF-8
> sequences, but for entering arbitrary byte sequences ([u8]), than Behdad's
> proposal makes more sense. But for such purposes, wouldn't it be better to
> specify them directly as u8 vectors, e.g. [0xE2,0x98,0xBA] ?

Definitely. This is really only an interop question, in my mind, not an 
expressivity one. That is: how likely is our behavior to be an unwelcome 
surprise when someone's trying to do something specific with a 
string-literal?

-Graydon


More information about the Rust-dev mailing list