[rust-dev] Unicode vs hex escapes in Rust
graydon at mozilla.com
Thu Jul 5 11:18:36 PDT 2012
On 12-07-04 1:12 PM, Christian Siefkes wrote:
> personally, I find the current behavior of Rust less risky and more logical.
> If you can write '\u263a', why would you want to write the cumbersome
> '\xE2\x98\xBA' instead? Moreover, it's dangerous--just writing '\xE2\x98' or
> '\xE2' would result in a broken UTF-8 string. Perl and C couldn't avoid that
> since they are older then Unicode/UTF-8, but what would be the point of
> allowing it in Rust?
Oh, a good point, but we wouldn't accept it during parsing. I don't want
to get into the game of allowing strings in that aren't valid utf8. Use
a [u8] for that.
The string-specific reasons I can see for this are:
- You want to denote some utf8 bytes and you want to avoid doing
the work of figuring out which codepoint it decodes to. For
example if you were writing a crude tool that emitted rust string
literals by doing byte-at-a-time copies of text files.
- You want to copy a string literal from C or C++.
Neither of these are _great_ reasons, but they feel like enough to
consider the change. I'm not actually sure how to interpret the "risk"
Behdad suggested of users thinking strings are latin-1 (as in: why they
would, and how to mitigate that). I mean, maybe if the user believed
that \xNN was the only escape form, no longer escapes? I don't know,
it's 2012 and I am sort of perplexed that anyone would think strings
would be anything other than unicode-of-some-sort. Anyone looking at
\xNN and wanting to write longer escapes would, I expect, google "rust
unicode escapes", or try writing "\uNNNN" or something :)
> No such danger exists in the current implementation, where every \xNN
> sequence refers to a Unicode codepoint < 256 (which also happens to be
> Latin1 character, but that's just because Unicode is a superset of Latin1).
> The current implementation is simple and consistent: all escapes refer to
> code points, none refers to bytes. If your code point is below 2^8, you can
> use any of "\xHH, \u00HH, \U000000HH", if it's below 2^16, you can use
> either of "\uHHHH, \U0000HHHH", otherwise you have to use "\UHHHHHHHH". Nice
> and sane.
I agree. This is the counterargument and the one I had in mind when
picking the current scheme. Any other feelings / rationales for deciding
one way or another? I'm not super clear on which way to go on this.
> Admittedly, if string literals should be useful not only for entering UTF-8
> sequences, but for entering arbitrary byte sequences ([u8]), than Behdad's
> proposal makes more sense. But for such purposes, wouldn't it be better to
> specify them directly as u8 vectors, e.g. [0xE2,0x98,0xBA] ?
Definitely. This is really only an interop question, in my mind, not an
expressivity one. That is: how likely is our behavior to be an unwelcome
surprise when someone's trying to do something specific with a
More information about the Rust-dev