[rust-dev] strings, slices and nulls

Graydon Hoare graydon at mozilla.com
Thu Apr 19 12:59:07 PDT 2012


On 12-04-19 07:25 AM, Jesse Ruderman wrote:
> My preference is to remove null termination:
> 
> * I'm guessing most strings aren't passed to C. (What are the most
> common C string calls in rustc?)

All the filesystem access stuff, at this point. In the future it's
harder to say.

> * C functions that scan for null are inefficient, so they're even more
> likely to be replaced with Rust equivalents than other C functions.

Hm, I think this is not a reasonable stance:

$ find /usr/include/ -name \*.h \
  | xargs cat \
  | grep -c 'char\( *const\)\? *\*'
10488

There are a lot of C APIs that take strings. "Rewrite the world in rust"
is going to take a long time.

> * Null termination is not sufficient for interop with C. You also have
> to ensure the strings don't contain null characters. (This is a common
> source of bugs in Firefox, since JavaScript strings and strings from
> the network can contain null characters.) And if null characters are
> present, what do you do?

I can see some cases where that might be a bug, but in general I think
an embedded null just ... makes a string shorter, from C's perspective.
It's the same as passing a short string. Of course if the C code
requires some other kind of well-formedness condition in the prefix,
you'd need to enforce that, but that condition presumably holds over
shorter and longer strings alike. Most C APIs aren't written to take
strings of a fixed size.

> * Each C function has its own expectations about character encoding
> and allowed characters, so calls to C involve extra state-tracking or
> checks anyway.

For APIs that take UTF-16, such as the win32 APIs, we already do the
conversion before calling, yes. But for APIs that take "char *" they
tend to be set up so they can accept UTF-8 input: they're either
agnostic to the differences between ASCII and UTF-8 (as UTF-8 was
designed to exploit) or else they can operate in UTF-8 mode via LC_CTYPE
or such. Sure you need to either enforce that and/or re-encode when it's
not true, but again, this is about opportunistic recoding-avoidance by
careful choice of defaults, rather than a guarantee that we never need
to recode.

Sometimes users want an array of UCS4 as well, but it's not our default
string representation.

-Graydon


More information about the Rust-dev mailing list