[rust-dev] Arrays, vectors...

Graydon Hoare graydon at mozilla.com
Fri Mar 23 13:42:36 PDT 2012


I've been thinking and sketching a lot about vectors & arrays lately.
There are a _lot_ of design pressures on this part of the language, but
our current approach is clearly unsatisfactory:

  - Existing vecs are always unique. Sometimes you want shared, but
    boxing them as @[] causes double-indirection, feels awkward.

  - Existing vecs are always in the heap. So:
     - There's no way to alloca one (in stack)
     - There's no way to allocate one in constant memory
     - There's no way to allocate one inside another structure

  - The latter is particularly egregious when it comes to C interop.
    C has the idea of a dense block of fixed-size, int[10] is
    10-ints-long. We have no way to model this in rust right now.

I think we need to deal with this stew of disappointments in our vectors
soon, probably in the 0.2 - 0.3 timeframe. I think with the appearance
of region pointers we have a reasonable leg to stand on in terms of
formulating lifecycles. Here's my proposal:

  - [int] means (&int,len), it's a slice
  - str means ((&u8,len) : is_utf8(*)), a textual slice

  - the region system is used to relate the visibility of [t] and str
    to their region-of-allocation, the same way it relates normal &t
    values. The additional 'length' integer on a slice is used only
    by unsafe code such as that implementing operator[] and iteration.

  - [int/10] and str/10 mean "fixed-size, 10-elements long"
    and are _interior values_. The number is part of the type.
    The type's size is N * sizeof(T). We define const-expressions and
    integer-const-expressions as part of this work, so that number
    can in some cases be symbolic.

  - [int/~] is a heap-resident unique vector, like [int] today
  - [int/@] is a heap-resident shared vector, like old [int]
  - str/~ and str/@ are the same for strings
  - The compiler open-codes all of this

  - [int] <: [int/10]
    [int] <: [int/~]
    [int] <: [int/@]
    str <: str/10
    str <: str/~
    str <: str/@

  - In other words, you can cast anything down to the "slice" form,
    the same way you can cast any of the closures to fn() type

  - Literals are subtle; assume p,q,r are dynamic vals:

     - [1,2,3,4,5]    -- constant memory, type [int]
     - [1,2,3,4,5]/5  -- constant memory, type [int/5]
     - [1,2,3,4,5]/_  -- same, type [int/5], type-size inferred
     - "hello"        -- constant memory, type str
     - "hello"/5      -- constant memory, type str/5
     - "hello"/_      -- same, type str/5, type-size inferred
     - [p,q,r]        -- alloca'ed memory, type [int]
     - [p,q,r]/3      -- alloca'ed memory, type [int/3]
     - [p,q,r]/_      -- same, type [int/3], type-size inferred
     - [p,q,r]/@      -- heap-resident shared, type [int/@]
     - [p,q,r]/~      -- heap-resident unique, type [int/~]

  - The compiler has to know about all these types. It's a bit of a
    bestiary but it means all the use-cases are satisfied and I think
    it composes well and is congruent with the rest of the language.

  - Things like foo[i], +, += and such are pushed down to library code,
    use operator overloading. The compiler only does lifecycle:
    literals, allocation, casting, passing and destruction.

Positives of this approach:

  - The common representation is cheapest, is a subtype of all, and is
    the default type of the easiest-to-write literal forms. Most people
    will write 'let x = "hello";' and get x:str inferred. And it'll be
    cheap to pass, and constant-memory. The next-most-common (vectors of
    dynamic values) are alloca'ed, but also passed as slices, cheap.

  - It's possible to denote the aggregate, not just a pointer-to-it,
    iff you fix its size. If you can't commit to a size, you can't
    denote the aggregate, can only get a pointer-to-it.

  - In particular, you can now translate C array types directly to a
    rust type. { x: int, y: [int/15], z: int } is a dense structure.

  - Pointers-to-fixed-sized-types are still possible: &[int/1024] is
    a pointer. As is *[int/1024]. Orthogonality is maintained.

  - Prefix box operators ~ and @ still mean what they always mean; there
    is no weirdness that arises by trying to make ~[] special and then
    confront the inconsistency of ~T != ~[] when T == []. Like with
    fn closures, we make the "how you manage your associated storage" a
    syntactically distinct component of the tycon.

  - Heap-resident versions exist; like in any data structure, you have
    to discuss uniquieness or shared-ness to some extent. If you want a
    record with 3 strings you can send over a channel, it has to be of
    type {x:str/~, y:str/~, z:str/~}.

So, thoughts, likes, dislikes?

Are the semantics over-complex? I think we really have a large number of
use-cases and there's no way we can hit them all with a single
abstraction. Shared-heap != unique-heap != constant != alloca !=
borrowed region, and fixed-size != variable-size.

Is the trailing slash absolutely hideous? There are a few other unused
ASCII symbols in the type grammar but this was the nicest-looking I
could see (that didn't collide with something else). It's also possible
to write them as str/10, str@ and str~, say, and similarly [int/10],
[int]~ and [int]@. That's fewer slashes but a bit more visual ambiguity
if you have a leading ~ or @ as well.

-Graydon


More information about the Rust-dev mailing list